Benchmarking GPT-5 & GPT-OSS: A Responsible AI Approach
Evaluating dimensions often overlooked by traditional benchmarks.
By: GovTech AI Practice Responsible AI Team (Jessica Foo, Shaun Khoo, Gabriel Chua, Goh Jia Yi, Leanne Tan)
Summary: We’ve benchmarked the new GPT-OSS and GPT-5 models on our Responsible AI Bench, evaluating safety, robustness, and fairness dimensions often overlooked by traditional benchmarks. GPT-OSS performs comparably to other open-weight models with significantly higher parameter counts, and GPT-5 variants show steady gains over previous OpenAI models. This is very much a work in progress — we’re sharing early results and actively seeking community feedback to improve the framework together. Framework and results available at go.gov.sg/rai-bench.
As AI models evolve at breakneck speed, teams struggle to make sense of each new release. Traditional benchmarks focus heavily on capability metrics — reasoning, STEM/coding ability — but often overlook the responsible AI dimensions critical for real-world deployments.
Today, we’re sharing early results from our Responsible AI Bench, an early MVP designed to evaluate models across dimensions that matter for production systems: safety, robustness, and fairness. We’ve applied this framework to benchmark 28 of the leading and popular models, including the recently released GPT Open Source (GPT-OSS) variants and GPT-5.
A note on our approach: This is very much experimental work in progress. We’re sharing these initial findings not as definitive answers, but as a starting point for discussion. We know there’s much room for improvement, and we’re eager to collaborate with the community to refine these methods and make them more comprehensive and useful for everyone.
Why Another Benchmark?
Most existing benchmarks tell us how smart a model is, but not necessarily what the unintended consequences or risks are of deployment. A model that aces MMLU or HumanEval might still generate biased content, fail to refuse harmful requests, or confidently hallucinate when faced with out-of-scope queries.
For this MVP, we are focusing on three dimensions:
- 🛡️ Safety — Can the model refuse harmful requests, especially those with local context and nuance?
- 🦾 Robustness — How well does it handle queries beyond its knowledge base in RAG applications?
- ⚖️ Fairness — Does it generate consistent outputs regardless of demographic attributes?
Notably, the context and system prompts for these tests are based on, or heavily inspired by, real applications we’ve seen deployed in the Singapore public service.

Key Findings
For safety, Gemini 2.5 Flash tops the chart at 96%. GPT-OSS scores are strong for their size, with the 20B and 120B models averaging 88% and 91% refusal rates. GPT-5 lands at 90%, maintaining consistently high resistance.

For robustness, excluding a stealth model tested on OpenRouter, Claude 4 Sonnet leads at 71%. GPT-OSS trails with 53% (20B) and 51% (120B), while GPT-5 Chat (the snapshot of GPT-5 that is used in ChatGPT) reaches 72% — a marked leap over earlier OpenAI releases.
For fairness, measured via bias scores (lower is better), Claude 4 Opus leads with 0.08. GPT-5 follows at 0.11, while GPT-OSS records 0.25 (20B) and 0.35(120B).
Note: GPT-5’s safety score should be read with some nuance: GPT-5 is reportedly trained with safe-completion, a new output-centric safety approach. Instead of making a binary “comply or refuse” decision based on the prompt, GPT-5 aims to give the most helpful safe answer possible. This means our refusal-rate test may undercount GPT-5’s actual safety performance, since many of its safe responses are not refusals in the strict sense. We will review this as we continuously improve our internal rejection evaluator.
Methodology
1. Safety: Localised Harmful Content Refusal
We tested models using a sample of prompts from RabakBench across four application contexts: a general helpful chatbot, a chatbot oriented towards career advice, a physics tutoring chatbot, and a job description generator.
2. Robustness: RAG Out-of-Knowledge Queries
Here, we use our KnowOrNot framework, an open-source library that systematically creates “out-of-knowledge base” scenarios to test whether models appropriately say “I don’t know” rather than hallucinating answers. For the context, we used PolicyBench, 331 question-answer pairs from four Singapore government documents (immigration FAQs, CPF savings, MediShield, driving theory). We further test the models under two retrieval methods: (i) hypothetical document embeddings, and (ii) providing the context entirely (i.e., long context)
3. Fairness: Testimonial Generation Bias
Extending our original analysis of fairness evaluation in LLM-generated student testimonials, we generated 3,520 synthetic testimonials across different combinations of gender (male/female) and race (Chinese, Malay, Indian, Eurasian) while holding all other qualifications constant.
Our analysis examined two dimensions:
- Language Style: Sentiment and formality scores using DistilBERT and RoBERTa models
- Lexical Content: Distribution of adjectives across 7 stereotype categories (assertiveness, independence, instrumental competence, leadership competence, concern for others, sociability, emotional sensitivity)
We ran regression analysis to determine if gender or race influenced the generated outputs while controlling for actual student attributes (personal qualities, CCA records, academic achievements). The regression coefficients quantify how much demographic factors affect style and content — among statistically significant regression coefficients, we took the coefficient of the highest magnitude as our bias score.
Implementation Note: On Providers and Parameters
For GPT-OSS, latency considerations meant we ran tests through several providers — Fireworks, Groq, and Cerebras. We note that internal hosting implementations could differ in ways that affect the generation & thus results.
For GPT-5, we left the reasoning parameters at their API defaults for each model.
What This Means for Practitioners
- Model Selection is Multi-Dimensional. GPT-5 excels at robustness and fairness, but relatively less so for localised safety scores. Claude 4 Opus leads in fairness and safety but at higher computational cost. Choose based on your priorities.
- Parameter Count ≠ Responsible AI Performance. GPT-OSS (20B) matches or beats models with 70–235B parameters on safety metrics, making it ideal for resource-constrained deployments.
- Context Matters for Safety. The 58% gap in job description writing safety scores (42% to 100%) shows that evaluation must match your use case.
Looking Ahead
This is an early MVP of our benchmark framework, designed to evolve with community input. Your feedback is crucial. Whether you’re a researcher, practitioner, or simply interested in responsible AI, we want to hear from you.
We’re particularly interested in:
- Expanding the datasets to cover more local contexts and use cases
- Enhancing the scope of tests for each dimension
- Adding other dimensions of Responsible AI (e.g., transparency, explainability)
- Improving our evaluation methodologies based on real-world experiences
- Understanding what matters most for your specific applications, and trying to balance between that and generalisability
We encourage teams to use this as a reference point while conducting their own application-specific evaluations. Have a look at our benchmark at go.gov.sg/rai-bench, and let us know what you think!
PS: We’re hiring!