Responsible AI

RabakBench: Multilingual AI Safety Evaluation Made Local

Global safety guardrails are often blind to local dialects and sensitivities.

AI Practice

11 Jul 2025 • 6 min read

By Gabriel Chua (Data Scientist, GovTech), Leanne Tan (Data Scientist, GovTech), Ziyu Ge (Research Officer, SUTD), and Roy Ka-Wei Lee (Assistant Professor, SUTD, and Smart Nation Fellow, GovTech)

🚨🚨🚨 Warning: This article and the linked materials may contain material that could be deemed highly offensive. Such material is presented here solely for educational and research purposes. These examples do not reflect the opinions of the author or any affiliated organisations.

Global AI expansion often outpaces multilingual safety benchmarks, creating gaps. Singapore’s distinct multilingual mix — Singlish, Chinese, Malay, Tamil — exposes unique blind spots in English-focused evaluations. To bridge this gap, we developed RabakBench*, a Singapore-specific multilingual safety benchmark, in collaboration with researchers from SUTD.

Rabak is a local Singapore expression meaning “extreme” or “intense.” It is often used to describe something risky, daring, or particularly outlandish.

Existing benchmarks like SGHateCheck are valuable but limited to Singlish hate speech. RabakBench broadens evaluation scope to multiple languages and risk categories — including insults, sexual content, self-harm — each annotated with clear severity levels for nuanced assessments. Sample translations were provided by native speakers, and the final dataset was verified by native speakers for accuracy.

Today, we release:

Our datasets — the public test set, and a set of high quality human verified translations
Evaluation code
Technical report

Key Findings

Evaluating 11 popular open-weights and closed-source guardrails on RabakBench revealed substantial multilingual performance gaps:

Popular guardrail options like OpenAI Moderation Endpoint or LlamaGuard 4 12B are not necessarily the best, especially with the former struggling with Tamil (7%)
The open-sourced WildGuard 7B excelled in Singlish (78.9%), but performance also drastically deteriorated for Tamil (2%).
AWS Bedrock Guardrail effectively flagged unsafe content in Singlish but almost entirely missed harmful content in Chinese, Malay and Tamil.

These results align with their stated multilingual capabilities (as of time of writing):

AWS Bedrock Guardrails: Supports only English, French, and Spanish.
Azure AI Content Safety: Multilingual support with all four languages explicitly supported
OpenAI Moderation, Google Model Armour: Multilingual support stated, but full language list undisclosed.
Perspective: Supports German, English, Spanish, French, Italian, Portuguese, Russian.
LlamaGuard 3 & 4: Supports English, French, German, Hindi, Italian, Portuguese, Spanish, Thai.
Duoguard: Multilingual support based on Qwen 2.5, full language list undisclosed.
Polyguard: Based on Qwen 2.5, fine-tuned for Arabic, Chinese, Czech, Dutch, English, French, German, Hindi, Thai, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Swedish.
ShieldGemma: Based on Gemma 2, and supports only English.
WildGuard: Based on Mistral-7b-v0.3, and supports only English.

These disparities highlight the urgent need for localised multilingual benchmarks like RabakBench to uncover language-specific blind spots, essential to mitigate risks in deployment.

Building RabakBench: Our Methodology

Overview of the benchmark construction process

Building high-quality multilingual safety benchmarks for low-resource languages is labor-intensive and difficult to scale. Manual annotation requires large teams of trained linguists, but such expertise is rare for languages like Malay, Tamil, or Singlish. Traditional approaches — often feasible for well-resourced languages — do not work when data, experts, and guidance are all limited. How we built RabakBench demonstrate how we can scale up human annotation with LLM-assisted workflows and collaborative workshops, enabling rigorous, culturally-grounded benchmarks even in data-scarce settings

We built RabakBench through a three-stage process that combines human-in-the-loop annotation with LLM-assisted red-teaming and multilingual translation. This pipeline enables the rapid, consistent creation of nuanced safety datasets that reflect local context and risk categories missing from global benchmarks.

Stage 1: Content Generation and Red-Teaming 🔦

We sourced raw Singlish content from web forums, converting these into instructional statements using prompt templates. Additionally, we applied automated adversarial red-teaming against widely-used guardrails to discover challenging edge cases — especially false positives and false negatives.

Further details on this automated red-teaming process can be found here.

Stage 2: Alternative Testing and Labelling 🎯

Utilising the Alt-Test methodology, we selected Gemini Flash, o3-mini-low, and Claude 3.5 Haiku — LLMs that closely aligned with human annotations. We then derived our final labels through majority voting across these LLM outputs, ensuring accuracy while maintaining scalability.

For more precise labeling, we defined a multi-risk taxonomy with severity levels, aligned to popular frameworks like MLCommons and OpenAI. Categories include hate speech, insults, sexual content, self-harm, violence, and misconduct, some with clear severity distinctions — for instance, self-harm ranges from ideation (Level 1) to explicit action or suicide (Level 2). Please refer to our technical report annex for full details.

Our internal risk taxonomy for unsafe/undesired content

Stage 3: Toxicity-Aware Translation 📣

Translating unsafe content required careful handling due to distinct multilingual challenges:

Malay and Tamil are relatively lower-resource languages with machine translation accuracy concerns.
LLM-generated translations often sanitise harmful expressions, losing critical nuances.

Overview of our “toxicity-aware few-shot prompting” approach

To overcome this, we developed a “toxicity-aware few-shot prompting” method:

Structured Multi-Round Workshops: We conducted structured workshops involving multiple rounds with native-speaking annotators. Annotators initially reviewed several LLM-generated translations and either selected the best options or provided alternative translations. Subsequent rounds refined the selections, prioritising authenticity, tone, and cultural context, and preservation of harmful expressions.
Prompt and Model Optimization: Evaluating translations proved challenging without multilingual ground truth references. To address this, we used semantic similarity as our primary metric, assessing the coherence and fidelity between: (a) original Singlish text & the translated Chinese/Malay/Tamil, and (b) original Singlish text & the back-translated Singlish (i.e., Singlish → LLM-translated Chinese/Malay/Tamil → LLM-translated Singlish)

We experimented with:

Multiple LLMs (GPT-4o Mini, Gemini 2.0 Flash, DeepSeek R1)
Varying the number of few-shot examples
Frameworks like DSPy

GPT 4o Mini emerged as the top performer, further validated by additional human reviews.

*Annotators reviewing and refining translations at the RabakBench workshop.*

Platform used by annotators to review translations.

Measuring AI Safety in Singapore

Effective multilingual safety evaluation must reflect local realities. RabakBench fills critical gaps left by English-centric or generic multilingual benchmarks, offering essential tools to identify and address language-specific risks in Singapore’s diverse linguistic and cultural landscape.

Companies and developers looking to build AI solutions for Singapore can test these applications against RabakBench to evaluate whether their systems can detect and appropriately refuse unsafe requests across Singlish, Chinese, Malay, and Tamil.

For researchers, RabakBench serves as a foundational dataset and evaluation framework, encouraging further exploration into low-resource languages specific to the Singaporean context and addressing unique multilingual AI safety challenges.

For policymakers, RabakBench highlights significant safety gaps beyond English-centric evaluations, emphasising the importance of comprehensive multilingual testing and informed policy-making to ensure safer AI deployments.

We invite researchers, developers, and policymakers to utilise RabakBench and collectively advance safer, more responsible AI systems tailored to our multilingual communities.

Acknowledgements

We thank Ainul Mardiyyah Zil Husham, Anandh Kumar Kaliyamoorthy, Govind Shankar Ganesan, Lizzie Loh, Nurussolehah Binte Jaini, Nur Hasibah Binte Abu Bakar, Prakash S/O Perumal Haridas, Siti Noordiana Sulaiman, Syairah Nur ‘Amirah Zaid, Vengadesh Jayaraman, and other participants for their valuable contributions. Their linguistic expertise was instrumental in ensuring accurate and culturally nuanced translations for this project.