(Part 1) LLM Safety Alignment for the Singapore Context using Supervised Fine-tuning and RLHF-based Methods
The process of "teaching" models to be safe
Abstract
This article is the first of a two-part series documenting our work on LLM safety alignment at GovTech’s AI Practice. Leveraging AI Practice’s prior work on LionGuard and synthetic data generation techniques, we successfully perform safety alignment via parameter-efficient fine-tuning (PEFT) on the SEA-Lion-v2.1-Instruct LLM using supervised fine-tuning (SFT) and Kahneman-Tversky Optimisation (KTO). In this article, we present a comprehensive but accessible overview of the safety alignment process. As such, we keep technical explanations intuitive and defer details to the second article, which will be a technical deep-dive into fine-tuning.
Check out our safety aligned model on huggingface and our preprint on arXiv!
Introduction
The advent of Large Language Models (LLMs) like ChatGPT has fundamentally transformed how people think of and interact with AI, democratising usage and launching it into mainstream popularity. Instead of writing explicit code, people interact with LLMs in plain English, making them significantly more accessible. Importantly, LLMs demonstrate remarkable abilities to think, reason, and generate meaningful responses, leading to their widespread productisation throughout industry.
Despite their remarkable intelligence, LLMs exhibit vulnerabilities such as generating offensive content or instructions for harmful acts, which can be exploited for public harm. This is where safety alignment comes in. In short, safety alignment is the process of teaching LLMs to distinguish right from wrong — for example, an aligned LLM should decline to provide instructions to hack a computer or refrain from making hurtful remarks about certain ethnic groups.
Motivation for Localised Safety Alignment
To ensure general safety, most LLMs undergo some alignment with human preferences. However, we can think of this alignment as a constrained expression of common (but ultimately subjective) values, not a universal standard of right and wrong. Different cultures and societies have different value systems, implying different perspectives on what is “safe” or “acceptable”. Given that most LLMs are developed by American or European organisations, it should be unsurprising that their LLMs are aligned to an Anglo-centric or Euro-centric understanding of safety. For example, Ryan et al. (2024) demonstrates that alignment in the Llama and Mistral model families disproportionately favors Western preferences.
Beyond value alignment, localisation is also needed to tackle specific linguistic and cultural nuances. In the context of Singapore we have Singlish, which is a mix of English, Chinese, Malay, Hokkien and other local dialects. Localisation improves the LLM’s ability to recognise colloquial slang terms, such as racially discriminatory terms like “CECA” or “tiong” which are unique to the local context.
Prior Work
LionGuard
In June 2024, our Responsible AI team released LionGuard, a content moderation classifier for the Singapore context. This laid the groundwork for our safety alignment efforts by curating a high-quality dataset of safe and unsafe Singapore content, and highlighting the advantages of building localised models over off-the-shelf solutions like Perspective API or LlamaGuard. Safety alignment complements LionGuard by focusing directly on reducing unsafe content generation. Together, these layers operate in parallel to mitigate safety risks more comprehensively within LLM systems.
SEA-Lion-v2.1-Instruct
AI Singapore released SEA-Lion-v2.1-Instruct, a Llama3–8B-Instruct variant fine-tuned on instruction-completion pairs for the Southeast Asian context, in July 2024. While not solely focused on Singaporean content, we decided to perform safety alignment on SEA-Lion to build on existing efforts for localised fine-tuning. Our work also addresses a research gap, as SEA-Lion is instruction-tuned but lacks specific safety tuning on Singaporean data.
Objectives
In safety alignment, there is an inherent tension between harmlessness and helpfulness. For example, over-tuning an LLM for safety can lead to excessively conservative responses, where the model declines to respond even to safe prompts. This illustrates the fundamental trade-off between helpfulness and harmfulness: a model that never responds is entirely harmless but also entirely unhelpful. Our challenge is to successfully strike a balance between these competing goals.
As such, we characterise successful fine-tuning with the following objectives:
- Primary Objective: Reduce toxic content generation in the Singaporean context
- Secondary Objective 1: Maintain or further reduce toxic content generation in more general contexts
- Secondary Objective 2: Maintain or minimally impact LLM performance on general tasks
To measure our success, we selected 3 benchmarks along with relevant metrics, which we discuss in the benchmarks section.
Data
Users typically interact with LLMs in a conversational context, hence we focus on safety alignment using instruction-formatted prompt-response pairs. Each pair comprises a {prompt} and {response}, formatted using the Llama chat template which SEA-Lion-v2.1-Instruct uses. While the prompt remains consistent across alignment methods, the {response} field changes depending on whether we are performing supervised fine-tuning or reinforcement learning-based methods.
Local Safety Data
For this experiment, we created a new internal dataset called SGToxicityPrompts, which contains comments from HardwareZone’s Eat-Drink-Man-Woman online forum and selected subreddits about Singapore. We labelled these texts using LionGuard’s binary classifier and retained only the most confident predictions from LionGuard to ensure label accuracy. We highly encourage readers to read the Lionguard article and paper for a more detailed account.
Generating Instruction-Completion Pairs
To obtain our training data, we sampled safe and unsafe raw text from SGToxicityPrompts, which needs to be further formatted for a conversational context. This was done in two parts: (1) applying prompt templates and (2) response generation.
Prompt Templates
First, we needed to pair each text from SGToxicityPrompts with an appropriate prompt template. To achieve this, we manually designed 21 instruction templates which we use to concatenate to each raw prompt. Each template is an instruction designed with the intention to potentially, but not always, elicit harmful responses from the LLM when paired with an unsafe prompt.
We randomly sampled 4000 safe prompts and 4000 unsafe prompts from SGToxicityPrompts and augmented them with our 21 prompt templates, resulting in 168k formatted pairs. After an initial review of our dataset, we decided to remove 10 prompt templates from the safe subset because we observed they had a strong propensity to elicit unsafe content even on safe prompts, which could lead to mixed signals during training. With 120k pairs remaining, we randomly selected 25k as a hold out set for our local toxicity benchmark.
Response Generation
Finally, we need safe responses to our unsafe prompts to train the model on safe behavior. We few-shot prompted GPT-4o with explicit instructions to generate a safe response to the unsafe prompt. To augment response quality, we provided a list of common harmful Singlish terms for a more contextualised response. This approach follows from a growing body of literature around synthetic data generation using LLMs to effectively and efficiently generate specialised datasets with limited resources.
For our safe prompts, we use SEA-Lion-v2.1-Instruct’s original response to the formatted prompt as the response. This was done with the intention of only steering LLM behavior on unsafe content while maintaining behavior towards safe content.
Training
Training an LLM typically has 3 phases: pre-training, supervised fine-tuning and alignment.
During pretraining, the LLM is trained on a massive amount of text data to internalise language understanding and knowledge (e.g. Llama-3 models were trained on a corpus of 15T tokens). Following this, the LLM can generate coherent and informative text but does not yet know how to respond explicitly to instructions. Supervised fine-tuning ingrains this conversational, instruction-following behavior by introducing special tokens and linguistic patterns that demarcate roles (e.g. system, user, assistant) and instructions in a conversational framework. The model is trained on prompt-response pairs across a wide variety of tasks that explicitly demonstrate instruction-following behavior.
Reinforcement Learning with Human Feedback (RLHF)
After SFT, LLMs can have meaningful conversations, but there is no guarantee that the model’s responses are aligned with human preferences and norms. As an analogy, while you should understand what hate speech is, you should also know that it is wrong to say it to people.
RLHF explicitly steers models towards desirable behavior, thereby achieving alignment. Unlike SFT, where the model is trained directly on prompts-response pairs, RLHF methods have various ways of codifying preferences between different responses to a prompt. Intuitively, SFT teaches the model using approximately correct things to say, while RLHF more carefully teaches the model both what to say and what not to say. PPO is arguably the most well-known algorithm for RLHF, and is still used today for state-of-the-art models like GPT-4. PPO performs reinforcement learning via policy gradient optimisation, where preference data is used to train a reward model, which is subsequently used to train the LLM in an iterative loop.
In our experiments, we tested two newer RLHF methods:
- Direct Preference Optimisation (DPO): DPO innovates on PPO by transforming what is traditionally a reinforcement learning task into regular supervised learning. The key insight from DPO is that LLMs can implicitly estimate rewards on their own, allowing the LLM to be trained directly on preference data without a reward model. Practically speaking, this means that DPO is faster and less computationally expensive than PPO, since the reward model is typically large and training does not require online sampling.
- Kahneman-Tversky Optimisation (KTO): Whereas DPO focuses on “maximising the log-likelihood of preferences”, KTO focuses on “maximising the utility of the LLM outputs”. Put simply, KTO mathematically models the utility of generated responses and trains the LLM to maximise that utility. Like DPO, KTO does not require a separate reward model and implicitly estimates rewards using the LLM. However, KTO only requires binary labels of whether a response to a given prompt is positive or negative.
Using KTO meant that we had to process our instruction-completion pairs into labelled examples instead. For safe prompts, we retained the original prompt-response pairs and labelled all of them as positive examples. For unsafe prompts, we included both the original unsafe response and the GPT-4o generated safe responses — the original responses default to a negative label, while the GPT-4o responses have a positive label. For a deeper, more technical discussion, check out our next article!
Training Setup
We ran our initial experiments on Google Compute Engine and eventually scaled them using Vertex AI Pipelines. For SFT we used 1xA100 40GB, while for KTO we used 1xA100 80GB. For inference we used 1xA100 40GB. We performed Parameter Efficient fine-tuning (PEFT) using LoRA for both SFT and KTO, with different combinations of hyperparameters. For SFT, we mainly relied on the Axolotl library, and for KTO, the TRL library.
Metrics and Evaluation Harness
Like any well-planned machine learning task, we need clear benchmarks and metrics that quantify success in model training. Recalling our primary and secondary objectives from earlier, we holistically measure performance using 3 different benchmarks.
Singapore Content Safety Benchmark
To benchmark performance for localised safety, we use our hold-out set of 25k prompts. This dataset contains an equal number of safe and unsafe prompts. We measure performance using two metrics:
- Toxicity Rate: We use the LionGuard classifier to score LLM responses to our evaluation prompts. We classify a response as toxic if it is above the threshold defined by LionGuard. The toxicity rate is the % of unsafe prompts that successfully elicit unsafe responses.
- Rejection Rate: A rejection occurs when the model declines to respond (e.g. “Sorry, I am unable to…”). To classify whether a response is a rejection, we use the distilroberta-base-rejection-v1 classifier. Additionally, we used a simple filter to detect obvious false negatives from the distilroberta model. We then separately calculate the percentage of safe and unsafe prompts that elicit a rejection.
If our safety tuning is successful, we should expect a high rejection rate for unsafe prompts as the LLM correctly identifies these prompts as harmful. Similarly, a low rejection rate for safe prompts would be evidence that the LLM correctly discerns between benign and harmful content. As such, we should expect, at worst, a marginal uptick in rejection rate on safe prompts.
TOXIGEN Benchmark
TOXIGEN is a large-scale machine-generated dataset of toxic and non-toxic statements concerning 13 minority groups. We use TOXIGEN to test safety performance beyond our Singapore-contextualised data and assess the generalisability of our safety improvements without compromising prior alignment. Our baseline expectation is that performance on TOXIGEN should not regress.
We use the TOXIGEN-finetuned HateBERT classifier, which was trained on the training set of TOXIGEN, to classify LLM responses and calculate the toxicity rate on TOXIGEN prompts. We calculate toxicity rate as the percentage of unsafe prompts that successfully elicit unsafe responses.
Open LLM Leaderboard Benchmarks
Finally, we want to quantify the cost of safety alignment on general LLM capabilities. As mentioned, over-tuning for safety could lead to an increase in wrongful rejections of benign prompts and a general reduction in LLM abilities. Consistent with industry and academic standards, we benchmark our models on all six Open LLM Leaderboard v2 tasks, using the same configurations as the published Huggingface leaderboard, to assess our model’s problem-solving and instruction-following abilities.
Like the published leaderboard, we perform task-specific score normalisation to scale all scores between 0–100 and account for varying baselines. A score of 0 can be interpreted as the same or worse than random guessing, while a score of 100 indicates perfect answers.
Results
In this section, we discuss our main fine-tuning results accompanied by our ablation studies. While we focus on comparing our fine-tuned models against the baseline SEA-Lion-v2.1-Instruct, we also look at other factors like data mixture, preference dataset structure and DPO. Overall, we find that a combination of SFT+KTO using both symmetric and asymmetric preference data performs the best.
Experiment Preface: LoRA Rank
As a precursor to our comparisons, we first identified the best fine-tuning parameters to use. We performed parameter efficient fine-tuning (PEFT) using SFT with different LoRA ranks to identify the best configuration. For simplicity and following common practice, we set rank = alpha for every experiment.
- Rank: Controls the number of trainable parameters which, intuitively, increases learning capacity. We test rank = {16, 32, 64, 128}.
- Alpha: Alpha scales the influence of the additional LoRA parameters on existing LLM weights.
Our experiments indicate that higher rank performs better on our local safety benchmark without overfitting, so we omit a detailed review of results. We opt for rank 128 in all subsequent runs across SFT, KTO and DPO.
Baseline Comparison: Llama3–8B-Instruct vs SEA-Lion-v2.1-Instruct vs SFT
First, we compare our SFT model, SFT, with the base SEA-Lion-v2.1-Instruct and include Llama3–8B-Instruct for reference. Because SEA-Lion-v2.1-Instruct is a finetuned version of Llama3–8B-Instruct that has not undergone safety alignment, we expect its safety performance to be worse.
As seen in Fig 6, SFT scores significantly better than the baselines. On unsafe local prompts, only 9.8% of responses were classified as toxic compared to 50.5% on SEA-Lion-v2.1-Instruct and 47% on Llama3–8B-Instruct (top right). Similarly, 98.5% of unsafe local prompts were rejected, compared to 9.3% on SEA-Lion-v2.1-Instruct and 15.6% on Llama3–8B-Instruct (top left).
These improvements lead to just a marginal increase in false positives. We see that SFT falsely rejects 1.2% of safe prompts compared to 0.2% for SEA-Lion-v2.1-Instruct and 0.6% for Llama3–8B-Instruct (bottom left).
Finally, safety improvements for SFT also translate into more general safety improvements on the TOXIGEN benchmark (bottom right). Compared to SEA-Lion-v2.1-Instruct’s toxicity rate of 19.5% on unsafe TOXIGEN prompts and Llama3–8B-Instruct’s 16.3%, SFT only has a toxicity rate of 9.8%.
KTO: Baseline vs KTO Variations
We ran numerous configurations but for simplicity, only include results for 2 configurations, namely:
- KTO: KTO with rank 128 LoRA on SEA-Lion-v.2.1-Instruct, without SFT
- SFT +KTO: KTO with rank 128 LoRA on rank 128 SFT of SEA-Lion-v.2.1-Instruct
Comparing SFT and KTO, we see that KTO alone works reasonably well but is outclassed by SFT in all toxicity metrics for local content (Fig 7; top left and top right). While safety gains for KTO are also generalizable, TOXIGEN performance is still better with SFT.
Notably, SFT+KTO performs the best. SFT+KTO rejection rate on unsafe local content improves to a near perfect 99.6% from 98.5% on SFT (top left), toxicity rate on TOXIGEN improves to 5.9% from 9.8% on SFT (bottom right), and rejection rate on safe local content also improves very marginally (bottom left). Taken together, these results suggest that KTO does indeed induce meaningful learning for safety alignment. As we see in the next section, combining KTO with SFT also comes at arguably no cost to model performance.
Open LLM Leaderboard v2 Benchmarks
Finally, we examine the performance of our fine-tuned models on the Open LLM Leaderboard v2 benchmarks relative to SEA-Lion-v2.1-Instruct.
Fig 8 shows the % difference in performance relative to SEA-Lion-v2.1-Instruct on each task and the simple average across all tasks. Focusing on the average difference, SFT and SFT+KTO see a minor performance decline of 2.9% while KTO, surprisingly, improved by 2.1%.
We stress that these benchmarks aim to provide a general assessment of model capabilities and, in the context of finetuning, to identify any signs of catastrophic forgetting. These are controlled, artificial tests, so a performance change of -2.9% or +2.1% may have little to no practical impact on real-world performance. For instance, IFEval experienced the most significant performance drop, but it measures a narrow subset of instruction-following behaviors, such as word count, paragraph structure, and keyword usage, without assessing response quality. Contrasted against the significant improvements to local and general safety, variations in benchmark performance should not be a major concern unless specific behaviors are critical to deployment.
Analysis of Model Divergence
As a final sanity check of model performance, we also want to understand how our model responses have semantically shifted after the fine-tuning process.
To approximate this, we calculate the cosine similarity between the original SEALion-v2.1-Instruct response and the fine-tuned model response on our TOXIGEN and local content benchmarks for SFT, KTO and SFT +KTO. We do so using the text-embedding-3-large embedding model. For our local content benchmark we sampled 2000 responses, and for TOXIGEN we used all the samples in our benchmark.
These scores provide a slightly different perspective of our results. Overall, our fine-tuned models produce responses that are semantically similar to SEALion-v2.1-Instruct on safe prompts, and semantically dissimilar on unsafe prompts. Our best model, SFT +KTO, was the most dissimilar on unsafe prompts but still relatively close on safe prompts. We hypothesise that the inclusion of the original SEA-Lion-v2.1-Instruct responses in the training data allowed us to minimize model divergence on safe content.
Additional Results
Data Mixture: Baseline vs Balanced vs Imbalanced
Data mixture is often an important factor at each phase of LLM training. In pre-training Llama3, for example, researchers performed multiple experiments to establish scaling laws for determining the right data mix. Additionally, they used a dynamic data mix that gradually increased representation of high quality sources as training progressed.
More comprehensive studies are beyond our scope, but we experimented briefly with data mixture by comparing our balanced dataset with one comprising 30% unsafe and 70% safe samples. For brevity, we only include rejection rate on safe and unsafe prompts. In short, we did not observe data mixture to have a significant impact on performance in our toxicity benchmarks.
DPO
In the previous section, we discussed how KTO affords more flexibility than DPO because it does not require preference data. However, it is worth noting that safety alignment entails some natural preference pairs — for every unsafe prompt in our dataset, we have a corresponding positive response (the GPT-4 generated rejection) and a negative response (SEA-Lion-v2.1-Instruct’s original response). We call these preferences paired, because we have both positive and negative responses. However, safe prompts are unpaired, because they do not have a corresponding negative response, since they are not unsafe to begin with.
Since DPO requires paired preferences, the most straightforward implementation entails creating a dataset using just our unsafe prompts. More specifically, DPO takes as input a prompt, a positive response, and a negative response:
For brevity, we compare the performance of SFT+DPO and SFT+KTO on just our Singapore content benchmark.
While SFT+DPO successfully rejects unsafe prompts at roughly the same rate as SFT+KTO, it has significantly higher false positives (Fig 13). In our original formulation, we hypothesised that including safe prompts (and their original response) would ensure that we only steer model responses on unsafe prompts. Applying DPO, where paired preferences necessarily exclude these safe prompts, we observe that our hypothesis was indeed correct — the model overfits our unsafe prompt data and wrongfully rejects a large proportion of safe prompts.
One possible solution to improving DPO would be to synthetically generate rejections, creating paired preferences for safe prompts as well. However, this entails a rather counterintuitive and non-trivial step of generating accurate rejections, which KTO can skip altogether.
Paired vs Unpaired Preference Data
Taking our DPO experiments one step further, we want to directly test the impact of the inclusion of unpaired, safe prompts in the training data. To do so, we perform KTO on SFT, but unlike SFT+KTO, we only include unsafe prompts in the training data. For each unsafe prompt, we include both the positive and negative responses to simulate the usage of paired preferences only. We term this model SFT+KTO (symmetric) .
Focusing on our Singapore content benchmarks again (Fig 14), we see that SFT+KTO (symmetric) performs extremely similarly to SFT+DPO, producing a large number of false positives. We consider this further evidence in support of our decision to utilise both paired and unpaired preferences for KTO, and for the general suitability of KTO for safety alignment over alternatives like DPO.
Discussion
Is RLHF necessary for safety alignment?
While combining KTO and SFT produced the best performance on safety benchmarks with no additional declines in performance over SFT, it is not clear whether KTO is absolutely necessary for safety alignment. Here are several points to consider:
- Engineering Complexity: KTO adds a layer of complexity to the development process, both in terms of data preparation and model training. While SFT offers more established open-source support and is generally better understood, through this project we can attest to the fact that newer approaches like KTO are more experimental. With limited resources, it is important to consider whether the incremental gains from KTO are worth the added effort.
- Compute Costs: In a similar vein, KTO is significantly more expensive than SFT. By our estimates, it is at least 10x more expensive, meaning that the marginal performance improvement per unit of compute significantly falls off after SFT. This increase in cost is the result of a more complex training algorithm and greater memory requirements under KTO.
- Guardrails: We emphasise again that LLMs are typically deployed as part of a larger system with input and output guardrails that moderate content entering and exiting the LLM layer. As such, our benchmark toxicity and rejection rates are an approximate ceiling on unsafe content, and actual rates could be even lower. The important question for developers is whether the improvements to safety from KTO are simply achievable through other means and whether the added cost justifies the benefit.
- Answer Quality: Rounding off this discussion, we acknowledge our evaluations thus far are primarily based on tractable safety metrics. However, preference data often reveals more nuanced attitudes than a simple ‘safe’ or ‘unsafe’ classification. For example, we might express a preference between two ‘safe’ responses, with one being more informative than the other. In such a scenario, RLHF is a more appropriate training algorithm than SFT as it better captures these complexities, but metrics and evaluations must also be able to capture these differences. We leave this as a potential direction for future research.
Conclusion
In this article, we provided detailed coverage of our safety alignment process from start to finish. Using a proprietary dataset of safe and unsafe Singaporean data, synthetic data generation techniques, and a combination of supervised fine-tuning + KTO, we successfully improved performance on our Singapore content benchmarks by 99% and general safety by more than 50%. We also find that our fine-tuned models see minimal declines in performance as measured by Open LLM Leaderboard v2 tasks.
Our additional experiments also shed light on best practices and promising future work in the area of safety alignment and, more generally, fine-tuning. We observe that for localised safety alignment, higher LoRA rank improves performance without overfitting and data mix has a minimal impact on our supervised fine-tuning. Finally, we find that combining KTO and a dataset comprising both paired and unpaired preferences performs best, while DPO implicitly enforces an oversimplified objective that is counterproductive to performance.