Validating Annotation Agreement between Humans and LLMs

AI Practice

07 Jul 2025 • 13 min read

Who Judges the Judge?

At GovTech’s AI Practice, we’ve been embracing what’s known as “LLM-as-a-judge” — essentially employing LLMs as evaluators across our AI workflows. This approach has become one powerful approach in our evaluation toolkit.

We use LLMs extensively across multiple areas: judging other LLM outputs (e.g. rejection detection when models refuse to respond), building factuality and faithfulness evaluators, and as guardrails for LLM applications. Even data labelling tasks can be framed as an evaluation problem — the LLM is essentially evaluating text and providing labels.

And honestly, it makes sense. LLMs are fast, scalable, and cost-effective. Plus, they’re consistent in their application of criteria, unlike humans who might have varying energy levels or interpretations on different days.

But here’s the thing that’s hard: How do we know if our LLM judges are actually good judges?

Validation is Difficult

When using LLMs as judges, we have to consider a lot of factors that may mess with the results. To name a few from recent studies (non-exhaustive!):

Positional bias — They may favour options that appear first or last
Authority bias — Confident-sounding information could sway them
Style bias — They may prefer longer texts over shorter ones
Self-preference — Models from the same family tend to rate their own outputs higher
Inherent stochasticity — The same input can produce different outputs
Sensitivity to prompting — Small changes in your system prompt can dramatically change the response

The validation challenge becomes trickier when you’re dealing with subjective tasks, when there’s no clear “ground truth” to compare against. The content moderation task illustrates this well: some comments clearly cross the line, but many fall into a grey area — what one sees as offensive, another may view as banter.

To validate LLMs, one may then use another LLM to validate the first one. But this creates a new problem: each layer of evaluation introduces its own potential errors and biases, which then compound through the system in unpredictable ways. For example, suppose we have an LLM with a 5% error rate, and we detected these errors using another LLM that also has a 5% error rate. The combined accuracy is then 0.95×0.95=0.9025, meaning the compounded error rate rises from 5% to approximately 9.75%. Rather than reducing uncertainty, each additional validation step with an imperfect model significantly increases it.

*Again, how we feel doing evals sometimes*

And when does it end?

At what point do we have sufficient confidence that our evaluation chain is robust?

Reframing the Problem

Instead of trying to find some absolute measure of “correctness”, we decided to reframe the objective:

To what extent do LLMs concur with human annotations?

This shift in perspective led us to explore statistical inference and hypothesis testing — tools that are less arbitrary than agreement rates or “vibe checks”.

In our research, we came across this approach called the Alternative Annotator Test (Alt-Test), which proposes a systematic way to justify the use of LLMs as a better alternative to human annotators. The basic idea is simple: if an LLM can consistently align with human annotations, we can trust its abilities to model human judgements.

*Summary diagram of Alt-Test from the paper*

The framework allows you to benchmark an LLM against real annotators using just a modest sample (>= 30 samples) and at least three humans. The core idea is simple:

1.Leave-one-out:

From a group of annotators (at least 3 annotators), exclude a single human, and treat the remaining humans as the ‘collective distribution’.

Then compare how closely the target LLM labels align with the remaining humans versus how that excluded human aligns with the same group .

2. Hypothesis testing:

For each excluded human j, compute two probabilities:

pᶠⱼ: the chance the target LLM does at least as well as the excluded human j

pʰⱼ: the chance the excluded human j does at least as well as the target LLM

Then run a t-test and test whether pᶠⱼ — pʰⱼ, the chance that the target LLM aligns better with the remaining humans that the excluded human j, exceeds a small cost-benefit margin ε.

(ε penalises pʰⱼ to reflect the higher cost that comes with human annotations.)

3. Since we’re conducting multiple hypothesis testing, control the false positives with a False Discovery Rate (FDR) procedure like the Benjamini–Yekutieli (BY) procedure. This sets p-value thresholds at a target FDR level (0.05 in this case), ensuring that on average, no more than 5% of LLM “wins” are false positives.

4. Aggregate the hypothesis testing results and obtain 2 key metrics:

Winning Rate (ω): The percentage of excluded humans j that the target LLM “wins” (i.e., the null hypothesis is rejected).

If ω ≥ 0.5, you now have some statistical evidence that the target LLM is a better alternative to recruiting costly human annotators.

Average Advantage Probability (AP), : The average count of pᶠⱼ: the chance the target LLM does at least as well as the excluded human j.

This compares how often target LLMs align with the remaining humans, without any statistical advantage given to the LLMs, and is used to directly compare LLMs.

To put this framework to the test, we applied it to multi-label toxicity detection, a task that’s both extremely subjective and highly relevant to our work. It was crucial for us to find out which LLMs could reliably identify toxic content in our local context, where cultural and language variations make the task especially challenging.

Applying this to Content Moderation Annotation

Our setup was straightforward: we fixed the system prompt and used zero-shot prompting for each LLM across a consistent set of texts sampled from an internal Singlish toxicity dataset. Our sample size was N=50, following the authors’ recommendation of N>=30 for statistical validity. Our dataset featured six toxicity labels as shown (full definitions are provided in the Annex).

*Our safety taxonomy for Undesirable Content.*

We recruited six annotators from within our organisation who were familiar with the taxonomy, and tested 6 then leading open- and closed-source LLMs to see how they compared.

OpenAI's o3-mini (at all three reasoning effort: high, medium and low)
Gemini 2.0 Flash
Llama 3.3 70B
Mistral Small 3 (2501) 24B
Claude 3.5 Haiku
Amazon Nova Lite

The framework works with any number of LLMs and human annotators, as long as you have at least three human annotators.

Our Modifications

While the original Alt-Test provided a solid foundation, we adapted it for our specific needs with three key modifications. The core implementation is based on the authors’ original code from their GitHub repository, to which we made the modifications described below:

1. Adding Multi-label Metrics

The introduced framework utilised accuracy for classification tasks, and we added three additional metrics that capture the partial correctness in multi-label scenarios, allowing for more robust evaluation.

Set-based Jaccard similarity — computes overall overlap across both predicted and true labels

# Simple set-based jaccard similarity
def simple_jaccard_similarity(pred: List[str], annotations: List[List[str]]) -> float:
jaccard_scores = []
for ann in annotations:
# If both are empty, return 1.0
if not pred and not ann:
jaccard_scores.append(1.0)
continue

intersection = len(set(pred) & set(ann))
union = len(set(pred) | set(ann))
jaccard_scores.append(intersection / union)
return float(np.mean(jaccard_scores))t[str], annotations: List[List[str]]) -> float:

Macro Jaccard Similarity — computes Jaccard Index for each label before averaging

# Macro-averaged jaccard similarity
def jaccard_similarity(pred: List[str], annotations: List[List[str]]) -> float:
jaccard_scores = []
for ann in annotations:
jaccard_scores.append(jaccard_score(y_true=ann, y_pred=pred, average="macro"))

return float(np.mean(jaccard_scores))

Hamming Similarity — computes the fraction of positional similarity

# Hamming similarity
def hamming_similarity(pred: List[str], annotations: List[List[str]]) -> float:
hamming_scores = []
for ann in annotations:
hamming_scores.append(1 - hamming_loss(y_true=ann, y_pred=pred))

return float(np.mean(hamming_scores))

2. Making Epsilon (ε) More Intuitive

The original paper experimented with different ε values and provided recommendations based on use case (e.g., setting it to 0.2 for expert annotations and 0.1 for crowdsourced annotations). However, we found this arbitrary number difficult to interpret and couldn’t tell what the best ε to set for our task.

Instead of an abstract, absolute margin, we use a relative fraction that scales with human performance. Similar to the additive ε, the higher the ε, the more advantage is given to the LLM.

The multiplicative epsilon represents a “performance gap” between LLMs and humans — something much easier to interpret and justify. If you set epsilon to 0.1, the target LLM only needs to be 90% as good as the excluded human to “win” that comparison.

3. Human Performance Insights

On top of calculating the LLM’s AP, we also calculated each human annotator’s AP. This extension gives us insights into human annotator consistency.

... original code ...

human_advantage_probs = {}

p_values, advantage_probs, humans = [], [], []
for excluded_h in humans_annotations:
llm_indicators = []
excluded_indicators = []
instances = [i for i in i_set[excluded_h] if i in llm_annotations]
for i in instances:
human_ann = humans_annotations[excluded_h][i]
llm_ann = llm_annotations[i]
remaining_anns = [
humans_annotations[h][i] for h in h_set[i] if h != excluded_h
]
human_score = scoring_function(human_ann, remaining_anns)
llm_score = scoring_function(llm_ann, remaining_anns)
llm_indicators.append(1 if llm_score >= human_score else 0)
############################################################
### Modification: Compute human AP here
excluded_indicators.append(1 if human_score >= llm_score else 0)
############################################################

############################################################
### Modification: Calculate p-value based on epsilon type
if multiplicative_epsilon:
diff_indicators = [
exc_ind - (llm_ind / (1 - epsilon))
for exc_ind, llm_ind in zip(excluded_indicators, llm_indicators)
]
p = ttest(diff_indicators, 0)
############################################################
else:
diff_indicators = [
exc_ind - llm_ind
or exc_ind, llm_ind in zip(excluded_indicators, llm_indicators)
]
p = ttest(diff_indicators, epsilon)

p_values.append(p)

advantage_probs.append(float(np.mean(llm_indicators)))
humans.append(excluded_h)

############################################################
### Modification: Save human AP with corresponding LLM AP
human_advantage_probs[excluded_h] = (
float(np.mean(llm_indicators)),
float(np.mean(excluded_indicators)),
)
############################################################

... original code ...

Do LLMs align with humans?

Our results showed that Gemini 2.0 Flash best aligned with our sample of human annotators on the toxicity labelling task.

We measured the winning rate (ω) across different epsilon values to understand how much of an advantage each LLM needed to outperform human annotators. To recap, at:

ε = 0: The LLM gets no advantage — it has to be as good as humans
ε = 0.1: The LLM only needs to be 90% as good as the excluded human to “win”, and so on

As mentioned above, we claim that the LLM has an advantage over human annotators if the calculated ω >= 0.5.

*LLM Winning Rates across different ε values.*

Gemini 2.0 Flash was the only LLM that achieved non-zero ω against human annotators for 3 out of 4 metrics at ε = 0 (no advantage given), and ω >= 0.5 at ε=0.05 for all metrics, indicating that it offers a good alternative to using human annotators for our specific toxicity detection task.

On the contrary, OpenAI o3-mini (high reasoning) only managed to achieve ω >= 0.5 at ε=0.25, and smaller LLMs (Llama 3.3 70B and Mistral Small 3 (2501) 24B) remained at ω = 0.

Comparing between LLM Annotators

To compare the agreement magnitude between LLMs, we use the second metric calculated from this procedure, the Average Advantage Probability (AP).

Looking at the AP of each LLM, we can conclude that Gemini 2.0 Flash has the best human alignment, followed by Claude Haiku 3.5, then OpenAI o3-mini (at high reasoning effort).

When testing OpenAI’s o3-mini at the three different reasoning effort levels (high, medium, and low), we observed that higher reasoning effort directly correlates with better alignment with human annotators, suggesting that the additional computational time spent on reasoning improves performance on nuanced and subjective tasks like localised toxicity detection.

Analysing Annotator Deviation

We plotted LLM AP against Human AP for each comparison, with the following interpretation:

X-axis: How well the target LLM agreed with the remaining humans
Y-axis: How well each excluded human agreed with the remaining humans
Blue zone (Human AP > LLM AP): excluded human had better agreement than the target LLM
Red zone (LLM AP > Human AP): target LLM had better agreement than the excluded human
Grey line: Equal agreement level between target LLM and excluded LLM

Gemini 2.0 Flash, Claude 3.5 Haiku, and o3-mini (high reasoning) consistently fell in the red zone, meaning they aligned better with the human consensus than individual humans did with each other. On the contrary, we see low agreement for all humans vs the smaller LLMs (Llama 3.3 70B and Mistral Small 3 (2501) 24B).

Interestingly, we also found that o3-mini tended to fall at the 45 degree line, suggesting almost equal agreement with the human annotators. This suggests that the LLM is able to agree (and disagree) with the other humans as how the average human would. We hypothesise this may be due to the alignment process a LLM goes through during post-training.

Lastly, in this chart, we observe that no single human deviates significantly in terms of agreement with the remaining humans.

Alt-Test vs. Traditional Measures

You may then ask, why choose Alt-Test over traditional inter-annotator agreement (IAA) metrics or performance metrics (F1/Accuracy)? The IAA metric also addresses the question of how well LLMs concur with humans, and traditional performance metrics provides an estimation of how well the LLM performs given human baselines as the ground truth.

The authors address this in their FAQs, noting that traditional IAA measures assess agreement among annotators, while their goal is ‘to compare the LLM to the group to determine whether it can replace them’. On the other hand, traditional performance metrics ‘only evaluate whether the LLM matches human performance, not whether it provides a better alternative’. In these aspects, the Alt-Test offers two key advantages. Firstly, it’s actionable. The winning rate provides statistical evidence we can use to justify deploying LLMs to model humans for annotation tasks. Other IAA metrics, while informative, purely give us a descriptive statistic that requires subjective interpretation of what agreement is “good enough”.

Secondly, the Alt-Test analysis is more robust as it captures the variability amongst humans themselves, accounting for the fact that humans disagree with each other.

To put our results into perspective, we compared against other IAA metrics (i.e. Cohen’s kappa) and found that Cohen’s kappa remained valuable for validation. Upon measuring both LLM-Human and Human-Human agreement, the results reinforced our Alt-Test findings. The same three models (Gemini 2.0 Flash, Claude 3.5 Haiku, and o3-mini) emerged as top performers in both frameworks,

… and the human-human kappa scores also showed that each human had very similar agreement levels with all other humans, confirming no significant annotator deviation.

Limitations and Important Notes:

Task-specific Findings

These results came from our specific use case of Singlish toxicity detection, and we cannot confirm that they generalise to other domains, languages, or annotation tasks. The cultural nuances and linguistic patterns in Singlish create a unique challenge that may not translate to other contexts.

We hence recommend that readers run the original authors’ code on their own datasets to validate LLM performance for their specific use cases.

Annotator Bias and Distribution

Our annotators were all recruited from the same organisation and were well-versed with our toxicity taxonomy. This introduces potential bias in several ways:

Organisational bias: Annotators share similar training and perspectives
Expertise bias: Familiarity with the taxonomy may not reflect general population understanding
Cultural homogeneity: Limited diversity in annotator backgrounds

To obtain more robust validation, future studies should include annotators from different backgrounds to better represent the target user population.

The Risk of Weak Annotations

As the original authors described, there are inherent limitations when human annotators are weak or when there’s high disagreement among humans.

If annotators are weak, the Alt-Test will identify which LLMs align with weak annotators rather than expert judgment. This is an inherent limitation in any human validation approach. The best we can do is to carefully select expert annotators and control the annotation setup to obtain high-quality ground truth data. Nonetheless, analysis of annotator deviation may help to identify weak annotators, if most annotators have good judgement.

In addition, when humans significantly disagree among themselves, it creates high variance in the results, and LLMs are unlikely to achieve high win rates in such scenarios. In these cases: (i) higher epsilon values can be applied to give LLMs more advantage; or (ii) more human annotators may be needed to establish clearer consensus.

Moving Forward

Despite these limitations, the Alt-Test framework provides a principled approach to LLM validation that is a systematic alternative to intuition-based (or vibe-based) selection. The key is to:

Acknowledge the constraints of your specific use case
Select representative annotators who match your target population
Validate findings across different datasets and contexts

We recommend this framework as it gives us more confidence in our LLM choices for our use case before we scale up our annotations or evaluations.

The notebook containing the modifications to the original Alt-Test code can be found here. Feel free to swap in your own data for analysis!

Annex:

Our safety taxonomy for Undesirable Content.

Hate: Text that discriminates, criticises, insults, denounces, or dehumanises a person or group on the basis of a protected identity (e.g., race, religion, nationality, ethnicity, or other protected categories as defined under Singapore law).

Insults: Text that demeans, humiliates, mocks, or belittles a person or group without referencing a legally protected trait. This includes personal attacks on attributes such as someone’s appearance, intellect, behavior, or other non-protected characteristics.

Sexual: Text that depicts or indicates sexual interest, activity, or arousal, using direct or indirect references to body parts, sexual acts, or physical traits. This includes sexual content that may be inappropriate for certain audiences.

Physical Violence: Text that includes glorification of violence or threats to inflict physical harm or injury on a person, group, or entity.

Self-Harm: Text that promotes, suggests, or expresses intent to self-harm or commit suicide.

All Other Misconduct: Text that seeks or provides information about engaging in misconduct, wrongdoing, or criminal activity, or that threatens to harm, defraud, or exploit others. This includes facilitating illegal acts (under Singapore law) or other forms of socially harmful activity.