MetaEvaluator: Systematically Evaluate Your LLM Judges

Measure how well your app is performing and more importantly where it's failing.

MetaEvaluator: Systematically Evaluate Your LLM Judges

The Case for Evals

When you’re building LLM applications, evaluations (or “evals”) are critical. They help measure how well your application is performing, and more importantly, where it’s failing and why. Good evals are what enable you to understand where your system is falling short, how to fix it, and make meaningful improvements!

The thing is, doing evals at scale is hard. An ideal approach is to collect a set of LLM responses and get human annotators to evaluate all the responses. This works for a small evaluation set, but it becomes impractical when scaling up to hundreds or thousands of responses.

This is where LLM-as-a-judge comes in. We can scale evals by using an LLM to assess and ‘judge’ the response in place of humans. This, however, comes with its own set of challenges. For one, your task could be rather subjective — it then becomes more difficult to find a judge who can match human judgment. It also sounds notoriously circular to have LLMs evaluate LLMs (and research shows that LLMs often favour their own generations and exhibit sycophantic behaviours at times).

“Each layer of evaluation introduces its own potential errors and biases, which then compound through the system in unpredictable ways.”

We discussed this issue in a previous article: Validating Annotation Agreement

Instead of continuing the eval cycle or selecting an LLM based on vibes, we choose to identify a judge that best aligns with our judgment on our tasks.

For our own work in AI safety, we use LLM judges for various evals, be it toxicity detectionfactuality checking, or safety rejection evaluation. To select the best LLM judge for our use case, we measure alignment of the judge with human annotations, and the workflow typically looks like:

While this process is straightforward, we realised each project owner was implementing this on their own, which led to duplicative and inconsistent work.

A Tool for Judge Evaluation!

This is what led us to develop MetaEvaluator, a tool that works across all your LLM judge evaluation use cases.

Check it out here: GitHub | Documentation

Essentially, MetaEvaluator helps you to…

  1. Configure and manage multiple LLM judges with LiteLLM integration
  2. Collect human annotations through a simple, deployable Streamlit interface
  3. Compute alignment metrics and generate visualisations and reports

Let’s dive into each key feature. For a more detailed tutorial and customisation options, visit the documentation here.

Let’s begin by installing the package.

# Requires Python 3.13+
pip install git+https://github.com/govtech-responsibleai/meta-evaluator.git

(If you already have existing LLM judge outputs or human annotations, load your results directly and skip to Step 3!)

1. Configure and run your LLM judges

Firstly, define your evaluation task and data.

from meta_evaluator.data import DataLoader
from meta_evaluator.eval_task import EvalTask

# Load your data
data = DataLoader.load_csv(name="my_evaluation", file_path="my_data.csv")

# Evaluate chatbot responses for whether it is a rejection
task = EvalTask(
task_schemas={
"rejection": ["yes", "no"], # Classification task
"explanation": None # Free-form text
},
prompt_columns=["user_prompt"], # Original prompt to the LLM
response_columns=["chatbot_response"], # LLM response to evaluate
answering_method="structured", # Use structured output parsing
structured_outputs_fallback=True # Fallback to instructor or XML
)

Create a judges.yaml file with your desired judge configurations. (Check the LiteLLM documentation for the complete provider list and model naming conventions, i.e., llm_clientand model.)

judges:
- id: judge_1
llm_client: openai
model: gpt-5-mini
prompt_file: ./prompt_v1.md

Put it all together to run your LLMs on your evaluation task and data. Results are automatically saved to the project directory (project_dir).

from meta_evaluator.meta_evaluator import MetaEvaluator

# Initialise evaluator
evaluator = MetaEvaluator(project_dir="project_dir")
evaluator.add_data(data)
evaluator.add_eval_task(task)

# Load and run judges
evaluator.load_judges_from_yaml(
yaml_file="judges.yaml",
async_mode=True,
)
evaluator.run_judges_async()

2. Spin up an interface to collect human annotations

In a separate script, load your evaluation data and task and launch the annotation interface.

# After loading your data and task, launch annotation interface
evaluator.launch_annotator(port=8501)

This runs a Streamlit app locally.

Access the link to the app on your CLI.
An example of the annotation interface.

To host it on the cloud via a public URL, try Docker deployment on your hosting provider or ngrok tunneling.

3. Compute alignment metrics

Once you have both the judge responses (Step 1) and human annotations (Step 2), score the judge responses with your desired metrics!

from meta_evaluator import MetaEvaluator
from meta_evaluator.scores import MetricConfig, MetricsConfig
from meta_evaluator.scores.metrics import ClassificationScorer

evaluator = MetaEvaluator(project_dir="project_dir", load=True)

# Configure one or more metrics
accuracy_scorer = ClassificationScorer(metric="accuracy")

config = MetricsConfig(
metrics=[
MetricConfig(
scorer=accuracy_scorer,
task_names=["rejection"],
task_strategy="single"
) # Add more here!
]
)

# Add metrics to evaluator and run comparison
evaluator.add_metrics_config(config)
evaluator.compare_async()

With the tool, existing capabilities developed from specific use cases (e.g., measuring toxicity alignment with the Alt-Test methodology) can now be applied to other use cases! Charts and tables are automatically saved to the project directory.

What you can expect to see after running evaluations

In Practice

Example 1: Safety Rejection Evaluation

In this mock example, let’s evaluate the best judge for whether an LLM accurately rejects/engages with the user when faced with unsafe prompts. This is especially critical for safety testing LLM applications, since we need to assess whether the LLM’s response engages with the user’s prompt or is a clear refusal.

After running Step 1 above, here’s a formatted example of what the dataset looks like and what different Judges might output.

First two rows of the dataset and sample judge outputs

From the intermediate judge evaluations, we see discrepancies between the judge outputs. judge_2 doesn’t seem to agree with the other judges on certain examples… Let’s test which judge aligns better with human annotators by running Steps 2 & 3 above.

For a more comprehensive view of alignment, we can use the agreement metrics (Accuracy, Cohen’s Kappa, Alt-Test) on the rejection classification label and the text/semantic similarity metrics for the explanation.

Reports of all judges across all metrics:

CLI output
Same table on the browser
In this example, `judge_4` and `judge_6` have relatively good human alignment scores.

From the alignment metrics, judge_2 has relatively lower alignment scores than the other judges, which matches the “vibe check” earlier. With these findings, we can decide whether to:

  • Choose only the top-performing judge for your use case
  • Use a jury of judges and exclude the worst performing judges (like judge_2)
  • Investigate why judge_2 performs poorly and improve or replace it

By making alignment easy to analyse, teams can make better decisions about which judges to trust and scale their evals more quickly and accurately.

Example 2: Evaluating your BTT Chatbot

This time, you want to evaluate whether your deployed BTT (Basic Theory Test) chatbot is giving factually accurate responses.

Previously, you’d collect production logs, then manually label whether the chatbot’s responses were factually correct. You realise this is too costly, and make plans to scale up your evaluation with an LLM judge instead.

Since you already have past production logs and ground truth labels, all you want is to find an LLM judge that best “mimics” your judgment. Here’s where MetaEvaluator can help:

  1. Load your existing production data, consisting of the chatbot prompt + response + ground_truth.

# Load MetaEvaluator
evaluator = MetaEvaluator(project_dir="project_dir")

# Load data and task
data = DataLoader.load_csv(file_path="path/to/ground_truth_results.csv")
task = EvalTask(
task_schemas={"factually": ["Factual", "Not factual"]},
prompt_columns=["prompt"],
response_columns=["response"],
)
evaluator.add_data(data)
evaluator.add_eval_task(task)

# Load annotations
evaluator.add_external_annotation_results(
file_path="path/to/ground_truth_results.csv",
annotator_id="my_id",
run_id="my_run_id"
)

First row of ground_truth_results.csv

2. Then, evaluate the candidate judges on the same examples:

evaluator.load_judges_from_yaml(
yaml_file="judges.yaml",
async_mode=True,
)
evaluator.run_judges_async()

Different Judge outputs for the first row

As expected, different judges have different thoughts on your chatbot’s performance…

3. Using ClassificationScorer (as there is only 1 ground truth label here), let’s compare the judge outputs to your ground truth labels.

config = MetricsConfig(
metrics=[
MetricConfig(
scorer=ClassificationScorer(metric="accuracy"),
task_names=["rejection"],
task_strategy="single"
)
]
)
evaluator.add_metrics_config(config)
evaluator.compare_async()

MetaEvaluator generates a comparison report showing how each judge performs against your ground truth. In this example, if Judge 1 shows higher accuracy than Judge 2, you now have data-driven evidence to deploy Judge 1 for your production evals.

Using your existing evaluation dataset and labels, you can constantly benchmark new judges and prompt configurations for alignment to ensure the best judge for your use case is selected.

Easily benchmark new judges!

Once you’ve selected the best judge, you can use that judge on any new data that is processed by your chatbot. Yay to scalable evals!

It’s a start!

Try It Out

MetaEvaluator aims to help you make better decisions about which LLM judges to deploy. Instead of rebuilding evaluation infrastructure for each project, you can focus on the more important stuff: choosing the judge that works best for your use case.

Try it out and let us know what you think! We’d love to hear how you’re using it and what features would be most helpful for your evals.

What’s Next

We’re planning upcoming improvements to the tool and are gathering feedback as we speak:

  • Improved score reporting with more dimensions of performance and better charts
  • Additional evaluation metrics to support other important judge characteristics (e.g., bias, consistency, etc.)
  • Support for evaluating jury performance, where multiple judges vote or reach a consensus on decisions

Got suggestions or feature requests? Contact us or open an issue on GitHub; we welcome contributions and discussions!