Evaluating MOE’s SLS Learning Assistant: Using Synthetic Data and LLMs to Benchmark Faithfulness and Factuality

Safer, faster testing of student-facing AI before real-world deployment.

Introduction

In Singapore’s push to harness the power of technology in education, the Ministry of Education (MOE) launched the “Transforming Education through Technology” Masterplan 2030. At the heart of this initiative is the Singapore Student Learning Space (SLS), an online learning platform that empowers students from primary to pre-university levels to learn at their own pace, using content tailored to their needs and interests.

With the rise of Generative AI, MOE took a bold step forward by integrating an LLM-powered Learning Assistant (LEA) into SLS. LEA guides students through self-directed learning using conversational, question-based interactions. But the bar set for the use of AI in education is high, given its interactions with younger learners: Accuracy, factual integrity, and appropriate responses aren’t just nice to have, they’re essential. A hallucinated response isn’t just an error; it risks misinforming students and eroding trust in the tool.

That’s where GovTech’s AI Practice came in. We partnered with MOE to rigorously evaluate LEA before it went live. In this post, I’ll take you through the process: how we used synthetic data and LLMs to test LEA across a wide range of learning scenarios, how we created diverse student-chatbot conversations using different personas, and how we built an evaluation framework which aligns with MOE’s key concerns on using Generative AI .

Learning Assistant (LEA)

Learning Assistant (LEA) in SLS

LEA operates based on a “recipe” system:

  1. Teacher Selects a Recipe: This is a pre-defined prompt template that determines the assistant’s role (e.g., discussion facilitator, perspective builder, ideas generator).
  2. Teacher Adds a Knowledge Base: This is an article (within the LLM’s context window) that serves as the foundational knowledge for the conversation. Conversations between the students and the assistants will center around the knowledge base, but are not strictly limited to its scope, depending on the role the assistant plays.
  3. Teacher Provides Additional Instructions: These instructions refine the assistant’s behaviour and align it with specific learning outcomes.

For a deeper dive into LEA, check out this video or visit the SLS website.

Generating Synthetic Student-Chatbot Conversations

Since LEA was in its pre-deployment phase, we lacked real student conversations. To resolve this, we built an LLM-powered pipeline to generate synthetic data, mimicking realistic classroom interactions.

The pipeline does the following, where an LLM instance refers to a system made up of an LLM and a system prompt:

1. Scope Generation

An LLM instance was first prompted to generate data which determined the scope of the conversation: subjectacademic level of the student, and the topic of discussion. Subjects were chosen according to subjects offered in Singapore’s education system (e.g. Maths, Science, English, Social Studies, History, Geography) and topics of discussion on these subjects were generated based on these subjects. This would allow us to test LEA’s responses across a diverse selection of topics, levels and subjects which students would be engaging the assistant on.

2. Knowledge Base Generation

The scope data generated from the previous stage was then fed into another LLM instance to generate a knowledge base of at least 2000 words, specific to the scope of the discussion. This was done to simulate the actual knowledge bases which the teachers would be using.

3. Conversation Generation

We provided both LEA and a third LLM instance, acting as a simulated student, with the same scope data and knowledge bases. The “student LLM instance” would then engage in a conversation with LEA on a given topic, continuing the dialogue either until it demonstrated sufficient understanding or until a predefined turn limit was reached. To more realistically mimic student behaviour, we created four distinct student profiles, each with specific instructions to guide their interaction style. This allowed us to test LEA’s ability to remain accurate and factual, even when conversations deviated from the norm.

The four simulated student profiles were:

  1. Good students — Stay on-topic and ask thoughtful, relevant questions.
  2. Weak English speakers — Use informal or grammatically incorrect English, including Singlish and shorthand.
  3. Out-of-topic students — Frequently derail the conversation with unrelated questions or tangents.
  4. Counterfactual students — Make incorrect or misleading claims as part of the conversation.

This pipeline generated a dataset of 400 conversations, featuring 5784 student-assistant prompt-response pairs. With this dataset, we can now start our evaluation!

Synthetic Conversation Generation Pipeline

gpt-4o-mini was used to generate the synthetic data by pairing it with different prompts for the different modules. The prompts for the synthetic data generation can be found here. We chose a cheaper and less powerful model for synthetic data generation to save cost as we do not need a model with high analytical capability.

Evaluation

Metrics

Because LEA is designed to support student learning, the accuracy of its responses is paramount. Our evaluation focused on two key metrics:

  • Faithfulness to Knowledge Base (FKB): Measures how well LEA’s responses align with the teacher-provided knowledge base. This ensures LEA stays within the intended curriculum.
  • Factual Consistency (FC): Measures how consistent LEA’s responses are with trusted external sources of truth. This helps detect misinformation beyond the knowledge base.

Framework

With these metrics defined, we built an evaluation framework that processes LEA’s responses through two scoring components:

  • faithfulness evaluator, which checks alignment with the internal knowledge base.
  • factuality evaluator, which checks consistency with external facts.

These components assign FKB and FC scores to each assistant response for a structured and reliable assessment.

Framework for Faithfulness and Factuality Evaluation Using Synthetic Data

Implementation

Taking reference from the Loki fact-checker, we created a workflow that processes each LEA conversation step by step. Here’s how it works:

  1. Extract Q&A Pairs: We split each conversation into pairs of student’s question (prompt) and LEA’s corresponding answer (response) using their turn numbers.
  2. Claim Decomposition: Each assistant’s repsonse is further broken down into individual claims using an LLM instance. This granularity allows us to evaluate each factual statement independently, rather than judging the response as a whole.
  3. Checkworthiness Detection: Using another LLM instance, we assess whether each claim is check-worthy (i.e., whether it makes a verifiable factual statement). Non-checkworthy claims (e.g., opinions or vague expressions) are excluded from evaluation.
  4. Answerability Classification: We use a third LLM instance to determine if the original question is answerable using only the teacher-provided knowledge base. If answerable, we run a faithfulness evaluation against the knowledge base. If not answerable, we run a factuality evaluation against trusted external sources.
  5. Fallback Factuality Check: For any claim that fails the faithfulness check, we perform a secondary factuality evaluation to see if it may still be factually correct, even if not explicitly supported by the internal knowledge base.

This approach ensures that every claim LEA makes is either grounded in the curriculum or backed by external facts — minimising hallucinations while respecting educational context.

The workflow was implemented using LangGraph, with the graph representation shown below. gpt-4o was used as the LLM as we needed the LLM instances to perform analytical tasks this time.

LangGraph Workflow for Faithfulness and Factuality Evaluation

Faithfulness Evaluation

To assess whether LEA’s responses stay aligned with the knowledge base provided by teachers, we used a proven prompting approach, similar to the one used by the Lynx hallucination detector.

The process is straightforward: given a question, a reference document (from the teacher-provided knowledge base), and a claim extracted from LEA’s response, an LLM is prompted to determine whether the claim can be logically inferred from the document. The model then returns a binary outcome: pass if the claim is supported, or fail if it is not.

Factuality Evaluation

Factuality evaluation is inherently more complex than faithfulness evaluation. While the underlying mechanism is similar (i.e. checking if a claim is supported by a context), the context needs to be retrieved from the open web, and is not just a fixed knowledge base, in this case.

However, web retrieval introduces its own challenges. The quality and relevance of retrieved content can significantly impact the evaluation outcome. Key factors include:

  1. The search engine used
  2. The query formulation strategy
  3. The number of search results considered
  4. Whether we rely on search snippets or fetch full web page content
  5. Whether the search is global or restricted to specific domains

ReAct Agent for Fact-Checking

To handle this complexity, we implemented a ReAct-style agent using LangGraph, designed specifically for factuality evaluation. The agent has access to two tools — both lightweight and free:

  • Web Search Tool — Uses the DuckDuckGo search engine to return top search results and snippets.
  • Web Content Fetcher — Retrieves the full HTML content of a selected web page using requests and BeautifulSoup.

Here’s how the fact-checking agent operates for each claim:

  1. Formulate a query based on the claim.
  2. Execute the search using the Web Search Tool.
  3. Evaluate the snippets to determine if they contain sufficient evidence to verify the claim.
  4. If needed, fetch the full content of the pages using the Web Content Fetcher.
  5. Reassess the retrieved information for adequacy.
  6. If insufficient, reformulate the query and repeat the process until a verdict is reached or a step limit is hit.
Factuality Evaluation Workflow

Agentic Behaviour and Control Flow

According to Hugging Face’s taxonomy of agent capabilities, our fact-checking agent exhibits Agency Level 2. While the web search step is always executed, the agent exercises control in several key areas:

  • Conditional control flow: It decides whether to use the content fetcher based on snippet quality.
  • Dynamic query refinement: It may reformulate search queries based on previous results.
  • Evaluation gating: It determines whether enough information has been gathered to make a judgment.

This adaptive behavior allows the agent to conduct more efficient and context-aware fact-checking which is crucial in high-stakes domains like education.

Levels of Agency from https://huggingface.co/blog/smolagents

The prompts used for the faithfulness and factuality evaluations can be found here.

Results

Overall Results

Overall Passing Rates for All Claims

The results for the evaluation are shown above. As previously mentioned, we obtained 5784 question answer pairs from the 400 conversations, out of which 1346 questions were answerable by the context. The 5903 claims from these 1346 responses were checked for faithfulness to their knowledge bases and 87.3% passed. 12.5% out of the remaining 12.7% of claims which were not faithful to the knowledge base further passed the factuality check, resulting in a 99.8% overall passing rate for responses to questions which were answerable by the knowledge base.

The 4438 responses to the questions which were not answerable by the knowledge base consisted of 8291 claims, and 98.9% of these claims were factually correct according to DuckDuckGo search results. Combining the results for the responses to questions answerable by the knowledge base, and questions not answerable, 99.3% of the claims in the responses passed, meaning that they are either faithful or factual. In other words, only 105 out of the 14194 claims (0.7%) made by the assistant were neither faithful to the knowledge base nor factual.

Results by Profile

Passing Rates by Student Profile

The chart above shows the faithfulness (FKB) and factuality (FC) scores for each of the simulated student profiles. From the charts, the small difference of 0.7% for combined score across all the profiles shows that the students’ behaviours have little impact on the overall faithfulness and factuality of LEA’s responses. Regardless of the student’s prompt, the LEA consistently replied with responses which are either faithful to the knowledge base, or factually correct. Despite the high scores, there are some imperfections, which we will look at in the next section.

Faithfulness Failure Examples

We analysed the claims which failed the faithfulness test and found out that in most of these cases, the assistant gave examples from its own knowledge on the topic which were not in the teacher-provided knowledge bases. This is not unexpected as the assistant is acting as a discussion facilitator. While it was instructed to use the knowledge base, its scope was restricted to the topic, but not just that of the knowledge base. Thus, it could use related examples if it could not find suitable ones in the knowledge base to reply to the student.

In the following example, the assistant makes a claim which was not in the knowledge base, but subsequently found to be factual after checking with web resources.

Faithfulness Failure Example

Altogether, 748 claims failed the faithfulness test, out of which 734 subsequently passed the factuality test. We look at some of the claims which failed the factuality test in the next section.

Factual Consistency Failure Examples

Examples of Factual Consistency Failures

The table above shows examples of claims which failed the factuality check. Most of the cases are similar to these, which are pretty innocuous, and failed the checks because of two main reasons:

  1. The existence of some very peculiar examples which are not common (e.g. robots running on rubber bands, and living things which can live without oxygen)
  2. The decomposed claims are wrongly assessed to be false because of a lack of contextual information.

To fix 1), we could restrict the web searches to certain mainstream sites like wikipedia.org or news websites so that the fact-checking agent won’t go scouring the internet looking for the odd counter-example. For 2), we could prompt the answer decomposer to make each claim include the contextual information in both the question and the answer.

Evaluating the Evaluator

Given the above examples of imperfections with the evaluator, one might ask: “How much can we trust the evaluation results?”, followed by “What if I make changes to the evaluation workflow? Would it work better? Or worse?” To answer these questions, we need to evaluate the evaluator (yes, it sounds meta).

To evaluate the evaluator, we generated datasets with known labels to be used as the ground truth, using the knowledge bases which we had previously generated. The difference is that this time round, instead of generating free-flowing conversations, we are generating question-answer pairs with specific instructions so that we can get the ground-truth labels we want. This is done as follows:

Generating Question-Answer Pairs for Evaluating the Evaluator

With this pipeline and using gpt-4o as the LLM, we obtained three sets of question-answer pairs for evaluating the evaluator which are:

  1. Questions answerable from context, answers solely from context: This dataset allows us to measure if the evaluator is able to correctly identify answers which are faithful. All claims from the answers should pass the faithfulness evaluation.
  2. Questions not answerable from context, answers from web search: This dataset allows us to measure if the evaluator is able to correctly identify answers which are factual. All claims from the answers should pass the factuality evaluation.
  3. Questions not answerable from context, answers not from web search: This dataset allows us to measure if the evaluator is able to correctly identify answers which are not factual. All claims from the answers should fail the factuality evaluation.

Ideally, there should be a fourth dataset: Questions answerable from context, answers not from context to measure the evaluator’s ability to identify answers which are factual, but yet not from the context. However, generating this dataset is challenging, as there are limited ways in which questions answerable by the context can be answered in alternative ways without taking information from the context. We therefore omit this test, and test only the factuality evaluation instead.

The prompts used to generate the datasets can be found here.

After post processing to remove claims not belonging to the correct categories, we obtained a total of 3104 question answer pairs consisting of 9490 claims from the three datasets. The results, after passing them through the evaluator is shown below:

Evaluator’s Evaluation Results

From the results, it can be seen that the evaluator has small margins of errors. For example, given a factually incorrect claim, there’s currently a 4.4% chance that it will be evaluated as correct. The results of the evaluator’s evaluation can be used as a measure of confidence of the LEA’s evaluation scores. In addition, it will be used to measure the improvement made to the evaluator as we make changes to it (e.g. changing the LLM or the search engine).

Limitations and Challenges

  1. Single-Dimensional Personas

In our evaluation, we tested LEA’s robustness by simulating conversations from distinct student personas such as those who go off-topic, use poor English, or make counterfactual claims. However, real-world student behaviour is often more nuanced. A single student may exhibit a mix of these traits within a single conversation, which could lead to unexpected or untested edge cases.

To address this, we plan to evaluate LEA on actual student-assistant interactions, once available, to better surface and understand complex, real-world failure modes.

2. Not All Failures Are Equal

Our current evaluation framework classifies claims as either faithful/unfaithful or factual/non-factual. However, not all failures carry the same weight:

  • Some unfaithful or unverified claims may be benign, adding extra but non-contradictory information.
  • Others may be harmful, directly contradicting the knowledge base or external facts.

To better reflect this spectrum, we can use finer-grained labels inspired by the MNLI (Multi-Genre Natural Language Inference) framework — such as ENTAILMENT, NEUTRAL, and CONTRADICTION. This allows stakeholders to apply nuanced thresholds, such as treating NEUTRAL claims as soft passes based on context.

3. Latency and Cost of Evaluation Workflow

Running the evaluation pipeline, especially the fact-checking agent, can be resource-intensive and time-consuming, depending on the control flow path taken (e.g. whether multiple web queries and page fetches are required).

To mitigate this, we’re exploring several optimisations:

  • Leveraging smaller, faster models for tasks which require lower analytical capability (e.g. assessing if a question is answerable by the context).
  • Using semantic chunking to narrow the context window in both faithfulness and factuality checks, rather than evaluating against the entire knowledge base and web retrieved documents.

These changes aim to improve throughput and cost-efficiency, without compromising the integrity of the evaluation.

Lessons Learnt

Evaluating an LLM application or workflow is essential but it’s rarely straightforward. One of the biggest challenges lies in aligning evaluation outcomes with users’ real-world concerns. By collaborating closely with teachers and stakeholders at MOE, we gained a deep understanding of their priorities and apprehensions. This enabled us to define evaluation metrics like faithfulness and factuality that directly addressed their key concerns around accuracy, trust, and pedagogical appropriateness.

This close partnership also gave us valuable insights into how students actually engage with the app. In the absence of production data, this understanding was crucial. It allowed us to simulate realistic student behaviours and generate synthetic evaluation conversations that reflect actual usage patterns. As a result, we were able to identify and fix issues in LEA’s responses before deployment, ensuring a safer and more trustworthy learning experience from day one.

Conclusion

As Eugene Yan aptly put it on X“There’s a reason why evals is the first pattern [for building LLM-based systems and products]; can’t imagine building any other way.”

The importance of evaluation frameworks in LLM development can’t be overstated. Evals not only provide teams with the confidence that their systems are production-ready, but also offer an objective, repeatable way to measure system performance over time. Without such frameworks, iteration becomes guesswork — akin to trying to improve your running speed without a stopwatch — or worse, relying solely on manual inspection of model outputs with every tweak.

By partnering with MOE to develop a robust evaluation framework for the Learning Assistant (LEA), and running it against synthetic conversations tailored to realistic student behaviours, we’ve built an evaluation system that is well-scoped, benchmarked, and automated. In doing so, we’ve raised the maturity of MOE’s evaluation capabilities and made a significant step towards trustworthy, scalable deployment of generative AI in education.

Acknowledgements

This project would not have been possible without the contributions of the following people:

GovTech

  • Watson Chua
  • Wilson Ng
  • Rachel Shong
  • Chen Ping (Fiona)

MOE

  • Dean Yap
  • Kenny Low
  • Joe Tay
  • Norman Ng