Does your LLM know when to say “I don’t know”?

Refusal by a model to answer may sometimes be more valuable.

Imagine a citizen using a government AI chatbot to check MediShield claim limits. They ask: “What is the maximum claim limit per policy year under MediShield in Singapore?”

The AI responds: “$150,000..”

Image: Screenshots from ChatGPT (with Web Search disabled) and Pair Chatin Jun 2025.

This is incorrect. Since April 2025, the MediShield claim limit was raised to $200,000. But the citizen doesn’t know this. They make financial plans based on this information, only to discover later they need not fork out excess cash for their medical treatments.

This scenario highlights a critical challenge in deploying AI for government services: hallucination. This occurs when AI systems confidently generate plausible-sounding but factually incorrect information.

The Current Approach: Retrieval-Augmented Generation

The standard solution is Retrieval-Augmented Generation (RAG), which provides AI systems with access to authoritative information sources to ground their responses. Instead of relying on potentially outdated training data, the AI retrieves relevant context from verified government documents.

But RAG has a fundamental limitation: no knowledge base can anticipate every citizen query.

Consider a healthcare subsidy chatbot with comprehensive information about Medisave coverage. Citizens will inevitably ask related questions that fall outside the knowledge base scope:

  • “Which private clinics accept Medisave for this procedure?”
  • “How long does subsidy approval typically take?”
  • “Can I combine this subsidy with my company insurance?”

These are legitimate, relevant questions. They’re just not covered in the system’s available information.

The Critical Question: What Should AI Do When It Doesn’t Know?

Current AI systems typically attempt to answer anyway, drawing from potentially outdated or incorrect training data rather than acknowledging their knowledge limitations. In government applications, this creates significant risks. Incorrect information about CPF withdrawal rules, healthcare subsidies, or HDB eligibility can have serious financial and legal consequences for citizens.

To enable developers to systematically measure and test how their AI systems handle these inevitable knowledge gaps, we developed KnowOrNot, an open-source library that creates guaranteed “out-of-knowledge base” scenarios to test whether AI systems properly recognise their limitations and abstain from answering when they lack sufficient context.

How KnowOrNot Works

The key insight behind KnowOrNot is simple: instead of guessing when AI should abstain, we create scenarios where we can guarantee abstention is the correct behavior.

This translates into a three-stage automated pipeline that transforms unstructured policy documents into systematic evaluation scenarios.

Stage 1: Atomic Fact Extraction

The library breaks down policy documents into individual, testable pieces of information. This is done by carefully prompting an LLM to extract relevant textual facts from the provided document.

For example, a sentence comprising multiple factual statements like “CPF education withdrawals are permitted for local tertiary institutions and require submission of acceptance letters” becomes two separate facts:

  • “CPF education withdrawals are permitted for local tertiary institutions”
  • “CPF education withdrawals require submission of acceptance letters”

Each fact stands alone and can be independently tested.

Stage 2: Question Generation with Diversity Filtering

For each fact, the system generates questions that can only be answered using that specific information. From “CPF education withdrawals are permitted for local tertiary institutions,” it might generate “Which institutions qualify for CPF education withdrawals?”

Here’s the critical requirement: this question can only be answered correctly using this exact fact.

Diversity filtering is essential for maintaining this guarantee. If you have two similar facts like “CPF education withdrawals are permitted for local tertiary institutions” and “Local universities qualify for CPF education schemes,” then removing one fact wouldn’t matter. The AI could still answer correctly using the other similar fact.

The system uses two types of filtering to prevent this:

  • Keyword analysis removes questions that use the same terms
  • Semantic filtering catches questions that mean the same thing even when worded differently

This ensures that when you remove a fact to test its corresponding question, there’s no other fact in the knowledge base that could provide the answer. Only then can you guarantee that attempting to answer shows the AI is guessing rather than properly saying “I don’t know.”

Stage 3: Leave-One-Out Experiment Construction

For each question, the system creates a context that includes all the other facts except the one needed to answer that specific question.

So when testing “Which institutions qualify for CPF education withdrawals?” the AI gets all the CPF information except the fact about local tertiary institutions. If the AI tries to answer anyway, it’s making something up rather than acknowledging its knowledge gap.

This creates systematic test scenarios where you know exactly what the right behavior should be: abstention.

The result? Definitive scenarios for measuring whether AI systems appropriately recognize when they lack the information needed to answer questions.

You just need four lines of code to do this.

kon = KnowOrNot()
kon.add_openai()
question_doc = kon.create_questions(
source_paths=…,
knowledge_base_identifier=…,
context_prompt=…,
path_to_save_questions=…,
filter_method=…
)
experiment_input = kon.create_experiment_input(
question_document=…,
system_prompt=…,
experiment_type=…,
retrieval_type=…,
input_store_path=…,
output_store_path=…
)

The Evaluation Challenge

But creating test scenarios is only half the battle. The next step is to evaluate whether AI responses constitute abstention. LLM-as-a-judge is a commonly used approach for such tasks as LLMs are scalable and cheaper compared to human evaluators. Why is this so difficult?

For straightforward tasks like evaluating abstention, we find that LLM-as-a-judge can be quite reliable, though it is nonetheless prudent to validate a small subset of labels with humans.

KnowOrNot allows users to customise evaluation criteria, specifying evaluation tasks other than abstention. For example, we may want LLM-as-a-judge to evaluate factuality, which requires comparing ground-truth answers to the responses generated by the target LLM.

Evaluation for more complex tasks like factuality is challenging. Consider the question, “What is the difference between a Dependant’s Pass (DP) and a Long-Term Visit Pass (LTVP)?”

At first glance, GPT-4o’s response seems quite similar to the ground-truth answer, as both mention that the passes are meant for dependents. However, GPT-4o’s response fails to mention the key point — that one is generally for work pass holders, while the other is meant for Singapore citizens/permanent residents. In addition, several details in GPT-4o’s response are factually wrong, such as parents qualifying for LTVP. In fact, parents are only eligible if the work pass holder earns over $12,000. GPT-4o’s response is also missing important information, as it only mentions Employment Pass holders, but in fact, any work pass holder (including S Pass holders) are eligible. Lastly, GPT-4o’s response also includes unclear details like “DP holders may seek employment directly”.

Such a systematic analysis of the target LLM’s response alongside the ground-truth answer is not straightforward for LLMs-as-a-judge, which may not consistently and reliably provide the right factuality classification. Most work in factuality today focuses on using external sources to validate the factuality of the LLM’s generated answer, not on comparing the LLM’s answer to a complex gold-standard answer. In other words, LLMs-as-a-judge may struggle with more complex reasoning tasks. Likewise, evaluation tasks that are naturally subjective, such as classifying toxicity, remain challenging for LLMs-as-a-judge.

KnowOrNot’s Solution: A Hybrid Approach

To solve this, KnowOrNot uses a hybrid approach combining automated LLM evaluation with targeted human validation. While human annotators may disagree among themselves or provide inaccurate annotations, KnowOrNot aims to highlight such instances, and facilitate active alignment of human judgements with LLMs-as-a-judge to ensure consistency in judgements.

In particular, the library enables iterative refinement of evaluation criteria by comparing human and automated judgments on a small stratified sample, allowing users to easily adjust prompts until they reach acceptable agreement. This creates a validated system that maintains human-level accuracy while scaling to thousands of responses.

Clear, Explicit Evaluation Criteria

The library addresses this through clear, explicit evaluation criteria that both humans and LLMs are presented with, and expected to follow consistently, to maximise alignment in judgement. This is particularly important for more challenging (e.g., factuality) or subjective (e.g., tone/style) evaluations.

For factuality detection, you define specific rules like:

  • “Insufficient information that has material impact on the user is considered non-factual”
  • “Responses with the same meaning but different linguistic expression count as factual”

This removes ambiguity about edge cases and creates measurable standards rather than relying on subjective judgment.

The Iterative Refinement Process

As described in the diagram above, the first step is to define an evaluation task for the LLM-as-a-judge. This is easily done in KnowOrNot by defining an evaluation specification and storing the evaluation outputs.

# Create evaluation task
evaluation = kon.create_evaluation_spec(
evaluation_name="abstention",
prompt_identifier="prompt_id_1",
prompt_content="Classify if the response shows appropriate abstention…",
evaluation_outcomes=["Yes", "No"],

)
kon.create_evaluator([evaluation])
evaluated_outputs = kon.evaluate_experiment(experiment_output=output_doc, …)

The next step is to run human validations, which will prompt for human annotations in the CLI.

# Run human labeling
samples = kon.label_samples(…
possible_values=["Yes", "No"],
allowed_inputs=["question", "expected_answer", "expected_answer"]
)

Thirdly, Know Or Not automatically generates metrics comparing human label agreement, as well as human-LLM label agreement.

# Initial evaluation with your criteria
results = await kon.evaluate_and_compare_to_human_labels(
labelled_samples=samples,
task_name="factuality",
prompt="Classify if the model response matches the expected answer…",
prompt_id="factuality_v1",
annotators_to_compare=["human_annotator_1", "human_annotator_2"]
)

The metrics help users identify anomalous disagreements and perform appropriate error analysis. Users can dive into the data to compare dissimilar labels, clarify judgement criteria and refine judgement prompts accordingly.

# Examine disagreements, refine criteria, then re-run
refined_results = await kon.evaluate_and_compare_to_human_labels(
labelled_samples=samples,
task_name="abstention",
prompt="Updated: Classify if the model response matches the expected answer. Insufficient information that has material impact on the user is considered non-factual…",
prompt_id="abstention_v2",
annotators_to_compare=["human_annotator_1", "human_annotator_2"]
)

The Result

A validated automated evaluation system that maintains human-level accuracy while scaling to evaluate thousands of responses consistently. You get traceable decisions with versioned prompts and criteria, enabling reliable measurement of AI abstention behavior across different configurations and domains.

Testing in Practice: PolicyBench

To demonstrate this methodology in practice, we developed PolicyBench, an open-source benchmark using real Singapore government policy documents, testing how well KnowOrNot could reveal knowledge boundary behavior in high-stakes government applications where wrong information can have serious consequences for citizens.

PolicyBench was designed with chatbot use case archetypes across government in mind, spanning two dimensions — topic broadness/generality and policy complexity. We hypothesized that these dimensions affect the balance that LLMs have to negotiate between relying on context and their internal parametric knowledge. PolicyBench was created with 331 question-answer pairs derived from the following four Singapore government policy documents:

We processed these real policy documents through the KnowOrNot pipeline, extracting atomic facts and generating diverse questions that could only be answered using specific pieces of information. Each question was designed to test whether AI systems appropriately abstain when the necessary context is deliberately removed using our leave-one-out experimental setup.

Experimental Setup

We tested PolicyBench across systematic experimental configurations to understand how different approaches affect abstention and factuality behavior. Our experiments used three prompting strategies: a basic citation prompt that allows drawing from both context and parametric knowledge, a conservative prompt that explicitly restricts responses to provided context only, and an opinion-based prompt that reframes questions as asking for a narrator’s stated beliefs rather than factual claims.

For retrieval, we compared:

  • Direct querying with no additional context
  • Providing the entire knowledge base as long context
  • Basic RAG retrieval of the five most semantically similar similar question-answer pairs
  • HyDE RAG which first generates hypothetical answers to improve retrieval relevance

Note that under the leave-one-out experiment construction, the knowledge bases here did not include the specific fact that had been intentionally removed to test out-of-context robustness. We used GPT-4o as our target model across all experimental configurations, evaluating responses through both automated LLM-based assessment and human validation. Besides evaluating the target LLM’s abstention on out-of-knowledge base questions, we measure the factuality of the response on non-abstained responses, by comparing ground-truth answers (automatically generated at Stage 2 of the Know Or Not pipeline, as described above) with the target LLM’s responses.

Key Results

The results revealed dramatic differences in abstention behavior across experimental configurations, with profound implications for government AI deployment.

With basic prompting and direct querying, the system abstained from answering only 1.8% of questions when the necessary information was missing. This is expected, as LLMs are designed to follow instructions and respond accordingly. What was surprising was that conservative prompting combined with RAG retrieval only achieved 60.7% abstention, the highest across all experiments. This means that despite being instructed to refrain from answering without sufficient context, LLMs still persist in answering close to 40% of the time.

Among the cases where systems did not abstain, factuality rates ranged from 24% to 33% across different configurations, meaning roughly two-thirds of confident responses were incorrect when proper context was missing. The conservative prompt with basic RAG emerged as the optimal configuration, achieving both high abstention rates (60.7%) and the best factuality among non-abstained responses (33.1%). As such, even if LLMs rely on their parametric knowledge instead of context to provide answers, they are frequently wrong.

Can grounding or factuality guardrails help?

One possible solution to detecting out-of-knowledge base robustness is groundedness or faithfulness detection, which aims to assess whether the context supports the answer provided by the LLM. We assess three widely used detectors — RAGAS, Azure Content Safety and AWS Bedrock Guardrails on their ability to detect that the provided context does not, by experimental design, support non-abstained LLM answers.

We find that none of the detectors can robustly detect out-of-context responses, with the best performing detector, Azure Content Safety Groundedness, achieving 51% accuracy. Performance also varies across different use cases. For example, RAGAS appears particularly good at simple, broad queries (BTT), while Azure Content Safety excels at broad, complex queries (CPF). They are comparable for narrow, complex queries (Medishield), while RAGAS lags behind for simple but narrow queries (ICA). These findings point to the importance of evaluating groundedness guardrails for each specific use case, as the most accurate and appropriate one may differ depending on the application at hand.

Search-augmented LLMs could likewise be used to evaluate the factuality of LLM responses. This is made possible by KnowOrNot’s evaluator class, which allows for specification of a search-based LLM like Gemini Search, and a prompting template instructing the LLM to identify key facts, perform Web search, and verify whether the original LLM response is supported by the search results. However, we find that even with search augmentation, detecting factual inaccuracies remains challenging. In particular, we found that Gemini Search had high precision and low recall in identifying non-factuality, implying that it was better at confirming factual content than identifying non-factual content. As such, search augmentation is not sufficient to ensure reliable factuality verification.

As LLMs are not able to consistently and robustly provide factual responses even with guardrails in place, requiring chatbots to abstain in the face of insufficient context and evaluating their propensity to will help to build trust in such applications.

Conclusion

Our experiments with PolicyBench reveal that even with an optimal RAG setup and conservative system prompt, AI chatbots will still attempt to answer 40% of the time even without any relevant information in the context. This highlights the need for teams to develop customised evaluations for their specific domains and deployment contexts to systematically measure and understand such behaviour. KnowOrNotsimplifies this evaluation by transforming policy documents into rigorous test scenarios that reveal when and how often AI systems attempt answers beyond their reliable knowledge scope. This process is done with only a few lines of code and does not require humans to write out ground-truth answers from scratch. Check out our complete methodology detailed in our paperand create your own customised evaluations using our GitHub repo!