Building a Better RAG Pipeline for HR Policy Q&A: What Worked and What Didn’t

We tested the most effective approaches.

Building a Better RAG Pipeline for HR Policy Q&A: What Worked and What Didn’t

Hello! My name is Clare and I’m currently doing a full-time Master’s in Computer Science at the National University of Singapore. During my term break, I did a short internship with the AI Capability Development (AI Cap Dev) Team at GovTech’s AI Practice. During the internship, the team worked with a product team, GovText, on a project involving Retrieval-Augmented Generation (RAG), a technique that combines information retrieval with large language models to generate grounded, source-based answers.

The project addressed a practical and meaningful challenge: helping civil servants obtain accurate answers to HR-related questions. In the Singapore public service, all agencies follow a common set of HR guidelines. However, each agency has the flexibility to implement these guidelines differently. For example, while the guideline may require a minimum of 18 days of annual leave, an agency might offer more leave based on internal policies or special conditions. As a result, any system built to answer HR policy queries must retrieve information not only from an agency’s internal HR policies, but also from the public service-wide guidelines. It must then amalgamate the information from both sources to deliver accurate, complete answers. This task becomes even more challenging when users phrase their questions differently from how policies are written.

The GovText team has built a RAG-based assistant for this use case and the AI Cap Dev team is assisting them with evaluating the performance of the system and optimising the RAG pipeline. This builds on the team’s earlier work on the RAG playbook, which outlines best practices for designing effective RAG systems. While the playbook provides theoretical guidance, this project gave us the opportunity to apply and evaluate those strategies in a practical, policy-driven context. As a data science intern in this project, my goal was to explore how this assistant could be improved, both by retrieving more relevant information and by generating clearer, more accurate answers, through experimentation with different components and configurations of the RAG pipeline.

In the rest of this post, I’ll walk through what we tried, highlight what worked and what didn’t, and share key takeaways for building more reliable question-answering systems grounded in structured, domain-specific knowledge.

Setup and Evaluation

To evaluate each technique fairly, we built a controlled test environment with a fixed baseline RAG pipeline. Each method was layered on top of this baseline one at a time, so we could clearly observe its impact without interference from other changes.

Dataset

We created a synthetic dataset of HR-related questions grounded in actual policy documents. These questions reflected common employee concerns such as leave entitlements, training support and workplace benefits. Each question was paired with a reference answer extracted from official HR materials, providing a clear basis for assessing whether system outputs were accurate and aligned with source content.

In the Singapore public service, answers to HR queries can come from either the public-service wide guidelines or from agency-specific HR policies, and sometimes both. To reflect this real-world complexity, we structured the synthetic dataset to include 500 question-answer pairs across three categories:

  • Questions answerable only using the public service-wide guidelines, with no supporting information in the agency’s own HR documents.
  • Questions answerable only using the agency’s internal HR policy, with no matching content in the central guidelines.
  • Questions where both sources contain relevant information. However, in such cases, the agency’s version is considered the authoritative source and should be used preferentially.

Retrieval

Our baseline retriever used FAISS, a widely adopted vector search library, together with the bge-large-en-v1.5 embedding model. Policy documents were split into chunks of 6,000 tokens before being indexed. At query time, the retriever fetched the top-k most relevant chunks to serve as grounding context for the generator.

Generation

For answer generation, we used GPT-4o via Azure OpenAI. Each prompt contained the user’s question and the retrieved chunks, with the generator tasked to produce a concise, accurate answer that stayed faithful to the provided context.

Evaluation

A RAG system cannot be judged solely on whether the answer “sounds good”. We measured both retrieval quality and generation quality, as strong performance in one stage is only useful if the other can leverage it effectively.

Retrieval Metrics:

  • Recall@k / MRR@k: How often relevant content appears in the top-k retrieved chunks and how highly it is ranked. Higher values indicate that the retriever is surfacing information the generator can use.
  • Context Precision and Recall: Precision measures how focused the retrieved content is; recall measures how comprehensive it is. Together, they show whether the model sees enough of the right information before generation.

Generation Metrics:

  • Faithfulness: Whether the answer stays grounded in retrieved context without adding unsupported claims.
  • Answer Relevance: How directly the output addresses the user’s question.
  • Factual Correctness (Precision and Recall): Precision reflects factual accuracy; recall reflects completeness compared to the reference answer.

By tracking these retrieval and generation metrics together, we could assess which techniques improved each stage individually and which ones strengthened the entire pipeline.

RAG Pipeline

Techniques We Explored

To identify which parts of the RAG pipeline have the most impact on answer quality, we took a modular approach, changing one component at a time while keeping the rest of the setup fixed. We focused on methods that could either improve the relevance and coverage of retrieved content or guide the model to make better use of that content when generating answers.

Retrieval Enhancements

ColBERT

ColBERT is a dense retriever that uses token-level interaction to compare queries with document passages. Unlike standard dense retrieval that encodes texts into a single vector, ColBERT retains fine-grained semantic information. This makes it better suited for nuanced policy questions where a single word can alter the meaning. For example, it can help distinguish between terms like “unpaid leave” and “unpaid childcare leave”.

ColBERT Retrieval Workflow

Hybrid Retrieval

We combine BM25 (keyword-based) with FAISS (dense-vector based) using reciprocal rank fusion. The dense retriever captures broader semantic meaning, while BM25 ensures exact matches for specific policy terms or acronyms. This pairing helps handle HR queries that blend formal terminology with everyday phrasing.

Hybrid Retrieval Workflow

Hierarchical Indexing

This method retrieves in two stages: first, it identifies a relevant “summary node” representing a broader section of the document, then it retrieves detailed chunks within that node. The idea is to avoid missing related content that might be scattered across a section, which can happen with flat top-k retrieval.

Hierarchical Index Retrieval Workflow

HYDE (Hypothetical Document Embedding)

HYDE first generates a hypothetical answer to the query using an LLM, then embeds that answer to perform retrieval. By steering the retrieval toward passages that match the intended meaning of the question, HYDE can be effective for vague or underspecified queries, which is common in HR contexts where employees do not know the exact policy terminology.

Query Expansion

We used an LLM to generate paraphrased variants of the original question. For instance, “How many days of childcare leave am I entitled to?” could also search for “parental leave” or “family support schemes”. This broadened the retriever’s coverage and helped capture alternative terms that appear in the documents.

Subquery Creation

Long or complex questions were split into simpler sub-questions before retrieval. For example, “Can I take study leave and what is the approval process?” became two separate queries, each targeting a more focused set of results. This increased the chances of retrieving complete and relevant context for each part of the question.

Post-Retrieval Processing & Generation Guidance

Context Distillation

After retrieval, we applied sentence-level filtering to remove verbose or irrelevant content before passing context to the generator. This was most useful when the retrieved text was lengthy, as shorter and more focused context made it easier for the model to stay-on topic.

Reranking

Even after initial retrieval, the top-k chunks are not equally relevant. We used a cross-encoder to score and reorder them based on how well each chunk matched the query. Unlike dense retrieval, which scores chunks independently, the cross-encoder considers the query and passage together, improving relevance ranking and ensuring the most useful context appears at the top.

Autocut

Autocut is a filtering step applied immediately after retrieval, based on the retriever’s own relevance scores. If there was a large score drop or chunks fell below a set threshold, they were removed before generation. This kept the prompt concise and focused, and reduced the chance of irrelevant details distracting the generator.

Chain-of-Thought Prompting

We tested prompts that explicitly instructed the model to reason step-by-step before producing an answer. While this technique is often used for logic-heavy tasks, we wanted to see if it could improve clarity and completeness in multi-step HR policy questions, such as those involving sequential approval process.

Below is a simplified version of the prompt we used:

You are a helpful assistant. Before answering, reason through the problem step-by-step using only the provided context. Do not use external knowledge. If no clear answer can be found, return an empty string.

What Worked and What Didn’t

After running each experiment in isolation, we compared results across both retrieval and generation metrics. Some techniques delivered consistent improvements, while others introduced trade-offs or showed limited impact. The charts below summarise how each method performed.

Retrieval metrics by technique. Blue bars indicate improvement over the baseline; orange bars indicate a drop in performance.
Generation metrics by technique. Blue bars indicate improvement over the baseline; orange bars indicate a drop in performance.

Techniques that Worked Well

Query Expansion

Query expansion was one of the most effective strategies overall, consistently ranking among the top three techniques across multiple retrieval and generation metrics. Compared to the baseline, it showed strong improvements in recall@k, context recall, faithfulness and answer relevancy. This suggests that query expansion not only helped retrieve more relevant information but also led to more accurate and well-grounded answers.

By reformulating user queries into semantically related variants, query expansion helped bridge the gap between how users phrase their queries and how policies are actually written. This is particularly important in the HR domain, where the same concept might be expressed in many different ways across agencies.

Example: A query like “Can I claim dental benefits?” was expanded to include phrases like “reimbursement for dental treatment” or “dental coverage policy”, enabling the system to retrieve policy chunks that didn’t use the word “claim” at all.

Hybrid Retrieval

Pairing BM25 with FAISS through reciprocal rank fusion performed as well or better than the baseline across all retrieval metrics and increased factual correctness in generated answers. The combination captured both exact policy terms and broader semantic intent, giving the model a richer, more varied context. This approach was particularly effective for queries containing abbreviations, synonyms or less common phrasing.

ColBERT

ColBERT outperformed the baseline retrieval method across most retrieval metrics and also showed gains in faithfulness and answer relevancy. Unlike dense retrievers that encode an entire document into a single vector, ColBERT performs token-level matching between queries and document chunks. This fine-grained alignment allows it to surface more passages that match user phrasing more precisely without compromising on semantic intent.

Example: When a user asked: “Do officers get a rest day if they work on a public holiday?”, dense retrieval returned generic content about public holiday entitlements. ColBERT, however, surfaced a chunk that specifically addressed compensatory rest entitlements for officers who are rostered to work during public holidays. This highlighted its strength in distinguishing between similar but policy-distinct concepts like rest days, off-in-lieu, and other forms of compensation.

Autocut

Autocut improved over the baseline by filtering out low-scoring chunks after retrieval, allowing only the most relevant content to be passed to the generator. While it did not change the retrieval method itself, it improved context recall and contributed to gains in faithfulness and answer relevancy during generation. This lightweight post-processing step was particularly helpful for narrow-scope questions, where extraneous or loosely related content could dilute the model’s focus. By trimming away noisy context, Autocut helped the generator stay on-topic and produce more precise answers.

Reranking

Reranking consistently outperformed the baseline by using a cross-encoder to reorder the top-k retrieved chunks, ensuring the most relevant ones appeared at the top. This improved MRR@k by promoting the gold chunks to the first or second position, making it more likely for the model to attend to the right information. The improved ranking also translated into better faithfulness and answer relevancy, especially for longer contexts where key information risked being lost in the middle.

Techniques with Limited Impact

Context Distillation

While intended to streamline the input, context distillation sometimes removed useful supporting details along with irrelevant sentences. For short factual questions, overly aggressive filtering could discard important evidence the model needed to justify its answer. In such cases, this filtering reduced grounding rather than improving focus.

Example: For the query “Can officers apply for gig work?”, the distilled context excluded a clause outlining key eligibility conditions. Although the remaining sentence stated that officers may apply, the absence of these qualifying details led the model to generate an answer that overstated the policy’s flexibility.

Hierarchical Indexing

Hierarchical indexing, which retrieves summary nodes before drilling down to detailed chunks, consistently underperformed across most metrics compared to the baseline. While the two-stage retrieval structure was designed to guide the system toward relevant sections, the initial summaries were often too generic. As a result, the second-stage retrieval sometimes missed precise details required to answer specific HR queries accurately.

Chain-of-Thought Prompting

Chain-of-thought prompting, which encourages step-by-step reasoning in the generation stage, did not show consistent improvements for factual HR queries. In our experiments, this technique led to longer but not more accurate answers, and in some cases, reduced factual correctness due to speculative or redundant reasoning steps. When the retrieved content was already clear and relevant, the extra reasoning added little value. This suggests that chain-of-thought prompting may be more useful in reasoning-heavy or multi-step domains, but offers limited benefit for short, fact-based policy questions.

Key Insight & What’s Next

Across all techniques we tested, the most effective approaches were those that strengthened both retrieval and generation. When the retriever surfaced context that was relevant, diverse and well-targeted, the model was more likely to produce answers that were more complete, accurate and consistent with the source material.

The stronger performers such as hybrid retrieval, query expansion, and reranking shared a common strength: they provided the model with richer, more useful input to work from. In contrast, techniques that did not improve retrieval, such as chain-of-thought prompting, generally had limited impact. Without high-quality context, additional reasoning steps sometimes added complexity without improving accuracy. Similarly, context distillation and hierarchical indexing often removed information that was useful for grounding the answer.

Looking ahead, it would be worth exploring how these techniques perform in combination. For example, methods that increase coverage, like hybrid retrieval, could be paired with filtering approaches, such as reranking or autocut, to both broaden and refine the retrieved context. Evaluating these combinations on real-world queries or historical logs would help determine how well they generalize beyond synthetic datasets.

Efficiency will be another key consideration. Some methods like reranking introduce extra inference steps that can increase latency and cost. Exploring lighter-weight alternatives or applying these methods selectively could help balance performance with practical constraints.