Responsible AI

Fine-Tuning Language Models for Long-Context Data: Automated Stance Analysis of Citizen Discussions

Addressing technical challenges of processing high-volume public feedback for policy-making

AI Practice

06 Jun 2025 • 17 min read

Public officers frequently gather large volumes of public feedback on national and social issues, through public engagement or from monitoring public online discourse, to inform policy decisions and understand citizen sentiment. However, analysing this unstructured, often lengthy textual data is time-consuming, complex, and cognitively demanding. Officers must manually sift through diverse and sometimes conflicting viewpoints, identify stances, and extract meaningful insights — a process that is not only labour-intensive but also difficult to scale.

This process relies heavily on manual annotation and thematic analysis, which limits the speed and scope of such studies due to the resource intensity. This can slow down government communications and policy responsiveness as officers are less effective at picking up emerging sentiments on the ground to identify and execute an appropriate government response (e.g. communications, or policy intervention).

Proposed solution

To mitigate this, a conversation analysis tool, which facilitates automated summarisation and stance detection of public discussions on national and social issues, was conceptualised and implemented to support public officers rapidly and accurately assess public sentiment, by leveraging Large Language Models (LLMs) and prompt engineering to automates much of the manual extractive analysis and quantification work and allows public officers to focus on efficiently reviewing the outputs and making adjustments for accuracy and nuance where required.

In summary, the tool augments the analysis of public discussions by:

Automatically summarising narratives across the discussion
Automatically assessing contributors’ stances on each narrative and aggregating statistics to represent the key discussion sentiments
Providing intuitive interfaces to review AI-generated content for accuracy and making edits
Ensuring secure and compliant use of AI by automating pre-processing and anonymisation of raw data as appropriate

Driving operations improvements with digital transformation can often be a complex undertaking, especially when operations procedures have to be re-designed in tandem with the design and development of digital tools. This is further complicated when incorporating AI and data science due to uncertainty about the accuracy and consistency of such components as opposed to more deterministic logic-based workflows.

To ensure that solutions are fit for purpose, our product development team spends significant effort to evaluate AI performance on the specific use case before deployment, and incorporates user validation and monitoring of AI performance during operations into the product design. We also conduct regular studies into areas with potential for optimisation (e.g. performance, cost) to continually improve the solution.

For the conversation analysis tool, this evaluation focused on various AI-powered steps in the analytics workflow (e.g. narrative extraction, stance detection). This article will focus on post-deployment optimisation study on the stance detection step.

Stance Detection Approach Deployed

To provide some context for the rest of the article, we provide a brief description of the stance detection approach deployed in the solution, which was a result of extensive pre-deployment experimentation with several approaches and benchmarked against human-level performance derived from a human-annotated ground truth dataset comprising some 3,000 stances labelled by a three-person panel.

The stance detection step takes as input (1) the conversation log with messages timestamped and annotated with the respective author (i.e. contributor); and (2) the claim statement derived from the narratives extracted during the summarisation step. It then produces as output a binary output (“yes” or “no”) for each contributor to indicate whether a contributor expressed the narrative in the claim statement through their messages across the entire conversation. The output is produced via zero-shot prompting using commercial LLMs (GPT-4o). This contributor-level annotation is then used to generate overall statistics to reflect the relative prevalence of particular points of view expressed in the conversation.

Fine-tuning Stance Detection

Overall, the results utilising zero shot LLMs were promising. However, we felt that fine-tuning could present a promising path to elevate performance and domain-specific understanding further. In our previous articles, we wrote about how fine-tuning improved the performance of language models. We wanted to explore if the data we have can be exploited to improve stance detection. With that in mind, we set out the following experimentation objectives:

Closer Alignment with Human Judgment and Local Context: Fine-tuning our labelled data aligns model predictions with public officers’ stance interpretation, boosting accuracy and trust. It also helps the model learn Singapore-specific terms, policies, and expressions that general LLMs might miss.
Balancing Memory Requirements and Performance: Embedding prompt knowledge directly into the model enables faster, cheaper predictions with smaller models. We explore both small and large models to assess performance differences and manage memory-performance trade-offs for fine-tuning.

We utilise zero-shot GPT-4o results as a baseline, since this was the model utilised for the current setup. Standard classification metrics — F1 score, precision, recall, and accuracy — are used for benchmarking, with priority given to the F1 score to balance precision and recall. The input data used for fine-tuning includes one position statement, user conversations, and a label of agree or disagree. Since this is a classification task, we will explore fine-tuning both small and large language models, aligning with our aim to understand trade-offs and fine-tuning challenges across different models.

Challenges faced During Fine-Tuning for Long Context Data

Fine-tuning presents several data challenges. The dataset comprises conversational threads, where messages may respond to a broader topic — such as a Singaporean policy, political event, or economic issue — or to earlier messages in the thread. Each user is classified against a position statement and may have multiple messages across different stages of the conversation. A hypothetical small snippet of the conversation could be:

User A:
I get what they’re trying to do — people live longer now. But not everyone same one leh, some jobs really cannot tahan until so old.

User B:
Ya lor, but I think they not asking people to work longer per se. More like, want to make sure people don’t finish their CPF too fast. Otherwise later really no money how?

User A:
True also… but like that, the system a bit too rigid right? Some people age faster, some work more jialat jobs. Maybe should have more leeway or options, not so one-size-fits-all.

User C:
I don’t think I like what the first person said.

A position statement which reflects the stance a policy officer seeks to understand would be curated. In this case: “Raising the CPF withdrawal age helps ensure retirement adequacy for Singaporeans.” We aim to automate the classification of whether users agree with the statement.

This nested, context-dependent structure means isolated messages often lack standalone meaning. For example, a response like “Ya lor, but I think they not asking people to work longer per se” is unclear without knowing what it’s replying to; it only makes sense when seen as a response to concerns about physically demanding jobs and ageing. Similarly, “True also… but like that, the system a bit too rigid right?” depends on the earlier mention of sustainability versus flexibility. Without sufficient context, the model struggles to capture intent, nuance, or relevance in such exchanges. Hence, retaining a large snippet or the full conversation is essential during fine-tuning to enable reasoning across multiple turns.

To fit a full conversation which frequently exceeds 150,000 characters into a natural language model comes with the challenge of a limited context window of language models, especially smaller BERT-style models which often cap at 512 tokens.

Recent advances in LLMs now allow much longer conversations to be processed. For instance, GPT-4o supports a 128,000-token context, while Gemini 1.5 Pro offers a 2-million-token window. However, the “lost-in-the-middle” issue remains common in long-context settings, where LLMs often neglect information located mid-sequence. This stems from positional attention biases and training on shorter contexts, causing models to prioritise the beginning and end of sequences. This limitation is pertinent to our setup, as a key user message may appear mid-text. Nonetheless, recent experiments suggest improvements in this area for OpenAI’s o1 and Gemini models.

Long-context challenges become more pronounced during fine-tuning due to GPU memory constraints. Longer sequences combined with larger batch sizes significantly increase memory demands. For this training data, batch sizes above one would likely exceed GPU capacity.

Options for Fine-tuning Long Context Language Models

Fine-Tuning Cross-Encoders and Bi-encoders

For classification tasks, we often prefer lightweight BERT-style models over large generative models (LLMs). BERT models have a much smaller output space due to their classification heads, allowing them to focus directly on the task with lower latency and computational overhead.

In our case, the input comprises two to three components: the position statement, conversation, and context. To handle this, we explored both cross-encoder (mMiniLLMv2-L12-H384-v1) and bi-encoder architectures (jina-embeddings), arbitrarily selected for their pre-training on multi-lingual datasets.

Cross-encoders concatenate input pairs and process them jointly through a transformer, enabling full attention across tokens. This typically yields higher accuracy but incurs significant computational cost, especially during retrieval, as inference must be run for every input pair.

Bi-encoders, by contrast, encode each input independently and compare their embeddings. This approach is far more scalable and well-suited for real-time or large-scale applications. However, it may underperform on tasks requiring nuanced token-level interactions between inputs.

Image: Architecture differences between bi-encoder and cross-encoder

A key challenge was handling long-context data within the 512-token limit. While truncation and summarisation are common solutions, truncation can omit crucial context, and summarisation — especially with larger models — adds latency, undermining the efficiency benefits of smaller models.

Our preprocessing strategy leveraged the data’s structure: multi-message conversations. We reduced the granularity to the message level, which required relabelling, as original annotations were at the conversation level. Whether done manually or via a teacher model, relabelling risks introduces errors, and aggregating message-level predictions back to conversation-level labels adds further complexity.

Fine-Tuning Long Context BERT Style Model

ModernBERT builds on BERT’s core strengths with key innovations in pretraining and architecture. It features a hybrid attention mechanism that combines global and local attention for balanced contextual understanding and computational efficiency. Inspired by long-context models like Gemma, it applies Global Attention every third layer, allowing all tokens to attend to one another for holistic context. Other layers use Local Attention with a 128-token sliding window to maintain local coherence and reduce compute demands.

ModernBERT also uses Multi-Query Attention (MQA), sharing key/value vectors across all heads to cut memory and inference costs without sacrificing quality — a common technique in scalable transformers. For speed and scalability, it integrates FlashAttention, a GPU-optimised mechanism that fuses operations into a single kernel and processes SRAM-friendly tiled blocks, ideal for long sequences.

With support for an 8K-token context window, ModernBERT suits extended dialogues and document-level tasks. However, the longer context raises memory requirements, challenging resource-limited hardware like NVIDIA’s g4dn GPU. Gradient accumulation is used to simulate larger batch sizes while keeping the actual batch size at one.

Pretrained mainly on English text, ModernBERT underperforms in multilingual settings. Around one in seven samples in our training dataset is in Chinese, limiting direct use. Re-training on Chinese is resource-intensive, while translation pipelines introduce additional compute and operational overhead, reducing the efficiency gains of ModernBERT. We chose the translation option.

Long-Context Strategies for Classification Tasks

While we need not train ModernBERT on a message level, we still would need to explore strategies to manage a finite context window. We explored various strategies for processing long conversational history in classification tasks:

Truncation: Truncate the conversation to fit within the 8K token window.
Summarisation: Summarise the user’s previous messages to condense history before classification.
Snippets of Context: For each user message, extract a snippet of preceding context — including messages from others — up to a defined token limit that fits the context window.

Fine-Tuning Long Context Large Language Models

Large Language Models (LLMs), especially decoder-only architectures like GPT-style transformers, offer a strong alternative for classification tasks involving long-context inputs. Their native capacity to handle extended sequences makes them ideal for tasks requiring understanding of lengthy conversations or documents. We used Mistral-Instruct, which supports a 32k context length (8,192 tokens) in version 0.2 and above.

Advantages:

Native long-context support: LLMs capture interactions and dependencies across long sequences without architectural changes.
Pretrained knowledge: Pretraining on large-scale web and conversational data helps LLMs generalise better to dialogue-rich or domain-specific contexts.

Challenges:

Stability during fine-tuning: LLMs are sensitive to hyperparameters and prone to overfitting on small, imbalanced classification datasets.
Context limitations remain: Despite extended windows, very long inputs may still require truncation, summarisation, or contextual snippet extraction — approaches we experimented with.
Resource demands: LLMs are significantly larger than models like BERT and require more compute and memory. With 8,192-token sequences and the large Mistral model, training on a single A100 is not feasible.

Changes to LLM Model Architecture for Fine-Tuning

In our previous post, we explained how we utilised LoRA to improve the training process. To recap, LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that inserts small trainable low-rank matrices into each layer of a pre-trained model, keeping the original weights frozen. This drastically reduces the number of trainable parameters — often by orders of magnitude — making fine-tuning faster and more memory-efficient. LoRA is particularly valuable for long input sequences, where full fine-tuning is memory- and compute-intensive.

Adding LoRA weights to pre-trained model (source)

However, LoRA introduces memory overhead due to the adapter weights. To manage this — particularly with long inputs — we apply LoRA selectively to specific layers, such as the query (q), key (k), and value (v) projections in the attention mechanism, where most task-specific learning occurs. We also cap the LoRA rank at 8 to constrain the dimensionality of the low-rank matrices, reducing parameters and memory usage, with a potential trade-off in representational capacity, enabling efficient fine-tuning within manageable memory limits.

Quantisation reduces base model memory by compressing 16-/32-bit floats to 8-bit integers, cutting memory and compute by up to 4× without significant performance loss — if applied correctly. Unlike naïve rounding, it scales and maps floats to int8 using techniques like zero-point and absmax quantisation. In zero-point, values are scaled (e.g., by 127), rounded, and later restored by dividing by the same factor.

Absmax quantisation scales each tensor by its absolute max value. All elements are divided by this scale, multiplied by 127 to normalise to [−127, 127], then rounded and stored as int8. Dequantisation divides int8 values by the same scale during inference. Standard quantisation struggles with outlier activations — rare, extreme values that skew outputs. LLM.int8() addresses this by splitting matrix multiplication: outliers (e.g., >6) are handled in FP16, the rest use vector-wise int8 quantisation (row-wise for activations, column-wise for weights), then results are dequantised and merged in FP16.

QLoRA enables efficient fine-tuning by combining 4-bit quantisation with LoRA. The base model is quantised with NF4 or FP4, reducing memory. NF4 maps values to 16 non-uniform levels from a normal distribution, offering higher resolution near zero — where weights cluster — for better low-precision accuracy. We quantised the base model to 4-bit to fit consumer GPUs and trained only LoRA adapters, achieving scalability and efficiency.

Our setup for training with LoRA weights and NF4 datatype (source)

Distributed Training for LLMs

DeepSpeed uses advanced parallelism and memory optimisation techniques. Tensor parallelism shards large tensors (e.g., Q/K/V projections, MLP weights) across GPUs along specific dimensions. Each GPU performs partial computations on its shard, synchronising via collective ops like all-reduce or reduce-scatter during forward and backward passes. This reduces per-GPU memory and accelerates matrix multiplications by distributing workloads. Pipeline parallelism splits the model into stages across GPUs. Using micro-batching and schedulers (e.g., 1F1B, GPipe), it overlaps forward and backward passes to maximise utilisation and spread activation storage.

ZeRO further reduces memory usage: Stage 1 shards optimiser states, Stage 2 shards gradients, and Stage 3 shards model parameters across GPUs. Stage 3 also supports CPU or NVMe offloading (ZeRO-Offload/Infinity), using PCIe/NVMe bandwidth to handle models exceeding GPU memory. Communication is optimised via NCCL over NVLink or PCIe, with asynchronous CUDA streams overlapping communication and computation. Fused kernels (e.g., fused all-reduce) cut overhead and boost throughput.

Activation checkpointing saves memory by recomputing intermediates during backprop. Tiling and kernel fusion improve GPU cache use and reduce memory bandwidth load. Dynamic memory management tracks allocation to reduce fragmentation and offload or reallocate memory, maximising GPU efficiency.

The entire system is abstracted by Axoltol, integrating smoothly with PyTorch and Accerlerate. A snippet of the configuration consisting of the concepts above is shown below. The input data should be formatted as a list of dictionaries with keys “user” and “assistant”, and AutoTokenizer would format it accordingly (in Mistral format in this case).

base_model: mistralai/Mistral-7B-Instruct-v0.2
datasets:
- path: /content/train_data.jsonl
ds_type: json
type: chat_template
chat_template: tokenizer_default
field_messages: messages
message_property_mappings:
role: role
content: content
roles:
user: ["user"]
assistant: ["assistant"]
drop_system_message: true
roles_to_train: ["assistant"]
data_files:
- /content/train_data.jsonl

# LoRA config
adapter: lora
lora_r: 8
lora_alpha: 16
lora_dropout: 0.1
lora_target_modules:
- q_proj
- k_proj
- v_proj
lora_modules_to_save:
- embed_tokens
- lm_head
is_mistral_derived_model: true

# Format
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
sequence_len: 8192
pad_to_sequence_len: true
load_in_4bit: true
flash_attention: true
sequence_parallel_degree: 2
device_map: sequential

# DeepSpeed
deepspeed: /content/ds_config_zero3.json

And in the DeepSpeed configuration, we would need to align the train batch size to that of gradient accumulation steps multiplied by mico-batch size per GPU and the number of GPUs (2 in this case):

{
"train_batch_size": 16,
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 8,
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu"
},
"offload_param": {
"device": "cpu"
},
"overlap_comm": true,
"contiguous_gradients": true
},
"bf16": {
"enabled": true
},
"steps_per_print": 100,
"wall_clock_breakdown": false
}

Ray is also a powerful alternative for distributed training of LLMs, especially when combined with DeepSpeed to enhance GPU communication efficiency, enable data parallelism. To leverage Ray, a Ray cluster must be configured, consisting of a head node and one or more worker nodes.

Ray distributed computing framework (source)

One key advantage of Ray over setups like Axolotl is its flexibility in distributing GPU workloads across multiple instances. For example, if only four single-GPU NVIDIA L4 instances are available — rather than one with four GPUs — Ray enables efficient parallel training across them. It also offers built-in hyperparameter tuning, coordinating jobs across workers for streamlined optimisation. Unlike Axolotl, Ray lets you fully customise the training pipeline while abstracting only the distributed infrastructure. Ray also supports dynamic autoscaling based on specified resource requirements, making it well-suited for long-term or always-on fine-tuning infrastructure. With Ray Serve, the same framework can be used for both continuous training and model deployment.

While Ray is typically deployed on Kubernetes, Vertex AI offers native support for Ray clusters. On Vertex AI, setup involves specifying a lightweight head node and GPU-enabled worker nodes according to your training requirements. After setting up your Vertex AI instance, we utilise Ray’s data loading APIs to set up our training data.

train_text = [{"text": text)} for text in list(df.prompt.values)]
ray_train_dataset = ray.data.from_items(train_text)
batch_size = 8
ray_train_datamap = ray_train_dataset.map_batches(
preprocess_function,
batch_format="pandas",
batch_size=batch_size
)

Next, we feed the Ray data into the trainer to begin the training process.

import deepspeed
from torch.utils.data import Dataset
from ray.train.torch import get_devices
def trainer_init_per_worker(config):
device = torch.device("cuda")
devices = get_devices()

# Deepspeed Configuration
deepseed_config = {
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"zero_optimization": {
"stage": 3,
"offload_param": {
"device": "cpu"
},
"offload_optimizer": {
"device": "cpu"
}
},
"bf16": {
"enabled": True
},
"tensor_parallel": {
"tp_size": 2
},
"sequence_parallel": {
"enabled": True
}
}
# Load Model
login(token=config["hf_token"])
tokenizer = AutoTokenizer.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.2",
trust_remote_code=True,
padding_side="left",
use_fast=False
)

# Feed data into the datashard
tokenizer.pad_token = tokenizer.eos_token
train_dataset_shard = ray.train.get_dataset_shard("train")
train_ds_iterable = train_dataset_shard.iter_torch_batches(
batch_size=1,
)
eval_dataset_shard = ray.train.get_dataset_shard("validation")
eval_ds_iterable = eval_dataset_shard.iter_torch_batches(
batch_size=1,
)
# Model initialization with quantization config
model = AutoModelForCausalLM.from_pretrained(
pretrained_model_name_or_path=config.get("model_name"),
load_in_4bit=True,
trust_remote_code=True
)
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

# PEFT configuration for LoRA
peft_config = LoraConfig(
lora_alpha=config.get("lora_alpha", 16),
lora_dropout=config.get("lora_dropout", 0.1),
r=config.get("lora_r", 8),
bias="none",
task_type="CAUSAL_LM",
target_modules=[
"q_proj",
"k_proj",
"v_proj",
]
)
model = get_peft_model(model, peft_config)
print(f"Max Steps: {config.get('max_steps', 10)}")

# Training Arguments
training_arguments = TrainingArguments(
deepspeed=deepseed_config,
output_dir="//model",
per_device_train_batch_size=config.get("per_device_train_batch_size", 1),
gradient_accumulation_steps=config.get("gradient_accumulation_steps", 4),
optim="paged_adamw_8bit",
save_steps=config.get("save_steps", 50),
logging_steps=config.get("logging_steps", 50),
learning_rate=config.get("learning_rate", 2e-4),
bf16=True,
fp16=False,
max_steps=config.get("max_steps", 10),
weight_decay=config.get("weight_decay", 0.0001),
logging_strategy="steps",
save_strategy="steps",
warmup_ratio=config.get("warmup_ratio", 0.03),
group_by_length=False,
lr_scheduler_type=config.get("lr_scheduler_type", "constant"),
ddp_find_unused_parameters=False,
push_to_hub=False,
disable_tqdm=False,
label_names=["input_ids", "attention_mask"],
evaluation_strategy="steps",
eval_steps=config.get("eval_steps", 100),
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False
)

# Initialize the Trainer directly in the training loop
trainer = Trainer(
model=model,
args=training_arguments,
train_dataset = train_ds_iterable,
eval_dataset=eval_ds_iterable,
tokenizer=tokenizer,
)
trainer.add_callback(RayTrainReportCallback())

# Add callbacks for earlystopping
trainer.add_callback(EarlyStoppingCallback(early_stopping_patience=5))
trainer = prepare_trainer(trainer)

# Run the training process inside the worker
print("Starting training...")
trainer.train()
return trainer

Despite the advantages of Ray, Ray offers significant communication overheads between worker nodes. On average, we realised that training on Ray might result in about 20% slower training speed. Managing data bottlenecks between the client, dataloaders, slicers, and the GPU memory can be challenging, especially for data with large context size. For our case, Axoltol offers more direct advantages in speed, but for more complex training processes, Ray can offer greater flexibility.

Results

We briefly discuss the key results of our experiment.

Key Takeaways from Fine-Tuning Results:

Competitive Performance with GPT-4o: While the fine-tuned models do not yet surpass GPT-4o, they achieve results that are closely comparable — highlighting the potential of fine-tuning even with limited training data. With improved data augmentation strategies, particularly given the small size of just three conversations, further gains are anticipated.
Stable Performance Across Pre-processing Variants: The fine-tuned language models exhibit consistent performance across different pre-processing methods, with a slight advantage observed when using the full context. This indicates robustness in the model’s ability to learn from varying input formats.
Efficiency of Smaller Models: Notably, smaller language models demonstrated performance close to that of larger LLMs. This presents promising opportunities for deploying cost-effective models in production settings without a significant trade-off in performance.
Challenges for Smaller Models: A key limitation observed was the sensitivity of the conversation-level aggregation approach — where a single incorrect message-level prediction could skew the overall label. To mitigate this, we introduced a pragmatic internal labelling heuristic: if a user expressed agreement with the position statement in any message, the conversation was labelled accordingly.

Conclusion

As public agencies aim to understand and engage citizens more comprehensively, stance detection has become a key capability — one effectively enabled by language models. Fine-tuning on domain-specific data improves contextual accuracy, and while off-the-shelf models perform well, we plan to expand on the methods in this post to cover a wider range of public sentiment use cases.

We also explored the trade-offs between fine-tuning smaller and larger models for long-context classification. This has practical implications for use cases involving extended text. Smaller models require careful preprocessing to reduce input length, tailored to content characteristics. Larger models, however, face resource constraints — especially fitting long sequences and batch sizes into a single GPU. Techniques like gradient accumulation, quantisation, and distributed training help address these challenges.

Our work will continue towards building robust, scalable stance detection systems for the public sector. If you’re from a public agency working with long-context text or sentiment analysis, GovTech’s AI Practice Group would be happy to collaborate with you.

Credits

Credits to contributors: Arnold Ng, Gabriel Quek, Chong Yu Kai, Jacob Lim

Credits to reviewers: Watson Chua, Rachel Shong, Tiffany Ong

References

Belkada, Y., & Dettmers, T. (2022, August 17). A gentle introduction to 8-bit matrix multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes. Hugging Face. https://huggingface.co/blog/hf-bitsandbytes-integration

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv.https://arxiv.org/abs/2106.09685

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Le Scao, T., Lavril, T., Wang, T., Lacroix, T., & El Sayed, W. (2023). Mistral 7B. arXiv. https://doi.org/10.48550/arXiv.2310.06825

Leng, Q., Portes, J., Havens, S., Zaharia, M., & Carbin, M. (2024, October 8). The long context RAG capabilities of OpenAI o1 and Google Gemini. Mosaic AI Research, Databricks. https://www.databricks.com/blog/long-context-rag-capabilities-openai-o1-and-google-gemini

Li, T., Zhang, G., Do, Q. D., Yue, X., & Chen, W. (2024). Long-context LLMs struggle with long in-context learning. arXiv preprint arXiv:2404.02060.https://doi.org/10.48550/arXiv.2404.02060

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12, 157–173.https://doi.org/10.48550/arXiv.2307.03172

Rajbhandari, S., Rasley, J., Ruwase, O., & He, Y. (2020). ZeRO: Memory optimizations toward training trillion parameter models. arXiv.https://doi.org/10.48550/arXiv.1910.02054

Ray Team. (n.d.). Ray use cases. Ray Documentation. Retrieved May 19, 2025, from https://docs.ray.io/en/latest/ray-overview/use-cases.html

Reimers, N., & Gurevych, I. (n.d.). Cross-Encoders — Sentence Transformers documentation. SBERT.net. Retrieved May 19, 2025, fromhttps://www.sbert.net/examples/cross_encoder/applications/README.html

Warner, B., Chaffin, A., Clavié, B., Weller, O., Hallström, O., Taghadouini, S., Gallagher, A., Biswas, R., Ladhak, F., Aarsen, T., Cooper, N., Adams, G., Howard, J., & Poli, I. (2024). Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference. arXiv. https://doi.org/10.48550/arXiv.2412.13663