Research foundations of modern LLMs — series

Information extraction didn’t disappear. It moved inside the workflow.

Yee Seng Chan — Sun, 05 Apr 2026 00:00:00 GMT

Part of a series

Research foundations of modern LLMs

For thirty years, information extraction was a real subfield of NLP. Named entity recognition, relation extraction, event extraction, coreference resolution, slot filling, knowledge-base population. Annotated corpora: ACE, OntoNotes, KBP, TAC. A shared-task culture at every major conference. An entire industry of vendors and government-funded R&D organizations whose product was, essentially, structured records pulled out of unstructured text.

By 2024, the center of gravity had shifted.

The standard story is that LLMs replaced IE. The truer claim is that IE didn’t disappear. It moved inside the workflow. Where the pre-LLM IE pipeline was the product (extract entities, link them, populate a knowledge base, ship the KB), modern IE is the structured layer between raw text and downstream generation or action.

What changed wasn’t the existence of the work. It was the labor model. Pre-LLM IE meant labeling thousands of examples per task, engineering features, training task-specific classifiers, and tuning a pipeline. LLM-era IE means designing a JSON schema, writing a prompt, and evaluating the result. Both are real engineering. They are not the same engineering.

I worked across all three eras the article traces: pre-BERT IE during a postdoc, BERT-era production IE through the late 2010s, and LLM-era workflow IE now.

Figure 1: The three eras of information extraction. The pipeline shape is unchanged from pre-BERT to BERT era; only the components inside each box change. The LLM era replaces the linear pipeline with a workflow: parallel extraction calls feed structured frames into a downstream system, and the labor profile shifts from annotation and feature engineering to schema design, prompt engineering, and evaluation.

What IE used to look like

The canonical IE pipeline went something like this:

NER: tag spans of text with entity types (PER, ORG, GPE, LOC).
Coreference: cluster mentions that refer to the same entity.
Relation extraction: for each pair of entities, classify the relation between them, if any.
Event extraction: identify event triggers, classify the event type, then identify the arguments playing each role.
Knowledge-base population: combine the above into a structured record per entity, linked across documents.

Each of these was its own subfield. Each had its own datasets: CoNLL-2003 (Tjong Kim Sang and Meulder 2003) for NER, ACE-2005 (Walker et al. 2006) for events and relations, OntoNotes (Hovy et al. 2006) for coreference, the TAC KBP datasets (Ji and Grishman 2011) for end-to-end knowledge-base construction. Each had its own modeling tradition: HMMs and CRFs for sequence labeling, structured-prediction for relation classification, ILP-based joint inference for combining extractions.

The labor was substantial and concentrated on the input side. To build an NER system for a new entity type, you collected text, annotated mentions (typically thousands per type), engineered features (word identity, POS, gazetteer match, dependency context), and trained a classifier. Relation extraction added another annotation pass per relation type. Event extraction added another for triggers and another for each argument role.

I worked in this regime during my postdoc. The published work from that period is on relation extraction (Chan and Roth 2010, 2011) and minimally-supervised event causality (Do et al. 2011). The relation-extraction setup was typical for the era: a feature engineering pass over syntactic and semantic structures (constituency parses, dependency paths, semantic roles, Wikipedia categories as background knowledge), then a structured classifier.

The defining property of pre-BERT IE wasn’t any particular model. It was the labor model. Every new task, every new domain, every new ontology required a fresh labeling effort, a fresh feature-engineering pass, and a fresh trained model. Performance was decent on benchmarks and worse in domain transfer. Adapting to a new domain typically meant starting over.

BERT made IE better, not cheaper

BERT (Devlin et al. 2019) changed the modeling but not the labor. The pipeline shape stayed identical to the pre-BERT pipeline; the component models got better:

NER became encoder + token classifier with BIO tags (what Hugging Face exposes as AutoModelForTokenClassification).
Relation extraction became encoder + span-pair classifier: encode the sentence, pull out the two entity span representations, pool them, run through an MLP head.
Event extraction became encoder + token classifier for triggers, plus a span-pair-style classifier for arguments.
Coreference became span scoring and clustering over contextualized representations.

The improvements were real. The transition from feature engineering to learned representations was, in retrospect, the largest single quality jump IE had seen in a decade.

But the labor model was unchanged. You still needed labeled data per task. You still needed an annotation effort per new domain. The encoder did the feature engineering for you, but the rest of the work (schema design, annotation, evaluation) was the same as before.

I spent most of the BERT era on this kind of production IE: event extraction (Chan et al. 2019), KBP-style end-to-end systems (DeYoung et al. 2017), domain-specific machine reading and few-shot event mention retrieval (Min et al. 2019, 2020). I designed and built the deep-learning IE platform these systems ran on, layered on the Hugging Face ecosystem and used across government-funded and industry projects.

One multilingual event extraction project captures how sophisticated BERT-era IE had become. The training data was English, the inference target was Arabic, and the ontology covered roughly forty event types with typed argument roles. The stack used XLM-R (Conneau et al. 2020), multi-stage fine-tuning, token classification, continued MLM pretraining on mixed English-Arabic text, and automated checkpoint selection. This was not “just put a classifier on BERT.”

A second project, rapid customization of event extraction for new ontologies (Chan et al. 2019), was already gesturing at the LLM-era interface. The setup: take a small set of event-type definitions and a handful of seed examples, produce a working extractor without a full annotation effort. We compressed the labeling step with bootstrapping; the modern version puts definitions and examples into a prompt instead. What changed is that the definitions-plus-examples now condition a frontier LLM directly rather than seeding a bootstrapping loop.

The point of dwelling on this isn’t nostalgia. Too many current writeups treat the BERT era as a stepping stone: “people used to fine-tune classifiers, then LLMs came along.” That undersells the engineering. The systems were good. They were just expensive to build per task, which is exactly the thing the next era changed.

LLMs changed the labor model

What changed with LLMs wasn’t that you could finally do IE. You could already do IE. What changed was the cost structure.

A frontier LLM with structured-output support can solve most NER, RE, and slot-filling tasks zero-shot or few-shot from a prompt and a JSON schema. Quality is task-dependent:

On standard benchmarks like CoNLL-2003, a well-prompted frontier LLM lands within a few F1 points of fine-tuned baselines.
On harder benchmarks like ACE 2005 event extraction, zero-shot LLMs still trail fine-tuned specialists by double-digit margins (Zhang et al. 2025).

The gap closes with a handful of in-context examples or light fine-tuning on synthetic data. But for most use cases the marginal value of building a specialized BERT-era IE pipeline collapsed, because the engineering cost of getting from “no extractor” to “working extractor” dropped by an order of magnitude even when the LLM isn’t strictly best on F1.

The mechanics are by now well-known. The prompt asks for structured output. The schema is enforced by the provider’s structured-output mode (OpenAI’s response_format), or by Pydantic schemas as a type-checked output contract. For NER, the schema is a list of typed entities. For RE, a list of (head, relation, tail) triples. For event extraction, a list of frames with event type and roles. The LLM call replaces the entire encoder-plus-classifier-head stack.

Three modern IE patterns

Figure 2: Three modern IE patterns. They differ in where the LLM sits and what ships: Direct extraction calls the LLM in the serving path and uses its output as-is; the LLM-built specialist uses the LLM offline only to generate training data, then serves a small specialist; IE as workflow state runs parallel extraction calls whose metadata-rich frames feed a downstream product, so extraction is the structured layer rather than the deliverable.

Direct LLM extraction

The baseline pattern: the production system makes an LLM call per document or chunk, parses the structured output, and uses it directly. It works for moderate-volume, latency-tolerant settings, and it’s the reference point the other two patterns are defined against.

LLM for data, specialist for serving

The hybrid pattern uses an LLM to generate training data, then trains a smaller specialist for production serving.

I ran into this in relation extraction over the MITRE STIX cyber security ontology (threat-actor, attack-pattern, malware, target, etc.). The available pre-LLM datasets were too sparse, incompletely annotated, or didn’t match STIX cleanly. The path forward: use GPT to extract candidate entities and relations from a few hundred documents. These LLM-generated annotations became training data for a DeBERTa-v3 (He et al. 2021) span-pair classifier (relations) and a T5 (Raffel et al. 2020) seq2seq tagger (NER, chosen because entity spans overlapped). The serving system never called GPT. It used the smaller specialists, trained on GPT-generated data.

The labor shift is the whole point. The classical version would have started with months of annotation. The LLM-era version started with prompt and schema design. The production model was still a classical encoder; the annotation cost collapsed.

IE as workflow state

The other interesting pattern is when extraction stops being the product. The product is a downstream system that uses extraction as its structured intermediate representation.

I’ve designed and led a clinical-documentation pipeline that takes a clinician-patient encounter transcript and produces a structured clinical note. The system uses multiple specialized LLM extraction calls: separate calls for problems and major note sections. Each extraction returns structured frames, with metadata attached so a downstream system can route facts into the right place:

Example clinical note sections covered. History of Present Illness (HPI), Assessment & Plan (A&P), Labs / Test Results, Physical Exam, Allergies, Past Medical History.
Frame fields, for a clinical problem. Symptoms, severity, duration, status, associated findings, treatments, medications, plan items.
Metadata on each extracted item. Temporal status (past / present / future), source and stance (patient-reported vs clinician-assessed), section affinity.
Downstream routing. Present-tense problem details go to HPI; clinician assessment and treatment plans go to A&P; past resolved history goes to medical history.

This is the modern form of IE I see most often: not an extractor that ships as the product, but a structured control layer that lets the rest of the workflow behave reliably. The product is the clinical note; the extractions are what the downstream system builds it from.

The pattern isn’t unique. SpecialtyScribe (Goyal et al. 2025) and GENIE (Ying et al. 2025) are published instances of the same idea: IE as the structured intermediate layer of a workflow, with the extraction done by a frontier LLM, a fine-tuned smaller specialist, or a mix.

GraphRAG (Edge et al. 2024) is another version of the same move: use IE to turn documents into entities, relations, and summaries, then run retrieval and reasoning over that structure rather than over raw chunks. These are entity extraction, relation extraction, and light coreference, with the graph as the structured layer they feed.

The new labor: schema, prompts, eval, orchestration

The labor profile is different from BERT-era IE in specific ways. There is no large annotation effort for training. The center of the work is schema design, and the contrast is concrete. A weak schema for clinical problems is {"problems": ["chest pain"]}, a flat list of strings. A schema that lets the downstream system route, deduplicate, and surface evidence for clinician review looks more like:

{
  "problem": "chest pain",
  "status": "active | resolved | historical",
  "temporality": "past | present | future_planned",
  "source": "patient_reported | clinician_assessed",
  "evidence_span": "...",
}

The richer schema is what makes the workflow possible: every extracted fact carries the metadata the downstream system needs to act on it without re-extracting. Around that, the new labor is:

Schema design: the frames, fields, metadata, and uncertainty behavior. The architecture decision that determines what the downstream system can do.
Prompt design: per-extraction prompts that reliably fill the schema, iterated against failure cases.
Evaluation: still labeled data, but at much smaller scale, for measuring rather than training.
Workflow orchestration: how extractions compose, how frames are routed, where metadata gates decisions.
Cost control: which calls run on a frontier model, which on a smaller specialist, which can be batched or cached.

The old labor was annotation, feature engineering, and per-task model training. The new labor is schema, prompts, eval, and orchestration. The label changed, but the work did not vanish. It moved from model training into workflow design.

When classical IE still wins

The “LLMs absorbed IE” framing oversells. There are concrete situations where classical IE wins.

High-volume or latency-sensitive serving. A DeBERTa-v3-base NER system runs at thousands of tokens per second per GPU at near-zero marginal cost; a frontier LLM call costs cents per document at a fraction of that throughput. If you process millions of documents per day, or a pipeline step has to run in tens of milliseconds, the encoder wins on cost and latency by an order of magnitude. The hybrid pattern (LLM-generated data, classical model serves) is the standard answer here.

Fine-grained span tasks under audit constraint. Medical coding, legal entity extraction, financial reporting: the span boundaries matter exactly, and LLM hallucination is unacceptable. A trained span-classifier with auditable failure modes is the safer choice. The LLM may generate the training data; the production extractor stays classical.

Ontology-backed coding and normalization. ICD-10 coding is not just extraction. It requires mapping clinical evidence to a controlled code system, following rules around specificity, exclusions and so on. A standalone LLM should not be trusted to do this from parametric memory alone. Use retrieval or lookup against the official code set and deterministic validation.

Stable schemas with mature labeled data. With a CoNLL-2003-scale labeled dataset and a stable schema, a fine-tuned encoder is hard to beat on the headline metric. The crossover varies by task and prompting strategy, but with thousands of labeled examples the encoder typically pulls ahead.

Code companion: TNLP IE examples

To make the transition concrete, my TNLP repo includes three working IE examples, written in late 2023 and early 2024, that map onto three stages of the shift.

Token classification / NER (BERT-era classical). microsoft/deberta-v3-base with AutoModelForTokenClassification on CoNLL-2003, predicting the four canonical entity types. The standard “encoder + BIO tagger” pattern that became the default after BERT. Config, run script.
Span-pair classification / relation extraction (BERT-era). A custom implementation, because span-pair classification has no clean off-the-shelf Hugging Face head when the code was written: encode the sentence with DeBERTa-v3, pool the two span representations, classify the relation. This is what production BERT-era RE actually looked like, a custom span aggregation on top of an encoder. Model, training.
Seq2seq NER with FLAN-T5 (the bridge to LLM-era). Instead of BIO tags, the model generates the sentence with bracket-tagged entity spans. google/flan-t5-base with LoRA (Hu et al. 2022) and 8-bit quantization. This handles overlapping and nested spans naturally (token classification can’t easily produce nested entities) and is structurally identical to how LLM-era extraction works: a generative model emits structured output. Code, config.

What people get wrong

Treating IE as deprecated. It isn’t. It’s everywhere: inside agents, RAG and knowledge graphs, structured data extraction, and so on. It’s just not explicitly labeled “IE”.
Skipping schema design. The schema is the architecture of the extraction layer. The single highest-leverage improvement to most LLM-extraction setups is tightening it.
Underweighting evaluation. Labeled data didn’t go away. It moved from training to eval. You can’t ship an IE workflow you haven’t measured, and measuring requires gold annotations. The eval set is the new annotation effort: smaller, but irreducible.
Picking the wrong model for the operating constraints. Try the LLM first for new or ambiguous tasks. Train a specialist when cost, latency, scale, auditability, or span precision force it. Both reflexes (always train a specialist; always reach for the frontier LLM) are mistakes. The constraints decide, not the fashion.
Treating extraction as the product. In modern systems extraction is rarely the product. It’s the structured intermediate layer between raw text and downstream generation, retrieval, or action.

Closing

For decades, information extraction was a product: annotate, train, extract, ship the knowledge base. BERT made the models better but left the labor model intact; every new domain still cost an annotation effort. LLMs didn’t make IE possible. IE was already possible. They made it low-startup-cost enough that it stopped being a product and became infrastructure: the structured layer between raw text and whatever the system does next.

The work didn’t disappear. The annotation-and-feature-engineering labor became schema-and-eval labor.

The extractor stopped being the deliverable and became workflow state.

The papers on the hard research problems (coreference, cross-document linking, strict argument-role constraints) are still being written; they’re just no longer where most production IE work happens. The next time someone says their system “doesn’t do information extraction,” it’s worth asking what the JSON between their LLM calls is.

References

Chan, Yee Seng, Joshua Fasching, Haoling Qiu, and Bonan Min. 2019. “Rapid Customization for Event Extraction.” Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations.

Chan, Yee Seng, and Dan Roth. 2010. “Exploiting Background Knowledge for Relation Extraction.” Proceedings of the 23rd International Conference on Computational Linguistics (COLING).

Chan, Yee Seng, and Dan Roth. 2011. “Exploiting Syntactico-Semantic Structures for Relation Extraction.” Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.

Conneau, Alexis, Kartikay Khandelwal, Naman Goyal, et al. 2020. “Unsupervised Cross-Lingual Representation Learning at Scale.” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. https://arxiv.org/abs/1911.02116.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). https://arxiv.org/abs/1810.04805.

DeYoung, Jay, Yee Seng Chan, Chester Pittapally, Hannah Provenza, Ryan Gabbard, and Marjorie Freedman. 2017. “BBN’s 2017 KBP EAL Submission.” Proceedings of the 2017 Text Analysis Conference (TAC).

Do, Quang, Yee Seng Chan, and Dan Roth. 2011. “Minimally Supervised Event Causality Identification.” Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing.

Edge, Darren, Ha Trinh, Newman Cheng, et al. 2024. “From Local to Global: A Graph RAG Approach to Query-Focused Summarization.” arXiv Preprint. https://arxiv.org/abs/2404.16130.

Goyal, Sagar, Eti Rastogi, Fen Zhao, Dong Yuan, and Andrew Beinstein. 2025. “SpecialtyScribe: Enhancing SOAP Note Scribing for Medical Specialties Using LLMs.” Proceedings of the Second Workshop on Patient-Oriented Language Processing (CL4Health).

He, Pengcheng, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. “DeBERTaV3: Improving DeBERTa Using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing.” arXiv Preprint. https://arxiv.org/abs/2111.09543.

Hovy, Eduard, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. 2006. “OntoNotes: The 90% Solution.” Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers.

Hu, Edward J., Yelong Shen, Phillip Wallis, et al. 2022. “LoRA: Low-Rank Adaptation of Large Language Models.” International Conference on Learning Representations. https://arxiv.org/abs/2106.09685.

Ji, Heng, and Ralph Grishman. 2011. “Knowledge Base Population: Successful Approaches and Challenges.” Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.

Min, Bonan, Yee Seng Chan, Haoling Qiu, and Joshua Fasching. 2019. “Towards Machine Reading for Interventions from Humanitarian-Assistance Program Literature.” Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).

Min, Bonan, Yee Seng Chan, and Lingjun Zhao. 2020. “Towards Few-Shot Event Mention Retrieval: An Evaluation Framework and a Siamese Network Approach.” Proceedings of the Twelfth Language Resources and Evaluation Conference (LREC).

Raffel, Colin, Noam Shazeer, Adam Roberts, et al. 2020. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.” Journal of Machine Learning Research. https://arxiv.org/abs/1910.10683.

Tjong Kim Sang, Erik F., and Fien De Meulder. 2003. “Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition.” Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 (CoNLL).

Walker, Christopher, Stephanie Strassel, Julie Medero, and Kazuaki Maeda. 2006. ACE 2005 Multilingual Training Corpus. Linguistic Data Consortium, LDC2006T06.

Ying, Huaiyuan, Hongyi Yuan, Jinsen Lu, et al. 2025. “GENIE: Generative Note Information Extraction Model for Structuring EHR Data.” arXiv Preprint. https://arxiv.org/abs/2501.18435.

Zhang, Zikun, Wei You, Tongtao Wu, Xiaolong Wang, Jianxin Li, and Min Zhang. 2025. “A Survey of Generative Information Extraction.” Proceedings of the 31st International Conference on Computational Linguistics (COLING).

The fine-tuning stack: one loss, different data

Yee Seng Chan — Tue, 31 Mar 2026 00:00:00 GMT

Part of a series

Research foundations of modern LLMs

The standard story about post-pretraining is: first supervised fine-tuning (often called instruction tuning), then alignment with RLHF or DPO. Two stages, presented as several distinct techniques because the SFT stage has gone by many names: supervised fine-tuning, instruction tuning, distilled SFT, chat tuning. Mechanically, they’re the same operation.

Here’s the thesis: SFT shares its loss function with pretraining. They’re both next-token cross-entropy, differing only in what data goes in and which tokens contribute to the loss. Preference optimization is the only stage that introduces a genuinely new loss family.

Two loss families across three stages

Figure 1: The three stages, grouped by loss family. Pretraining and SFT share next-token cross-entropy: they differ only in what data goes in and which tokens contribute to the loss. Preference optimization is the only stage that introduces a new loss family.

Pretraining and SFT use the same loss: next-token cross-entropy. They differ in:

What data the model sees. Raw text for pretraining; (instruction, response) pairs for SFT.
Which tokens contribute to the loss. All tokens for pretraining; only response tokens for SFT (the prompt is masked out).

That’s the whole mechanical difference. SFT is pretraining on more curated data with a loss mask on the prompt.

Preference optimization is the only stage that introduces a new loss family. The output is no longer compared to a target sequence; it’s compared to a competing sequence under a preference framework. RLHF wraps this in a reward model and PPO; DPO (Rafailov et al. 2023) collapses the same idea into a single supervised loss; GRPO replaces the value network with group statistics. Each has its own deep-dive in this site’s RL series: PPO is REINFORCE Plus Five Fixes, DPO: RLHF Collapsed Into One Loss, and GRPO: The Algorithm Behind Reasoning Models.

Supervised fine-tuning

SFT’s loss is the same as pretraining’s. Everything interesting is in the data.

The mechanics, traced through one example. Take an instruction-response pair:

Instruction: “Classify the sentiment of this review and explain why: ‘The food was great but the service was terrible.’”
Response: “Mixed sentiment. The reviewer praises the food but criticizes the service.”

You concatenate the instruction and response into a single sequence and feed it through the model. The model computes next-token cross-entropy as it would in pretraining. The difference: a loss mask zeros out the contribution from instruction tokens. Only the response tokens carry gradient.

The wrong reading is that the model learns to generate instruction-and-response pairs. It doesn’t. The instruction tokens have no loss contribution, so the model gets no gradient signal saying “produce instructions like these.” What the model learns is to produce response tokens given the instruction tokens as context. At inference, you provide the instruction; the model continues with the response.

That’s it for the loss. Everything else is data engineering. SFT data comes in a few forms:

Human-written demonstrations. The original InstructGPT (Ouyang et al. 2022) recipe: humans wrote responses to a curated set of instructions, and the model was fine-tuned on those pairs. High-quality, but expensive and slow.

Multitask instruction data. The Sept 2021 FLAN (Wei et al. 2022) paper showed that training on many NLP tasks formatted as instructions improves zero-shot performance on unseen tasks. T0 (Sanh et al. 2022) showed the same on encoder-decoder models; Tk-INSTRUCT (Wang et al. 2022) and FLAN-PaLM scaled to 1,600 and 1,800 tasks. More tasks, more diverse templates, and larger base models all help. FLAN also found a size threshold: multitask SFT hurt held-out performance at 8B and below, helped substantially at 137B. T0 found the threshold lower (3B) for encoder-decoder models, the kind of small-scale inductive-bias advantage that scales away (see Pretraining objectives: why decoder-only won).

Synthetic teacher-generated data. By late 2022 the bottleneck was data. Self-Instruct (Wang et al. 2023) showed an instruction-following LLM could generate its own training data: seed with a few human instructions, generate variations and responses, filter for quality. Alpaca and Zephyr (Tunstall et al. 2023) operationalized this (Zephyr’s UltraChat: 1.47M GPT-3.5 dialogues filtered to 200K). This distilled SFT (dSFT) pattern is now standard; the scale is set by what teacher models can produce, not what humans can write. Two caveats: the student inherits the teacher’s limits, and the filter does real work. The empirical lesson of the past two years is that 10K well-chosen examples often beat 1M scraped ones (Zhou et al. 2023).

Chat-format data. The same loss is applied across multi-turn conversations, with the loss mask zeroing out user turns and computing loss only on assistant turns. Mechanically identical to single-turn SFT; the data just has more turns.

A note on single-task vs multi-task SFT. If your application is one task with consistent phrasing, single-task SFT is fine. If your application is a chat assistant handling varied user queries, multi-task SFT is doing real work: the diverse training data is what gives the model robustness across phrasings. Almost everyone does multi-task SFT, because almost everyone is building something at least chat-shaped.

LoRA and QLoRA change the memory story, not the objective

LoRA and QLoRA don’t change what the model is being trained on or how the loss is computed. They change what parameters are trainable and what fits in memory. That distinction matters for the article’s thesis: most of the post-pretraining stack is the same algorithm applied to different data; parameter-efficient methods are a memory-engineering layer on top, not a new training paradigm.

Figure 2: LoRA decomposes the fine-tuning update into two small matrices B (d×r) and A (r×k), with r much smaller than d or k. The base weight matrix W₀ stays frozen. Right: memory for fine-tuning a 7B model under three regimes; full fine-tuning at FP32 needs ~84GB, LoRA on FP16 base ~14GB, QLoRA on 4-bit base ~4GB.

LoRA (Hu et al. 2022) trains a small low-rank adapter on top of a frozen base model. The premise: full fine-tuning is overkill for most tasks because fine-tuning updates have low intrinsic rank. The base already knows grammar, world facts, and reasoning patterns; fine-tuning is usually a nudge, not a rewrite. LoRA replaces the full update with the product of two much smaller matrices , where the rank is much smaller than the original matrix dimensions. The base stays frozen; only the small and are trained. The phrase to remember: LoRA works because fine-tuning is usually steering, not rebuilding.

QLoRA (Dettmers et al. 2023) keeps LoRA’s structure but quantizes the frozen base to 4 bits while leaving the adapter in 16-bit. The QLoRA mechanics (NF4 quantization, double quantization of scale constants, paged optimizers for memory spikes) are covered in the QLoRA deep-dive post; they’re what makes the memory math work but they’re orthogonal to the article’s thesis.

Almost all open-source fine-tuning in 2024-2026 uses LoRA or QLoRA. Full fine-tuning still happens for frontier-scale base-model training, but downstream specialization is overwhelmingly LoRA-based. The wave of fine-tuned open-source models from late 2023 onward is downstream of this.

When preference optimization actually matters

Once a model has been SFT’d, you can do another round of training that uses preference data: pairs of responses where one is judged better than the other. The goal is to push the model toward responses that match human (or proxy) preferences for properties like helpfulness, harmlessness, conciseness.

Three families dominate the post-2022 alignment landscape:

RLHF (InstructGPT (Ouyang et al. 2022) recipe): SFT, then train a reward model on preference pairs, then run PPO using the reward model as the reward signal. Four LLM-sized things in memory at training time.
DPO (Direct Preference Optimization): skip the reward model and the RL machinery. A single supervised loss on preference pairs. Two models in memory.
GRPO (Group Relative Policy Optimization): PPO with the value network removed. Memory-efficient, well-suited to verifiable-reward settings (math, code). The algorithm behind R1 and most open-source reasoning models.

The mechanics (why DPO’s derivation works, what PPO is doing under the hood, why GRPO works for reasoning) are covered in detail in this site’s RL series: PPO is REINFORCE plus five fixes, DPO: RLHF collapsed into one loss, and GRPO: the algorithm behind reasoning models. For this article the strategic question is simpler: when does preference optimization actually matter for your application?

For chat assistants serving general users, almost always. SFT alone produces a model that can follow instructions but doesn’t have stable behavioral preferences. It’ll be helpful one moment and verbose or evasive the next. Preference tuning is what shapes that into something consistent.

For applications where the desired behavior is itself contested (creative writing, advice-giving, judgments about taste), preference tuning is doing the bulk of the work. The model isn’t learning facts; it’s learning whose preferences to optimize for.

TNLP showpiece: the fine-tuning stack in code

The TNLP repo implements the three stages of post-pretraining on LLaMA-2-7B and Mistral-7B: instruction fine-tuning, chat fine-tuning, and DPO:

Instruction fine-tuning trains LLaMA-2-7B on the Alpaca dataset using AutoModelForCausalLM with LoRA and 8-bit loading.
Chat fine-tuning trains Mistral-7B on the multi-turn UltraChat dataset using TRL’s SFTTrainer with QLoRA (4-bit base, 16-bit adapters). Same next-token objective as instruction tuning; only the data format (turn-based user/assistant exchanges) differs.
DPO takes the chat-tuned model and runs preference optimization on UltraFeedback (binarized chosen/rejected pairs) via TRL’s DPOTrainer, again with QLoRA. This is where the objective itself changes: from next-token imitation to direct preference optimization against a frozen reference policy.

The QLoRA setup runs on a single 24GB GPU.

Closing

The post-pretraining stack looks complicated because every stage has its own name. Most of those names are about the data, not the algorithm. Pretraining and SFT share next-token cross-entropy; SFT just curates the data and masks the loss to response tokens. Preference optimization is the one place the loss family genuinely changes. LoRA and QLoRA don’t change the loss; they change what fits in memory.

That’s most of post-pretraining in a paragraph. The depth is in the data engineering (what instruction sources to use, how to filter synthetic data, how to balance task mixtures, when preference data is worth collecting) and in the algorithmic deep-dives for preference optimization.

References

Dettmers, Tim, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. “QLoRA: Efficient Finetuning of Quantized LLMs.” Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/2305.14314.

Ouyang, Long, Jeff Wu, Xu Jiang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/2203.02155.

Rafailov, Rafael, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/2305.18290.

Sanh, Victor, Albert Webson, Colin Raffel, et al. 2022. “Multitask Prompted Training Enables Zero-Shot Task Generalization.” International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2110.08207.

Tunstall, Lewis, Edward Beeching, Nathan Lambert, et al. 2023. “Zephyr: Direct Distillation of LM Alignment.” arXiv Preprint. https://arxiv.org/abs/2310.16944.

Wang, Yizhong, Yeganeh Kordi, Swaroop Mishra, et al. 2023. “Self-Instruct: Aligning Language Models with Self-Generated Instructions.” Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL). https://arxiv.org/abs/2212.10560.

Wang, Yizhong, Swaroop Mishra, Pegah Alipoormolabashi, et al. 2022. “Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks.” Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP). https://arxiv.org/abs/2204.07705.

Wei, Jason, Maarten Bosma, Vincent Y. Zhao, et al. 2022. “Finetuned Language Models Are Zero-Shot Learners.” International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2109.01652.

Zhou, Chunting, Pengfei Liu, Puxin Xu, et al. 2023. “LIMA: Less Is More for Alignment.” Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/2305.11206.

Retrieval is older than RAG: from DPR to end-to-end

Yee Seng Chan — Fri, 27 Mar 2026 00:00:00 GMT

Part of a series

Research foundations of modern LLMs

Most production RAG systems today are simple pipelines: a frozen embedding model, a vector database, and a frozen LLM connected by prompt assembly. That is closer to Dense Passage Retrieval (DPR), the 2020 bi-encoder retrieval method from Karpukhin et al. (Karpukhin et al. 2020) with a generator, than to the original RAG paper (Lewis et al. 2020), which proposed joint training of the retriever and generator with marginalization over retrieved documents.

This article builds on paper-by-paper notes I wrote on https://chanys.github.io between 2022 and 2023. Those posts explain the individual papers. Here, I step back and connect them into a larger story: retrieval was already a mature technical lineage before “RAG” became the umbrella term. Later, I also point to my TNLP code that implements the more expensive end-to-end version.

What RAG meant in 2020 vs what it means today

The Lewis et al. (2020) paper proposed:

A DPR-style bi-encoder retriever
A BART seq2seq generator
Joint training of query encoder and generator (document encoder frozen)

What most production RAG deploys:

An off-the-shelf embedding model (BGE, E5, OpenAI text-embedding-3, Cohere), frozen
A vector database (Pinecone, Weaviate, Qdrant, Milvus, pgvector, Chroma)
An LLM API (OpenAI, Anthropic, Google), frozen
No joint training

This isn’t a critique. The simpler pattern works well for many use cases, doesn’t require ML engineering depth, and decouples the retriever from the generator.

A short history of dense retrieval

The history below is the chronology that produced the drift. Each step changed what was possible, and most of the production world adopted only part of what each step demonstrated. The story is short because it happened fast: almost all the field-defining work landed between 2020 and 2023, on top of a lexical baseline (BM25) you still need.

Figure 1: The dense-retrieval lineage on four paradigm lanes. BM25 has been the lexical baseline since 2009. Between 2020 and 2023 the dense-retrieval lineage splits across three other paths: bi-encoder dense (DPR), late interaction (ColBERT, ColBERTv2), and joint training (REALM, RAG, End-to-end RAG, RA-DIT). Production RAG today is essentially DPR plus a frozen generator: it took the bi-encoder lane and stopped there, while the joint-training lane kept going.

BM25 (Robertson and Zaragoza 2009)

The baseline that wouldn’t die. Lexical ranking over inverted indices, using TF-IDF with length normalization and term saturation. For roughly two decades, BM25 was the baseline neural methods couldn’t consistently beat. DPR’s primary contribution was beating it cleanly enough that the field finally moved on, but BM25 didn’t go away: hybrid setups (BM25 + dense, often with reciprocal rank fusion (Cormack et al. 2009)) are common in production today, and pure-dense systems frequently lose to hybrid on tasks with rare-term queries.

REALM (Guu et al. 2020)

Two months before DPR, REALM proposed something more ambitious: retrieval-augmented language model pretraining, with the retriever trained jointly with the generator from MLM signal. It got far less production traction than DPR because joint pretraining was expensive, and DPR’s recipe was simpler and more transferable.

DPR (Karpukhin et al. 2020)

The paper that made bi-encoder dense retrieval practical. Two BERT encoders, for queries and for passages. Score by inner product. Train with the negative log-likelihood of the positive passage against negatives (both explicit hard negatives, and in-batch negatives where the positives of other examples in the same batch serve as negatives):

The contributions were practical: in-batch negatives (free supervision), independent encoders for query and passage (asymmetric architecture for asymmetric inputs), and a clean training recipe. The result was the first dense retriever to consistently beat BM25 on open-domain QA.

DPR’s bi-encoder is the architecture that most RAG systems still use, with newer embedding models in place of BERT. The drift starts here: the production world adopted DPR’s architecture but not its training discipline.

RAG (Lewis et al. 2020)

The paper RAG got its name from, and the one production drift moved furthest away from. DPR retriever plus BART seq2seq generator, jointly fine-tunable. The RAG-token model marginalizes over the top-k retrieved documents per output token:

Two important details. First, only the query encoder is updated during training; the document encoder (and the FAISS index) stay frozen. Re-encoding the corpus during training is too expensive. Second, the marginalization is per-token: at every output token, the model sums the probability of that token across the retrieved documents weighted by retrieval probability. This is genuinely different from concatenating top-k documents and running a single forward pass, which is what most production RAG does.

ColBERT (Khattab and Zaharia 2020)

Late interaction was the path the field didn’t ultimately take, though it remains the most interesting compromise between bi-encoder speed and cross-encoder accuracy. Encode query and document independently with BERT, then compute similarity at the token level:

For each query token, find the most similar document token, sum the maxima. Encoding stays independent (documents can be indexed offline), but the matching is fine-grained. The cost is storage: ColBERT stores one vector per token instead of one per document, roughly 100x for a 100-token document. That’s its main operational drawback.

ColBERTv2 (Santhanam et al. 2022)

Same architecture with two improvements: residual quantization (cluster the per-token embeddings, store each as (centroid_id, residual) with a 2-bit residual; storage drops 6-10x), and distillation from a cross-encoder. ColBERT and ColBERTv2 are the canonical late-interaction systems, and they work at scale. But modern strong bi-encoders have closed enough of the gap that the storage cost is hard to justify for most use cases.

End-to-end RAG (Siriwardhana et al. 2023)

If the original RAG paper described joint training with a frozen document encoder, this is the paper that made even the document encoder trainable. The engineering cost is what most teams won’t pay.

The original RAG paper kept the document encoder frozen during training because re-encoding the corpus was too expensive. End-to-end RAG removes that constraint with two asynchronous processes: one continuously re-encodes passages with the updated document encoder, and one rebuilds the index. Training proceeds in parallel; the index is updated periodically with the latest version of the encoder.

This makes joint training of all components feasible: query encoder, document encoder, and generator. The paper also adds an auxiliary loss: regenerate the input query from the retrieved passages. This forces the retriever to find passages that contain enough information to reconstruct the query, which is a useful signal when domain-specific labels are scarce.

This is the closest to a “real” end-to-end retrieval-augmented model. The async re-encoding pipeline is the cost most production teams won’t pay.

RA-DIT (Lin et al. 2023)

The practical answer to the engineering question end-to-end RAG raises: how do you get most of the benefit of joint training without the async re-indexing pipeline? Two separate fine-tunings, run in sequence. First, fine-tune the LLM to use retrieved chunks. Second, fine-tune the retriever using LM-supervised retrieval (LSR): score documents by how much they raise the LM’s probability of the correct output:

Only the query encoder is updated; the document encoder stays frozen. The two stages can each be done with standard fine-tuning infrastructure. The paper reports SOTA results on MMLU, NQ, TriviaQA, and KILT subsets with a 65B-parameter LLM.

The retrieval pattern most production systems actually use

Setting aside the joint-training story, here is the pipeline most teams have built. The bi-encoder/cross-encoder split is the operational story; everything else is supporting infrastructure.

Figure 2: The production RAG pipeline. Top: offline indexing. Bottom: online query, with the reranker as an optional but usually worthwhile stage. The vector index built once at indexing time is queried at every request.

Chunking. Source documents are split into chunks. Fixed-token chunking (256-512 tokens with 10-20% overlap) is the default and works well enough for most prose. Recursive chunking respects document structure (Markdown headers, paragraph breaks). Semantic chunking (splitting on sentence-embedding distance) is more expensive and rarely justifies the cost on standard text. The ceiling on retrieval quality is often set here: chunks too small lose context, chunks too large produce diffuse embeddings that match weakly. The default chunker in the framework is rarely the right one for your data: PDFs with tables, code with function boundaries, and Markdown with structured sections each need different handling.

Embedding. Each chunk is encoded once with a frozen embedding model.

Vector database. The encoded chunks live in a vector index, e.g. HNSW or IVF-PQ under the hood. Pinecone, Weaviate, Qdrant, Milvus, pgvector, Chroma: pick on operational fit, not on retrieval quality. The choice of embedding model and chunking strategy usually dominates. pgvector is increasingly competitive when your data is already in Postgres and you don’t want a second system to operate. For the indexing intuition behind approximate nearest-neighbor search and product quantization, see my earlier KNN search note.

Top-k retrieval. Encode the query with the same embedding model used at indexing time, retrieve top- by cosine similarity (or dot product if vectors are normalized). Typical is 10-50. Larger is wasted unless you rerank.

Reranking (optional but usually worth it). A cross-encoder reranker takes the query and each retrieved chunk together and produces a relevance score. The bi-encoder is fast but cannot model query-document interactions; the cross-encoder can. Bi-encoder retrieves 50, cross-encoder reranks to top 5-10. The off-the-shelf rerankers in 2024-2025 are strong enough that the “no reranker” failure mode is increasingly hard to justify: BGE-reranker-v2-m3, Cohere Rerank 3, mxbai-rerank-large-v1, and Jina Reranker v2 are all usable before you train your own reranker.

A note on the bi-encoder vs cross-encoder choice: cross-encoders score query-document pairs jointly, so they cannot be precomputed. Running one over millions of documents at query time is infeasible. They belong in reranking, after a bi-encoder narrows the candidates.

Context assembly. A prompt template combines the system instruction, the retrieved-and-reranked chunks, and the query. Order matters: long-context models attend more to the start and end of the prompt than to the middle (the “lost in the middle” effect). Some systems include chunk metadata (source URL, section title) to help the LLM cite.

LLM call. Frozen API or local model. The LLM generates the answer from the assembled prompt.

This pipeline is what most blog posts mean when they say RAG. It’s also what most teams should build first before they consider anything more complex.

Code companion: TNLP end-to-end RAG

Most production RAG systems stop at frozen retrieval plus prompt assembly. In TNLP, I implemented the more expensive pattern: end-to-end RAG with retriever-generator coupling and asynchronous index refresh.

The point of the exercise isn’t that everyone should joint-train. Most teams shouldn’t. The point is to make the distinction concrete: production RAG is usually a pipeline; end-to-end RAG is a trained retrieval-augmented model.

RAG evaluation

Three things tend to go wrong when teams evaluate this pipeline.

Treating RAG as monolithic. RAG is a pipeline. Each stage (chunking, embedding, retrieval, reranking, prompt assembly, generation) has its own quality and latency tradeoffs. “Our RAG isn’t working” is rarely diagnosed by treating the system as a black box.
Vibes-only evaluation. “It seems to work” is not evaluation. At minimum: a labeled set of (query, relevant_chunk) pairs, retrieval metrics on a held-out set, and end-to-end answer correctness. Without these, you don’t know whether the embedding model, the reranker, or the LLM is failing.
Not measuring retrieval recall before LLM accuracy. If the relevant chunks aren’t retrieved, no LLM can save you. Measure retrieval recall@k first, then end-to-end accuracy.

Closing

The retrieval lineage from DPR through RA-DIT is a story of escalating sophistication: bi-encoder dense retrieval, then joint training with a frozen index, then joint training with an updating index, then a two-stage practical compromise. Production RAG, meanwhile, mostly stopped at step one. That’s not a critique. The simpler pattern works well for most use cases, and the engineering cost of going further is real.

But the gap is worth knowing. When the off-the-shelf pipeline isn’t working on your data, the next step isn’t to swap embedding models or try another vector database. It’s to figure out where in the pipeline the loss is happening, and whether the answer is a configuration change, a fine-tune, or the more ambitious territory the original RAG paper was actually about.

References

Cormack, Gordon V., Charles L. A. Clarke, and Stefan Buettcher. 2009. “Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods.” Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval.

Guu, Kelvin, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. 2020. “REALM: Retrieval-Augmented Language Model Pre-Training.” Proceedings of the 37th International Conference on Machine Learning (ICML). https://arxiv.org/abs/2002.08909.

Karpukhin, Vladimir, Barlas Oğuz, Sewon Min, et al. 2020. “Dense Passage Retrieval for Open-Domain Question Answering.” Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). https://arxiv.org/abs/2004.04906.

Khattab, Omar, and Matei Zaharia. 2020. “ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.” Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. https://arxiv.org/abs/2004.12832.

Lewis, Patrick, Ethan Perez, Aleksandra Piktus, et al. 2020. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/2005.11401.

Lin, Xi Victoria, Xilun Chen, Mingda Chen, et al. 2023. “RA-DIT: Retrieval-Augmented Dual Instruction Tuning.” arXiv Preprint. https://arxiv.org/abs/2310.01352.

Robertson, Stephen, and Hugo Zaragoza. 2009. “The Probabilistic Relevance Framework: BM25 and Beyond.” Foundations and Trends in Information Retrieval 3 (4): 333–89.

Santhanam, Keshav, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. “ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction.” Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). https://arxiv.org/abs/2112.01488.

Siriwardhana, Shamane, Rivindu Weerasekera, Elliott Wen, Tharindu Kaluarachchi, Rajib Rana, and Suranga Nanayakkara. 2023. “Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open Domain Question Answering.” Transactions of the Association for Computational Linguistics (TACL). https://arxiv.org/abs/2210.02627.

The encoder didn’t die. It became the embedding model

Yee Seng Chan — Mon, 23 Mar 2026 00:00:00 GMT

Part of a series

Research foundations of modern LLMs

The LLM story is usually told as a generation story: GPT scaling, instruction tuning, RLHF, chat, agents.

But most LLM-powered systems also depend on a quieter model running in the background: an embedding model. RAG, semantic search, reranking, clustering, recommendation, deduplication, and classification all depend on embeddings. The generator gets the headlines, but the embedding model often decides what the generator gets to see.

This is where encoders went.

The previous article argued that decoder-only models won the general-purpose generation interface. This article makes the complementary argument: encoders didn’t die. They became the default architecture for turning text into reusable vectors.

More precisely, the embedding role was won by encoder-style machinery: bidirectional attention, pooled output representations, and contrastive fine-tuning. Even modern decoder-based embedders often move in this direction during fine-tuning, relaxing causal attention and training the model to produce reusable vectors.

I have been circling this topic for a while. Since late 2022, I have written separate deep dives on SBERT, SGPT, MTEB, MPNet, PLM/XLNet, knowledge distillation and DistilBERT, and the relevant loss functions and KNN search over at https://chanys.github.io. Those posts covered the individual papers and techniques; this article steps back from them to make the broader point.

It is also a bridge between those older paper notes and the TNLP codebase: the paper trail explains the ideas, and the code shows what they look like when implemented.

One terminology clarification before going further. “Embedding” refers to two different things: the token embedding table, which maps token IDs to input vectors, and the output embedding, which is the pooled vector representing a sentence, paragraph, or document. When practitioners today say “an embedding,” they usually mean the latter. This article is about the latter.

Where embeddings actually live in modern systems

Embeddings are the input layer of many AI systems. They show up anywhere we need to turn text into something searchable, comparable, clusterable, or rankable.

Retrieval (RAG). Embed the document corpus once, index the vectors in a vector database, embed the query at runtime, and retrieve nearest neighbors. If the embedding model can’t tell that “what causes type 2 diabetes” and “diabetes risk factors” should be close, the RAG system breaks before the LLM ever sees the query.
Reranking. Retrieval gives you candidates; a stronger model reranks the top results. This second model is often a cross-encoder, another transformer encoder used in a different way.
Classification heads. Encode text once, then run a classifier for sentiment, intent, moderation, or routing. The “encoder plus linear head” recipe predates BERT, but BERT made it the default.
Semantic deduplication. Large training datasets need more than exact-match deduplication. Embeddings catch near-duplicates that lexical matching misses.
Clustering and topic discovery. Embed a document collection, cluster the vectors, then inspect the clusters. This is a standard recipe for analyzing customer feedback, support tickets, or other text corpora.
Recommendation and semantic search. User embeddings, item embeddings, query-item matching, and document search are all variations of the same idea: represent things as vectors, then compare them.

The practical test is simple: pick an LLM-powered product and ask where the embedding model is. In many systems, it is doing critical work before the LLM ever sees the prompt.

How a modern embedding model is built

A modern embedding model is usually built in two stages.

First, start with a pretrained language model backbone. For encoder-based embedders, this is often a BERT-like, MPNet-like, or DeBERTa-like model trained with MLM, RTD, or a related objective. Pretraining gives the model general language understanding: syntax, semantics, factual associations, and domain patterns.

Second, fine-tune it contrastively. This is the step that turns a language model into an embedding model.

Raw pretrained representations are not automatically good sentence embeddings. If you simply pool BERT outputs and compare them with cosine similarity, the geometry is often poor (Ethayarajh 2019; Li et al. 2020): unrelated texts can still end up with surprisingly high similarity scores. Sentence-BERT (Reimers and Gurevych 2019) made this practical problem visible.

Contrastive training fixes the geometry by pulling related texts closer and pushing unrelated texts farther apart. The result is an embedding space where cosine similarity becomes useful for retrieval, clustering, classification, and semantic matching.

The key distinction is simple: pretraining gives the model language understanding; contrastive fine-tuning gives it useful embedding geometry.

Pooling

Transformers produce one vector per token. Embedding systems usually need one vector for the whole input, so they need a pooling strategy.

Mean pooling is the safest encoder default: average the token representations. [CLS] pooling can work, but only if the model was trained to make [CLS] meaningful. Raw BERT [CLS] is usually weak for sentence similarity. Last-token pooling is common in decoder-based embedders, where the final token has seen the previous context.

How embedding models are used for retrieval

Once you have an embedding model, the next question is how to use it to score query-document relevance.

Figure 1: Three architectures for similarity scoring. Bi-encoders encode independently and compare; cross-encoders encode the pair jointly; late interaction keeps token-level representations and aggregates with MaxSim.

Bi-encoder. Encode the query and document separately, then compare their vectors. Document vectors can be precomputed, so this is the standard choice for first-stage retrieval.

Cross-encoder. Encode the query and document together, then output a relevance score. This is usually more accurate, but too expensive to run over an entire corpus.

Late interaction. Models like ColBERT (Khattab and Zaharia 2020) keep token-level vectors and compare query tokens against document tokens. This sits between bi-encoders and cross-encoders in both cost and accuracy.

The standard production recipe is simple: bi-encoder for retrieval; cross-encoder or late-interaction model for reranking.

Code companion: the TNLP embedding experiments

The TNLP repo has my working code for the ideas in this article. The most relevant example is a pair of contrastive-training experiments on BioASQ11, a biomedical retrieval dataset with questions, chosen answers, and rejected PubMed snippets. Both experiments train the same behavior: pull the query and chosen answer closer, push the rejected candidate farther away.

I implemented this two ways, to show two real workflows.

The first uses sentence-transformers with intfloat/e5-base-v2 (Wang et al. 2022): take a strong existing embedder, rely on the library, and fine-tune with relatively little code. This is what adapting an existing embedding model to your domain looks like in practice.

The second uses a custom DeBERTa-v3 triplet model with a hand-rolled training loop. Every hidden choice becomes visible: backbone, pooling, projection, distance function, triplet construction, loss. This is what turning a pretrained encoder into an embedding model yourself looks like, when you can’t or don’t want to lean on the library defaults.

The two are not a head-to-head benchmark. E5 is already contrastively pretrained; DeBERTa-v3 is a general encoder backbone. They’re paired here because together they cover the two starting points a real production team faces.

Decoder-only models can do embeddings too

The article so far has argued that encoder-style machinery won the embedding role. That does not mean only encoder backbones can produce embeddings. Decoder-only models can, and recent work has pushed this direction hard.

The early version of this idea was SGPT (Muennighoff 2022): take a GPT-style decoder-only model, pool token representations with a position-weighted mean, and contrastively fine-tune for semantic search. It worked, but it was expensive relative to encoder-based alternatives.

The current generation is stronger. Large decoder backbones (Llama, Mistral) can be turned into embedders that compete at the top of MTEB (Muennighoff et al. 2022), especially on retrieval and reranking. But notice the recipe they converge on: take a strong decoder-only LLM, relax or replace causal attention during fine-tuning, pool the output representations, and contrastively fine-tune on a large mix of data.

That recipe looks encoder-like. Bidirectional access, pooled output, contrastive geometry. The starting weights may come from a decoder-only LLM, but the embedding behavior is built by the fine-tuning stage.

So “decoders caught up” is only half right. The better framing is that large decoder backbones can be adapted into strong embedders when the fine-tuning recipe starts to look encoder-like. For most production systems, small encoder-based embedders remain attractive because they are fast and cheap. For highest-quality retrieval where compute is available, large decoder-based embedders are now real competitors.

Closing

Decoder-only models won the visible part of the LLM story: generation, chat, instruction following, agents.

But encoder-style machinery won a quieter role: bidirectional access, pooled output representations, and contrastive fine-tuning. That machinery became the default way to turn text into reusable vectors, and it now powers retrieval, reranking, clustering, classification, deduplication, and semantic search across the modern AI stack.

That is why the encoder did not die. It moved into the infrastructure.

The next article picks up where this one leaves off: how the retrieval part of RAG was solved well before “RAG” became the popular term, from Dense Passage Retrieval through end-to-end joint training of retriever and generator.

References

Ethayarajh, Kawin. 2019. “How Contextual Are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Contextualized Representations.” Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). https://arxiv.org/abs/1909.00512.

Li, Bohan, Hao Zhou, Junxian He, Mingxuan Wang, Yiming Yang, and Lei Li. 2020. “On the Sentence Embeddings from Pre-Trained Language Models.” Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). https://arxiv.org/abs/2011.05864.

Muennighoff, Niklas. 2022. “SGPT: GPT Sentence Embeddings for Semantic Search.” arXiv Preprint. https://arxiv.org/abs/2202.08904.

Muennighoff, Niklas, Nouamane Tazi, Loïc Magne, and Nils Reimers. 2022. “MTEB: Massive Text Embedding Benchmark.” arXiv Preprint. https://arxiv.org/abs/2210.07316.

Reimers, Nils, and Iryna Gurevych. 2019. “Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks.” Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). https://arxiv.org/abs/1908.10084.

Wang, Liang, Nan Yang, Xiaolong Huang, et al. 2022. “Text Embeddings by Weakly-Supervised Contrastive Pre-Training.” arXiv Preprint. https://arxiv.org/abs/2212.03533.

Pretraining objectives: why decoder-only won

Yee Seng Chan — Wed, 18 Mar 2026 00:00:00 GMT

Part of a series

Research foundations of modern LLMs

The standard tutorial story: “BERT was bidirectional, T5 was sequence-to-sequence, GPT was autoregressive, and decoder-only won because it was simpler.” Sometimes “scaling worked better.” Sometimes “instruction-tuning needed it.”

I think this story is shallow. The objective didn’t win; the paradigm did.

I’ve been writing about these architectures individually since late 2022: deep-dives on transformer architecture, BERT, RoBERTa, T5, GPT-1, ELECTRA, DeBERTa, DeBERTa-v3, UL2, FLAN, and LLaMA-2 over at https://chanys.github.io. This article steps back from those individual treatments to make an argument those posts don’t make explicitly: the field’s reasons for converging on decoder-only are often misstated. Decoder-only did not win because causal language modeling was magically superior. It won because the decoder-only paradigm made pretraining, prompting, generation, inference, and deployment all line up.

A short tour of the pretraining objectives

The objectives that defined the pretraining era can be characterized cleanly by what they hold out, what they leave visible, and which architectural family they imply. Throughout, let be a sequence of tokens.

Masked language modeling (MLM, BERT)

BERT’s (Devlin et al. 2019) pretraining objective. Sample a set of positions (typically 15% of tokens). For each , replace with a [MASK] token (80% of the time), a random token (10%), or leave it unchanged (10%). Train to predict the original tokens at the masked positions:

where is the corrupted input. The loss only fires at the ~15% of positions that were sampled, so 85% of the input contributes to the gradient via attention but produces no direct prediction. The 80/10/10 mixing prevents the model from learning that [MASK] is the only token requiring prediction (see BERT deep-dive for more details).

The encoder is bidirectional: every position attends to every other position. However, the cost is that the model can’t generate naturally, since at inference time you’d have to feed [MASK] tokens and the model isn’t trained to compose tokens left-to-right.

Replaced token detection (RTD, ELECTRA)

ELECTRA (Clark et al. 2020) replaces MLM with a discriminative objective. A small generator network (typically a smaller MLM model) replaces some tokens with plausible alternatives drawn from its own predictions. A discriminator then predicts, for every token, whether it was original or replaced:

RTD stands for replaced token detection.
The indicator function if the token was not replaced, and if replaced.
The discriminator .

Two architectural advantages over MLM: (1) loss fires on every token, not just 15%, giving roughly 4× sample efficiency; (2) the input the model sees at training matches its evaluation distribution (real-looking text), instead of [MASK]-laden gibberish.

At matched scale ELECTRA-Base outperforms BERT-Base on GLUE (85.1 vs 82.2 in the original paper), and ELECTRA-Large reaches RoBERTa-comparable (Liu et al. 2019) quality with under 1/4 the compute. The RTD objective got picked up later in DeBERTa-v3 (He et al. 2021), which combines it with DeBERTa’s disentangled attention.

Span corruption (T5)

T5 (Raffel et al. 2020) corrupts contiguous spans rather than individual tokens. Sample a set of spans (mean length 3, total ~15% of tokens), replace each span with a unique sentinel , , , and train an encoder-decoder to autoregressively generate the missing spans as the target sequence. So with the original The cat sat on the mat watching:

Encoder input: The cat the mat
Decoder target: sat on watching

The decoder is autoregressive over the target, so the loss is:

This is generative: the decoder produces a sequence of tokens. But the corruption rate is low so most of the input stays observable to the encoder. T5 reframes a wide variety of NLP tasks (translation, summarization, classification, QA) as text-to-text under this objective.

Causal language modeling (CLM, GPT)

The simplest objective. Predict each token from its left context:

No masking, no replacement, no sentinels: the data is fed unchanged. Loss fires on every position. The architecture must use causal self-attention so position only sees positions . The same parameters compute representations and generate tokens; there’s no encoder-decoder split. See GPT-1 (Radford et al. 2018) for more details.

Mixture-of-denoisers (UL2)

UL2 (Tay et al. 2023) unifies multiple objectives within a single training run by mode-switching. Three denoising paradigms, signaled by special mode tokens ([R], [S], [X]) at the start of the sequence:

R-denoiser (regular): standard T5-style span corruption (mean span length 3, ~15% rate).
S-denoiser (sequential): prefix LM: split the sequence into prefix and target, and predict the target autoregressively given a bidirectional prefix.
X-denoiser (extreme): aggressive corruption with long spans ( tokens) or high rates ().

The model learns to handle all three modes through the explicit mode tokens. UL2 reports that the mixture beats both pure CLM (GPT-style) and pure span corruption (T5-style) on a wide benchmark.

The interesting part for the argument here is that the S-denoiser is essentially CLM with a bidirectional prefix. In a normal causal LM, every token can only attend to previous tokens. In UL2’s S-denoiser / prefix-LM setup, the prefix is treated more like an encoder context: prefix tokens can see each other bidirectionally before the model starts generating the target.

UL2 explicitly recognizes prefix-LM as a useful interpolation between encoder-decoder and decoder-only.

Fill-in-the-middle (FIM)

The infilling objective (Bavarian et al. 2022) used in code models (StarCoder, Code Llama, OpenAI Codex). Take a sequence, split into prefix, middle, and suffix, then rearrange the data and train autoregressively:

Training-time order:

prefix suffix middle

This is just CLM on rearranged data. The model learns to generate middle conditioned on prefix and suffix, while remaining a pure decoder-only autoregressive language model. The cleverness is in the data layout, not the objective. The inference-time prompt is

{user_prefix} {user_suffix} , after which the model autoregressively generates the missing middle.

This pattern of getting a span-fill capability by rearranging data into a CLM-shaped task, is a recurring theme in what comes next.

Figure 1: One sentence under five pretraining objectives: same source text, different held-out tokens.

Architecture is mostly an attention pattern

The key move is to stop treating encoder, encoder-decoder, and decoder-only models as completely separate species. At the transformer level, much of the difference comes down to which tokens are allowed to attend to which other tokens.

Figure 2: Attention masks across the three transformer families. Filled cells indicate attendable positions; the encoder-decoder panel shows the joint attention pattern over a concatenated [encoder, decoder] sequence.

An encoder gives every token bidirectional access to every other token. A decoder-only model gives each token access only to earlier tokens. An encoder-decoder splits the computation into two parts: the encoder reads the input bidirectionally, while the decoder generates the output autoregressively while attending back to the encoder. These are different attention patterns over tokens.

This matters because some capabilities that look architecture-specific can be recovered by changing the data layout. Prefix-LM gives a decoder-style model bidirectional access to a prefix before generating a continuation. Fill-in-the-middle gives a decoder-only model infilling behavior by rearranging the sequence into prefix, suffix, then middle. Span corruption can be viewed similarly: decide what is visible, decide what is hidden, then train the model to predict the missing text.

So the important question is not simply, “Which objective is best?” The better question is: Which training setup gives the most useful capabilities per unit of data, compute, and deployment complexity?

From that angle, decoder-only models had a major advantage. They could absorb many task formats into a single pattern: put context in the prefix, then generate the continuation.

The claim is not that CLM is theoretically superior to every denoising objective. The claim is that decoder-only made the fewest assumptions about the shape of the task. Once everything can be represented as context followed by continuation, the same model can support pretraining, instruction following, few-shot prompting, chat, tool use, and long-form generation without changing architecture.

At small scale, objective matters more

At small scale, pretraining objectives strongly shape what the model learns efficiently.

BERT-style MLM gives the model a bidirectional bias, which helps classification, sequence labeling, and extraction. ELECTRA improves sample efficiency by producing a learning signal at every token. T5-style span corruption gives the model a natural input-output format for tasks like summarization and translation.

These advantages are real, but they become less decisive as models scale. The objectives are not equivalent, but they all train on the same underlying distribution of language. At large scale, each objective still pushes the model to learn many of the same patterns: syntax, semantics, factual associations, discourse structure, and task formats. The objective matters, but it no longer dominates general-purpose capability the way it does at small scale.

So the question gradually shifts from “Which objective gives the best inductive bias?” to “Which architecture is easiest to scale, prompt, serve, and adapt?”

That changes the tradeoff. Once objective-level advantages become less dominant, the practical advantages of decoder-only become harder to ignore: simpler data preparation, direct autoregressive generation, natural prompting, easier serving, and a single architecture for many behaviors.

So the point is not that MLM, span corruption, or RTD were bad ideas. The point is that their advantages mattered most in regimes where model size, data scale, and deployment patterns looked very different from modern LLMs.

The actual reason decoder-only won: unification

Decoder-only did not win because MLM was “wrong” or span corruption was “worse.” It won because the field moved from many specialized NLP tasks toward one general-purpose modeling interface.

BERT-style models were excellent for classification, sequence labeling, span extraction, and retrieval. But each task usually required a specific setup: a head, a pooling strategy, a masking scheme, or a fine-tuned output format.

Decoder-only models made the interface much simpler: Put the context in the prefix, then generate the continuation.

That one pattern absorbs many tasks. Classification becomes completion. Question answering becomes completion. Dialogue becomes completion. Code generation becomes completion. Tool use becomes completion over an action-observation trace.

So the deeper reason decoder-only won is unification: many task formats became one modeling problem: context in, continuation out.

1. One interface for many tasks. Decoder-only models turn many tasks into the same pattern: put the task description, examples, prior turns, or tool outputs in the prefix, then generate the continuation. Classification, QA, dialogue, code generation, and tool use all become variations of completion.

2. Flexible training data. CLM trains directly on raw token sequences. No masking, span sampling, task-specific heads, or encoder-decoder split. Documents, code, conversations, transcripts, math, and structured text can all be modeled in the same format.

3. Training matches inference. Modern LLM use is generative: chat, instruction following, summarization, code generation, agents, and tool use. Decoder-only models are trained in the same mode in which they are used: condition on previous tokens, then generate the next ones. Encoders remain excellent representation learners, but open-ended generation is not their native mode.

4. New capabilities become formatting problems. Decoder-only models can absorb behaviors that once looked like they required specialized objectives. Fill-in-the-middle reframes span filling as autoregressive prediction over rearranged data. Instruction tuning, chat formatting, tool traces, and preference tuning follow the same pattern: keep the architecture fixed, change the data format and training signal.

The production benefits follow from this unification. One architecture is easier to scale, cache, batch, serve, and adapt than a collection of task-specific modeling setups.

What encoders still do

This story is specifically about general-purpose generative LLMs. It is not a story about encoders becoming useless. Encoders did not disappear; they specialized.

Encoders are still the right tool when:

The output is structured per-token (named entity recognition, sequence labeling, span extraction).
The output is a fixed-length representation of the whole input (sentence embeddings, classification, retrieval).
The latency budget rules out autoregressive generation.
The downstream task has small training data and benefits from the strong inductive bias of bidirectional MLM pretraining.

DeBERTa-v3 is in many ways the high-water mark of encoder pretraining, and it’s still the model I reach for on small-data IE tasks. The disentangled attention from DeBERTa, the RTD objective from ELECTRA, and DeBERTa-v3’s gradient-disentangled embedding sharing combine into a hard-to-beat package below 1B parameters.

The mistake is to turn “decoder-only won general-purpose generation” into “encoders are obsolete.” It doesn’t. For the IE-shaped tasks they were always good at, they’re still ahead of decoder-only models on a Pareto frontier of accuracy and inference cost.

Code companion: where this shows up in TNLP

The TNLP repo at https://github.com/chanys/tnlp contains working examples for the model families discussed in this article and the rest of the foundations series.

The examples most relevant to this article are:

Token classification / NER: DeBERTa-v3 base with a token classification head on CoNLL-2003. This is the classic encoder use case: one contextual representation per token, one label per token.
Span-pair classification / relation extraction: DeBERTa-v3 with custom span-pair pooling on NYT-H. This shows why encoders remain useful for information extraction: they produce cheap, dense representations over the whole sentence.
Contrastive modeling / retrieval: E5 and DeBERTa-based triplet-loss models on BioASQ11. This connects directly to the next article: encoders did not disappear; they became embedding and retrieval models.

The same repo also includes examples from the other branches of the pretraining story:

FLAN-T5 seq2seq examples, representing the encoder-decoder path.
LLaMA-2 and Mistral instruction/chat tuning, representing the decoder-only path after CLM pretraining.
DPO examples, representing the alignment stage that comes after supervised instruction tuning.

I will return to these examples in later articles. For this piece, the main point is narrower: the old pretraining families did not vanish. They separated into different roles. Decoder-only became the default for general-purpose generation, while encoders remained highly useful for embeddings, retrieval, classification, and information extraction.

References

Bavarian, Mohammad, Heewoo Jun, Nikolas Tezak, et al. 2022. “Efficient Training of Language Models to Fill in the Middle.” arXiv Preprint. https://arxiv.org/abs/2207.14255.

Clark, Kevin, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. “ELECTRA: Pre-Training Text Encoders as Discriminators Rather Than Generators.” International Conference on Learning Representations. https://arxiv.org/abs/2003.10555.

Liu, Yinhan, Myle Ott, Naman Goyal, et al. 2019. “RoBERTa: A Robustly Optimized BERT Pretraining Approach.” arXiv Preprint. https://arxiv.org/abs/1907.11692.

Radford, Alec, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving Language Understanding by Generative Pre-Training. OpenAI.

Tay, Yi, Mostafa Dehghani, Vinh Q. Tran, et al. 2023. “UL2: Unifying Language Learning Paradigms.” International Conference on Learning Representations. https://arxiv.org/abs/2205.05131.