Deep Research — State of the Art LLM Guide

43 min readFeb 5, 2025

Everyone’s been raving about OpenAI’s deep research prowess, so I decided to test it out myself. I’m interesting in understanding what the state of the art is in LLM so thats what I asked it:

Based on the latest research papers, including DeepSeek V3, DeepSeek R1, LLaMA 3, LLaMA 3.3, Moonshot Kim-K1.5, Google Gemini, and DeepMind,

Note: DeepSeek V3 is the most advanced open-weight model, although there may be innovations in other models that have not been incorporated into DeepSeek.

Additionally, include any other relevant cutting-edge research papers covering LLM architecture, LLM pre-training, and LLM post-training (including reinforcement learning). Also, consider papers on data preparation techniques, such as RefinedWeb, and large-scale training methodologies, such as mega-scale training.

Based on all this knowledge, write a comprehensive guide on creating a state-of-the-art LLM. The guide should be structured like an academic paper, covering all essential details. The sections should include:

1. Data Collection and Preparation
 • Include datasets that can be downloaded from Hugging Face.
 • Detail data preparation techniques to ensure high-quality training data.

2. Cutting-Edge LLM Architecture
 • Assume no prior knowledge—explain everything from scratch.
 • Cover the leading state-of-the-art architecture in complete detail, including:
 • Every layer of the LLM.
 • All tensor operations and the floating-point precision used.
 • Optimizations and inter-GPU communication strategies.
 • Context length extension to 1 million or even 100 million tokens.
 • KV cache compression, as described in DeepSeek V3.
 • Mixture of Experts (MoE) and MoE balancing strategies.
 • Speculative decoding and the training of Medusa heads.

3. Post-Training & Reinforcement Learning
 • Cover reinforcement learning-based post-training in full detail.
 • Include:
 • Self-play techniques for model improvement.
 • Reward function design for fine-tuning model behavior.

This guide should be as detailed as an academic paper, including all technical aspects necessary to develop a state-of-the-art LLM.

Upon entering this promp it asked my some clarification questions, to which I responded:

Imagine you are writing a paper for a academic journel, the objective is to explain the current state of cutting edge llm research and knowledge if the clear object of providing sufficiently detailed knoweldge for the reader to implement it.

Make decisions to the above questions based on this.

I tried this a number of times, the quality of the generated report varies wildly. Here is the best one:

Latest Advancements in Large Language Models: A Comprehensive Guide

Abstract — This guide provides a detailed overview of state-of-the-art techniques in large language model (LLM) development. We explore modern data collection and preparation methods, highlighting high-quality open datasets and advanced curation techniques that enable trillion-token scale training. We then delve into cutting-edge model architectures, dissecting transformer-based LLM designs layer by layer, including modifications for ultra-long context windows (up to 100 million tokens) and Mixture-of-Experts (MoE) innovations for massive model capacity. We discuss post-training and reinforcement learning strategies for alignment, such as RLHF (Reinforcement Learning from Human Feedback) and automated feedback methods, used to refine model behavior in models like DeepSeek V3 and Google’s Gemini. Finally, we present implementation-level details — mathematical formulations, tensor operations, and pseudocode — to guide practitioners in implementing high-performance LLMs based on these advancements. All sections are supported with references to the latest research in the field.

Data Collection and Preparation

High-quality data is the foundation of modern LLMs. Recent research emphasizes scaling data to trillions of tokens while maintaining stringent quality controls This section surveys several open-source datasets (available on Hugging Face) and discusses techniques for data cleaning, deduplication, and refinement at mega-scale.

1.1 High-Quality LLM Pretraining Datasets

Gathering raw text is only the first step; data preparation — cleaning and refining the corpus — is crucial for training high-performance LLMs. At trillions of tokens, data preparation becomes a massive engineering challenge. Researchers have developed advanced techniques for filtering noise, removing duplicates, and balancing data sources:

Filtering and Cleaning: Quality filters eliminate undesirable content and noise. This includes removing boilerplate text (e.g. navigation menus, HTML/JS gibberish), non-natural language strings, profanity/hate content, and any other irrelevant text. For example, FineWeb applied the C4 pipeline (language identification, bad-word removal, etc.) and then extended it with custom rules Heuristic filters are often handcrafted: e.g., dropping lines containing placeholders like “Lorem ipsum” or random character sequences, and stripping content with excessive repetition or spam indicators Domain-specific knowledge helps: if crawling technical papers, one might filter out reference sections; for code data, one might remove large auto-generated files. Filtering is done at multiple levels (line-level and document-level) to progressively refine the text Researchers also perform statistical analysis on the corpus (e.g., distribution of document lengths, frequencies of tokens) to spot outliers or anomalies, informing further filter rules Balanced filtering is important — filters should remove as much noise as possible without overly pruning useful data An aggressive filter that eliminates 5% of good text might harm downstream performance, so teams validate each filter via ablation studies (checking model performance with/without the filter) The outcome of these efforts in projects like FineWeb is a corpus with significantly improved signal-to-noise ratio, enabling better pretraining efficiency

- Deduplication: Large web crawls contain many duplicated or near-duplicated passages (e.g., boilerplate text, reprinted articles). Removing duplicates is vital to prevent the model from overcounting certain data and to reduce overfitting. However, deduplication at trillions of tokens is non-trivial. Traditional exact-match dedup (hashing full lines or documents) fails to catch fuzzy duplicates and is memory-intensive at this scale. Projects like SlimPajama pioneered the use of MinHash LSH (Locality-Sensitive Hashing) to deduplicate trillion-token datasets (SlimPajama: A 627B token, cleaned and deduplicated version of RedPajama — Cerebras]) MinHash LSH can efficiently identify high-overlap text spans by hashing smaller chunks and finding approximate matches. Cerebras had to improve existing open-source LSH tools to handle RedPajama’s scale, creating a distributed, multi-threaded dedup pipeline that runs on large clusters (SlimPajama: A 627B token, cleaned and deduplicated version of RedPajama — Cerebras) The payoff was huge: SlimPajama cut nearly half of the data as duplicates or low-quality, dramatically reducing the size without hurting diversity Similarly, the Falcon RefinedWeb team deduplicated across 5 trillion tokens of CommonCrawl, ensuring that repeated content (common on the web) didn’t inflate the data volume Deduplication also helps avoid data contamination in evaluations (where a test set answer might appear in the training data), thus giving more reliable benchmarks.

- Refinement and Curation: Beyond removing bad data, some efforts curate the dataset composition for maximum impact. For example, the concept of “pretraining annealing” involves re-weighting data sources over the course of training — e.g., gradually increasing the fraction of high-quality or task-relevant data in later epochs Google reported using such a tactic: toward the end of Gemini’s pretraining, they fed more “relevant, high-quality data” to boost performance on key benchmarks Another approach is to upsample under-represented but important domains. If code data is only 5% of the corpus but is crucial for coding ability, one might sample code examples more frequently during training. Researchers also create refined subsets of data targeting specific capabilities; for instance, the FineWeb team built “FineWeb-Edu,” a 1.3T-token subset of educational and high-information web pages, which yielded superior performance on knowledge-heavy tasks (MMLU, ARC, etc.) This kind of targeted curation can sharpen an LLM’s strengths in certain areas without manual labeling — essentially steering a portion of the unsupervised data toward the desired knowledge distribution.

- Mega-Scale Training Strategies: When training on trillions of tokens, one must decide how to traverse such a massive corpus. Common strategies include single-epoch vs multi-epoch training. Often, these huge datasets are so large that a model sees each example only once (one epoch) or just a few times. Ensuring the model doesn’t “forget” early data requires a carefully scheduled learning rate decay or even mixing in a small amount of earlier data in later training. Another challenge is I/O and storage for multi-terabyte datasets; practitioners use streaming data loaders and fast decompression (LMDB or cache systems) to keep GPUs fed. The scale also demands early benchmarking — FineWeb advocates “early signal” evaluation by training small models on a subset of data to estimate dataset quality If a data refinement improves a 1.3B model’s perplexity or zero-shot accuracy, it likely will help at 70B scale too. This iterative loop allows continuous dataset improvement even before full-scale training. In summary, training on mega-scale data is as much an engineering feat as a modeling one, requiring distributed data pipelines, careful curation, and sometimes creative scheduling of data usage over the course of training.

By combining robust filtering, effective deduplication, and thoughtful data curation, today’s LLM projects produce corpora that maximize quality without sacrificing scale Open efforts like FineWeb and SlimPajama have set new standards, showing that transparent, reproducible dataset construction is possible even at the 10¹² token frontier. Table 1 (below) summarizes a few key datasets and their properties:

Table 1: Selection of large-scale datasets for LLM pretraining. ENG = primarily English text.

2. Cutting-Edge Architecture

Transformer-based architectures continue to be the bedrock of LLMs, but recent innovations have significantly extended their capabilities. This section provides an in-depth look at the latest LLM architectures, including layer-by-layer breakdowns, modifications for extreme context lengths, mixture-of-experts designs, and optimized training/inference strategies. We also discuss how massive models are trained across many GPUs with specialized communication patterns and precision techniques.

2.1 Advanced Transformer Layer Designs

State-of-the-art LLMs are built on the Transformer decoder block, with improvements in normalization, activation functions, and positional encoding to enhance training stability and performance. A modern Transformer layer typically consists of: (1) Multi-Head Self-Attention, (2) Feed-Forward Network (FFN), each preceded or followed by normalization (RMSNorm or LayerNorm), and with residual skip connections around each (A Complete Beginner’s Guide to Llama 2 | Build Generative AI Applications With SingleStoreDB) For example, Meta’s LLaMA-2 architecture (7B–70B parameters) uses pre-normalization with RMSNorm (a variation of LayerNorm that normalizes by root-mean-square) and a SwiGLU activation in the FFN, instead of the standard ReLU. These choices, inspired by earlier models like PaLM, improve the gradient flow and expressive power of the network. The feed-forward “intermediate” size is often 4× the hidden size (e.g., 11008 in LLaMA-70B for a 4096 hidden dim) and uses the gated linear unit (SwiGLU) which was found to boost model quality. Many models also employ Rotary Positional Embeddings (RoPE) to encode token positions implicitly into the attention mechanism, enabling extrapolation to longer sequences with minimal loss. In summary, today’s transformer block includes several refinements: RMSNorm (for stability), scaled activation functions like SwiGLU (for efficiency), and advanced positional encodings (for longer contexts).

Another innovation is Grouped-Query Attention (GQA), used in some large models to reduce memory usage. In GQA, multiple attention heads share a single key/value projection — effectively grouping heads so that instead of each head having its own $W^K$ and $W^V$, a group of heads use one set of key/value matrices. This was implemented in LLaMA 1 & 2 for larger variants to save memory at high head counts. For instance, a 65B model might have 8 groups for 32 heads, meaning only 4 independent key/value sets per layer. GQA can dramatically cut the size of the KV cache and attention computation, with a minimal hit to performance. In one example, a 65B model with grouped-query attention and 8-bit quantized KV cache can hold a 512k-token context in ~86 GB of memory (which is still huge but otherwise would be far larger) Techniques like GQA thus help make extremely long context handling more tractable by trimming down the per-layer memory overhead.

Multi-Head Attention Mechanics: Given the importance of attention, it’s worth revisiting how a Transformer processes input at the tensor-operation level. For each layer and each token, the model computes query (Q), key (K), and value (V) vectors by multiplying the input hidden state x by learned projection matrices (W^Q, W^K, W^V). Multi-head attention splits these vectors into h heads (e.g., 16 or 32 heads) such that Q_i, K_i, V_i are the i-th head’s projections (of dimension d_k). The attention output for each head is:

which produces a weighted combination of values based on the similarity of queries and keys (100M Token Context Windows — Magic) (2405.05254] You Only Cache Once: Decoder-Decoder Architectures for Language Models) In practice, this is implemented with high-efficiency tensor kernels: matrix multiplications for QK^T and for multiplying by V. The softmax operation yields attention weights that sum to 1 over the sequence length. The outputs of all heads are then concatenated and projected through another matrix $W^O$ to form the layer’s final attention output. Pseudocode for a simplified attention forward pass is shown below:

# Pseudocode: Multi-head self-attention (no batching for simplicity)
def self_attention_forward(x, Wq, Wk, Wv, Wo):
 Q = x @ Wq # [seq_len, d_model] -> [seq_len, h*d_k]
 K = x @ Wk
 V = x @ Wv
 # Reshape Q, K, V to [h, seq_len, d_k] for h heads
 Q = Q.view(h, seq_len, d_k)
 K = K.view(h, seq_len, d_k)
 V = V.view(h, seq_len, d_k)
 outputs = []
 for i in range(h):
 attn_scores = softmax(Q[i] @ K[i].T / sqrt(d_k)) # [seq_len, seq_len]
 head_out = attn_scores @ V[i] # [seq_len, d_k]
 outputs.append(head_out)
 # Concatenate heads and project
 multi_head_out = concat(outputs, axis=-1) # [seq_len, h*d_k] 
 return multi_head_out @ Wo # [seq_len, d_model]py

In modern implementations, efficient kernels (like FlashAttention) are used to compute attention in a memory-optimized way (fusing the softmax and avoiding explicitly materializing large QK^T matrices), which is crucial for long sequences. Still, the core operations are dense matrix multiplies — making Transformers amenable to GPU/TPU acceleration.

Comment below if you want me to keep going and sort out the formatting of this document here!

2.2 Scaling to Extreme Context Lengths

One of the most dramatic advancements is the extension of LLM context windows from the typical 2K–4K tokens to tens of thousands, and recently, to millions or more. Two challenges arise with long contexts: memory usage (storing and processing the KV cache) and computational cost (attention operations scale quadratically with sequence length). Several architectural innovations address these challenges:

- Efficient KV Cache Management: DeepSeek-AI’s research introduced Multi-Head Latent Attention (MLA), a technique to compress the key/value cache at each layer (DeepSeek-V3 Technical Report) Instead of storing full-dimension $K$ and $V$ for every past token, MLA projects them into a smaller latent space. Concretely, the model learns a pair of projection matrices to reduce the dimensionality of keys and values: $K_{\text{compressed}} = K \, W_{\text{down}}$ and similarly for $V$ ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=The%20core%20of%20MLA%20is,Value%20%28KV%29%20cache%20during%20inference)) These compressed vectors (of size, say, 512 instead of 8192) are what get stored as the context. When computing attention, an up-projection is applied to map the compressed $K/V$ back to the original space for the dot-product with $Q$ By choosing a low rank for this compression, the memory overhead of the cache becomes significantly smaller. DeepSeek-V3, for example, sets the compression dimension to 512 (for K/V) and also compresses queries to a smaller size to save activation memory This yields massive memory savings: the KV cache size can be reduced by multiples (depending on compression ratio). You Only Cache Once (YOCO) is another approach from Microsoft Research that reimagines the Transformer as a two-layer decoder architecture where *only one layer stores a full KV cache* ([[2405.05254] You Only Cache Once: Decoder-Decoder Architectures for Language Models](https://ar5iv.org/abs/2405.05254#:~:text=which%20only%20caches%20key,compared%20to%20Transformer%20in%20various)) YOCO uses a *self-decoder* to encode the sequence into a global cache, which a *cross-decoder* then attends to (like an encoder-decoder model, but both are the same stack) ([[2405.05254] You Only Cache Once: Decoder-Decoder Architectures for Language Models](https://ar5iv.org/abs/2405.05254#:~:text=which%20only%20caches%20key,compared%20to%20Transformer%20in%20various)) Because the cache is shared, it *avoids storing separate KV for each layer*. This drastically reduces memory — effectively by a factor of $N_{\text{layers}}$ in the limit. YOCO demonstrated scaling to *1 million tokens of context* with near-perfect retrieval (i.e., the model could recover specific “needle in haystack” tokens from 1M-token prompts) ([[2405.05254] You Only Cache Once: Decoder-Decoder Architectures for Language Models](https://ar5iv.org/abs/2405.05254#:~:text=that%20YOCO%20achieves%20favorable%20performance,Code)) Profiling showed huge improvements in memory and throughput across context lengths ([[2405.05254] You Only Cache Once: Decoder-Decoder Architectures for Language Models](https://ar5iv.org/abs/2405.05254#:~:text=that%20YOCO%20achieves%20favorable%20performance,Code)) These advancements — low-rank compression (DeepSeek’s MLA) and caching once (YOCO) — are complementary and represent the state-of-the-art in managing long contexts.

- 100,000+ Token Context Windows: Pushing context even further, some experimental models have achieved extraordinary lengths. Magic.dev’s LTM-2 is a recently announced model capable of *100 million tokens* context window — essentially allowing the model to attend to entire codebases or libraries of books in one pass ([100M Token Context Windows — Magic](https://magic.dev/blog/100m-token-context-windows#:~:text=Magic%E2%80%99s%20progress%20on%20ultra%20long,context)) To handle $10⁸$ tokens, Magic did not simply scale vanilla attention (which would be infeasible); instead, they developed a *sequence-dimension* algorithm that is **1000× cheaper than standard attention at 100M length** ([100M Token Context Windows — Magic](https://magic.dev/blog/100m-token-context-windows#:~:text=For%20each%20decoded%20token%2C%20LTM,a%20100M%20token%20context%20window)) They also leveraged techniques like multi-hop attention (where the model learns to attend in stages, jumping through the context via hashing) and custom CUDA kernels The result is that while a naive 405B model at 100M context would require *638* 80GB GPUs just to store the KV cache (≈51 TB) ([100M Token Context Windows — Magic](https://magic.dev/blog/100m-token-context-windows#:~:text=The%20contrast%20in%20memory%20requirements,user%20for%20the%20same%20context)) Magic’s model can do it in under one GPU’s memory ([100M Token Context Windows — Magic](https://magic.dev/blog/100m-token-context-windows#:~:text=The%20contrast%20in%20memory%20requirements,user%20for%20the%20same%20context)) In their approach, they hash segments of the context to enable constant-time lookups, and use a fixed memory per token (~800 bytes/token) by sharing and compressing caches aggressively These developments are still cutting-edge research, but they indicate a trend: **hybrid architectures** (combining Transformers with retrieval systems or hashing schemes) are enabling *extreme context lengths*. While most practical LLMs in 2024 top out around 32K to 256K context using scaled RoPE or other interpolation methods, experimental systems are showing that context length can potentially be stretched to millions by rethinking the attention mechanism entirely.

- **Local and Hierarchical Attention:** Another class of techniques for long context is to restrict or structure the attention pattern. Models like Longformer, BigBird, etc., introduced *sparse attention* patterns (sliding windows, global tokens, etc.) to handle longer texts efficiently. More recently, approaches like **hierarchical attention** or **fragment-based processing** allow breaking a long input into chunks that are processed independently and then combined at a higher level. These are not as universal as the methods above, since they often target specific tasks (like long document understanding) and can sacrifice exact equivalence to full attention. Nonetheless, they are useful tools. For instance, an LLM could read a 100-page document by first summarizing each page (with a Transformer that can handle that page length), then attending to the summaries with another Transformer. This two-stage approach is analogous to how YOCO introduces a cross-decoder — first encode, then attend — but usually done outside the model architecture (as a separate process or prompt engineering technique).

In summary, state-of-the-art LLMs have moved beyond the 2048-token limitation through a mix of clever architectural changes (compressing or sharing attention cache) and algorithmic innovations (sparse or multi-hop attention). The *long context* challenge is being actively tackled: DeepSeek-V3 handles 128K tokens natively ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=additional%20training%20phases%2C%20each%20comprising,remain%20identical%20across%20both%20phases)) models like Mistral-7B have been fine-tuned to 100K with ALiBi/RoPE scaling, and new research is breaking the 1M token barrier ([[2405.05254] You Only Cache Once: Decoder-Decoder Architectures for Language Models](https://ar5iv.org/abs/2405.05254#:~:text=that%20YOCO%20achieves%20favorable%20performance,Code)) These breakthroughs unlock new capabilities, such as feeding an entire code repository or multiple books to an LLM and having it reason across them in one go.

### 2.3 Massive Model Scaling and Mixture-of-Experts (MoE)

Scaling the number of parameters in LLMs has been a reliable path to improving performance (as per scaling laws), but it runs into the wall of limited computation: a 500B+ parameter dense model is extraordinarily expensive to train and use. **Mixture-of-Experts (MoE)** is a paradigm that allows models to scale to trillions of parameters **without a proportional increase in computation per token** In an MoE layer, multiple parallel FFN “expert” networks are available, and a learned gating function routes each input token to a subset of these experts. Only those experts are activated to process the token, making the computation *sparse* (each token sees only a fraction of the model’s parameters).

Modern MoE architectures, like those in DeepSeek-V3 and Google’s GLaM, incorporate *hundreds of experts per layer*. DeepSeek-V3, for example, replaces most FFN layers with MoE layers (all except the first three transformer blocks) ([](https://arxiv.org/pdf/2412.19437?#:~:text=substitute%20all%20FFNs%20except%20for,As)) Each MoE layer has **256 expert networks** plus 1 shared expert, and each token is routed to the top-8 experts for its input (with non-zero gating weights) ([](https://arxiv.org/pdf/2412.19437?#:~:text=substitute%20all%20FFNs%20except%20for,As)) This means at inference time, although the model has a *total* of 671B parameters in all experts combined, only ~37B parameters are used per token (since 8 experts × ~4.6B params per expert = ~37B) ([](https://arxiv.org/pdf/2412.19437?#:~:text=and%20multiplies%20additional%20scaling%20factors,are%20activated%20for%20each%20token)) The gating network (a small neural network or linear layer) takes the token’s hidden state and produces a probability distribution over the experts (256+1 in this case). The top-$k$ experts with highest probability are selected (sparsely), and the token’s representation is sent through each of those experts’ feed-forward networks. The outputs are combined (summed, weighted by the gate probabilities) to produce the layer output. Mathematically, if $h$ is the input to the MoE layer, and $E_i(h)$ is the $i$-th expert’s FFN output, and $G_i(h)$ is the gate’s softmax weight for expert $i$, then the MoE output is:

$$\text{MoE}(h) = \sum_{i \in \mathcal{S}_k(h)} G_i(h)\; E_i(h),$$

where $\mathcal{S}_k(h)$ is the set of top-$k$ selected experts for input $h$. This operation is implemented efficiently by parallelizing the experts (each expert is like a separate feed-forward layer matrix; with 256 experts, one could distribute them across accelerators or process in batches) and using sparse dispatch (only compute $E_i(h)$ for selected $i$).

One challenge with MoE is ensuring **load balancing** — i.e., that different experts are used approximately evenly. Without care, the gating mechanism might learn to send most tokens to a few experts, leading to bottlenecks (and under-utilization of others). Early MoE models (e.g., Google’s Switch Transformer) introduced an auxiliary loss to encourage a uniform assignment of tokens to experts. DeepSeek-V3 *pioneered an auxiliary-loss-free strategy* for MoE load balancing They achieved balanced expert usage through gating function design and perhaps gating noise or constraints, avoiding the need for an extra loss term. This is important because tuning the auxiliary loss weight can be tricky — too high and it hurts model quality, too low and it fails its purpose. By eliminating it, DeepSeek-V3 simplified MoE training while still keeping experts evenly loaded.

Overall, MoE allows “scaling width” almost arbitrarily: DeepSeek-V3’s 671B parameter MoE model (with 61 layers, hidden size 7168, 128 attention heads) uses 256 experts per MoE layer ([](https://arxiv.org/pdf/2412.19437?#:~:text=4.2.%20Hyper,the%20query%20compression%20dimension%20%F0%9D%91%91)) ([](https://arxiv.org/pdf/2412.19437?#:~:text=substitute%20all%20FFNs%20except%20for,As)) Google’s GLaM had an even larger configuration (up to 1.2T parameters with 64 experts). These models show near-linear scaling of certain abilities with total parameter count, *without* the quadratic cost that a dense model of equal size would incur. However, training MoEs at scale requires excellent distributed training infrastructure, since experts can be placed on different GPUs and thus tokens in a batch must be **communicated** to the appropriate device (this is known as the **“expert parallel”** regime). DeepSpeed and custom frameworks use *All-to-All* communication primitives to shuffle token representations among GPUs during MoE layers. Despite the complexity, MoE is a promising direction: it offers a way to get **“the best of both worlds”** — extremely high parameter count (capacity) and relatively low computation per token (since only a fraction of the network is active).

In addition to MoE, another approach to expand model capacity with limited cost is **multi-token generation heads**. Recently, models are being trained to predict multiple tokens in a single forward pass (rather than the standard single next-token). For example, DeepSeek-V3 employed a *multi-token prediction objective* with a prediction depth $D=1$ (meaning each forward pass predicts 2 tokens: the true next token and one additional token beyond it) ([](https://arxiv.org/pdf/2412.19437?#:~:text=shared%20expert%20and%20256%20routed,layers%20after%20the%20compressed%20latent)) And the **Medusa** framework (discussed in the next subsection) adds extra decoding heads to generate several tokens in parallel These strategies don’t change the parameter count, but they alter the output mechanism to improve throughput. By predicting, say, 2–4 tokens per step, the model can effectively double or quadruple its generation speed at inference (at the cost of some training complexity). Importantly, multi-token prediction can be combined with MoE: DeepSeek-V3’s MoE layers and compression are all still utilized, and on top of that it does $D=1$ extra token prediction ([](https://arxiv.org/pdf/2412.19437?#:~:text=shared%20expert%20and%20256%20routed,layers%20after%20the%20compressed%20latent))

### 2.4 Inter-GPU Communication and Training Methodologies

Training “massive-scale” models (tens to hundreds of billions of parameters and beyond) requires distributing the model across multiple GPUs or even multiple servers. Two primary forms of parallelism are used: **data parallelism** and **model parallelism**. In *data parallelism*, each GPU has a copy of the model, and they each train on different batches of data (periodically synchronizing gradients). In *model parallelism*, the model itself is split across GPUs so that each GPU holds only a chunk of the parameters. Modern LLM training often uses a combination (hybrid) of both.

**Intra-layer (Tensor) Parallelism:** One common scheme is **tensor model parallelism**, where heavy matrix multiplications are split among devices. For example, a weight matrix of size $8192 \times 8192$ could be split column-wise across 4 GPUs (each handles an $8192 \times 2048$ slice). During a forward pass, each GPU computes its partial result; an all-reduce operation (summing across GPUs) then gathers the full output. This requires high-speed interconnects (NVLink or InfiniBand) because gradients must be all-reduced across GPUs for each layer. NVIDIA’s Megatron-LM library employs tensor parallelism within nodes, as all-reduce latency is lower when GPUs are on the same server (e.g., 8 GPUs on an A100 DGX share NVLink) By splitting matrices, tensor parallelism reduces memory per GPU and allows layers to be larger than a single GPU’s memory. The cost is added communication every forward/backward pass.

**Inter-layer (Pipeline) Parallelism:** Another dimension is **pipeline parallelism**, where different layers of the model are assigned to different GPUs. For instance, in a 4-way pipeline parallel setup, GPU0 holds layers 1–15, GPU1 holds layers 16–30, etc. A microbatch of data is fed into GPU0, it processes those layers, then passes the intermediate activations to GPU1, and so on. Pipeline parallelism keeps each GPU busy with a portion of the network. However, naive pipelining can lead to bubbles (idle times) while one microbatch is still being processed by earlier stages. Techniques like the 1F1B (One Forward One Backward) scheduling overlap the computation: after a warmup, each GPU alternates between forward on one microbatch and backward on another, keeping the pipeline full The *PipeDream-Flush* or *1F1B* schedule significantly reduces pipeline idle time, effectively achieving high hardware utilization Pipeline parallelism is generally used across nodes (since it’s easier to partition layers across different machines than to do fast all-reduce across machines every layer). For example, a 70B model might use 8-way tensor parallel within each node and 4-way pipeline parallel across 4 nodes (total 32 GPUs effectively).

**Optimizer State Sharding:** In addition to splitting the forward pass, systems like **DeepSpeed ZeRO** partition the *optimizer states and gradients* across data-parallel workers to save memory. In full data parallel training, each GPU would have its own copy of the 70B model’s gradients and optimizer moments (so 70B worth of states per GPU). ZeRO stage 1/2 splits those so each GPU holds only 1/N of them, and only the model weights (and perhaps some activations) are replicated. Communication (all-gather or reduce-scatter) is used to fetch parameters when needed and to aggregate gradients. This enables training ultra-large models with limited GPU memory.

**Low-Precision Training:** Another key optimization is using **mixed-precision** arithmetic. Almost all large-scale trainings now use FP16 or bfloat16 (BF16) for model weights and activations, with an FP32 master copy of weights for stability (in classic mixed precision) or relying on BF16’s range to avoid an FP32 copy. Cutting precision in half yields significant speedups on tensor cores and halves memory usage for storage. The bleeding edge is training with **FP8** precision. Recent work (NVIDIA’s Transformer Engine, and papers like *FP8-LM*) show that with careful scaling and range management, FP8 can be used for GEMM operations in training ([](https://arxiv.org/pdf/2412.19437?#:~:text=kernels%2C%20i,This%20design%20theoretically%20doubles)) DeepSeek-V3 implemented an FP8 training framework: all matrix multiplies (forward, backward weight grad, backward activation grad) run in FP8, doubling throughput, while certain sensitive operations remain in higher precision ([](https://arxiv.org/pdf/2412.19437?#:~:text=Despite%20the%20efficiency%20advantage%20of,output%20head%2C%20MoE%20gating%20modules)) They kept layernorm, softmax, and the *gating for MoE* in BF16/FP32, as well as the accumulation of weight updates (the “master weights”) in BF16 ([](https://arxiv.org/pdf/2412.19437?#:~:text=precision%20due%20to%20their%20sensitivity,targeted%20retentions%20of%20high%20precision)) With these precautions, they achieved <0.25% training loss deviation using FP8 compared to BF16 ([](https://arxiv.org/pdf/2412.19437?#:~:text=DeepSeek%02V2%2C%20training%20for%20approximately%201,2019)) ([](https://arxiv.org/pdf/2412.19437?#:~:text=overhead%20to%20the%20overall%20training,further%20guarantee%20numerical%20stability%2C%20we)) The reward is faster training and reduced memory traffic. As hardware (like NVIDIA H100) supports FP8, we expect more adoption of FP8 in large model training.

In summary, training a frontier model might use: *data parallel groups of 8 GPUs*, each group using *tensor parallel* for matrix ops and *pipeline parallel* across 4 groups, totaling 32 GPUs (for instance). The training loop orchestrates communication so that gradients are synchronized (all-reduced) among data parallel replicas, parameters are sharded (ZeRO) to save memory, and pipeline microbatching keeps all GPUs busy. On each GPU, computations are done in mixed precision (BF16 or FP8), with occasional high-precision operations for stability. All these techniques enable training models like DeepSeek-V3 (670B params) in a reasonable time: DeepSeek-V3 was trained on 2.788 million A100 GPU-hours which, while large, is attainable on a decent cluster.

### 2.5 Efficient Inference and Decoding Strategies

Training huge models is only half the battle; *efficient inference* (text generation) is equally important. Naively, generating one token with a 100B model and 2048-token context requires performing 100B × 2048 × operations (for attention+MLP) — and generation is inherently sequential. Researchers have developed methods to accelerate inference without retraining the model from scratch.

**Speculative Decoding:** This method uses a **draft model** (smaller or distilled model) to accelerate a larger model’s sampling. The idea, introduced by Leviathan et al., is to let a fast draft model generate several tokens in one go, and then have the large model *verify* those tokens in a single pass For example, the small model might predict 4 tokens greedily; the large model then runs and if it agrees on the first token, we accept it and then check the second, and so on until a mismatch occurs (at which point decoding resumes normally). By doing so, one large-model call can produce multiple tokens on average, achieving >2× speed-ups in practice with negligible quality loss The drawback is needing a second model and the logic to manage the speculative execution. It’s useful in settings where latency is critical and one can afford to maintain an extra smaller model.

**Medusa: Multi-Head Decoding:** The **Medusa framework** eliminates the need for a separate draft model by augmenting the LLM itself with extra *decoding heads* Normally, a Transformer has one output head (the final linear layer to vocab logits). Medusa adds, say, 2 or 4 additional heads that also take the last layer’s hidden state as input and produce logits for the next token. During inference, the model uses a special attention mechanism (a “tree-based attention”) that allows these heads to propose multiple next tokens in parallel and then verifies them within a few steps Essentially, it can branch out the generation like a tree and confirm which branch is valid. Medusa can achieve speed-ups similar to speculative decoding (2–3× faster) **without an external model** There are two training modes: *Medusa-1* which trains only the new heads on a fixed backbone (so the original model’s outputs remain unchanged, giving *lossless* acceleration), and *Medusa-2* which fine-tunes the whole model with the heads for even better prediction accuracy In evaluations, Medusa-1 reached ~2.2× speedup with essentially identical output quality, and Medusa-2 achieved up to 3.6× speedup with a slight fine-tuning of the model Medusa’s approach is a form of model augmentation: by investing some compute in parallel heads, it cuts down the number of sequential decoding steps required.

**Parallelism in Beam Search or Sampling:** Even without fancy new methods, one can exploit model parallelism for faster inference. If a model is serving multiple requests (or multiple beams in a search) simultaneously, it can batch them and use GPU parallelism to process many tokens at once. This doesn’t reduce latency for a single sequence, but increases throughput. On the other hand, techniques like **early exit** in decoding (e.g., if all beams in a beam search have the same prefix, some compute can be shared) can save some cycles.

**Quantization for Inference:** Reducing model precision at inference is very popular. Int8 quantization (with calibration for outlier features) can run a 175B model on smaller hardware at the cost of a slight hit in perplexity ([](https://arxiv.org/pdf/2412.19437?#:~:text=%28e,While)) New *GPTQ* methods allow one to quantize to 4-bit weights with minimal accuracy drop, enabling LLMs to run on a single GPU or even CPU. These methods don’t change the architecture, but we note them as an important practical technique. The extreme context methods described earlier also influence inference: for instance, a 100K token context is slow to process in one go; some implementations cache earlier layers on CPU or disk and only do full attention every N tokens (relying on partial attentions in between). All these engineering tricks ensure that users can actually *use* these large models in reasonable time.

In closing, the frontier of LLM architecture combines *scale (MoE, billions of parameters)*, *long memory (extended context, retrieval, compression)*, and *speed (efficient decoding and optimized kernels)* to push performance. Table 2 summarizes a few architectural innovations and which models employ them:

| **Innovation** | **Description** | **Used in** |
| — — — — — — — — — — — — — — — — -| — — — — — — — — — — — — — — — — — — — — — — — — — — | — — — — — — — — — — — — — — — -|
| **RMSNorm + SwiGLU** | Pre-normalization & gated activation in Transformer blocks for stable deep training. | LLaMA-2, PaLM, GPT-3, etc. ([A Complete Beginner’s Guide to Llama 2 | Build Generative AI Applications With SingleStoreDB](https://www.singlestore.com/blog/a-complete-beginners-guide-to-llama2/#:~:text=elements%2C%20including%20RMSNorm%20pre,a%20vast%20corpus%20of%20data)) |
| **Rotary Positional Embedding** | Rotating token embeddings in Q/K space for better long-range modeling and extrapolation. | GPT-NeoX, LLaMA, Mistral ([A Complete Beginner’s Guide to Llama 2 | Build Generative AI Applications With SingleStoreDB](https://www.singlestore.com/blog/a-complete-beginners-guide-to-llama2/#:~:text=elements%2C%20including%20RMSNorm%20pre,a%20vast%20corpus%20of%20data)) |
| **Grouped-Query Attention** | Multiple heads share key/value projections to reduce memory (KV cache grouped). | LLaMA 65B, 70B|
| **Context Compression (MLA)** | Low-rank compression of attention keys/values to shrink memory footprint of long contexts. | DeepSeek-V2/V3 ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=The%20core%20of%20MLA%20is,Value%20%28KV%29%20cache%20during%20inference)) |
| **YOCO (Cache Once)** | One shared KV cache for all layers (decoder-decoder architecture) for extreme context lengths. | (Research prototype) ([[2405.05254] You Only Cache Once: Decoder-Decoder Architectures for Language Models](https://ar5iv.org/abs/2405.05254#:~:text=which%20only%20caches%20key,compared%20to%20Transformer%20in%20various)) ([[2405.05254] You Only Cache Once: Decoder-Decoder Architectures for Language Models](https://ar5iv.org/abs/2405.05254#:~:text=that%20YOCO%20achieves%20favorable%20performance,Code)) |
| **MoE Layers (256 experts)** | Sparsely-activated feed-forward experts to expand model capacity without proportional compute cost. | DeepSeek-V3 (671B) ([](https://arxiv.org/pdf/2412.19437?#:~:text=substitute%20all%20FFNs%20except%20for,As)) ([](https://arxiv.org/pdf/2412.19437?#:~:text=and%20multiplies%20additional%20scaling%20factors,are%20activated%20for%20each%20token)) GLaM (1.2T) |
| **Multi-Token Decoding** | Model predicts multiple tokens per step (extra decoding heads or objectives) to accelerate generation. | DeepSeek-V3 (2-token) ([](https://arxiv.org/pdf/2412.19437?#:~:text=shared%20expert%20and%20256%20routed,layers%20after%20the%20compressed%20latent)) Medusa|
| **Speculative Decoding** | Use a smaller draft model or extra heads to propose tokens faster, then validate with the main model. | Medusa (no draft) Alibaba’s SpecInfer, etc. |
| **FP8 Training** | Use 8-bit floating point for matrix multiplications in training (with higher precision accumulations). | DeepSeek-V3, NVIDIA TE ([](https://arxiv.org/pdf/2412.19437?#:~:text=kernels%2C%20i,This%20design%20theoretically%20doubles)) ([](https://arxiv.org/pdf/2412.19437?#:~:text=precision%20due%20to%20their%20sensitivity,targeted%20retentions%20of%20high%20precision)) |

*Table 2: Notable architectural and training innovations in recent LLMs.*

## 3. Post-Training & Reinforcement Learning

After the base model is pre-trained on large-scale data, a critical phase begins: **post-training alignment**. This is where the model is guided to produce useful, safe, and human-aligned outputs. Techniques in this stage include *supervised fine-tuning on instructions*, *reinforcement learning from feedback*, and other automated alignment methods. We also discuss “self-play” strategies where models improve by generating and evaluating content themselves.

### 3.1 Instruction Tuning and Supervised Fine-Tuning (SFT)

The first step in aligning an LLM is often **supervised fine-tuning (SFT)** on human-written demonstrations. For example, OpenAI’s GPT-3 was later refined into InstructGPT by fine-tuning on prompt-response pairs labeled as ideal by humans. SFT teaches the model to follow instructions and stick to a desired style/tone. Datasets for SFT include prompt-response pairs from various tasks (summarization, Q&A, etc.), often curated by humans or distilled from the web. Open-source efforts like Alpaca, LLaMA-Adapter, Dolly 2.0 collected or generated such instruction-following data for models like LLaMA. While SFT is effective at teaching the model *what to do*, it doesn’t fully guarantee *how* the model should balance different objectives (helpfulness vs. correctness vs. harmlessness, etc.) when those trade-offs aren’t explicit in data.

### 3.2 Reinforcement Learning from Human Feedback (RLHF)

To further align models with human preferences, **Reinforcement Learning from Human Feedback (RLHF)** has become a standard approach ([RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback](https://arxiv.org/html/2309.00267v3#:~:text=Reinforcement%20learning%20from%20human%20feedback,even%20when%20the%20AI%20labeler)) ([RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback](https://arxiv.org/html/2309.00267v3#:~:text=circumvents%20RM%20training%20by%20obtaining,the%20scalability%20limitations%20of%20RLHF)) The classic RLHF pipeline (as used in OpenAI’s InstructGPT and Anthropic’s Claude) involves:

1. **Collecting human feedback** on model outputs. Usually, human annotators are shown a prompt and two or more model responses, and they rank which response they prefer. This creates a dataset of comparisons or a scalar reward signal for outputs.
2. **Training a Reward Model (RM)**. A separate neural network is trained to predict human preference scores. It takes as input a prompt and a candidate response, and outputs a reward score. The RM is trained on the human-labeled comparisons (using techniques like Bradley-Terry or direct regression to a preference score). Once trained, this RM serves as a stand-in for human judgment.
3. **Fine-tuning the policy (the original LLM)** with reinforcement learning, using the RM as a guide. Typically, an algorithm like Proximal Policy Optimization (PPO) is used, where the “policy” is the LLM that generates text. The reward signal comes from the RM’s evaluation of the LLM’s output. Additionally, a **KL-divergence penalty** is often included to keep the new policy close to the original pre-trained model, to avoid drifting into nonsensical but high-reward regions The objective for the policy $π_\theta$ on prompt $x$ with generated response $y$ can be:

$$L(\theta) = -\mathbb{E}_{y \sim π_\theta}[ r(x,y) ] + \beta \, \mathrm{KL}(π_\theta(\cdot|x)\;||\;π_{\text{pretrain}}(\cdot|x) ),$$

which the policy seeks to minimize (here $r(x,y)$ is the reward model’s output, and $\beta$ is a weighting for the KL penalty). In practice, the gradient is estimated via policy gradient methods (like PPO which uses an advantage estimator and clipping to stabilize training).

OpenAI’s implementation and also Google’s Gemini, for instance, use such RLHF with some variations. Google reportedly applied vanilla policy gradient (REINFORCE) with a large reward model and KL regularization similar to InstructGPT The KL term can be seen as coming from a probabilistic interpretation (optimal reward maximization leads to KL control). By penalizing divergence, the model is less likely to produce odd outputs that exploit the reward model’s flaws.

**Group-PPO Variants:** One novel twist is DeepSeek’s use of **Group Relative Policy Optimization (GRPO)** GRPO is a variant of PPO that does not require a separate value network (or critic). Instead, it generates a *group of candidate outputs* for each prompt using the current policy (or an older version of the policy), and uses the distribution of reward scores within that group as the baseline. Essentially, for a prompt, sample, say, M=8 outputs from the model; evaluate all with the reward model. If one particular output has a reward higher than the group’s average, it gets a positive advantage, if lower, a negative advantage. The policy is then updated to maximize the expected reward advantage, which implicitly pushes it to produce more outputs like the high-reward ones ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=where%20and%20are%20hyper,the%20outputs%20within%20each%20group)) GRPO removes the need to train a separate value function (which in PPO is usually a clone of the model, doubling memory). By using *relative rewards* within a batch of samples, it cleverly estimates the advantage. This approach worked for DeepSeek-V2 and V3, enabling reinforcement learning at the scale of a 37B active parameter model without an extra critic network

**Reward Design:** Crafting the reward function is central to RLHF. Usually the reward comes from a learned model as above, but sometimes you can incorporate **rule-based rewards** for certain objective criteria DeepSeek-V3 did this: for math questions where there’s a known correct answer, they gave a positive reward if the model’s answer was exactly correct (like matching a solution in a box) For coding problems, they actually ran the code and used unit tests or a compiler to check correctness, using that as a reward signal These *rule-based rewards* are high-precision and resistant to exploitation (the model can’t “trick” a compiler — it either passes tests or not). Wherever a ground-truth check or rule can be applied, it’s beneficial to use it; this reduces the load on the learned reward model to handle everything. For more open-ended tasks (creative writing, general Q&A), only a learned reward model can judge quality. DeepSeek trained a reward model starting from their SFT model, and interestingly, they had it not only output a score but also model the **chain-of-thought leading to that score** (perhaps to help interpretability and reduce reward hacking)

**Automated Feedback (RLAIF):** Human feedback is costly. An emerging idea is **Reinforcement Learning from AI Feedback (RLAIF)**, where AI models generate feedback labels instead of humans ([RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback](https://arxiv.org/html/2309.00267v3#:~:text=Reinforcement%20learning%20from%20human%20feedback,even%20when%20the%20AI%20labeler)) ([RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback](https://arxiv.org/html/2309.00267v3#:~:text=circumvents%20RM%20training%20by%20obtaining,the%20scalability%20limitations%20of%20RLHF)) Recent studies (e.g., by Google/DeepMind) have found that using an off-the-shelf strong model (like GPT-4) to label responses can produce a reward model nearly as good as a human-labeled one ([RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback](https://arxiv.org/html/2309.00267v3#:~:text=Reinforcement%20learning%20from%20human%20feedback,even%20when%20the%20AI%20labeler)) RLAIF involves using a large language model to annotate or score outputs, train a reward model on those *AI-generated preference labels*, and then do RL as usual. Impressively, RLAIF has achieved **comparable performance to RLHF** on summarization and dialogue tasks ([RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback](https://arxiv.org/html/2309.00267v3#:~:text=Reinforcement%20learning%20from%20human%20feedback,even%20when%20the%20AI%20labeler)) ([RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback](https://arxiv.org/html/2309.00267v3#:~:text=circumvents%20RM%20training%20by%20obtaining,the%20scalability%20limitations%20of%20RLHF)) In some cases, it even *outperforms* a human-labeled approach, especially if the AI labeler is very capable and consistent in its evaluations ([RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback](https://arxiv.org/html/2309.00267v3#:~:text=Furthermore%2C%20we%20take%20a%20step,the%20scalability%20limitations%20of%20RLHF)) This line of research suggests we can scale alignment methods without a linear increase in human labor — a crucial factor for building models like GPT-4, which likely required a huge number of feedback examples. Of course, one must ensure the AI feedback model is aligned itself; otherwise, biases or errors in its judgment could propagate. Techniques like *direct RLAIF*, where the policy model queries a separate LLM (like asking “Would a human prefer this output?” and directly using that as a reward signal) have been proposed ([RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback](https://arxiv.org/html/2309.00267v3#:~:text=Furthermore%2C%20we%20take%20a%20step,the%20scalability%20limitations%20of%20RLHF)) The overall trend is **automating alignment**, either by using AI to generate training prompts and answers (self-instruct, self-play) or AI to generate feedback (RLAIF). This reduces reliance on scarce human labelers and may unlock faster iteration.

### 3.3 Self-Play and Other Post-Training Methods

**Self-Play for Skill Improvement:** In certain domains, an LLM can be fine-tuned by *interacting with itself* or a simulated environment. This concept is borrowed from reinforcement learning in games (e.g., AlphaZero self-play). For LLMs, “self-play” might mean the model is put in dialogue with a copy of itself: one plays the user, one the assistant, generating new conversational data. This was used in OpenAI’s early dialogue models — they let the model chat with itself about a topic and then had humans review some of those chats to pick high-quality ones for fine-tuning. Self-play can rapidly create a lot of in-domain data. Another example is reasoning or tool use: you can have the model try to solve puzzles or use tools in a simulated loop and learn from successes and failures (perhaps with an automated checker giving reward).

**Chain-of-Thought Distillation:** A technique where a larger model’s reasoning traces (its step-by-step thinking) are used to fine-tune a smaller model. Post-training, one might take a strong model like GPT-4, have it produce an answer *with an explanation*, and then fine-tune the target LLM to generate similar explanations and answers. This guides the model toward more human-understandable reasoning and can indirectly align it (because the explanations can be verified or at least make the model’s internal process more transparent).

**Constitutional AI:** Proposed by Anthropic, this method removes human labelers from the direct loop by instead defining a set of principles (a “constitution”) and having the AI judge its outputs according to those principles. For example, a principle might be “The assistant should not be rude.” The model generates an output, then *a second step* where the model (or a separate model) evaluates the output against the principles and revises it if needed. This can be done iteratively to refine behavior. In training, they generate many prompt-response pairs and have an AI judge rewrite any response that violates rules. The rewritten responses form the fine-tuning data. This is another form of AI-assisted alignment: the model learns from feedback given by a fixed set of rules enforced by an AI reviewer.

**Safety and Red-Teaming:** Post-training alignment also involves adversarial testing. Teams will “red-team” the model (try to elicit bad behavior) and then use those findings to further fine-tune the model on *avoiding traps*. For instance, if a model discloses private info in a prompt, one may add training examples where the correct response is a refusal. RLHF can be extended to RL for safety: define some reward for adhering to safety and a negative reward for unsafe completions (with the reward model or rules detecting unsafe content). By training on a wide range of adversarial prompts, the model becomes more robust.

DeepSeek-V3 and Gemini, as large cutting-edge models, incorporate many of these post-training techniques. DeepSeek-V3’s post-training involved a combination of **Supervised Fine-Tuning (SFT)** on curated instructions, then **RL** using both rule-based and model-based rewards as described, and even a form of model self-improvement where they distilled reasoning from an earlier model (DeepSeek-R1) into the final model They also mention “self-rewarding” and ensuring the model doesn’t exploit the reward function Google’s Gemini (and the smaller Gemma variant) also went through extensive fine-tuning. According to reports, Google’s alignment for Gemini included **reinforce with human feedback** (like ChatGPT’s method) and they notably scaled up the size of the reward model (using a model possibly as large as the policy) They combined that with the KL penalty to keep outputs in check Gemini’s multimodal nature (text + images) means they also had to do multi-modal RLHF, which is even more complex (ensuring it doesn’t produce offensive images, etc.), and encountered some issues as noted by their handling of image prompts

To guide implementation of RLHF, here is a high-level pseudocode of the training loop:

```python
# Pseudocode: RLHF fine-tuning (using PPO-like updates)
initialize policy_model (from SFT weights)
initialize reward_model (pretrained or from SFT model)
for iteration in range(num_iterations):
# Step 1: Sample outputs from the current policy
prompts = sample_batch(prompts_dataset)
with torch.no_grad():
outputs, logprobs = policy_model.generate(prompts) # get model outputs and log probabilities

# Step 2: Compute rewards for each output
with torch.no_grad():
rewards = reward_model.score(prompts, outputs) # scalar reward for each output

# (Optional: add baseline for advantage)
# e.g., use mean reward or reward_model on a reference policy for baseline
advantages = rewards — rewards.mean()

# Step 3: Compute policy loss with KL penalty
old_logprobs = logprobs.detach()
# Get new logprobs (we’ll update model, so treat old as baseline)
new_logprobs = policy_model.compute_logprobs(prompts, outputs)
# Policy gradient loss (REINFORCE objective, maximize reward -> minimize -reward*logprob)
pg_loss = — advantages * new_logprobs
# KL divergence penalty (policy vs initial/SFT model distribution)
kl_penalty = beta * (new_logprobs — reference_model.logprob(prompts, outputs))
loss = (pg_loss + kl_penalty).mean()

# Step 4: Update policy model (e.g., using Adam on the loss)
optimizer.zero_grad()
loss.backward()
optimizer.step()
```

In practice, implementations use batches of prompts, and the policy model might generate multiple samples per prompt (for GRPO, etc.), but the above pseudocode captures the essence: sample from the model, score with reward model, compute a policy gradient update with a regularizer.

### 3.4 Evaluation and Iteration

After each stage of alignment training, the model is evaluated on a suite of benchmarks: helpfulness (like TruthfulQA, HH tasks), harmlessness (e.g., checking refusal when asked for disallowed content), and general abilities (to ensure alignment didn’t degrade core skills). If issues are found (e.g., the model starts refusing too much, a phenomenon known as *over-alignment* or becoming overly cautious), the team may adjust the reward balance or add more data to correct it.

Both DeepSeek-V3 and Gemini underwent these iterative improvements. DeepSeek-V3, for instance, claims strong performance on a wide range of evaluations, approaching GPT-4’s level, after alignment They also note the importance of stability: DeepSeek-V3’s entire training and RL phase proceeded without loss spikes or restarts which is a testament to good hyperparameter choices and robust training procedures.

In conclusion, post-training alignment techniques like RLHF have become as crucial as the initial unsupervised training. They transform a raw model into a usable AI assistant. Innovations like rule-based rewards, AI-generated feedback (RLAIF), and improved RL algorithms (GRPO, DPO, etc.) are making this process more effective and efficient. The end result is models that not only excel on academic benchmarks but behave in ways that are more aligned with human expectations and ethical norms ([RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback](https://arxiv.org/html/2309.00267v3#:~:text=Furthermore%2C%20we%20take%20a%20step,the%20scalability%20limitations%20of%20RLHF))

## 4. Implementation-Level Details

This final section bridges the concepts into implementation guidance. We provide some mathematical formulations and pseudocode patterns relevant to building advanced LLMs, tying together the data, architecture, and training considerations discussed above.

### 4.1 Model Architecture Implementation

An LLM’s code typically defines the model as a stack of Transformer decoder blocks. Each block will implement multi-head attention and an MLP, often leveraging library primitives (e.g., PyTorch’s `nn.MultiheadAttention` or custom fused kernels for FlashAttention). Key steps to implement:

- **Rotary Embeddings:** When implementing attention, if using RoPE, you’ll apply a rotation to the Q and K vectors before the $QK^T$ computation. This can be done by interleaving the vector into complex planes and multiplying by a rotation matrix that advances phase with position ([100M Token Context Windows — Magic](https://magic.dev/blog/100m-token-context-windows#:~:text=For%20each%20decoded%20token%2C%20LTM,a%20100M%20token%20context%20window)) Many open-source codebases have efficient RoPE implementations (often just a few lines).

- **Mixed Precision:** Wrap model layers with a decorator or context to use FP16/BF16. E.g., PyTorch’s `autocast` or the `transformer_engine` library for FP8. Pay attention to ensure LayerNorms and Softmax (in attention) are in FP32 or at least BF16 for stability ([](https://arxiv.org/pdf/2412.19437?#:~:text=precision%20due%20to%20their%20sensitivity,targeted%20retentions%20of%20high%20precision)) For FP8, one must calibrate scaling factors. NVIDIA’s TransformerEngine handles this with a “per-tensor scaling” technique, or fine-grained scaling as mentioned in DeepSeek-V3 (they use tile-wise grouping to extend FP8 dynamic range) ([](https://arxiv.org/pdf/2412.19437?#:~:text=the%20Linear%20operator%20is%20illustrated,GEMM)) ([](https://arxiv.org/pdf/2412.19437?#:~:text=extend%20the%20dynamic%20range%20of,training%2C%20we%20cache%20and%20dispatch)) Implementing FP8 from scratch is non-trivial, but using existing libraries, one can swap in FP8 linear layers fairly easily.

- **MoE Routing:** If adding MoE layers, one implementation approach is: split the batch by tokens among experts. Libraries like Fairseq’s MOE or Tutel (by Microsoft) provide an optimized All-to-All communication and expert kernel. The pseudocode might look like:

```python
# Simplified MoE forward (assuming 1 token per batch element for clarity)
gates = softmax(gating_network(x)) # [batch, n_experts]
topk_experts = gates.topk(k=8, dim=-1)
# Dispatch each token to its top experts
for token_idx, (exp_indices, exp_scores) in enumerate(zip(topk_experts.indices, topk_experts.values)):
for j, exp_idx in enumerate(exp_indices):
y = experts[exp_idx](x[token_idx]) # forward token through expert network
expert_outputs[token_idx, exp_idx] = exp_scores[j] * y
# Sum outputs from all experts for each token
output = expert_outputs.sum(dim=1)
```

In practice, you use batch processing: group tokens by expert, concatenate their inputs, run the expert in one go on that batch, then scatter outputs back. The frameworks handle these details and the necessary communication if experts are on different devices.

- **Long Contexts:** To implement context lengths beyond the trained limit (say your model was trained on 4K, and you want 16K), one might use RoPE extrapolation or fine-tune with position interpolation. Implementation-wise, ensure that `max_position_embeddings` is set high and that the model either learned positional encodings (like a sinusoidal or RoPE) that can extrapolate. For extremely long contexts (100K+), one needs to chunk the attention or use a special technique like implementing the *transformer as a recurrent mechanism or using IO-aware streaming*. The Magic’s method, for example, likely required custom CUDA kernels for hash-based memory access As an implementer, start with more modest long-context methods: e.g., use Hugging Face’s **Longformer** attention pattern or implement a sliding window if needed. YOCO architecture (decoder-decoder) would be a significant change: two sets of transformer layers and a routing of cache from one to the other ([[2405.05254] You Only Cache Once: Decoder-Decoder Architectures for Language Models](https://ar5iv.org/abs/2405.05254#:~:text=which%20only%20caches%20key,compared%20to%20Transformer%20in%20various)) That’s an advanced project; but using available long-context models (like MPT-100k or LongLLaMA) as references can guide how to handle positional encodings and attention variants.

### 4.2 Training Loop and Optimization

Implementing a training loop for LLMs requires careful attention to efficiency and stability:

- **Data Pipeline:** Use streaming datasets if data is huge. Tokenize on the fly if needed, but it may be better to pre-tokenize and store in a memory-mapped file for throughput. Ensure shuffling at scale (e.g., shuffle indices, or shuffle chunks of data periodically) to avoid bias from data ordering. For multi-node, each worker should get a different slice of data (handle with DistributedSampler in PyTorch or similar).

- **Gradient Accumulation:** Due to memory limits, you might accumulate gradients over multiple microbatches before updating (effectively increasing batch size without needing it all in GPU at once). This just means calling `loss.backward()` on each microbatch and stepping the optimizer every N microbatches, taking care to scale the loss appropriately or divide gradients by N.

- **Learning Rate Schedule:** Large models often use a long warmup (to avoid instability at the start) and then a cosine decay or linear decay. E.g., warmup 1–2% of total steps, then cosine decay to a small fraction of the LR. Choose an appropriate max LR based on batch size and model size (too high can blow up training). DeepSeek-V3 mentioned they did not encounter loss spikes indicating they tuned things well (likely a low enough LR and maybe gradient clipping). Always monitor training loss — if it diverges (sudden NaNs or inf), the usual suspects are learning rate too high, FP16 overflow (use grad clipping or dynamic loss scaling), or a bug in the model (like a LayerNorm instability).

- **Checkpointing and Resumption:** Save checkpoints periodically. For huge models, saving full optimizer states is heavy; some opt to save just model weights (and perhaps a smaller set of states) and rely on the learning rate schedule and some luck to resume. Libraries like DeepSpeed or PyTorch distributed can save sharded checkpoints (each rank saves its shard of the model). Make sure to test that you can load a checkpoint and resume deterministically.

- **Monitoring**: Log perplexity on a validation set. For trillion-token training, having an “early validation” scheme as FineWeb suggests is useful: e.g., after every 10B tokens, evaluate on a fixed set of tasks (like Lambada accuracy, etc.) to see if the model is improving and not overfitting. Given the expense, one cannot iterate as fast as with small models, so these checks are precious.

### 4.3 RLHF and Fine-tuning Implementation

Implementing RLHF is complex, but tools like Hugging Face’s TRL (Transformer Reinforcement Learning) library provide abstractions. Key points:

- **Reward Model**: Typically a model like GPT with a scalar head. Hugging Face offers `GPTRewardModel` or one can simply take a pretrained model and add a linear layer on top of the final hidden state (CLS token or end-of-sequence token) to predict a reward. Train this on your human preference data (usually pairs of responses — train to score higher for the preferred one).

- **Policy Model**: Start from your pre-trained (or SFT) model weights. It’s beneficial to initialize policy and reward from the same base (so they understand the language similarly). However, freeze the reward model when doing RL to prevent it from drifting.

- **PPO Loop**: Use a small batch size for PPO (like 16–64 sequences per update) and generate multiple sequences per prompt if doing a multi-sample strategy. Limit the length of generation to something reasonable to avoid huge variance. Compute rewards, advantages (typically using a value head or the group method). Ensure to clip gradients or use PPO’s clipping on the objective to keep the policy update in a safe range (PPO clip param epsilon ~0.2). Also anneal the KL coefficient — often one starts with a certain KL target and adjusts β so that the KL divergence between the new policy and original stays near that target (too high β and the model won’t move, too low and it might wander off).

- **Safety Techniques**: It’s common to also bake in a “reject” or “don’t know” response in the training. For instance, if a prompt is to produce something disallowed, the desired behavior is to refuse. That can be part of the reward function (reward refusals when appropriate).

### 4.4 Inference: Serving the Model

When it comes time to serve the model (for experimentation or deployment), some implementation tips:

- Use **model parallel** if the model is bigger than one GPU’s memory (most 70B models can fit in 1x80GB with 4-bit quantization, but not in full FP16). Libraries like Hugging Face Accelerate or DeepSpeed Inference can load a model across multiple GPUs. They handle splitting weights and also coordinate the forward pass (with attention communication as needed).

- **KV Cache**: Preallocate a big tensor for the cache [layers × batch × seq_len × head_dim]. If doing streaming generation, keep appending to it. For very long sequences, you might need to offload old parts of the cache to CPU if memory is a concern (some frameworks do this automatically for 2D attention patterns).

- **Efficient Batch Decoding**: If generating for multiple prompts concurrently, pad the prompts and use a batch generation to make use of vectorization. But note that if lengths diverge, you’ll waste compute on padding for the shorter sequences. A common approach is to regroup queries by similar lengths or run a loop that at each step drops finished sequences and continues with the rest.

- **Medusa Implementation**: If one has fine-tuned a model with Medusa heads, generating with it means parallel token outputs. The code would need to run the model and its extra heads and then follow the tree search procedure (as outlined in the Medusa paper) to accept or reject the proposed tokens. This is more complex than standard autoregressive generation, so one would likely use the authors’ open-sourced code (the Medusa paper provides code ([[2401.10774] Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads](https://arxiv.org/abs/2401.10774#:~:text=Comments%3A%20The%20code%20for%20this,for%20this%20version%29%20https%3A%2F%2Fdoi.org%2F10.48550%2FarXiv.2401.10774)) . If not using Medusa, one can still try simpler speculative decoding with an available smaller model: generate k tokens with small model, then one step with large model to approve them (the GitHub repo for speculative decoding or HF Blog might have reference code.

### 4.5 Mathematical Footing

It’s often helpful for implementers to understand some of the math behind these models, especially when debugging or modifying them:

- **Layer Normalization vs RMSNorm:** RMSNorm computes $ \text{RMSNorm}(h) = \frac{h}{\sqrt{\frac{1}{d}\sum_i h_i² + \epsilon}} \cdot g$ (where $g$ is a learned gain vector, and there’s no bias) It omits the mean subtraction that LayerNorm does, which is fine for transformers and saves computation. If implementing from scratch, ensure you handle dimensions right and include the epsilon for stability.

- **Optimizer States:** Adam is commonly used; it keeps first and second moment estimates. These can be memory heavy for 30B+ models. Sharded optimizers or low-memory alternatives (e.g., Adafactor for very large models) can be considered. Adafactor saves memory by not keeping a full second moment for all params, using a factored approximation, at some cost to convergence quality.

- **Loss Scaling:** In mixed precision, to avoid FP16 underflow, loss scaling multiplies the loss by a factor, does backward (so grads are scaled up), then divides grads by the factor. Most frameworks now do dynamic loss scaling automatically. Monitor for overflow warnings; if using BF16, it’s usually not needed since BF16 has wide range.

- **Equation References:** For those inclined, refer to the original Transformer paper *Attention is All You Need* (Vaswani et al. 2017) for the exact equations of multi-head attention and the position-wise feedforward. The equations we informally described match those: attention uses $Q, K, V$ projections and softmax weighting ([100M Token Context Windows — Magic](https://magic.dev/blog/100m-token-context-windows#:~:text=For%20each%20decoded%20token%2C%20LTM,a%20100M%20token%20context%20window)) feedforward is typically two linear layers with an activation in between (e.g., $FFN(x) = \max(0, xW_1 + b_1)W_2 + b_2$, or with GLU: $FFN(x) = (xW_1 * \sigma(xW_2)) W_3$ if using 2-linears for SwiGLU).

- **Memory complexity:** If implementing a long-context model, be aware of $O(n²)$ memory. FlashAttention reduces memory constant but not asymptotic. If a model is 20B parameters and you want to go from 2K to 16K context, you *64×* the attention compute and memory. This can blow past GPU limits. So techniques like sliding window or using CPU memory for cache become necessary. Evaluate trade-offs: maybe process 16K context in chunks of 2K attended with a stride (this will not precisely emulate full attention but can approximate).

### 4.6 Verification and Testing

Finally, after implementing all pieces, it’s important to verify correctness:

- Compare the forward pass of your model on a short dummy input with a reference implementation (like Hugging Face) to ensure your customizations (RoPE, etc.) line up.
- Write unit tests for submodules (does your attention give the same result as a naive attention for small sizes? Does your MoE routing preserve shapes and gradients?).
- When training on a small dataset or for a few steps, monitor that the loss goes down and the model can overfit a small batch (a sanity check that training is functioning).
- Use pretrained weights if possible to sanity-check: for example, load a public 6B model in your code and ensure you get identical outputs to the official version. This can catch subtle bugs.

— -

By integrating the above components — massive curated data, advanced model architecture, and rigorous alignment and optimization — one can implement a cutting-edge LLM. The current state-of-the-art, as described, involves unprecedented *scale* (trillions of tokens, hundreds of billions of parameters) and *sophistication* (long memory, modular experts, feedback-driven fine-tuning). While challenging, the open-source community has produced tools, reference models, and datasets that make it feasible for researchers and engineers to build high-performance LLMs following the blueprint of recent breakthroughs. Each piece, from data cleaning to inference optimization, contributes to the overall system. This guide, with references to the latest literature, aims to provide a roadmap for replicating and learning from the best practices in large language model development as of 2024–2025.

**References:**

4. Penedo, G., Malartic, Q., Hesslow, D., *et al.* (2023). “The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only.” *arXiv preprint arXiv:2306.01116*
5. Hugging Face (2023). “FineWeb: decanting the web for the finest text data at scale.” *HuggingFace Technical Report*
6. Cerebras (2023). “SlimPajama: A 627B token, cleaned and deduplicated version of RedPajama.” *Cerebras Blog* ([SlimPajama: A 627B token, cleaned and deduplicated version of RedPajama — Cerebras](https://cerebras.ai/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama/#:~:text=Together,training%20at%20trillion%20token%20scale)) ([SlimPajama: A 627B token, cleaned and deduplicated version of RedPajama — Cerebras](https://cerebras.ai/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama/#:~:text=SlimPajama,deduplicated%20datasets%20in%20the%20future))
7. Kili Technology (2023). “Building High-Quality Datasets: Insights from Hugging Face’s FineWeb.” *Kili Blog*
8. Allen Institute for AI (2023). “Dolma: 3 trillion token open corpus for language model pretraining.” *AI2 Blog / arXiv:2307.167260*
9. Sun, Y., *et al.* (2023). “You Only Cache Once: Decoder-Decoder Architectures for Language Models.” *arXiv:2405.05254* ([[2405.05254] You Only Cache Once: Decoder-Decoder Architectures for Language Models](https://ar5iv.org/abs/2405.05254#:~:text=which%20only%20caches%20key,compared%20to%20Transformer%20in%20various)) ([[2405.05254] You Only Cache Once: Decoder-Decoder Architectures for Language Models](https://ar5iv.org/abs/2405.05254#:~:text=that%20YOCO%20achieves%20favorable%20performance,Code))
10. DeepSeek-AI (2024). “DeepSeek-V3 Technical Report.” *arXiv:2412.19437* ([DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1#:~:text=The%20core%20of%20MLA%20is,Value%20%28KV%29%20cache%20during%20inference)) ([](https://arxiv.org/pdf/2412.19437?#:~:text=substitute%20all%20FFNs%20except%20for,As))
11. Magic.dev (2024). “100M Token Context Windows.” *Magic.dev Blog* ([100M Token Context Windows — Magic](https://magic.dev/blog/100m-token-context-windows#:~:text=For%20each%20decoded%20token%2C%20LTM,a%20100M%20token%20context%20window)) ([100M Token Context Windows — Magic](https://magic.dev/blog/100m-token-context-windows#:~:text=The%20contrast%20in%20memory%20requirements,user%20for%20the%20same%20context))
12. Reddit discussion on Magic (2024). “My thoughts on Magic’s 100M context length.” *r/LocalLLaMA*
13. Cai, T., *et al.* (2024). “Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads.” *arXiv:2401.10774*
14. Google DeepMind (2023). “RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback.” *arXiv:2309.00267* ([RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback](https://arxiv.org/html/2309.00267v3#:~:text=Reinforcement%20learning%20from%20human%20feedback,even%20when%20the%20AI%20labeler)) ([RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback](https://arxiv.org/html/2309.00267v3#:~:text=circumvents%20RM%20training%20by%20obtaining,the%20scalability%20limitations%20of%20RLHF))
15. Google (2023). “Gemini Technical Report Highlights.” *Interconnects.ai Blog*
16. DeepSeek-AI (2024). DeepSeek-V3 Technical Report — Reinforcement Learning section
17. Ouyang, L., *et al.* (2022). “Training language models to follow instructions with human feedback.” *NeurIPS 2022* (InstructGPT paper).

18. Bai, Y., *et al.* (2022). “Training a Helpful and Harmless Assistant with Self-Play and RLHF.” *Anthropic Tech Report (Constitutional AI).*
*(Additional references from provided context have been integrated inline in IEEE style citations.)*

Deep Research — State of the Art LLM Guide

Latest Advancements in Large Language Models: A Comprehensive Guide

2. Cutting-Edge Architecture

2.2 Scaling to Extreme Context Lengths

Written by Craig Swift

No responses yet