Companion post to the LayerNorm & RMSNorm carousel. Watch the companion video: Why LayerNorm and RMSNorm Matter in a C LLM Runtime. Previously: Activation Functions: The Nonlinearity Inside Neural Networks.
The previous post was about the nonlinearities that bend a transformer into something expressive. This sequel is about the quieter operation that keeps those nonlinearities trainable once depth, residuals, and large learning rates enter the picture. LayerNorm and RMSNorm rarely get the spotlight, but they are the reason modern transformer stacks stay numerically sane long enough to learn.
A good way to frame the pair is simple. Activations change what the network can represent. Normalization changes whether the optimizer can survive the trip through dozens of layers without the hidden state exploding, collapsing, or drifting into a regime where gradients become useless. stability Normalization does not add new information. It makes existing information easier to optimize through depth.
Roadmap for this post
We will start with why activations drift in deep networks, then unpack the full LayerNorm formula, build a visual analogy for γ and β, compare Pre-LN to Post-LN, and finally explain why RMSNorm became the default in many modern decoder-only models.
The second half moves from intuition to implementation: a full forward-pass walkthrough, backward-pass formulas, transformer placement, and how the C-Kernel-Engine project turns these equations into SIMD kernels.
Section 1: Why Normalization Matters
The classical phrase here is internal covariate shift. As training updates the weights of earlier layers, the distribution seen by later layers keeps changing under their feet. What counted as a “normal” activation on one optimization step can look shifted, stretched, or wildly rescaled a few hundred updates later.
That instability compounds with depth. By layer 10 to 30, distributions can become completely unmanageable if every block receives hidden states on a different numerical scale than the last batch. The network is still computing, but the optimizer is no longer working with a calm, stationary target. layer 10→30 Deep stacks amplify small statistical drifts until later layers see inputs on wildly different scales.
The symptoms are familiar if you have ever watched a bad training curve. Loss spikes appear even when the model seemed stable a moment earlier. Gradients either explode into huge updates or vanish into values too small to matter, and individual channels begin producing extreme activations that dominate everything around them.

In transformers, normalization is usually applied token by token across the feature dimension. That means the model is not comparing one sample in the batch against another. It is cleaning up the [C] feature vector for one token position so the next sublayer receives a predictable numeric range. token-wise LayerNorm and RMSNorm normalize across features inside one token, not across the batch.
Once that cleanup exists, residual connections become far more useful. The residual branch can carry information forward while the normalized branch feeds attention or the feed-forward network inputs with a controlled scale. That combination is the core stability trick behind modern transformer blocks.
What normalization buys you
Normalization is a scale-management layer. It keeps later blocks from having to relearn the meaning of “large” and “small” activations every few steps.
Just as importantly, it makes gradients flow through a network whose hidden states would otherwise widen or shrink unpredictably with depth.
Section 2: The LayerNorm Formula — Step by Step
The full LayerNorm equation is y = γ ⊙ (x − μ) / √(σ² + ε) + β. Every symbol there has a job. The first half standardizes one token vector, and the second half gives the model learned knobs for restoring the representation style it wants.
| Tensor | Shape | Role |
|---|---|---|
x | [C] | Input features for one token position. |
μ | scalar | Mean across all C features. |
σ² | scalar | Variance across those same features. |
γ | [C] | Learned per-feature scale. |
β | [C] | Learned per-feature shift. |
y | [C] | Output vector after normalization and affine restoration. |
Step 1: Compute the mean
The mean is μ = (1/C) Σ xᵢ. For one token, it answers a plain question: where is the center of this feature vector right now? If the hidden state is globally biased upward or downward, the mean captures that offset in a single scalar. μ The mean is one number summarizing the center point of all features in the token.
Step 2: Compute the variance
The variance is σ² = (1/C) Σ (xᵢ − μ)². Now the model asks how spread out the features are relative to their center. Large variance means the vector has strong contrast between channels; tiny variance means the channels are packed tightly together. σ² Variance measures spread, not direction. It tells you how wide the feature cloud is around its mean.
Step 3: Normalize the vector
The standardized coordinates are x̂ᵢ = (xᵢ − μ) / √(σ² + ε). Subtracting the mean centers the vector around zero. Dividing by the standard deviation rescales the spread so the variance is approximately one. ε ≈ 1e-5 The epsilon term prevents division by zero when a token vector has vanishing variance.
Step 4: Restore expressive power with γ and β
Pure standardization would force every layer to live with the exact same mean and scale forever. That would stabilize training, but it would also be too restrictive. So LayerNorm immediately follows the normalized vector with a learned affine transform: y = γ ⊙ x̂ + β. γ and β γ stretches or compresses each feature; β shifts each feature to a learned center.

The key subtlety is that γ and β are not global scalars. They are vectors of length C. Each feature channel gets its own learned scale and shift, which means the network can preserve some dimensions as high-variance specialists while keeping others quiet and centered.
import numpy as np
x = np.array([2.0, -1.0, 0.5, 3.0, -0.5], dtype=np.float32)
gamma = np.ones_like(x)
beta = np.zeros_like(x)
eps = 1e-5
mu = x.mean()
var = ((x - mu) ** 2).mean()
xhat = (x - mu) / np.sqrt(var + eps)
y = gamma * xhat + beta
print(mu)
print(var)
print(np.round(xhat, 4))
print(np.round(y, 4))One-line intuition
LayerNorm says: first make the token numerically well behaved, then give the model learned per-channel controls so that “well behaved” does not mean “information destroyed.”
Section 3: Visual Intuition — The Photo Editor Analogy
Imagine each transformer layer as a professional photo editor receiving an image from the previous stage. Sometimes that incoming photo is underexposed, sometimes overexposed, and sometimes the contrast is so extreme that small details disappear into the shadows or highlights. Raw activations behave the same way.
LayerNorm’s first move is the neutralization step. The expression (x − μ) / σ is like resetting brightness and contrast to a standardized baseline before any artistic choice is made. The editor now has a clean canvas instead of inherited chaos.
The second move is stylization. γ acts like a contrast knob that widens or compresses the distribution. β acts like a brightness offset that moves the whole distribution left or right after standardization. Normalize first, stylize second. That is the whole LayerNorm story in six words.

This analogy matters because it explains why the network keeps γ and β trainable. The model does not want every layer to look identical. It wants every layer to start from a stable baseline and then learn the exact contrast and offset that help the downstream computation. contrast vs brightness γ changes spread. β changes center. Both are learned during training.
Once you see LayerNorm this way, the formula stops looking ceremonial. It is a pragmatic division of labor. Statistics clean up the signal; learned parameters put the model’s preferred style back in.
Why the analogy sticks
A photo editor does not destroy an image by correcting exposure. It makes the next creative adjustment predictable. LayerNorm does the same thing for hidden states.
Section 4: Pre-LN vs Post-LN — The 2-Line Revolution
The first big transformer generation popularized the Post-LN layout: y = LayerNorm(x + F(x)). BERT 2018 used that structure, and it works, but the gradient has to pass through a normalization operation at every block on its way backward. As depth rises, that repeated detour becomes a real optimization tax. BERT ≈ 24 layers Post-LN models were typically kept much shallower because optimization became fragile as depth increased.
GPT-2’s Pre-LN variant changed only the order: y = x + F(LayerNorm(x)). That tiny edit leaves the residual branch itself untouched by normalization. The result is a cleaner gradient highway that can carry useful signal through far more layers.
You can summarize the intuition with dy/dx = 1 + dF/dx. That identity term is the hero. Even if the learned branch is noisy or small, the residual route still gives the optimizer a direct path backward. gradient highway The explicit “1” in the derivative keeps gradients from collapsing as easily across deep residual stacks.
# Post-LN (BERT)
layer_norm(x + attention(x))
layer_norm(x + ffn(x))
# Pre-LN (GPT-2)
x + attention(layer_norm(x))
x + ffn(layer_norm(x))
| Question | Post-LN | Pre-LN |
|---|---|---|
| Where is normalization? | After the residual add. | Before the sublayer function. |
| Backward path | Must traverse LayerNorm each block. | Residual branch keeps an identity route. |
| Training depth | Fragile at large depth. | Stable enough for very deep LLM stacks. |
| Historical examples | BERT-family encoders. | GPT-2, GPT-3, and many later decoder stacks. |
This was not a cosmetic refactor. It was the architectural choice that helped unlock 96-layer scale in models like GPT-3. In practice, Pre-LN also reduces the need for extremely delicate warmup schedules because the residual path is numerically friendlier from the beginning. GPT-3 = 96 layers Very deep decoder-only transformers lean on Pre-LN to keep optimization stable at scale.
That is why calling it a two-line revolution is not exaggeration. The code change is tiny. The optimization consequence is enormous. Sometimes the most important architecture win is not a new block. It is moving one old block two lines upward.
Why Pre-LN became the default
Pre-LN lets the residual stream behave like a stable carrier signal while the normalized branch does the harder nonlinear work. That separation is exactly what deep language models needed.
Section 5: RMSNorm — Dropping the Mean
RMSNorm keeps the scale control idea but simplifies the operation. Its formula is y = γ ⊙ x / √(mean(x²) + ε). There is no mean subtraction and no learned β term. no μ, no β RMSNorm keeps scale normalization and removes explicit recentering and the output bias term.
Why can that work? Because in many transformer settings, the essential stabilizer is not recentering the vector to zero mean. It is controlling the overall magnitude so one token does not enter the next matrix multiply with a wildly different norm than its neighbors.
Mean subtraction is often redundant because surrounding linear layers already have biases or affine freedom that can absorb a recentering effect. If the model mainly cares about keeping the hidden state norm predictable, then RMSNorm gives most of the benefit at lower computational cost. That trade-off is attractive in massive decoder-only models where the same operation runs billions of times. compute savings Skipping the mean pass removes work from both forward and backward kernels.

| Property | LayerNorm | RMSNorm |
|---|---|---|
| Centering | Subtracts mean μ. | Keeps the original mean. |
| Scale control | Divides by standard deviation. | Divides by root-mean-square. |
| Learned parameters | γ and β. | Usually only γ. |
| Typical use today | Still common in encoders and many libraries. | Very common in modern decoder-only LLMs. |
That is why names like LLaMA, Mistral, and Gemma show up in RMSNorm conversations so often. Most modern decoder-only transformer families now choose the simpler norm because it is cheaper and empirically strong. The industry conclusion has been blunt: if dropping the mean barely hurts quality and saves real work, keep the simpler layer. LLaMA / Mistral / Gemma RMSNorm became a practical default for efficient decoder-only transformer stacks.
The core idea of RMSNorm
Normalize the size of the vector, not necessarily its center. For many language models, that is the stabilizer that matters most.
Section 6: Forward Pass — Numerical Walkthrough
Take the concrete vector x = [2.0, -1.0, 0.5, 3.0, -0.5]. A numerical walkthrough is useful because LayerNorm and RMSNorm feel similar in words but produce visibly different outputs. Here we will use the default initialization idea γ = 1 and β = 0 unless stated otherwise.
For LayerNorm, the mean is μ = (2.0 + (-1.0) + 0.5 + 3.0 + (-0.5)) / 5 = 0.8. The variance is ((2.0−0.8)² + (−1.0−0.8)² + (0.5−0.8)² + (3.0−0.8)² + (−0.5−0.8)²) / 5 = 2.26. So the standard deviation is about 1.5033.
The normalized LayerNorm vector is therefore approximately [0.7982, -1.1973, -0.1996, 1.4634, -0.8645]. Because γ = 1 and β = 0 in this demonstration, the output y is the same as x̂. Its mean is now zero and its variance is now one up to epsilon-level numerical differences. zero-centered LayerNorm explicitly recenters the vector before rescaling it.
For RMSNorm, start with the squared values [4, 1, 0.25, 9, 0.25]. Their mean is 2.9, so rms = √(2.9 + ε) ≈ 1.7029. Divide the original vector by that rms and you get approximately [1.1744, -0.5872, 0.2936, 1.7617, -0.2936]. mean preserved RMSNorm controls scale but does not force the vector to have zero mean.
import numpy as np
x = np.array([2.0, -1.0, 0.5, 3.0, -0.5], dtype=np.float32)
eps = 1e-5
mu = x.mean()
var = ((x - mu) ** 2).mean()
layernorm_y = (x - mu) / np.sqrt(var + eps)
rms = np.sqrt((x ** 2).mean() + eps)
rmsnorm_y = x / rms
print('mu =', round(float(mu), 4))
print('var =', round(float(var), 4))
print('LayerNorm =', np.round(layernorm_y, 4))
print('RMS =', round(float(rms), 4))
print('RMSNorm =', np.round(rmsnorm_y, 4))
What the numbers reveal
LayerNorm changes both center and scale. RMSNorm changes mostly scale. That is why the two outputs can look directionally similar while still encoding different statistical assumptions.
Section 7: Backward Pass — LayerNorm Gradients
The forward formula is tidy, but the backward pass is where LayerNorm becomes mathematically interesting. Suppose the next layer hands us an upstream gradient dL/dy. We need gradients for the learned parameters and for the original input vector.
The easy pieces are the affine parameters. Per token, dL/dβ = dL/dy because β is added directly. Likewise dL/dγ = dL/dy ⊙ x̂ because γ scales the normalized vector coordinate by coordinate. easy branch Parameter gradients come straight from the affine tail of LayerNorm.
The hard part is dL/dx. Every input coordinate influences the shared mean and shared variance, so each output coordinate depends on every input coordinate indirectly. That is why LayerNorm backward has cross-terms instead of a purely element-wise derivative. coupled features Once μ and σ depend on the whole vector, no input feature is isolated in backward mode.
A compact and useful expression is dL/dx = (1/σ) · (dL/dx̂ − mean(dL/dx̂) − x̂ · mean(dL/dx̂ · x̂)). Kernel code often rewrites the same math as scale = rstd / D and then dX[d] = scale · (D · dY[d] · γ[d] − Σ(dY·γ) − x̂[d] · Σ(dY·γ·x̂)). That form makes the reduction structure visible and maps cleanly onto vectorized loops.
This is exactly the sort of formula used in the C-Kernel-Engine LayerNorm backward kernel. A common implementation pattern is a two-pass algorithm. Pass one computes dX using per-token reductions, and pass two accumulates dγ and dβ across tokens for the learnable parameters.

Backward intuition
LayerNorm backward is expensive because the normalization statistics tie the whole feature vector together. The gradient for one coordinate has to remember what happened to all the others.
Section 8: Backward Pass — RMSNorm Gradients
RMSNorm backward is simpler because there is no mean-subtraction branch. The affine parameter rule stays familiar: dL/dγ = dL/dy ⊙ x̂ with x̂ = x · rstd. The only shared reduction comes from the root-mean-square scale factor.
A clean formula is m = (1/D) · Σ(dY · γ · x̂), then dX[d] = rstd · (dY[d] · γ[d] − x̂[d] · m). That is still coupled because the shared rms touches every feature. But the coupling is weaker than LayerNorm’s because there is no explicit mean term to backpropagate through. ≈20% faster backward In optimized kernels, RMSNorm backward is often measurably faster because the graph is smaller and the reductions are simpler.
This simplification is one reason RMSNorm is attractive in giant decoder-only models. You save work in forward, you save work in backward, and you often preserve model quality well enough that the simpler kernel wins outright. Scale coupling remains, but centering coupling disappears. RMSNorm keeps the norm problem and drops the centering problem.

Why modern decoder stacks like it
The operation is cheaper, the backward graph is simpler, and the empirical quality is strong. That is a compelling combination at LLM scale.
Section 9: Where Normalization Lives in a Transformer
A standard transformer block uses normalization twice. One norm sits before attention and another sits before the feed-forward network. In Pre-LN notation, the pattern is y = x + Attention(Norm₁(x)), then z = y + FFN(Norm₂(y)).
That count adds up quickly. GPT-3 has 96 layers, which means 192 normalization operations inside the stack before you even count the final normalization before the output head. Each of those operations carries its own learnable parameters. 96 × 2 = 192 Deep LLMs spend a surprising amount of time running normalization layers.
If the norm is LayerNorm, every block has separate γ and β vectors for attention and for FFN. If the norm is RMSNorm, each branch usually keeps only its own γ. Either way, the model is not sharing one universal normalization setting across the network. separate parameters Norm₁ and Norm₂ are different learned modules because attention and FFN want different feature scalings.

| Location | What it stabilizes | Typical learned parameters |
|---|---|---|
| Before attention | The hidden state fed into Q, K, V projections. | γ or γ, β for Norm₁. |
| Before FFN | The hidden state entering the two-layer MLP. | γ or γ, β for Norm₂. |
| Final norm | The representation before the output logits. | A separate final normalization module. |
Placement is part of the architecture
Normalization is not a decorative post-process. Its location determines how gradients travel and what numerical range reaches attention, MLPs, and the final output head.
Section 10: C-Kernel-Engine Implementation
To see how these equations land in real systems, it helps to study an implementation-focused project such as C-Kernel-Engine. Its normalization kernels turn textbook formulas into explicit loops, reductions, SIMD loads, and memory-traffic decisions. That is where the abstractions stop being symbolic and start becoming hardware work.
LayerNorm forward — three kernel variants
C-Kernel-Engine exposes multiple LayerNorm forward styles for different performance goals. The rolled slice kernel is cache-friendly and processes one token at a time with clean sequential access. The unrolled slice kernel increases throughput by unrolling the loop and, on AVX-512 hardware, can handle 64 floats per iteration using four 16-float accumulators.
The project also keeps a naive serial kernel as a scalar reference for benchmarking and debugging. That matters because correctness baselines are what let you trust aggressive vectorized variants later. Production speedups are only useful when a slower version exists to prove the math still matches. reference path Fast kernels need a slower truth source for parity checks and regression testing.
// LayerNorm Forward (simplified)
void layernorm_forward(float *out, float *in,
float *gamma, float *beta,
float *mean, float *rstd,
int T, int D) {
for (int t = 0; t < T; t++) {
// Pass 1: compute mean
float m = 0.0f;
for (int d = 0; d < D; d++) m += in[t*D + d];
m /= D;
mean[t] = m;
// Pass 2: compute variance
float v = 0.0f;
for (int d = 0; d < D; d++) {
float diff = in[t*D + d] - m;
v += diff * diff;
}
v /= D;
float rs = 1.0f / sqrtf(v + 1e-5f);
rstd[t] = rs;
// Pass 3: normalize, scale, shift
for (int d = 0; d < D; d++) {
float xhat = (in[t*D + d] - m) * rs;
out[t*D + d] = xhat * gamma[d] + beta[d];
}
}
}RMSNorm forward — the shorter kernel
RMSNorm drops the mean pass entirely. The kernel only needs one reduction for sum_sq, one reciprocal square root, and one fused scale loop. That shorter dependency chain is exactly why it is attractive for hot decoder paths.
void rmsnorm_forward(float *out, float *in,
float *gamma, float *rstd,
int T, int D) {
for (int t = 0; t < T; t++) {
float sum_sq = 0.0f;
for (int d = 0; d < D; d++)
sum_sq += in[t*D + d] * in[t*D + d];
float rs = 1.0f / sqrtf(sum_sq / D + 1e-5f);
rstd[t] = rs;
for (int d = 0; d < D; d++)
out[t*D + d] = in[t*D + d] * rs * gamma[d];
}
} On modern CPUs, the real speed story is SIMD. AVX-512 can process 16 float lanes at once with instructions like _mm512_fmadd_ps, while AVX2 handles 8 lanes with _mm256_fmadd_ps. The engine also uses prefetch hints such as _mm_prefetch(in + j + 128, _MM_HINT_T0) and four-accumulator unrolling to hide latency and keep the pipeline busy. SIMD Vector width turns normalization from one-float-at-a-time math into many-floats-per-cycle throughput.

| Variant | SIMD | Floats/Cycle | Use Case |
|---|---|---|---|
| Rolled Slice | AVX512 | 16 | Cache-friendly sequential |
| Unrolled Slice | AVX512 | 64 | High-throughput batch |
| Rolled AVX2 | AVX2 | 8 | Older CPUs |
| Naive Serial | None | 1 | Reference/debug |
| Exact (GGML) | None | 1 | Double-precision parity |
Fused kernels change the memory story
A fused RMSNorm+Linear kernel can keep the normalized vector in registers or L1 instead of writing it to DRAM and reading it back. The result is a 2–4× memory-traffic reduction compared with the unfused path.
A fused RMSNorm+QKV kernel pushes the same idea further by computing Q, K, and V from normalized input in one pass, often delivering roughly 1.5–2× speedup over separate operations.
The project also carries quantized variants. BF16 paths keep input and output in BF16 while accumulating in FP32. INT8 wrappers convert between INT8 and FP32, and INT4 variants rely on nibble packing with two 4-bit values stored per byte.
| Variant | Storage format | Computation detail |
|---|---|---|
| BF16 | BF16 tensors | Accumulate in FP32 for stability. |
| INT8 | INT8 tensors | Wrapper converts INT8 ↔ FP32 around the kernel. |
| INT4 | Packed nibbles | Two values per byte, unpacked for compute. |
Several engineering rules inside C-Kernel-Engine are worth calling out because they shape kernel design as much as the formulas do. There is no malloc inside hot kernels because memory comes from a bump allocator. Parallelization is handled by an orchestrator rather than dropping OpenMP pragmas into every inner loop, and the overall goal stays deterministic computation that is easy to benchmark and reproduce. no malloc Hot kernels avoid heap allocation so latency and determinism stay under control. no OpenMP in kernels Threading lives at the orchestration level, keeping inner loops small and predictable. deterministic Stable reduction order and explicit orchestration make performance numbers easier to trust.
Section 11: Summary & What’s Next
LayerNorm is the full version: subtract the mean, divide by the standard deviation, then apply learned scale and shift. RMSNorm is the leaner version: normalize by root-mean-square and usually keep only the learned scale. Pre-LN is the architectural placement that made very deep transformers practical by preserving a clean residual gradient path.
On the calculus side, LayerNorm backward is harder because its shared statistics create D-dependent cross-terms. RMSNorm backward is simpler because the centering branch disappears. On the systems side, projects like C-Kernel-Engine show how those mathematical differences translate into very different kernel shapes, SIMD strategies, and memory-traffic decisions.
The next natural topic is attention itself. Queries, keys, and values depend on these normalized hidden states because stable scale is what keeps attention logits from becoming numerically erratic before softmax ever gets a chance to act. Normalization is the quiet stabilizer behind the probability engine. Attention decides where to look. Normalization makes that decision numerically trustworthy.
If you want a practical rule of thumb, LayerNorm is the more explicit statistical reset and RMSNorm is the leaner scale-control tool. Encoders, research code, and general-purpose frameworks still use LayerNorm heavily because its semantics are familiar and complete. Decoder-only LLM stacks often prefer RMSNorm because the saved work compounds over every token, every layer, and every training step. rule of thumb Use LayerNorm when you want full centering and scaling; use RMSNorm when scale control is the main goal and efficiency matters.
Takeaway
If activations are the expressive bends in a transformer, normalization is the alignment rack that keeps the whole machine drivable. Without it, depth becomes a liability instead of an advantage.
LayerNorm teaches the full normalization story. RMSNorm shows how much of that story modern LLMs can keep while dropping some of the cost.