Previously: Attention: The Core Of The Transformer.
The previous post showed how attention routes information sideways across a sequence. This post is about the architectural trick that lets that routing stack to enormous depth without collapsing during training. Residual connections are the gradient highway underneath the entire transformer era.
If one idea deserves the title of most important concept in deep learning, this is the one. Without the skip connection, depth becomes a liability because every extra layer multiplies the backward signal one more time. With the skip connection, depth stops feeling like an obstacle course and starts feeling like usable capacity. Residual connections are not a cosmetic add-on. They are the reason modern deep models can remain optimizable as they grow.
Roadmap for this post
Sections 1 and 2 tell the story of the vanishing-gradient crisis and the 29-year gap between mainstream backprop and practical deep networks.
Sections 3 through 5 explain the residual formula, the gradient math, and the central thesis of this post: the skip path must stay clean.
Sections 6 through 10 move inside the transformer and into C-Kernel-Engine, where residual adds, gradient copies, and accumulation points become concrete implementation details.
Section 11 closes with the historical claim in plain language: the clean residual path is the bridge from ResNet to GPT.
Section 1: The Problem — Why Deep Networks Couldn't Train
Before ResNet in 2015, training networks deeper than roughly 20 layers was already uncomfortable and often unstable. Researchers knew that deeper models should be more expressive, but optimization curves refused to cooperate. Adding layers often made the model worse, not better, even on the training set.
The famous paradox from the ResNet paper was brutal in its simplicity. A 56-layer plain network performed worse than a 20-layer plain network on both training error and test error. That matters because worse training error means the optimizer did not merely overfit poorly; it failed to find a good solution in the first place. 56 > 20 The 56-layer model should have been able to imitate the 20-layer model and then improve on top of it. The fact that it could not was the smoking gun.
The root cause is vanishing gradients. In a deep network the backward signal is a long chain-rule product: dL/dx = dL/df_N · df_N/df_{N-1} · ... · df_2/df_1. Every Jacobian in that chain is another opportunity to shrink the signal before it reaches the earliest layers.
If each Jacobian has typical eigenvalues below one, the product decays exponentially with depth. Even a gentle shrink factor compounds hard: 0.9^50 ≈ 0.005. By the time the loss gradient reaches the first layers, those parameters barely feel the learning signal at all. 0.9^50 ≈ 0.005 The first layers can no longer hear the loss function. They are mathematically connected to it, but the signal arrives as a whisper.
That is why people say the earliest layers in a very deep plain network cannot hear the loss. The graph is intact, but the useful magnitude is gone. Weights near the input keep moving too slowly, so the full stack never coordinates into a strong solution.
History makes this even more striking. Backpropagation was popularized in 1986 by Rumelhart, Hinton, and Williams, yet truly deep feed-forward models did not become routine until almost three decades later. Hochreiter in 1991 and Bengio et al. in 1994 made the gradient-flow diagnosis explicit: depth was not waiting for more ambition; it was waiting for a path that gradients could survive. The vanishing-gradient problem is why backprop existed long before deep learning felt practical. The algorithm arrived first. The architecture arrived later.

Optimization, not overfitting, was the real enemy
The 56-layer plain network did not fail because it had too much capacity. It failed because the optimizer could not drive that capacity into a useful configuration.
Residual connections matter because they attack the optimization bottleneck directly, not because they merely regularize the model.
Section 2: The History — From Backprop to Deep Learning
The story starts earlier than most retellings begin. In 1970, Seppo Linnainmaa published the mathematics of automatic differentiation, the underlying machinery that makes reverse-mode gradient computation possible. In 1974, Paul Werbos applied that idea to neural networks in his PhD thesis, sketching the route that later became backprop in modern language.
Then came 1986, the canonical breakthrough moment. Rumelhart, Hinton, and Williams published Learning representations by back-propagating errors in Nature and turned backprop into a mainstream training recipe. But the community then spent almost 29 years learning a painful lesson: having the gradient formula is not the same thing as having an architecture that can carry gradients through depth. 29 years The 29-year gap between 1986 and 2015 is one of the deepest lessons in machine learning history. We had the learning rule. We lacked the stable path.
By 1991, Hochreiter had formally identified the vanishing-gradient problem, and by 1994 Bengio and collaborators had systematically analyzed how long chains destroy gradient flow, especially in recurrent settings. LSTM in 1997 was a partial answer: insert gates and an explicit memory path so important information can survive longer. That was a hint of the future because it recognized that architecture, not just optimization heuristics, must protect signal flow.
The next major leap arrived in 2015 with two closely related ideas. Highway Networks introduced learned skip gates, and ResNet removed the gates entirely, leaving the most stripped-down version possible: an identity shortcut added to a learned branch. That simplification turned out to be the winner.
From there the historical line is remarkably direct. The 2017 transformer placed residual connections around attention and feed-forward sublayers, and the 2019 GPT-2 Pre-LN layout preserved an even cleaner path by moving normalization off the highway. If the LayerNorm and RMSNorm post explained why norm placement matters, this post explains why that placement matters so much: the residual stream is the thing being protected. The modern transformer did not replace ResNet's insight. It inherited it, duplicated it twice per block, and scaled it to hundreds of residual additions.
| Year | Milestone | Why it mattered |
|---|---|---|
| 1970 | Linnainmaa publishes automatic differentiation | Provides the mathematical foundation for reverse-mode gradient computation. |
| 1974 | Werbos applies backprop to neural networks | Connects automatic differentiation to learning in neural nets. |
| 1986 | Rumelhart, Hinton, Williams popularize backprop | Makes gradient-based representation learning mainstream. |
| 1991 | Hochreiter formalizes vanishing gradients | Explains why long chains are hard to optimize. |
| 1994 | Bengio et al. analyze gradient flow | Shows the problem is systematic, not anecdotal. |
| 1997 | LSTM introduces gated memory | Protects important signals with explicit paths. |
| 2015 | Highway Networks and ResNet | Skip connections turn depth from fragile to trainable. |
| 2017 | Transformer uses residuals everywhere | Makes skip connections foundational for sequence models. |
| 2019 | GPT-2 adopts Pre-LN | Keeps the residual path clean enough for much deeper stacks. |

Backprop won, but depth exposed a transport problem
Backpropagation became the dominant learning rule because it worked across regression, CNNs, recurrent systems, and eventually transformers. The problem was not that backprop failed. The problem was that very deep networks made the gradient pass through too many fragile transformations before it reached early layers.
Residual connections did not replace backprop. They made backprop scale deeper by giving the gradient an identity highway: a clean route through the network while the learned branch still does the actual modeling work.
Section 3: The Solution — y = x + F(x)
The residual connection is almost insultingly simple. Instead of asking a block to learn y = F(x), we ask it to learn y = x + F(x). The branch F now learns only the residual, meaning the deviation from identity.
That sounds like a tiny rewrite, but optimization feels it immediately. If the best transformation is close to the identity map, then a residual block can get there by setting F(x) ≈ 0. Learning “do almost nothing” is far easier than forcing a nonlinear branch to learn the full copy operation from scratch. reparameterization Residual learning is a reparameterization. The model class is still expressive, but the optimization landscape becomes dramatically friendlier.
# Without residual (plain network)
def plain_layer(x, W, b):
return activation(W @ x + b) # Must learn full mapping
# With residual connection
def residual_layer(x, W, b):
F_x = activation(W @ x + b) # Learn the residual
return x + F_x # Add identity shortcutThis is why He et al. framed residual blocks as a solution to a degradation problem. A deeper network should be able to do at least as well as a shallower one by copying the shallower computation and setting extra layers to identity. Plain networks were bad at discovering that identity behavior, while residual networks make identity the default fallback.
The practical consequence is profound. When a new block is unhelpful early in training, a residual network can behave almost like a shallower network and keep learning anyway. That means adding depth no longer forces the optimizer to solve every new layer perfectly on day one. Residual blocks make “do no harm first, improve later” a natural learning strategy.
| Formulation | What the branch must learn | Easy fallback |
|---|---|---|
Plain block: y = F(x) | The entire mapping from input to output | None; even copying the input must be learned |
Residual block: y = x + F(x) | Only the deviation from identity | F(x) = 0 gives a safe identity map |
Once residual learning is framed this way, the next question is automatic. Why does this tiny algebraic change matter so much for backpropagation? The answer lives in one derivative term.
Section 4: The Math — Why Gradients Can't Vanish
Start with the forward equation y = x + F(x). Differentiate it with respect to x and the crucial structure appears immediately: dy/dx = I + dF(x)/dx. That added identity matrix is the entire game.
Without the skip connection, the block Jacobian is just dF/dx. With the skip connection, the block Jacobian always includes an identity contribution that does not depend on learned weights. Even when the learned branch is small, noisy, or poorly conditioned, the backward signal still has a direct route through the +I term. +I This is the gradient highway in one line: dy/dx = I + dF/dx. The identity path exists before the model has learned anything useful.
Layer L: y_L = x_L + F_L(x_L)
Layer L-1: y_{L-1} = x_{L-1} + F_{L-1}(x_{L-1})
...
Layer 1: y_1 = x_0 + F_1(x_0)
Gradient through all layers:
dL/dx_0 = dL/dy_L · (I + dF_L/dx_L) · (I + dF_{L-1}/dx_{L-1}) · ... · (I + dF_1/dx_1)
Expanding the product:
= dL/dy_L · (I + dF_L + dF_{L-1} + dF_L·dF_{L-1} + ... + higher order terms)
The identity contribution guarantees a direct path from loss to input.
That direct path is why residual networks keep gradients alive far deeper than plain networks.Another way to say it is that the total gradient becomes a sum of routes, not a single fragile tunnel. Some routes still pass through many learned Jacobians, but one route is the clean identity lane. The deeper the network gets, the more valuable that untouched lane becomes.
Compare that with the plain case: dL/dx_0 = dL/dy_L · dF_L · dF_{L-1} · ... · dF_1. Now every factor is learned, so every factor can shrink or distort the signal. Residual networks do not remove all optimization difficulty, but they remove the requirement that the entire signal survive only through learned multipliers. The clean path does not have to be guessed by optimization. It is built into the graph before training starts.

Why “highway” is the right metaphor
A highway is valuable because it bypasses local friction. The residual path does the same thing for gradients by bypassing repeated learned transformations.
The branch still matters for representation learning, but the skip path is what keeps the full depth trainable.
Section 5: The Clean Path — Why It Cannot Be Polluted
This is the core insight
The skip connection path must remain clean. If you put normalization, activation, dropout, or any other learned or stateful transform on the highway itself, you weaken the whole residual argument.
The point is not that the gradient be approximately one. The point is that the identity term be exact and unconditional.
This is where transformer engineering becomes mathematically revealing. A residual block works best when the branch does the hard nonlinear work and the skip path does nothing except carry the signal forward unchanged. The entire Pre-LN versus Post-LN debate is really a debate about whether the highway stays clean.
Post-LN, as used in BERT-style blocks, wraps the residual sum in LayerNorm: y = layer_norm(x + F(x)). That means the derivative becomes dy/dx = dLN/d(x + F(x)) · (I + dF/dx). The normalization derivative now rescales both the branch and the skip path, so the identity lane is no longer pure. Post-LN Once LayerNorm sits on top of the residual sum, the highway is polluted. Gradients must pass through LN at every layer.
# Post-LN: LayerNorm wraps the residual
y = layer_norm(x + F(x))
# Backward:
# dy/dx = d(LN)/d(x + F(x)) * (I + dF/dx)
# The LN derivative multiplies BOTH paths.
# Gradients must flow through LayerNorm at every layer.That polluted path is why very deep Post-LN transformers become fragile. BERT-scale models around 12 to 24 layers were workable, but optimization usually required learning-rate warmup and careful tuning to avoid instability. Stack that same pattern to 96 layers and the compounded normalization factors become a serious liability.
# Pre-LN: LayerNorm is only on the branch
y = x + F(layer_norm(x))
# Backward:
# dy/dx = I + dF/d(LN(x)) * d(LN)/dx
# The identity term is pure.
# The LN derivative affects only the branch, not the highway. Pre-LN moves normalization off the skip path and onto the branch: y = x + F(layer_norm(x)). That preserves the exact identity contribution in the derivative, which is why GPT-2-style stacks scale much more gracefully to 96 layers and beyond. The normalization post introduced this as a norm-placement story; the deeper truth is that Pre-LN preserves the clean residual path that large language models depend on. Pre-LN GPT-2, GPT-3, ChatGPT, and LLaMA all rely on the clean residual path. The two-line switch from Post-LN to Pre-LN helped unlock the LLM era.
| Question | Post-LN | Pre-LN |
|---|---|---|
| Formula | y = LN(x + F(x)) | y = x + F(LN(x)) |
| Does the skip stay clean? | No; LayerNorm wraps the residual sum | Yes; normalization stays on the branch |
| Backward highway | Rescaled by dLN/d(sum) at every layer | Retains a pure identity contribution |
| Typical depth behavior | More fragile as depth rises | Stable enough for very deep decoder stacks |
| Practical consequence | Warmup and tuning become critical | Optimization is calmer from the start |
The code change is tiny. The mathematical consequence is enormous. When people say the residual stream must remain clean, they mean exactly this: nothing is allowed to sit on the identity lane and distort it.
A clean path is not a stylistic preference. It is the condition that makes the phrase “gradient highway” literally true. If the highway contains a toll booth at every block, it is no longer a highway. The most important detail in a deep transformer is often not the flashy sublayer. It is whether the skip path has been left untouched.

Section 6: Residual Connections in the Transformer Block
A transformer decoder layer contains two residual additions, not one. First the model computes x₁ = x + Attention(Norm(x)). Then it computes x₂ = x₁ + FFN(Norm(x₁)).
That means a 96-layer transformer contains 192 residual additions. Every one of them creates another direct route from the loss back into the residual stream. This is why many practitioners describe the residual stream as a bus: layers keep reading from it, writing updates back to it, and passing it onward. 96 layers = 192 skips Depth 96 does not mean one highway. It means a long sequence of carefully preserved local highways stitched into one residual stream.
// C-Kernel-Engine: Residual Add (forward)
void ck_residual_add_token_major(const float *a, const float *b,
float *out, int tokens, int aligned_embed_dim) {
size_t total = (size_t)tokens * (size_t)aligned_embed_dim;
for (size_t i = 0; i < total; i++)
out[i] = a[i] + b[i];
}
// C-Kernel-Engine: Residual Add (backward)
void ck_residual_add_backward(const float *d_out, float *d_a, float *d_b,
int tokens, int aligned_embed_dim) {
size_t total = (size_t)tokens * (size_t)aligned_embed_dim;
for (size_t i = 0; i < total; i++) {
float v = d_out[i];
d_a[i] = v; // Gradient copies to skip path
d_b[i] = v; // Gradient copies to branch path
}
}Forward:
Step 1: rmsnorm(input) -> ln1_out
Step 2: qkv_project(ln1_out) -> q, k, v
Step 3: attention(q, k, v) -> attn_out
Step 4: attn_proj(attn_out) -> proj_tmp
Step 5: residual_add(input, proj_tmp) -> residual1 <- RESIDUAL #1
Step 6: rmsnorm(residual1) -> ln2_out
Step 7: mlp_up(ln2_out) -> fc1_out
Step 8: swiglu(fc1_out) -> swiglu_out
Step 9: mlp_down(swiglu_out) -> mlp_out
Step 10: residual_add(residual1, mlp_out) -> output <- RESIDUAL #2The forward kernel is almost comically simple: element-wise addition. The backward kernel is the important part because it reveals the residual logic directly. The outgoing gradient is copied to both inputs; it is not split, halved, or averaged.
That copy semantics is exactly what the math predicted. The skip path receives the full gradient, and the branch path also receives the full gradient. Later merge points accumulate those contributions, but the residual add itself is a duplication point, not a bottleneck. The simplest kernel in the engine is also the one that keeps the whole stack trainable.

Section 7: Backward Through the Full Transformer Block
Backward propagation through a transformer block is easiest to understand if you literally reverse the forward plan. Start at the output, hit the second residual add, and watch the gradient split into a skip contribution and an MLP contribution. Then run the MLP branch backward, accumulate, and repeat the same logic at the attention residual.
Backward (reverse of forward):
Step 10: d_output -> residual_add_backward -> d_residual1_from_mlp, d_mlp_out
Step 9: d_mlp_out -> mlp_down backward -> d_swiglu
Step 8: d_swiglu -> swiglu backward -> d_fc1
Step 7: d_fc1 -> mlp_up backward -> d_ln2_out
Step 6: d_ln2_out -> rmsnorm backward -> d_from_ln2
CRITICAL: d_residual1 = d_residual1_from_mlp + d_from_ln2
Step 5: d_residual1 -> residual_add_backward -> d_input_from_attn, d_proj
Step 4: d_proj -> attn_proj backward -> d_attn_out
Step 3: d_attn_out -> attention backward -> d_q, d_k, d_v
Step 2: d_q,d_k,d_v -> qkv_project backward -> d_ln1_out
Step 1: d_ln1_out -> rmsnorm backward -> d_from_ln1
CRITICAL: d_input = d_input_from_attn + d_from_ln1Those two accumulation lines are the places implementers most often get wrong. At each residual merge, one gradient came through the skip path and another came through the normalized branch. They must be added together because both paths contributed to the same upstream tensor. accumulate! Residuals create splits on the way back and accumulations at the merge points. Missing either one silently corrupts training.
// CORRECT: accumulate gradient from both branches
ck_add_inplace(d_residual1, d_from_ln2, T, aligned_embed);
// WRONG: overwriting one gradient with the other
d_residual1 = d_from_ln2; // BUG! Lost the skip gradient!This bug is nasty because nothing crashes. The model still runs, losses still print, and kernels still execute. The only symptom is that learning becomes mysteriously weak or unstable because one of the mathematically required gradient paths was discarded.
Residual correctness is therefore not only about forward shape compatibility. It is about honoring the backward graph exactly: copy at the add, then accumulate when branches reunite. That is the systems-level version of “keep the highway clean.” A broken accumulation line can destroy the highway just as effectively as a bad architecture diagram.

Residual bugs are often silent
If you forget an accumulation, the network usually does not throw an error. It simply learns the wrong function because part of the gradient graph never reaches upstream parameters.
That is why residual correctness belongs in the mental model of both researchers and systems engineers.
Section 8: ResNet — Where It All Started
The decisive paper was He et al. 2015: Deep Residual Learning for Image Recognition. It trained a 152-layer convolutional network, an unprecedented depth for that era, and won ImageNet 2015. Just as important as the benchmark victory was the explanation of why plain depth had been failing.
The headline number was 3.57% top-5 error on ImageNet, beating the previous best by a large margin. But the conceptual result was even more valuable: deeper residual networks kept getting better, while deeper plain networks degraded. In one stroke, ResNet turned depth from a warning sign into a lever. 3.57% ResNet-152 did not just win a competition. It demonstrated that 100+ trainable layers were no longer absurd.
The paper's logic was elegant. If adding layers only increases representational capacity, then a deeper network should be able to copy a shallower network by setting the new layers to identity. The fact that plain 56-layer models underperformed plain 20-layer models proved that optimization, not representational limits, was the barrier.
Residual connections solve that exact contradiction. With y = x + F(x), the identity map is the default behavior when F(x) = 0. The network no longer has to discover copying as a difficult special case; copying is built into the parameterization from the start. ResNet made identity easy. That is why extra depth stopped hurting.
| Architecture | Observed behavior with more depth |
|---|---|
| Plain networks | Training error worsens as depth grows past the comfortable range. |
| Residual networks | Training error falls as depth grows because identity shortcuts preserve optimization. |
Seen in hindsight, the transformer is a descendant of this result. The sublayers changed from convolutions to attention and MLPs, but the optimization lesson stayed identical. Every modern deep transformer still stands on the ResNet idea that depth needs a default identity route.

Section 9: Variants — Highway, DenseNet, and Beyond
Residual connections spawned variants, but the simple identity-add form aged the best. Highway Networks added gates so the model could decide how much transformed content versus copied content to use. DenseNet concatenated features instead of summing them, making every layer see all previous activations directly.
| Method | Formula | Skip Type | Parameters | Used In |
|---|---|---|---|---|
| ResNet | y = x + F(x) | Identity add | 0 extra | CNNs, Transformers |
| Highway | y = T·H(x) + (1-T)·x | Gated | Gate weights | Early research |
| DenseNet | y = [x, F(x)] | Concatenation | 0 extra | Efficient CNNs |
| Pre-LN Transformer | y = x + F(LN(x)) | Clean identity | 0 extra | GPT-2/3/4, LLaMA |
The surprising winner was the least ornate design. Large-scale practice kept returning to simple addition because it is cheap, stable, and easy to reason about in both forward and backward passes. No gates, no concatenation, no extra parameters on the highway: just keep the path clean and let the branch do the work. 0 extra params The best skip connection for large language models adds zero extra parameters to the highway itself.
That simplicity matters more as models scale. A giant transformer already has enough expressive power in attention, MLPs, and normalization. The skip path does not need to be clever; it needs to be reliable.
This is one reason the residual connection feels almost inevitable in retrospect. Once the community saw that a clean identity path solved the hardest optimization problem, every more complicated alternative had to justify why the highway should be anything other than clean. Most alternatives never won that argument at scale. Simplicity won because the highway is infrastructure, not decoration.
Why the plain residual add won
It preserves shapes, adds no new gate parameters, gives a direct backward route, and composes cleanly with hardware-friendly kernels.
For large transformer training, those advantages compound more than architectural cleverness does.
Section 10: C-Kernel-Engine Implementation
In C-Kernel-Engine, the Pre-LN transformer layer makes the residual logic explicit at the orchestration level. RMSNorm sits on the branch, attention or MLP computes an update, and the result is added back to the running residual stream. The implementation is almost a direct translation of the mathematical picture from the earlier sections.
// Attention sub-block
rmsnorm_forward(ln1_out, input, ln1_gamma, ...); // LN on branch only
qkv_project(ln1_out, q, k, v, ...);
rope_forward_qk(q, k, cos_cache, sin_cache, ...);
attention_forward(q, k, v, attn_out, ...);
attn_project(attn_out, proj_tmp, ...);
ck_residual_add_token_major(input, proj_tmp, residual1, ...); // SKIP #1
// MLP sub-block
rmsnorm_forward(ln2_out, residual1, ln2_gamma, ...); // LN on branch only
mlp_up(ln2_out, fc1_out, ...);
swiglu(fc1_out, swiglu_out, ...);
mlp_down(swiglu_out, mlp_out, ...);
ck_residual_add_token_major(residual1, mlp_out, output, ...); // SKIP #2// Backward: gradient highway in action
ck_residual_add_backward(d_output, d_residual1, d_mlp_out, ...);
// ... MLP backward ...
rmsnorm_backward(d_ln2_out, residual1, ...);
// CRITICAL: accumulate gradient from LN path into skip path
ck_add_inplace(d_residual1, d_from_ln2, T, aligned_embed);
ck_residual_add_backward(d_residual1, d_input, d_proj_tmp, ...);
// ... Attention backward ...
rmsnorm_backward(d_ln1_out, input, ...);
// CRITICAL: accumulate gradient from LN path into skip path
ck_add_inplace(d_input, d_from_ln1, T, aligned_embed);Systems details reinforce the same design philosophy. No per-token malloc/free churn means the residual stream can stay in predictable buffers, token-parallel loops keep the operation embarrassingly simple, and 64-byte alignment helps the surrounding kernels stay SIMD-friendly. Even at the engine level, the residual path wants to be boring, contiguous, and untouched. 64-byte aligned The residual add kernel is simple enough to disappear in a profiler, yet critical enough that a wrong backward accumulation can ruin training.
That is why the clean path concept is not just theory for paper diagrams. It shows up in the exact order of operations, in the absence of unnecessary transforms on the skip, and in the discipline of accumulating gradients where branches merge. A production transformer engine is a machine for protecting the residual stream while letting sublayers write useful updates into it.
Seen this way, a transformer block is not attention plus MLP plus norms plus adds as separate trivia items. It is one residual stream with two excursions: one through attention and one through the feed-forward network. Everything else is branch work around a protected highway. The residual stream is the stable carrier signal. The sublayers are temporary detours that write edits back onto it.
Implementation rule of thumb
If an operation does not belong to the branch, keep it off the skip path. That principle gives you the right mental model for Pre-LN transformer design and for residual backward debugging.
In other words: protect the highway first, then optimize the branch.
Section 11: Summary
Residual connections are the single most important architectural innovation in deep learning because they make depth trainable. The forward rule y = x + F(x) changes the backward rule into dy/dx = I + dF/dx, injecting a direct identity route into the gradient graph. That route is the reason the earliest layers can still receive useful learning signal in very deep networks.
The central thesis of this post is the clean path concept. A residual connection only becomes a real highway if the skip lane remains untouched by normalization, activation, dropout, or any other transformation. Post-LN pollutes that lane, Pre-LN protects it, and modern LLM depth depends on that protection. The clean residual path is the hidden infrastructure of GPT-scale training.
ResNet proved in 2015 that identity shortcuts solve the degradation problem in deep CNNs. Transformers inherited the same idea and placed it twice per block, while GPT-2-style Pre-LN made sure the highway remained mathematically clean. In C-Kernel-Engine, that story becomes concrete as element-wise add, full-gradient copy, and correct accumulation.
Takeaways
Residual connections are not a convenience feature. They are the optimization infrastructure that made modern deep learning practical.
The two-line move from Post-LN to Pre-LN was historically decisive because it preserved the clean highway needed for 96-layer-plus transformers.
Next: putting every sublayer together into one complete transformer forward and backward pass, end to end.