dL/d(LLM): The Full Backward Pass

Lab note

Previously: Residual Connections: The Gradient Highway. This is the capstone post of the transformer-building-block series.

The previous entries in this series built every kernel a transformer needs. Scalar operations, matrix multiplies, softmax, layer normalization, positional encoding, tokenization, attention, and residual connections. Each post showed the forward pass, then derived the backward pass, then showed it running in C-Kernel-Engine. This post stitches the entire thing together from loss to embeddings.

The question this post answers is deceptively simple: given a single scalar loss, how does every weight in a language model receive its gradient? The answer is the chain rule applied methodically, one kernel at a time, in exact reverse order. No magic, no handwaving, no "autograd handles it." We will walk the gradient from the output of the model all the way back to the embedding table, showing exactly which function runs at each step and what it computes. dL/d(LLM) is not one derivative. It is a chain of local derivatives, each computed by a kernel that already knows its own math. The framework just calls them in order.

Roadmap for this post

Sections 1 through 5 walk the full forward pass from token IDs to loss, building the computation graph that backward will reverse.

Sections 6 through 9 reverse the graph step by step: loss → LM head → final norm → decoder layers → embeddings, showing the chain rule at every junction.

Sections 10 and 11 zoom out to the training loop and gradient accumulation.

Sections 12 and 13 compare C-Kernel-Engine's static IR-based stitching to PyTorch's dynamic autograd graph, and close the series.

Section 1: The Forward Pass — From Token IDs to Loss

Before we can differentiate anything, we need to know what we are differentiating through. The forward pass of a decoder-only transformer follows a strict sequence. Every modern LLM, from GPT-2 to LLaMA to Mistral, follows this same template with minor variations.

The high-level pipeline is: 7 stages Embedding → [Norm → Attention → Residual → Norm → FFN → Residual] × N layers → Final Norm → LM Head → Loss

Each stage is a kernel we have already studied. The embedding lookup is a table gather (the tokenization post). The norm is RMSNorm (the LayerNorm and RMSNorm post). The attention block is the Q/K/V → RoPE → scaled dot-product → output projection pipeline (the attention post and the positional encoding post). The FFN is a gated MLP with SwiGLU activation (the matrix wx+b post and the activation-functions post). The residual adds are the gradient highways from the residual-connections post. The LM head is one more GEMM projecting back to vocabulary size. And the loss is cross-entropy with softmax (the softmax post).

Complete forward pass pipeline from token IDs through N decoder layers to loss

Section 2: Embedding — Tokens Become Vectors

The model receives a sequence of integer token IDs. Each ID is used as an index into the embedding table, a matrix of shape [vocab_size, embed_dim]. The forward pass is just a table lookup: row token_id gets copied into the activation buffer.

C-Kernel-Engine — embedding forward

// Forward: gather rows from embedding table
// token_ids[t] → embedding_table[token_ids[t], :]
embedding_forward(
    input_token_ids,       // [T] integer token IDs
    g_active_tokens,       // number of tokens in this batch
    1024,                  // vocab_size
    weight_token_emb,      // [vocab_size × embed_dim] embedding table
    NULL,                  // positional embeddings (NULL = using RoPE)
    act_embedding_out,     // [T × embed_dim] output activations
    256,                   // embed_dim
    256,                   // aligned_embed_dim
    g_active_tokens,       // context_window
    0                      // add_pos flag (0 = no learned pos)
);

There is no mathematical operation here, just a copy. But the backward pass will be interesting: since multiple tokens might share the same ID, the gradient for a given embedding row is the sum of gradients from every position that used that row. This is sparse scatter-add, and getting it right matters.

Section 3: The Decoder Layer — One Complete Forward Block

Each decoder layer follows the Pre-LN pattern that modern models adopted after GPT-2. The sequence inside one layer is exactly 15 kernel calls in C-Kernel-Engine's generated code.

The pattern for a single decoder layer: 15 ops RMSNorm → Q proj → K proj → V proj → QK norm → RoPE → Attention → Out proj → Residual add → RMSNorm → MLP gate-up → SwiGLU → MLP down → Residual add → (next layer input)

C-Kernel-Engine — one decoder layer forward (generated)

// ─── Attention sub-block ───
// op 2: Pre-attention RMSNorm
rmsnorm_forward(prev_residual, ln1_gamma, norm_out, rstd_cache,
                tokens, 256, 256, 1e-6f);

// ops 3-5: Q, K, V projections (three GEMMs)
gemm(norm_out, Wq, bq, q_proj, T, 256, 256);  // Q: [T×256]
gemm(norm_out, Wk, bk, k_proj, T, 128, 256);  // K: [T×128]
gemm(norm_out, Wv, bv, v_proj, T, 128, 256);  // V: [T×128]

// op 6: QK normalization (optional per-head norm)
qk_norm_forward(q_head, k_head, q_norm, k_norm,
                num_heads, num_kv_heads, T, head_dim, 1e-5f);

// op 7: RoPE — inject positional information
rope_forward_qk(q_head, k_head, cos_cache, sin_cache,
                num_heads, num_kv_heads, T, head_dim, ...);

// op 8: Causal attention (scores saved for backward)
attention_forward_causal_head_major_gqa_exact(
    q, k, v, saved_attn_weights, attn_out,
    num_heads, num_kv_heads, T, head_dim, ...);

// op 9: Output projection
gemm(attn_out, Wo, bo, out_proj, T, 256, 256);

// op 10: First residual add
ck_residual_add_token_major(out_proj, prev_residual,
                            residual_0_out, T, 256);

// ─── FFN sub-block ───
// op 11: Post-attention RMSNorm
rmsnorm_forward(residual_0_out, ln2_gamma, norm2_out,
                rstd_cache2, T, 256, 256, 1e-6f);

// op 12: MLP gate-up projection
gemm(norm2_out, W1, b1, gate_up, T, 2048, 256);

// op 13: SwiGLU activation
swiglu_forward_exact(gate_up, silu_out, T, 1024);

// op 14: MLP down projection
gemm(silu_out, W2, b2, mlp_down, T, 256, 1024);

// op 15: Second residual add
ck_residual_add_token_major(mlp_down, residual_0_out,
                            residual_1_out, T, 256);

Two things to notice. First, attention saves its weight matrix saved_attn_weights during forward. This is the [num_heads, T, T] matrix of softmax probabilities. Backward needs it to compute gradients through the softmax-matmul composition. This is a memory-for-compute trade-off: we could recompute it (Flash Attention does this), but the generated training runtime saves it explicitly.

Second, every RMSNorm saves its rstd_cache, the reciprocal standard deviation per token. Backward needs this to avoid recomputing the forward statistics.

Section 4: The Output Head — From Hidden State to Vocabulary

After the final decoder layer, the hidden state passes through one more RMSNorm and then the LM head. The LM head is simply a GEMM that projects from embed_dim to vocab_size.

C-Kernel-Engine — output head forward

// op 16: Final RMSNorm
rmsnorm_forward(last_residual, final_ln_weight, final_norm_out,
                final_rstd, T, 256, 256, 1e-6f);

// op 18: LM head projection (no bias)
gemm(final_norm_out, output_weight, NULL, logits, T, 1024, 256);
// logits: [T × vocab_size] — raw unnormalized scores

The logits are raw scores, not probabilities. The softmax that converts them to probabilities is fused into the loss computation because that fusion gives us the famously clean gradient p - one_hot. Never apply softmax separately before cross-entropy. The fused version is numerically stable and produces a trivially simple gradient. Separating them invites overflow and wastes compute.

Section 5: The Loss — Where Backward Begins

The loss function is the starting point of every backward pass. For language modeling, it is cross-entropy loss with mean reduction. The fused softmax-cross-entropy kernel does three things in one pass over the logits:

First, it computes the numerically stable softmax using the max-subtract trick from the softmax post. Second, it computes the loss: L = -log(softmax[target]), averaged over valid tokens. Third, it computes the gradient: d_logits = softmax - one_hot(target), scaled by 1/num_valid_tokens.

C-Kernel-Engine — softmax cross-entropy (loss_kernels.c)

// For each token t:
float max_logit = row[0];
for (int v = 1; v < vocab_size; ++v)
    if (row[v] > max_logit) max_logit = row[v];

double sum_exp = 0.0;
for (int v = 0; v < vocab_size; ++v) {
    float e = expf(row[v] - max_logit);
    drow[v] = e;              // softmax numerator
    sum_exp += (double)e;
}
float inv_sum = 1.0f / (float)sum_exp;
for (int v = 0; v < vocab_size; ++v)
    drow[v] *= inv_sum;        // drow is now softmax(logits)

total_loss += -(double)row[target] + (double)max_logit + log(sum_exp);

// The key line: gradient = p - one_hot
drow[target] -= 1.0f;         // subtract 1 at the target index

That last line is the entire magic of the softmax-cross-entropy backward. After computing softmax probabilities, the gradient of the loss with respect to every logit is just p[v] for non-target classes and p[target] - 1 for the target class. The derivation was covered in the softmax post, but the intuition is clean: the gradient pushes the target probability toward 1 and all other probabilities toward 0. The magnitude of the push is proportional to how far each probability is from its target. p − 1 If the model assigns probability 0.9 to the correct token, the gradient magnitude is only 0.1. If it assigns 0.01, the gradient is 0.99. The loss is self-calibrating.

Loss gradient: softmax probabilities minus one-hot target

After the per-token gradients are computed, they are scaled by 1 / num_valid_tokens for mean reduction. Tokens marked with ignore_index = -100 (padding tokens) get zero gradients and do not contribute to the loss denominator. This is PyTorch-aligned semantics, implemented identically in C-Kernel-Engine.

Section 6: Backward Through the Output Head

The loss kernel gave us d_logits of shape [T, vocab_size]. Now we need to push that gradient backward through the LM head GEMM and the final RMSNorm.

The LM head is logits = final_norm_out @ W_output. This is a standard GEMM, and the backward was derived in the matrix wx+b post:

GEMM backward — LM head

python

# Forward: logits = X @ W^T (no bias for LM head)
# Backward:
d_X = d_logits @ W            # gradient w.r.t. input (final_norm_out)
d_W = d_logits^T @ X          # gradient w.r.t. weight

C-Kernel-Engine — LM head backward (generated)

// op 20: LM head backward
gemm_backward_f32_train_parallel_dispatch_v2(
    grad_act_logits,            // d_logits [T × vocab_size]
    act_final_norm_out,         // saved forward input
    weight_output_weight,       // W_output [vocab_size × embed_dim]
    tmp_d_input,                // → d_final_norm_out [T × embed_dim]
    tmp_d_W,                    // → d_W_output [vocab_size × embed_dim]
    NULL,                       // no bias gradient
    T, 256, 1024, num_threads
);

// ops 21-22: accumulate into persistent gradient buffers
gradient_accumulate_f32(grad_act_final_norm, tmp_d_input, 2048);
gradient_accumulate_f32(grad_weight_output, tmp_d_W, 262144);

Next, the final RMSNorm backward. This uses the saved rstd_cache from forward to avoid recomputing the variance.

C-Kernel-Engine — final norm backward

// op 23: Final RMSNorm backward
rmsnorm_backward(
    grad_act_final_norm,         // d_output from LM head backward
    act_last_residual,           // saved forward input
    weight_final_ln,             // gamma
    saved_rstd_cache,            // 1/√(mean(x²) + ε) from forward
    tmp_d_input,                 // → d_last_residual [T × embed_dim]
    tmp_d_gamma,                 // → d_gamma [embed_dim]
    T, 256, 256
);

// ops 24-25: accumulate
gradient_accumulate_f32(grad_act_last_residual, tmp_d_input, 2048);
gradient_accumulate_f32(grad_weight_final_ln, tmp_d_gamma, 256);

At this point, we have the gradient at the output of the last decoder layer. The chain so far: dL/d_logits → dL/d_final_norm_out → dL/d_last_residual. Now we enter the decoder layers in reverse.

Backward flow through LM head and final normalization

Section 7: Backward Through One Decoder Layer

This is the core of the entire backward pass. The decoder layer has two sub-blocks (attention and FFN), each wrapped with a residual connection. The backward must reverse each sub-block in exact reverse order, handling the residual splits correctly.

The backward through one decoder layer has this structure: ~40 ops Each decoder layer backward generates roughly 40 operations in C-Kernel-Engine: kernel calls plus gradient accumulations. A 32-layer model has ~1280 backward operations.

Step 1: Second Residual Add Backward

The forward was residual_1_out = mlp_down + residual_0_out. As covered in the residual-connections post, the residual backward is trivially simple: the gradient copies to both branches.

Residual backward — gradient splits

// op 26: Residual add backward
ck_residual_add_backward(
    grad_act_residual_1,     // incoming gradient
    tmp_d_mlp_down,          // → copy to FFN branch
    tmp_d_residual_0,        // → copy to skip branch
    T, 256
);
// Both outputs are identical copies of the input gradient.
// The skip path flows directly to the next residual split.

Step 2: FFN Backward (MLP Down → SwiGLU → MLP Gate-Up → Norm)

Working backward through the FFN sub-block:

FFN backward — three kernels in reverse

python

# Forward was: norm2 → gate_up → swiglu → mlp_down
# Backward:

# MLP down backward (GEMM backward)
d_silu_out = d_mlp_down @ W2        # [T × 1024]
d_W2 = d_mlp_down.T @ silu_out      # [256 × 1024]

# SwiGLU backward
# Forward: out[i] = silu(gate[i]) * x[i]
# where gate = input[:, :dim], x = input[:, dim:]
d_gate = d_out * x * silu'(gate)     # silu'(g) = σ(g)(1 + g(1-σ(g)))
d_x = d_out * silu(gate)
d_gate_up = concat(d_gate, d_x)     # [T × 2048]

# MLP gate-up backward (GEMM backward)
d_norm2 = d_gate_up @ W1            # [T × 256]
d_W1 = d_gate_up.T @ norm2_out      # [2048 × 256]

# Post-attention RMSNorm backward
d_residual_0, d_ln2_gamma = rmsnorm_backward(
    d_norm2, residual_0_out, ln2_gamma, rstd_cache2
)

Step 3: First Residual Add Backward

At this point two gradient streams meet. The skip path from Step 1 already contributed d_residual_0 (the skip branch of the second residual). The FFN backward also produced d_residual_0 (through norm → FFN path). These are accumulated before entering the first residual backward.

First residual backward

// Gradient accumulation: two streams merge
gradient_accumulate_f32(grad_act_residual_0, d_from_skip, 2048);
gradient_accumulate_f32(grad_act_residual_0, d_from_ffn, 2048);

// op: First residual add backward
ck_residual_add_backward(
    grad_act_residual_0,
    tmp_d_out_proj,           // → attention branch
    tmp_d_prev_residual,      // → skip to layer input
    T, 256
);

Step 4: Attention Backward (the most complex single step)

The attention backward reverses the entire attention block from the attention post. Working in reverse: output projection → attention core → RoPE → QK norm → Q/K/V projections → pre-attention norm.

Attention backward — complete reverse

python

# Output projection backward
d_attn_out = d_out_proj @ Wo           # [T × 256]
d_Wo = d_out_proj.T @ attn_out         # [256 × 256]

# Attention core backward (using saved attention weights)
# This is the heart of the backward pass
for each head h:
    # Step A: d_V += weights_h.T @ d_output_h
    d_V[h] += attn_weights[h].T @ d_out[h]

    # Step B: d_scores = d_out[h] @ V[h].T
    d_scores = d_out[h] @ V[h].T

    # Step C: softmax backward
    # d_prescore[i,j] = w[i,j] * (d_scores[i,j] - sum_k(w[i,k] * d_scores[i,k]))
    row_sum = (attn_weights[h] * d_scores).sum(axis=-1, keepdim=True)
    d_prescore = attn_weights[h] * (d_scores - row_sum)

    # Step D: undo the 1/√d_k scaling
    d_prescore *= scale  # scale = 1/√head_dim

    # Step E: d_Q, d_K from the QK^T dot product
    d_Q[h] += d_prescore @ K[h]
    d_K[h] += d_prescore.T @ Q[h]

# RoPE backward: reverse the rotation
d_q_pre_rope, d_k_pre_rope = rope_backward(d_Q, d_K, cos, sin)

# QK norm backward
d_q_proj, d_k_proj, d_q_norm, d_k_norm = qk_norm_backward(...)

# Q, K, V projection backward (three GEMM backwards)
d_norm1 += d_q_proj @ Wq + d_k_proj @ Wk + d_v_proj @ Wv
d_Wq = d_q_proj.T @ norm1_out
d_Wk = d_k_proj.T @ norm1_out
d_Wv = d_v_proj.T @ norm1_out

# Pre-attention RMSNorm backward
d_prev_residual, d_ln1_gamma = rmsnorm_backward(
    d_norm1, layer_input, ln1_gamma, rstd_cache1
)

A critical detail: the three Q/K/V projection backwards all produce gradients with respect to the same input (the norm output). These three gradient contributions must be summed. C-Kernel-Engine handles this with explicit gradient_accumulate_f32 calls after each GEMM backward. Missing any one of these accumulations would silently produce incorrect gradients. When three projections share an input, the backward produces three separate gradient contributions that must be added. This is not optional — it is the sum rule of calculus. Forget one and the model trains on wrong gradients without any error message.

Complete backward flow through one decoder layer

Section 8: Backward Through Embeddings

After reversing every decoder layer, the gradient arrives at the embedding output. The embedding backward has a unique character: it is sparse.

C-Kernel-Engine — embedding backward

// Embedding backward: scatter-add gradients to embedding rows
embedding_backward(
    input_token_ids,             // [T] — which rows to update
    T,                           // number of tokens
    grad_act_embedding_out,      // [T × embed_dim] — incoming gradient
    grad_weight_token_emb,       // [vocab_size × embed_dim] — accumulate
    grad_weight_pos_emb,         // positional embedding grads (if used)
    vocab_size,
    embed_dim,
    aligned_embed_dim,
    context_window,
    0                            // add_pos flag
);

The forward was a gather: copy row token_id from the table. The backward is a scatter-add: for each token position t, add the gradient at position t to the row indexed by token_ids[t]. If the same token appears at multiple positions (which is common — "the" might appear 5 times in a sentence), its embedding row receives the sum of all 5 gradients. sparse Only T rows of the vocab_size × embed_dim table receive gradient updates per step. For a 32K vocabulary with 8 tokens, only 0.025% of embedding weights get a gradient signal.

Section 9: The Chain Rule — How Local Derivatives Become Global

Every kernel we just walked has its own local backward function. The softmax knows how to differentiate softmax. The GEMM kernel knows how to differentiate matrix multiplication. RMSNorm knows its own two-pass backward. Attention knows its three-step algorithm (d_scores via V, softmax backward, dQ/dK via scaled dot product). RoPE knows how to reverse rotations. Residual add knows that gradients copy.

None of these kernels know anything about the others. The chain rule is what connects them. If the forward pass is f(g(h(x))), then the backward is: dL/dx = dL/df · df/dg · dg/dh · dh/dx. Each factor is computed by one kernel's backward function. The framework's job is to call them in order and pipe the output of one into the input of the next.

Written out for the full transformer, the chain looks like this:

The full chain rule — dL/d(LLM)

python

# Loss seeds the backward with d_logits = softmax(logits) - one_hot
d = d_logits                                    # [T × V]

# Output head
d = gemm_backward(d, final_norm_out, W_out)     # → d_final_norm
d = rmsnorm_backward(d, last_residual, γ, rstd) # → d_last_residual

# For layer = N-1 down to 0:
for layer in reversed(range(N)):
    # 2nd residual split
    d_ffn, d_skip2 = residual_backward(d)

    # FFN backward
    d_ffn = gemm_backward(d_ffn, silu_out, W2)    # mlp_down
    d_ffn = swiglu_backward(d_ffn, gate_up)        # activation
    d_ffn = gemm_backward(d_ffn, norm2_out, W1)    # gate-up
    d_ffn = rmsnorm_backward(d_ffn, res0, γ2, rstd2)

    # Merge skip + FFN path
    d = d_skip2 + d_ffn

    # 1st residual split
    d_attn, d_skip1 = residual_backward(d)

    # Attention backward
    d_attn = gemm_backward(d_attn, attn_out, Wo)  # out proj
    d_attn = attention_backward(d_attn, Q, K, V, weights)
    d_q, d_k = rope_backward(d_attn_q, d_attn_k, cos, sin)
    d_norm1 = qkv_proj_backward(d_q, d_k, d_v)    # 3 GEMMs
    d_attn = rmsnorm_backward(d_norm1, input, γ1, rstd1)

    # Merge skip + attention path
    d = d_skip1 + d_attn

# Embedding backward
embedding_backward(d, token_ids)  # scatter-add to embedding table

Complete chain rule from loss through N layers to embedding

That is the entire dL/d(LLM). Every weight gradient is a by-product of these kernel backwards. The GEMM backward for the Q projection produces both d_norm1 (the activation gradient that continues backward) and d_Wq (the weight gradient that gets stored for the optimizer). By the time the backward pass finishes, every weight in the model has a gradient waiting in its gradient buffer.

Section 10: The Training Loop — Forward, Zero, Backward, Optimize

The training loop orchestrates these pieces. In C-Kernel-Engine, it is a four-phase cycle with optional gradient accumulation.

C-Kernel-Engine — ck_train_step_ex (training loop)

int ck_train_step_ex(const int32_t *token_ids,
                     const int32_t *targets,
                     int valid_tokens,
                     float *loss_out, float lr)
{
    // 1. FORWARD — fill all activation buffers
    int fwd = ck_train_forward_step();

    // 2. BACKWARD — compute all gradients
    //    (includes loss computation as first op)
    int bwd = ck_train_backward_step();

    // 3. GRADIENT ACCUMULATION
    g_accum_step++;
    if (g_accum_step >= CK_GRAD_ACCUM_STEPS) {
        // 4. OPTIMIZER — AdamW update
        int opt = ck_train_optimizer_step(lr);

        // 5. ZERO GRADIENTS for next accumulation window
        g_accum_step = 0;
        ck_zero_grad();
    }

    *loss_out = g_loss_scalar[0];
    return fwd + bwd;
}

Gradient accumulation is a simple but essential trick. Instead of updating weights after every forward-backward pair, we sum gradients over K steps and then update once. This simulates a batch size of K × micro_batch_size without needing K times the memory. Before the optimizer step, the accumulated gradients are scaled by 1/K to match the per-step semantics. ck_zero_grad Forgetting to zero gradients before a new accumulation window is one of the most common training bugs. The gradients from the previous window leak into the next one, producing subtly wrong updates that can take hours to diagnose.

Section 11: The Optimizer — AdamW

Once all gradients are computed and optionally accumulated, the optimizer applies the weight update. C-Kernel-Engine uses AdamW with per-tensor gradient clipping, implemented as a fused multi-tensor update.

AdamW update — what happens to each weight

python

# For each weight tensor w with gradient g:
# (m = first moment, v = second moment, t = step number)

m = β₁ · m + (1 - β₁) · g          # momentum (smoothed gradient)
v = β₂ · v + (1 - β₂) · g²         # velocity (smoothed squared gradient)

m̂ = m / (1 - β₁ᵗ)                  # bias correction
v̂ = v / (1 - β₂ᵗ)                  # bias correction

w = w - lr · (m̂ / (√v̂ + ε) + λ · w)  # update with weight decay

C-Kernel-Engine — fused multi-tensor AdamW (generated)

// All 19 weight tensors updated in one fused call
float *grads[19]   = {grad_final_ln, grad_b1, grad_b2, grad_bk,
                      grad_bo, grad_bq, grad_bv, grad_k_norm,
                      grad_ln1, grad_ln2, grad_q_norm,
                      grad_w1, grad_w2, grad_wk, grad_wo,
                      grad_wq, grad_wv, grad_output, grad_emb};
float *weights[19] = { /* corresponding weight pointers */ };
float *m_states[19] = { /* first moment buffers */ };
float *v_states[19] = { /* second moment buffers */ };
size_t numels[19]  = {256, 2048, 256, 128, 256, 256, 128, 32,
                      256, 256, 32, 524288, 262144, 32768,
                      65536, 65536, 32768, 262144, 262144};

adamw_clip_update_multi_f32(
    grads, weights, m_states, v_states, numels,
    19,              // tensor_count
    lr,              // learning rate
    0.9f,            // β₁
    0.999f,          // β₂
    1e-8f,           // ε
    0.01f,           // weight_decay (λ)
    1.0f,            // max_grad_norm (clipping)
    g_opt_step       // step number for bias correction
);

The optimizer maintains two extra buffers per weight tensor: the first moment m (momentum) and second moment v (adaptive learning rate). This triples the memory needed for trainable parameters: weights + m + v. For a model with 1.8M parameters, that is 5.4M floats, or about 21 MB. For a 7B parameter model, that is 21B floats, or about 84 GB just for the optimizer state.

Training loop: forward, backward, accumulate, optimize

Section 12: Memory Layout — One Contiguous Arena

C-Kernel-Engine allocates the entire model state in one contiguous array. No malloc or free during training. Every tensor, activation, gradient, and optimizer buffer is a pointer offset into this single arena.

C-Kernel-Engine — arena layout

// Single allocation: everything lives in g_memory
static float g_memory[CK_TRAIN_TOTAL_FLOATS];  // ~122 MB

// Pointers are offsets into the arena
static float *weight_token_emb;     // g_memory + OFF_weight_token_emb
static float *act_embedding_out;    // g_memory + OFF_act_embedding_out
static float *grad_act_embedding;   // g_memory + OFF_grad_act_embedding
static float *grad_weight_token_emb;// g_memory + OFF_grad_weight_token_emb
static float *m_token_emb;          // g_memory + OFF_m_token_emb (AdamW)
static float *v_token_emb;          // g_memory + OFF_v_token_emb (AdamW)

// Layout regions:
// ├─ weights          [1.8M floats]   model parameters
// ├─ activations      [6.2M floats]   forward pass outputs (saved)
// ├─ grad_activations [6.2M floats]   activation gradients
// ├─ grad_weights     [1.8M floats]   weight gradients
// ├─ optimizer.m      [1.8M floats]   AdamW first moment
// ├─ optimizer.v      [1.8M floats]   AdamW second moment
// ├─ saved_*          attention weights, rstd caches
// └─ tmp_*            temporary buffers for backward

This design has no runtime allocation cost. The code generator knows every tensor size at compile time and produces exact offsets. There is no fragmentation, no garbage collection, no allocation failure during training. The CPU sees one large contiguous block and the prefetcher can work efficiently. No malloc, no free, no allocation failures. The memory layout is a solved problem at compile time. The training loop is pure computation against a flat array.

Section 13: C-Kernel-Engine vs Autograd — Two Ways to Stitch

At this point we can compare the two major approaches to building the backward pass.

What they share

Both approaches require hand-written backward functions for each kernel. PyTorch's autograd does not invent derivatives. When you call torch.matmul, PyTorch does not derive the matrix multiplication gradient from first principles at runtime. Instead, someone wrote the backward function for matmul (which computes dX = dout @ W and dW = dout.T @ X) and registered it in the autograd dispatch table. The same is true for softmax, layer norm, attention, RoPE, and every other operation.

This is the key insight that makes the entire series hang together. At every level of abstraction, someone had to sit down and derive the backward pass by hand, verify it mathematically, and implement it in optimized code. The only question is how those hand-written backward functions get called. Neither autograd nor C-Kernel-Engine invents derivatives. Both rely on human-derived backward implementations for each kernel. The difference is how the stitching happens, not whether the math was done.

How autograd stitches

PyTorch records a dynamic computation graph during the forward pass. Every operation creates a node in this graph, storing which tensors were inputs and which backward function to call. When you call loss.backward(), PyTorch walks the graph in reverse topological order, calling each node's backward function and piping the output gradient as input to the next node.

PyTorch autograd — dynamic graph

python

# Forward: autograd records the graph as a side effect
logits = model(input_ids)       # builds graph dynamically
loss = cross_entropy(logits, targets)

# Backward: autograd walks the graph in reverse
loss.backward()
# Internally:
#   1. Calls cross_entropy_backward → d_logits
#   2. Calls linear_backward (LM head) → d_hidden, d_W
#   3. Calls layernorm_backward → d_residual
#   4. ... walks through every layer in reverse ...
#   5. Calls embedding_backward → d_embedding_table

# The graph is built fresh every forward pass.
# This is why PyTorch can handle dynamic control flow
# (different graph shapes per batch).

How C-Kernel-Engine stitches

C-Kernel-Engine takes a fundamentally different approach. Every inference kernel has a paired backprop kernel. When you define gemm_forward, the framework knows it has a corresponding gemm_backward_f32. When you define attention_forward_causal_head_major_gqa, the framework knows it has attention_backward_causal_head_major_gqa. RMSNorm forward pairs with rmsnorm_backward. RoPE forward pairs with rope_backward_qk. This pairing is the core design principle.

The pairing is formalized in a grad rules contract. Every forward operation maps to a named grad rule, and every grad rule declares which tensors the forward pass must save for the backward.

C-Kernel-Engine — grad_rules_v7.json (the kernel pairing contract)

json

// Forward op → grad rule mapping
"template_op_to_rule": {
  "rmsnorm":     "rmsnorm_backward",
  "q_proj":      "matmul_backward",
  "k_proj":      "matmul_backward",
  "v_proj":      "matmul_backward",
  "qk_norm":     "qk_norm_backward",
  "rope_qk":     "rope_qk_backward",
  "attn":        "attention_backward",
  "out_proj":    "matmul_backward",
  "residual_add":"residual_add_backward",
  "mlp_gate_up": "matmul_backward",
  "silu_mul":    "swiglu_backward",
  "mlp_down":    "matmul_backward",
  "logits":      "matmul_no_bias_backward",
  "dense_embedding_lookup": "dense_embedding_lookup_backward"
}

// Each grad rule declares what forward must save
"rules": {
  "attention_backward": {
    "requires_saved": ["q", "k", "v"],
    "extra_saved": ["attn_weights"]       // materialized scores
  },
  "rmsnorm_backward": {
    "requires_saved": ["input", "gamma", "rstd_cache"]
  },
  "swiglu_backward": {
    "requires_saved": ["input"]           // gate values for silu'
  },
  "matmul_backward": {
    "requires_saved": ["input", "W"]      // both needed for dX and dW
  }
}

The backward pass is not discovered at runtime. The IR builder reads the model template, looks up the grad rule for each forward op, automatically derives the save_for_backward tensors, and the code generator emits the entire backward pass as explicit C calls in reverse topological order. This is the contract: you write a forward kernel and a backward kernel, you declare what the forward must save, and the framework handles everything else — the ordering, the buffer allocation, the gradient accumulation, and the stitching. Every inference kernel creates its own backward kernel. The framework's job is to pair them, order them, and wire the gradient buffers. The kernel author owns both the forward and backward math.

C-Kernel-Engine — static generated backward

// Generated at compile time, not discovered at runtime
int ck_train_backward_step(void) {
    int call_count = 0;

    // op 19: loss backward — softmax_cross_entropy
    ck_train_ce_dispatch(logits, targets, T, V, d_logits, loss);
    call_count++;

    // op 20: LM head backward — gemm_backward_f32
    gemm_backward_f32(..., d_logits, norm_out, W_out,
                      d_norm, d_W_out, ...);
    call_count++;

    // ops 21-22: gradient accumulate
    gradient_accumulate_f32(grad_norm, d_norm, 2048);
    gradient_accumulate_f32(grad_W_out, d_W_out, 262144);
    call_count += 2;

    // op 23: final norm backward — rmsnorm_backward
    rmsnorm_backward(grad_norm, residual, gamma, rstd,
                     d_residual, d_gamma, T, 256, 256);
    call_count++;

    // ... 54 more operations, all explicit ...

    return call_count;  // total: 59 operations
}

The framework accommodates kernels at any granularity. A kernel can be as granular as a single element-wise add (like residual_add and its trivial backward that copies the gradient), or as explicit as a full attention block (3900 lines of SIMD code with a sophisticated 3-step backward algorithm). The header file lists the evidence. On the simple end: ck_residual_add_token_major / ck_residual_add_backward — four lines of per-element copy. On the complex end: attention_forward_causal_head_major_gqa_exact / attention_backward_causal_head_major_gqa — 18 forward variants, 3 backward variants covering GQA, flash, decode, BF16, and sliding-window paths. In between: swiglu_forward_exact / swiglu_backward_exact, rope_forward_qk / rope_backward_qk, rmsnorm_forward / rmsnorm_backward — each at its natural level of complexity. The framework does not care about the size of the block. It only cares that every forward kernel has a paired backward kernel, that the grad rule declares what to save, and that the code generator can wire them in reverse topological order.

PyTorch also needs hardcoded kernels

Here is the nuance that most explanations miss. PyTorch's autograd stitches beautifully when the operations decompose into basic primitives (add, multiply, matmul, exp, log). But when you have a novel kernel or algorithm where the derivative derivation is too complex to express as a composition of basic operations, or where the mathematically optimal backward path does not factor into autograd's primitive vocabulary, you still have to hardcode your backward pass.

PyTorch — custom kernel with hardcoded backward

python

# When autograd's primitives are not enough:
class FlashAttentionFunc(torch.autograd.Function):
    @staticmethod
    def forward(ctx, q, k, v):
        # Custom CUDA kernel — not decomposable into basic ops
        out = flash_attn_cuda.forward(q, k, v)
        ctx.save_for_backward(q, k, v, out)
        return out

    @staticmethod
    def backward(ctx, d_out):
        # Hand-derived backward — autograd cannot stitch this
        q, k, v, out = ctx.saved_tensors
        d_q, d_k, d_v = flash_attn_cuda.backward(d_out, q, k, v, out)
        return d_q, d_k, d_v

Flash Attention, fused layer norm, fused SwiGLU, custom RoPE implementations — all of these are custom autograd functions in PyTorch with hand-written backward passes. This is the same pattern as C-Kernel-Engine: the kernel author writes both the forward and backward, and the framework stitches them into the larger graph. The difference is not really about granularity. It is about the stitching model. PyTorch gives you a choice: use basic primitives (autograd stitches automatically) or write a custom kernel (you hardcode the backward, autograd still stitches it into the graph). C-Kernel-Engine always takes the second path — every kernel is explicit, and the code generator stitches them.

The trade-off

Aspect	PyTorch Autograd	C-Kernel-Engine
Graph construction	Dynamic, built during forward	Static, generated at compile time
Kernel pairing	Built-in ops auto-stitched; custom ops via `autograd.Function`	Every kernel explicitly paired: forward ↔ backward
Granularity range	Primitives (add, mul) up to fused blocks (FlashAttention)	Same range: element-wise add up to full attention block
Novel algorithms	Must write custom `autograd.Function` with hardcoded backward	Must write paired backward kernel
Runtime overhead	Graph bookkeeping, Python dispatch	Zero — flat C function calls
Memory allocation	Dynamic (allocator + GC)	Static arena (one malloc at init)
Flexibility	Any model topology, dynamic control flow	Fixed topology per generated binary
Parallelism	GPU-optimized (CUDA kernels)	CPU-optimized (AVX-512, feature-parallel)

The real trade-off is between the stitching model, not the kernel granularity. PyTorch autograd gives you a dynamic graph walker that can stitch anything — basic primitives, fused kernels, or custom blocks — at the cost of runtime dispatch overhead. C-Kernel-Engine gives you a static code generator that emits a flat sequence of kernel calls at compile time, at the cost of needing to regenerate when the model topology changes. Both accommodate kernels at any level of granularity, from a single add to a full attention block. Both require someone to write the backward pass for each kernel. The difference is when and how the stitching decision is made. Autograd stitches at runtime with a graph walker. C-Kernel-Engine stitches at compile time with a code generator. Both produce identical gradients because the underlying math — and the human-derived backward passes — are the same.

The critical point: autograd is not doing something fundamentally different from what we just walked through. It is calling the same backward functions (softmax backward, GEMM backward, norm backward, attention backward) in the same order. Autograd discovers the calling order by walking a graph at runtime. C-Kernel-Engine hardcodes it at compile time. And in both systems, when the derivative is too complex or the performance-critical path demands a fused implementation, someone sits down, derives the backward by hand, and writes optimized code for it.

Section 14: The Full Picture — Everything Connected

Let us zoom out one final time and see the complete gradient flow as a single diagram.

Complete dL/d(LLM) backward pass from loss to embeddings

The complete dL/d(LLM) — all operations

text

FORWARD (19 ops per layer):
  tokens → embedding_lookup → [rmsnorm → q_proj → k_proj → v_proj →
  qk_norm → rope → attention → out_proj → residual_add →
  rmsnorm → mlp_gate_up → swiglu → mlp_down → residual_add] × N →
  rmsnorm → lm_head → logits

LOSS:
  logits + targets → softmax_cross_entropy → scalar_loss + d_logits

BACKWARD (59 ops for 1 layer):
  d_logits → gemm_backward (lm_head) → rmsnorm_backward (final) →
  [residual_backward → gemm_backward (mlp_down) → swiglu_backward →
  gemm_backward (gate_up) → rmsnorm_backward →
  residual_backward → gemm_backward (out_proj) →
  attention_backward → rope_backward → qk_norm_backward →
  gemm_backward (v_proj) → gemm_backward (k_proj) →
  gemm_backward (q_proj) → rmsnorm_backward] × N →
  embedding_backward

OPTIMIZER:
  for each weight: AdamW(gradient, weight, m, v, step, lr)
  zero all gradient buffers

For a 1-layer model, C-Kernel-Engine generates 19 forward operations, 59 backward operations, and 19 optimizer updates. For a 32-layer model, that becomes roughly 480 forward operations, ~1900 backward operations, and ~600 optimizer updates. Every single one of those operations is a concrete C function call against a flat memory array. No interpreter, no graph walker, no dynamic dispatch.

Section 15: What This Series Built

Nine posts ago we started with a scalar multiply: y = w · x + b. We derived its gradient by hand: dy/dw = x, dy/dx = w, dy/db = 1. Then we scaled that idea to matrices, to softmax, to activations, to normalization, to positional encoding, to tokenization, to full multi-head attention, to residual connections.

At each step, the pattern was the same:

Define the forward computation. Derive the local backward (the Jacobian or its efficient form). Implement both in C with SIMD. Show that the backward is not magic — it is the same calculus you learned in school, applied kernel by kernel.

This post completed the circle. The full backward pass is nothing more than calling each kernel's backward function in reverse topological order, piping the output gradient of one into the input of the next, and accumulating weight gradients along the way. That is all autograd does. That is all C-Kernel-Engine does. That is all any training framework does. The derivative of a language model is not one formula. It is a pipeline of local derivatives, each derived by hand, each implemented in optimized code, stitched together by the chain rule. Welcome to C-Kernel-Engine.

The complete series: from scalar wx+b to full transformer backprop

Series summary — posts 25 through 33

Matrix wx+b: From Scalars To Transformers — scalar to GEMM, forward and backward.

Softmax: The Probability Engine — stable normalization, the Jacobian, cross-entropy fusion.

Activation Functions: The Nonlinearity Inside Neural Networks — ReLU, GELU, SiLU, SwiGLU and their derivatives.

LayerNorm & RMSNorm: Stabilizing The Signal — statistical normalization, two-pass backward.

Positional Encoding: Teaching Transformers Where To Look — sinusoidal, learned, RoPE, rotation gradients.

Tokenization: The First Decision That Shapes Everything — BPE, WordPiece, SentencePiece, custom tokens.

Attention: The Core Of The Transformer — Q/K/V, multi-head, the 7-step backward pass.

Residual Connections: The Gradient Highway — the gradient highway, Pre-LN vs Post-LN.

This capstone: dL/d(LLM), the full backward pass stitched together.

dL/d(LLM): The Full Backward Pass

Roadmap for this post

Section 1: The Forward Pass — From Token IDs to Loss

Section 2: Embedding — Tokens Become Vectors

Section 3: The Decoder Layer — One Complete Forward Block

Section 4: The Output Head — From Hidden State to Vocabulary

Section 5: The Loss — Where Backward Begins

Section 6: Backward Through the Output Head

Section 7: Backward Through One Decoder Layer

Step 1: Second Residual Add Backward

Step 2: FFN Backward (MLP Down → SwiGLU → MLP Gate-Up → Norm)

Step 3: First Residual Add Backward

Step 4: Attention Backward (the most complex single step)

Section 8: Backward Through Embeddings

Section 9: The Chain Rule — How Local Derivatives Become Global

Section 10: The Training Loop — Forward, Zero, Backward, Optimize

Section 11: The Optimizer — AdamW

Section 12: Memory Layout — One Contiguous Arena

Section 13: C-Kernel-Engine vs Autograd — Two Ways to Stitch

What they share

How autograd stitches

How C-Kernel-Engine stitches

PyTorch also needs hardcoded kernels

The trade-off

Section 14: The Full Picture — Everything Connected

Section 15: What This Series Built

Series summary — posts 25 through 33

Need an intelligent system to work on real hardware?

Embedded systems · Robotics · Constrained AI · CPU and HPC · Accelerators · Distributed systems

ShivasNotes

Read

Support

dL/d(LLM): The Full Backward Pass

Roadmap for this post

Section 1: The Forward Pass — From Token IDs to Loss

Section 2: Embedding — Tokens Become Vectors

Section 3: The Decoder Layer — One Complete Forward Block

Section 4: The Output Head — From Hidden State to Vocabulary

Section 5: The Loss — Where Backward Begins

Section 6: Backward Through the Output Head

Section 7: Backward Through One Decoder Layer

Step 1: Second Residual Add Backward

Step 2: FFN Backward (MLP Down → SwiGLU → MLP Gate-Up → Norm)

Step 3: First Residual Add Backward

Step 4: Attention Backward (the most complex single step)

Section 8: Backward Through Embeddings

Section 9: The Chain Rule — How Local Derivatives Become Global

Section 10: The Training Loop — Forward, Zero, Backward, Optimize

Section 11: The Optimizer — AdamW

Section 12: Memory Layout — One Contiguous Arena

Section 13: C-Kernel-Engine vs Autograd — Two Ways to Stitch

What they share

How autograd stitches

How C-Kernel-Engine stitches

PyTorch also needs hardcoded kernels

The trade-off

Section 14: The Full Picture — Everything Connected

Section 15: What This Series Built

Series summary — posts 25 through 33

Subscribe

Subscribe to emails from Anthony

Need an intelligent system to work on real hardware?

Embedded systems · Robotics · Constrained AI · CPU and HPC · Accelerators · Distributed systems

ShivasNotes

Read

Support