Muon Optimizer: SGD vs AdamW vs Matrix-Aware Training Updates

Optimizer research note

This post continues the optimizer path from SGD and AdamW into Muon, a matrix-aware training update that changes what the optimizer kernel has to do. Previously in this series: Tokenization: The First Decision That Shapes Everything.

Muon is one of the more interesting optimizer ideas showing up in modern LLM training discussions because it does not merely change a learning-rate schedule or add another element-wise moment buffer. It changes the geometry of the update for hidden-layer weight matrices. AdamW looks at each parameter mostly as an element with its own first and second moment statistics. Muon looks at a 2D matrix update and asks whether the update direction can be made more orthogonal, balanced, and matrix-aware.

C-Kernel-Engine does not currently ship a Muon kernel. That is important to say clearly. This post is therefore a research and implementation target: what Muon is, what the update rule does, why it differs from AdamW, and what a clean CKE implementation path would look like. scope This is not an inference kernel. Muon is a training optimizer. It matters when CKE updates weights, not when CKE only runs a fixed checkpoint.

Muon optimizer step showing gradient, momentum, Nesterov, Newton-Schulz orthogonalization, and weight update. — Muon changes the optimizer step by inserting a matrix orthogonalization-style update before weights are modified.

Sources and scope

This draft references Keller Jordan’s Muon optimizer writeup, the Muon reference repository, and the technical report Muon is Scalable for LLM Training.

The CKE sections are explicitly written as an implementation roadmap. They should not be read as claiming a completed CKE Muon kernel.

Why Another Optimizer?

Post 34 built the optimizer path from SGD to Momentum to AdamW. That sequence is still the right baseline. SGD gives the simplest possible update. Momentum remembers a smoothed direction. Adam adds per-parameter adaptive scaling. AdamW fixes weight decay by decoupling it from the gradient update.

AdamW became the default because it is robust, easy to tune relative to older methods, and works across many parameter types. But AdamW pays for that robustness with optimizer state. For every weight it usually stores a first moment and a second moment. At scale, optimizer state becomes a major memory object.

Muon asks a different question. Many important neural network weights are not isolated scalars. They are 2D matrices: projection weights, MLP weights, expert weights, and other hidden-layer transforms. If the parameter is a matrix, maybe the update should respect matrix geometry instead of treating the entries as unrelated scalar slots.

SGD follows raw gradient direction, AdamW scales coordinates elementwise, and Muon shapes the matrix update with Newton-Schulz orthogonalization. — SGD, AdamW, and Muon occupy the same training-loop slot, but they convert gradients into weight updates in different geometries.

The diagram is not saying one optimizer is universally better. It is showing the shape of the update each optimizer believes in. All three methods start from the same object: a gradient computed by backpropagation. The difference is what happens before that gradient becomes a weight update.

How to read the three boxes

SGD / Momentum is drawn as a single arrow because the update mostly follows the gradient direction. Plain SGD says: move the weights in the opposite direction of the current gradient. Momentum adds memory by keeping a running direction, so the update does not jerk around as much from batch to batch. But the mental model is still direction-following: gradient comes in, update direction comes out.

AdamW is drawn as many small coordinate blocks because AdamW treats each parameter coordinate with its own adaptive scale. It keeps a first moment, roughly the smoothed gradient, and a second moment, roughly the smoothed squared gradient. That lets AdamW damp noisy coordinates and boost quieter ones. This is why AdamW is robust, but also why it carries more optimizer state: each weight needs extra moment buffers.

Muon is drawn as a matrix path because Muon is aimed at 2D weight matrices, not isolated scalar slots. For matrix weights, Muon first builds a momentum-style update and then uses a Newton-Schulz style transformation to make that matrix update more orthogonalized or balanced. So the optimizer contains real matrix work. It is no longer only an elementwise loop over parameters; it becomes a kernel with GEMM-like operations, scratch buffers, layout decisions, and numerical parity requirements.

This is the core mental shift. SGD asks, “which direction does the gradient point?” AdamW asks, “how should each coordinate be scaled based on first and second moments?” Muon asks, “if this parameter is a matrix, should the update itself be shaped like a matrix operator?” That is why Muon belongs in a kernel-engineering series. The optimizer is no longer just a vectorized element-wise pass. The optimizer now contains matrix multiplications, temporary buffers, dtype decisions, and layout-sensitive performance behavior.

Comparison chart between AdamW and Muon across state, update shape, best target, kernel cost, and CKE status. — AdamW is broad and mature. Muon is more specialized: strongest for 2D hidden-layer matrices, not every parameter in the model.

The Core Muon Step

The PyTorch implementation makes the high-level structure clear. For each 2D parameter matrix with gradient G, Muon keeps a momentum buffer B. It computes a momentum or Nesterov-style update, orthogonalizes that update using a Newton-Schulz iteration, applies decoupled weight decay, adjusts the learning rate based on matrix shape, and subtracts the resulting update from the parameter.

\[ g_t = \nabla_{\theta} L_t(\theta_{t-1}) \] First compute the raw gradient for the current mini-batch or accumulated batch.

\[ B_t = \mu B_{t-1} + g_t \] Then update the momentum buffer. This part is conceptually similar to SGD with momentum.

\[ \widetilde{B}_t = \begin{cases} g_t + \mu B_t, & \text{Nesterov} \\ B_t, & \text{otherwise} \end{cases} \] With Nesterov enabled, Muon forms a look-ahead update from the current gradient and the momentum buffer.

\[ O_t = \operatorname{NewtonSchulz}_k(\widetilde{B}_t) \] This is the defining move: the update is passed through a matrix orthogonalization-style transform.

\[ \theta_t \leftarrow (1-\eta\lambda)\theta_{t-1} - \eta_{\mathrm{shape}} O_t \] Finally apply decoupled weight decay and subtract the shape-adjusted orthogonalized update.

Muon update skeleton — Python-like pseudocode

python

for W in two_dimensional_hidden_layer_weights:
    G = W.grad

    B = momentum * B + G
    U = G + momentum * B if nesterov else B

    O = newton_schulz_orthogonalize(U, steps=5)

    adjusted_lr = adjust_lr(lr, W.shape)
    W *= (1.0 - lr * weight_decay)
    W -= adjusted_lr * O

Why Only 2D Parameters?

This point matters. Muon is not meant to replace every optimizer path in the model. The local PyTorch implementation rejects non-2D parameters. The documentation notes that other parameters, such as biases and embeddings, should be optimized by a standard method such as AdamW.

This is a natural fit for transformer hidden layers. The most expensive trainable objects in the model are matrix-shaped: W_Q, W_K, W_V, W_O, MLP up/down/gate projections, MoE expert matrices, and the final LM head. But embeddings, biases, scalar norm weights, and router quirks may need a different optimizer route. 2D Muon is best understood as a matrix optimizer for hidden-layer weights, not a universal replacement for every parameter tensor.

Parameter type	Shape	Muon?	Likely fallback
Attention projection	`[D, D]` or rectangular	yes	Muon
MLP up/down/gate	`[D, H]`, `[H, D]`	yes	Muon
MoE expert weight	2D per expert	yes	Muon or expert-group Muon
Bias	1D	no	AdamW / SGD
Norm scale	1D	no	AdamW / SGD
Embedding table	2D but sparse semantics	case-dependent	usually AdamW-style handling

The Newton-Schulz Part

The Newton-Schulz loop is where Muon becomes interesting for kernel engineering. The update matrix is normalized, optionally transposed so the smaller dimension is handled conveniently, and then repeatedly transformed using matrix products. The local PyTorch implementation uses default coefficients approximately:

\[ a=3.4445,\qquad b=-4.7750,\qquad c=2.0315,\qquad k=5 \] These are the common Muon Newton-Schulz coefficients and default step count used in the local PyTorch implementation.

At a high level, the loop looks like this:

Newton-Schulz loop showing X0 normalization, A equals X X transpose, polynomial update, and repeated matrix multiplications. — The Newton-Schulz loop converts the momentum update into a matrix-shaped update with more orthogonalized geometry.

Newton-Schulz orthogonalization — simplified

python

def zeropower_via_newton_schulz(G, a, b, c, steps, eps):
    X = bf16(G)

    if X.rows > X.cols:
        X = X.T

    X = X / max(norm(X), eps)

    for _ in range(steps):
        A = X @ X.T
        P = b * A + c * (A @ A)
        X = a * X + P @ X

    if original_was_transposed:
        X = X.T

    return X

From a kernel perspective, that loop is not mysterious. It is a sequence of matrix multiplies plus scalar combinations. But it is very different from AdamW. AdamW is mostly element-wise vector math. Muon introduces real matrix work inside the optimizer step.

Newton-Schulz iterations make the update matrix singular values more balanced, turning an anisotropic update into a more orthogonalized update. — One useful intuition: Newton-Schulz reshapes the spectrum of the update matrix, making the update direction more balanced instead of letting a few directions dominate.

The exact implementation details matter. The common Muon implementation first normalizes the update matrix so the iteration is stable. It may transpose the matrix so the iteration handles the smaller side more efficiently. Then each Newton-Schulz step builds products such as X Xᵀ, A², and P X. For square matrices this is real cubic-ish matrix work. For rectangular transformer matrices the cost depends heavily on which side is smaller and how the implementation chooses its orientation.

Why this is not “just another optimizer flag”

AdamW can be written as a fused element-wise multi-tensor update. Muon needs a matrix path. That means scratch buffers, GEMM reuse, dtype policy, shape-aware learning-rate adjustment, and a parameter router. The optimizer becomes part of the kernel/runtime design surface.

Why This Is Interesting For C-Kernel-Engine

CKE already treats training as explicit kernels, saved tensors, generated backward paths, and optimizer updates. Post 34 covered the current optimizer surface: SGD momentum, AdamW, gradient norm clipping, and fused multi-tensor update paths. Muon would add a new kind of optimizer kernel: a matrix-update optimizer instead of a pure element-wise optimizer.

That makes it a good next implementation target because it naturally exercises the parts CKE cares about: matrix layout, scratch buffers, BF16/FP32 behavior, GEMM reuse, deterministic update contracts, optimizer routing, and parity against a known reference.

Proposed C-Kernel-Engine Muon build order from scalar reference to SIMD and AMX paths. — A clean CKE Muon implementation should start scalar/reference-first, then optimize only after parity.

CKE implementation rule

The first Muon kernel should not be “fast.” It should be correct, inspectable, and parity-tested. Only then should the Newton-Schulz matrix products be lowered into optimized GEMM, SIMD, AMX, or future CPU matrix-extension paths.

Proposed CKE Kernel Surface

A C implementation should not begin as one giant monolithic function. It should expose the algorithmic pieces separately so each piece can be tested:

Kernel / helper	Purpose	Test contract
`muon_momentum_f32`	`B = μB + G`	matches scalar reference for all shapes
`muon_nesterov_f32`	`U = G + μB` or `U = B`	Nesterov flag changes only this path
`muon_normalize_matrix_f32`	scale by matrix norm with epsilon clamp	stable for zero / tiny matrices
`muon_newton_schulz_f32`	run `k` NS iterations	matches PyTorch reference tolerance
`muon_apply_update_f32`	decoupled weight decay + subtract update	same ordering as reference implementation
`optimizer_route_muon_adamw`	Muon for 2D matrices, AdamW for others	parameter manifest routes correctly

Proposed CKE API shape — not implemented yet

typedef struct {
    float lr;
    float weight_decay;
    float momentum;
    float eps;
    float ns_a;
    float ns_b;
    float ns_c;
    int   ns_steps;
    int   nesterov;
    int   adjust_lr_mode;
} ck_muon_config_t;

void ck_muon_update_matrix_f32(
    float *weight,             // [rows, cols]
    const float *grad,          // [rows, cols]
    float *momentum_buffer,     // [rows, cols]
    float *scratch,             // workspace for X, A, A2, P
    int rows,
    int cols,
    const ck_muon_config_t *cfg);

The scratch pointer matters. Newton-Schulz needs temporary matrices. If the optimizer allocates inside every matrix update, the implementation will become noisy, slow, and hard to reason about. CKE should allocate a training scratch arena once, plan the largest required temporary shape, and reuse it across parameters.

AdamW State vs Muon State

AdamW carries first and second moment buffers. Muon carries a momentum buffer and computes an orthogonalized update. This does not mean Muon is automatically cheaper in wall-clock time. It shifts cost from element-wise state math toward matrix products during the optimizer step.

Simplified optimizer state memory comparison for SGD, Momentum, AdamW, and Muon. — Muon may use less persistent optimizer state than AdamW for matrix weights, but it introduces heavier per-step matrix computation.

The practical trade-off is this: AdamW is memory-state heavy and element-wise compute friendly. Muon is matrix-update heavy and may be attractive when the training dynamics improvement is worth the optimizer-side GEMM work. trade-off Muon is not “free speed.” It is a different optimizer geometry. The right benchmark is convergence per token, memory footprint, and total training cost, not only optimizer kernel latency.

Shape-Based Learning Rate Adjustment

Muon implementations usually adjust the learning rate based on matrix shape. The local PyTorch implementation includes two modes. The original mode scales by a ratio involving the rectangular shape. Another mode attempts to match AdamW-style RMS behavior.

\[ \eta_{\mathrm{original}} = \eta \sqrt{\max\left(1,\frac{A}{B}\right)} \] Here A and B are the first two dimensions of the 2D parameter matrix.

\[ \eta_{\mathrm{match\_rms\_adamw}} = 0.2\,\eta\,\sqrt{\max(A,B)} \] The scalable LLM training variant attempts to make Muon more compatible with AdamW-tuned learning-rate and weight-decay settings.

This matters for CKE because the optimizer cannot just read a flat parameter array and blindly apply one update rule. It needs the parameter manifest. The manifest must know whether a tensor is 2D, whether it should use Muon, what shape adjustment applies, and whether a fallback optimizer should own that parameter.

Where Muon Fits In The Training Pipeline

Muon sits exactly where AdamW sits: after forward, loss, backward, gradient accumulation, optional gradient clipping, and before zeroing gradients for the next step. It does not change the chain rule. It changes how the final gradient buffer modifies the weights.

Training loop placement

for step in training_steps {
    ck_forward(model, batch);
    ck_cross_entropy_loss(model, targets);
    ck_backward(model);

    ck_gradient_allreduce_or_accumulate(model);
    ck_clip_grad_norm_if_needed(model);

    // Existing path:
    // ck_adamw_update_all(model);

    // Proposed mixed optimizer path:
    ck_optimizer_route_update(model, ADAMW_FOR_NON_2D, MUON_FOR_2D);

    ck_zero_grad(model);
}

The mixed optimizer route is the real systems problem. If Muon only supports hidden-layer matrices, then the optimizer pass must split parameters into groups: Muon-owned matrices, AdamW-owned embeddings/bias/norm parameters, and possibly special cases like MoE routers.

A parameter manifest routes hidden-layer 2D matrices to Muon and biases, norm weights, embeddings, and special cases to AdamW fallback. — A practical Muon implementation is a mixed-optimizer implementation. The manifest decides which tensors are Muon-owned and which tensors remain on AdamW or another fallback.

This is where a generated-runtime project like CKE has an advantage if it is disciplined. The optimizer router should not be a pile of string checks like “if the tensor name contains wq, use Muon.” It should be a property of the model manifest and lowered training graph. Each tensor should carry enough metadata to answer: is it trainable, is it 2D, is it a hidden-layer matrix, is it sparse-like, does it need weight decay, does it use Muon, and what fallback owns it if Muon does not.

What To Benchmark

A good Muon benchmark is not “does the optimizer step run fast once?” That is too shallow. The optimizer matters because it changes the training trajectory. A serious CKE benchmark should track:

Metric	Why it matters
loss vs tokens	Does Muon reach the same loss with fewer tokens?
loss vs wall-clock	Does better convergence beat the heavier optimizer step?
optimizer memory	How much state is carried per parameter group?
parity vs PyTorch	Does the scalar C reference match known behavior?
BF16/FP32 drift	Does repeated Newton-Schulz update drift over long runs?
shape sensitivity	Do rectangular matrices behave correctly under LR adjustment?

How This Differs From “AdamW But Faster”

A common mistake is to evaluate Muon as if it were trying to be a faster AdamW kernel. That is not the right comparison. AdamW is usually cheap per step relative to the rest of training because the update is mostly element-wise. Muon deliberately adds matrix computation to the optimizer step. The bet is that the changed update geometry can improve the training trajectory enough to justify that extra work.

That means a fair benchmark needs at least three axes: loss per token, loss per second, and total memory/runtime cost. If Muon reaches the same loss with fewer tokens but each step costs more, the question becomes whether the net training run is cheaper. If Muon uses less persistent optimizer state for matrix weights but needs more scratch workspace during the step, the question becomes whether the memory plan is better for the actual machine. This is why optimizer work belongs beside systems work, not above it.

\[ \text{useful optimizer} = \frac{\text{quality gain per token}}{\text{wall-clock cost} + \text{memory cost} + \text{implementation risk}} \] This is not a formal Muon equation. It is the engineering lens CKE should use before accepting a more complex optimizer kernel.

Why Muon Belongs After The AdamW Post

The previous optimizer post explained the default training path. Muon is the right follow-up because it asks a deeper question: should the optimizer know the tensor is a matrix? AdamW says every element gets adaptive scalar statistics. Muon says the update for a matrix can be shaped as a matrix.

That is exactly the kind of topic that belongs in this series. The point of the series is not to memorize formulas. The point is to connect math, tensor shape, kernel implementation, memory, and training behavior. Muon touches all of those layers.

Implementation thesis

CKE should implement Muon only after a scalar reference, PyTorch parity harness, optimizer parameter-routing manifest, and scratch-arena plan exist. The optimized path can then reuse GEMM kernels and later target AVX-512, AMX, ARM SVE2, or other CPU matrix extensions.

Summary

Muon is best understood as a matrix-aware optimizer for hidden-layer weights. It keeps momentum, optionally uses Nesterov, transforms the update through Newton-Schulz orthogonalization, applies decoupled weight decay, adjusts learning rate based on matrix shape, and updates the weight matrix.

For C-Kernel-Engine, the key is not to chase hype. The key is to translate the optimizer into contracts: which parameters use it, which tensors are saved, which scratch buffers are required, which GEMM paths are reused, and which parity tests prove the update is correct.

That makes Muon a useful next research target. It is not part of CKE yet, but it fits the project’s direction: build the training stack from math to kernels, make every step inspectable, and only optimize after correctness is nailed down.

Muon Optimizer: SGD vs AdamW vs Matrix-Aware Training Updates

Sources and scope

Why Another Optimizer?

How to read the three boxes

The Core Muon Step

Why Only 2D Parameters?

The Newton-Schulz Part

Why this is not “just another optimizer flag”

Why This Is Interesting For C-Kernel-Engine

CKE implementation rule

Proposed CKE Kernel Surface

AdamW State vs Muon State

Shape-Based Learning Rate Adjustment

Where Muon Fits In The Training Pipeline

What To Benchmark

How This Differs From “AdamW But Faster”

Why Muon Belongs After The AdamW Post

Implementation thesis

Summary

ShivasNotes

Explore

Connect

Muon Optimizer: SGD vs AdamW vs Matrix-Aware Training Updates

Sources and scope

Why Another Optimizer?

How to read the three boxes

The Core Muon Step

Why Only 2D Parameters?

The Newton-Schulz Part

Why this is not “just another optimizer flag”

Why This Is Interesting For C-Kernel-Engine

CKE implementation rule

Proposed CKE Kernel Surface

AdamW State vs Muon State

Shape-Based Learning Rate Adjustment

Where Muon Fits In The Training Pipeline

What To Benchmark

How This Differs From “AdamW But Faster”

Why Muon Belongs After The AdamW Post

Implementation thesis

Summary

Subscribe

Subscribe to emails from Anthony

ShivasNotes

Explore

Connect