Lab note

Companion post to the Activation Functions carousel. Previously: Softmax: The Probability Engine.

The previous Softmax post ended with a strange contrast. Softmax was beautifully smooth, but its Jacobian was dense, global, and a little expensive. This sequel walks in the other direction. Activation functions are local, element-wise, and deceptively simple, yet they are the reason a transformer can represent anything more interesting than a giant linear map.

A transformer block is full of linear algebra. Queries, keys, values, projections, and feed-forward layers are all matrix multiplies. But matrix multiplies alone cannot bend a straight decision boundary into a curve. They need a nonlinearity in the middle.

That nonlinearity is the activation function. Sometimes it is a classic sigmoid or tanh. Sometimes it is the sharp hinge of ReLU. In modern language models it is usually GELU or a gated cousin such as SwiGLU. nonlinearity Without activations, stacked linear layers collapse into one affine map.

Roadmap for this post

We will start with the reason activations exist, move through the historical sequence sigmoid → tanh → ReLU → GELU → SwiGLU, then connect all of them to backpropagation and low-level kernels.

The unifying idea is simple: activations give networks shape, and because they are element-wise, they keep the backward pass cheap.

Section 1: Why Activations Exist

Suppose layer one computes h = W₁x + b₁ and layer two computes y = W₂h + b₂. If there is no activation between them, then you can expand the whole thing into y = W₂W₁x + W₂b₁ + b₂. That is still just one affine transformation. No matter how many such layers you stack, the composition remains affine.

This is the collapse problem. Depth looks impressive on a diagram, but without a nonlinear break between layers, the network has not gained expressive power. It has only factorized one big matrix into several smaller ones. That can help with optimization or parameterization tricks, but not with representation itself.

Activations break that collapse. You compute h = f(W₁x + b₁), and now layer two sees a warped version of the hidden state rather than a purely linear one. The next matrix multiply can combine those warped coordinates into curves, thresholds, and region boundaries that a single line could never express.

A tiny scalar example makes the point concrete. If f(x) = x, then 2(3x) is just 6x. If f(x) = max(0, x), then 2·f(3x) behaves differently on the negative and positive side of zero. The network now has a hinge. hinge A nonlinearity turns stacked linear maps into piecewise or smooth functions.

In practice the activation is applied element-wise. If a hidden state is an [M×N] matrix, each entry is transformed independently: Y[i, j] = f(X[i, j]). There is no mixing between columns, no normalization across the row, and no probability budget to share. That is the opposite of softmax.

Element-wise structure has a huge consequence for calculus. The Jacobian of an activation is diagonal because output coordinate y_i only depends on input coordinate x_i. All off-diagonal terms are zero. Backpropagation therefore becomes an element-wise multiply rather than a dense matrix product. local Softmax couples a row. Activations reshape each value independently.

One sentence summary

Linear layers move and mix information; activations bend it.

That bending is exactly what turns depth into expressive power.

A matrix view of element-wise nonlinearity

Imagine an activation applied to a 2×3 matrix. [[x₁₁, x₁₂, x₁₃], [x₂₁, x₂₂, x₂₃]] becomes [[f(x₁₁), f(x₁₂), f(x₁₃)], [f(x₂₁), f(x₂₂), f(x₂₃)]]. Each cell keeps its location. Only its value is reshaped.

That is why activation kernels are so friendly to SIMD. You can load a register of 16 floats, apply the same formula to each lane, and store the result. No lane needs to know what its neighbor is doing. Modern deep learning lives on this kind of independence.

Section 2: Sigmoid — σ(x) = 1/(1+e⁻ˣ)

Sigmoid is the classic squashing function. Its formula is:

Sigmoid Formula

\[ \sigma(x) = \frac{1}{1 + e^{-x}} \]

Large negative numbers map close to zero, large positive numbers map close to one, and zero lands exactly at 0.5.

That range (0, 1) made sigmoid the early default for neural networks. It feels probability-like. You can interpret the output as a soft yes-or-no score. That is still why it survives in binary classifiers and recurrent gates.

Sigmoid curve on x from -6 to 6 with dashed horizontal lines marking the output range between 0 and 1.

The derivative is remarkably neat. It can be written entirely in terms of the forward output:

Sigmoid Backward

\[ \sigma'(x) = \sigma(x)\left(1 - \sigma(x)\right) \]

You can reuse the forward value and avoid recomputing the exponential from scratch in the backward pass.

The bad news is hidden in the size of that derivative. It is always at most 0.25, with the maximum happening at x = 0. As soon as the neuron saturates near zero or one, the derivative shrinks toward zero. That is the vanishing gradient problem in one line. 0.25¹⁰ Ten sigmoid-like slopes can shrink a gradient to roughly one millionth.

Sigmoid derivative curve with a highlighted peak at 0.25 when x equals zero.

You can make the failure mode numerical. If ten consecutive layers each contribute a derivative around 0.25, then the chain rule multiplies them into roughly one millionth. A useful learning signal becomes microscopic before it reaches the early layers.

This does not make sigmoid obsolete. It makes sigmoid specialized. When you truly want a gate that interpolates between closed and open, the (0, 1) range is perfect. That is why LSTM and GRU gates still use it, and why Swish and SwiGLU inherit it internally.

NumPy sigmoid forward and backward python
import numpy as np

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))


def sigmoid_backward(x, dy):
    s = sigmoid(x)
    return dy * s * (1.0 - s)

Take x = [-4, 0, 4]. The outputs are approximately [0.018, 0.5, 0.982]. The middle point still has some slope, but the outer points are almost frozen. That is exactly what saturation means.

At the kernel level, sigmoid is more expensive than ReLU because the exponential is expensive. The C-Kernel-Engine implementation therefore uses a polynomial approximation for exp in AVX512, processing 16 floats per vector. Backward then reduces to dx = dy * σ(x) * (1 − σ(x)), which is mostly multiplies once the forward value is available. (0,1) Sigmoid is still useful when the model needs a bounded gate.

Mental model for sigmoid

Sigmoid is a soft switch, not a modern default hidden activation.

Use it when you want “how open is this gate?” not when you want gradients to stay large through dozens of layers.

Section 3: Tanh — tanh(x)

Tanh is the symmetric cousin of sigmoid. Its formula is:

Tanh Formula

\[ \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \]

An equivalent identity is tanh(x) = 2σ(2x) − 1. So tanh is literally a rescaled sigmoid.

The output range is (-1, 1). That matters because the activation is now zero-centered. Positive and negative inputs can stay positive and negative after the nonlinearity, rather than all being pushed into the positive half-plane.

Tanh curve on x from -4 to 4 with dashed lines marking saturation at minus one and plus one.

The derivative is:

Tanh Backward

\[ \frac{d}{dx}\tanh(x) = 1 - \tanh^2(x) \]

At the origin, the slope is exactly 1. That is far better than sigmoid's maximum slope of 0.25.

Zero-centered output often makes optimization behave better. Gradients do not have to fight a built-in positive bias in the activation statistics. For older recurrent networks, tanh was often easier to train than sigmoid for exactly this reason. 2σ(2x)-1 Tanh is a zero-centered rescaling of sigmoid.

Tanh derivative curve peaking at 1.0 at the origin and shrinking toward zero as the input saturates.

But tanh is not magic. As |x| grows, tanh still saturates near ±1, so the derivative still goes to zero in the tails. The vanishing gradient problem is delayed, not eliminated.

A good scalar comparison is this. At x = 0, sigmoid outputs 0.5 with derivative 0.25. Tanh outputs 0 with derivative 1.0. Near the origin, tanh is both centered and steeper.

Modern transformers rarely use tanh as the main feed-forward activation, yet tanh still hides inside approximations. The famous fast GELU formula uses a tanh expression as its inner nonlinearity. In the C-Kernel-Engine notes, tanh does not get a standalone kernel at all because it is embedded inside tanh512_fast() for GELU. tanh_fast Tanh can live as a helper inside the GELU approximation path.

When to remember tanh

Tanh is the bridge between old and new neural networks.

It shows why centering and derivative scale matter, and it quietly reappears inside faster approximations for more modern activations.

Section 4: ReLU — max(0, x)

ReLU stands for rectified linear unit. Its formula is almost comically short: ReLU(x) = max(0, x). If the input is positive, pass it through. If the input is negative, clamp it to zero.

That simplicity is exactly why ReLU changed deep learning. On the positive side the derivative is 1, so gradients can flow without shrinking at every layer. The activation does not saturate for large positive values.

ReLU function with a sharp kink at the origin and a highlighted red point marking the hinge.

There is a catch, and it is mathematically interesting. ReLU is not differentiable at x = 0. This is the exact sort of discontinuity we mentioned in the softmax post when min and max entered the picture.

Why does practice not care? First, hitting exactly zero has measure zero for continuous inputs. Second, optimization libraries use a subgradient convention, often assigning the derivative at zero to 0. Third, away from the kink, the derivative is wonderfully simple. max(0,x) ReLU has a kink at zero, but training uses a practical subgradient convention.

ReLU derivative shown as a step function with red markers at x equals zero indicating the undefined point.

The derivative is 0 for x < 0 and 1 for x > 0. So the positive branch behaves like an identity map during backprop. That is the anti-vanishing part of the ReLU story.

The failure mode is the dead neuron. If a neuron falls into a regime where its pre-activation is always negative, then its output is always zero and its local gradient is always zero. Nothing nudges it back into usefulness. It becomes a permanently silent feature detector. dead ReLU Negative-only ReLU paths output zero and receive zero local gradient.

Older CNNs, ResNets, and many plain MLPs used ReLU because the training speedup was dramatic compared with sigmoid-style saturating activations. Hardware liked it too. A max with zero is cheaper than an exponential, a tanh, or a cumulative Gaussian.

The C-Kernel-Engine version makes the systems point painfully clear. Forward is one AVX512 instruction: _mm512_max_ps(zero, x). Backward builds a mask with _mm512_cmp_ps_mask() and gates the incoming gradient by that mask. max(0,x) ReLU maps directly to a cheap vector max and a backward mask.

ReLU in one sentence

ReLU won the previous era because it preserved gradient flow on the positive side and was trivial to compute.

Its weakness is that “hard zero” is both its feature and its failure mode.

Section 5: GELU — x·Φ(x) (The Smooth ReLU)

GELU stands for Gaussian Error Linear Unit. The exact definition is:

GELU Exact

\[ \mathrm{GELU}(x) = x\,\Phi(x) \]

\(\Phi\) is the standard normal cumulative distribution function.

Intuitively, it keeps more of a value when that value looks more likely to be positive under a unit Gaussian.

In production code we usually use the tanh approximation. The common fast approximation is:

GELU Tanh Approximation

\[ \mathrm{GELU}(x) \approx \frac{1}{2}x\left(1 + \tanh\left[\sqrt{\frac{2}{\pi}}\left(x + 0.044715x^3\right)\right]\right) \]

That expression is smooth everywhere and cheap enough to vectorize well.

GELU curve overlaid with a dashed ReLU curve, including an annotation showing the slight negative dip below zero.

Compared with ReLU, GELU does not draw a hard line at zero. Small negative values are not crushed to zero immediately. They become small negative outputs. Near the origin the function is even slightly non-monotonic, which makes it act like a soft probabilistic gate.

That small negative dip is not a bug. It tells you GELU is willing to leak a little information through the negative side instead of shutting it off completely. Transformers seem to like that smoother behavior. FC1 → GELU → FC2 GELU is the classic transformer MLP activation used in BERT and GPT-style blocks.

GELU derivative shown as a smooth curve compared against ReLU derivative as a dashed step function.

The derivative looks intimidating but follows the chain rule cleanly. If \(g = \sqrt{2/\pi}(x + 0.044715x^3)\), then the derivative is:

GELU Backward

\[ \mathrm{GELU}'(x) = \frac{1}{2}\left(1 + \tanh(g)\right) + \frac{1}{2}x\,\operatorname{sech}^2(g)\,g' \]

The important practical fact is not the exact algebra. It is that the derivative changes smoothly, with no ReLU-style jump.

Smoothness helps optimization because nearby inputs produce nearby gradients. You avoid the sharp kink at zero while still keeping roughly identity-like behavior for large positive values. GELU is more expensive than ReLU, but the optimization trade-off has been worth it in transformer-scale workloads.

NumPy GELU forward and backward (tanh approximation) python
import numpy as np

SQRT_2_OVER_PI = 0.7978845608
C = 0.044715


def gelu(x):
    g = SQRT_2_OVER_PI * (x + C * x**3)
    return 0.5 * x * (1.0 + np.tanh(g))


def gelu_backward(x, dy):
    g = SQRT_2_OVER_PI * (x + C * x**3)
    gp = SQRT_2_OVER_PI * (1.0 + 3.0 * C * x**2)
    tanh_g = np.tanh(g)
    sech2 = 1.0 - tanh_g**2
    return dy * (0.5 * (1.0 + tanh_g) + 0.5 * x * sech2 * gp)

If you compare outputs at x = -1, 0, 1, ReLU gives [0, 0, 1] while GELU gives roughly [-0.159, 0, 0.841]. The positive side stays close to identity, but the negative side is softened rather than amputated. That is the whole design philosophy in three numbers.

The C-Kernel-Engine implementation reflects that extra complexity. The gelu_fast_inplace() kernel spans hundreds of lines because it needs polynomial approximations for the inner tanh and exp operations, plus scalar, vector, and tail handling. Compared with ReLU’s one-instruction core, GELU is a tiny math pipeline. smooth gate GELU spends more instructions to avoid a hard ReLU threshold.

Why people call GELU a smooth ReLU

For large positive inputs, GELU behaves almost like the identity, just like ReLU.

Near zero and slightly below zero, GELU becomes soft, graded, and differentiable everywhere.

Section 6: SwiGLU — The Gated Activation

Modern large language models increasingly use a gated activation instead of a single scalar nonlinearity. SwiGLU is a popular choice. Its formula is:

SwiGLU Formula

\[ \mathrm{SwiGLU}(g, v) = \mathrm{Swish}(g)\odot v, \qquad \mathrm{Swish}(g) = g\,\sigma(g) \]

The hidden projection that feeds SwiGLU has size 2D. One half becomes the gate vector and the other half becomes the value vector. The gate goes through Swish, the value path stays linear, and then the two halves are multiplied element-wise to produce an output of size D.

Swish activation curve overlaid with ReLU and GELU to compare how the gate behaves around the origin.

This is more expressive than a plain activation because the model is not merely reshaping one vector. It is letting one learned vector decide how much of another learned vector gets through. That is the essence of a gated linear unit. 2D → D SwiGLU splits the hidden projection into gate and value halves, then collapses back to D.

The derivative splits naturally into two branches. The two local gradients are:

SwiGLU Backward

\[ \frac{\partial L}{\partial g} = \frac{\partial L}{\partial y}\odot v\odot \mathrm{Swish}'(g), \qquad \frac{\partial L}{\partial v} = \frac{\partial L}{\partial y}\odot \mathrm{Swish}(g) \]

So backward is still just a handful of element-wise multiplies once the sigmoid has been computed.

Swish itself is smooth and slightly negative for some negative inputs, much like GELU. That makes it a gentler gate than a hard ReLU gate would be. Instead of saying “block or pass,” it learns “suppress, leak, or amplify.”

Flow diagram showing an input of size 2D split into gate and value branches, with Swish applied to the gate before element-wise multiplication into an output of size D.

Architecturally, the transformer MLP becomes FC_gate ∥ FC_value → SwiGLU → FC_out. That is the pattern used in LLaMA, Mistral, Gemma, and many other modern models. The gate makes the feed-forward block feel less like a blunt expansion-and-contract sequence and more like a learned routing mechanism. gated MLP Modern LLM feed-forward blocks use gates, not just bigger ReLUs.

NumPy SwiGLU forward and backward python
import numpy as np


def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))


def swish(x):
    s = sigmoid(x)
    return x * s


def swish_prime(x):
    s = sigmoid(x)
    return s * (1.0 + x * (1.0 - s))


def swiglu_forward(x):
    gate, value = np.split(x, 2, axis=-1)
    return swish(gate) * value


def swiglu_backward(x, dy):
    gate, value = np.split(x, 2, axis=-1)
    gate_grad = dy * value * swish_prime(gate)
    value_grad = dy * swish(gate)
    return np.concatenate([gate_grad, value_grad], axis=-1)

A scalar example helps. If the gate entry is -2, then Swish(-2) is only mildly negative, so the value channel is heavily suppressed but not abruptly erased. If the gate entry is 3, then Swish is close to 3, so the value channel passes through with amplification.

The C-Kernel-Engine kernel swiglu_forward() mirrors this logic exactly. It reads the interleaved [gate|value] layout, runs a fast polynomial sigmoid on the gate half, multiplies by the gate to get Swish, and then multiplies by the value half. Backward computes separate gate and value gradients, just as the equations predict. router The gate decides which learned value channels pass through.

Why SwiGLU feels like the modern endpoint

Sigmoid learned to open and close a channel.

SwiGLU turns that idea into the central nonlinearity of the transformer feed-forward block.

Section 7: The Big Comparison

Seen one by one, these activations can feel like historical accidents. Seen together, they tell a coherent story about gradient flow and representational shape. The field kept moving toward activations that preserve useful slope where it matters and avoid brittle discontinuities where possible.

Overlay of sigmoid, tanh, ReLU, GELU, and Swish on a single set of axes for direct visual comparison.

Sigmoid and tanh squash strongly and saturate in both tails. ReLU throws away the negative half entirely but keeps a perfect slope of one on the positive side. GELU and Swish keep the positive friendliness of ReLU while softening the negative side and the transition through zero. "The history of activations is really the history of asking one question more carefully: where can the gradient move, and where does it get stuck?" A field-wide design pattern

Overlay of the derivatives of sigmoid, tanh, ReLU, GELU, and Swish showing how their gradient profiles differ.

The derivative plot is the real cheat sheet. Sigmoid never exceeds 0.25. Tanh reaches 1 but still collapses in the tails. ReLU is brutally sparse, GELU is smooth, and Swish interpolates between gating and identity-like flow.

Activation Formula Range Derivative Max derivative Pros Cons Used in
Sigmoid 1 / (1 + e-x) (0, 1) σ(x)(1-σ(x)) 0.25 Probability-like gate, simple backward reuse Strong saturation and vanishing gradients Binary outputs, LSTM/GRU gates, Swish internals
Tanh tanh(x) (-1, 1) 1 - tanh²(x) 1.0 Zero-centered, steeper near zero Still saturates for large |x| Older RNNs, normalization intuition, GELU internals
ReLU max(0, x) [0, ∞) 0 / 1 away from zero 1.0 Cheap, fast, no positive-side vanishing Dead neurons, kink at zero CNNs, ResNets, older MLPs
GELU x·Φ(x) (≈-0.17, ∞) Smooth tanh-based expression 1.13 near the positive shoulder Smooth, soft negative side, great for transformers Much more arithmetic than ReLU BERT, GPT, many transformer MLPs
SwiGLU Swish(g) ⊙ v Output depends on value path Two branch-wise products Gate slope depends on Swish Learned gating and strong empirical performance Requires doubled hidden projection and more parameters than plain activation LLaMA, Mistral, Gemma, modern LLMs

The comparison table also hides a philosophical shift. Early activations were chosen mostly for mathematical convenience. Modern activations are chosen for optimization behavior inside extremely deep architectures running on vector hardware.

Section 8: Element-wise Means Diagonal Jacobian

Now return to the calculus thread that connects this post to softmax. For an element-wise activation y_i = f(x_i), output i depends on exactly one input coordinate. Therefore the Jacobian has this sparse structure:

Diagonal Jacobian

\[ \frac{\partial y_i}{\partial x_j} = \begin{cases} f'(x_i), & i = j \\ 0, & i \ne j \end{cases} \]

The Jacobian matrix is diagonal. For ReLU on a five-element vector, you might literally get diag([0, 1, 1, 0, 1]). That is not a metaphor. It is the actual linear map used by the backward pass at that point.

Two heatmaps side by side comparing a diagonal activation Jacobian against the dense Jacobian of softmax.

Backpropagation therefore simplifies to:

Activation Backward

\[ \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y}\odot f'(x) \]

You do not need a matrix multiply with a dense Jacobian. You just multiply each incoming gradient component by the local slope of the matching input element. diagonal Activation Jacobians are diagonal, so backward is an element-wise multiply.

This is why activation backward kernels are usually bandwidth-bound rather than algebra-bound. The work per element is tiny: maybe a comparison and a multiply for ReLU, or a few multiplies and reused forward values for sigmoid. There is no cross-element communication step.

It is also why activations compose so naturally with the matrix layers around them. The expensive linear algebra happens in the projections. The activation just reshapes each scalar independently before handing the tensor onward. Conceptually local, computationally local. O(N) Element-wise backward stays linear in the number of elements.

A useful contrast to memorize

Softmax asks “how should this entire row share one probability budget?”

An activation asks “how should this one scalar be reshaped before the next layer sees it?”

Section 9: What This Looks Like in C

The fastest way to see the personality of each activation is to read its kernel. Forward is usually one SIMD pass over memory. Backward is another SIMD pass that multiplies the upstream gradient by the local derivative.

What changes is how much math each element needs. ReLU is almost nothing. Sigmoid needs an exponential approximation. GELU needs a tanh chain built on top of an approximate exponential. SwiGLU adds a split, a Swish gate, and an element-wise multiply with the value path.

Simplified sigmoid kernel pattern c
// sigmoid_kernels.c: AVX512 processes 16 floats
// Uses polynomial exp approximation (5-term, ~1e-4 error)
float sigmoid_scalar(float x) {
    return 1.0f / (1.0f + expf(-x));
}

// Backward: dx = dy * sig * (1 - sig)
Simplified ReLU kernel pattern c
// relu_kernels.c: ONE SIMD instruction!
// _mm512_max_ps(zero, input) — that's the entire forward
// Backward: mask = _mm512_cmp_ps_mask(input, zero, _CMP_GT_OQ)
//           dx = _mm512_maskz_mov_ps(mask, dy)
Simplified GELU kernel pattern c
// gelu_kernels.c: 719 lines for all variants
// g = sqrt(2/pi) * (x + 0.044715 * x^3)
// output = 0.5 * x * (1 + tanh(g))
// tanh computed via polynomial exp approximation
Simplified SwiGLU kernel pattern c
// swiglu_kernels.c: input [2D] split into gate/value
// gate -> Swish(gate) = gate * sigmoid(gate)
// output = Swish(gate) ⊙ value
// Backward: dgate = dy * value * σ(g) * (1 + g*(1-σ(g)))
//           dvalue = dy * Swish(gate)

Read those snippets as a complexity ladder. ReLU forward is literally a max. Sigmoid adds a nonlinear approximation. GELU wraps that approximation inside another one. SwiGLU combines a smooth gate with a learned value path. max vs tanh ReLU is cheap. GELU spends more instructions to buy smoother optimization.

Activation Kernel file Forward pattern Backward pattern SIMD width Approx. work per element
Sigmoid sigmoid_kernels.c Polynomial exp then reciprocal dy * s * (1-s) 16 floats in AVX512 Moderate: exp approximation + 2 multiplies
Tanh Embedded in tanh512_fast() (exp(2x)-1)/(exp(2x)+1) via polynomial exp Usually reused inside GELU derivative logic 16 floats in AVX512 Moderate to high when used
ReLU relu_kernels.c max(0, x) Mask then zero incoming gradient 16 floats in AVX512 Tiny: one max + one mask path
GELU gelu_kernels.c Cubic + tanh approximation + multiply Smooth chain-rule expression 16 floats in AVX512 High: roughly a few dozen scalar-like ops
SwiGLU swiglu_kernels.c Split, sigmoid, multiply gate, multiply value Two branch gradients plus Swish derivative 16 floats per vector half High: gate math + value blend

The broader pattern is that activations are embarrassingly parallel. You stream through memory, transform lanes independently, and write them back. That is why even the “expensive” activations are still tractable at scale.

From a kernel engineer’s perspective, the design question becomes: how much nonlinear sophistication are you willing to buy per token per layer? Transformers answer that question differently from old CNNs. They are willing to spend more instructions on GELU or SwiGLU because the optimization behavior pays off across enormous models. kernel cost Smooth activations buy optimization behavior with extra instructions.

Section 10: Summary

Activation functions are the nonlinearity that makes deep learning work. Without them, depth collapses into one affine map. With them, each layer can bend, gate, saturate, or smooth the hidden representation before the next matrix multiply sees it.

  • Sigmoid is a bounded soft gate with a clean derivative, but its maximum slope of 0.25 makes gradients vanish quickly through depth.
  • Tanh fixes the zero-centering problem and reaches unit slope at the origin, yet still saturates for large magnitudes.
  • ReLU keeps gradients alive on the positive side and is almost free to compute, but its hard zero can strand neurons and its kink is not differentiable at exactly zero.
  • GELU smooths the ReLU transition and leaks small negative values, which is why transformers such as BERT and GPT adopted it.
  • SwiGLU adds learned gating on top of smooth activation behavior, which is why it dominates modern LLM feed-forward blocks.

There is a clean evolutionary story here. Sigmoid and tanh taught the field about saturation. ReLU taught the field the value of preserving slope. GELU and SwiGLU taught the field that smooth gating often beats hard clipping inside transformer-scale models. "The arc from sigmoid → tanh → ReLU → GELU → SwiGLU is really the arc from “make it nonlinear” to “make gradients flow well at scale.”" A one-line historical summary

Just as importantly, activations stay local. Their Jacobians are diagonal, their backpropagation is element-wise, and their kernels map naturally onto SIMD hardware. That makes them the perfect complement to the heavy matrix multiplies surrounding them.

Next in the series

We now have enough pieces to assemble the transformer’s main event.

Next: the full attention mechanism — queries, keys, values, multi-head attention, and how all the building blocks lock together into one forward pass.

Softmax showed us how a row of scores becomes a distribution. Activations show us how a tensor of features becomes trainably nonlinear. Put those ideas together and the transformer stops looking like magic and starts looking like careful engineering.