Layer Normalization¶

Formula¶

\[ \mu = \frac{1}{d}\sum_{i=1}^d x_i,\quad \sigma^2=\frac{1}{d}\sum_{i=1}^d (x_i-\mu)^2 \]

\[ \mathrm{LN}(x)=\gamma \odot \frac{x-\mu}{\sqrt{\sigma^2+\epsilon}}+\beta \]

Parameters¶

\(x\in\mathbb{R}^d\): features for one example/token
\(\gamma,\beta\): learnable scale and shift
\(\epsilon\): numerical stability constant

What it means¶

Normalizes features within a single example/token, then applies a learned affine transform.

What it's used for¶

Stabilizing training in Transformers.
Reducing sensitivity to activation scale.

Key properties¶

Independent of batch size.
Common in pre-norm and post-norm Transformer variants.

Common gotchas¶

Normalize over the correct axis (feature dimension, not batch).
Pre-norm vs post-norm changes training dynamics.

Example¶

In a Transformer block, layer norm is applied before or after attention/MLP sublayers depending on architecture.

How to Compute (Pseudocode)¶

Input: feature vector/tensor x for one token/example, gamma, beta, epsilon
Output: layer-normalized output

compute mean and variance over the feature dimension(s)
normalize x elementwise using those statistics
apply affine transform with gamma and beta
return output

Complexity¶

Time: \(O(m)\) per normalized feature vector/tensor slice with \(m\) normalized elements
Space: \(O(m)\) for activations/output, plus parameter storage for gamma/beta
Assumptions: Normalization axes are the feature dimensions; total model cost is dominated by surrounding attention/MLP layers