Layer Normalization¶
Formula¶
\[
\mu = \frac{1}{d}\sum_{i=1}^d x_i,\quad
\sigma^2=\frac{1}{d}\sum_{i=1}^d (x_i-\mu)^2
\]
\[
\mathrm{LN}(x)=\gamma \odot \frac{x-\mu}{\sqrt{\sigma^2+\epsilon}}+\beta
\]
Parameters¶
- \(x\in\mathbb{R}^d\): features for one example/token
- \(\gamma,\beta\): learnable scale and shift
- \(\epsilon\): numerical stability constant
What it means¶
Normalizes features within a single example/token, then applies a learned affine transform.
What it's used for¶
- Stabilizing training in Transformers.
- Reducing sensitivity to activation scale.
Key properties¶
- Independent of batch size.
- Common in pre-norm and post-norm Transformer variants.
Common gotchas¶
- Normalize over the correct axis (feature dimension, not batch).
- Pre-norm vs post-norm changes training dynamics.
Example¶
In a Transformer block, layer norm is applied before or after attention/MLP sublayers depending on architecture.
How to Compute (Pseudocode)¶
Input: feature vector/tensor x for one token/example, gamma, beta, epsilon
Output: layer-normalized output
compute mean and variance over the feature dimension(s)
normalize x elementwise using those statistics
apply affine transform with gamma and beta
return output
Complexity¶
- Time: \(O(m)\) per normalized feature vector/tensor slice with \(m\) normalized elements
- Space: \(O(m)\) for activations/output, plus parameter storage for gamma/beta
- Assumptions: Normalization axes are the feature dimensions; total model cost is dominated by surrounding attention/MLP layers