Skip to content

Layer Normalization

Formula

\[ \mu = \frac{1}{d}\sum_{i=1}^d x_i,\quad \sigma^2=\frac{1}{d}\sum_{i=1}^d (x_i-\mu)^2 \]
\[ \mathrm{LN}(x)=\gamma \odot \frac{x-\mu}{\sqrt{\sigma^2+\epsilon}}+\beta \]

Parameters

  • \(x\in\mathbb{R}^d\): features for one example/token
  • \(\gamma,\beta\): learnable scale and shift
  • \(\epsilon\): numerical stability constant

What it means

Normalizes features within a single example/token, then applies a learned affine transform.

What it's used for

  • Stabilizing training in Transformers.
  • Reducing sensitivity to activation scale.

Key properties

  • Independent of batch size.
  • Common in pre-norm and post-norm Transformer variants.

Common gotchas

  • Normalize over the correct axis (feature dimension, not batch).
  • Pre-norm vs post-norm changes training dynamics.

Example

In a Transformer block, layer norm is applied before or after attention/MLP sublayers depending on architecture.

How to Compute (Pseudocode)

Input: feature vector/tensor x for one token/example, gamma, beta, epsilon
Output: layer-normalized output

compute mean and variance over the feature dimension(s)
normalize x elementwise using those statistics
apply affine transform with gamma and beta
return output

Complexity

  • Time: \(O(m)\) per normalized feature vector/tensor slice with \(m\) normalized elements
  • Space: \(O(m)\) for activations/output, plus parameter storage for gamma/beta
  • Assumptions: Normalization axes are the feature dimensions; total model cost is dominated by surrounding attention/MLP layers

See also