Skip to content

Batch Normalization

Formula

\[ \hat{x}=\frac{x-\mu_B}{\sqrt{\sigma_B^2+\epsilon}},\qquad y=\gamma \hat{x}+\beta \]

Parameters

  • \(\mu_B,\sigma_B^2\): batch mean and variance
  • \(\gamma,\beta\): learnable scale and shift
  • \(\epsilon\): numerical stability constant

What it means

Normalizes activations using batch statistics during training, then applies a learnable affine transform.

What it's used for

  • Stabilizing and speeding up training in many CNN/MLP architectures.
  • Reducing sensitivity to initialization and learning rate.

Key properties

  • Uses batch statistics in training and running estimates at inference.
  • Behavior depends on batch size and axis conventions.

Common gotchas

  • Small batch sizes can make BN noisy.
  • BN is less common than LN inside Transformers.

Example

CNN activations are often batch-normalized after convolution and before nonlinear activation.

How to Compute (Pseudocode)

Input: batch activations x, learned gamma/beta, epsilon, mode
Output: normalized activations y

if training:
  compute batch mean and variance over the normalization axes
  normalize x using batch statistics
  update running mean/variance estimates
else:
  normalize x using running mean/variance
apply affine transform y <- gamma * x_hat + beta
return y

Complexity

  • Time: \(O(m)\) per batch for \(m\) normalized activation values (plus reduction operations for mean/variance)
  • Space: \(O(m)\) for activations and normalized outputs, plus \(O(c)\) running stats/affine parameters for normalized channels/features
  • Assumptions: Exact axes and constants depend on BN variant (1D/2D/3D) and tensor layout

See also