GELU (Gaussian Error Linear Unit)¶

Formula¶

\[ \mathrm{GELU}(x)=x\,\Phi(x) \]

\[ \approx \frac{1}{2}x\left(1+\tanh\!\left(\sqrt{\frac{2}{\pi}}\left(x+0.044715x^3\right)\right)\right) \]

Plot¶

fn: 0.5*x*(1+tanh(sqrt(2/PI)*(x+0.044715*x^3)))
xmin: -4
xmax: 4
ymin: -1.0
ymax: 4.2
height: 280
title: GELU(x) (tanh approximation)

Parameters¶

\(x\): scalar input (applied elementwise)
\(\Phi(x)\): standard normal CDF

What it means¶

GELU smoothly gates inputs by their magnitude, rather than hard-thresholding like ReLU.

What it's used for¶

Common hidden activation in Transformer MLP blocks.
Deep models where smooth activations can help optimization.

Key properties¶

Smooth and non-monotonic.
Behaves roughly like a softened ReLU for positive inputs.

Common gotchas¶

More expensive than ReLU if not using an approximation.
Implementation details vary (exact vs approximate GELU).

Example¶

For large positive \(x\), \(\mathrm{GELU}(x)\approx x\); for large negative \(x\), \(\mathrm{GELU}(x)\approx 0\).

How to Compute (Pseudocode)¶

Input: tensor/vector x
Output: y = GELU(x) applied elementwise

for each element x_i in x:
  y_i <- x_i * Phi(x_i)   # or a standard GELU approximation
return y

Complexity¶

Time: \(O(m)\) elementwise operations for \(m\) inputs
Space: \(O(m)\) for the output tensor/vector (or \(O(1)\) extra if done in place)
Assumptions: Elementwise application over \(m\) scalars; exact constant factors depend on operations like \(\exp\), \(\tanh\), or \(\mathrm{erf}/\Phi\) approximations