Skip to content

GELU (Gaussian Error Linear Unit)

Formula

\[ \mathrm{GELU}(x)=x\,\Phi(x) \]
\[ \approx \frac{1}{2}x\left(1+\tanh\!\left(\sqrt{\frac{2}{\pi}}\left(x+0.044715x^3\right)\right)\right) \]

Plot

fn: 0.5*x*(1+tanh(sqrt(2/PI)*(x+0.044715*x^3)))
xmin: -4
xmax: 4
ymin: -1.0
ymax: 4.2
height: 280
title: GELU(x) (tanh approximation)

Parameters

  • \(x\): scalar input (applied elementwise)
  • \(\Phi(x)\): standard normal CDF

What it means

GELU smoothly gates inputs by their magnitude, rather than hard-thresholding like ReLU.

What it's used for

  • Common hidden activation in Transformer MLP blocks.
  • Deep models where smooth activations can help optimization.

Key properties

  • Smooth and non-monotonic.
  • Behaves roughly like a softened ReLU for positive inputs.

Common gotchas

  • More expensive than ReLU if not using an approximation.
  • Implementation details vary (exact vs approximate GELU).

Example

For large positive \(x\), \(\mathrm{GELU}(x)\approx x\); for large negative \(x\), \(\mathrm{GELU}(x)\approx 0\).

How to Compute (Pseudocode)

Input: tensor/vector x
Output: y = GELU(x) applied elementwise

for each element x_i in x:
  y_i <- x_i * Phi(x_i)   # or a standard GELU approximation
return y

Complexity

  • Time: \(O(m)\) elementwise operations for \(m\) inputs
  • Space: \(O(m)\) for the output tensor/vector (or \(O(1)\) extra if done in place)
  • Assumptions: Elementwise application over \(m\) scalars; exact constant factors depend on operations like \(\exp\), \(\tanh\), or \(\mathrm{erf}/\Phi\) approximations

See also