GELU (Gaussian Error Linear Unit)¶
Formula¶
\[
\mathrm{GELU}(x)=x\,\Phi(x)
\]
\[
\approx \frac{1}{2}x\left(1+\tanh\!\left(\sqrt{\frac{2}{\pi}}\left(x+0.044715x^3\right)\right)\right)
\]
Plot¶
fn: 0.5*x*(1+tanh(sqrt(2/PI)*(x+0.044715*x^3)))
xmin: -4
xmax: 4
ymin: -1.0
ymax: 4.2
height: 280
title: GELU(x) (tanh approximation)
Parameters¶
- \(x\): scalar input (applied elementwise)
- \(\Phi(x)\): standard normal CDF
What it means¶
GELU smoothly gates inputs by their magnitude, rather than hard-thresholding like ReLU.
What it's used for¶
- Common hidden activation in Transformer MLP blocks.
- Deep models where smooth activations can help optimization.
Key properties¶
- Smooth and non-monotonic.
- Behaves roughly like a softened ReLU for positive inputs.
Common gotchas¶
- More expensive than ReLU if not using an approximation.
- Implementation details vary (exact vs approximate GELU).
Example¶
For large positive \(x\), \(\mathrm{GELU}(x)\approx x\); for large negative \(x\), \(\mathrm{GELU}(x)\approx 0\).
How to Compute (Pseudocode)¶
Input: tensor/vector x
Output: y = GELU(x) applied elementwise
for each element x_i in x:
y_i <- x_i * Phi(x_i) # or a standard GELU approximation
return y
Complexity¶
- Time: \(O(m)\) elementwise operations for \(m\) inputs
- Space: \(O(m)\) for the output tensor/vector (or \(O(1)\) extra if done in place)
- Assumptions: Elementwise application over \(m\) scalars; exact constant factors depend on operations like \(\exp\), \(\tanh\), or \(\mathrm{erf}/\Phi\) approximations