Swish / SiLU¶

Formula¶

\[ \mathrm{Swish}(x)=x\,\sigma(\beta x) \]

\[ \mathrm{SiLU}(x)=x\,\sigma(x)\quad (\beta=1) \]

Plot¶

fn: x/(1+exp(-x))
xmin: -6
xmax: 6
ymin: -1.0
ymax: 6.5
height: 280
title: SiLU / Swish (beta=1)

Parameters¶

\(x\): scalar input (applied elementwise)
\(\sigma(\cdot)\): sigmoid function
\(\beta\): slope parameter (fixed or learned)

What it means¶

Swish/SiLU multiplies the input by a sigmoid gate, giving a smooth activation that can suppress or pass values gradually.

What it's used for¶

Hidden activations in modern CNNs/MLPs/transformer variants.
Alternative to ReLU/GELU in some architectures.

Key properties¶

Smooth and non-monotonic.
SiLU is a special case of Swish with \(\beta=1\).

Common gotchas¶

"Swish" and "SiLU" are often used interchangeably, but Swish can include \(\beta\neq 1\).
Slightly more compute than ReLU.

Example¶

At \(x=0\), \(\mathrm{SiLU}(0)=0\). For large positive \(x\), \(\mathrm{SiLU}(x)\approx x\).

How to Compute (Pseudocode)¶

Input: tensor/vector x
Output: y = SiLU/Swish(x) applied elementwise

for each element x_i in x:
  y_i <- x_i * sigmoid(beta * x_i)   # beta=1 for SiLU
return y

Complexity¶

Time: \(O(m)\) elementwise operations for \(m\) inputs
Space: \(O(m)\) for the output tensor/vector (or \(O(1)\) extra if done in place)
Assumptions: Elementwise application over \(m\) scalars; exact constant factors depend on operations like \(\exp\), \(\tanh\), or \(\mathrm{erf}/\Phi\) approximations