Swish / SiLU¶
Formula¶
\[
\mathrm{Swish}(x)=x\,\sigma(\beta x)
\]
\[
\mathrm{SiLU}(x)=x\,\sigma(x)\quad (\beta=1)
\]
Plot¶
fn: x/(1+exp(-x))
xmin: -6
xmax: 6
ymin: -1.0
ymax: 6.5
height: 280
title: SiLU / Swish (beta=1)
Parameters¶
- \(x\): scalar input (applied elementwise)
- \(\sigma(\cdot)\): sigmoid function
- \(\beta\): slope parameter (fixed or learned)
What it means¶
Swish/SiLU multiplies the input by a sigmoid gate, giving a smooth activation that can suppress or pass values gradually.
What it's used for¶
- Hidden activations in modern CNNs/MLPs/transformer variants.
- Alternative to ReLU/GELU in some architectures.
Key properties¶
- Smooth and non-monotonic.
- SiLU is a special case of Swish with \(\beta=1\).
Common gotchas¶
- "Swish" and "SiLU" are often used interchangeably, but Swish can include \(\beta\neq 1\).
- Slightly more compute than ReLU.
Example¶
At \(x=0\), \(\mathrm{SiLU}(0)=0\). For large positive \(x\), \(\mathrm{SiLU}(x)\approx x\).
How to Compute (Pseudocode)¶
Input: tensor/vector x
Output: y = SiLU/Swish(x) applied elementwise
for each element x_i in x:
y_i <- x_i * sigmoid(beta * x_i) # beta=1 for SiLU
return y
Complexity¶
- Time: \(O(m)\) elementwise operations for \(m\) inputs
- Space: \(O(m)\) for the output tensor/vector (or \(O(1)\) extra if done in place)
- Assumptions: Elementwise application over \(m\) scalars; exact constant factors depend on operations like \(\exp\), \(\tanh\), or \(\mathrm{erf}/\Phi\) approximations