Skip to content

Swish / SiLU

Formula

\[ \mathrm{Swish}(x)=x\,\sigma(\beta x) \]
\[ \mathrm{SiLU}(x)=x\,\sigma(x)\quad (\beta=1) \]

Plot

fn: x/(1+exp(-x))
xmin: -6
xmax: 6
ymin: -1.0
ymax: 6.5
height: 280
title: SiLU / Swish (beta=1)

Parameters

  • \(x\): scalar input (applied elementwise)
  • \(\sigma(\cdot)\): sigmoid function
  • \(\beta\): slope parameter (fixed or learned)

What it means

Swish/SiLU multiplies the input by a sigmoid gate, giving a smooth activation that can suppress or pass values gradually.

What it's used for

  • Hidden activations in modern CNNs/MLPs/transformer variants.
  • Alternative to ReLU/GELU in some architectures.

Key properties

  • Smooth and non-monotonic.
  • SiLU is a special case of Swish with \(\beta=1\).

Common gotchas

  • "Swish" and "SiLU" are often used interchangeably, but Swish can include \(\beta\neq 1\).
  • Slightly more compute than ReLU.

Example

At \(x=0\), \(\mathrm{SiLU}(0)=0\). For large positive \(x\), \(\mathrm{SiLU}(x)\approx x\).

How to Compute (Pseudocode)

Input: tensor/vector x
Output: y = SiLU/Swish(x) applied elementwise

for each element x_i in x:
  y_i <- x_i * sigmoid(beta * x_i)   # beta=1 for SiLU
return y

Complexity

  • Time: \(O(m)\) elementwise operations for \(m\) inputs
  • Space: \(O(m)\) for the output tensor/vector (or \(O(1)\) extra if done in place)
  • Assumptions: Elementwise application over \(m\) scalars; exact constant factors depend on operations like \(\exp\), \(\tanh\), or \(\mathrm{erf}/\Phi\) approximations

See also