Softmax¶

Formula¶

\[ \mathrm{softmax}(z)_i=\frac{e^{z_i}}{\sum_j e^{z_j}} \]

Parameters¶

\(z\): vector of logits
\(\mathrm{softmax}(z)_i\): normalized probability for class/component \(i\)

What it means¶

Softmax converts a vector of real-valued scores (logits) into a probability distribution over classes/components.

What it's used for¶

Multiclass classification output layers.
Normalizing attention scores into attention weights.

Key properties¶

Outputs are nonnegative and sum to 1.
Invariant to adding the same constant to all logits.

Common gotchas¶

Compute with the stability trick: subtract \(\max_i z_i\) before exponentiating.
Large logits can make distributions overly sharp.

Example¶

If \(z=[1,2]\), then \(\mathrm{softmax}(z)\approx [0.269,0.731]\).

How to Compute (Pseudocode)¶

Input: logits z[1..K]
Output: probabilities p[1..K]

z_max <- max_i z[i]            # numerical stability trick
for i from 1 to K:
  e[i] <- exp(z[i] - z_max)
den <- sum_{i=1..K} e[i]
for i from 1 to K:
  p[i] <- e[i] / den
return p

Complexity¶

Time: \(O(K)\) for a \(K\)-class logit vector
Space: \(O(K)\) for exponentials/output probabilities (or \(O(1)\) extra in place)
Assumptions: One logit vector shown; batched softmax scales linearly with batch size and sequence dimensions