Negative Log-Likelihood (NLL)¶
Formula¶
\[
\mathrm{NLL}(\theta) = -\sum_{i=1}^n \log p_\theta(x_i)
\]
Plot¶
fn: -log(x)
xmin: 0.001
xmax: 0.999
ymin: 0
ymax: 7
height: 280
title: Negative log-likelihood for one observed event
Parameters¶
- \(p_\theta(x)\): model density/mass
- \(x_1,\dots,x_n\): samples
- Often averaged: \(-\frac{1}{n}\sum_i \log p_\theta(x_i)\)
What it means¶
Training by maximum likelihood is minimizing NLL.
What it's used for¶
- Training probabilistic models by maximizing likelihood.
- Model comparison with log-likelihood.
Key properties¶
- Bernoulli (binary classification) → NLL equals binary log loss
- Categorical (multiclass) → NLL equals multiclass cross-entropy
- Gaussian regression with fixed \(\sigma\): [ -\log \mathcal{N}(y\mid \mu_\theta(x), \sigma^2) \propto \frac{(y-\mu_\theta(x))^2}{2\sigma^2} ] So MSE is a special case (up to constants/scaling).
Common gotchas¶
- Make sure \(p_\theta\) is a valid normalized probability density/mass.
- For densities, values can exceed 1; \(\log p\) is still valid.
Example¶
For a Bernoulli model with \(y=1\) and \(p=0.8\), \(\mathrm{NLL}=-\log 0.8\).
How to Compute (Pseudocode)¶
Input: predicted probabilities (or model likelihoods) and true labels/observations
Output: negative log-likelihood
accumulate the negative log probability assigned to the observed outcomes
average over examples if reporting mean loss
return the aggregated value
Complexity¶
- Time: \(O(n)\) once per-example predicted probabilities/likelihood terms are available
- Space: \(O(1)\) extra space for running accumulation
- Assumptions: Exact formula depends on binary vs multiclass vs sequence likelihood setup; model forward-pass cost is excluded