Maximum Likelihood Estimation (MLE)¶

Formula¶

\[ \hat{\theta}_{\text{MLE}}=\arg\max_\theta \prod_{i=1}^n p(x_i\mid \theta) \]

\[ =\arg\max_\theta \sum_{i=1}^n \log p(x_i\mid \theta) \]

Parameters¶

\(\theta\): model parameters
\(x_1,\dots,x_n\): observed data
\(p(x_i\mid \theta)\): likelihood contribution

What it means¶

MLE chooses parameter values that make the observed data most likely under the model.

What it's used for¶

Parameter estimation in statistics and machine learning.
Deriving many standard estimators and training losses.

Key properties¶

Log-likelihood is usually easier to optimize than likelihood.
Often equivalent to minimizing NLL.

Common gotchas¶

Likelihood is a function of \(\theta\), not a probability over \(\theta\).
MLE can overfit or be unstable with small data.

Example¶

For Bernoulli data, the MLE of \(p\) is the sample mean.

How to Compute (Pseudocode)¶

Input: data and a parametric likelihood model p(x; theta)
Output: MLE estimate theta_hat

write the log-likelihood l(theta) from the observed data
optimize l(theta) over theta (closed form or numerical optimizer)
return the maximizing parameter estimate theta_hat

Complexity¶

Time: Depends on model likelihood evaluation cost and optimization method (closed-form solutions can be cheap; numerical optimization is iterative)
Space: Depends on parameter dimension and optimizer state, plus data storage
Assumptions: Parametric likelihood model specified; optimization tolerance and initialization affect practical runtime