MLP (Multi-Layer Perceptron)¶

Formula¶

\[ h^{(1)}=\phi(W^{(1)}x+b^{(1)}),\quad h^{(\ell+1)}=\phi(W^{(\ell+1)}h^{(\ell)}+b^{(\ell+1)}) \]

\[ \hat{y}=W^{(L)}h^{(L-1)}+b^{(L)} \]

Parameters¶

\(x\): input vector
\(W^{(\ell)}, b^{(\ell)}\): weights and biases at layer \(\ell\)
\(\phi\): activation function (e.g., ReLU, GELU)
\(\hat{y}\): output (logit, regression value, etc.)

What it means¶

An MLP is a feedforward neural network made of stacked linear layers plus nonlinear activations.

What it's used for¶

Tabular data modeling and generic function approximation.
Classifier/regressor heads on top of learned embeddings.
Transformer feedforward blocks (position-wise MLPs).

Key properties¶

Universal approximation (with sufficient width under standard conditions).
Capacity depends on depth, width, activation choice, and regularization.

Common gotchas¶

Without activations, stacked layers collapse to a single linear map.
MLPs do not explicitly model sequence order or graph structure on their own.

Example¶

A 2-layer MLP classifier might map \(x\in\mathbb{R}^{d}\) to hidden size 128 with GELU, then to class logits with a final linear layer.

How to Compute (Pseudocode)¶

Input: x and layer parameters {(W^(l), b^(l))}
Output: network output y_hat

h <- x
for each hidden layer l:
  h <- activation(W^(l) h + b^(l))
y_hat <- W^(L) h + b^(L)
return y_hat

Complexity¶

Time: Sum of matrix-multiply and activation costs across layers (typically dominated by dense linear layers)
Space: Depends on layer widths and whether activations are cached for backpropagation
Assumptions: Dense MLP shown; batch size and layer widths determine concrete runtime/memory costs