MLP (Multi-Layer Perceptron)¶
Formula¶
\[
h^{(1)}=\phi(W^{(1)}x+b^{(1)}),\quad
h^{(\ell+1)}=\phi(W^{(\ell+1)}h^{(\ell)}+b^{(\ell+1)})
\]
\[
\hat{y}=W^{(L)}h^{(L-1)}+b^{(L)}
\]
Parameters¶
- \(x\): input vector
- \(W^{(\ell)}, b^{(\ell)}\): weights and biases at layer \(\ell\)
- \(\phi\): activation function (e.g., ReLU, GELU)
- \(\hat{y}\): output (logit, regression value, etc.)
What it means¶
An MLP is a feedforward neural network made of stacked linear layers plus nonlinear activations.
What it's used for¶
- Tabular data modeling and generic function approximation.
- Classifier/regressor heads on top of learned embeddings.
- Transformer feedforward blocks (position-wise MLPs).
Key properties¶
- Universal approximation (with sufficient width under standard conditions).
- Capacity depends on depth, width, activation choice, and regularization.
Common gotchas¶
- Without activations, stacked layers collapse to a single linear map.
- MLPs do not explicitly model sequence order or graph structure on their own.
Example¶
A 2-layer MLP classifier might map \(x\in\mathbb{R}^{d}\) to hidden size 128 with GELU, then to class logits with a final linear layer.
How to Compute (Pseudocode)¶
Input: x and layer parameters {(W^(l), b^(l))}
Output: network output y_hat
h <- x
for each hidden layer l:
h <- activation(W^(l) h + b^(l))
y_hat <- W^(L) h + b^(L)
return y_hat
Complexity¶
- Time: Sum of matrix-multiply and activation costs across layers (typically dominated by dense linear layers)
- Space: Depends on layer widths and whether activations are cached for backpropagation
- Assumptions: Dense MLP shown; batch size and layer widths determine concrete runtime/memory costs