Causal Language Modeling¶

Formula¶

\[ \mathcal{L}=-\sum_{t=1}^T \log P(x_t\mid x_{\lt t}) \]

Parameters¶

\(x_t\): token at position \(t\)
\(x_{\lt t}\): past context only
\(\mathcal{L}\): training loss

What it means¶

Causal language modeling trains a model to predict each token using only earlier tokens.

What it's used for¶

Decoder-only LLM training.
Autoregressive text generation.

Key properties¶

Uses causal masking in self-attention.
Training and generation objectives align naturally.

Common gotchas¶

"Teacher forcing" during training differs from free-running generation at inference.
Sequence packing can leak context if masking is wrong.

Example¶

Given "The sky is", the model predicts the next token distribution for words like "blue".

How to Compute (Pseudocode)¶

Input: token sequence batch and a decoder-only/causal LM
Output: causal LM training loss

shift targets so each position predicts the next token
run the model with a causal mask to obtain logits for all positions
compute cross-entropy loss against next-token targets
average/sum over valid positions
return loss

Complexity¶

Time: Depends on model architecture; for Transformers, training cost is dominated by masked self-attention and FFN computation over sequence length and batch size
Space: Depends on model size and activation storage across sequence length (attention memory can dominate)
Assumptions: Teacher-forced training on full sequences; exact complexity inherits the underlying model (for example, Transformer) runtime/memory behavior