Random Forest¶

Formula¶

\[ \hat f(x)=\frac{1}{B}\sum_{b=1}^{B} T_b(x) \]

Parameters¶

\(T_b\): tree grown on a bootstrap sample with feature subsampling
\(B\): number of trees

What it means¶

Random forest averages many decorrelated trees to reduce variance and improve generalization.

What it's used for¶

Strong tabular baseline.
Feature importance and robust nonlinear modeling.

Key properties¶

Usually needs less tuning than boosting.
Out-of-bag estimates can provide quick validation signals.

Common gotchas¶

Impurity-based importances can be biased toward high-cardinality features.
Large forests can be memory-heavy.

Example¶

A fraud model averages 500 trees trained on bootstrap samples and random feature subsets.

How to Compute (Pseudocode)¶

Input: training data X, y; number of trees B; tree settings
Output: trained forest {T_1, ..., T_B}

forest <- empty list
for b from 1 to B:
  draw a bootstrap sample of the training data
  grow a decision tree using random feature subsampling at each split
  add the tree T_b to forest

# Prediction
for query x:
  aggregate tree predictions (average for regression, vote/probability average for classification)

Complexity¶

Time: Training scales roughly linearly with the number of trees \(B\), times the cost of training one randomized tree; prediction scales with \(B\) times per-tree traversal cost
Space: Typically \(O(B \cdot \mathrm{tree\_size})\) for storing the ensemble, plus training data storage
Assumptions: Exact complexity depends on tree depth limits, split search strategy, feature subsampling, and whether trees are trained in parallel