Skip to content

Random Forest

Formula

\[ \hat f(x)=\frac{1}{B}\sum_{b=1}^{B} T_b(x) \]

Parameters

  • \(T_b\): tree grown on a bootstrap sample with feature subsampling
  • \(B\): number of trees

What it means

Random forest averages many decorrelated trees to reduce variance and improve generalization.

What it's used for

  • Strong tabular baseline.
  • Feature importance and robust nonlinear modeling.

Key properties

  • Usually needs less tuning than boosting.
  • Out-of-bag estimates can provide quick validation signals.

Common gotchas

  • Impurity-based importances can be biased toward high-cardinality features.
  • Large forests can be memory-heavy.

Example

A fraud model averages 500 trees trained on bootstrap samples and random feature subsets.

How to Compute (Pseudocode)

Input: training data X, y; number of trees B; tree settings
Output: trained forest {T_1, ..., T_B}

forest <- empty list
for b from 1 to B:
  draw a bootstrap sample of the training data
  grow a decision tree using random feature subsampling at each split
  add the tree T_b to forest

# Prediction
for query x:
  aggregate tree predictions (average for regression, vote/probability average for classification)

Complexity

  • Time: Training scales roughly linearly with the number of trees \(B\), times the cost of training one randomized tree; prediction scales with \(B\) times per-tree traversal cost
  • Space: Typically \(O(B \cdot \mathrm{tree\_size})\) for storing the ensemble, plus training data storage
  • Assumptions: Exact complexity depends on tree depth limits, split search strategy, feature subsampling, and whether trees are trained in parallel

See also