Skip to content

Decision Tree

Formula

\[ \text{Choose split } s^* = \arg\max_s\ \Delta I(s) \]
\[ \Delta I(s)=I(\text{parent})-\sum_{c\in\{L,R\}}\frac{n_c}{n}I(c) \]

Parameters

  • \(I\): impurity (e.g., Gini/entropy/MSE)
  • \(s\): candidate split

What it means

A decision tree recursively partitions feature space into regions with simple prediction rules.

What it's used for

  • Interpretable tabular modeling.
  • Capturing nonlinear interactions without feature scaling.

Key properties

  • Handles mixed feature types with preprocessing choices.
  • Can overfit without depth/min-samples constraints.

Common gotchas

  • Small data changes can produce very different trees.
  • Greedy splitting is not globally optimal.

Example

A tree might first split on income, then on account age to predict default risk.

How to Compute (Pseudocode)

Input: training data X, targets y, stopping rules
Output: decision tree

function BUILD_NODE(data):
  if stopping_rule_met(data):
    return leaf(prediction_from(data))

  evaluate candidate splits and choose best split s*
  split data into left_data, right_data using s*

  node <- internal_node(split=s*)
  node.left <- BUILD_NODE(left_data)
  node.right <- BUILD_NODE(right_data)
  return node

return BUILD_NODE((X, y))

Complexity

  • Time: Depends on the split search strategy and implementation; a common training-time bound is roughly \(O(n d \log n)\) to \(O(n d n_{levels})\) for greedy tree growth on tabular data
  • Space: Typically \(O(n d)\) for data storage plus tree structure and recursion/work buffers
  • Assumptions: \(n\) samples, \(d\) features; exact split search, feature types, and sorting reuse strongly affect practical complexity

See also