Bootstrap¶
Formula¶
\[
\hat{\theta}^{*(b)} = s(X^{*(b)}),\quad X^{*(b)}\sim \text{resample with replacement from } X
\]
Parameters¶
- \(s(\cdot)\): statistic
- \(b\): bootstrap replicate index
What it means¶
The bootstrap estimates sampling variability by repeatedly resampling the observed data with replacement.
What it's used for¶
- Confidence intervals, standard errors, and stability checks when analytic formulas are hard.
- Model performance uncertainty estimates.
Key properties¶
- Nonparametric and broadly applicable.
- Works best when the sample represents the population and observations are appropriately independent.
Common gotchas¶
- Resample the right unit (e.g., user/session/cluster) to match dependence structure.
- Naive bootstrap can fail for heavy dependence/time series.
Example¶
Estimate a 95% CI for median revenue by bootstrapping users 10,000 times.
How to Compute (Pseudocode)¶
Input: dataset, statistic s(.), number of bootstrap resamples B
Output: bootstrap replicates and uncertainty summary
for b from 1 to B:
sample a bootstrap dataset by resampling with replacement
compute theta_star[b] <- s(resampled_data)
aggregate theta_star values (SE, CI, quantiles, etc.)
return bootstrap summary
Complexity¶
- Time: \(O(B \cdot \mathrm{StatCost})\), where \(\mathrm{StatCost}\) is the cost to compute the statistic on one resample
- Space: \(O(B)\) to store bootstrap replicates (or \(O(1)\) extra if streaming a summary only) plus resample/statistic workspace
- Assumptions: Resampling unit and dependence structure must match the study design; \(B\) controls Monte Carlo error and runtime