Interpretability is indirect due to the hundreds of underlying trees that prevent direct inspection. Common interpretation tools include:
Model Properties
Low Variance Model: Random forests achieve low variance through averaging, making predictions stable when data slightly changes. This property makes them common for production pipelines.
Extrapolation Limitations
Random forests cannot extrapolate outside the training range. Predictions become the average value in the nearest region.
Example:
- Training data:
- Prediction at typically returns values close to the maximum training region
Function Approximation
Random forests approximate functions through local partitioning of feature space, creating piecewise regions:
The forest smooths discontinuities through averaging.
Training Procedure
Random forests follow a deterministic training procedure:
- Draw bootstrap samples
- Grow decision trees
- At each split, choose a random subset of features
- Optimize each split with:
No search over mathematical forms occurs.
Symbolic Approximation via Model Distillation
Why Approximate with Symbolic Regression
Random forests provide strong predictive accuracy and robustness to noise. Symbolic Regression provides interpretable functional structure and compact mathematical representation. Combining them produces a high-accuracy model with a simplified analytical description.
Distillation Pipeline
Step 1 — Train the random forest:
Step 2 (why do this step?) — Generate synthetic training data: Sample inputs from the feature space:
Compute labels using the forest:
New dataset: represents the response surface learned by the forest.
Step 3 — Run symbolic regression: Train symbolic regression on to obtain:
Benefits of Distillation
- ==Model compression==: A forest with hundreds of trees becomes a single formula
- Example: Random forest (500 trees) → Symbolic formula (8 terms)
- Interpretability: Exposes relationships such as polynomial growth, logarithmic scaling, interaction terms
- Analytical manipulation: Compute derivatives , integrals, asymptotic behaviour
Limitations
- Approximation error: Symbolic regression cannot always perfectly reproduce the forest surface; complex forests may require large formulas
- Feature interactions: Forests can capture extremely complex interaction boundaries that symbolic regression may simplify
- Stability: Symbolic approximations may vary depending on sampling strategy and search randomness
Comparing Symbolic Formula to Forest Predictions
Prediction Agreement
The most direct comparison measures Model Evaluation by Prediction Difference:
These measure how closely the symbolic formula reproduces the forest.
Functional Similarity
Functional distance between models:
Small indicates the symbolic model is a good surrogate.
Behavioural Diagnostics
- Partial dependence comparison: Compare vs to verify the symbolic formula reproduces the same trend for individual features
- Interaction analysis: Verify that complex interactions like captured by the forest are approximated in the symbolic formula
Out-of-Sample Validation
Evaluate both models on validation data:
| Model | |
|---|---|
| Random forest | 0.92 |
| Symbolic regression | 0.84 |
The symbolic model typically sacrifices some predictive accuracy for interpretability.
Complexity Comparison
Compare model sizes:
- Random forest: 400 trees, depth ≈ 10, ~10000 decision nodes
- Symbolic formula:
Symbolic models are dramatically smaller.
Residuals Analysis
Evaluate residuals:
Typical pattern: symbolic model captures global trend, forest captures local irregularities.