Interpretability is indirect due to the hundreds of underlying trees that prevent direct inspection. Common interpretation tools include:

Model Properties

Low Variance Model: Random forests achieve low variance through averaging, making predictions stable when data slightly changes. This property makes them common for production pipelines.

Extrapolation Limitations

Random forests cannot extrapolate outside the training range. Predictions become the average value in the nearest region.

Example:

  • Training data:
  • Prediction at typically returns values close to the maximum training region

Function Approximation

Random forests approximate functions through local partitioning of feature space, creating piecewise regions:

The forest smooths discontinuities through averaging.

Training Procedure

Random forests follow a deterministic training procedure:

  1. Draw bootstrap samples
  2. Grow decision trees
  3. At each split, choose a random subset of features
  4. Optimize each split with:

No search over mathematical forms occurs.

Symbolic Approximation via Model Distillation

Why Approximate with Symbolic Regression

Random forests provide strong predictive accuracy and robustness to noise. Symbolic Regression provides interpretable functional structure and compact mathematical representation. Combining them produces a high-accuracy model with a simplified analytical description.

Distillation Pipeline

Step 1 — Train the random forest:

Step 2 (why do this step?) — Generate synthetic training data: Sample inputs from the feature space:

Compute labels using the forest:

New dataset: represents the response surface learned by the forest.

Step 3 — Run symbolic regression: Train symbolic regression on to obtain:

Benefits of Distillation

  • ==Model compression==: A forest with hundreds of trees becomes a single formula
    • Example: Random forest (500 trees) → Symbolic formula (8 terms)
  • Interpretability: Exposes relationships such as polynomial growth, logarithmic scaling, interaction terms
  • Analytical manipulation: Compute derivatives , integrals, asymptotic behaviour

Limitations

  • Approximation error: Symbolic regression cannot always perfectly reproduce the forest surface; complex forests may require large formulas
  • Feature interactions: Forests can capture extremely complex interaction boundaries that symbolic regression may simplify
  • Stability: Symbolic approximations may vary depending on sampling strategy and search randomness

Comparing Symbolic Formula to Forest Predictions

Prediction Agreement

The most direct comparison measures Model Evaluation by Prediction Difference:

These measure how closely the symbolic formula reproduces the forest.

Functional Similarity

Functional distance between models:

Small indicates the symbolic model is a good surrogate.

Behavioural Diagnostics

  • Partial dependence comparison: Compare vs to verify the symbolic formula reproduces the same trend for individual features
  • Interaction analysis: Verify that complex interactions like captured by the forest are approximated in the symbolic formula

Out-of-Sample Validation

Evaluate both models on validation data:

Model
Random forest0.92
Symbolic regression0.84

The symbolic model typically sacrifices some predictive accuracy for interpretability.

Complexity Comparison

Compare model sizes:

  • Random forest: 400 trees, depth ≈ 10, ~10000 decision nodes
  • Symbolic formula:

Symbolic models are dramatically smaller.

Residuals Analysis

Evaluate residuals:

Typical pattern: symbolic model captures global trend, forest captures local irregularities.