Symbolic regression (SR) searches for an explicit functional form by exploring a space of mathematical expressions. Both the structure of the model and its parameters are learned simultaneously.
Example output:
Expressions are typically represented as trees:
- Internal nodes: operators ()
- Leaves: variables and constants
Core Mechanism
Genetic Programming (GP)
SR is usually implemented via evolutionary search over expression trees.
- Population: a set of candidate formulas
- Evaluation: each formula is scored using a loss (e.g. MSE)
- Selection: better formulas are retained
- Crossover: subtrees are swapped between formulas
- Mutation: random structural or operator changes
The process iterates over generations.
Objective Function and Parsimony
To control overfitting and expression growth, SR optimises:
- Complexity measures tree size / operator cost
- controls the simplicity–accuracy trade-off
The result is a Pareto front: a set of non-dominated models balancing error and complexity.
Practical Workflow
Given feature matrix and target :
- Initialise random expression trees
- Evaluate each candidate: vs
- Evolve population via crossover and mutation
- Extract Pareto-optimal equations
Example (PySR)
from pysr import PySRRegressor
model = PySRRegressor(
niterations=100,
binary_operators=["+", "-", "*", "/"],
unary_operators=["sin", "exp", "inv(x) = 1/x"],
complexity_of_operators={"sin": 2, "exp": 3},
)
model.fit(X, y)
model.equations_ # Pareto frontComputational Profile
- Time: high (search over expression space)
- Memory: relatively low
- Scaling: worsens with feature count and operator set size
Suitable for long-running batch processes (e.g. headless execution on low-power systems).
Behavioural Properties
Extrapolation
Because the model is an explicit function, it can extrapolate beyond training data.
Example: If , then
This is useful in scientific and physical systems.
Stability
SR is high variance:
- Small data changes can yield different expressions
- Multiple equivalent formulations may exist
Example:
Drift Analysis
Symbolic models allow structural inspection of drift.
Structural Drift
Change in functional form:
Indicates new relationships or variables.
Parametric Drift
Same structure, different coefficients:
Updating Strategies
Updating SR models is non-trivial:
- Global optimisation means structure may change entirely
Common approaches:
- Periodic retraining
- Warm-start using previous expressions
- Track stability across Pareto front over time
Error Characteristics
SR explicitly trades accuracy for simplicity:
This biases toward compact, interpretable expressions rather than purely predictive models.
Function Properties
Typical outputs are:
- Continuous
- Often differentiable
- Compact representations
Example:
Interpretability
Interpretability is intrinsic:
- Direct visibility of variable relationships
- Explicit nonlinear structure
- Clear scaling effects
Advanced Uses
Feature Engineering: Extract sub-expressions as engineered features:
- e.g.
Model Compression: Approximate black-box models:
- Train SR on from a complex model
- Produces an analytic surrogate
Dimensional Constraints: Enforce unit consistency in physical systems:
- Prevent invalid expressions (e.g. )
- And limit complexity of output formulas
Invariant Discovery: Find conserved relationships by solving:
Implementation Ecosystem
- PySR (state-of-the-art, Julia backend)
- gplearn (scikit-learn style, simpler)
Notes
How does SR works:
- Population-based search over functions
- Operators used in symbolic regression: mutation, crossover, simplification, optimisation
- Island-based evolution can improve diversity
- Widely used in physics, control systems, and scientific modelling
References
- PySR: https://github.com/MilesCranmer/PySR
- Cranmer et al. (2020): Discovering Symbolic Models from Data
- Koza (1992): Genetic Programming