Symbolic Regression

Symbolic regression (SR) searches for an explicit functional form $y = f (X)$ by exploring a space of mathematical expressions. Both the structure of the model and its parameters are learned simultaneously.

Example output: $y = 2.1 x_{1} + lo g (x_{2}) - 0.5 x_{3}^{2}$

Expressions are typically represented as trees:

Internal nodes: operators ( $+, -, \times, \div, sin, exp$ )
Leaves: variables and constants

Core Mechanism

Genetic Programming (GP)

SR is usually implemented via evolutionary search over expression trees.

Population: a set of candidate formulas
Evaluation: each formula is scored using a loss (e.g. MSE)
Selection: better formulas are retained
Crossover: subtrees are swapped between formulas
Mutation: random structural or operator changes

The process iterates over generations.

Objective Function and Parsimony

To control overfitting and expression growth, SR optimises:

Fitness = Error + α \cdot Complexity

Complexity measures tree size / operator cost
$α$ controls the simplicity–accuracy trade-off

The result is a Pareto front: a set of non-dominated models balancing error and complexity.

Practical Workflow

Given feature matrix $X$ and target $y$ :

Initialise random expression trees
Evaluate each candidate: $f_{θ} (X)$ vs $y$
Evolve population via crossover and mutation
Extract Pareto-optimal equations

Example (PySR)

from pysr import PySRRegressor
 
model = PySRRegressor(
    niterations=100,
    binary_operators=["+", "-", "*", "/"],
    unary_operators=["sin", "exp", "inv(x) = 1/x"],
    complexity_of_operators={"sin": 2, "exp": 3},
)
 
model.fit(X, y)
 
model.equations_  # Pareto front

Computational Profile

Time: high (search over expression space)
Memory: relatively low
Scaling: worsens with feature count and operator set size

Suitable for long-running batch processes (e.g. headless execution on low-power systems).

Behavioural Properties

Extrapolation

Because the model is an explicit function, it can extrapolate beyond training data.

Example: If $y = 2 x$ , then $x = 100 \Rightarrow y = 200$

This is useful in scientific and physical systems.

Stability

SR is high variance:

Small data changes can yield different expressions
Multiple equivalent formulations may exist

Example:

$y = 3 x_{1} + 2 x_{2}$
$y = 2.8 x_{1} + 2.1 x_{2} + 0.1 x_{1}^{2}$

Drift Analysis

Symbolic models allow structural inspection of drift.

Structural Drift

Change in functional form:

f_{t - 1} (x) = 3 x_{1} + 0.5 x_{2} \to f_{t} (x) = 3 x_{1} + 0.5 x_{2} + lo g (x_{3})

Indicates new relationships or variables.

Parametric Drift

Same structure, different coefficients:

f_{t - 1} (x) = 3 x_{1} + 0.5 x_{2} \to f_{t} (x) = 2.6 x_{1} + 0.7 x_{2}

Updating Strategies

Updating SR models is non-trivial:

Global optimisation means structure may change entirely

Common approaches:

Periodic retraining
Warm-start using previous expressions
Track stability across Pareto front over time

Error Characteristics

SR explicitly trades accuracy for simplicity:

error + complexity

This biases toward compact, interpretable expressions rather than purely predictive models.

Function Properties

Typical outputs are:

Continuous
Often differentiable
Compact representations

Example:

f (x) = x_{1}^{2} + sin (x_{2})

Interpretability

Interpretability is intrinsic:

Direct visibility of variable relationships
Explicit nonlinear structure
Clear scaling effects

Advanced Uses

Feature Engineering: Extract sub-expressions as engineered features:

e.g. $x_{1}^{2} + x_{2}^{2}$

Model Compression: Approximate black-box models:

Train SR on $(X, \overset{y}{^})$ from a complex model
Produces an analytic surrogate

Dimensional Constraints: Enforce unit consistency in physical systems:

Prevent invalid expressions (e.g. $m + s$ )
And limit complexity of output formulas

Invariant Discovery: Find conserved relationships by solving:

f (X) \approx 0

Implementation Ecosystem

PySR (state-of-the-art, Julia backend)
gplearn (scikit-learn style, simpler)

Notes

How does SR works:

Population-based search over functions
Operators used in symbolic regression: mutation, crossover, simplification, optimisation
Island-based evolution can improve diversity
Widely used in physics, control systems, and scientific modelling

References

PySR: https://github.com/MilesCranmer/PySR
Cranmer et al. (2020): Discovering Symbolic Models from Data
Koza (1992): Genetic Programming

Data Archive

Explorer