A learning curve is a diagnostic plot that shows model performance (e.g., accuracy or error) on training and validation sets as a function of training set size. It helps assess (Model Selection) whether a model suffers from underfitting, overfitting, or has converged in performance.
Core Concepts:
- A learning curve plots model scores (typically training and validation) against the number of training samples. It reveals how the model generalizes as it is trained on more data.
Typical behaviours:
- Overfitting (small data): High training score, low validation score. Model memorizes training data but generalizes poorly.
- Underfitting (high bias): Both training and validation scores are low. Model is too simple for the data.
- Convergence: As the dataset grows, training and validation scores approach each other. Once convergence is reached, adding more data does not improve performance significantly.
Key Insight:
- If both curves flatten and are close together, the model has likely reached its capacity. In this case, increasing training data further will not help—you may need a more expressive model.
Implementation Example (scikit-learn):
from sklearn.model_selection import learning_curve
train_sizes, train_scores, val_scores = learning_curve(estimator, X, y, cv=5)
Use this to plot performance curves and analyze model behavior as training size varies.
Use Cases:
- Diagnose model complexity
- Identify data limitations
- Support decisions on collecting more data vs. changing the model