A Random Forest is an Model Ensemble method that combines many Decision Trees to improve accuracy and generalisation.
How does Random Forest work?
- Each tree is trained on a random bootstrap sample of the training data (bagging).
- At each split, the tree considers only a random subset of features (commonly features if is the total number).
- The final prediction is obtained by majority vote (classification) or averaging (regression).
This randomness reduces correlation between trees, making the overall model more robust (generalize).
Key properties:
- Can handle regression, classification, dimensionality reduction, and missing values.
- Flexible, robust, and resistant to overfitting compared to a single Decision Tree.
- Works well with high-dimensional data.
Hyperparameters to tune:
n_estimators
: number of trees.max_depth
: maximum depth of each tree.max_features
: number of features considered at each split (default for classification).n_jobs
: controls parallelism during training (more cores = faster training, but watch out for system slowdown).
Strengths:
- Reduces variance compared to a single tree.
- Handles large datasets and mixed feature types well.
- Provides feature importance estimates.
Weaknesses:
- Can still overfit with noisy or very high-dimensional data.
- Less interpretable than a single decision tree.
Evaluation:
- Out-of-bag (OOB) error can be used as an internal validation metric (data not included in bootstrap samples).
- Model performance can be refined by tuning Hyperparameters.
Related: