What is Random Forest Regression?

Random Forest is an ensemble learning method that builds multiple decision trees and averages their predictions.

  • Works for both classification and regression
  • Reduces overfitting compared to a single decision tree
  • Handles non-linear relationships well
  • Built-in feature importance ranking

We will use the built-in mtcars dataset to predict miles per gallon (mpg)

The Math Behind It

Each tree \(T_b\) is built on a bootstrap sample. The final prediction is the average: \[\hat{y} = \frac{1}{B} \sum_{b=1}^{B} T_b(x)\] Where: - \(B\) = number of trees

  • \(T_b(x)\) = prediction from the \(b\)-th tree

  • \(x\) = input feature vector The model minimizes the Mean Squared Error (MSE): The model minimizes the Mean Squared Error (MSE): \[MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]

The Dataset

Feature Importance

R Code: Building the Model

library(randomForest)

set.seed(42)

# Fit model with all features
model_rf <- randomForest(
  mpg ~ .,
  data = mtcars,
  ntree = 500,
  importance = TRUE
)

# View summary
print(model_rf)

# Feature importance
importance(model_rf)

Model Performance

Conclusion

Random Forest Regression is powerful because:

  • It averages \(B = 500\) decision trees to reduce variance
  • The \(R^2\) on mtcars is typically > 0.97
  • Features wt and hp drive MPG the most
  • Works well even without data scaling or normalization

Limitations:

  • Less interpretable than linear regression
  • Computationally heavier on large datasets
  • Can overfit on very small datasets

Random Forest is a go-to model in data science for its balance of accuracy and flexibility