Random Forest Regression

What is Random Forest Regression?

Random Forest is an ensemble learning method that builds multiple decision trees and averages their predictions.

Works for both classification and regression
Reduces overfitting compared to a single decision tree
Handles non-linear relationships well
Built-in feature importance ranking

We will use the built-in mtcars dataset to predict miles per gallon (mpg)

The Math Behind It

Each tree \(T_b\) is built on a bootstrap sample. The final prediction is the average: \[\hat{y} = \frac{1}{B} \sum_{b=1}^{B} T_b(x)\] Where: - \(B\) = number of trees

\(T_b(x)\) = prediction from the \(b\)-th tree
\(x\) = input feature vector The model minimizes the Mean Squared Error (MSE): The model minimizes the Mean Squared Error (MSE): \[MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]

The Dataset

Feature Importance

R Code: Building the Model

library(randomForest)

set.seed(42)

# Fit model with all features
model_rf <- randomForest(
  mpg ~ .,
  data = mtcars,
  ntree = 500,
  importance = TRUE
)

# View summary
print(model_rf)

# Feature importance
importance(model_rf)

Model Performance

Conclusion

Random Forest Regression is powerful because:

It averages \(B = 500\) decision trees to reduce variance
The \(R^2\) on mtcars is typically > 0.97
Features wt and hp drive MPG the most
Works well even without data scaling or normalization

Limitations:

Less interpretable than linear regression
Computationally heavier on large datasets
Can overfit on very small datasets

Random Forest is a go-to model in data science for its balance of accuracy and flexibility