Supervised Learning - Regression

Predicting Highway Mileage

Objective: To compare various models and determine the winner based on RMSE. To compare the performance of four different regression models on the mpg dataset.

Models:

Linear Regression.
Generalized Additive Model (GAM).
Random Forest.
Gradient Boosting Machine (GBM).

Evaluation Metrics:

Root Mean Squared Error (RMSE).
Mean Squared Error (MSE).

Todo:
- Load the necessary libraries and the dataset.
- Split the data into a training set and a test set.
- Train various machine learning models (linear regression, generalized additive models, random forests, and gradient boosting machines) on the training set.
- Evaluate the models on the test set using RMSE (Root Mean Squared Error) and MSE (Mean Squared Error).
- Compare the performance of the models and determine the winner.

Load necessary libraries

library(ggplot2)
library(caret)

## Loading required package: lattice

library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.1-8

library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(gbm)

## Loaded gbm 2.1.8.1

library(magrittr)
library(mgcv)  # Load the mgcv package

## Loading required package: nlme

## This is mgcv 1.9-0. For overview type 'help("mgcv-package")'.

library(kableExtra)
library(knitr)

Load the dataset

data(mpg)

Split the data into 80% train and 20% test

set.seed(9202023)
trainIndex <- createDataPartition(mpg$hwy, p = 0.8, 
                                  list = FALSE, 
                                  times = 1)
trainData <- mpg[trainIndex, ]
testData <- mpg[-trainIndex, ]

Define a function to calculate RMSE and MSE

calculate_errors <- function(model, testData) {
  predictions <- predict(model, newdata = testData)
  rmse <- sqrt(mean((testData$hwy - predictions)^2))
  mse <- mean((testData$hwy - predictions)^2)
  return(c(RMSE = rmse, MSE = mse))
}

Train various models

trainData %>%
  ggplot(aes(x = displ, y = hwy)) +
  geom_point()

Linear Regression

lm_model <- lm(hwy ~ displ + cty + cyl + year, data = trainData)

Generalized Additive Model

gam_model <- gam(hwy ~ s(displ) + cty + cyl + year, data = trainData)

Random Forest

rf_model <- randomForest(hwy ~ displ + cty + cyl + year, data = trainData)

Gradient Boosting Machine

gbm_model <- gbm(hwy ~ displ + cty + cyl + year, data = trainData, n.trees = 100)

## Distribution not specified, assuming gaussian ...

Evaluate models

lm_errors <- calculate_errors(lm_model, testData)
gam_errors <- calculate_errors(gam_model, testData)
rf_errors <- calculate_errors(rf_model, testData)
gbm_errors <- calculate_errors(gbm_model, testData)

## Using 100 trees...

Create a table to compare models

model_comparison <- data.frame(
  Model = c("Linear Regression", "Generalized Additive Model", 
            "Random Forest", "Gradient Boosting Machine"),
  RMSE = c(lm_errors["RMSE"], gam_errors["RMSE"], rf_errors["RMSE"], gbm_errors["RMSE"]),
  MSE = c(lm_errors["MSE"], gam_errors["MSE"], rf_errors["MSE"], gbm_errors["MSE"])
)

Determine the winner model

winner_model <- model_comparison[which.min(model_comparison$RMSE), ]

Results:

The following table shows the RMSE and MSE for each model:

model_comparison

##                        Model     RMSE      MSE
## 1          Linear Regression 1.255960 1.577435
## 2 Generalized Additive Model 1.226731 1.504869
## 3              Random Forest 2.267690 5.142416
## 4  Gradient Boosting Machine 1.918309 3.679910

winner_model

##                        Model     RMSE      MSE
## 2 Generalized Additive Model 1.226731 1.504869

Print the table and winner model

model_comparison %>%
  kable("html") %>%
  kable_styling(full_width = FALSE) %>%
  add_header_above(c(" " = 1, "Model Comparison" = 2)) %>%
  knitr::kable(caption = "Model Comparison Table") %>%
  kable_styling(bootstrap_options = "striped", position = "center")

Model Comparison Table
x
<table class=“table” style=“width: auto !important; margin-left: auto; margin-right: auto;”> <thead> <tr> <th style=“empty-cells: hide;border-bottom:hidden;” colspan=“1”></th> <th style=“border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center;” colspan=“2”><div style=“border-bottom: 1px solid #ddd; padding-bottom: 5px;”>Model Comparison</div></th> </tr> <tr> <th style=“text-align:left;”> Model </th> <th style=“text-align:right;”> RMSE </th> <th style=“text-align:right;”> MSE </th> </tr> </thead> <tbody> <tr> <td style=“text-align:left;”> Linear Regression </td> <td style=“text-align:right;”> 1.255960 </td> <td style=“text-align:right;”> 1.577435 </td> </tr> <tr> <td style=“text-align:left;”> Generalized Additive Model </td> <td style=“text-align:right;”> 1.226731 </td> <td style=“text-align:right;”> 1.504869 </td> </tr> <tr> <td style=“text-align:left;”> Random Forest </td> <td style=“text-align:right;”> 2.267689 </td> <td style=“text-align:right;”> 5.142416 </td> </tr> <tr> <td style=“text-align:left;”> Gradient Boosting Machine </td> <td style=“text-align:right;”> 1.918309 </td> <td style=“text-align:right;”> 3.679910 </td> </tr> </tbody> </table>

cat("Winner Model: ", winner_model$Model, " (RMSE = ", winner_model$RMSE, ", MSE = ", winner_model$MSE, ")\n")

## Winner Model:  Generalized Additive Model  (RMSE =  1.226731 , MSE =  1.504869 )

Print a table with all models

full_model_comparison <- rbind(model_comparison, Winner = c("Winner", winner_model$RMSE, winner_model$MSE))

# Print the table
full_model_comparison %>%
  kable("html") %>%
  kable_styling(full_width = FALSE) %>%
  add_header_above(c(" " = 1, "Model Comparison" = 2)) %>%
  knitr::kable(caption = "Model Comparison Table") %>%
  kable_styling(bootstrap_options = "striped", position = "center")

Model Comparison Table
x
<table class=“table” style=“width: auto !important; margin-left: auto; margin-right: auto;”> <thead> <tr> <th style=“empty-cells: hide;border-bottom:hidden;” colspan=“1”></th> <th style=“border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center;” colspan=“2”><div style=“border-bottom: 1px solid #ddd; padding-bottom: 5px;”>Model Comparison</div></th> </tr> <tr> <th style=“text-align:left;”> Model </th> <th style=“text-align:left;”> RMSE </th> <th style=“text-align:left;”> MSE </th> </tr> </thead> <tbody> <tr> <td style=“text-align:left;”> Linear Regression </td> <td style=“text-align:left;”> 1.25595980416418 </td> <td style=“text-align:left;”> 1.57743502967612 </td> </tr> <tr> <td style=“text-align:left;”> Generalized Additive Model </td> <td style=“text-align:left;”> 1.22673116038059 </td> <td style=“text-align:left;”> 1.50486933984871 </td> </tr> <tr> <td style=“text-align:left;”> Random Forest </td> <td style=“text-align:left;”> 2.26768950287766 </td> <td style=“text-align:left;”> 5.14241568146154 </td> </tr> <tr> <td style=“text-align:left;”> Gradient Boosting Machine </td> <td style=“text-align:left;”> 1.9183092623025 </td> <td style=“text-align:left;”> 3.67991042583558 </td> </tr> <tr> <td style=“text-align:left;”> Winner </td> <td style=“text-align:left;”> 1.22673116038059 </td> <td style=“text-align:left;”> 1.50486933984871 </td> </tr> </tbody> </table>

Conclusion

The Generalized Additive Model performed the best, with the lowest RMSE and MSE. The Linear Regression model and Gradient Boosting Machine model also performed well, but with slightly higher RMSE and MSE values. The Random Forest models performed the worst, with the highest RMSE and MSE values.

Based on the results of this model comparison, I recommend using the Generalized Additive Model to predict mpg. However, it is important to note that the best model for a particular problem will depend on the specific data and the desired outcome.