Objective: To compare various models and determine the winner based on RMSE. To compare the performance of four different regression models on the mpg dataset.
Models:
Evaluation Metrics:
Todo:
- Load the necessary libraries and the dataset.
- Split the data into a training set and a test set.
- Train various machine learning models (linear regression, generalized
additive models, random forests, and gradient boosting machines) on the
training set.
- Evaluate the models on the test set using RMSE (Root Mean Squared
Error) and MSE (Mean Squared Error).
- Compare the performance of the models and determine the winner.
Load necessary libraries
library(ggplot2)
library(caret)
## Loading required package: lattice
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(gbm)
## Loaded gbm 2.1.8.1
library(magrittr)
library(mgcv) # Load the mgcv package
## Loading required package: nlme
## This is mgcv 1.9-0. For overview type 'help("mgcv-package")'.
library(kableExtra)
library(knitr)
Load the dataset
data(mpg)
Split the data into 80% train and 20% test
set.seed(9202023)
trainIndex <- createDataPartition(mpg$hwy, p = 0.8,
list = FALSE,
times = 1)
trainData <- mpg[trainIndex, ]
testData <- mpg[-trainIndex, ]
Define a function to calculate RMSE and MSE
calculate_errors <- function(model, testData) {
predictions <- predict(model, newdata = testData)
rmse <- sqrt(mean((testData$hwy - predictions)^2))
mse <- mean((testData$hwy - predictions)^2)
return(c(RMSE = rmse, MSE = mse))
}
trainData %>%
ggplot(aes(x = displ, y = hwy)) +
geom_point()
lm_model <- lm(hwy ~ displ + cty + cyl + year, data = trainData)
gam_model <- gam(hwy ~ s(displ) + cty + cyl + year, data = trainData)
rf_model <- randomForest(hwy ~ displ + cty + cyl + year, data = trainData)
gbm_model <- gbm(hwy ~ displ + cty + cyl + year, data = trainData, n.trees = 100)
## Distribution not specified, assuming gaussian ...
lm_errors <- calculate_errors(lm_model, testData)
gam_errors <- calculate_errors(gam_model, testData)
rf_errors <- calculate_errors(rf_model, testData)
gbm_errors <- calculate_errors(gbm_model, testData)
## Using 100 trees...
model_comparison <- data.frame(
Model = c("Linear Regression", "Generalized Additive Model",
"Random Forest", "Gradient Boosting Machine"),
RMSE = c(lm_errors["RMSE"], gam_errors["RMSE"], rf_errors["RMSE"], gbm_errors["RMSE"]),
MSE = c(lm_errors["MSE"], gam_errors["MSE"], rf_errors["MSE"], gbm_errors["MSE"])
)
winner_model <- model_comparison[which.min(model_comparison$RMSE), ]
Results:
The following table shows the RMSE and MSE for each model:
model_comparison
## Model RMSE MSE
## 1 Linear Regression 1.255960 1.577435
## 2 Generalized Additive Model 1.226731 1.504869
## 3 Random Forest 2.267690 5.142416
## 4 Gradient Boosting Machine 1.918309 3.679910
winner_model
## Model RMSE MSE
## 2 Generalized Additive Model 1.226731 1.504869
model_comparison %>%
kable("html") %>%
kable_styling(full_width = FALSE) %>%
add_header_above(c(" " = 1, "Model Comparison" = 2)) %>%
knitr::kable(caption = "Model Comparison Table") %>%
kable_styling(bootstrap_options = "striped", position = "center")
| x |
|---|
| <table class=“table” style=“width: auto !important; margin-left: auto; margin-right: auto;”> <thead> <tr> <th style=“empty-cells: hide;border-bottom:hidden;” colspan=“1”></th> <th style=“border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center;” colspan=“2”><div style=“border-bottom: 1px solid #ddd; padding-bottom: 5px;”>Model Comparison</div></th> </tr> <tr> <th style=“text-align:left;”> Model </th> <th style=“text-align:right;”> RMSE </th> <th style=“text-align:right;”> MSE </th> </tr> </thead> <tbody> <tr> <td style=“text-align:left;”> Linear Regression </td> <td style=“text-align:right;”> 1.255960 </td> <td style=“text-align:right;”> 1.577435 </td> </tr> <tr> <td style=“text-align:left;”> Generalized Additive Model </td> <td style=“text-align:right;”> 1.226731 </td> <td style=“text-align:right;”> 1.504869 </td> </tr> <tr> <td style=“text-align:left;”> Random Forest </td> <td style=“text-align:right;”> 2.267689 </td> <td style=“text-align:right;”> 5.142416 </td> </tr> <tr> <td style=“text-align:left;”> Gradient Boosting Machine </td> <td style=“text-align:right;”> 1.918309 </td> <td style=“text-align:right;”> 3.679910 </td> </tr> </tbody> </table> |
cat("Winner Model: ", winner_model$Model, " (RMSE = ", winner_model$RMSE, ", MSE = ", winner_model$MSE, ")\n")
## Winner Model: Generalized Additive Model (RMSE = 1.226731 , MSE = 1.504869 )
full_model_comparison <- rbind(model_comparison, Winner = c("Winner", winner_model$RMSE, winner_model$MSE))
# Print the table
full_model_comparison %>%
kable("html") %>%
kable_styling(full_width = FALSE) %>%
add_header_above(c(" " = 1, "Model Comparison" = 2)) %>%
knitr::kable(caption = "Model Comparison Table") %>%
kable_styling(bootstrap_options = "striped", position = "center")
| x |
|---|
| <table class=“table” style=“width: auto !important; margin-left: auto; margin-right: auto;”> <thead> <tr> <th style=“empty-cells: hide;border-bottom:hidden;” colspan=“1”></th> <th style=“border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center;” colspan=“2”><div style=“border-bottom: 1px solid #ddd; padding-bottom: 5px;”>Model Comparison</div></th> </tr> <tr> <th style=“text-align:left;”> Model </th> <th style=“text-align:left;”> RMSE </th> <th style=“text-align:left;”> MSE </th> </tr> </thead> <tbody> <tr> <td style=“text-align:left;”> Linear Regression </td> <td style=“text-align:left;”> 1.25595980416418 </td> <td style=“text-align:left;”> 1.57743502967612 </td> </tr> <tr> <td style=“text-align:left;”> Generalized Additive Model </td> <td style=“text-align:left;”> 1.22673116038059 </td> <td style=“text-align:left;”> 1.50486933984871 </td> </tr> <tr> <td style=“text-align:left;”> Random Forest </td> <td style=“text-align:left;”> 2.26768950287766 </td> <td style=“text-align:left;”> 5.14241568146154 </td> </tr> <tr> <td style=“text-align:left;”> Gradient Boosting Machine </td> <td style=“text-align:left;”> 1.9183092623025 </td> <td style=“text-align:left;”> 3.67991042583558 </td> </tr> <tr> <td style=“text-align:left;”> Winner </td> <td style=“text-align:left;”> 1.22673116038059 </td> <td style=“text-align:left;”> 1.50486933984871 </td> </tr> </tbody> </table> |
The Generalized Additive Model performed the best, with the lowest RMSE and MSE. The Linear Regression model and Gradient Boosting Machine model also performed well, but with slightly higher RMSE and MSE values. The Random Forest models performed the worst, with the highest RMSE and MSE values.
Based on the results of this model comparison, I recommend using the Generalized Additive Model to predict mpg. However, it is important to note that the best model for a particular problem will depend on the specific data and the desired outcome.