Predicting Highway Mileage

Objective: To compare various models and determine the winner based on RMSE. To compare the performance of four different regression models on the mpg dataset.

Models:

Evaluation Metrics:

Todo:
- Load the necessary libraries and the dataset.
- Split the data into a training set and a test set.
- Train various machine learning models (linear regression, generalized additive models, random forests, and gradient boosting machines) on the training set.
- Evaluate the models on the test set using RMSE (Root Mean Squared Error) and MSE (Mean Squared Error).
- Compare the performance of the models and determine the winner.

Load necessary libraries

library(ggplot2)
library(caret)
## Loading required package: lattice
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(gbm)
## Loaded gbm 2.1.8.1
library(magrittr)
library(mgcv)  # Load the mgcv package
## Loading required package: nlme
## This is mgcv 1.9-0. For overview type 'help("mgcv-package")'.
library(kableExtra)
library(knitr)

Load the dataset

data(mpg)

Split the data into 80% train and 20% test

set.seed(9202023)
trainIndex <- createDataPartition(mpg$hwy, p = 0.8, 
                                  list = FALSE, 
                                  times = 1)
trainData <- mpg[trainIndex, ]
testData <- mpg[-trainIndex, ]

Define a function to calculate RMSE and MSE

calculate_errors <- function(model, testData) {
  predictions <- predict(model, newdata = testData)
  rmse <- sqrt(mean((testData$hwy - predictions)^2))
  mse <- mean((testData$hwy - predictions)^2)
  return(c(RMSE = rmse, MSE = mse))
}

Train various models

Linear Regression

lm_model <- lm(hwy ~ displ + cty, data = trainData)

Generalized Additive Model

gam_model <- gam(hwy ~ s(displ) + s(cty), data = trainData)

Random Forest

rf_model <- randomForest(hwy ~ displ + cty, data = trainData)

Gradient Boosting Machine

gbm_model <- gbm(hwy ~ displ + cty, data = trainData, n.trees = 100)
## Distribution not specified, assuming gaussian ...

Evaluate models

lm_errors <- calculate_errors(lm_model, testData)
gam_errors <- calculate_errors(gam_model, testData)
rf_errors <- calculate_errors(rf_model, testData)
gbm_errors <- calculate_errors(gbm_model, testData)
## Using 100 trees...

Create a table to compare models

model_comparison <- data.frame(
  Model = c("Linear Regression", "Generalized Additive Model", 
            "Random Forest", "Gradient Boosting Machine"),
  RMSE = c(lm_errors["RMSE"], gam_errors["RMSE"], rf_errors["RMSE"], gbm_errors["RMSE"]),
  MSE = c(lm_errors["MSE"], gam_errors["MSE"], rf_errors["MSE"], gbm_errors["MSE"])
)

Determine the winner model

winner_model <- model_comparison[which.min(model_comparison$RMSE), ]

Results:

The following table shows the RMSE and MSE for each model:

model_comparison
##                        Model     RMSE      MSE
## 1          Linear Regression 1.344991 1.809000
## 2 Generalized Additive Model 1.403406 1.969548
## 3              Random Forest 1.460389 2.132737
## 4  Gradient Boosting Machine 2.105117 4.431519
winner_model
##               Model     RMSE   MSE
## 1 Linear Regression 1.344991 1.809

Conclusion

The Linear Regression model performed the best, with the lowest RMSE and MSE. The Generalized Additive Model and Random Forest models also performed well, but with slightly higher RMSE and MSE values. The Gradient Boosting Machine model performed the worst, with the highest RMSE and MSE values.

Based on the results of this model comparison, I recommend using the Linear Regression model to predict mpg. However, it is important to note that the best model for a particular problem will depend on the specific data and the desired outcome.