(14) In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.
Loading the necessary libraries
library(ISLR2)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:ISLR2':
##
## Boston
library(class)
library(e1071)
## Warning: package 'e1071' was built under R version 4.4.3
library(ggplot2)
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
data(Auto) #Loading the dataset
(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.
mpg_median=median(Auto$mpg)
# creating a new variable to see if mpg > median
Auto$mpg01 = ifelse(Auto$mpg > mpg_median, 1, 0)
table(Auto$mpg01)
##
## 0 1
## 196 196
(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.
#Scatterplot
pairs(Auto[, c("mpg", "horsepower", "weight", "acceleration", "displacement")])
#Boxplots
ggplot(Auto, aes(x = as.factor(mpg01), y = horsepower)) +
geom_boxplot() +
labs(title = "Horsepower vs mpg01", x = "mpg01 (0=Low, 1=High)", y = "Horsepower")
ggplot(Auto, aes(x = as.factor(mpg01), y = weight)) +
geom_boxplot() +
labs(title = "Weight vs mpg01", x = "mpg01 (0=Low, 1=High)", y = "Weight")
ggplot(Auto, aes(x = as.factor(mpg01), y = acceleration)) +
geom_boxplot() +
labs(title = "Acceleration vs mpg01", x = "mpg01 (0=Low, 1=High)", y = "Acceleration")
Based on the graphs, # -
horsepower,
weight, and displacement are strongly
correlated with mpg01 In addition,
‘accelaration’ seems to be useful
(c) Split the data into a training set and a test set.
set.seed(30) # For reproducibility
train_index = createDataPartition(Auto$mpg01, p = 0.7, list = FALSE) # 70% train, 30% test
# Create training and test sets
train = Auto[train_index, ]
test = Auto[-train_index, ]
# Define predictor variables based on (b)
predictors = c("horsepower", "weight", "displacement", "acceleration")
(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
#Perform LDA
lda.fit <- lda(mpg01 ~ horsepower + weight + displacement + acceleration, data = train)
lda.pred <- predict(lda.fit, test)
lda.class <- lda.pred$class
# Confusion matrix & accuracy
table(lda.class, test$mpg01)
##
## lda.class 0 1
## 0 51 2
## 1 7 56
mean(lda.class == test$mpg01)
## [1] 0.9224138
#This model performs well
(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
qda.fit <- qda(mpg01 ~ horsepower + weight + displacement + acceleration, data = train)
qda.pred <- predict(qda.fit, test)
qda.class <- qda.pred$class
table(qda.class, test$mpg01)
##
## qda.class 0 1
## 0 52 5
## 1 6 53
mean(qda.class == test$mpg01) # Accuracy
## [1] 0.9051724
(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
glm.fit = glm(mpg01 ~ horsepower + weight + displacement + acceleration, data = train, family = binomial)
glm.probs = predict(glm.fit, test, type = "response")
glm.pred = ifelse(glm.probs > 0.5, 1, 0)
table(glm.pred, test$mpg01)
##
## glm.pred 0 1
## 0 51 7
## 1 7 51
mean(glm.pred == test$mpg01) # Accuracy
## [1] 0.8793103
(g) Perform naive Bayes on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
nb.fit = naiveBayes(as.factor(mpg01) ~ horsepower + weight + displacement + acceleration, data = train)
nb.pred = predict(nb.fit, test)
table(nb.pred, test$mpg01)
##
## nb.pred 0 1
## 0 51 7
## 1 7 51
mean(nb.pred == test$mpg01) # Accuracy: this model performs well
## [1] 0.8793103
(h) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?
train.X <- as.matrix(train[, predictors])
test.X <- as.matrix(test[, predictors])
train.Y <- train$mpg01
test.Y <- test$mpg01
knn.pred = knn(train.X, test.X, train.Y, k = 5)# KNN performs better when K=5
mean(knn.pred == test.Y)
## [1] 0.9137931
knn.pred = knn(train.X, test.X, train.Y, k = 3)
mean(knn.pred == test.Y)
## [1] 0.9051724
knn.pred = knn(train.X, test.X, train.Y, k = 7)
mean(knn.pred == test.Y)
## [1] 0.9051724