This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.
library(knitr)
library(ISLR)
library(MASS)
library(class)
library(e1071)
attach(Weekly)
(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
str(Weekly)
## 'data.frame': 1089 obs. of 9 variables:
## $ Year : num 1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
## $ Lag1 : num 0.816 -0.27 -2.576 3.514 0.712 ...
## $ Lag2 : num 1.572 0.816 -0.27 -2.576 3.514 ...
## $ Lag3 : num -3.936 1.572 0.816 -0.27 -2.576 ...
## $ Lag4 : num -0.229 -3.936 1.572 0.816 -0.27 ...
## $ Lag5 : num -3.484 -0.229 -3.936 1.572 0.816 ...
## $ Volume : num 0.155 0.149 0.16 0.162 0.154 ...
## $ Today : num -0.27 -2.576 3.514 0.712 1.178 ...
## $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
To find patters, I used th cor()
function.
Volume
appears to be strongly correlated with
Year
at 0.84194162.
cor(Weekly[, -9])
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
Using the pairs()
function, we can see a clear upward
trend of Volume
over time. The average number of shares
traded daily increased from 1990 to 2010.
pairs(Weekly)
plot(Volume)
(b) Use the full data set to perform a logistic regression
with Direction
as the response and the five lag variables
plus Volume
as predictors. Use the summary function to
print the results. Do any of the predictors appear to be statistically
significant? If so, which ones?
Using the glm()
function, I’ve performed a logistic
regression to find any significant predictors. Lag2
appears
to be statistically significant in predicting the market direction with
a p-value of 0.0296.
glm.fit <- glm(Direction~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)
summary(glm.fit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
Below I have predicted the probabilities and converted them into
class labels “Up” or “Down”. Then, created a confusion matrix using the
table()
function. The matrix reveals that the model
correctly predicted 557 “Up” weeks (true positives) and 54 “Down” weeks
(true negatives). The model predicts “Up” majority of the time, hence
the 557 true positives and 430 false negatives. There were some mistakes
made as well. The model predicted “Up” 430 times when it was actually
“Down”. It also predicted “Down” 48 times when the actual value was
“Up”. This model heavily favors predicting “Up”, resulting in many false
positives.
glm.probs <- predict(glm.fit, type = "response")
glm.pred <- rep("Down", length(glm.probs))
glm.pred[glm.probs > 0.5] = "Up"
table(glm.pred, Weekly$Direction)
##
## glm.pred Down Up
## Down 54 48
## Up 430 557
The overall fraction of correct predictions is approximately 56.1%
mean(glm.pred == Direction)
## [1] 0.5610652
(d) Now fit the logistic regression model using a training
data period from 1990 to 2008, with Lag2
as the only
predictor. Compute the confusion matrix and the overall fraction of
correct predictions for the held out data (that is, the data from 2009
and 2010).
The logistic regression model trained on data from 1990 to 2008 using Lag2 as the only predictor achieved 62.5% accuracy on the 2009 to 2010 test data. It predicted 65 correct “Up” weeks and 9 correct “Down” weeks. There are 34 falsely predicted “Up” values. The confusion matrix revealed that the model is still more successful at identifying “Up” weeks.
train <- (Year < 2009)
Weekly_test <- Weekly[!train, ]
Direction_test <- Direction[!train]
glm.fit.train <- glm(Direction ~ Lag2, data = Weekly, subset = train, family = binomial)
summary(glm.fit.train)
##
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = Weekly,
## subset = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20326 0.06428 3.162 0.00157 **
## Lag2 0.05810 0.02870 2.024 0.04298 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1350.5 on 983 degrees of freedom
## AIC: 1354.5
##
## Number of Fisher Scoring iterations: 4
glm.prob.test <- predict(glm.fit.train, Weekly_test, type = "response")
glm.pred.test <- rep("Down", length(glm.prob.test))
glm.pred.test[glm.prob.test > 0.5] = "Up"
table(glm.pred.test, Direction_test)
## Direction_test
## glm.pred.test Down Up
## Down 9 5
## Up 34 56
glm.accuracy <- mean(glm.pred.test == Direction_test)
glm.accuracy
## [1] 0.625
(e) Repeat (d) using LDA.
lda.fit <- lda(Direction ~ Lag2, data = Weekly, subset = train)
lda.fit
## Call:
## lda(Direction ~ Lag2, data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
##
## Coefficients of linear discriminants:
## LD1
## Lag2 0.4414162
lda.pred <- predict(lda.fit, Weekly_test)
lda.class <- lda.pred$class
table(lda.class, Direction_test)
## Direction_test
## lda.class Down Up
## Down 9 5
## Up 34 56
lda.accuracy <- mean(lda.class == Direction_test)
lda.accuracy
## [1] 0.625
(f) Repeat (d) using QDA.
qda.fit <- qda(Direction ~ Lag2, data = Weekly, subset = train)
qda.fit
## Call:
## qda(Direction ~ Lag2, data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
qda.pred <- predict(qda.fit, Weekly_test)
qda.class <- qda.pred$class
table(qda.class, Direction_test)
## Direction_test
## qda.class Down Up
## Down 0 0
## Up 43 61
qda.accuracy <- mean(qda.class == Direction_test)
qda.accuracy
## [1] 0.5865385
(g) Repeat (d) using KNN with \(K = 1\).
train_X <- as.matrix(Lag2[train])
test_X <- as.matrix(Lag2[!train])
Direction_train <- Direction[train]
Direction_test <- Direction[!train]
set.seed(123)
knn.pred <- knn(train_X, test_X, Direction_train, k = 1)
table(knn.pred, Direction_test)
## Direction_test
## knn.pred Down Up
## Down 21 29
## Up 22 32
knn.accuracy <- mean(knn.pred == Direction_test)
knn.accuracy
## [1] 0.5096154
(h) Repeat (d) using naive Bayes.
nb.fit <- naiveBayes(Direction ~ Lag2, data = Weekly, subset = train)
nb.fit
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Down Up
## 0.4477157 0.5522843
##
## Conditional probabilities:
## Lag2
## Y [,1] [,2]
## Down -0.03568254 2.199504
## Up 0.26036581 2.317485
nb.pred <- predict(nb.fit, Weekly_test)
table(nb.pred, Direction_test)
## Direction_test
## nb.pred Down Up
## Down 0 0
## Up 43 61
nb.accuracy <- mean(nb.pred == Direction_test)
nb.accuracy
## [1] 0.5865385
(i) Which of these methods appears to provide the best results on this data?
Below I have created some visuals to easily compare the results from each model’s confusion matrix, along with comparing accuracy. Overall, logistic regression and LDA are the best methods of choice. Each offer the highest test accuracy at 62.5% and a decent balance of true positive and true negative predictions. The Naive Bayes and QDA models had 0 true negatives, which is not a preferred outcome. As for KNN, it had the lowest accuracy at approximately 50.96% and 33 false negatives, the most.
kable(conf_summary, caption = "Confusion Matrix Summary for All Models")
Method | TP | TN | FP | FN |
---|---|---|---|---|
Logistic Regression | 56 | 9 | 34 | 5 |
LDA | 56 | 9 | 34 | 5 |
QDA | 61 | 0 | 43 | 4 |
KNN (K=1) | 32 | 21 | 30 | 33 |
Naive Bayes | 61 | 0 | 43 | 4 |
accuracy_values <- c(glm.accuracy, lda.accuracy, qda.accuracy, knn.accuracy, nb.accuracy)
model_names <- c("Logistic Regression", "LDA", "QDA", "KNN", "Naive Bayes")
barplot(accuracy_values,
names.arg = model_names,
col = "lightgreen",
ylim = c(0, 0.7),
main = "Comparison of Model Accuracy",
ylab = "Accuracy",
las = 2,
cex.names = 0.7)
(j) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for \(K\) in the KNN classifier.
First, I’ll start with some transformations such as quadratic and
logarithmic. Next, I will test some interactions between
Lag2
and Volume
, and Lag1
and
Lag2
. Each method used in the previous problems will be
used again. Findings are included after model testing.
Quadratic Transformation Lag2
:
glm.quad <- glm(Direction ~ Lag2 + I(Lag2^2), data = Weekly, subset = train, family = binomial)
glm.prob.quad <- predict(glm.quad, Weekly_test, type = "response")
glm.pred.quad <- ifelse(glm.prob.quad > 0.5, "Up", "Down")
table(glm.pred.quad, Direction_test)
## Direction_test
## glm.pred.quad Down Up
## Down 8 4
## Up 35 57
mean(glm.pred.quad == Direction_test)
## [1] 0.625
Logarithmic Transformation Lag2
:
glm.log <- glm(Direction ~ Lag2 + log(Lag2 + 20), data = Weekly, subset = train, family = binomial)
glm.prob.log <- predict(glm.log, Weekly_test, type = "response")
glm.pred.log <- ifelse(glm.prob.log > 0.5, "Up", "Down")
table(glm.pred.log, Direction_test)
## Direction_test
## glm.pred.log Down Up
## Down 9 5
## Up 34 56
mean(glm.pred.log == Direction_test)
## [1] 0.625
Interaction Lag2
& Volume
:
glm.inter3 <- glm(Direction ~ Lag2 + Lag2:Volume, data = Weekly, subset = train, family = binomial)
glm.prob.inter3 <- predict(glm.inter3, Weekly_test, type = "response")
glm.pred.inter3 <- ifelse(glm.prob.inter3 > 0.5, "Up", "Down")
table(glm.pred.inter3, Direction_test)
## Direction_test
## glm.pred.inter3 Down Up
## Down 9 5
## Up 34 56
mean(glm.pred.inter3 == Direction_test)
## [1] 0.625
Quadratic Transformation Lag2
:
lda.quad <- lda(Direction ~ Lag2 + I(Lag2^2), data = Weekly, subset = train)
lda.pred.quad <- predict(lda.quad, Weekly_test)
lda.quad.class <- lda.pred.quad$class
table(lda.quad.class, Direction_test)
## Direction_test
## lda.quad.class Down Up
## Down 7 4
## Up 36 57
mean(lda.quad.class == Direction_test)
## [1] 0.6153846
Logarithmic Transformation Lag2
:
lda.log <- lda(Direction ~ Lag2 + log(Lag2 + 20), data = Weekly, subset = train)
lda.pred.log <- predict(lda.log, Weekly_test)
lda.log.class <- lda.pred.log$class
table(lda.log.class, Direction_test)
## Direction_test
## lda.log.class Down Up
## Down 9 5
## Up 34 56
mean(lda.log.class == Direction_test)
## [1] 0.625
Interaction Lag2
& Volume
:
lda.inter1 <- lda(Direction ~ Lag2 + Lag2:Volume, data = Weekly, subset = train)
lda.pred.inter1 <- predict(lda.inter1, Weekly_test)
lda.inter1.class <- lda.pred.inter1$class
table(lda.inter1.class, Direction_test)
## Direction_test
## lda.inter1.class Down Up
## Down 9 5
## Up 34 56
mean(lda.inter1.class == Direction_test)
## [1] 0.625
Quadratic Transformation Lag2
:
qda.quad = qda(Direction ~ Lag2 + I(Lag2^2), data = Weekly, subset = train)
qda.pred.quad = predict(qda.quad, Weekly_test)
qda.quad.class = qda.pred.quad$class
table(qda.quad.class, Direction_test)
## Direction_test
## qda.quad.class Down Up
## Down 7 3
## Up 36 58
mean(qda.quad.class == Direction_test)
## [1] 0.625
Logarithmic Transformation Lag2
:
qda.log <- qda(Direction ~ Lag2 + log(Lag2 + 20), data = Weekly, subset = train)
qda.pred.log <- predict(qda.log, Weekly_test)
qda.log.class <- qda.pred.log$class
table(qda.log.class, Direction_test)
## Direction_test
## qda.log.class Down Up
## Down 2 2
## Up 41 59
mean(qda.log.class == Direction_test)
## [1] 0.5865385
Interaction Lag2
& Volume
:
qda.inter1 <- qda(Direction ~ Lag2 + Lag2:Volume, data = Weekly, subset = train)
qda.pred.inter1 <- predict(qda.inter1, Weekly_test)
qda.inter1.class <- qda.pred.inter1$class
table(qda.inter1.class, Direction_test)
## Direction_test
## qda.inter1.class Down Up
## Down 25 31
## Up 18 30
mean(qda.inter1.class == Direction_test)
## [1] 0.5288462
Quadratic Transformation Lag2
:
nb.quad <- naiveBayes(Direction ~ Lag2 + I(Lag2^2), data = Weekly, subset = train)
nb.pred.quad <- predict(nb.quad, Weekly_test)
table(nb.pred.quad, Direction_test)
## Direction_test
## nb.pred.quad Down Up
## Down 0 0
## Up 43 61
mean(nb.pred.quad == Direction_test)
## [1] 0.5865385
Logarithmic Transformation Lag2
:
nb.log <- naiveBayes(Direction ~ Lag2 + log(Lag2 + 20), data = Weekly, subset = train)
nb.pred.log <- predict(nb.log, Weekly_test)
table(nb.pred.log, Direction_test)
## Direction_test
## nb.pred.log Down Up
## Down 0 0
## Up 43 61
mean(nb.pred.log == Direction_test)
## [1] 0.5865385
Interaction Lag2
& Volume
:
Weekly$Lag2_Volume = Weekly$Lag2 * Weekly$Volume
nb.inter1 <- naiveBayes(Direction ~ Lag2 + Volume + Lag2_Volume, data = Weekly, subset = train)
nb.pred.inter1 <- predict(nb.inter1, Weekly_test)
table(nb.pred.inter1, Direction_test)
## Direction_test
## nb.pred.inter1 Down Up
## Down 43 57
## Up 0 4
mean(nb.pred.inter1 == Direction_test)
## [1] 0.4519231
Interaction Lag1
& Lag2
:
Weekly$Lag1_Lag2 = Weekly$Lag1 * Weekly$Lag2
nb.inter2 <- naiveBayes(Direction ~ Lag1 + Lag2 + Lag1_Lag2, data = Weekly, subset = train)
nb.pred.inter2 <- predict(nb.inter2, Weekly_test)
table(nb.pred.inter2, Direction_test)
## Direction_test
## nb.pred.inter2 Down Up
## Down 3 8
## Up 40 53
mean(nb.pred.inter2 == Direction_test)
## [1] 0.5384615
K = 3
:
set.seed(123)
knn.pred3 <- knn(train_X, test_X, Direction_train, k = 3)
table(knn.pred3, Direction_test)
## Direction_test
## knn.pred3 Down Up
## Down 16 20
## Up 27 41
mean(knn.pred3 == Direction_test)
## [1] 0.5480769
K = 10
:
set.seed(123)
knn.pred10 <- knn(train_X, test_X, Direction_train, k = 10)
table(knn.pred10, Direction_test)
## Direction_test
## knn.pred10 Down Up
## Down 20 19
## Up 23 42
mean(knn.pred10 == Direction_test)
## [1] 0.5961538
Quadratic Transformation Lag2
with
K = 10
:
train_X <- cbind(Lag2, Lag2^2)[train, ]
test_X <- cbind(Lag2, Lag2^2)[!train, ]
set.seed(123)
knn.pred.quad <- knn(train_X, test_X, Direction_train, k = 10)
table(knn.pred.quad, Direction_test)
## Direction_test
## knn.pred.quad Down Up
## Down 19 20
## Up 24 41
mean(knn.pred.quad == Direction_test)
## [1] 0.5769231
Logarithmic Transformation Lag2
with
K = 10
:
train_X <- cbind(Lag2, log(Lag2 + 20))[train, ]
test_X <- cbind(Lag2, log(Lag2 + 20))[!train, ]
set.seed(123)
knn.pred.log <- knn(train_X, test_X, Direction_train, k = 10)
table(knn.pred.log, Direction_test)
## Direction_test
## knn.pred.log Down Up
## Down 20 19
## Up 23 42
mean(knn.pred.log == Direction_test)
## [1] 0.5961538
Interaction Lag2
& Volume
with
K = 10
:
train_X <- cbind(Lag2, Lag2:Volume)[train, ]
test_X <- cbind(Lag2, Lag2:Volume)[!train, ]
set.seed(123)
knn.pred.inter1 <- knn(train_X, test_X, Direction_train, k = 10)
table(knn.pred.inter1, Direction_test)
## Direction_test
## knn.pred.inter1 Down Up
## Down 17 26
## Up 26 35
mean(knn.pred.inter1 == Direction_test)
## [1] 0.5
Several models achieved the highest accuracy of 62.5% on the 2009–2010 test data, including:
Logistic regression with a quadratic transformation of
Lag2
: Lag2 + I(Lag2^2)
Logistic regression with a logarithmic transformation of
Lag2
: Lag2 + log(Lag2 + 20)
Logistic regression with an added interaction of
Lag2
with Volume
:
Lag2 + Lag2:Volume
LDA with a quadratic transformation of Lag2
:
Lag2 + log(Lag2 + 20)
LDA with an added interaction of Lag2
with
LAg2
:Lag2 + Lag1:Lag2
QDA with a quadratic transformation of Lag2
:
Lag2 + I(Lag2^2)
Three of these models provided the same exact results: 62.5% accuracy, 56 true positives, 34 false positives, 9 true negatives, and 5 false negatives. These would be the logarithmic transformations from the logistic regression and LDA, and the added interaction of Lag1 and Lag2 from the LDA. Models from Naive Bayes and KNN were dismissed. They under performed drastically compared to the top models in terms of accuracy and ability to predict “Down” weeks. These models have the best balance of predictions and highest accuracy score, making either of these the best choice.
In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.
library(ISLR)
attach(Auto)
(a) Create a binary variable, mpg01
, that
contains a 1 if mpg
contains a value above its median, and
a 0 if mpg
contains a value below its median. You can
compute the median using the median()
function. Note you may find it helpful to use the data.frame() function to create a single
data set containing both mpg01
and the other Auto variables.
mpg01 <- ifelse(mpg > median(mpg), 1, 0)
Auto <- data.frame(Auto, mpg01)
(b) Explore the data graphically in order to investigate the
association between mpg01
and the other features. Which of
the other features seem most likely to be useful in predicting
mpg01
? Scatterplots and boxplots may be useful tools to
answer this question. Describe your findings.
Below I have scatterplots of all predictors in the auto set.
Additionally, I have color coded mpg01 = 1
as red and
mpg01 = 0
as black.
pairs(Auto[, c("mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "origin", "year")], col = mpg01 + 1)
The boxplot of mpg01
and weight
highlights
a strong distinction between fuel efficiency categories. Cars classified
as mpg01 = 1
consistently have a lower weight compared to
those withmpg01 = 0
. The corresponding scatterplot
reinforces this findingr. All red points cluster at lower weights an
higher mpg values, and all black points are concentrated at higher
weights and lower mpg. These patterns establish weight
as a
key predictor of mpg01
. Similarly, the boxplot and
scatterplot for displacement
reveal that cars with a higher
mpg tend to have smaller engine displacement, whereas cars with low mpg
have larger engine displacement. This negative relationship confirms
that displacement
is also a key predictor in in
distinguishing mpg classification. The boxplot for
cylinders
shows that cars with high mpg typically have
fewer cylinders. Vehicles coded mpg01 = 1
are clustered
around 4 cylinders, while low mpg cars are more likely to have 6 or 8
cylinders. This makes cylinder count another strong predictor for fuel
efficiency classification. The boxplot for origin
shows
that cars with a higher mpg are commonly associated with origins 2 and
3, while cars with low mpg are mostly from origin 1. This suggests a
regional influence on fuel efficiency. Certain regions produce more fuel
efficient vehicles than other regions, making origin
a key
predictor for mpg01
. Looking at the scatterplot
foryear
, high mpg cars are more prevalent in later models,
whereas low mpg cars appear more frequently in the earlier years. This
trend implies that the year of manufacture plays a role in classifying a
vehicle’s fuel efficiency. Finally, the scatterplot for
horsepower
illustrates a relationship similar to that of
weight
. Cars with high mpg are concentrated at lower
horsepower, while lower mpg cars are congregated at higher horsepower.
This emphasizes horsepower
as another key predictor.
par(mfrow = c(2, 3))
boxplot(weight ~ mpg01, data = Auto, main = "Weight vs mpg01")
boxplot(displacement ~ mpg01, data = Auto, main = "Displacement vs mpg01")
boxplot(origin ~ mpg01, data = Auto, main = "Origin vs mpg01")
boxplot(horsepower ~ mpg01, data = Auto, main = "Horsepower vs mpg01")
boxplot(cylinders ~ mpg01, data = Auto, main = "Cylinders vs mpg01")
boxplot(year ~ mpg01, data = Auto, main = "Year vs mpg01")
par(mfrow = c(2, 2))
plot(weight, mpg, col = mpg01 + 1, pch = 19, main = "Weight vs mpg")
plot(displacement, mpg, col = mpg01 + 1, pch = 19, main = "Displacement vs mpg")
plot(year, mpg, col = mpg01 + 1, pch = 19, main = "Year vs mpg")
plot(horsepower, mpg, col = mpg01 + 1, pch = 19, main = "Horsepower vs mpg")
(c) Split the data into a training set and a test set.
I have randomly split the Auto dataset into 70% training and 30% testing sets below.
set.seed(123)
train <- sample(1:nrow(Auto), size = 0.7 * nrow(Auto))
Auto_train <- Auto[train, ]
Auto_test <- Auto[-train, ]
(d) Perform LDA on the training data in order to predict
mpg01
using the variables that seemed most associated with
mpg01
in (b). What is the test error of the model
obtained?
I fit a LDA model using weight
,
displacement
, horsepower
,
cylinders
, year
, and origin
as
predictors. The resulting test error rate was 11.01695%.
lda.fit <- lda(mpg01 ~ weight + displacement + horsepower + cylinders + year + origin, data = Auto_train)
lda.pred <- predict(lda.fit, Auto_test)
lda.class <- lda.pred$class
table(lda.class, Auto_test$mpg01)
##
## lda.class 0 1
## 0 50 3
## 1 10 55
mean(lda.class != Auto_test$mpg01)
## [1] 0.1101695
(e) Perform QDA on the training data in order to predict
mpg01
using the variables that seemed most associated with
mpg01
in (b). What is the test error of the model
obtained?
I fit a QDA model using weight
,
displacement
, horsepower
,
cylinders
, year
, and origin
as
predictors. The resulting test error rate was 9.322034%.
qda.fit <- qda(mpg01 ~ weight + displacement + horsepower + cylinders + year + origin, data = Auto_train)
qda.pred <- predict(qda.fit, Auto_test)
qda.class <- qda.pred$class
table(qda.class, Auto_test$mpg01)
##
## qda.class 0 1
## 0 53 4
## 1 7 54
mean(qda.class != Auto_test$mpg01)
## [1] 0.09322034
(f) Perform logistic regression on the training data in order
to predict mpg01
using the variables that seemed most
associated with mpg01
in (b). What is the test error of the
model obtained?
I fit a logistic regression model using weight
,
displacement
, horsepower
,
cylinders
, year
, and origin
as
predictors. The resulting test error rate was 9.322034%, the same as the
LDA fit.
glm.fit <- glm(mpg01 ~ weight + displacement + horsepower + cylinders + year + origin, data = Auto_train, family = binomial)
glm.prob <- predict(glm.fit, Auto_test, type = "response")
glm.pred <- ifelse(glm.prob > 0.5, 1, 0)
table(glm.pred, Auto_test$mpg01)
##
## glm.pred 0 1
## 0 56 7
## 1 4 51
mean(glm.pred != Auto_test$mpg01)
## [1] 0.09322034
(g) Perform naive Bayes on the training data in order to
predict mpg01
using the variables that seemed most
associated with mpg01
in (b). What is the test error of the
model obtained?
I fit a Naive Bayes model using weight
,
displacement
, horsepower
,
cylinders
, year
, and origin
as
predictors. The resulting test error rate was 9.322034%, the same as the
LDA and logistic regression fit.
nb.fit <- naiveBayes(as.factor(mpg01) ~ weight + displacement + horsepower + cylinders + year + origin, data = Auto_train)
nb.pred <- predict(nb.fit, Auto_test)
table(nb.pred, Auto_test$mpg01)
##
## nb.pred 0 1
## 0 51 2
## 1 9 56
mean(nb.pred != Auto_test$mpg01)
## [1] 0.09322034
(h) Perform KNN on the training data, with several values of
\(K\), in order to predict
mpg01.
Use only the variables that seemed most associated
with mpg01
in (b). What test errors do you obtain? Which
value of \(K\) seems to perform the
best on this data set?
I tested KNN with K = 1, 3, 5, and 7. The error rates are as follows:
K = 1
: 16.95%K = 3
: 13.56%K = 5
: 11.02%K = 7
: 10.17%As K increased, the test error rate decreased, leaving
K = 7
as the best performing model on the Auto dataset.
train_X <- cbind(weight, displacement, horsepower, cylinders, year, origin)[train, ]
test_X <- cbind(weight, displacement, horsepower, cylinders, year, origin)[-train, ]
train_Y <- mpg01[train]
set.seed(123)
knn.pred1 <- knn(train_X, test_X, train_Y, k = 1)
table(knn.pred1, Auto_test$mpg01)
##
## knn.pred1 0 1
## 0 51 11
## 1 9 47
mean(knn.pred1 != Auto_test$mpg01)
## [1] 0.1694915
set.seed(123)
knn.pred3 <- knn(train_X, test_X, train_Y, k = 3)
table(knn.pred3, Auto_test$mpg01)
##
## knn.pred3 0 1
## 0 52 8
## 1 8 50
mean(knn.pred3 != Auto_test$mpg01)
## [1] 0.1355932
set.seed(123)
knn.pred5 <- knn(train_X, test_X, train_Y, k = 5)
table(knn.pred5, Auto_test$mpg01)
##
## knn.pred5 0 1
## 0 52 5
## 1 8 53
mean(knn.pred5 != Auto_test$mpg01)
## [1] 0.1101695
set.seed(123)
knn.pred7 <- knn(train_X, test_X, train_Y, k = 7)
table(knn.pred7, Auto_test$mpg01)
##
## knn.pred7 0 1
## 0 53 5
## 1 7 53
mean(knn.pred7 != Auto_test$mpg01)
## [1] 0.1016949
Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, Naive Bayes, and KNN models using various subsets of the predictors. Describe your findings.
Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set.
First, I will create a new target variable crime01
where
1 equals high crime and 0 equals low crime. I will then split the data
into training and testing sets. Predictors rm
,
dis
, nox
, rad
, tax
,
lstat
will be used in the classification models.
attach(Boston)
crime01 <- median(Boston$crim)
Boston$crim01 <- ifelse(Boston$crim > crime01, 1, 0)
set.seed(123)
train_X <- sample(1:nrow(Boston), nrow(Boston) * 0.7)
Boston_train <- Boston[train_X, ]
Boston_test <- Boston[-train_X, ]
The logistic regression model achieved a 10.52632% error rate and correctly identified 69 tracts with high crime and 67 tracts with low crime. It made 8 false positive predictions and 8 false negative predictions.
glm.fit <- glm(crim01 ~ rm + dis + nox + rad + tax + lstat, data = Boston_train, family = binomial)
glm.probs <- predict(glm.fit, Boston_test, type = "response")
glm.pred <- ifelse(glm.probs > 0.5, 1, 0)
table(glm.pred, Boston_test$crim01)
##
## glm.pred 0 1
## 0 67 8
## 1 8 69
mean(glm.pred != Boston_test$crim01)
## [1] 0.1052632
The LDA model performed slightly worse than logistic regression with a 12.5% error rate. While it did only have 5 false positives, it with struggled by predicting 14 false negatives. This model missed high-crime tracts.
lda.fit <- lda(crim01 ~ rm + dis + nox + rad + tax + lstat, data = Boston_train)
lda.pred <- predict(lda.fit, Boston_test)$class
table(lda.pred, Boston_test$crim01)
##
## lda.pred 0 1
## 0 70 14
## 1 5 63
mean(lda.pred != Boston_test$crim01)
## [1] 0.125
The Naive Bayes model performed the worst with an error rate of 13.81579%. It predicted 13 false negatives, the worst of all models, and 8 false positives.
nb.fit <- naiveBayes(as.factor(crim01) ~ rm + dis + nox + rad + tax + lstat, data = Boston_train)
nb.pred <- predict(nb.fit, Boston_test)
table(nb.pred, Boston_test$crim01)
##
## nb.pred 0 1
## 0 67 13
## 1 8 64
mean(nb.pred != Boston_test$crim01)
## [1] 0.1381579
The KNN model with K = 1
performed better than the above
models with a 5.263158% error rate. It predicted only 3 false positives
and 5 false negatives. This is a big improvement for both high and low
crime tracts compared to other models. K = 3
provided the
lowest error rate of 3.947368%. It achieved to predict 74 true positives
and 74 true negatives, and only 3 false positives and 3 false negatives.
This is by far the most accurate and balanced model in the comparison.
Finally, we have K = 5
. This model had the same error rate
as K = 1
, which was 5.263158%. However, it’s distribution
of predictions is slightly different with 4 false positives and 4 false
negatives.
train_X <- cbind(Boston_train$rm, Boston_train$dis, Boston_train$nox,
Boston_train$rad, Boston_train$tax, Boston_train$lstat)
test_X <- cbind(Boston_test$rm, Boston_test$dis, Boston_test$nox,
Boston_test$rad, Boston_test$tax, Boston_test$lstat)
train_Y <- Boston_train$crim01
set.seed(123)
knn.pred <- knn(train_X, test_X, train_Y, k = 1)
table(knn.pred, Boston_test$crim01)
##
## knn.pred 0 1
## 0 72 5
## 1 3 72
mean(knn.pred != Boston_test$crim01)
## [1] 0.05263158
set.seed(123)
knn.pred3 <- knn(train_X, test_X, train_Y, k = 3)
table(knn.pred3, Boston_test$crim01)
##
## knn.pred3 0 1
## 0 72 3
## 1 3 74
mean(knn.pred3 != Boston_test$crim01)
## [1] 0.03947368
set.seed(123)
knn.pred5 <- knn(train_X, test_X, train_Y, k = 5)
table(knn.pred5, Boston_test$crim01)
##
## knn.pred5 0 1
## 0 71 4
## 1 4 73
mean(knn.pred5 != Boston_test$crim01)
## [1] 0.05263158
The KNN model with K = 3
is the best model choice. It
had the lowest error rate, a balanced false positive and false negative
count, and the highest true positive count.