Assignment #3

Problem 13

This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

library(knitr)
library(ISLR)
library(MASS)
library(class)
library(e1071)
attach(Weekly)

(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

str(Weekly)

## 'data.frame':    1089 obs. of  9 variables:
##  $ Year     : num  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
##  $ Lag1     : num  0.816 -0.27 -2.576 3.514 0.712 ...
##  $ Lag2     : num  1.572 0.816 -0.27 -2.576 3.514 ...
##  $ Lag3     : num  -3.936 1.572 0.816 -0.27 -2.576 ...
##  $ Lag4     : num  -0.229 -3.936 1.572 0.816 -0.27 ...
##  $ Lag5     : num  -3.484 -0.229 -3.936 1.572 0.816 ...
##  $ Volume   : num  0.155 0.149 0.16 0.162 0.154 ...
##  $ Today    : num  -0.27 -2.576 3.514 0.712 1.178 ...
##  $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...

summary(Weekly)

##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
##

To find patters, I used th cor() function. Volume appears to be strongly correlated with Year at 0.84194162.

cor(Weekly[, -9])

##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000

Using the pairs() function, we can see a clear upward trend of Volume over time. The average number of shares traded daily increased from 1990 to 2010.

pairs(Weekly)

plot(Volume)

(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

Using the glm() function, I’ve performed a logistic regression to find any significant predictors. Lag2 appears to be statistically significant in predicting the market direction with a p-value of 0.0296.

glm.fit <- glm(Direction~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)
summary(glm.fit)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

Below I have predicted the probabilities and converted them into class labels “Up” or “Down”. Then, created a confusion matrix using the table() function. The matrix reveals that the model correctly predicted 557 “Up” weeks (true positives) and 54 “Down” weeks (true negatives). The model predicts “Up” majority of the time, hence the 557 true positives and 430 false negatives. There were some mistakes made as well. The model predicted “Up” 430 times when it was actually “Down”. It also predicted “Down” 48 times when the actual value was “Up”. This model heavily favors predicting “Up”, resulting in many false positives.

glm.probs <- predict(glm.fit, type = "response")
glm.pred <- rep("Down", length(glm.probs))
glm.pred[glm.probs > 0.5] = "Up"
table(glm.pred, Weekly$Direction)

##         
## glm.pred Down  Up
##     Down   54  48
##     Up    430 557

The overall fraction of correct predictions is approximately 56.1%

mean(glm.pred == Direction)

## [1] 0.5610652

(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

The logistic regression model trained on data from 1990 to 2008 using Lag2 as the only predictor achieved 62.5% accuracy on the 2009 to 2010 test data. It predicted 65 correct “Up” weeks and 9 correct “Down” weeks. There are 34 falsely predicted “Up” values. The confusion matrix revealed that the model is still more successful at identifying “Up” weeks.

train <- (Year < 2009)
Weekly_test <- Weekly[!train, ]
Direction_test <- Direction[!train]

glm.fit.train <- glm(Direction ~ Lag2, data = Weekly, subset = train, family = binomial)
summary(glm.fit.train)

## 
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = Weekly, 
##     subset = train)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.20326    0.06428   3.162  0.00157 **
## Lag2         0.05810    0.02870   2.024  0.04298 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1350.5  on 983  degrees of freedom
## AIC: 1354.5
## 
## Number of Fisher Scoring iterations: 4

glm.prob.test <- predict(glm.fit.train, Weekly_test, type = "response")
glm.pred.test <- rep("Down", length(glm.prob.test))
glm.pred.test[glm.prob.test > 0.5] = "Up"
table(glm.pred.test, Direction_test)

##              Direction_test
## glm.pred.test Down Up
##          Down    9  5
##          Up     34 56

glm.accuracy <- mean(glm.pred.test == Direction_test)
glm.accuracy

## [1] 0.625

(e) Repeat (d) using LDA.

lda.fit <- lda(Direction ~ Lag2, data = Weekly, subset = train)
lda.fit

## Call:
## lda(Direction ~ Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
## 
## Coefficients of linear discriminants:
##            LD1
## Lag2 0.4414162

lda.pred <- predict(lda.fit, Weekly_test)
lda.class <- lda.pred$class
table(lda.class, Direction_test)

##          Direction_test
## lda.class Down Up
##      Down    9  5
##      Up     34 56

lda.accuracy <- mean(lda.class == Direction_test)
lda.accuracy

## [1] 0.625

(f) Repeat (d) using QDA.

qda.fit <- qda(Direction ~ Lag2, data = Weekly, subset = train)
qda.fit

## Call:
## qda(Direction ~ Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581

qda.pred <- predict(qda.fit, Weekly_test)
qda.class <- qda.pred$class
table(qda.class, Direction_test)

##          Direction_test
## qda.class Down Up
##      Down    0  0
##      Up     43 61

qda.accuracy <- mean(qda.class == Direction_test)
qda.accuracy

## [1] 0.5865385

(g) Repeat (d) using KNN with \(K = 1\).

train_X <- as.matrix(Lag2[train])
test_X <- as.matrix(Lag2[!train])
Direction_train <- Direction[train]
Direction_test <- Direction[!train]

set.seed(123)
knn.pred <- knn(train_X, test_X, Direction_train, k = 1)
table(knn.pred, Direction_test)

##         Direction_test
## knn.pred Down Up
##     Down   21 29
##     Up     22 32

knn.accuracy <- mean(knn.pred == Direction_test)
knn.accuracy

## [1] 0.5096154

(h) Repeat (d) using naive Bayes.

nb.fit <- naiveBayes(Direction ~ Lag2, data = Weekly, subset = train)
nb.fit

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Conditional probabilities:
##       Lag2
## Y             [,1]     [,2]
##   Down -0.03568254 2.199504
##   Up    0.26036581 2.317485

nb.pred <- predict(nb.fit, Weekly_test)
table(nb.pred, Direction_test)

##        Direction_test
## nb.pred Down Up
##    Down    0  0
##    Up     43 61

nb.accuracy <- mean(nb.pred == Direction_test)
nb.accuracy

## [1] 0.5865385

(i) Which of these methods appears to provide the best results on this data?

Below I have created some visuals to easily compare the results from each model’s confusion matrix, along with comparing accuracy. Overall, logistic regression and LDA are the best methods of choice. Each offer the highest test accuracy at 62.5% and a decent balance of true positive and true negative predictions. The Naive Bayes and QDA models had 0 true negatives, which is not a preferred outcome. As for KNN, it had the lowest accuracy at approximately 50.96% and 33 false negatives, the most.

kable(conf_summary, caption = "Confusion Matrix Summary for All Models")

Confusion Matrix Summary for All Models
Method	TP	TN	FP	FN
Logistic Regression	56	9	34	5
LDA	56	9	34	5
QDA	61	0	43	4
KNN (K=1)	32	21	30	33
Naive Bayes	61	0	43	4

accuracy_values <- c(glm.accuracy, lda.accuracy, qda.accuracy, knn.accuracy, nb.accuracy)
model_names <- c("Logistic Regression", "LDA", "QDA", "KNN", "Naive Bayes")

barplot(accuracy_values,
        names.arg = model_names,
        col = "lightgreen",
        ylim = c(0, 0.7),
        main = "Comparison of Model Accuracy",
        ylab = "Accuracy",
        las = 2,
        cex.names = 0.7)

(j) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for \(K\) in the KNN classifier.

First, I’ll start with some transformations such as quadratic and logarithmic. Next, I will test some interactions between Lag2 and Volume, and Lag1 and Lag2. Each method used in the previous problems will be used again. Findings are included after model testing.

Logistic Regression

Quadratic Transformation Lag2:

glm.quad <- glm(Direction ~ Lag2 + I(Lag2^2), data = Weekly, subset = train, family = binomial)
glm.prob.quad <- predict(glm.quad, Weekly_test, type = "response")
glm.pred.quad <- ifelse(glm.prob.quad > 0.5, "Up", "Down")
table(glm.pred.quad, Direction_test)

##              Direction_test
## glm.pred.quad Down Up
##          Down    8  4
##          Up     35 57

mean(glm.pred.quad == Direction_test)

## [1] 0.625

Logarithmic Transformation Lag2:

glm.log <- glm(Direction ~ Lag2 + log(Lag2 + 20), data = Weekly, subset = train, family = binomial)
glm.prob.log <- predict(glm.log, Weekly_test, type = "response")
glm.pred.log <- ifelse(glm.prob.log > 0.5, "Up", "Down")
table(glm.pred.log, Direction_test)

##             Direction_test
## glm.pred.log Down Up
##         Down    9  5
##         Up     34 56

mean(glm.pred.log == Direction_test)

## [1] 0.625

Interaction Lag2 & Volume:

glm.inter3 <- glm(Direction ~ Lag2 + Lag2:Volume, data = Weekly, subset = train, family = binomial)
glm.prob.inter3 <- predict(glm.inter3, Weekly_test, type = "response")
glm.pred.inter3 <- ifelse(glm.prob.inter3 > 0.5, "Up", "Down")
table(glm.pred.inter3, Direction_test)

##                Direction_test
## glm.pred.inter3 Down Up
##            Down    9  5
##            Up     34 56

mean(glm.pred.inter3 == Direction_test)

## [1] 0.625

LDA

Quadratic Transformation Lag2:

lda.quad <- lda(Direction ~ Lag2 + I(Lag2^2), data = Weekly, subset = train)
lda.pred.quad <- predict(lda.quad, Weekly_test)
lda.quad.class <- lda.pred.quad$class
table(lda.quad.class, Direction_test)

##               Direction_test
## lda.quad.class Down Up
##           Down    7  4
##           Up     36 57

mean(lda.quad.class == Direction_test)

## [1] 0.6153846

Logarithmic Transformation Lag2:

lda.log <- lda(Direction ~ Lag2 + log(Lag2 + 20), data = Weekly, subset = train)
lda.pred.log <- predict(lda.log, Weekly_test)
lda.log.class <- lda.pred.log$class
table(lda.log.class, Direction_test)

##              Direction_test
## lda.log.class Down Up
##          Down    9  5
##          Up     34 56

mean(lda.log.class == Direction_test)

## [1] 0.625

Interaction Lag2 & Volume:

lda.inter1 <- lda(Direction ~ Lag2 + Lag2:Volume, data = Weekly, subset = train)
lda.pred.inter1 <- predict(lda.inter1, Weekly_test)
lda.inter1.class <- lda.pred.inter1$class
table(lda.inter1.class, Direction_test)

##                 Direction_test
## lda.inter1.class Down Up
##             Down    9  5
##             Up     34 56

mean(lda.inter1.class == Direction_test)

## [1] 0.625

QDA

Quadratic Transformation Lag2:

qda.quad = qda(Direction ~ Lag2 + I(Lag2^2), data = Weekly, subset = train)
qda.pred.quad = predict(qda.quad, Weekly_test)
qda.quad.class = qda.pred.quad$class
table(qda.quad.class, Direction_test)

##               Direction_test
## qda.quad.class Down Up
##           Down    7  3
##           Up     36 58

mean(qda.quad.class == Direction_test)

## [1] 0.625

Logarithmic Transformation Lag2:

qda.log <- qda(Direction ~ Lag2 + log(Lag2 + 20), data = Weekly, subset = train)
qda.pred.log <- predict(qda.log, Weekly_test)
qda.log.class <- qda.pred.log$class
table(qda.log.class, Direction_test)

##              Direction_test
## qda.log.class Down Up
##          Down    2  2
##          Up     41 59

mean(qda.log.class == Direction_test)

## [1] 0.5865385

Interaction Lag2 & Volume:

qda.inter1 <- qda(Direction ~ Lag2 + Lag2:Volume, data = Weekly, subset = train)
qda.pred.inter1 <- predict(qda.inter1, Weekly_test)
qda.inter1.class <- qda.pred.inter1$class
table(qda.inter1.class, Direction_test)

##                 Direction_test
## qda.inter1.class Down Up
##             Down   25 31
##             Up     18 30

mean(qda.inter1.class == Direction_test)

## [1] 0.5288462

Naive Bayes

Quadratic Transformation Lag2:

nb.quad <- naiveBayes(Direction ~ Lag2 + I(Lag2^2), data = Weekly, subset = train)
nb.pred.quad <- predict(nb.quad, Weekly_test)
table(nb.pred.quad, Direction_test)

##             Direction_test
## nb.pred.quad Down Up
##         Down    0  0
##         Up     43 61

mean(nb.pred.quad == Direction_test)

## [1] 0.5865385

Logarithmic Transformation Lag2:

nb.log <- naiveBayes(Direction ~ Lag2 + log(Lag2 + 20), data = Weekly, subset = train)
nb.pred.log <- predict(nb.log, Weekly_test)
table(nb.pred.log, Direction_test)

##            Direction_test
## nb.pred.log Down Up
##        Down    0  0
##        Up     43 61

mean(nb.pred.log == Direction_test)

## [1] 0.5865385

Interaction Lag2 & Volume:

Weekly$Lag2_Volume = Weekly$Lag2 * Weekly$Volume
nb.inter1 <- naiveBayes(Direction ~ Lag2 + Volume + Lag2_Volume, data = Weekly, subset = train)
nb.pred.inter1 <- predict(nb.inter1, Weekly_test)
table(nb.pred.inter1, Direction_test)

##               Direction_test
## nb.pred.inter1 Down Up
##           Down   43 57
##           Up      0  4

mean(nb.pred.inter1 == Direction_test)

## [1] 0.4519231

Interaction Lag1 & Lag2:

Weekly$Lag1_Lag2 = Weekly$Lag1 * Weekly$Lag2
nb.inter2 <- naiveBayes(Direction ~ Lag1 + Lag2 + Lag1_Lag2, data = Weekly, subset = train)
nb.pred.inter2 <- predict(nb.inter2, Weekly_test)
table(nb.pred.inter2, Direction_test)

##               Direction_test
## nb.pred.inter2 Down Up
##           Down    3  8
##           Up     40 53

mean(nb.pred.inter2 == Direction_test)

## [1] 0.5384615

KNN

K = 3:

set.seed(123)
knn.pred3 <- knn(train_X, test_X, Direction_train, k = 3)
table(knn.pred3, Direction_test)

##          Direction_test
## knn.pred3 Down Up
##      Down   16 20
##      Up     27 41

mean(knn.pred3 == Direction_test)

## [1] 0.5480769

K = 10:

set.seed(123)
knn.pred10 <- knn(train_X, test_X, Direction_train, k = 10)
table(knn.pred10, Direction_test)

##           Direction_test
## knn.pred10 Down Up
##       Down   20 19
##       Up     23 42

mean(knn.pred10 == Direction_test)

## [1] 0.5961538

Quadratic Transformation Lag2 with K = 10:

train_X <- cbind(Lag2, Lag2^2)[train, ]
test_X <- cbind(Lag2, Lag2^2)[!train, ]
set.seed(123)
knn.pred.quad <- knn(train_X, test_X, Direction_train, k = 10)
table(knn.pred.quad, Direction_test)

##              Direction_test
## knn.pred.quad Down Up
##          Down   19 20
##          Up     24 41

mean(knn.pred.quad == Direction_test)

## [1] 0.5769231

Logarithmic Transformation Lag2 with K = 10:

train_X <- cbind(Lag2, log(Lag2 + 20))[train, ]
test_X <- cbind(Lag2, log(Lag2 + 20))[!train, ]
set.seed(123)
knn.pred.log <- knn(train_X, test_X, Direction_train, k = 10)
table(knn.pred.log, Direction_test)

##             Direction_test
## knn.pred.log Down Up
##         Down   20 19
##         Up     23 42

mean(knn.pred.log == Direction_test)

## [1] 0.5961538

Interaction Lag2 & Volume with K = 10:

train_X <- cbind(Lag2, Lag2:Volume)[train, ]
test_X <- cbind(Lag2, Lag2:Volume)[!train, ]
set.seed(123)
knn.pred.inter1 <- knn(train_X, test_X, Direction_train, k = 10)
table(knn.pred.inter1, Direction_test)

##                Direction_test
## knn.pred.inter1 Down Up
##            Down   17 26
##            Up     26 35

mean(knn.pred.inter1 == Direction_test)

## [1] 0.5

Several models achieved the highest accuracy of 62.5% on the 2009–2010 test data, including:

Logistic regression with a quadratic transformation of Lag2: Lag2 + I(Lag2^2)
Logistic regression with a logarithmic transformation of Lag2: Lag2 + log(Lag2 + 20)
Logistic regression with an added interaction of Lag2 with Volume: Lag2 + Lag2:Volume
LDA with a quadratic transformation of Lag2: Lag2 + log(Lag2 + 20)
LDA with an added interaction of Lag2 with LAg2:Lag2 + Lag1:Lag2
QDA with a quadratic transformation of Lag2: Lag2 + I(Lag2^2)

Three of these models provided the same exact results: 62.5% accuracy, 56 true positives, 34 false positives, 9 true negatives, and 5 false negatives. These would be the logarithmic transformations from the logistic regression and LDA, and the added interaction of Lag1 and Lag2 from the LDA. Models from Naive Bayes and KNN were dismissed. They under performed drastically compared to the top models in terms of accuracy and ability to predict “Down” weeks. These models have the best balance of predictions and highest accuracy score, making either of these the best choice.

Problem 14

In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

library(ISLR)
attach(Auto)

(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.

mpg01 <- ifelse(mpg > median(mpg), 1, 0)
Auto <- data.frame(Auto, mpg01)

(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

Below I have scatterplots of all predictors in the auto set. Additionally, I have color coded mpg01 = 1 as red and mpg01 = 0 as black.

pairs(Auto[, c("mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "origin", "year")], col = mpg01 + 1)

The boxplot of mpg01 and weight highlights a strong distinction between fuel efficiency categories. Cars classified as mpg01 = 1 consistently have a lower weight compared to those withmpg01 = 0. The corresponding scatterplot reinforces this findingr. All red points cluster at lower weights an higher mpg values, and all black points are concentrated at higher weights and lower mpg. These patterns establish weight as a key predictor of mpg01. Similarly, the boxplot and scatterplot for displacement reveal that cars with a higher mpg tend to have smaller engine displacement, whereas cars with low mpg have larger engine displacement. This negative relationship confirms that displacement is also a key predictor in in distinguishing mpg classification. The boxplot for cylinders shows that cars with high mpg typically have fewer cylinders. Vehicles coded mpg01 = 1 are clustered around 4 cylinders, while low mpg cars are more likely to have 6 or 8 cylinders. This makes cylinder count another strong predictor for fuel efficiency classification. The boxplot for origin shows that cars with a higher mpg are commonly associated with origins 2 and 3, while cars with low mpg are mostly from origin 1. This suggests a regional influence on fuel efficiency. Certain regions produce more fuel efficient vehicles than other regions, making origin a key predictor for mpg01. Looking at the scatterplot foryear, high mpg cars are more prevalent in later models, whereas low mpg cars appear more frequently in the earlier years. This trend implies that the year of manufacture plays a role in classifying a vehicle’s fuel efficiency. Finally, the scatterplot for horsepower illustrates a relationship similar to that of weight. Cars with high mpg are concentrated at lower horsepower, while lower mpg cars are congregated at higher horsepower. This emphasizes horsepower as another key predictor.

par(mfrow = c(2, 3))
boxplot(weight ~ mpg01, data = Auto, main = "Weight vs mpg01")
boxplot(displacement ~ mpg01, data = Auto, main = "Displacement vs mpg01")
boxplot(origin ~ mpg01, data = Auto, main = "Origin vs mpg01")
boxplot(horsepower ~ mpg01, data = Auto, main = "Horsepower vs mpg01")
boxplot(cylinders ~ mpg01, data = Auto, main = "Cylinders vs mpg01")
boxplot(year ~ mpg01, data = Auto, main = "Year vs mpg01")

par(mfrow = c(2, 2))
plot(weight, mpg, col = mpg01 + 1, pch = 19, main = "Weight vs mpg")
plot(displacement, mpg, col = mpg01 + 1, pch = 19, main = "Displacement vs mpg")
plot(year, mpg, col = mpg01 + 1, pch = 19, main = "Year vs mpg")
plot(horsepower, mpg, col = mpg01 + 1, pch = 19, main = "Horsepower vs mpg")

(c) Split the data into a training set and a test set.

I have randomly split the Auto dataset into 70% training and 30% testing sets below.

set.seed(123)
train <- sample(1:nrow(Auto), size = 0.7 * nrow(Auto))
Auto_train <- Auto[train, ]
Auto_test <- Auto[-train, ]

(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

I fit a LDA model using weight, displacement, horsepower, cylinders, year, and origin as predictors. The resulting test error rate was 11.01695%.

lda.fit <- lda(mpg01 ~ weight + displacement + horsepower + cylinders + year + origin, data = Auto_train)
lda.pred <- predict(lda.fit, Auto_test)
lda.class <- lda.pred$class
table(lda.class, Auto_test$mpg01)

##          
## lda.class  0  1
##         0 50  3
##         1 10 55

mean(lda.class != Auto_test$mpg01)

## [1] 0.1101695

(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

I fit a QDA model using weight, displacement, horsepower, cylinders, year, and origin as predictors. The resulting test error rate was 9.322034%.

qda.fit <- qda(mpg01 ~ weight + displacement + horsepower + cylinders + year + origin, data = Auto_train)
qda.pred <- predict(qda.fit, Auto_test)
qda.class <- qda.pred$class
table(qda.class, Auto_test$mpg01)

##          
## qda.class  0  1
##         0 53  4
##         1  7 54

mean(qda.class != Auto_test$mpg01)

## [1] 0.09322034

(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

I fit a logistic regression model using weight, displacement, horsepower, cylinders, year, and origin as predictors. The resulting test error rate was 9.322034%, the same as the LDA fit.

glm.fit <- glm(mpg01 ~ weight + displacement + horsepower + cylinders + year + origin, data = Auto_train, family = binomial)
glm.prob <- predict(glm.fit, Auto_test, type = "response")
glm.pred <- ifelse(glm.prob > 0.5, 1, 0)
table(glm.pred, Auto_test$mpg01)

##         
## glm.pred  0  1
##        0 56  7
##        1  4 51

mean(glm.pred != Auto_test$mpg01)

## [1] 0.09322034

(g) Perform naive Bayes on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

I fit a Naive Bayes model using weight, displacement, horsepower, cylinders, year, and origin as predictors. The resulting test error rate was 9.322034%, the same as the LDA and logistic regression fit.

nb.fit <- naiveBayes(as.factor(mpg01) ~ weight + displacement + horsepower + cylinders + year + origin, data = Auto_train)
nb.pred <- predict(nb.fit, Auto_test)
table(nb.pred, Auto_test$mpg01)

##        
## nb.pred  0  1
##       0 51  2
##       1  9 56

mean(nb.pred != Auto_test$mpg01)

## [1] 0.09322034

(h) Perform KNN on the training data, with several values of \(K\), in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of \(K\) seems to perform the best on this data set?

I tested KNN with K = 1, 3, 5, and 7. The error rates are as follows:

K = 1: 16.95%
K = 3: 13.56%
K = 5: 11.02%
K = 7: 10.17%

As K increased, the test error rate decreased, leaving K = 7 as the best performing model on the Auto dataset.

train_X <- cbind(weight, displacement, horsepower, cylinders, year, origin)[train, ]
test_X <- cbind(weight, displacement, horsepower, cylinders, year, origin)[-train, ]
train_Y <- mpg01[train]
set.seed(123)
knn.pred1 <- knn(train_X, test_X, train_Y, k = 1)
table(knn.pred1, Auto_test$mpg01)

##          
## knn.pred1  0  1
##         0 51 11
##         1  9 47

mean(knn.pred1 != Auto_test$mpg01)

## [1] 0.1694915

set.seed(123)
knn.pred3 <- knn(train_X, test_X, train_Y, k = 3)
table(knn.pred3, Auto_test$mpg01)

##          
## knn.pred3  0  1
##         0 52  8
##         1  8 50

mean(knn.pred3 != Auto_test$mpg01)

## [1] 0.1355932

set.seed(123)
knn.pred5 <- knn(train_X, test_X, train_Y, k = 5)
table(knn.pred5, Auto_test$mpg01)

##          
## knn.pred5  0  1
##         0 52  5
##         1  8 53

mean(knn.pred5 != Auto_test$mpg01)

## [1] 0.1101695

set.seed(123)
knn.pred7 <- knn(train_X, test_X, train_Y, k = 7)
table(knn.pred7, Auto_test$mpg01)

##          
## knn.pred7  0  1
##         0 53  5
##         1  7 53

mean(knn.pred7 != Auto_test$mpg01)

## [1] 0.1016949

Problem 16

Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, Naive Bayes, and KNN models using various subsets of the predictors. Describe your findings.

Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set.

First, I will create a new target variable crime01 where 1 equals high crime and 0 equals low crime. I will then split the data into training and testing sets. Predictors rm, dis, nox, rad, tax, lstat will be used in the classification models.

attach(Boston)
crime01 <- median(Boston$crim)
Boston$crim01 <- ifelse(Boston$crim > crime01, 1, 0)

set.seed(123)
train_X <- sample(1:nrow(Boston), nrow(Boston) * 0.7)
Boston_train <- Boston[train_X, ]
Boston_test  <- Boston[-train_X, ]

Logistic Regression

The logistic regression model achieved a 10.52632% error rate and correctly identified 69 tracts with high crime and 67 tracts with low crime. It made 8 false positive predictions and 8 false negative predictions.

glm.fit <- glm(crim01 ~ rm + dis + nox + rad + tax + lstat, data = Boston_train, family = binomial)
glm.probs <- predict(glm.fit, Boston_test, type = "response")
glm.pred <- ifelse(glm.probs > 0.5, 1, 0)
table(glm.pred, Boston_test$crim01)

##         
## glm.pred  0  1
##        0 67  8
##        1  8 69

mean(glm.pred != Boston_test$crim01)

## [1] 0.1052632

LDA

The LDA model performed slightly worse than logistic regression with a 12.5% error rate. While it did only have 5 false positives, it with struggled by predicting 14 false negatives. This model missed high-crime tracts.

lda.fit <- lda(crim01 ~ rm + dis + nox + rad + tax + lstat, data = Boston_train)
lda.pred <- predict(lda.fit, Boston_test)$class
table(lda.pred, Boston_test$crim01)

##         
## lda.pred  0  1
##        0 70 14
##        1  5 63

mean(lda.pred != Boston_test$crim01)

## [1] 0.125

Naive Bayes

The Naive Bayes model performed the worst with an error rate of 13.81579%. It predicted 13 false negatives, the worst of all models, and 8 false positives.

nb.fit <- naiveBayes(as.factor(crim01) ~ rm + dis + nox + rad + tax + lstat, data = Boston_train)
nb.pred <- predict(nb.fit, Boston_test)
table(nb.pred, Boston_test$crim01)

##        
## nb.pred  0  1
##       0 67 13
##       1  8 64

mean(nb.pred != Boston_test$crim01)

## [1] 0.1381579

KNN

The KNN model with K = 1 performed better than the above models with a 5.263158% error rate. It predicted only 3 false positives and 5 false negatives. This is a big improvement for both high and low crime tracts compared to other models. K = 3 provided the lowest error rate of 3.947368%. It achieved to predict 74 true positives and 74 true negatives, and only 3 false positives and 3 false negatives. This is by far the most accurate and balanced model in the comparison. Finally, we have K = 5. This model had the same error rate as K = 1, which was 5.263158%. However, it’s distribution of predictions is slightly different with 4 false positives and 4 false negatives.

train_X <- cbind(Boston_train$rm, Boston_train$dis, Boston_train$nox,
                 Boston_train$rad, Boston_train$tax, Boston_train$lstat)
test_X <- cbind(Boston_test$rm, Boston_test$dis, Boston_test$nox,
                Boston_test$rad, Boston_test$tax, Boston_test$lstat)
train_Y <- Boston_train$crim01
set.seed(123)
knn.pred <- knn(train_X, test_X, train_Y, k = 1)
table(knn.pred, Boston_test$crim01)

##         
## knn.pred  0  1
##        0 72  5
##        1  3 72

mean(knn.pred != Boston_test$crim01)

## [1] 0.05263158

set.seed(123)
knn.pred3 <- knn(train_X, test_X, train_Y, k = 3)
table(knn.pred3, Boston_test$crim01)

##          
## knn.pred3  0  1
##         0 72  3
##         1  3 74

mean(knn.pred3 != Boston_test$crim01)

## [1] 0.03947368

set.seed(123)
knn.pred5 <- knn(train_X, test_X, train_Y, k = 5)
table(knn.pred5, Boston_test$crim01)

##          
## knn.pred5  0  1
##         0 71  4
##         1  4 73

mean(knn.pred5 != Boston_test$crim01)

## [1] 0.05263158

The KNN model with K = 3 is the best model choice. It had the lowest error rate, a balanced false positive and false negative count, and the highest true positive count.

Assignment #3

Dominica Peri

2025-06-12

Problem 13

Logistic Regression

LDA

QDA

Naive Bayes

KNN

Problem 14

Problem 16

Logistic Regression

LDA

Naive Bayes

KNN