Assignment 3

Exercise 13

This question should be answered using the Weekly data set, which […] contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

corrplot(cor(Weekly[,-9]), method = "number")

From the correlation plot, we observe that Year and Volume are highly correlated, while none of the remaining variables don’t show a significant linear relationship among them.

(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary() function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

glm1.fit = glm(Direction ~ Lag1+Lag2+Lag3+Lag4+Lag5+Volume, Weekly, family = binomial)
summary(glm1.fit)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

We observe from the output above that Lag2 is the only statistically significant predictor of Direction using a significance level of \(\alpha=.05\).

(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

glm1.probs = predict(glm1.fit, type = "response")
glm1.pred = rep("Down", length(glm1.probs))
glm1.pred[glm1.probs >.5] = "Up"
table(glm1.pred, Weekly$Direction)

##          
## glm1.pred Down  Up
##      Down   54  48
##      Up    430 557

By taking the correct predictions from the confusion matrix and dividing by the total number of observations, we get an accuracy of 56.1% as shown below. This suggest that the model overall is predicting slightly better than random chance.

(54+557)/1089

## [1] 0.5610652

When looking at the model’s ability to predict Up and Down respectively, we notice a significant difference between the two. Below are the calculations for each group, showing that the model correctly predicted Up 92.1% of the time, while correctly predicting Down just 11.5% of the time.

557/(557+48) # Up

## [1] 0.9206612

54/(54+430) # Down

## [1] 0.1115702

(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

w_train = (Weekly$Year < 2009)
w_test = Weekly[!w_train,]

glm2.fit = glm(Direction ~ Lag2, data = Weekly, family = binomial, subset = w_train)
glm2.prob = predict(glm2.fit, w_test, type = "response")
glm2.pred = rep("Down", length(glm2.prob))
glm2.pred[glm2.prob > 0.5] = "Up"

Direction.0910 = Weekly$Direction[!w_train]
table(glm2.pred, Direction.0910)

##          Direction.0910
## glm2.pred Down Up
##      Down    9  5
##      Up     34 56

(9+56)/(9+5+34+56) # Overall

## [1] 0.625

56/(56+5) # Up

## [1] 0.9180328

9/(9+34) # Down

## [1] 0.2093023

In comparison with the previous model, this model using Lag2 as the only predictor and a train/test split moderately improved on overall accuracy, going from 56.1% to 62.5%. Likewise, the model’s ability to correctly predict Down significantly improved, whereas the accuracy for Up slightly decreased.

(e) Repeat (d) using LDA.

lda1.fit = lda(Direction ~ Lag2, data = Weekly, subset = w_train)
lda1.pred = predict(lda1.fit, w_test)
table(lda1.pred$class, Direction.0910)

##       Direction.0910
##        Down Up
##   Down    9  5
##   Up     34 56

acc_lda <- mean(lda1.pred$class == Direction.0910)
cat("LDA Model Accuracy:", acc_lda, "\n")

## LDA Model Accuracy: 0.625

The LDA model produces identical results to the logistic regression from (d).

(f) Repeat (d) using QDA.

qda1.fit = qda(Direction ~ Lag2, data = Weekly, subset = w_train)
qda1.pred = predict(qda1.fit, w_test)
table(qda1.pred$class, Direction.0910)

##       Direction.0910
##        Down Up
##   Down    0  0
##   Up     43 61

acc_qda <- mean(qda1.pred$class == Direction.0910)
cat("QDA Model Accuracy:", acc_qda, "\n")

## QDA Model Accuracy: 0.5865385

While the overall accuracy of the QDA (58.7%) is only a small decrease from the previous two models, we notice from the confusion matrix that the model exclusively predicted “Up”. In other words, the model did not perform well.

(g) Repeat (d) using KNN with \(K=1\).

train.X = as.matrix(Weekly$Lag2[w_train])
test.X = as.matrix(Weekly$Lag2[!w_train])
train.Direction = Weekly$Direction[w_train]
set.seed(1)
knn1.pred = knn(train.X, test.X, train.Direction, k=1)
table(knn1.pred, Direction.0910)

##          Direction.0910
## knn1.pred Down Up
##      Down   21 30
##      Up     22 31

acc_knn <- mean(knn1.pred == Direction.0910)
cat("KNN Model Accuracy:", acc_knn, "\n")

## KNN Model Accuracy: 0.5

The overall accuracy of the KNN model is exactly 50%.

(h) Repeat (d) using naive Bayes.

bayes1.fit = naiveBayes(Direction ~ Lag2, data = Weekly, subset = w_train)
bayes1.pred = predict(bayes1.fit, w_test)
table(bayes1.pred, Direction.0910)

##            Direction.0910
## bayes1.pred Down Up
##        Down    0  0
##        Up     43 61

acc_bayes <- mean(bayes1.pred == Direction.0910)
cat("Naive Bayes Model Accuracy:", acc_bayes, "\n")

## Naive Bayes Model Accuracy: 0.5865385

Like the QDA, the naive Bayes model exclusively predicted “Up”, making the model rather useless.

(h) Which of these methods appears to provide the best results on this data?
The LDA and the logistic regression model using only Lag2 performed equally well, both yielding an overall accuracy of 62.5%.

(h) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for \(K\) in the KNN classifier.

Testing a logistic regression with interaction term between Lag1 and Lag2:

glm3.fit = glm(Direction ~ Lag1*Lag2, data = Weekly, family = binomial, subset = w_train)
glm3.prob = predict(glm3.fit, w_test, type = "response")
glm3.pred = rep("Down", length(glm3.prob))
glm3.pred[glm3.prob > 0.5] = "Up"

Direction.0910 = Weekly$Direction[!w_train]
table(glm3.pred, Direction.0910)

##          Direction.0910
## glm3.pred Down Up
##      Down    7  8
##      Up     36 53

acc_glm3 <- mean(glm3.pred == Direction.0910)
cat("Accuracy of Logit Model w. Interaction Term:", acc_glm3, "\n")

## Accuracy of Logit Model w. Interaction Term: 0.5769231

Testing a polynomial transformation on a LDA:

lda2.fit = lda(Direction ~ poly(Lag2, 2), data = Weekly, subset = w_train)
lda2.pred = predict(lda2.fit, w_test)
table(lda2.pred$class, Direction.0910)

##       Direction.0910
##        Down Up
##   Down    7  4
##   Up     36 57

acc_lda2 <- mean(lda2.pred$class == Direction.0910)
cat("Accuracy of LDA Model w. Poly Transformation:", acc_lda2, "\n")

## Accuracy of LDA Model w. Poly Transformation: 0.6153846

Testing a KNN classifier using \(k=5\):

train.X = as.matrix(Weekly$Lag2[w_train])
test.X = as.matrix(Weekly$Lag2[!w_train])
train.Direction = Weekly$Direction[w_train]
set.seed(1)
knn2.pred = knn(train.X, test.X, train.Direction, k=5)
table(knn2.pred, Direction.0910)

##          Direction.0910
## knn2.pred Down Up
##      Down   16 21
##      Up     27 40

acc_knn2 <- mean(knn2.pred == Direction.0910)
cat("Accuracy of KNN Model w. K=5:", acc_knn2, "\n")

## Accuracy of KNN Model w. K=5: 0.5384615

Testing a KNN classifier using \(k=20\):

train.X = as.matrix(Weekly$Lag2[w_train])
test.X = as.matrix(Weekly$Lag2[!w_train])
train.Direction = Weekly$Direction[w_train]
set.seed(1)
knn3.pred = knn(train.X, test.X, train.Direction, k=20)
table(knn3.pred, Direction.0910)

##          Direction.0910
## knn3.pred Down Up
##      Down   21 21
##      Up     22 40

acc_knn3 <- mean(knn3.pred == Direction.0910)
cat("Accuracy of KNN Model w. K=20:", acc_knn3, "\n")

## Accuracy of KNN Model w. K=20: 0.5865385

Testing a naive Bayes model with all lags as predictors:

bayes2.fit = naiveBayes(Direction ~ Lag1+Lag2+Lag3+Lag4+Lag5, data = Weekly, subset = w_train)
bayes2.pred = predict(bayes2.fit, w_test)
table(bayes2.pred, Direction.0910)

##            Direction.0910
## bayes2.pred Down Up
##        Down   10 21
##        Up     33 40

acc_bayes2 <- mean(bayes2.pred == Direction.0910)
cat("Accuracy of Naive Bayes Model w. All Lags:", acc_bayes2, "\n")

## Accuracy of Naive Bayes Model w. All Lags: 0.4807692

After testing interaction terms, more predictors, transformations, and different values of \(k\), I did not achieved any better results than previously. In conclusion, the original LDA and logistic regression using Lag2 as the only predictor maintain the position as the best models.

Exercise 14

In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function.

Auto <- Auto %>%
  mutate(mpg01 = as.numeric(mpg > median(mpg)))

…

(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

corrplot(cor(Auto[,-c(1, 9)]), method = "number", number.cex = 0.9)

From the correlation plot we observe that mpg01 exhibits the strongest linear correlations with cylinders, weight, displacement, and horsepower, all of which show negative correlations with mpg01. This suggests that vehicles with more cylinders, greater weight, larger engine displacements, and more horsepower tend to have below median fuel efficiency. To further investigate these relationships, below are four boxplots illustrating the distribution of these variables in relation to mpg01.

boxplot(cylinders ~ mpg01, Auto)

boxplot(weight ~ mpg01, Auto)

boxplot(displacement ~ mpg01, Auto)

boxplot(horsepower ~ mpg01, Auto)

As expected, the boxplots reveal distinct differences in the distributions of cylinders, weight, displacement, and horsepower when mpg01 is 0 compared to when it is 1.

(c) Split the data into a training set and a test set.

set.seed(1)
index = sample(nrow(Auto), 0.7*nrow(Auto), replace = FALSE)
a_train = Auto[index,]
a_test = Auto[-index,]

The data is split into a training set and test set using a 70/30 split.

(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

lda.fit1 = lda(mpg01 ~ cylinders + weight + displacement + horsepower,
               data = a_train)
lda.pred1 = predict(lda.fit1, a_test)
table(lda.pred1$class, a_test$mpg01)

##    
##      0  1
##   0 50  3
##   1 11 54

error_lda <- mean(lda.pred1$class != a_test$mpg01)
cat("LDA Model Test Error:", error_lda, "\n")

## LDA Model Test Error: 0.1186441

The LDA model results in a test error of 11.86%.

(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

qda.fit1 = qda(mpg01 ~ cylinders + weight + displacement + horsepower,
               data = a_train)
qda.pred1 = predict(qda.fit1, a_test)
table(qda.pred1$class, a_test$mpg01)

##    
##      0  1
##   0 52  5
##   1  9 52

error_qda <- mean(qda.pred1$class != a_test$mpg01)
cat("QDA Model Test Error:", error_qda, "\n")

## QDA Model Test Error: 0.1186441

The QDA model yields a test error similar to the LDA model. However, a closer look at the confusion matrices reveals distinctions in the models’ predictions for the two classes (0 and 1). In other words, their precision and recall metrics for individual classes differ, indicating variations in their ability to correctly classify instances belonging to each class.

(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

glm.fit1 = glm(mpg01 ~ cylinders + weight + displacement + horsepower,
               data = a_train, family = binomial)

glm.prob1 = predict(glm.fit1, a_test, type = "response")
glm.pred1 = rep(0, length(glm.prob1))
glm.pred1[glm.prob1 > 0.5] = 1

table(glm.pred1, a_test$mpg01)

##          
## glm.pred1  0  1
##         0 53  3
##         1  8 54

error_glm1 <- mean(glm.pred1 != a_test$mpg01)
cat("Logistic Regression Model Test Error:", error_glm1, "\n")

## Logistic Regression Model Test Error: 0.09322034

The logistic regression model results in a test error of 9.32, making it a better performing model than the LDA and QDA in terms of test error.

(g) Perform naive Bayes on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

bayes.fit1 = naiveBayes(mpg01 ~ cylinders + weight + displacement + horsepower,
               data = a_train)
bayes.pred1 = predict(bayes.fit1, a_test)
table(bayes.pred1, a_test$mpg01)

##            
## bayes.pred1  0  1
##           0 52  4
##           1  9 53

error_bayes1 <- mean(bayes.pred1 != a_test$mpg01)
cat("Naive Bayes Model Test Error:", error_bayes1, "\n")

## Naive Bayes Model Test Error: 0.1101695

The Naive Bayes model results in a test error of 11.02%.

(h) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

train.A <- as.matrix(a_train[, c("displacement", "horsepower", "weight", "cylinders")])
test.A <- as.matrix(a_test[, c("displacement", "horsepower", "weight", "cylinders")])

set.seed(1)
knn.pred1 <- knn(train.A, test.A, a_train$mpg01, k = 3)
table(knn.pred1, a_test$mpg01)

##          
## knn.pred1  0  1
##         0 52  4
##         1  9 53

acc_knn1 <- mean(knn.pred1 != a_test$mpg01)
cat("KNN Model Test Error", acc_knn1, "\n")

## KNN Model Test Error 0.1101695

After experimenting with different values of \(k\) ranging from 1 to 10, I found that using \(k=3\) results in the lowest test error of 11.02%, similar to the Naive Bayes model. This performance slightly outperforms the LDA and QDA models, while the logistic regression model remains superior among the accessed methods in terms of test error.

Exercise 16

Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings.

To prepare and examine the data, I will: * Create the binary variable for crime, crim01 * Split the data into a training set and testing set using a 70/30 split * Use the corrplot() function to detect which predictors exhibit the strongest linear relationship with crim01:

Boston <- Boston %>%
  mutate(crim01 = as.numeric(crim > median(crim)))

set.seed(1)
index = sample(nrow(Boston), 0.7*nrow(Boston), replace = FALSE)
b_train = Boston[index,]
b_test = Boston[-index,]

corrplot(cor(Boston[,-1]), method = "number", number.cex = 0.65)

The correlation plot reveal that crim01 has the most significant linear correlations with indus, nox, age, dis, rad, and tax. All these variables show a positive correlation with crim01 except dis, which has a negative linear relationship with crim01.

My first model will be utilizing logistic regression to predicts the response variable using the predictors described above.

my.glm.fit = glm(crim01 ~ indus + nox + age + dis + rad + tax , b_train, family = binomial)

my.glm.prob = predict(my.glm.fit, b_test, type = "response")
my.glm.pred = rep(0, length(my.glm.prob))
my.glm.pred[my.glm.prob > 0.5] = 1

table(my.glm.pred, b_test$crim01)

##            
## my.glm.pred  0  1
##           0 58  6
##           1 15 73

acc.my.glm <- mean(my.glm.pred == b_test$crim01)
cat("Accuracy of Logit Model w. Linearly Correlated Predictors:", acc.my.glm, "\n")

## Accuracy of Logit Model w. Linearly Correlated Predictors: 0.8618421

Next, I will access a model containing all predictors (not including crim):

my.glm.fit2 = glm(crim01 ~ .-crim , b_train, family = binomial)

my.glm.prob2 = predict(my.glm.fit2, b_test, type = "response")
my.glm.pred2 = rep(0, length(my.glm.prob2))
my.glm.pred2[my.glm.prob2 > 0.5] = 1

table(my.glm.pred2, b_test$crim01)

##             
## my.glm.pred2  0  1
##            0 62  5
##            1 11 74

acc.my.glm2 <- mean(my.glm.pred2 == b_test$crim01)
cat("Accuracy of Logit Model w. All Predictors:", acc.my.glm2, "\n")

## Accuracy of Logit Model w. All Predictors: 0.8947368

Third, I will test a logistic regression model with predictors remaining from a stepwise selection using the AIC criterion.

null_model = glm(crim01 ~ 1, data = b_train, family = binomial) 
full_model = my.glm.fit2 

step.model.AIC = step(null_model, scope = list(upper = full_model),
                      direction = "both", test = "Chisq", trace = F) 
summary(step.model.AIC)

## 
## Call:
## glm(formula = crim01 ~ nox + rad + tax + dis + zn + rm + lstat + 
##     black, family = binomial, data = b_train)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -37.090639   6.921915  -5.358 8.39e-08 ***
## nox          47.749134   7.891777   6.050 1.44e-09 ***
## rad           0.675929   0.154962   4.362 1.29e-05 ***
## tax          -0.007918   0.002798  -2.830  0.00466 ** 
## dis           0.910433   0.253567   3.591  0.00033 ***
## zn           -0.108743   0.040446  -2.689  0.00718 ** 
## rm            1.467346   0.511761   2.867  0.00414 ** 
## lstat         0.108594   0.052865   2.054  0.03996 *  
## black        -0.007586   0.005156  -1.471  0.14121    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 490.65  on 353  degrees of freedom
## Residual deviance: 148.44  on 345  degrees of freedom
## AIC: 166.44
## 
## Number of Fisher Scoring iterations: 9

my.glm.fit3 = glm(crim01 ~ nox + rad + tax + dis + zn +rm +lstat, b_train, family = binomial)

my.glm.prob3 = predict(my.glm.fit3, b_test, type = "response")
my.glm.pred3 = rep(0, length(my.glm.prob3))
my.glm.pred3[my.glm.prob3 > 0.5] = 1

table(my.glm.pred3, b_test$crim01)

##             
## my.glm.pred3  0  1
##            0 59  8
##            1 14 71

acc.my.glm3 <- mean(my.glm.pred3 == b_test$crim01)
cat("Accuracy of Logit Model w. Selected Predictors:", acc.my.glm3, "\n")

## Accuracy of Logit Model w. Selected Predictors: 0.8552632

When comparing all three logistic regression models accessed above, we observe the greatest accuracy for the model using all predictors (88.8%). However, the model using the linearly correlated predictors and the model using the predictors from stepwise selection perform almost equally well, achieving accuracies of 86.2 and 85.5 respectively. The simplicity of these two models are generally preferred over the full model, especially considering the minor difference in prediction power.

The forth model I will test is an LDA using the predictors from stepwise selection.

my.lda.fit = lda(crim01 ~ nox + rad + tax + dis + zn +rm +lstat, b_train)
my.lda.pred = predict(my.lda.fit, b_test)
table(my.lda.pred$class, b_test$crim01)

##    
##      0  1
##   0 73 20
##   1  0 59

acc.my.lda <- mean(my.lda.pred$class == b_test$crim01)
cat("Accuracy of LDA Model:", acc.my.lda, "\n")

## Accuracy of LDA Model: 0.8684211

The sixth model is applying QDA to the selected predictors.

my.qda.fit = qda(crim01 ~ nox + rad + tax + dis + zn +rm +lstat, b_train)
my.qda.pred = predict(my.qda.fit, b_test)
table(my.qda.pred$class, b_test$crim01)

##    
##      0  1
##   0 67  8
##   1  6 71

acc.my.qda <- mean(my.qda.pred$class == b_test$crim01)
cat("Accuracy of QDA Model:", acc.my.qda, "\n")

## Accuracy of QDA Model: 0.9078947

Next, we till test a naive Bayes model on the selected predictors.

my.bayes.fit = naiveBayes(crim01 ~ nox + rad + tax + dis + zn +rm +lstat, b_train)
my.bayes.pred = predict(my.bayes.fit, b_test)
table(my.bayes.pred, b_test$crim01)

##              
## my.bayes.pred  0  1
##             0 63 20
##             1 10 59

acc.my.bayes <- mean(my.bayes.pred == b_test$crim01)
cat("Accuracy of Naive Bayes Model:", acc.my.bayes, "\n")

## Accuracy of Naive Bayes Model: 0.8026316

The final model will employ KNN on the selected predictors, using \(k=1\) after testing different values of \(k\) for optimal accuracy.

train.B <- as.matrix(b_train[, c("nox", "rad", "tax", "dis", "zn", "rm", "lstat")])
test.B <- as.matrix(b_test[, c("nox", "rad", "tax", "dis", "zn", "rm", "lstat")])

set.seed(1)
my.knn.pred <- knn(train.B, test.B, b_train$crim01, k = 1)
table(my.knn.pred, b_test$crim01)

##            
## my.knn.pred  0  1
##           0 67  1
##           1  6 78

acc.my.knn <- mean(my.knn.pred == b_test$crim01)
cat("Accuracy of KNN Model:", acc.my.knn, "\n")

## Accuracy of KNN Model: 0.9539474

After evaluating the eight models above employing various methodologies and different combinations of predictors for logistic regression, as well as tested a range of values of \(k\) for KNN, the model that exhibited the best performance was the KNN model using \(k=1\). This model achieved an accuracy of 95.39%, trailed by the QDA model that resulted in an accuracy of 90.79%. In conclusion, the KNN model was the best at predicting whether a census tract has a crime rate above or below the median.

Assignment 3

Emilie Baek

2024-02-21

Exercise 13

Exercise 14

Exercise 16