Problem 13

library(ISLR2)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(corrplot)
## corrplot 0.95 loaded
library(caret)
## Loading required package: lattice
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## The following object is masked from 'package:ISLR2':
## 
##     Boston
library(class)
library(e1071)

#Load Weekly dataset
data('Weekly')

a.

Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

str(Weekly)
## 'data.frame':    1089 obs. of  9 variables:
##  $ Year     : num  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
##  $ Lag1     : num  0.816 -0.27 -2.576 3.514 0.712 ...
##  $ Lag2     : num  1.572 0.816 -0.27 -2.576 3.514 ...
##  $ Lag3     : num  -3.936 1.572 0.816 -0.27 -2.576 ...
##  $ Lag4     : num  -0.229 -3.936 1.572 0.816 -0.27 ...
##  $ Lag5     : num  -3.484 -0.229 -3.936 1.572 0.816 ...
##  $ Volume   : num  0.155 0.149 0.16 0.162 0.154 ...
##  $ Today    : num  -0.27 -2.576 3.514 0.712 1.178 ...
##  $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
summary(Weekly)
##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
## 
pairs(Weekly)

corrplot(cor(Weekly[, sapply(Weekly, is.numeric)]), method = 'number')

ggplot(Weekly, aes(x = Year, y = Volume, group = Year)) +
  geom_boxplot() +
  ggtitle('Trading Volume has increased over time')

ggplot(Weekly, aes(x = Today)) +
  geom_histogram(binwidth = 0.5, fill = 'blue', color = 'black') +
  ggtitle('Returns has a normally distributed behavior')

ggplot(Weekly, aes(x = Direction, y = Volume, fill = Direction)) +
  geom_boxplot() + 
  ggtitle("Volume does not differ between Directions")

Findings:

Lags are likely to be multicollinear because the value of one, directly affects the other, and summary statistics look eerily similar.

The only strong correlation is the positive correlation between Volume and Year. Volume's trend is positive with each increase in Year.

Today has a normal distribution. This shows that there aren’t more positive or negative returns, they are about the same.

Volume does not differ between Direction's which shows that Trading Volume is probably not associated with Direction.

Our data seems to be balanced. Direction has 484 ‘Down’ and 605 ‘Up’.

b.

Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

logistic_model_13b <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, 
                          data = Weekly, family = 'binomial')
summary(logistic_model_13b)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = "binomial", data = Weekly)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Answer

Lag2 is the only predictor in our model that is statistically significant. P-value(0.0296) is lower than 0.05.

c.

Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

#Probabilities
pred_probs_13c <- predict(logistic_model_13b, type = 'response')

#Convert probabilities to class labels
pred_class_13c <- ifelse(pred_probs_13c > 0.5, 'Up', 'Down')

#Confusion Matrix
confusionMatrix(as.factor(pred_class_13c), Weekly$Direction, positive = 'Up')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down  Up
##       Down   54  48
##       Up    430 557
##                                          
##                Accuracy : 0.5611         
##                  95% CI : (0.531, 0.5908)
##     No Information Rate : 0.5556         
##     P-Value [Acc > NIR] : 0.369          
##                                          
##                   Kappa : 0.035          
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.9207         
##             Specificity : 0.1116         
##          Pos Pred Value : 0.5643         
##          Neg Pred Value : 0.5294         
##              Prevalence : 0.5556         
##          Detection Rate : 0.5115         
##    Detection Prevalence : 0.9063         
##       Balanced Accuracy : 0.5161         
##                                          
##        'Positive' Class : Up             
## 

Answer

Our confusion matrix is telling us that our model is biased towards predicting Up, as seen in the large number of False Positives(430). It struggles to correctly predict Down cases, meaning it overestimates the likelihood of the market going up.

For model improvement, an option would be to use the pROC to find the optimal threshold of balance.

d.

Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

train_df <- Weekly$Year <= 2008
test_df <- Weekly$Year > 2008
logistic_model_13d <- glm(Direction ~ Lag2, data = Weekly, family = 'binomial', subset = train_df)

#Prediction probabilities
pred_probs_13d <- predict(logistic_model_13d, Weekly[test_df, ], type = 'response')
#Class
pred_class_13d <- ifelse(pred_probs_13d > 0.5, 'Up', 'Down')

confusionMatrix(as.factor(pred_class_13d), Weekly$Direction[test_df], positive = 'Up')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    9  5
##       Up     34 56
##                                          
##                Accuracy : 0.625          
##                  95% CI : (0.5247, 0.718)
##     No Information Rate : 0.5865         
##     P-Value [Acc > NIR] : 0.2439         
##                                          
##                   Kappa : 0.1414         
##                                          
##  Mcnemar's Test P-Value : 7.34e-06       
##                                          
##             Sensitivity : 0.9180         
##             Specificity : 0.2093         
##          Pos Pred Value : 0.6222         
##          Neg Pred Value : 0.6429         
##              Prevalence : 0.5865         
##          Detection Rate : 0.5385         
##    Detection Prevalence : 0.8654         
##       Balanced Accuracy : 0.5637         
##                                          
##        'Positive' Class : Up             
## 

e.

Repeat (d) using LDA.

#Fit model
lda_model_13e <- lda(Direction ~ Lag2, data = Weekly, subset = train_df)

#Probabilities
pred_probs_13e <- predict(lda_model_13e, Weekly[test_df, ])

#Class
pred_class_13e <- pred_probs_13e$class

#Confusion matrix
confusionMatrix(pred_class_13e, Weekly$Direction[test_df], positive = 'Up')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    9  5
##       Up     34 56
##                                          
##                Accuracy : 0.625          
##                  95% CI : (0.5247, 0.718)
##     No Information Rate : 0.5865         
##     P-Value [Acc > NIR] : 0.2439         
##                                          
##                   Kappa : 0.1414         
##                                          
##  Mcnemar's Test P-Value : 7.34e-06       
##                                          
##             Sensitivity : 0.9180         
##             Specificity : 0.2093         
##          Pos Pred Value : 0.6222         
##          Neg Pred Value : 0.6429         
##              Prevalence : 0.5865         
##          Detection Rate : 0.5385         
##    Detection Prevalence : 0.8654         
##       Balanced Accuracy : 0.5637         
##                                          
##        'Positive' Class : Up             
## 

f.

Repead (d) using QDA.

qda_model_13f <- qda(Direction ~ Lag2, data = Weekly, subset = train_df)

#Prediction probabilities
pred_probs_13f <- predict(qda_model_13f, Weekly[test_df, ])
pred_class_13f<- pred_probs_13f$class

#Confusion Matrix
confusionMatrix(pred_class_13f, Weekly$Direction[test_df], positive = 'Up')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    0  0
##       Up     43 61
##                                           
##                Accuracy : 0.5865          
##                  95% CI : (0.4858, 0.6823)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.5419          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 1.504e-10       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.5865          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.5865          
##          Detection Rate : 0.5865          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : Up              
## 

g.

Repeat (d) using KNN with K = 1.

train_x <- Weekly[train_df, 'Lag2', drop = FALSE]
test_x <- Weekly[test_df, 'Lag2', drop = FALSE]
train_y <- Weekly$Direction[train_df]
test_y <- Weekly$Direction[test_df]

# KNN model
set.seed(42)
knn_preds_13g <- knn(train_x, test_x, train_y, k = 1)

#Confusion Matrix
confusionMatrix(knn_preds_13g, test_y, positive = 'Up')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   21 30
##       Up     22 31
##                                           
##                Accuracy : 0.5             
##                  95% CI : (0.4003, 0.5997)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.9700          
##                                           
##                   Kappa : -0.0033         
##                                           
##  Mcnemar's Test P-Value : 0.3317          
##                                           
##             Sensitivity : 0.5082          
##             Specificity : 0.4884          
##          Pos Pred Value : 0.5849          
##          Neg Pred Value : 0.4118          
##              Prevalence : 0.5865          
##          Detection Rate : 0.2981          
##    Detection Prevalence : 0.5096          
##       Balanced Accuracy : 0.4983          
##                                           
##        'Positive' Class : Up              
## 

h.

Repeat (d) using naive Bayes.

nb_model_13h <- naiveBayes(Direction ~ Lag2, data = Weekly, subset = train_df)

#Predict on test set
pred_probs_13h <- predict(nb_model_13h, Weekly[test_df, ])

#Confusion matrix
confusionMatrix(pred_probs_13h, Weekly$Direction[test_df], positive = 'Up')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    0  0
##       Up     43 61
##                                           
##                Accuracy : 0.5865          
##                  95% CI : (0.4858, 0.6823)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.5419          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 1.504e-10       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.5865          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.5865          
##          Detection Rate : 0.5865          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : Up              
## 

i.

Which of these methods appears to provide the best results on this data?

Answer

Logistic Regression and LDA both appear to provide the best results. Both have an accuracy of 62.5%, sensitivity of 91.8%, and sensitivity 20.9%. However, the best model would be Logistic Regression due to the interpret-ability we can get from the summary() function.

j.

Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

Logistic Regression

log_model_13j <- glm(Direction ~ I(Lag1^2) * Lag2, data = Weekly, subset = train_df, family = 'binomial')
summary(log_model_13j)
## 
## Call:
## glm(formula = Direction ~ I(Lag1^2) * Lag2, family = "binomial", 
##     data = Weekly, subset = train_df)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)   
## (Intercept)     0.1954492  0.0719405   2.717  0.00659 **
## I(Lag1^2)       0.0001890  0.0074888   0.025  0.97986   
## Lag2            0.0716183  0.0315379   2.271  0.02316 * 
## I(Lag1^2):Lag2 -0.0007183  0.0011226  -0.640  0.52229   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1349.3  on 981  degrees of freedom
## AIC: 1357.3
## 
## Number of Fisher Scoring iterations: 4
pred_probs_13j_log <- predict(log_model_13j, Weekly[test_df, ], type = 'response')

pred_class_13j_log <- ifelse(pred_probs_13j_log > 0.5, 'Up', 'Down')

confusionMatrix(as.factor(pred_class_13j_log), Weekly$Direction[test_df], positive = 'Up')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    8  2
##       Up     35 59
##                                           
##                Accuracy : 0.6442          
##                  95% CI : (0.5443, 0.7357)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.1364          
##                                           
##                   Kappa : 0.1728          
##                                           
##  Mcnemar's Test P-Value : 1.435e-07       
##                                           
##             Sensitivity : 0.9672          
##             Specificity : 0.1860          
##          Pos Pred Value : 0.6277          
##          Neg Pred Value : 0.8000          
##              Prevalence : 0.5865          
##          Detection Rate : 0.5673          
##    Detection Prevalence : 0.9038          
##       Balanced Accuracy : 0.5766          
##                                           
##        'Positive' Class : Up              
## 

LDA

#Fit model
lda_model_13j <- lda(Direction ~ Lag1 * Lag2 + Volume, data = Weekly, subset = train_df)

#Probabilities
pred_probs_13j <- predict(lda_model_13j, Weekly[test_df, ])

#Class
pred_class_13j <- pred_probs_13j$class

#Confusion matrix
confusionMatrix(pred_class_13j, Weekly$Direction[test_df], positive = 'Up')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   27 32
##       Up     16 29
##                                          
##                Accuracy : 0.5385         
##                  95% CI : (0.438, 0.6367)
##     No Information Rate : 0.5865         
##     P-Value [Acc > NIR] : 0.86308        
##                                          
##                   Kappa : 0.0979         
##                                          
##  Mcnemar's Test P-Value : 0.03038        
##                                          
##             Sensitivity : 0.4754         
##             Specificity : 0.6279         
##          Pos Pred Value : 0.6444         
##          Neg Pred Value : 0.4576         
##              Prevalence : 0.5865         
##          Detection Rate : 0.2788         
##    Detection Prevalence : 0.4327         
##       Balanced Accuracy : 0.5517         
##                                          
##        'Positive' Class : Up             
## 

QDA

qda_model_13j <- qda(Direction ~ Lag1 * Lag2, data = Weekly, subset = train_df)

#Prediction probabilities
pred_probs_13j <- predict(qda_model_13j, Weekly[test_df, ])
pred_class_13j<- pred_probs_13j$class

#Confusion Matrix
confusionMatrix(pred_class_13j, Weekly$Direction[test_df], positive = 'Up')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   23 36
##       Up     20 25
##                                          
##                Accuracy : 0.4615         
##                  95% CI : (0.3633, 0.562)
##     No Information Rate : 0.5865         
##     P-Value [Acc > NIR] : 0.99616        
##                                          
##                   Kappa : -0.0524        
##                                          
##  Mcnemar's Test P-Value : 0.04502        
##                                          
##             Sensitivity : 0.4098         
##             Specificity : 0.5349         
##          Pos Pred Value : 0.5556         
##          Neg Pred Value : 0.3898         
##              Prevalence : 0.5865         
##          Detection Rate : 0.2404         
##    Detection Prevalence : 0.4327         
##       Balanced Accuracy : 0.4724         
##                                          
##        'Positive' Class : Up             
## 

KNN

train_x <- Weekly[train_df, c('Lag1', 'Lag2'), drop = FALSE]
test_x <- Weekly[test_df, c('Lag1', 'Lag2'), drop = FALSE]
train_y <- Weekly$Direction[train_df]
test_y <- Weekly$Direction[test_df]

train_x <- scale(train_x)
test_x <- scale(test_x)

# KNN model
set.seed(42)
knn_preds_13j <- knn(train_x, test_x, train_y, k = 10)

#Confusion Matrix
confusionMatrix(knn_preds_13j, test_y, positive = 'Up')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   23 28
##       Up     20 33
##                                          
##                Accuracy : 0.5385         
##                  95% CI : (0.438, 0.6367)
##     No Information Rate : 0.5865         
##     P-Value [Acc > NIR] : 0.8631         
##                                          
##                   Kappa : 0.0738         
##                                          
##  Mcnemar's Test P-Value : 0.3123         
##                                          
##             Sensitivity : 0.5410         
##             Specificity : 0.5349         
##          Pos Pred Value : 0.6226         
##          Neg Pred Value : 0.4510         
##              Prevalence : 0.5865         
##          Detection Rate : 0.3173         
##    Detection Prevalence : 0.5096         
##       Balanced Accuracy : 0.5379         
##                                          
##        'Positive' Class : Up             
## 

Naive Bayes

nb_model_13j <- naiveBayes(Direction ~ Lag1 + Lag2 + Lag3 + Volume, data = Weekly, subset = train_df)

#Predict on test set
pred_probs_13j <- predict(nb_model_13j, Weekly[test_df, ])

#Confusion matrix
confusionMatrix(pred_probs_13j, Weekly$Direction[test_df], positive = 'Up')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   41 58
##       Up      2  3
##                                           
##                Accuracy : 0.4231          
##                  95% CI : (0.3268, 0.5239)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.9997          
##                                           
##                   Kappa : 0.0022          
##                                           
##  Mcnemar's Test P-Value : 1.243e-12       
##                                           
##             Sensitivity : 0.04918         
##             Specificity : 0.95349         
##          Pos Pred Value : 0.60000         
##          Neg Pred Value : 0.41414         
##              Prevalence : 0.58654         
##          Detection Rate : 0.02885         
##    Detection Prevalence : 0.04808         
##       Balanced Accuracy : 0.50133         
##                                           
##        'Positive' Class : Up              
## 

Answer

In terms of accuracy, our logistic regression model with the interaction of Lag1 ^2 and Lag2 gave us the best with 64.44%. The only concerning thing about this model is that most of our models predictions are ‘Up’

In terms of balance in sensitivity and specificity as well as high accuracy, our KNN and LDA models show solid performance by both having an accuracy of 53.85%, with KNN having the best balance of sensitivity and specificity of 54.10% and 53.49%.

KNN used Lag1 and Lag2 while LDA used the interaction of Lag1 and Lag2 + Volume.

Problem 14

data('Auto')

a.

Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.

Auto$mpg01 <- ifelse(Auto$mpg > median(Auto$mpg), 1, 0)

b.

Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

pairs(Auto[, sapply(Auto, is.numeric)])

We have got a few variables with linear negative relationships to mpg like displacement, horsepower, weight, and a few positive linear relationships like acceleration and year.

boxplot(Auto$horsepower ~ Auto$mpg01, main = "Horsepower vs mpg01", xlab = "mpg01", ylab = "Horsepower", 
        col = c("red", "blue"))

boxplot(Auto$displacement ~ Auto$mpg01, main = "Displacement vs mpg01", xlab = "mpg01", ylab = "Displacement", 
        col = c("red", "blue"))

boxplot(Auto$weight ~ Auto$mpg01, main = "Weight vs mpg01", xlab = "mpg01", ylab = "Weight", 
        col = c("red", "blue"))

boxplot(Auto$acceleration ~ Auto$mpg01, main = "Acceleration vs mpg01", xlab = "mpg01", ylab = "Acceleration", 
        col = c("red", "blue"))

boxplot(Auto$year ~ Auto$mpg01, main = "Year vs mpg01", xlab = "mpg01", ylab = "Year", 
        col = c("red", "blue"))

boxplot(Auto$cylinders ~ Auto$mpg01, main = "Cylinders vs mpg01", xlab = "mpg01", ylab = "Cylinders", 
        col = c("red", "blue"))

Since we are interested in our new categorical variable of mpg01, we use this with boxplots to see if our predictors can separate our variable.

Based on our boxplots, we can see that those same variables that had a linear relationship with mpg also have a trend that separates our mpg01 which just reinforces our findings of using those variables to predict higher vs lower mileage. Another interesting finding is that cylinders shows a big separation, which could be due to an unbalance in the amount of cars with lower cylinders, but generally more cylinders mean lower mileage.

knitr::kable(table(Auto$cylinders), col.names = c('Cylinders', 'Frequency'), align = 'c')
Cylinders Frequency
3 4
4 199
5 3
6 83
8 103

We can confirm that our box plots behavior for cylinders was not due to an unbalance in the data, so it has potential for being a significant predictor to our outcome variable mpg01.

c.

Split the data into a training set and a test set.

set.seed(42)

train_index_Auto <- createDataPartition(Auto$mpg01, p = 0.7, list = FALSE)
train_data_Auto <- Auto[train_index_Auto, ]
test_data_Auto <- Auto[-train_index_Auto, ]

d.

Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

lda_model_14d <- lda(mpg01 ~ horsepower + weight + displacement + acceleration + as.factor(cylinders) + year, 
                     data = train_data_Auto)

lda_preds_14d <- predict(lda_model_14d, test_data_Auto)
lda_class_14d <- lda_preds_14d$class

confusionMatrix(lda_class_14d, as.factor(test_data_Auto$mpg01))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 52  4
##          1  6 54
##                                           
##                Accuracy : 0.9138          
##                  95% CI : (0.8472, 0.9579)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8276          
##                                           
##  Mcnemar's Test P-Value : 0.7518          
##                                           
##             Sensitivity : 0.8966          
##             Specificity : 0.9310          
##          Pos Pred Value : 0.9286          
##          Neg Pred Value : 0.9000          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4483          
##    Detection Prevalence : 0.4828          
##       Balanced Accuracy : 0.9138          
##                                           
##        'Positive' Class : 0               
## 

After creating a LDA model using the variables that seemed associated the most with mpg01, we got a test accuracy of 91.38%.

e.

Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

qda_model_15e <- qda(mpg01 ~ horsepower + weight + displacement + acceleration + cylinders + year, 
                     data = train_data_Auto)

qda_preds_15e <- predict(qda_model_15e, test_data_Auto)
qda_class_15e <- qda_preds_15e$class

confusionMatrix(qda_class_15e, as.factor(test_data_Auto$mpg01))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 50  3
##          1  8 55
##                                           
##                Accuracy : 0.9052          
##                  95% CI : (0.8367, 0.9517)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8103          
##                                           
##  Mcnemar's Test P-Value : 0.2278          
##                                           
##             Sensitivity : 0.8621          
##             Specificity : 0.9483          
##          Pos Pred Value : 0.9434          
##          Neg Pred Value : 0.8730          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4310          
##    Detection Prevalence : 0.4569          
##       Balanced Accuracy : 0.9052          
##                                           
##        'Positive' Class : 0               
## 

After creating a QDA model using the variables that seemed associated the most with mpg01, we got a test accuracy of 90.52%.

Also, we kept cylinders as a numerical value because QDA does not take categorical variables as predictors.

f.

Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b).

What is the test error of the model obtained?

log_model_15f <- glm(mpg01 ~ horsepower + weight + displacement + acceleration + as.factor(cylinders) + year, 
                     data = train_data_Auto, family = 'binomial')

log_probs_15f <- predict(log_model_15f, test_data_Auto, type = 'response')
log_preds_15f <- ifelse(log_probs_15f > 0.5, 1, 0)

confusionMatrix(as.factor(log_preds_15f), as.factor(test_data_Auto$mpg01))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 51  3
##          1  7 55
##                                           
##                Accuracy : 0.9138          
##                  95% CI : (0.8472, 0.9579)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8276          
##                                           
##  Mcnemar's Test P-Value : 0.3428          
##                                           
##             Sensitivity : 0.8793          
##             Specificity : 0.9483          
##          Pos Pred Value : 0.9444          
##          Neg Pred Value : 0.8871          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4397          
##    Detection Prevalence : 0.4655          
##       Balanced Accuracy : 0.9138          
##                                           
##        'Positive' Class : 0               
## 

After creating a logistic regression model using the variables that seemed associated the most with mpg01, we got a test accuracy of 91.38%.

g.

Perform naive Bayes on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

nb_model_15g <- naiveBayes(mpg01 ~ horsepower + weight + displacement + acceleration + as.factor(cylinders) + year, 
                     data = train_data_Auto)

nb_preds_15g <- predict(nb_model_15g, test_data_Auto)

confusionMatrix(as.factor(nb_preds_15g), as.factor(test_data_Auto$mpg01))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 46  1
##          1 12 57
##                                         
##                Accuracy : 0.8879        
##                  95% CI : (0.816, 0.939)
##     No Information Rate : 0.5           
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.7759        
##                                         
##  Mcnemar's Test P-Value : 0.005546      
##                                         
##             Sensitivity : 0.7931        
##             Specificity : 0.9828        
##          Pos Pred Value : 0.9787        
##          Neg Pred Value : 0.8261        
##              Prevalence : 0.5000        
##          Detection Rate : 0.3966        
##    Detection Prevalence : 0.4052        
##       Balanced Accuracy : 0.8879        
##                                         
##        'Positive' Class : 0             
## 

After creating a Naive Bayes model using the variables that seemed associated the most with mpg01, we got a test accuracy of 88.79%.

h.

Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b).

What test errors do you obtain? Which value of K seems to perform the best on this data set?

set.seed(42)
train_x_15h <- train_data_Auto[, c('horsepower', 'weight', 'displacement', 'acceleration', 'cylinders', 'year')]
test_x_15h <- test_data_Auto[, c('horsepower', 'weight', 'displacement', 'acceleration', 'cylinders', 'year')]
train_y_15h <- train_data_Auto$mpg01
test_y_15h <- test_data_Auto$mpg01

knn_preds_15h <- knn(train_x_15h, test_x_15h, train_y_15h, k = 1)
knn_preds_15h2 <- knn(train_x_15h, test_x_15h, train_y_15h, k = 4)
knn_preds_15h3 <- knn(train_x_15h, test_x_15h, train_y_15h, k = 8)

#Confusion Matrix
confusionMatrix(as.factor(knn_preds_15h), as.factor(test_y_15h))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 47  4
##          1 11 54
##                                           
##                Accuracy : 0.8707          
##                  95% CI : (0.7957, 0.9258)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7414          
##                                           
##  Mcnemar's Test P-Value : 0.1213          
##                                           
##             Sensitivity : 0.8103          
##             Specificity : 0.9310          
##          Pos Pred Value : 0.9216          
##          Neg Pred Value : 0.8308          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4052          
##    Detection Prevalence : 0.4397          
##       Balanced Accuracy : 0.8707          
##                                           
##        'Positive' Class : 0               
## 
confusionMatrix(as.factor(knn_preds_15h2), as.factor(test_y_15h))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 50  4
##          1  8 54
##                                           
##                Accuracy : 0.8966          
##                  95% CI : (0.8263, 0.9454)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7931          
##                                           
##  Mcnemar's Test P-Value : 0.3865          
##                                           
##             Sensitivity : 0.8621          
##             Specificity : 0.9310          
##          Pos Pred Value : 0.9259          
##          Neg Pred Value : 0.8710          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4310          
##    Detection Prevalence : 0.4655          
##       Balanced Accuracy : 0.8966          
##                                           
##        'Positive' Class : 0               
## 
confusionMatrix(as.factor(knn_preds_15h3), as.factor(test_y_15h))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 48  3
##          1 10 55
##                                         
##                Accuracy : 0.8879        
##                  95% CI : (0.816, 0.939)
##     No Information Rate : 0.5           
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.7759        
##                                         
##  Mcnemar's Test P-Value : 0.09609       
##                                         
##             Sensitivity : 0.8276        
##             Specificity : 0.9483        
##          Pos Pred Value : 0.9412        
##          Neg Pred Value : 0.8462        
##              Prevalence : 0.5000        
##          Detection Rate : 0.4138        
##    Detection Prevalence : 0.4397        
##       Balanced Accuracy : 0.8879        
##                                         
##        'Positive' Class : 0             
## 

After creating 3 KNN models using the variables that seemed associated the most with mpg01, we got:

For k = 1 a test accuracy of 87.07%

For k = 4 a test accuracy of 89.66%

For k = 8 a test accuracy of 88.79%

Problem 16

Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings.

Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set.

data('Boston')

Create Response Variable

# (1) is high crime and (0) is low crime
Boston$crim01 <- ifelse(Boston$crim > median(Boston$crim), 1, 0)

Data Exploration

knitr::kable(summary(Boston))
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv crim01
Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000 Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130 Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32 Min. : 1.73 Min. : 5.00 Min. :0.0
1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38 1st Qu.: 6.95 1st Qu.:17.02 1st Qu.:0.0
Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000 Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207 Median : 5.000 Median :330.0 Median :19.05 Median :391.44 Median :11.36 Median :21.20 Median :0.5
Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917 Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795 Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67 Mean :12.65 Mean :22.53 Mean :0.5
3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23 3rd Qu.:16.95 3rd Qu.:25.00 3rd Qu.:1.0
Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000 Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90 Max. :37.97 Max. :50.00 Max. :1.0
pairs(Boston)

chas is 0’s and 1’s so we convert it to categorical

Boston$chas <- as.factor(Boston$chas)
Boston$crim01 <- as.factor(Boston$crim01)
boxplot(Boston$indus, Boston$crim01, main = 'Industrial Area Perfectly separates Crime Rate', 
        xlab = 'Crime Rate 0 or 1', ylab = 'Proportion of Industrial Area', 
        col = c('red', 'blue'))

boxplot(Boston$nox, Boston$crim01, main = 'Air Contamination Perfectly separates Crime Rate', 
        xlab = 'Crime Rate 0 or 1', ylab = 'Nitrogen Oxide Concentration', 
        col = c('red', 'blue'))

boxplot(Boston$tax, Boston$crim01, main = 'Tax rate values Perfectly separates Crime Rate', 
        xlab = 'Crime Rate 0 or 1', ylab = 'Properties Tax Rate', 
        col = c('red', 'blue'))

boxplot(Boston$medv, Boston$crim01, main = 'Value of homes Perfectly separates Crime Rate', 
        xlab = 'Crime Rate 0 or 1', ylab = 'value of homes', 
        col = c('red', 'blue'))

boxplot(Boston$rad, Boston$crim01, 
        main = 'Zone with highway access Perfectly separates Crime Rate', 
        xlab = 'Crime Rate 0 or 1', ylab = 'Access to highways', 
        col = c('red', 'blue'))

boxplot(Boston$rm, Boston$crim01, 
        main = 'Zone with highway access Perfectly separates Crime Rate', 
        xlab = 'Crime Rate 0 or 1', ylab = 'Access to highways', 
        col = c('red', 'blue'))

It appears as though indus, nox, tax, medv, rad and rm all perfectly separate crime between high and low.

Train-Test split

set.seed(42)
train_index_Boston <- createDataPartition(Boston$crim01, p = 0.7, list = FALSE)

train_data_Boston <- Boston[train_index_Boston, ]
test_data_Boston <- Boston[-train_index_Boston, ]

Logistic Regression

log_model_16 <- glm(crim01 ~ indus + nox + tax + medv + rad, data = train_data_Boston, 
                    family = 'binomial')

probs_log <- predict(log_model_16, test_data_Boston, type = 'response')
preds_log <- ifelse(probs_log > 0.5, 1, 0)

confusionMatrix(as.factor(preds_log), as.factor(test_data_Boston$crim01))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 69 14
##          1  6 61
##                                           
##                Accuracy : 0.8667          
##                  95% CI : (0.8016, 0.9166)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7333          
##                                           
##  Mcnemar's Test P-Value : 0.1175          
##                                           
##             Sensitivity : 0.9200          
##             Specificity : 0.8133          
##          Pos Pred Value : 0.8313          
##          Neg Pred Value : 0.9104          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4600          
##    Detection Prevalence : 0.5533          
##       Balanced Accuracy : 0.8667          
##                                           
##        'Positive' Class : 0               
## 

Logistic Regression model has an accuracy of 86.67% when predicting if crime is high or low.

LDA

lda_model_16 <- lda(crim01 ~ indus + tax + nox + medv + rad, data = train_data_Boston)

lda_preds_16 <- predict(lda_model_16, test_data_Boston)$class

# Confusion Matrix
confusionMatrix(lda_preds_16, as.factor(test_data_Boston$crim01))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 74 19
##          1  1 56
##                                           
##                Accuracy : 0.8667          
##                  95% CI : (0.8016, 0.9166)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7333          
##                                           
##  Mcnemar's Test P-Value : 0.0001439       
##                                           
##             Sensitivity : 0.9867          
##             Specificity : 0.7467          
##          Pos Pred Value : 0.7957          
##          Neg Pred Value : 0.9825          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4933          
##    Detection Prevalence : 0.6200          
##       Balanced Accuracy : 0.8667          
##                                           
##        'Positive' Class : 0               
## 

LDA model has an accuracy of 86.67% when predicting if crime is high or low.

Naive Bayes

nb_model_16 <- naiveBayes(crim01 ~ indus + tax + nox + medv + rad, data = train_data_Boston)

nb_preds_16 <- predict(nb_model_16, test_data_Boston)

confusionMatrix(nb_preds_16, as.factor(test_data_Boston$crim01))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 73 24
##          1  2 51
##                                           
##                Accuracy : 0.8267          
##                  95% CI : (0.7564, 0.8835)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6533          
##                                           
##  Mcnemar's Test P-Value : 3.814e-05       
##                                           
##             Sensitivity : 0.9733          
##             Specificity : 0.6800          
##          Pos Pred Value : 0.7526          
##          Neg Pred Value : 0.9623          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4867          
##    Detection Prevalence : 0.6467          
##       Balanced Accuracy : 0.8267          
##                                           
##        'Positive' Class : 0               
## 

Naive Bayes model has an accuracy of 82.67% when predicting if crime is high or low.

KNN

set.seed(42)
train_x_16 <- train_data_Boston[, c('indus', 'tax', 'nox', 'medv', 'rad')]
test_x_16 <- test_data_Boston[, c('indus', 'tax', 'nox', 'medv', 'rad')]
train_y_16 <- train_data_Boston$crim01
test_y_16 <- test_data_Boston$crim01

knn_preds_16 <- knn(train_x_16, test_x_16, train_y_16, k = 1)
knn_preds_16_2 <- knn(train_x_16, test_x_16, train_y_16, k = 2)
knn_preds_16_3 <- knn(train_x_16, test_x_16, train_y_16, k = 4)

#Confusion Matrix
confusionMatrix(as.factor(knn_preds_16), as.factor(test_y_16))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 65  5
##          1 10 70
##                                           
##                Accuracy : 0.9             
##                  95% CI : (0.8404, 0.9429)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8             
##                                           
##  Mcnemar's Test P-Value : 0.3017          
##                                           
##             Sensitivity : 0.8667          
##             Specificity : 0.9333          
##          Pos Pred Value : 0.9286          
##          Neg Pred Value : 0.8750          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4333          
##    Detection Prevalence : 0.4667          
##       Balanced Accuracy : 0.9000          
##                                           
##        'Positive' Class : 0               
## 
confusionMatrix(as.factor(knn_preds_16_2), as.factor(test_y_16))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 68  3
##          1  7 72
##                                           
##                Accuracy : 0.9333          
##                  95% CI : (0.8808, 0.9676)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8667          
##                                           
##  Mcnemar's Test P-Value : 0.3428          
##                                           
##             Sensitivity : 0.9067          
##             Specificity : 0.9600          
##          Pos Pred Value : 0.9577          
##          Neg Pred Value : 0.9114          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4533          
##    Detection Prevalence : 0.4733          
##       Balanced Accuracy : 0.9333          
##                                           
##        'Positive' Class : 0               
## 
confusionMatrix(as.factor(knn_preds_16_3), as.factor(test_y_16))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 67  3
##          1  8 72
##                                           
##                Accuracy : 0.9267          
##                  95% CI : (0.8726, 0.9628)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8533          
##                                           
##  Mcnemar's Test P-Value : 0.2278          
##                                           
##             Sensitivity : 0.8933          
##             Specificity : 0.9600          
##          Pos Pred Value : 0.9571          
##          Neg Pred Value : 0.9000          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4467          
##    Detection Prevalence : 0.4667          
##       Balanced Accuracy : 0.9267          
##                                           
##        'Positive' Class : 0               
## 

After creating 3 KNN models using the variables that seemed associated the most with crim01, we got:

For k = 1 a test accuracy of 90%

For k = 2 a test accuracy of 93.33%

For k = 4 a test accuracy of 92.67%

Final Findings

After exploring Logistic Regression, LDA, Naive Bayes, and KNN models, we saw the best model for our data was KNN. We ran 3 different models of KNN basically changing the number of K, and found that specifically K = 2 was our best performing model with an accuracy of 93.33% which is very high.

On the other hand, our worst performing model was our Naive Bayes with an accuracy of 82.67% which could still be considered relatively high, but is still vastly outperformed by our KNN model.

Our findings also suggest that changing the predictor variables could help improve other models, as well as using transformations, since each model interprets data differently.