Problem 4.13

This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1, 089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

weekly <- Weekly
  1. Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
summary(weekly)
##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
## 
pairs(weekly, col=weekly$Direction)

Looking at the summary statistics, Lag1 through Lag5 are similar with extremely close values for each individual statistic. We also see that there appears to be more weeks with an upwards direction, corresponding with the positive mean and median found in today.

When we examine the pairwise plots we see a strong relationship between Direction and Today. High values for today are Up and low values are Down; this corresponds with the fact that Direction is derived from Today. Another strong relationship we can see is as year increases, the Volume tends to increase as well.

  1. Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?
weekly_predictors <- weekly[-c(1,8)] # selecting 5 lag plus volume

# up is 1, down is 0
weekly_predictors$Direction <- (ifelse(weekly_predictors$Direction == 'Up', 1, 0))

logistic_b <- glm(Direction ~ ., data=weekly_predictors, family = 'binomial')
summary(logistic_b)
## 
## Call:
## glm(formula = Direction ~ ., family = "binomial", data = weekly_predictors)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Using logistic regression we see the only significant variable is Lag2.

  1. Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
# confusion matrix
predicted_2c <- predict(logistic_b, weekly, type='response')
predicted_2c <- ifelse(predicted_2c >= 0.5, 1, 0)

confusionMatrix(as.factor(predicted_2c), as.factor(weekly_predictors$Direction), mode='everything',positive='1')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  54  48
##          1 430 557
##                                          
##                Accuracy : 0.5611         
##                  95% CI : (0.531, 0.5908)
##     No Information Rate : 0.5556         
##     P-Value [Acc > NIR] : 0.369          
##                                          
##                   Kappa : 0.035          
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.9207         
##             Specificity : 0.1116         
##          Pos Pred Value : 0.5643         
##          Neg Pred Value : 0.5294         
##               Precision : 0.5643         
##                  Recall : 0.9207         
##                      F1 : 0.6997         
##              Prevalence : 0.5556         
##          Detection Rate : 0.5115         
##    Detection Prevalence : 0.9063         
##       Balanced Accuracy : 0.5161         
##                                          
##        'Positive' Class : 1              
## 

There is a sensitivity of 92% and a specificity of 11%. This means it detects true positives (a direction of up) 92% of the time and true negatives (a direction of down) 11% of the time. The confusion matrix shows a positive predictive value of 56% and a negative predictive value of 52%. The F1 score is 0.70, and the overall accuracy is 0.56.

From this we can tell that our model is much better at predicting when the market is up rather than when it is down. However, this is slightly deceptive; looking at the PPV we only see a meager 56% meaning only 56% of its predictions of an upward market are correct. Going back to the confusion matrix we can see this is because around 90% of its predictions are that the market will be up. Looking at the data, only 55% of the time the market is up. We can see that the model mistakenly predicts the market will be up a majority of the time, giving it a high sensitivity but mediocre scores on other metrics.

  1. Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).
weekly_before_2008 = weekly[weekly$Year <= 2008,]
weekly_after_2008 = weekly[weekly$Year > 2008,]

logistic_d <- glm(Direction ~ Lag2, data=weekly_before_2008, family = 'binomial')

predicted_d <- predict(logistic_d, newdata=weekly_after_2008,type='response')
predicted_d <- ifelse(predicted_d >= 0.5, 'Up', 'Down')

# compute confusion matrix
cm.lr <- confusionMatrix(as.factor(predicted_d), as.factor(weekly_after_2008$Direction), mode='everything',positive='Up')

print(cm.lr)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    9  5
##       Up     34 56
##                                          
##                Accuracy : 0.625          
##                  95% CI : (0.5247, 0.718)
##     No Information Rate : 0.5865         
##     P-Value [Acc > NIR] : 0.2439         
##                                          
##                   Kappa : 0.1414         
##                                          
##  Mcnemar's Test P-Value : 7.34e-06       
##                                          
##             Sensitivity : 0.9180         
##             Specificity : 0.2093         
##          Pos Pred Value : 0.6222         
##          Neg Pred Value : 0.6429         
##               Precision : 0.6222         
##                  Recall : 0.9180         
##                      F1 : 0.7417         
##              Prevalence : 0.5865         
##          Detection Rate : 0.5385         
##    Detection Prevalence : 0.8654         
##       Balanced Accuracy : 0.5637         
##                                          
##        'Positive' Class : Up             
## 

After training on 2008 and before then testing on 2009 an 2010, we calculate an accuracy (the fraction of correct predictions) of 62.5%, an improvement over what we found in part C. (e) Repeat (d) using LDA.

lda_fit <- lda(Direction ~ Lag2, data=weekly_before_2008)


predicted_lda <- predict(lda_fit, newdata=weekly_after_2008)

# compute confusion matrix
cm.lda <- confusionMatrix(as.factor(predicted_lda$class), as.factor(weekly_after_2008$Direction), mode='everything',positive='Up')

print(cm.lda)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    9  5
##       Up     34 56
##                                          
##                Accuracy : 0.625          
##                  95% CI : (0.5247, 0.718)
##     No Information Rate : 0.5865         
##     P-Value [Acc > NIR] : 0.2439         
##                                          
##                   Kappa : 0.1414         
##                                          
##  Mcnemar's Test P-Value : 7.34e-06       
##                                          
##             Sensitivity : 0.9180         
##             Specificity : 0.2093         
##          Pos Pred Value : 0.6222         
##          Neg Pred Value : 0.6429         
##               Precision : 0.6222         
##                  Recall : 0.9180         
##                      F1 : 0.7417         
##              Prevalence : 0.5865         
##          Detection Rate : 0.5385         
##    Detection Prevalence : 0.8654         
##       Balanced Accuracy : 0.5637         
##                                          
##        'Positive' Class : Up             
## 

Using LDA to predict for 2009 and 2010 gives us an accuracy of 62.5%. This performance is identical to logistic regression using Lag2.

  1. Repeat (d) using QDA.
qda_fit <- qda(Direction ~ Lag2, data=weekly_before_2008)

predicted_qda <- predict(qda_fit, newdata=weekly_after_2008)

# compute confusion matrix
cm.qda <- confusionMatrix(as.factor(predicted_qda$class), as.factor(weekly_after_2008$Direction), mode='everything',positive='Up')

print(cm.qda)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    0  0
##       Up     43 61
##                                           
##                Accuracy : 0.5865          
##                  95% CI : (0.4858, 0.6823)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.5419          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 1.504e-10       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.5865          
##          Neg Pred Value :    NaN          
##               Precision : 0.5865          
##                  Recall : 1.0000          
##                      F1 : 0.7394          
##              Prevalence : 0.5865          
##          Detection Rate : 0.5865          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : Up              
## 

Using QDA we get an accuracy of 58.65%. It is important to note that the sensitivity of 100% is somewhat misleading as to the performance of the model, as it is predicting ‘Up’ for every possible prediction. This inflates the sensitivity as it will always correctly predict a positive value if it always predicts a positive value.

  1. Repeat (d) using KNN with K = 1.
knn_predicted <- knn(train=weekly_before_2008[3], test=weekly_after_2008[3], cl = weekly_before_2008[,9], k=1)

# compute confusion matrix
cm.knn <- confusionMatrix(as.factor(knn_predicted), as.factor(weekly_after_2008[,9]), mode='everything',positive='Up')
print(cm.knn)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   21 29
##       Up     22 32
##                                          
##                Accuracy : 0.5096         
##                  95% CI : (0.4097, 0.609)
##     No Information Rate : 0.5865         
##     P-Value [Acc > NIR] : 0.9540         
##                                          
##                   Kappa : 0.0127         
##                                          
##  Mcnemar's Test P-Value : 0.4008         
##                                          
##             Sensitivity : 0.5246         
##             Specificity : 0.4884         
##          Pos Pred Value : 0.5926         
##          Neg Pred Value : 0.4200         
##               Precision : 0.5926         
##                  Recall : 0.5246         
##                      F1 : 0.5565         
##              Prevalence : 0.5865         
##          Detection Rate : 0.3077         
##    Detection Prevalence : 0.5192         
##       Balanced Accuracy : 0.5065         
##                                          
##        'Positive' Class : Up             
## 

Using KNN with k=1 we find an accuracy of 51%, which so far is the worst performing model.

  1. Repeat (d) using naive Bayes.
nb_fit <- naiveBayes(Direction ~ Lag2, data=weekly_before_2008)

pred_nb <- predict(nb_fit, newdata = weekly_after_2008, type = "raw")
pred_nb <- ifelse(pred_nb[,2] >= 0.5, 'Up', 'Down') # column 2 represents likelihood of 'Up'

cm.nb <- confusionMatrix(as.factor(pred_nb), as.factor(weekly_after_2008$Direction),mode='everything',positive='Up')
## Warning in confusionMatrix.default(as.factor(pred_nb),
## as.factor(weekly_after_2008$Direction), : Levels are not in the same order for
## reference and data. Refactoring data to match.
print(cm.nb)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    0  0
##       Up     43 61
##                                           
##                Accuracy : 0.5865          
##                  95% CI : (0.4858, 0.6823)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.5419          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 1.504e-10       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.5865          
##          Neg Pred Value :    NaN          
##               Precision : 0.5865          
##                  Recall : 1.0000          
##                      F1 : 0.7394          
##              Prevalence : 0.5865          
##          Detection Rate : 0.5865          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : Up              
## 

We see an accuracy of 58.65% using naive bayes, which is the same as QDA. We also see the same issue with QDA, in which the model has a perfect sensitivity of 100% by predicting everything as up.

  1. Which of these methods appears to provide the best results on this data?

The best performing models using Lag2 were LDA and logistic regression, both of which provided the exact same results with an overall accuracy of 62.5%.

  1. Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

We will experiment with different subsets for each model; these subsets include all lag variables, all lag and today, and all lag with today and volume. However, we will not conduct further investigation with logistic regression as Lag2 was the only term shown to be significant.

## [1] LDA with lag 1, 2, and 3:
##           Reference
## Prediction Down Up
##       Down    8  9
##       Up     35 52
##  Accuracy 
## 0.5769231
##          Sensitivity          Specificity       Pos Pred Value 
##            0.8524590            0.1860465            0.5977011 
##       Neg Pred Value            Precision               Recall 
##            0.4705882            0.5977011            0.8524590 
##                   F1           Prevalence       Detection Rate 
##            0.7027027            0.5865385            0.5000000 
## Detection Prevalence    Balanced Accuracy 
##            0.8365385            0.5192528
## [1] LDA with all lag:
##           Reference
## Prediction Down Up
##       Down    9 13
##       Up     34 48
##  Accuracy 
## 0.5480769
##          Sensitivity          Specificity       Pos Pred Value 
##            0.7868852            0.2093023            0.5853659 
##       Neg Pred Value            Precision               Recall 
##            0.4090909            0.5853659            0.7868852 
##                   F1           Prevalence       Detection Rate 
##            0.6713287            0.5865385            0.4615385 
## Detection Prevalence    Balanced Accuracy 
##            0.7884615            0.4980938
## [1] LDA with all lag and volume:
##           Reference
## Prediction Down Up
##       Down   31 44
##       Up     12 17
##  Accuracy 
## 0.4615385
##          Sensitivity          Specificity       Pos Pred Value 
##            0.2786885            0.7209302            0.5862069 
##       Neg Pred Value            Precision               Recall 
##            0.4133333            0.5862069            0.2786885 
##                   F1           Prevalence       Detection Rate 
##            0.3777778            0.5865385            0.1634615 
## Detection Prevalence    Balanced Accuracy 
##            0.2788462            0.4998094

We see that the best performance for LDA is using only Lag2, which has an accuracy of 62.5% which is higher than all other LDA models shown. This appears to get worse as more variables are added; the one exception is specificity, which increases to 72% with all variables compared to 21% with only lag2.

## [1] QDA with lag 1, 2, and 3:
##           Reference
## Prediction Down Up
##       Down    6 10
##       Up     37 51
##  Accuracy 
## 0.5480769
##          Sensitivity          Specificity       Pos Pred Value 
##            0.8360656            0.1395349            0.5795455 
##       Neg Pred Value            Precision               Recall 
##            0.3750000            0.5795455            0.8360656 
##                   F1           Prevalence       Detection Rate 
##            0.6845638            0.5865385            0.4903846 
## Detection Prevalence    Balanced Accuracy 
##            0.8461538            0.4878002
## [1] QDA with all lag:
##           Reference
## Prediction Down Up
##       Down   10 23
##       Up     33 38
##  Accuracy 
## 0.4615385
##          Sensitivity          Specificity       Pos Pred Value 
##            0.6229508            0.2325581            0.5352113 
##       Neg Pred Value            Precision               Recall 
##            0.3030303            0.5352113            0.6229508 
##                   F1           Prevalence       Detection Rate 
##            0.5757576            0.5865385            0.3653846 
## Detection Prevalence    Balanced Accuracy 
##            0.6826923            0.4277545
## [1] QDA with all lag and volume:
##           Reference
## Prediction Down Up
##       Down   33 49
##       Up     10 12
##  Accuracy 
## 0.4326923
##          Sensitivity          Specificity       Pos Pred Value 
##            0.1967213            0.7674419            0.5454545 
##       Neg Pred Value            Precision               Recall 
##            0.4024390            0.5454545            0.1967213 
##                   F1           Prevalence       Detection Rate 
##            0.2891566            0.5865385            0.1153846 
## Detection Prevalence    Balanced Accuracy 
##            0.2115385            0.4820816

Similarly, we see the overall accuracy decreases with the addition of terms for QDA, as it starts from 59% and decreases to a low of 43%. The sensitivity of 100% also decreases to a measly 20%. Interestingly, the specificity also increases from 0% to 76%.

Next we will perform KNN for the various subsets for k values of 1, 3, 5, 10, 15, 20, 25, and 30. To make things easier to read we will only compare the overall accuracy of each.

## [1] KNN with lag2:
## [1] For k=1
## Accuracy 
##      0.5 
## [1] For k=3
##  Accuracy 
## 0.5384615 
## [1] For k=5
##  Accuracy 
## 0.5384615 
## [1] For k=10
##  Accuracy 
## 0.5865385 
## [1] For k=15
##  Accuracy 
## 0.5865385 
## [1] For k=20
##  Accuracy 
## 0.5576923 
## [1] For k=25
##  Accuracy 
## 0.5480769 
## [1] For k=30
##  Accuracy 
## 0.5288462
## [1] KNN with lag 1,2,3:
## [1] For k=1
##  Accuracy 
## 0.4903846 
## [1] For k=3
##  Accuracy 
## 0.5096154 
## [1] For k=5
##  Accuracy 
## 0.5288462 
## [1] For k=10
##  Accuracy 
## 0.5769231 
## [1] For k=15
##  Accuracy 
## 0.5865385 
## [1] For k=20
##  Accuracy 
## 0.6153846 
## [1] For k=25
##  Accuracy 
## 0.6153846 
## [1] For k=30
##  Accuracy 
## 0.6153846
## [1] KNN with all lag:
## [1] For k=1
##  Accuracy 
## 0.5192308 
## [1] For k=3
##  Accuracy 
## 0.5576923 
## [1] For k=5
##  Accuracy 
## 0.5480769 
## [1] For k=10
##  Accuracy 
## 0.6153846 
## [1] For k=15
##  Accuracy 
## 0.5384615 
## [1] For k=20
##  Accuracy 
## 0.5673077 
## [1] For k=25
## Accuracy 
##      0.5 
## [1] For k=30
##  Accuracy 
## 0.5096154
## [1] KNN with all lag, volume:
## [1] For k=1
##  Accuracy 
## 0.4807692 
## [1] For k=3
##  Accuracy 
## 0.5096154 
## [1] For k=5
## Accuracy 
##      0.5 
## [1] For k=10
##  Accuracy 
## 0.5192308 
## [1] For k=15
##  Accuracy 
## 0.5576923 
## [1] For k=20
##  Accuracy 
## 0.5192308 
## [1] For k=25
##  Accuracy 
## 0.5480769 
## [1] For k=30
##  Accuracy 
## 0.5480769
## [1] KNN with lag 1, 2, and 3, and k=25:
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   22 19
##       Up     21 42
##                                           
##                Accuracy : 0.6154          
##                  95% CI : (0.5149, 0.7091)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.3110          
##                                           
##                   Kappa : 0.2015          
##                                           
##  Mcnemar's Test P-Value : 0.8744          
##                                           
##             Sensitivity : 0.6885          
##             Specificity : 0.5116          
##          Pos Pred Value : 0.6667          
##          Neg Pred Value : 0.5366          
##               Precision : 0.6667          
##                  Recall : 0.6885          
##                      F1 : 0.6774          
##              Prevalence : 0.5865          
##          Detection Rate : 0.4038          
##    Detection Prevalence : 0.6058          
##       Balanced Accuracy : 0.6001          
##                                           
##        'Positive' Class : Up              
## 

A k=10 seems to have the best performance overall, seeming to reach a peak accuracy of 61% when using all lag variables as predictors. Similar performance is reached by k=20, 25, and 30 with lag 1, 2, and 3. Overall, it appears that adding predictors helps until the point where you have all lag variables, where the performance starts to decline. The worst performance is with all lag and volume. KNN are the only models which appear to improve with the addition of variables, but all models including KNN still suffer decreases in performances when all predictors are used.

## [1] NB with lag 1,2,3:
##           Reference
## Prediction Down Up
##       Down    5 10
##       Up     38 51
##  Accuracy 
## 0.5384615
##          Sensitivity          Specificity       Pos Pred Value 
##            0.8360656            0.1162791            0.5730337 
##       Neg Pred Value            Precision               Recall 
##            0.3333333            0.5730337            0.8360656 
##                   F1           Prevalence       Detection Rate 
##            0.6800000            0.5865385            0.4903846 
## Detection Prevalence    Balanced Accuracy 
##            0.8557692            0.4761723
## [1] NB with all lag:
##           Reference
## Prediction Down Up
##       Down   10 21
##       Up     33 40
##  Accuracy 
## 0.4807692
##          Sensitivity          Specificity       Pos Pred Value 
##            0.6557377            0.2325581            0.5479452 
##       Neg Pred Value            Precision               Recall 
##            0.3225806            0.5479452            0.6557377 
##                   F1           Prevalence       Detection Rate 
##            0.5970149            0.5865385            0.3846154 
## Detection Prevalence    Balanced Accuracy 
##            0.7019231            0.4441479
## [1] NB with all lag, volume:
##           Reference
## Prediction Down Up
##       Down   42 56
##       Up      1  5
##  Accuracy 
## 0.4519231
##          Sensitivity          Specificity       Pos Pred Value 
##           0.08196721           0.97674419           0.83333333 
##       Neg Pred Value            Precision               Recall 
##           0.42857143           0.83333333           0.08196721 
##                   F1           Prevalence       Detection Rate 
##           0.14925373           0.58653846           0.04807692 
## Detection Prevalence    Balanced Accuracy 
##           0.05769231           0.52935570

We see a similar pattern to what occured in LDA and QDA for NB. The accuracy steadily decreased from 59% at just lag2 down to 45% to all lag and volume. We see a similar phenomenon as we saw in QDA where the initial sensitivity of 100% decreased to 8% as the model stopped predicting purely positive responses, and a corresponding increase of specificity from 0% to 98%.

## [1] Logistic regression:
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    9  5
##       Up     34 56
##                                          
##                Accuracy : 0.625          
##                  95% CI : (0.5247, 0.718)
##     No Information Rate : 0.5865         
##     P-Value [Acc > NIR] : 0.2439         
##                                          
##                   Kappa : 0.1414         
##                                          
##  Mcnemar's Test P-Value : 7.34e-06       
##                                          
##             Sensitivity : 0.9180         
##             Specificity : 0.2093         
##          Pos Pred Value : 0.6222         
##          Neg Pred Value : 0.6429         
##               Precision : 0.6222         
##                  Recall : 0.9180         
##                      F1 : 0.7417         
##              Prevalence : 0.5865         
##          Detection Rate : 0.5385         
##    Detection Prevalence : 0.8654         
##       Balanced Accuracy : 0.5637         
##                                          
##        'Positive' Class : Up             
## 
## [1] LDA:
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    9  5
##       Up     34 56
##                                          
##                Accuracy : 0.625          
##                  95% CI : (0.5247, 0.718)
##     No Information Rate : 0.5865         
##     P-Value [Acc > NIR] : 0.2439         
##                                          
##                   Kappa : 0.1414         
##                                          
##  Mcnemar's Test P-Value : 7.34e-06       
##                                          
##             Sensitivity : 0.9180         
##             Specificity : 0.2093         
##          Pos Pred Value : 0.6222         
##          Neg Pred Value : 0.6429         
##               Precision : 0.6222         
##                  Recall : 0.9180         
##                      F1 : 0.7417         
##              Prevalence : 0.5865         
##          Detection Rate : 0.5385         
##    Detection Prevalence : 0.8654         
##       Balanced Accuracy : 0.5637         
##                                          
##        'Positive' Class : Up             
## 

After conducting a further analysis, we see that the best performance was using logistic regression and LDA for only Lag2 as a predictor, which achieved an accuracy of 62.5%. Adding any extra predictors only served to decrease the performance of models.

KNN with k=10 and all lag predictors was the closest as it did see some improvement when adding extra variables, but was only able to get an accuracy of 61%.

Problem 4.14

In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

auto <- Auto
  1. Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.
auto$mpg01 <- ifelse(auto$mpg >= median(auto$mpg), 1, 0)
  1. Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.
par(mfrow = c(2,2))
# plot mpg01 versus every other predictor
sapply(names(auto)[-(9:10)], function(col){
  x <- auto[[col]]  
  plot(auto$mpg01, x, main = paste("mpg01 vs", col), xlab = "mpg01", ylab = col)
  boxplot(x ~ auto$mpg01, main = paste("Boxplot of", col, "by mpg01"), xlab = "mpg01", ylab = col)
})

##       mpg         cylinders   displacement horsepower  weight      acceleration
## stats numeric,10  numeric,10  numeric,10   numeric,10  numeric,10  numeric,10  
## n     numeric,2   numeric,2   numeric,2    numeric,2   numeric,2   numeric,2   
## conf  numeric,4   numeric,4   numeric,4    numeric,4   numeric,4   numeric,4   
## out   numeric,3   numeric,17  numeric,5    numeric,3   numeric,3   numeric,5   
## group numeric,3   numeric,17  numeric,5    numeric,3   numeric,3   numeric,5   
## names character,2 character,2 character,2  character,2 character,2 character,2 
##       year        origin     
## stats numeric,10  numeric,10 
## n     numeric,2   numeric,2  
## conf  numeric,4   numeric,4  
## out   numeric,0   numeric,23 
## group numeric,0   numeric,23 
## names character,2 character,2

Excluding the mpg, it seems likely that the variables that will be associated with mpg01 are cylinders, displacement, horsepower, weight, and origin.

  1. Split the data into a training set and a test set.
set.seed(123)
trainIndex <- createDataPartition(auto$mpg01, p = 0.8, list = F) # 80% training and 20% test

train_auto <- auto[trainIndex,]
test_auto <- auto[-trainIndex,]
  1. Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
lda <- lda(mpg01 ~ cylinders + displacement + horsepower + weight + origin, data = train_auto)

lda_pred <- predict(lda, test_auto)$class

cm <- confusionMatrix(as.factor(test_auto$mpg01), lda_pred)
cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 31  8
##          1  1 38
##                                           
##                Accuracy : 0.8846          
##                  95% CI : (0.7922, 0.9459)
##     No Information Rate : 0.5897          
##     P-Value [Acc > NIR] : 1.097e-08       
##                                           
##                   Kappa : 0.7692          
##                                           
##  Mcnemar's Test P-Value : 0.0455          
##                                           
##             Sensitivity : 0.9688          
##             Specificity : 0.8261          
##          Pos Pred Value : 0.7949          
##          Neg Pred Value : 0.9744          
##              Prevalence : 0.4103          
##          Detection Rate : 0.3974          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.8974          
##                                           
##        'Positive' Class : 0               
## 

After using LDA we see the test error metric of accuracy is at 0.88, with a sensitivity of 0.97 and specificity of 0.83.

  1. Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
qda <- qda(mpg01 ~ cylinders + displacement + horsepower + weight + origin, data = train_auto)

qda_pred <- predict(qda, test_auto)$class

cm <- confusionMatrix(as.factor(test_auto$mpg01), qda_pred)
cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 33  6
##          1  2 37
##                                           
##                Accuracy : 0.8974          
##                  95% CI : (0.8079, 0.9547)
##     No Information Rate : 0.5513          
##     P-Value [Acc > NIR] : 3.51e-11        
##                                           
##                   Kappa : 0.7949          
##                                           
##  Mcnemar's Test P-Value : 0.2888          
##                                           
##             Sensitivity : 0.9429          
##             Specificity : 0.8605          
##          Pos Pred Value : 0.8462          
##          Neg Pred Value : 0.9487          
##              Prevalence : 0.4487          
##          Detection Rate : 0.4231          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.9017          
##                                           
##        'Positive' Class : 0               
## 

QDA gives us an accuracy of 0.90 with a sensitivity of 0.94 and specificity of 0.86. It seems that in comparison to LDA there is a slight sacrifice of sensitivity to give a slightly higher accuracy and specificity.

  1. Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
lr <- glm(mpg01 ~ cylinders + displacement + horsepower + weight + origin, data = train_auto, family = 'binomial')

lr_pred<- predict(lr, test_auto)
lr_pred <- ifelse(lr_pred >= 0.5, 1, 0)

cm <- confusionMatrix(as.factor(test_auto$mpg01), as.factor(lr_pred))
cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 34  5
##          1  2 37
##                                           
##                Accuracy : 0.9103          
##                  95% CI : (0.8238, 0.9632)
##     No Information Rate : 0.5385          
##     P-Value [Acc > NIR] : 1.083e-12       
##                                           
##                   Kappa : 0.8205          
##                                           
##  Mcnemar's Test P-Value : 0.4497          
##                                           
##             Sensitivity : 0.9444          
##             Specificity : 0.8810          
##          Pos Pred Value : 0.8718          
##          Neg Pred Value : 0.9487          
##              Prevalence : 0.4615          
##          Detection Rate : 0.4359          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.9127          
##                                           
##        'Positive' Class : 0               
## 

Logistic regression gives us an accuracy of 0.91, with a sensitivity of 0.94 and specificity of 0.88. This is the highest accuracy and specificity so far, with the same sensitivity as QDA.

  1. Perform naive Bayes on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
nb <- naiveBayes(mpg01 ~ cylinders + displacement + horsepower + weight + origin, data = train_auto)

nb_pred <- predict(nb, test_auto)


cm <- confusionMatrix(as.factor(test_auto$mpg01), as.factor(nb_pred))
cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 31  8
##          1  1 38
##                                           
##                Accuracy : 0.8846          
##                  95% CI : (0.7922, 0.9459)
##     No Information Rate : 0.5897          
##     P-Value [Acc > NIR] : 1.097e-08       
##                                           
##                   Kappa : 0.7692          
##                                           
##  Mcnemar's Test P-Value : 0.0455          
##                                           
##             Sensitivity : 0.9688          
##             Specificity : 0.8261          
##          Pos Pred Value : 0.7949          
##          Neg Pred Value : 0.9744          
##              Prevalence : 0.4103          
##          Detection Rate : 0.3974          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.8974          
##                                           
##        'Positive' Class : 0               
## 

Naive Bayes gives the exact same results as LDA, with an accuracy of 0.88, sensitivity of 0.97, and specificity of 0.83.

  1. Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

We will use the following variables for this problem:

cylinders + displacement + horsepower + weight + origin

set.seed(123)
# first select the values that are being tested
knn_train <- train_auto[,c(2:5, 8)]
knn_test <- test_auto[,c(2:5, 8)]
knn_train_cl <- train_auto$mpg01 # correct classifications for training
knn_test_cl <- test_auto$mpg01 # for testing

# from k = 1 to k = 20
k_list <- 1:20
knn_results <- lapply(k_list, function(k){
  knn_pred <- knn(knn_train, knn_test, knn_train_cl, k = k)
  knn_cm <- confusionMatrix(as.factor(knn_pred), as.factor(knn_test_cl))
  return(list(k = k, cm = knn_cm))
})

We see the highest accuracy ratings to be at around 0.91 at k=8 and k=12. For these two we obtain very similar values of sensitivity at around 0.85 and specificity at around 0.98. Using these two k values we see a very similar performance to logistic regression, with a higher specificity and lower sensitivity.

Problem 4.16

Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings.

Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set.

set.seed(123)
# creating the response variable
med.crim <- median(Boston$crim)

# 0 for lower crim than median, 1 for higher
is.high.crime <- apply(Boston, 1, function(x) {
  ifelse(x['crim'] > med.crim, 1,0) # median will be considered low crim, if exists in data
})

boston.crime <- Boston[-1] # remove crim from dataset
boston.crime$high.crime <- is.high.crime

After creating the response variable, we next decide on what subsets to use. The first model we will use StepAIC with stepwise regression to find the optimal logistic regression model.

# logistic regression
full.model <- glm(high.crime~. , data=boston.crime, family='binomial')
null.model <- glm(high.crime~1, data=boston.crime, family='binomial')

model.selection <- stepAIC(null.model, direction = 'both', scope=list(lower=null.model, upper=full.model), trace = F)
summary(model.selection)
## 
## Call:
## glm(formula = high.crime ~ nox + rad + tax + zn + black + dis + 
##     ptratio + medv + age, family = "binomial", data = boston.crime)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -31.441272   6.048989  -5.198 2.02e-07 ***
## nox          43.195824   6.452812   6.694 2.17e-11 ***
## rad           0.718773   0.142066   5.059 4.21e-07 ***
## tax          -0.007676   0.002503  -3.066  0.00217 ** 
## zn           -0.082567   0.031424  -2.628  0.00860 ** 
## black        -0.012866   0.006334  -2.031  0.04224 *  
## dis           0.634380   0.207634   3.055  0.00225 ** 
## ptratio       0.303502   0.109255   2.778  0.00547 ** 
## medv          0.112882   0.034362   3.285  0.00102 ** 
## age           0.022851   0.009894   2.310  0.02091 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 701.46  on 505  degrees of freedom
## Residual deviance: 216.22  on 496  degrees of freedom
## AIC: 236.22
## 
## Number of Fisher Scoring iterations: 9

As we see, the optimal LR model is using the predictors as follows: nox + rad + tax + zn + black + dis + ptratio + medv + age

For the different subsets used to predict whether a census tract has a higher than the median crime level, we will use these predictors in that order. For example, we will first use LR with nox, then nox + rad, then nox + rad + tax, and so on. We will do this for LR, LDA, NB, and KNN (with k values of 1, 3, 5, 10, 15, 20, 25, and 30). The metrics we will use to compare will be accuracy, sensitivity, specificity, and F1 score. We will then compare these metrics visually at the end.

# creating test and training dataset
tr.size = floor(0.7*nrow(boston.crime))
tr.ind = sample(seq_len(nrow(boston.crime)),tr.size)
boston.tr = boston.crime[tr.ind,]
# table(boston.tr$high.crime)
boston.te = boston.crime[!(rownames(boston.crime) %in% tr.ind),]
# table(boston.te$high.crime)

Now that we have the desired subsets and have split the data into the testing and training sets, we can proceed with fitting the models.

In terms of sensitivity, we see the KNN models performed significantly better than the other models; the closest performer was logistic regression which appeared to have performed worse than most KNN models. Of the KNN models, the best performer was k=1 for all amounts of predictors, which had a max sensitivity of around 93%.

For specificity we see knn has the best overall performance, particularly at k=1 for 2 to 4 predictors with an estimated specificity of 98%. The model with k=3 appears to come close for 2, 4, and 8 predictors as well.

For accuracy, we again see that KNN models perform better with the best model of k=1 achieving an accuracy of 96% at 2 to 4 predictors. LR also performs well and approaches k=1 at 8 predictors with an accuracy just below 93%, but is ultimately outperformed.

We see a very similar picture with the F1 score as we saw in the accuracy. KNN with k=1 performs the best at 4 predictors with an F1 of around 97%, but LR gets close at 8 predictors.

Overall, the best performing model was KNN with k=1 and 4 predictors (nox + rad + tax + zn), as it outperformed every model in all metrics. There is concern for overfitting however, as there was no split between the training and the validation test set.

While LR was outperformed by KNN at k=1, the optimal model of 8 predictors still performed relatively well at all metrics, around 5 to 7 percentage points behind the optimal k=1 model. Given the likely situation in which the KNN model overfit, LR would be a solid contender for the most optimal model.