Question 10: This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1, 089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

Load the data set and packages

library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.3
library(MASS)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.0.3
## corrplot 0.84 loaded
library(ggplot2)
library(caret)
## Warning: package 'caret' was built under R version 4.0.3
## Loading required package: lattice
library(class)
data(Weekly)
  1. Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
head(Weekly)
##   Year   Lag1   Lag2   Lag3   Lag4   Lag5    Volume  Today Direction
## 1 1990  0.816  1.572 -3.936 -0.229 -3.484 0.1549760 -0.270      Down
## 2 1990 -0.270  0.816  1.572 -3.936 -0.229 0.1485740 -2.576      Down
## 3 1990 -2.576 -0.270  0.816  1.572 -3.936 0.1598375  3.514        Up
## 4 1990  3.514 -2.576 -0.270  0.816  1.572 0.1616300  0.712        Up
## 5 1990  0.712  3.514 -2.576 -0.270  0.816 0.1537280  1.178        Up
## 6 1990  1.178  0.712  3.514 -2.576 -0.270 0.1544440 -1.372      Down
str(Weekly)
## 'data.frame':    1089 obs. of  9 variables:
##  $ Year     : num  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
##  $ Lag1     : num  0.816 -0.27 -2.576 3.514 0.712 ...
##  $ Lag2     : num  1.572 0.816 -0.27 -2.576 3.514 ...
##  $ Lag3     : num  -3.936 1.572 0.816 -0.27 -2.576 ...
##  $ Lag4     : num  -0.229 -3.936 1.572 0.816 -0.27 ...
##  $ Lag5     : num  -3.484 -0.229 -3.936 1.572 0.816 ...
##  $ Volume   : num  0.155 0.149 0.16 0.162 0.154 ...
##  $ Today    : num  -0.27 -2.576 3.514 0.712 1.178 ...
##  $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
summary(Weekly)
##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
## 
pairs(Weekly)

correlation<-cor(Weekly[-9])
correlation
##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000
corrplot(correlation, type = 'upper')

There are weak correlations between the variables except Year and Volume.

  1. Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?
glm.fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)
summary(glm.fit)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

With a .05 significance value, only the lag2 variable is significant. All other variables have p-value greater than .05 and are not significant

  1. Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
predicted <- factor(ifelse(predict(glm.fit, type = "response") < 0.5, "Down", "Up"))

confusionMatrix(predicted, Weekly$Direction, positive = "Up")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down  Up
##       Down   54  48
##       Up    430 557
##                                          
##                Accuracy : 0.5611         
##                  95% CI : (0.531, 0.5908)
##     No Information Rate : 0.5556         
##     P-Value [Acc > NIR] : 0.369          
##                                          
##                   Kappa : 0.035          
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.9207         
##             Specificity : 0.1116         
##          Pos Pred Value : 0.5643         
##          Neg Pred Value : 0.5294         
##              Prevalence : 0.5556         
##          Detection Rate : 0.5115         
##    Detection Prevalence : 0.9063         
##       Balanced Accuracy : 0.5161         
##                                          
##        'Positive' Class : Up             
## 

Based on the confusion matrix, Accuracy = 56.11%. The model has a Sensitivity of 92.07% and 11.16% of Specificity. From these percentages, we can see that the prediction is correct 92% of the time, when the market is going Up. However, when the market is going Down, the model is only correct 11.16% of the time. An ideal model would have a high rate of Sensitivity and Specificity.

  1. Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).
train <- Weekly[Weekly$Year <= 2008, ]
test <- Weekly[Weekly$Year > 2008, ]

glm.fit2 <- glm(Direction ~ Lag2, 
               data = train, 
               family = "binomial")

predicted <- factor(ifelse(predict(glm.fit2, newdata = test, type = "response") < 0.5, "Down", "Up"))

confusionMatrix(predicted, test$Direction, positive = "Up")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    9  5
##       Up     34 56
##                                          
##                Accuracy : 0.625          
##                  95% CI : (0.5247, 0.718)
##     No Information Rate : 0.5865         
##     P-Value [Acc > NIR] : 0.2439         
##                                          
##                   Kappa : 0.1414         
##                                          
##  Mcnemar's Test P-Value : 7.34e-06       
##                                          
##             Sensitivity : 0.9180         
##             Specificity : 0.2093         
##          Pos Pred Value : 0.6222         
##          Neg Pred Value : 0.6429         
##              Prevalence : 0.5865         
##          Detection Rate : 0.5385         
##    Detection Prevalence : 0.8654         
##       Balanced Accuracy : 0.5637         
##                                          
##        'Positive' Class : Up             
## 

Accuracy is not 62.5% when all variables except lag2 are eliminated Sensitivity = 91.08% which has decreased from last model Specificity = 20.93% which has increased from last model

  1. Repeat (d) using LDA.
lda.fit <- lda(Direction ~ Lag2, data = train)


predicted_lda <- predict(lda.fit, newdata = test)

confusionMatrix(data = predicted_lda$class,
                reference = test$Direction, 
                positive = "Up")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    9  5
##       Up     34 56
##                                          
##                Accuracy : 0.625          
##                  95% CI : (0.5247, 0.718)
##     No Information Rate : 0.5865         
##     P-Value [Acc > NIR] : 0.2439         
##                                          
##                   Kappa : 0.1414         
##                                          
##  Mcnemar's Test P-Value : 7.34e-06       
##                                          
##             Sensitivity : 0.9180         
##             Specificity : 0.2093         
##          Pos Pred Value : 0.6222         
##          Neg Pred Value : 0.6429         
##              Prevalence : 0.5865         
##          Detection Rate : 0.5385         
##    Detection Prevalence : 0.8654         
##       Balanced Accuracy : 0.5637         
##                                          
##        'Positive' Class : Up             
## 

The Accuracy score is still at 62.5% using the LDA model. The Sensitivity score is still at 91.80%, and the Specificity score remained at 20.93%.

  1. Repeat (d) using QDA.
qda.fit <- qda(Direction ~ Lag2, data = train)


predicted_qda <- predict(qda.fit, newdata = test)

confusionMatrix(data = predicted_qda$class, 
                reference = test$Direction, 
                positive = "Up")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    0  0
##       Up     43 61
##                                           
##                Accuracy : 0.5865          
##                  95% CI : (0.4858, 0.6823)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.5419          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 1.504e-10       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.5865          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.5865          
##          Detection Rate : 0.5865          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : Up              
## 

Accuracy has dropped to 58.65% The model is only predicting when market is Up, hence sensitivity 100% and specificity %

  1. Repeat (d) using KNN with K = 1.
train.X <- cbind(train$Lag2)
test.X <- cbind(test$Lag2)
train.Direction <- train$Direction

set.seed(1)

knn.fit<- knn(train.X, test.X, train.Direction, k = 1)
confusionMatrix(knn.fit, test$Direction, positive = "Up")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   21 30
##       Up     22 31
##                                           
##                Accuracy : 0.5             
##                  95% CI : (0.4003, 0.5997)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.9700          
##                                           
##                   Kappa : -0.0033         
##                                           
##  Mcnemar's Test P-Value : 0.3317          
##                                           
##             Sensitivity : 0.5082          
##             Specificity : 0.4884          
##          Pos Pred Value : 0.5849          
##          Neg Pred Value : 0.4118          
##              Prevalence : 0.5865          
##          Detection Rate : 0.2981          
##    Detection Prevalence : 0.5096          
##       Balanced Accuracy : 0.4983          
##                                           
##        'Positive' Class : Up              
## 

The model accuracy for KNN is 50%. The Sensitivity is 50.82% and Specificity is 48.84%, which is almost even for both market directions.

  1. Which of these methods appears to provide the best results on this data?

Using the target metric as the Accuracy, LDA and Logistic Regression have the highest accuracy scores of 62.5%

  1. Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

KNN - selecting best K

knn.fit2 <- knn(train.X, test.X, train.Direction, k = 10)
confusionMatrix(knn.fit2, test$Direction, positive = "Up")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   17 18
##       Up     26 43
##                                           
##                Accuracy : 0.5769          
##                  95% CI : (0.4761, 0.6732)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.6193          
##                                           
##                   Kappa : 0.1031          
##                                           
##  Mcnemar's Test P-Value : 0.2913          
##                                           
##             Sensitivity : 0.7049          
##             Specificity : 0.3953          
##          Pos Pred Value : 0.6232          
##          Neg Pred Value : 0.4857          
##              Prevalence : 0.5865          
##          Detection Rate : 0.4135          
##    Detection Prevalence : 0.6635          
##       Balanced Accuracy : 0.5501          
##                                           
##        'Positive' Class : Up              
## 

KNN using K = 10 Accuracy = 59.73% Sensitivity = 68.85% Specificity = 39.53% Accuracy increased compared to when using K = 1, but logistic and lda still have a better accuracy

knn.fit3 <- knn(train.X, test.X, train.Direction, k = 50)
confusionMatrix(knn.fit3, test$Direction, positive = "Up")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   20 22
##       Up     23 39
##                                           
##                Accuracy : 0.5673          
##                  95% CI : (0.4665, 0.6641)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.6921          
##                                           
##                   Kappa : 0.1048          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.6393          
##             Specificity : 0.4651          
##          Pos Pred Value : 0.6290          
##          Neg Pred Value : 0.4762          
##              Prevalence : 0.5865          
##          Detection Rate : 0.3750          
##    Detection Prevalence : 0.5962          
##       Balanced Accuracy : 0.5522          
##                                           
##        'Positive' Class : Up              
## 

KNN using K = 50 Accuracy = 56.73% Sensitivity = 63.93% Specificity = 46.51% Accuracy increased compared to when using K = 1, but logistic and lda still have a better accuracy

knn.fit4 <- knn(train.X, test.X, train.Direction, k = 100)
confusionMatrix(knn.fit4, test$Direction, positive = "Up")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    9 12
##       Up     34 49
##                                         
##                Accuracy : 0.5577        
##                  95% CI : (0.457, 0.655)
##     No Information Rate : 0.5865        
##     P-Value [Acc > NIR] : 0.75792       
##                                         
##                   Kappa : 0.0136        
##                                         
##  Mcnemar's Test P-Value : 0.00196       
##                                         
##             Sensitivity : 0.8033        
##             Specificity : 0.2093        
##          Pos Pred Value : 0.5904        
##          Neg Pred Value : 0.4286        
##              Prevalence : 0.5865        
##          Detection Rate : 0.4712        
##    Detection Prevalence : 0.7981        
##       Balanced Accuracy : 0.5063        
##                                         
##        'Positive' Class : Up            
## 

KNN using K = 100 Accuracy = 55.77% Sensitivity = 78.69% Specificity = 23.26% Accuracy increased compared to when using K = 1, but logistic and lda still have a better accuracy

Logistic Regression using Lag1 and Lag2 and their Interaction.

glm.fit3<-glm(Direction ~ Lag2+Lag1 + Lag1*Lag2, data=train, family=binomial)
glm.fit3.probs<-predict(glm.fit3, newdata = test, type = 'response')
glm.fit3.preds<- rep('Down', length(glm.fit3.probs))
glm.fit3.preds[glm.fit3.probs>0.5]='Up'

confusionMatrix(as.factor(glm.fit3.preds), test$Direction, positive = 'Up')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    7  8
##       Up     36 53
##                                           
##                Accuracy : 0.5769          
##                  95% CI : (0.4761, 0.6732)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.6193          
##                                           
##                   Kappa : 0.035           
##                                           
##  Mcnemar's Test P-Value : 4.693e-05       
##                                           
##             Sensitivity : 0.8689          
##             Specificity : 0.1628          
##          Pos Pred Value : 0.5955          
##          Neg Pred Value : 0.4667          
##              Prevalence : 0.5865          
##          Detection Rate : 0.5096          
##    Detection Prevalence : 0.8558          
##       Balanced Accuracy : 0.5158          
##                                           
##        'Positive' Class : Up              
## 

This model with interaction has accuracy = 57.69% which is lower than accuracy score of model with just lag Sensitivity and specificity are also lower at 86.89% and 16.28%

LDA with Lag1 and Lag2 Variables

lda.fit2 <- lda(Direction ~ Lag1 + Lag2, data = train)


predicted_lda2 <- predict(lda.fit2, newdata = test)

confusionMatrix(data = predicted_lda2$class,
                reference = test$Direction, 
                positive = "Up")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down    7  8
##       Up     36 53
##                                           
##                Accuracy : 0.5769          
##                  95% CI : (0.4761, 0.6732)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 0.6193          
##                                           
##                   Kappa : 0.035           
##                                           
##  Mcnemar's Test P-Value : 4.693e-05       
##                                           
##             Sensitivity : 0.8689          
##             Specificity : 0.1628          
##          Pos Pred Value : 0.5955          
##          Neg Pred Value : 0.4667          
##              Prevalence : 0.5865          
##          Detection Rate : 0.5096          
##    Detection Prevalence : 0.8558          
##       Balanced Accuracy : 0.5158          
##                                           
##        'Positive' Class : Up              
## 

Including the Lag1 variable gives an Accuracy score of 57.69%, which is still lower than the original LDA model with an Accuracy score of 62.5%

QDA with Lag1 and Lag2 Variables and Their Interaction

qda.fit2 <- qda(Direction ~ Lag1 + Lag2 + Lag1*Lag2, data = train)


predicted_qda2 <- predict(qda.fit2, newdata = test)

confusionMatrix(data = predicted_qda2$class, 
                reference = test$Direction, 
                positive = "Up")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down Up
##       Down   23 36
##       Up     20 25
##                                          
##                Accuracy : 0.4615         
##                  95% CI : (0.3633, 0.562)
##     No Information Rate : 0.5865         
##     P-Value [Acc > NIR] : 0.99616        
##                                          
##                   Kappa : -0.0524        
##                                          
##  Mcnemar's Test P-Value : 0.04502        
##                                          
##             Sensitivity : 0.4098         
##             Specificity : 0.5349         
##          Pos Pred Value : 0.5556         
##          Neg Pred Value : 0.3898         
##              Prevalence : 0.5865         
##          Detection Rate : 0.2404         
##    Detection Prevalence : 0.4327         
##       Balanced Accuracy : 0.4724         
##                                          
##        'Positive' Class : Up             
## 

Including the Lag1 variable and the interaction between Lag1 and Lag2 gives an even lower Accuracy score of 46.15%, which is still lower than the original QDA model with an Accuracy score of 58.65%. The Sensitivity = 40.98% is lower, however, with the Specificity = 53.49%, the model is also predicting when the market direction is Down, unlike the first QDA model.

The original LDA and Logistic Regression models with the Accuracy score of 62.5% are the best models, with target metric as accuracy

QUESTION 11: In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

  1. Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.
attach(Auto)
## The following object is masked from package:ggplot2:
## 
##     mpg
mpg01<-rep(0, length(mpg))
mpg01[mpg > median(mpg)] = 1
Auto <- data.frame(Auto,mpg01)
  1. Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.
pairs(Auto[ ,-9])

M<-cor(Auto[,-9])
corrplot(M, type="upper")

boxplot(cylinders ~ mpg01, data = Auto, main = "Cylinders vs mpg01")

boxplot(displacement ~ mpg01, data = Auto, main = "Displacement vs mpg01")

boxplot(horsepower ~ mpg01, data = Auto, main = "Horsepower vs mpg01")

boxplot(weight ~ mpg01, data = Auto, main = "Weight vs mpg01")

boxplot(acceleration ~ mpg01, data = Auto, main = "Acceleration vs mpg01")

boxplot(year ~ mpg01, data = Auto, main = "Year vs mpg01")

Per the correlation plots, there is a positive correlation between displacement and horsepower, cylinder and horsepower, horsepower and weight, and displacement and weight, and cylinder and weight.

All four variables are inversely correlated with mpg 01.

cor(Auto[ ,-9])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
## mpg01         0.8369392 -0.7591939   -0.7534766 -0.6670526 -0.7577566
##              acceleration       year     origin      mpg01
## mpg             0.4233285  0.5805410  0.5652088  0.8369392
## cylinders      -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement   -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower     -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight         -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration    1.0000000  0.2903161  0.2127458  0.3468215
## year            0.2903161  1.0000000  0.1815277  0.4299042
## origin          0.2127458  0.1815277  1.0000000  0.5136984
## mpg01           0.3468215  0.4299042  0.5136984  1.0000000

The results above support our initial observation about which variables are correlated, with cylinders, displacement, horsepower, and weight having a negative correlation with mpg01 of over 0.77.

  1. Split the data into a training set and a test set.

Train/Test Splot of 80/20

set.seed(1)
index <- sample(1:nrow(Auto), 0.8*nrow(Auto))
Auto.train<-Auto[index, ]
Auto.test<-Auto[-index, ]
  1. Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
set.seed(1)
autolda.fit <- lda(mpg01 ~ cylinders + displacement + weight + horsepower, data=Auto.train)
autolda.preds <- predict(autolda.fit, Auto.test)
autolda.class <- autolda.preds$class

confusionMatrix(autolda.class, as.factor(Auto.test$mpg01), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 35  0
##          1  7 37
##                                           
##                Accuracy : 0.9114          
##                  95% CI : (0.8259, 0.9636)
##     No Information Rate : 0.5316          
##     P-Value [Acc > NIR] : 2.819e-13       
##                                           
##                   Kappa : 0.8241          
##                                           
##  Mcnemar's Test P-Value : 0.02334         
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.8333          
##          Pos Pred Value : 0.8409          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.4684          
##          Detection Rate : 0.4684          
##    Detection Prevalence : 0.5570          
##       Balanced Accuracy : 0.9167          
##                                           
##        'Positive' Class : 1               
## 
mean(autolda.class!=Auto.test$mpg01)
## [1] 0.08860759

The test error rate for the LDA model is 8.86%.

  1. Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
autoqda.fit <- qda(mpg01 ~ cylinders + displacement + weight + horsepower, data=Auto.train)
autoqda.preds <- predict(autoqda.fit, Auto.test)
autoqda.class <- autoqda.preds$class

confusionMatrix(autoqda.class, as.factor(Auto.test$mpg01), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 37  2
##          1  5 35
##                                           
##                Accuracy : 0.9114          
##                  95% CI : (0.8259, 0.9636)
##     No Information Rate : 0.5316          
##     P-Value [Acc > NIR] : 2.819e-13       
##                                           
##                   Kappa : 0.8229          
##                                           
##  Mcnemar's Test P-Value : 0.4497          
##                                           
##             Sensitivity : 0.9459          
##             Specificity : 0.8810          
##          Pos Pred Value : 0.8750          
##          Neg Pred Value : 0.9487          
##              Prevalence : 0.4684          
##          Detection Rate : 0.4430          
##    Detection Prevalence : 0.5063          
##       Balanced Accuracy : 0.9134          
##                                           
##        'Positive' Class : 1               
## 
mean(autoqda.class!=Auto.test$mpg01)
## [1] 0.08860759

The test error rate for the QDA model is 8.86%.

  1. Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
autoglm.fit <- glm(mpg01 ~ cylinders + displacement + weight + horsepower, data=Auto.train, family = binomial)
autoglm.preds <- predict(autoglm.fit, newdata = Auto.test, type = "response")
autoglm.class <- ifelse(autoglm.preds>0.5,1,0)

confusionMatrix(as.factor(autoglm.class), as.factor(Auto.test$mpg01), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 38  1
##          1  4 36
##                                           
##                Accuracy : 0.9367          
##                  95% CI : (0.8584, 0.9791)
##     No Information Rate : 0.5316          
##     P-Value [Acc > NIR] : 2.725e-15       
##                                           
##                   Kappa : 0.8735          
##                                           
##  Mcnemar's Test P-Value : 0.3711          
##                                           
##             Sensitivity : 0.9730          
##             Specificity : 0.9048          
##          Pos Pred Value : 0.9000          
##          Neg Pred Value : 0.9744          
##              Prevalence : 0.4684          
##          Detection Rate : 0.4557          
##    Detection Prevalence : 0.5063          
##       Balanced Accuracy : 0.9389          
##                                           
##        'Positive' Class : 1               
## 
mean(autoglm.class!=Auto.test$mpg01)
## [1] 0.06329114

The test error rate for the Logistic Regression model is 6.33%.

  1. Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

K = 1

trainX.Auto <- cbind(Auto.train$cylinders, Auto.train$displacement, Auto.train$weight, Auto.train$horsepower)
testX.Auto <- cbind(Auto.test$cylinders, Auto.test$displacement, Auto.test$weight, Auto.test$horsepower)
Auto.mpg01 <- Auto.train$mpg01

set.seed(1)
autoknn1<- knn(trainX.Auto, testX.Auto, Auto.mpg01, k=1)

confusionMatrix(autoknn1, as.factor(Auto.test$mpg01), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 36  4
##          1  6 33
##                                           
##                Accuracy : 0.8734          
##                  95% CI : (0.7795, 0.9376)
##     No Information Rate : 0.5316          
##     P-Value [Acc > NIR] : 1.017e-10       
##                                           
##                   Kappa : 0.7466          
##                                           
##  Mcnemar's Test P-Value : 0.7518          
##                                           
##             Sensitivity : 0.8919          
##             Specificity : 0.8571          
##          Pos Pred Value : 0.8462          
##          Neg Pred Value : 0.9000          
##              Prevalence : 0.4684          
##          Detection Rate : 0.4177          
##    Detection Prevalence : 0.4937          
##       Balanced Accuracy : 0.8745          
##                                           
##        'Positive' Class : 1               
## 
mean(autoknn1!=Auto.test$mpg01)
## [1] 0.1265823

Using a K=1, the test error is at 12.66% Accuracy = 87.34%.

K = 10

set.seed(1)
autoknn2<- knn(trainX.Auto, testX.Auto, Auto.mpg01, k=10)

confusionMatrix(autoknn2, as.factor(Auto.test$mpg01), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 35  2
##          1  7 35
##                                           
##                Accuracy : 0.8861          
##                  95% CI : (0.7947, 0.9466)
##     No Information Rate : 0.5316          
##     P-Value [Acc > NIR] : 1.615e-11       
##                                           
##                   Kappa : 0.7731          
##                                           
##  Mcnemar's Test P-Value : 0.1824          
##                                           
##             Sensitivity : 0.9459          
##             Specificity : 0.8333          
##          Pos Pred Value : 0.8333          
##          Neg Pred Value : 0.9459          
##              Prevalence : 0.4684          
##          Detection Rate : 0.4430          
##    Detection Prevalence : 0.5316          
##       Balanced Accuracy : 0.8896          
##                                           
##        'Positive' Class : 1               
## 
mean(autoknn2!=Auto.test$mpg01)
## [1] 0.1139241

Using a K=10, the test error is at 11.39%, Accuracy = 88.61%.

K = 80

set.seed(1)
autoknn3<- knn(trainX.Auto, testX.Auto, Auto.mpg01, k=80)

confusionMatrix(autoknn3, as.factor(Auto.test$mpg01), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 35  4
##          1  7 33
##                                           
##                Accuracy : 0.8608          
##                  95% CI : (0.7645, 0.9284)
##     No Information Rate : 0.5316          
##     P-Value [Acc > NIR] : 5.745e-10       
##                                           
##                   Kappa : 0.7217          
##                                           
##  Mcnemar's Test P-Value : 0.5465          
##                                           
##             Sensitivity : 0.8919          
##             Specificity : 0.8333          
##          Pos Pred Value : 0.8250          
##          Neg Pred Value : 0.8974          
##              Prevalence : 0.4684          
##          Detection Rate : 0.4177          
##    Detection Prevalence : 0.5063          
##       Balanced Accuracy : 0.8626          
##                                           
##        'Positive' Class : 1               
## 
mean(autoknn3!=Auto.test$mpg01)
## [1] 0.1392405

Using a K=80, the test error is at 13.92%, Accuracy = 86.08%.

The k=10 model seems to perform the best, with the lowest test error and highest Accuracy score.

QUESTION 13: Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.

attach(Boston)

#new column 
crime01 <- rep(0, length(crim))
crime01[crim > median(crim)] = 1
Boston <- data.frame(Boston, crime01)

#create test and train  
train = 1:(dim(Boston)[1]/2)
test = (dim(Boston)[1]/2 + 1):dim(Boston)[1]
Boston.train = Boston[train, ]
Boston.test = Boston[test, ]
crime01.test = crime01[test]

Logistic model

glm.crim.fit = glm(crime01 ~ . - crime01 - crim, data = Boston.train, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
glm.crim01.probs <- predict(glm.crim.fit, Boston.test, type = 'response')
glm.crim01.preds <- rep(0, length(glm.crim01.probs))
glm.crim01.preds[glm.crim01.probs > 0.5] = 1

caret::confusionMatrix(as.factor(glm.crim01.preds), as.factor(Boston.test$crime01), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  68  24
##          1  22 139
##                                          
##                Accuracy : 0.8182         
##                  95% CI : (0.765, 0.8637)
##     No Information Rate : 0.6443         
##     P-Value [Acc > NIR] : 9.016e-10      
##                                          
##                   Kappa : 0.6053         
##                                          
##  Mcnemar's Test P-Value : 0.8828         
##                                          
##             Sensitivity : 0.8528         
##             Specificity : 0.7556         
##          Pos Pred Value : 0.8634         
##          Neg Pred Value : 0.7391         
##              Prevalence : 0.6443         
##          Detection Rate : 0.5494         
##    Detection Prevalence : 0.6364         
##       Balanced Accuracy : 0.8042         
##                                          
##        'Positive' Class : 1              
## 
mean(glm.crim01.preds != crime01.test)
## [1] 0.1818182

For a logistic model, the accuracy = 81.82% and the test error rate = 18.18%

LDA model

lda.crim01.fit <- lda(crime01 ~ . - crim, Boston.train)
lda.crim01.preds <- predict(lda.crim01.fit, Boston.test)
lda.crim01.class <- lda.crim01.preds$class

mean(lda.crim01.preds$class != crime01.test)
## [1] 0.1343874
mean(lda.crim01.preds$class == crime01.test)
## [1] 0.8656126

For an LDA model, the accuracy = 86.56% and the test error rate = 13.44%

KNN model k = 1

train.X = cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, 
    lstat, medv)[train, ]
test.X = cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, 
    lstat, medv)[test, ]
train.crime01 = crime01[train]
set.seed(1)

knn.pred = knn(train.X, test.X, train.crime01, k = 1)
caret::confusionMatrix(knn.pred, as.factor(Boston.test$crime01), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  85 111
##          1   5  52
##                                           
##                Accuracy : 0.5415          
##                  95% CI : (0.4779, 0.6041)
##     No Information Rate : 0.6443          
##     P-Value [Acc > NIR] : 0.9997          
##                                           
##                   Kappa : 0.2085          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.3190          
##             Specificity : 0.9444          
##          Pos Pred Value : 0.9123          
##          Neg Pred Value : 0.4337          
##              Prevalence : 0.6443          
##          Detection Rate : 0.2055          
##    Detection Prevalence : 0.2253          
##       Balanced Accuracy : 0.6317          
##                                           
##        'Positive' Class : 1               
## 
mean(knn.pred != crime01.test)
## [1] 0.458498

When k = 1, the model accuracy is 54.15% and the test error rate is 45.85%

k = 5

knn.pred = knn(train.X, test.X, train.crime01, k = 5)
caret::confusionMatrix(knn.pred, as.factor(Boston.test$crime01), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  84  37
##          1   6 126
##                                          
##                Accuracy : 0.83           
##                  95% CI : (0.778, 0.8742)
##     No Information Rate : 0.6443         
##     P-Value [Acc > NIR] : 5.088e-11      
##                                          
##                   Kappa : 0.6558         
##                                          
##  Mcnemar's Test P-Value : 4.763e-06      
##                                          
##             Sensitivity : 0.7730         
##             Specificity : 0.9333         
##          Pos Pred Value : 0.9545         
##          Neg Pred Value : 0.6942         
##              Prevalence : 0.6443         
##          Detection Rate : 0.4980         
##    Detection Prevalence : 0.5217         
##       Balanced Accuracy : 0.8532         
##                                          
##        'Positive' Class : 1              
## 
mean(knn.pred != crime01.test)
## [1] 0.1699605

When k = 5, the model accuracy is 83% and the test error rate is 17.00%

k = 7

knn.pred = knn(train.X, test.X, train.crime01, k = 7)
caret::confusionMatrix(knn.pred, as.factor(Boston.test$crime01), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  83  22
##          1   7 141
##                                           
##                Accuracy : 0.8854          
##                  95% CI : (0.8395, 0.9219)
##     No Information Rate : 0.6443          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.7589          
##                                           
##  Mcnemar's Test P-Value : 0.00933         
##                                           
##             Sensitivity : 0.8650          
##             Specificity : 0.9222          
##          Pos Pred Value : 0.9527          
##          Neg Pred Value : 0.7905          
##              Prevalence : 0.6443          
##          Detection Rate : 0.5573          
##    Detection Prevalence : 0.5850          
##       Balanced Accuracy : 0.8936          
##                                           
##        'Positive' Class : 1               
## 
mean(knn.pred != crime01.test)
## [1] 0.1146245

When k = 7, the model accuracy is 88.54% and the test error rate is 11.46%