Classification

Author

Justin Pons

13

This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

library(MASS)
library(tidyverse)
library(ISLR2)
library(corrplot)
library(caret)
library(e1071)
library(class)
data(Weekly)

13A

Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

plot(Weekly)

There appears to be a logarithmic/exponential relationship between Year and Volume.

Weekly |> 
  select(-Direction) |> 
  cor() |> 
  corrplot(method = "number")

No variable are highly correlated with one another.

13B

Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

mod.log1 <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 +Lag5 + Volume, data = Weekly, family = binomial)
summary(mod.log1)

Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
    Volume, family = binomial, data = Weekly)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  0.26686    0.08593   3.106   0.0019 **
Lag1        -0.04127    0.02641  -1.563   0.1181   
Lag2         0.05844    0.02686   2.175   0.0296 * 
Lag3        -0.01606    0.02666  -0.602   0.5469   
Lag4        -0.02779    0.02646  -1.050   0.2937   
Lag5        -0.01447    0.02638  -0.549   0.5833   
Volume      -0.02274    0.03690  -0.616   0.5377   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1496.2  on 1088  degrees of freedom
Residual deviance: 1486.4  on 1082  degrees of freedom
AIC: 1500.4

Number of Fisher Scoring iterations: 4

Lag2 is the only significant variable used.

13C

Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression

probs <- predict(mod.log1, type = "response")
preds <- rep("Down",nrow(Weekly))
preds[probs > .5] <- "Up"
preds <- as.factor(preds)
confusionMatrix(preds, Weekly$Direction)
Confusion Matrix and Statistics

          Reference
Prediction Down  Up
      Down   54  48
      Up    430 557
                                         
               Accuracy : 0.5611         
                 95% CI : (0.531, 0.5908)
    No Information Rate : 0.5556         
    P-Value [Acc > NIR] : 0.369          
                                         
                  Kappa : 0.035          
                                         
 Mcnemar's Test P-Value : <2e-16         
                                         
            Sensitivity : 0.11157        
            Specificity : 0.92066        
         Pos Pred Value : 0.52941        
         Neg Pred Value : 0.56434        
             Prevalence : 0.44444        
         Detection Rate : 0.04959        
   Detection Prevalence : 0.09366        
      Balanced Accuracy : 0.51612        
                                         
       'Positive' Class : Down           
                                         
mean(preds == Weekly$Direction)
[1] 0.5610652

56% Accuracy. The models is predicting “Up” far more often than “Down”, resulting in poor accuracy and sensitivity.

13D

Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

train <- Weekly[Weekly$Year < 2009,]
test <- Weekly[Weekly$Year > 2008,]
mod.log2 <- glm(Direction ~ Lag2, data = train, family = binomial)
probs <- predict(mod.log1, newdata = test, type = "response")
preds <- ifelse(probs < .5, "Down","Up")
confusionMatrix(as.factor(preds), test$Direction)
Confusion Matrix and Statistics

          Reference
Prediction Down Up
      Down   17 13
      Up     26 48
                                         
               Accuracy : 0.625          
                 95% CI : (0.5247, 0.718)
    No Information Rate : 0.5865         
    P-Value [Acc > NIR] : 0.24395        
                                         
                  Kappa : 0.1907         
                                         
 Mcnemar's Test P-Value : 0.05466        
                                         
            Sensitivity : 0.3953         
            Specificity : 0.7869         
         Pos Pred Value : 0.5667         
         Neg Pred Value : 0.6486         
             Prevalence : 0.4135         
         Detection Rate : 0.1635         
   Detection Prevalence : 0.2885         
      Balanced Accuracy : 0.5911         
                                         
       'Positive' Class : Down           
                                         

Accuracy improves to 62.5% and sensitivity improves to 40%.

13E

Repeat (d) using LDA.

mod.lda <- lda(Direction ~ Lag2, data = train)
preds <- predict(mod.lda, test)
confusionMatrix(preds$class, test$Direction)
Confusion Matrix and Statistics

          Reference
Prediction Down Up
      Down    9  5
      Up     34 56
                                         
               Accuracy : 0.625          
                 95% CI : (0.5247, 0.718)
    No Information Rate : 0.5865         
    P-Value [Acc > NIR] : 0.2439         
                                         
                  Kappa : 0.1414         
                                         
 Mcnemar's Test P-Value : 7.34e-06       
                                         
            Sensitivity : 0.20930        
            Specificity : 0.91803        
         Pos Pred Value : 0.64286        
         Neg Pred Value : 0.62222        
             Prevalence : 0.41346        
         Detection Rate : 0.08654        
   Detection Prevalence : 0.13462        
      Balanced Accuracy : 0.56367        
                                         
       'Positive' Class : Down           
                                         

LDA results in 62% accuracy

13F

Repeat (d) using QDA.

mod.qda <- qda(Direction ~ Lag2, data = train)
preds <- predict(mod.qda, test)
confusionMatrix(preds$class, test$Direction)
Confusion Matrix and Statistics

          Reference
Prediction Down Up
      Down    0  0
      Up     43 61
                                          
               Accuracy : 0.5865          
                 95% CI : (0.4858, 0.6823)
    No Information Rate : 0.5865          
    P-Value [Acc > NIR] : 0.5419          
                                          
                  Kappa : 0               
                                          
 Mcnemar's Test P-Value : 1.504e-10       
                                          
            Sensitivity : 0.0000          
            Specificity : 1.0000          
         Pos Pred Value :    NaN          
         Neg Pred Value : 0.5865          
             Prevalence : 0.4135          
         Detection Rate : 0.0000          
   Detection Prevalence : 0.0000          
      Balanced Accuracy : 0.5000          
                                          
       'Positive' Class : Down            
                                          

QDA results in an accuracy of 58%

13H

Repeat (d) using naive Bayes.

mod.bayes <- naiveBayes(Direction ~ Lag2, data = train)
preds <- predict(mod.bayes, test)
confusionMatrix(preds, test$Direction)
Confusion Matrix and Statistics

          Reference
Prediction Down Up
      Down    0  0
      Up     43 61
                                          
               Accuracy : 0.5865          
                 95% CI : (0.4858, 0.6823)
    No Information Rate : 0.5865          
    P-Value [Acc > NIR] : 0.5419          
                                          
                  Kappa : 0               
                                          
 Mcnemar's Test P-Value : 1.504e-10       
                                          
            Sensitivity : 0.0000          
            Specificity : 1.0000          
         Pos Pred Value :    NaN          
         Neg Pred Value : 0.5865          
             Prevalence : 0.4135          
         Detection Rate : 0.0000          
   Detection Prevalence : 0.0000          
      Balanced Accuracy : 0.5000          
                                          
       'Positive' Class : Down            
                                          

Naive Bayes has an accuracy of 58%

13I

Which of these methods appears to provide the best results on this data?

LDA and Logistic Regression achieve the same results of about 62% accuracy.

14

In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

data(Auto)

14A

Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.

df <- Auto |> 
  mutate(mpg01 = if_else(mpg <= median(mpg), 0, 1)) |> 
  mutate(mpg01 = as.factor(mpg01))

table(df$mpg01)

  0   1 
196 196 

14B

Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

df |> dplyr::select(-c(mpg, name)) |> 
  plot()

Displacement, horsepower, acceleration, weight, and cylinders all appear to separate the data fairly well.

df |> 
  ggplot(aes(x = horsepower, y = acceleration, color =mpg01)) +
  geom_point()

14C

Split the data into a training set and a test set.

set.seed(333)
idx <- sample(nrow(df),nrow(df)*.75)
train <- df[idx,]
test <- df[-idx,]

14D

Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

lda.mod <- lda(mpg01 ~ weight + acceleration + horsepower + year, data = train)
preds <- predict(lda.mod, test)
confusionMatrix(preds$class, test$mpg01)
Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 45  4
         1  9 40
                                          
               Accuracy : 0.8673          
                 95% CI : (0.7838, 0.9274)
    No Information Rate : 0.551           
    P-Value [Acc > NIR] : 1.976e-11       
                                          
                  Kappa : 0.7347          
                                          
 Mcnemar's Test P-Value : 0.2673          
                                          
            Sensitivity : 0.8333          
            Specificity : 0.9091          
         Pos Pred Value : 0.9184          
         Neg Pred Value : 0.8163          
             Prevalence : 0.5510          
         Detection Rate : 0.4592          
   Detection Prevalence : 0.5000          
      Balanced Accuracy : 0.8712          
                                          
       'Positive' Class : 0               
                                          

87% accuracy

14E

Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

qda.mod <- qda(mpg01 ~ weight + acceleration + horsepower + year, data = train)
preds <- predict(qda.mod, test)
confusionMatrix(preds$class, test$mpg01)
Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 47  5
         1  7 39
                                          
               Accuracy : 0.8776          
                 95% CI : (0.7959, 0.9351)
    No Information Rate : 0.551           
    P-Value [Acc > NIR] : 3.596e-12       
                                          
                  Kappa : 0.7536          
                                          
 Mcnemar's Test P-Value : 0.7728          
                                          
            Sensitivity : 0.8704          
            Specificity : 0.8864          
         Pos Pred Value : 0.9038          
         Neg Pred Value : 0.8478          
             Prevalence : 0.5510          
         Detection Rate : 0.4796          
   Detection Prevalence : 0.5306          
      Balanced Accuracy : 0.8784          
                                          
       'Positive' Class : 0               
                                          

88% accuracy.

14F

Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

log.mod <- glm(mpg01 ~ weight + acceleration + horsepower + year, data = train, family = "binomial")
preds <- predict(log.mod, test, type = "response")
preds <- if_else(preds >= .5, 1, 0) |> 
  as.factor()
confusionMatrix(preds, test$mpg01)
Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 47  4
         1  7 40
                                         
               Accuracy : 0.8878         
                 95% CI : (0.808, 0.9426)
    No Information Rate : 0.551          
    P-Value [Acc > NIR] : 5.976e-13      
                                         
                  Kappa : 0.7746         
                                         
 Mcnemar's Test P-Value : 0.5465         
                                         
            Sensitivity : 0.8704         
            Specificity : 0.9091         
         Pos Pred Value : 0.9216         
         Neg Pred Value : 0.8511         
             Prevalence : 0.5510         
         Detection Rate : 0.4796         
   Detection Prevalence : 0.5204         
      Balanced Accuracy : 0.8897         
                                         
       'Positive' Class : 0              
                                         

89%

14G

Perform naive Bayes on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

mod.bayes <- naiveBayes(mpg01 ~ weight + acceleration + horsepower + year, data = train)
preds <- predict(mod.bayes, test)
confusionMatrix(preds, test$mpg01)
Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 45  5
         1  9 39
                                          
               Accuracy : 0.8571          
                 95% CI : (0.7719, 0.9196)
    No Information Rate : 0.551           
    P-Value [Acc > NIR] : 9.971e-11       
                                          
                  Kappa : 0.7137          
                                          
 Mcnemar's Test P-Value : 0.4227          
                                          
            Sensitivity : 0.8333          
            Specificity : 0.8864          
         Pos Pred Value : 0.9000          
         Neg Pred Value : 0.8125          
             Prevalence : 0.5510          
         Detection Rate : 0.4592          
   Detection Prevalence : 0.5102          
      Balanced Accuracy : 0.8598          
                                          
       'Positive' Class : 0               
                                          

86%

14H

Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

train.X <- cbind(train$weight,train$acceleration,train$horsepower,train$year)
test.X <- cbind(test$weight,test$acceleration,test$horsepower,test$year)
train.Direction <- df$mpg01[idx]
knn.pred <- knn(train = train.X, test = test.X, cl = train.Direction, k = 5)
confusionMatrix(as.factor(knn.pred), test$mpg01)
Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 46  8
         1  8 36
                                          
               Accuracy : 0.8367          
                 95% CI : (0.7484, 0.9037)
    No Information Rate : 0.551           
    P-Value [Acc > NIR] : 2.007e-09       
                                          
                  Kappa : 0.67            
                                          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.8519          
            Specificity : 0.8182          
         Pos Pred Value : 0.8519          
         Neg Pred Value : 0.8182          
             Prevalence : 0.5510          
         Detection Rate : 0.4694          
   Detection Prevalence : 0.5510          
      Balanced Accuracy : 0.8350          
                                          
       'Positive' Class : 0               
                                          

86% with k=5

16

Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings. Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set.

data(Boston)
str(Boston)
'data.frame':   506 obs. of  13 variables:
 $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
 $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
 $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
 $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
 $ rm     : num  6.58 6.42 7.18 7 7.15 ...
 $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
 $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
 $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
 $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
 $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
 $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
 $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
df <- Boston |> 
  mutate(response = if_else(crim >= median(crim), 1, 0)) |> 
  dplyr::select(-crim) |> 
  mutate(response = as.factor(response))
idx <- sample(nrow(df),nrow(df)*.8)
train <- df[idx,]
test <- df[-idx,]
log.mod <- glm(response ~ ., data = train, family = "binomial")
preds <- predict(log.mod, test, type = "response")
preds <- if_else(preds >= .5, 1, 0) |> 
  as.factor()
confusionMatrix(preds, test$response)
Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 47  4
         1  5 46
                                          
               Accuracy : 0.9118          
                 95% CI : (0.8391, 0.9589)
    No Information Rate : 0.5098          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.8235          
                                          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.9038          
            Specificity : 0.9200          
         Pos Pred Value : 0.9216          
         Neg Pred Value : 0.9020          
             Prevalence : 0.5098          
         Detection Rate : 0.4608          
   Detection Prevalence : 0.5000          
      Balanced Accuracy : 0.9119          
                                          
       'Positive' Class : 0               
                                          

91% accuracy is achieved with a logit model