Assignment 3

STA6543

Author

Stephen Garcia (wqr974)

Published

July 2, 2025

Chapter 4

Classification

Applied Exercises:

Problem 13

This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

(a)
Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

library(ISLR2)
Warning: package 'ISLR2' was built under R version 4.4.3
library(MASS)

Attaching package: 'MASS'
The following object is masked from 'package:ISLR2':

    Boston
library(ggplot2)
library(GGally)
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
library(dplyr)

Attaching package: 'dplyr'
The following object is masked from 'package:MASS':

    select
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(class)
library(e1071)
library(tidyr)
library(caret)
Warning: package 'caret' was built under R version 4.4.2
Loading required package: lattice
data("Weekly")

# Display the structure of the Weekly dataset
str(Weekly)
'data.frame':   1089 obs. of  9 variables:
 $ Year     : num  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
 $ Lag1     : num  0.816 -0.27 -2.576 3.514 0.712 ...
 $ Lag2     : num  1.572 0.816 -0.27 -2.576 3.514 ...
 $ Lag3     : num  -3.936 1.572 0.816 -0.27 -2.576 ...
 $ Lag4     : num  -0.229 -3.936 1.572 0.816 -0.27 ...
 $ Lag5     : num  -3.484 -0.229 -3.936 1.572 0.816 ...
 $ Volume   : num  0.155 0.149 0.16 0.162 0.154 ...
 $ Today    : num  -0.27 -2.576 3.514 0.712 1.178 ...
 $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
# Display a summary of the Weekly dataset
summary(Weekly)
      Year           Lag1               Lag2               Lag3         
 Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
 1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
 Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
 Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
 3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
 Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
      Lag4               Lag5              Volume            Today         
 Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
 1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
 Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
 Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
 3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
 Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
 Direction 
 Down:484  
 Up  :605  
           
           
           
           
# Let's first check the pairs of the Weekly dataset to visualize relationships
plot_obj <- ggpairs(
    Weekly,
    aes(color = factor(Direction), alpha = 0.5),
    upper = list(continuous = wrap("cor", size = 2.5)),
    lower = list(continuous = wrap("points", alpha = 0.6, size = 1.5)),
    diag = list(continuous = wrap("densityDiag"))
) +
    theme_minimal() +
    labs(
        title = "Scatterplot Matrix of Weekly Data",
        subtitle = "Colored by Direction",
        color = "Direction"
    )

suppressWarnings(suppressMessages(print(plot_obj)))

Response:
The scatterplot matrix above provides a visual summary of the relationships between the variables in the weekly dataset. For the mostpart the plots are unremarkable, with few discernible patterns. However, there are some observations:
- Although the correlation between Year and Volume is high (r = 0.84), the relationship does not appear strictly linear. A visual inspection of the scatterplot suggests a nonlinear, upward-curving trend in Volume over time, which may indicate that trading volume has generally increased over the years.
- An examination of the lag variables (Lag1 through Lag5) exhibit only weak relationships with the Direction variable (whether the market went Up or Down in a given week). Among these, Lag2, Lag4, and Lag5 stand out slightly

(b)
Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

# Logistic regression with Direction as the response and lag variables plus Volume as predictors
logistic_model <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, 
                     data = Weekly, 
                     family = binomial)

summary(logistic_model)

Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
    Volume, family = binomial, data = Weekly)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  0.26686    0.08593   3.106   0.0019 **
Lag1        -0.04127    0.02641  -1.563   0.1181   
Lag2         0.05844    0.02686   2.175   0.0296 * 
Lag3        -0.01606    0.02666  -0.602   0.5469   
Lag4        -0.02779    0.02646  -1.050   0.2937   
Lag5        -0.01447    0.02638  -0.549   0.5833   
Volume      -0.02274    0.03690  -0.616   0.5377   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1496.2  on 1088  degrees of freedom
Residual deviance: 1486.4  on 1082  degrees of freedom
AIC: 1500.4

Number of Fisher Scoring iterations: 4

Response:
In the logistic regression model, the coefficient for Lag2 is estimated at 0.05844 with a standard error of 0.02686, resulting in a z-value of 2.175 and a p-value of 0.0296. Since the p-value is below the 0.05 significance threshold, we conclude that Lag2 is statistically significant at the 5% level.

This positive coefficient indicates that higher returns two weeks ago are associated with an increased likelihood that the market will go Up in the current week. Specifically, for every 1% increase in Lag2, the log-odds of an Up week increase by approximately 0.05844. When translated to odds, this equates to a 6% increase in the odds of the market going Up, holding all other variables constant:

\[\begin{equation} \text{Odds Ratio} = e^{0.05844} \approx 1.060 \end{equation}\]

This result suggests that Lag2 may carry some limited predictive power for weekly market direction.

(c)
Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

# Compute predicted probabilities
predicted_probs <- predict(logistic_model, type = "response")

# Convert probabilities to binary predictions
predicted_classes <- ifelse(predicted_probs > 0.5, "Up", "Down")

# Create confusion matrix
confusion_matrix <- table(Predicted = predicted_classes, Actual = Weekly$Direction)

# Compute overall fraction of correct predictions
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)

# Print confusion matrix and accuracy
print(confusion_matrix)
         Actual
Predicted Down  Up
     Down   54  48
     Up    430 557
cat("Overall fraction of correct predictions (accuracy):", accuracy, "\n") 
Overall fraction of correct predictions (accuracy): 0.5610652 

Response:
The model correctly predicted the market direction in 611 out of 1089 weeks, resulting in an overall accuracy of:

\[\begin{equation} \text{Accuracy} = \frac{54 + 557}{1089} \approx 0.5616 \end{equation}\]

This suggests that the model performs only slightly better than random guessing.

Notably, the model frequently misclassifies Down weeks, predicting “Up” in 430 of the 484 weeks when the market actually went down. This indicates a strong bias toward predicting the more common outcome, market increases, at the expense of accurately detecting downturns. This imbalance may limit the model’s practical usefulness, especially for applications like trading strategies or risk management, where correctly identifying Down weeks is critical.

(d)
Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

# Subset the data into training and testing sets
train_data <- subset(Weekly, Year < 2009)
test_data <- subset(Weekly, Year >= 2009)

# Fit the logistic regression model
logistic_model_1990_2008 <- glm(Direction ~ Lag2, data = train_data, family = binomial)

# Compute predicted probabilities on the test set
predicted_probs_test <- predict(logistic_model_1990_2008, newdata = test_data, type = "response")

# Convert probabilities to binary predictions
predicted_classes_test <- ifelse(predicted_probs_test > 0.5, "Up", "Down")

# Create confusion matrix
confusion_matrix_test <- table(Predicted = predicted_classes_test, Actual = test_data$Direction)

# Compute overall fraction of correct predictions
accuracy_test <- sum(diag(confusion_matrix_test)) / sum(confusion_matrix_test)

# Print confusion matrix and accuracy
print(confusion_matrix_test)
         Actual
Predicted Down Up
     Down    9  5
     Up     34 56
cat("Overall fraction of correct predictions (accuracy):", accuracy_test, "\n")
Overall fraction of correct predictions (accuracy): 0.625 

(e)
Repeat (d) using LDA.

# Fit the LDA model
lda_model <- lda(Direction ~ Lag2, data = train_data)

# Compute predicted probabilities on the test set
predicted_probs_lda <- predict(lda_model, newdata = test_data)$posterior[, "Up"]

# Convert probabilities to binary predictions
predicted_classes_lda <- ifelse(predicted_probs_lda > 0.5, "Up", "Down")

# Create confusion matrix
confusion_matrix_lda <- table(Predicted = predicted_classes_lda, Actual = test_data$Direction)

# Compute overall fraction of correct predictions
accuracy_lda <- sum(diag(confusion_matrix_lda)) / sum(confusion_matrix_lda)

# Print confusion matrix and accuracy
print(confusion_matrix_lda)
         Actual
Predicted Down Up
     Down    9  5
     Up     34 56
cat("Overall fraction of correct predictions (accuracy):", accuracy_lda, "\n")
Overall fraction of correct predictions (accuracy): 0.625 

(f)
Repeat (d) using QDA.

# Fit the QDA model
qda_model <- qda(Direction ~ Lag2, data = train_data)

# Compute predicted probabilities on the test set
predicted_probs_qda <- predict(qda_model, newdata = test_data)$posterior[, "Up"]

# Convert probabilities to binary predictions
predicted_classes_qda <- ifelse(predicted_probs_qda > 0.5, "Up", "Down")

# Ensure both predicted and actual are factors with consistent levels
predicted_classes_qda <- factor(predicted_classes_qda, levels = c("Down", "Up"))
actual_classes_qda <- factor(test_data$Direction, levels = c("Down", "Up"))

# Create confusion matrix
confusion_matrix_qda <- table(Predicted = predicted_classes_qda, Actual = actual_classes_qda)

# Compute overall fraction of correct predictions
accuracy_qda <- sum(diag(confusion_matrix_qda)) / sum(confusion_matrix_qda)

# Print confusion matrix and accuracy
print(confusion_matrix_qda)
         Actual
Predicted Down Up
     Down    0  0
     Up     43 61
cat("Overall fraction of correct predictions (accuracy):", accuracy_qda, "\n")
Overall fraction of correct predictions (accuracy): 0.5865385 

(g)
Repeat (d) using KNN with K = 1.

# Prepare training and test sets for k-NN
# Only use Lag2 as the predictor (as in your QDA example)
train_X <- data.frame(Lag2 = train_data$Lag2)
test_X <- data.frame(Lag2 = test_data$Lag2)

# Target variable must be a factor
train_Y <- train_data$Direction

# Apply k-NN with k = 1
predicted_classes_knn <- knn(train = train_X, test = test_X, cl = train_Y, k = 1)

# Create confusion matrix
confusion_matrix_knn <- table(Predicted = predicted_classes_knn, Actual = test_data$Direction)

# Compute overall fraction of correct predictions
accuracy_knn <- sum(diag(confusion_matrix_knn)) / sum(confusion_matrix_knn)

# Print confusion matrix and accuracy
print(confusion_matrix_knn)
         Actual
Predicted Down Up
     Down   21 30
     Up     22 31
cat("Overall fraction of correct predictions (accuracy):", accuracy_knn, "\n")
Overall fraction of correct predictions (accuracy): 0.5 

(h)
Repeat (d) using naive Bayes.

# Fit the Naive Bayes model using only Lag2
nb_model <- naiveBayes(Direction ~ Lag2, data = train_data)

# Compute predicted probabilities on the test set
predicted_probs_nb <- predict(nb_model, newdata = test_data, type = "raw")[, "Up"]

# Convert probabilities to binary predictions
predicted_classes_nb <- ifelse(predicted_probs_nb > 0.5, "Up", "Down")

# Ensure both predicted and actual are factors with the same levels
predicted_classes_nb <- factor(predicted_classes_nb, levels = c("Down", "Up"))
actual_classes <- factor(test_data$Direction, levels = c("Down", "Up"))

# Create confusion matrix
confusion_matrix_nb <- table(Predicted = predicted_classes_nb, Actual = actual_classes)

# Compute overall fraction of correct predictions
accuracy_nb <- sum(diag(confusion_matrix_nb)) / sum(confusion_matrix_nb)

# Print confusion matrix and accuracy
print(confusion_matrix_nb)
         Actual
Predicted Down Up
     Down    0  0
     Up     43 61
cat("Overall fraction of correct predictions (accuracy):", accuracy_nb, "\n")
Overall fraction of correct predictions (accuracy): 0.5865385 

(i)
(i) Which of these methods appears to provide the best results on this data?

# First, store the accuracy values from each model
accuracy_results <- data.frame(
  Model = c("Logistic Regression", "LDA", "QDA", "k-NN (k = 1)", "Naive Bayes"),
  Accuracy = c(accuracy_test, accuracy_lda, accuracy_qda, accuracy_knn, accuracy_nb)
)

# Print as a clean table (requires knitr package)
library(knitr)
kable(accuracy_results, digits = 3, caption = "Model Accuracy Comparison")
Model Accuracy Comparison
Model Accuracy
Logistic Regression 0.625
LDA 0.625
QDA 0.587
k-NN (k = 1) 0.500
Naive Bayes 0.587

Response:
Logistic regression and LDA provide the best results on this data, with identical accuracy. This is expected since both models produce linear decision boundaries and are fit using only a single predictor (Lag2). In such cases, their predictions often align closely. More complex models like QDA or k-NN may overfit or underperform due to the simplicity of the data.

Problem 14

In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

(a)
Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.

# Load the Auto dataset
data("Auto")

# Remove rows with missing values
Auto <- na.omit(Auto)

# Create binary response variable mpg01: 1 if mpg > median, else 0
cat("Median mpg:", median(Auto$mpg), "\n")
Median mpg: 22.75 
Auto$mpg01 <- factor(ifelse(Auto$mpg > median(Auto$mpg), "High", "Low"))

# Convert relevant variables to factors
Auto$cylinders <- factor(Auto$cylinders)
Auto$year      <- factor(Auto$year)
Auto$origin    <- factor(Auto$origin)

# Drop the 'name' variable (not useful for modeling)
Auto <- Auto %>% select(-name, -year)

# View structure and sample of cleaned dataset
str(Auto)
'data.frame':   392 obs. of  8 variables:
 $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
 $ cylinders   : Factor w/ 5 levels "3","4","5","6",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
 $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
 $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
 $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
 $ origin      : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
 $ mpg01       : Factor w/ 2 levels "High","Low": 2 2 2 2 2 2 2 2 2 2 ...
 - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
  ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
head(Auto)
  mpg cylinders displacement horsepower weight acceleration origin mpg01
1  18         8          307        130   3504         12.0      1   Low
2  15         8          350        165   3693         11.5      1   Low
3  18         8          318        150   3436         11.0      1   Low
4  16         8          304        150   3433         12.0      1   Low
5  17         8          302        140   3449         10.5      1   Low
6  15         8          429        198   4341         10.0      1   Low

(b)
Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

# Select numeric predictors only (excluding mpg01, which is now a factor)
predictors <- Auto %>%
  select(-mpg01) %>%
  select(where(is.numeric))

# Create the pairs plot with mpg01 as the color
plot_obj <- ggpairs(
  data = bind_cols(predictors, mpg01 = Auto$mpg01),
  mapping = aes(color = mpg01, alpha = 0.5),
  upper = list(continuous = wrap("cor", size = 2.5)),
  lower = list(continuous = wrap("points", alpha = 0.6, size = 1.5)),
  diag = list(continuous = wrap("densityDiag"))
) +
  theme_minimal() +
  labs(
    title = "Scatterplot Matrix of Auto Dataset",
    subtitle = "Colored by mpg01 (1 = high MPG, 0 = low MPG)",
    color = "mpg01"
  )

# Suppress warnings and print the plot
suppressWarnings(suppressMessages(print(plot_obj)))

Response:
Based on the pairs plot, the features most useful in predicting ‘mpg01’ appear to be weight, horsepower, and displacement. These variables show strong negative correlations with ‘mpg’ and clear class separation between high and low MPG vehicles in both scatterplots and boxplots. Vehicles with lower weight, horsepower, and displacement are much more likely to fall into the high MPG category. In contrast, acceleration appears to have a weaker relationship and may be less useful for classification on its own.

# Select relevant categorical predictors and mpg01
categorical_vars <- Auto %>%
  select(mpg01, cylinders, origin)

# Pivot into long format for faceted plotting
long_cat <- categorical_vars %>%
  pivot_longer(cols = c(cylinders, origin), names_to = "variable", values_to = "category")

# Plot proportion of mpg01 = 1 vs 0 for each category of each variable
ggplot(long_cat, aes(x = category, fill = mpg01)) +
  geom_bar(position = "fill") +
  facet_wrap(~variable, scales = "free_x") +
  labs(title = "Proportion of High vs Low MPG by Categorical Variables",
       y = "Proportion", x = NULL, fill = "mpg01") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Reshape for boxplot faceting
boxplot_data <- Auto %>%
  select(mpg, cylinders, origin) %>%
  pivot_longer(cols = c(cylinders, origin), names_to = "variable", values_to = "category")

# Faceted boxplots of mpg by each categorical variable
ggplot(boxplot_data, aes(x = category, y = mpg)) +
  geom_boxplot(fill = "#69b3a2", alpha = 0.7) +
  facet_wrap(~variable, scales = "free_x") +
  labs(title = "Distribution of MPG by Categorical Predictors",
       x = NULL, y = "MPG") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

(c)
Split the data into a training set and a test set.

# Set a seed for reproducibility
set.seed(123)

# Create a stratified split: 70% train, 30% test (preserving mpg01 balance)
train_index <- createDataPartition(Auto$mpg01, p = 0.7, list = FALSE)

# Split the data
train_data <- Auto[train_index, ]
test_data  <- Auto[-train_index, ]

# Drop highly correlated predictors
cor_matrix <- cor(select_if(train_data, is.numeric))
high_cor <- findCorrelation(cor_matrix, cutoff = 0.75)
if (length(high_cor) > 0) {
  train_data <- train_data[, -high_cor]
  test_data <- test_data[, -high_cor]
  print(high_cor)
}
[1] 3 2 4
# Confirm the class balance is preserved
prop.table(table(train_data$mpg01))

High  Low 
 0.5  0.5 
prop.table(table(test_data$mpg01))

High  Low 
 0.5  0.5 
# I am going to try to use some of the techniques we learning in class with cross-validation.  
# I will try to weave this into the questions as I go along.

# Define the cross-validation control
train_control <- trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 3,
  classProbs = TRUE,
  summaryFunction = twoClassSummary
)

(d)
Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

# Fit LDA model using caret
model_lda <- train(
  mpg01 ~ .,                # Predict mpg01 using all other variables
  data = train_data,
  method = "lda",
  trControl = train_control,
  metric = "ROC"
)

# Print cross-validation results
print(model_lda)
Linear Discriminant Analysis 

276 samples
  4 predictor
  2 classes: 'High', 'Low' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 248, 248, 248, 250, 248, 248, ... 
Resampling results:

  ROC       Sens       Spec     
  0.997619  0.9445055  0.9855311
# Predict on the test set
pred_classes <- predict(model_lda, newdata = test_data)
pred_probs   <- predict(model_lda, newdata = test_data, type = "prob")

# Compute confusion matrix
conf_matrix <- confusionMatrix(pred_classes, test_data$mpg01, positive = "High")
print(conf_matrix)
Confusion Matrix and Statistics

          Reference
Prediction High Low
      High   55   2
      Low     3  56
                                          
               Accuracy : 0.9569          
                 95% CI : (0.9023, 0.9859)
    No Information Rate : 0.5             
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.9138          
                                          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.9483          
            Specificity : 0.9655          
         Pos Pred Value : 0.9649          
         Neg Pred Value : 0.9492          
             Prevalence : 0.5000          
         Detection Rate : 0.4741          
   Detection Prevalence : 0.4914          
      Balanced Accuracy : 0.9569          
                                          
       'Positive' Class : High            
                                          
# Calculate accuracy and test error
accuracy <- conf_matrix$overall["Accuracy"]
test_error <- 1 - accuracy

# Print accuracy and test error
cat("Accuracy on test set:", round(accuracy, 3), "\n")
Accuracy on test set: 0.957 
cat("Test Error Rate:", round(test_error, 3), "\n")
Test Error Rate: 0.043 

(e)
Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

# Fit QDA model using caret
model_qda <- train(
  mpg01 ~ .,                # Predict mpg01 using all other variables
  data = train_data,
  method = "qda",
  trControl = train_control,
  metric = "ROC"
)

# Print cross-validation results
print(model_qda)
Quadratic Discriminant Analysis 

276 samples
  4 predictor
  2 classes: 'High', 'Low' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 249, 248, 249, 248, 248, 248, ... 
Resampling results:

  ROC        Sens       Spec     
  0.9787687  0.9637363  0.8842491
# Predict on the test set
pred_classes <- predict(model_qda, newdata = test_data)
pred_probs   <- predict(model_qda, newdata = test_data, type = "prob")

# Compute confusion matrix
conf_matrix <- confusionMatrix(pred_classes, test_data$mpg01, positive = "High")
print(conf_matrix)
Confusion Matrix and Statistics

          Reference
Prediction High Low
      High   55   4
      Low     3  54
                                          
               Accuracy : 0.9397          
                 95% CI : (0.8796, 0.9754)
    No Information Rate : 0.5             
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.8793          
                                          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.9483          
            Specificity : 0.9310          
         Pos Pred Value : 0.9322          
         Neg Pred Value : 0.9474          
             Prevalence : 0.5000          
         Detection Rate : 0.4741          
   Detection Prevalence : 0.5086          
      Balanced Accuracy : 0.9397          
                                          
       'Positive' Class : High            
                                          
# Calculate accuracy and test error
accuracy <- conf_matrix$overall["Accuracy"]
test_error <- 1 - accuracy

# Print accuracy and test error
cat("Accuracy on test set:", round(accuracy, 3), "\n")
Accuracy on test set: 0.94 
cat("Test Error Rate:", round(test_error, 3), "\n")
Test Error Rate: 0.06 

(f)
Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

# Fit logistic regression model using only key predictors
model_logit <- train(
  mpg01 ~ .,  # predictors selected from part (b)
  data = train_data,
  method = "glm",
  family = "binomial",
  trControl = train_control,
  metric = "ROC"
)
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Print cross-validation results
print(model_logit)
Generalized Linear Model 

276 samples
  4 predictor
  2 classes: 'High', 'Low' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 248, 248, 249, 248, 248, 249, ... 
Resampling results:

  ROC        Sens       Spec
  0.9983778  0.9855311  1   
# Predict on the test set
pred_classes <- predict(model_logit, newdata = test_data)
pred_probs   <- predict(model_logit, newdata = test_data, type = "prob")

# Compute confusion matrix
conf_matrix <- confusionMatrix(pred_classes, test_data$mpg01, positive = "High")
print(conf_matrix)
Confusion Matrix and Statistics

          Reference
Prediction High Low
      High   58   0
      Low     0  58
                                     
               Accuracy : 1          
                 95% CI : (0.9687, 1)
    No Information Rate : 0.5        
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.0        
            Specificity : 1.0        
         Pos Pred Value : 1.0        
         Neg Pred Value : 1.0        
             Prevalence : 0.5        
         Detection Rate : 0.5        
   Detection Prevalence : 0.5        
      Balanced Accuracy : 1.0        
                                     
       'Positive' Class : High       
                                     
# Calculate accuracy and test error
accuracy <- conf_matrix$overall["Accuracy"]
test_error <- 1 - accuracy

# Print accuracy and test error
cat("Accuracy on test set:", round(accuracy, 3), "\n")
Accuracy on test set: 1 
cat("Test Error Rate:", round(test_error, 3), "\n")
Test Error Rate: 0 

(g)
Perform naive Bayes on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

# Fit Naive Bayes model using selected predictors
model_nb <- train(
  mpg01 ~ .,
  data = train_data,
  method = "naive_bayes",
  trControl = train_control,
  metric = "ROC"
)

# Print cross-validation results
print(model_nb)
Naive Bayes 

276 samples
  4 predictor
  2 classes: 'High', 'Low' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 248, 248, 248, 248, 250, 250, ... 
Resampling results across tuning parameters:

  usekernel  ROC        Sens       Spec     
  FALSE      0.9737270  0.9664835  0.8620879
   TRUE      0.9974097  0.9690476  0.9761905

Tuning parameter 'laplace' was held constant at a value of 0
Tuning
 parameter 'adjust' was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were laplace = 0, usekernel = TRUE
 and adjust = 1.
# Predict on the test set
pred_classes <- predict(model_nb, newdata = test_data)
pred_probs   <- predict(model_nb, newdata = test_data, type = "prob")

# Compute confusion matrix
conf_matrix <- confusionMatrix(pred_classes, test_data$mpg01, positive = "High")
print(conf_matrix)
Confusion Matrix and Statistics

          Reference
Prediction High Low
      High   56   1
      Low     2  57
                                          
               Accuracy : 0.9741          
                 95% CI : (0.9263, 0.9946)
    No Information Rate : 0.5             
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.9483          
                                          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.9655          
            Specificity : 0.9828          
         Pos Pred Value : 0.9825          
         Neg Pred Value : 0.9661          
             Prevalence : 0.5000          
         Detection Rate : 0.4828          
   Detection Prevalence : 0.4914          
      Balanced Accuracy : 0.9741          
                                          
       'Positive' Class : High            
                                          
# Calculate accuracy and test error
accuracy <- conf_matrix$overall["Accuracy"]
test_error <- 1 - accuracy

# Print accuracy and test error
cat("Accuracy on test set:", round(accuracy, 3), "\n")
Accuracy on test set: 0.974 
cat("Test Error Rate:", round(test_error, 3), "\n")
Test Error Rate: 0.026 

(h)
Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

# Define a grid of K values to try (e.g., 1 to 20)
k_grid <- expand.grid(k = 1:20)

# Fit KNN model using caret with selected predictors
model_knn <- train(
  mpg01 ~ .,
  data = train_data,
  method = "knn",
  trControl = train_control,
  tuneGrid = k_grid,
  metric = "ROC"
)

# Print cross-validation results for all K values
print(model_knn)
k-Nearest Neighbors 

276 samples
  4 predictor
  2 classes: 'High', 'Low' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 248, 249, 248, 248, 248, 249, ... 
Resampling results across tuning parameters:

  k   ROC        Sens       Spec     
   1  0.8560440  0.8739927  0.8380952
   2  0.8936043  0.8668498  0.8424908
   3  0.9074382  0.8838828  0.8357143
   4  0.9102307  0.8716117  0.8357143
   5  0.9079887  0.8622711  0.8210623
   6  0.9147999  0.8743590  0.8382784
   7  0.9160604  0.8695971  0.8331502
   8  0.9194607  0.8670330  0.8430403
   9  0.9224570  0.8695971  0.8426740
  10  0.9248355  0.8717949  0.8424908
  11  0.9274539  0.8793040  0.8452381
  12  0.9295466  0.8840659  0.8525641
  13  0.9304895  0.8793040  0.8478022
  14  0.9322314  0.8794872  0.8426740
  15  0.9341414  0.8719780  0.8428571
  16  0.9366009  0.8816850  0.8503663
  17  0.9348740  0.8816850  0.8452381
  18  0.9363196  0.8864469  0.8551282
  19  0.9369148  0.8791209  0.8551282
  20  0.9374381  0.8840659  0.8622711

ROC was used to select the optimal model using the largest value.
The final value used for the model was k = 20.
plot(model_knn)

# Predict on the test set using the best K found via CV
pred_classes <- predict(model_knn, newdata = test_data)
pred_probs   <- predict(model_knn, newdata = test_data, type = "prob")

# Confusion matrix
conf_matrix <- confusionMatrix(pred_classes, test_data$mpg01, positive = "High")
print(conf_matrix)
Confusion Matrix and Statistics

          Reference
Prediction High Low
      High   53   5
      Low     5  53
                                          
               Accuracy : 0.9138          
                 95% CI : (0.8472, 0.9579)
    No Information Rate : 0.5             
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.8276          
                                          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.9138          
            Specificity : 0.9138          
         Pos Pred Value : 0.9138          
         Neg Pred Value : 0.9138          
             Prevalence : 0.5000          
         Detection Rate : 0.4569          
   Detection Prevalence : 0.5000          
      Balanced Accuracy : 0.9138          
                                          
       'Positive' Class : High            
                                          
# Accuracy and test error
accuracy <- conf_matrix$overall["Accuracy"]
test_error <- 1 - accuracy

cat("Best K:", model_knn$bestTune$k, "\n")
Best K: 20 
cat("Accuracy on test set:", round(accuracy, 3), "\n")
Accuracy on test set: 0.914 
cat("Test Error Rate:", round(test_error, 3), "\n")
Test Error Rate: 0.086 

Response:
Caret automatically selects the best K value based on cross-validation performance. The best K value is determined by the highest ROC AUC score across the grid of K values. Caret chose 19 as the best k value, which is somewhat high for knn. Looking at the plot of K vs ROC it appears that 14 is the best K value, as there are diminishing returns after that point. I am going to force caret to use K = 14 and see how it performs.

# Force caret to use K = 14
model_knn_k14 <- train(
  mpg01 ~ .,
  data = train_data,
  method = "knn",
  tuneGrid = data.frame(k = 14),   # <- explicitly set K = 14
  trControl = train_control,
  metric = "ROC"
)

# Print model details
print(model_knn_k14)
k-Nearest Neighbors 

276 samples
  4 predictor
  2 classes: 'High', 'Low' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 248, 248, 248, 249, 248, 248, ... 
Resampling results:

  ROC        Sens       Spec     
  0.9366703  0.8772894  0.8549451

Tuning parameter 'k' was held constant at a value of 14
# Predict class labels and probabilities on test set
pred_classes <- predict(model_knn_k14, newdata = test_data)
pred_probs   <- predict(model_knn_k14, newdata = test_data, type = "prob")

# Compute confusion matrix
conf_matrix <- confusionMatrix(pred_classes, test_data$mpg01, positive = "High")
print(conf_matrix)
Confusion Matrix and Statistics

          Reference
Prediction High Low
      High   52   5
      Low     6  53
                                          
               Accuracy : 0.9052          
                 95% CI : (0.8367, 0.9517)
    No Information Rate : 0.5             
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.8103          
                                          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.8966          
            Specificity : 0.9138          
         Pos Pred Value : 0.9123          
         Neg Pred Value : 0.8983          
             Prevalence : 0.5000          
         Detection Rate : 0.4483          
   Detection Prevalence : 0.4914          
      Balanced Accuracy : 0.9052          
                                          
       'Positive' Class : High            
                                          
# Calculate accuracy and test error
accuracy <- conf_matrix$overall["Accuracy"]
test_error <- 1 - accuracy

cat("Accuracy on test set (K = 14):", round(accuracy, 3), "\n")
Accuracy on test set (K = 14): 0.905 
cat("Test Error Rate (K = 14):", round(test_error, 3), "\n")   
Test Error Rate (K = 14): 0.095 

Response:
The KNN model with K = 14 achieved an accuracy of 0.905, which is marginally better than the default K value of 19. Sincce the accuracy is better in K = 14 than K = 19, I will use K = 14 for the rest of the analysis.

Problem 16

Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings.

# Load the Boston dataset
data("Boston")

# Create a binary response: 1 if crim > median, else 0
median_crim <- median(Boston$crim)
Boston$crim01 <- factor(ifelse(Boston$crim > median_crim, "High", "Low"))

# Drop 'crim' to avoid leakage
Boston <- Boston %>% select(-crim)
# Split the data into training and test sets
set.seed(123)

train_index <- createDataPartition(Boston$crim01, p = 0.7, list = FALSE)
train_data <- Boston[train_index, ]
test_data <- Boston[-train_index, ]

# Check class balance
prop.table(table(train_data$crim01))

High  Low 
 0.5  0.5 
model_logit <- train(
  crim01 ~ nox + rad + dis + tax + lstat,
  data = train_data,
  method = "glm",
  family = "binomial",
  trControl = train_control,
  metric = "ROC"
)
model_lda <- train(
  crim01 ~ nox + rad + dis + tax + lstat,
  data = train_data,
  method = "lda",
  trControl = train_control,
  metric = "ROC"
)
model_nb <- train(
  crim01 ~ nox + rad + dis + tax + lstat,
  data = train_data,
  method = "naive_bayes",
  trControl = train_control,
  metric = "ROC"
)
knn_grid <- expand.grid(k = 1:20)

model_knn <- train(
  crim01 ~ nox + rad + dis + tax + lstat,
  data = train_data,
  method = "knn",
  tuneGrid = knn_grid,
  trControl = train_control,
  metric = "ROC"
)
# Predict and evaluate for each model
results <- list()

# Logistic Regression
pred_logit <- predict(model_logit, newdata = test_data)
cm_logit <- confusionMatrix(pred_logit, test_data$crim01, positive = "High")
results$Logistic <- cm_logit$overall["Accuracy"]

# LDA
pred_lda <- predict(model_lda, newdata = test_data)
cm_lda <- confusionMatrix(pred_lda, test_data$crim01, positive = "High")
results$LDA <- cm_lda$overall["Accuracy"]

# Naive Bayes
pred_nb <- predict(model_nb, newdata = test_data)
cm_nb <- confusionMatrix(pred_nb, test_data$crim01, positive = "High")
results$NaiveBayes <- cm_nb$overall["Accuracy"]

# KNN
pred_knn <- predict(model_knn, newdata = test_data)
cm_knn <- confusionMatrix(pred_knn, test_data$crim01, positive = "High")
results$KNN <- cm_knn$overall["Accuracy"]
# Convert results list to a data frame
accuracy_df <- data.frame(
  Model = names(results),
  Accuracy = round(unlist(results), 3)
)

# Display table
library(knitr)
kable(accuracy_df, caption = "Test Set Accuracy by Model")
Test Set Accuracy by Model
Model Accuracy
Logistic.Accuracy Logistic 0.927
LDA.Accuracy LDA 0.913
NaiveBayes.Accuracy NaiveBayes 0.900
KNN.Accuracy KNN 0.953

Model Comparison Results

We evaluated four classification models to predict whether a Boston census tract has a crime rate above the median. Test set accuracies were as follows: KNN (95.3%), Logistic Regression (92.7%), LDA (91.3%), and Naive Bayes (90.0%). KNN performed the best overall, suggesting that neighborhood-based classification effectively captures the structure in the data. All models performed well, with only slight differences in accuracy.