Question 13.

This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

summary(Weekly)
##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
## 
pairs(Weekly)

cor(Weekly[, -9]) 
##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000
plot(Weekly$Year, Weekly$Volume,
     xlab = "Year",
     ylab = "Volume",
     main = "Trading Volume Over Time")

boxplot(Today ~ Direction,
        data = Weekly,
        main = "Today's Return by Market Direction")

Numerical and graphical summaries of the Weekly dataset reveal several interesting patterns. The correlation matrix shows generally weak correlations among the lag variables (Lag1–Lag5) and Today, suggesting that past weekly returns are not strongly related to future returns. However, Volume exhibits a noticeable positive correlation with Year, indicating that trading volume increased steadily over time.

The scatterplot matrix confirms the lack of strong linear relationships among the lag variables and weekly returns. Most pairs of variables appear widely dispersed with no obvious trends. In contrast, a plot of Volume versus Year shows a clear upward trend, reflecting growth in market trading activity between 1990 and 2010.

(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

logistic_model <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, family = binomial)
summary(logistic_model)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

The logistic regression model suggests that only the return from two weeks prior (Lag2) has a statistically significant relationship with market direction, although the effect is relatively small.The coefficient for Lag2 is positive, indicating that an increase in the previous week’s return (Lag2) is associated with a higher probability that the market will move up in the current week.

(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

# 2. Get predicted probabilities (returns values between 0 and 1)
probabilities <- predict(logistic_model, type = "response")

# 3. Convert probabilities to binary classes based on a 0.5 threshold
weekly.pred <- rep("Down", length(probabilities))
weekly.pred[probabilities > 0.5] <- "Up"

# 4. Generate the confusion matrix
conf_matrix <- table(weekly.pred,Actual = Weekly$Direction)
print(conf_matrix)
##            Actual
## weekly.pred Down  Up
##        Down   54  48
##        Up    430 557
# Overall accuracy
mean(weekly.pred == Weekly$Direction)
## [1] 0.5610652

The confusion matrix compares the predicted market direction to the actual market direction. The logistic regression model correctly classified 557 weeks in which the market moved up and 54 weeks in which the market moved down. Overall, the model correctly classified approximately 56.1% of the observations.

The confusion matrix reveals that the model predicts Up much more frequently than Down. While it correctly identifies many weeks in which the market increases, it performs poorly at identifying weeks in which the market decreases.

Overall, the model has limited predictive power and is biased toward predicting upward market movements, which may be partly due to the fact that the market moved up more often than down during the sample period based on early exploratory data analysis.

(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

# Create training indicator
training_data <- Weekly$Year <= 2008

# Training and test sets
Weekly.train <- Weekly[training_data, ]
Weekly.test  <- Weekly[!training_data, ]

Logistic Regression

# Fit logistic regression
filtered_logistic_model <- glm(Direction ~ Lag2, data = Weekly.train,family = binomial)

# Predicted probabilities
probabilities <- predict(filtered_logistic_model,
                         newdata = Weekly.test,
                         type = "response"
                         )

# Convert probabilities to predictions
weekly.pred <- rep("Down", length(probabilities))
weekly.pred[probabilities > 0.5] <- "Up"

# Confusion matrix
conf_matrix <- table( Predicted = weekly.pred,Actual = Weekly.test$Direction)
print(conf_matrix)
##          Actual
## Predicted Down Up
##      Down    9  5
##      Up     34 56
# Overall accuracy
mean(weekly.pred == Weekly.test$Direction)
## [1] 0.625

(e) Repeat (d) using LDA.

Linear Discriminant Analysis (LDA)

library(MASS)

# Fit logistic regression
weekly.lda <- lda(Direction ~ Lag2, data = Weekly.train)

# Predictions on test set
lda.pred <- predict(weekly.lda,newdata = Weekly.test)

# Confusion matrix
conf_matrix <- table(Predicted = lda.pred$class,Actual = Weekly.test$Direction)

print(conf_matrix)
##          Actual
## Predicted Down Up
##      Down    9  5
##      Up     34 56
# Accuracy
mean(lda.pred$class == Weekly.test$Direction)
## [1] 0.625

(f) Repeat (d) using QDA.

Quadratic Discriminant Analysis (QDA)

# Fit QDA model using Lag2
weekly.qda <- qda(Direction ~ Lag2, data = Weekly.train)

# Predictions on test set
qda.pred <- predict(weekly.qda, newdata = Weekly.test)

# Confusion matrix
conf_matrix <- table(
  Predicted = qda.pred$class,
  Actual = Weekly.test$Direction
)

print(conf_matrix)
##          Actual
## Predicted Down Up
##      Down    0  0
##      Up     43 61
# Accuracy
mean(qda.pred$class == Weekly.test$Direction)
## [1] 0.5865385

(g) Repeat (d) using KNN with K = 1.

K-Nearest Neighbors (KNN)

library(class)

# Predictor matrices
train.X <- as.matrix(Weekly.train$Lag2)
test.X  <- as.matrix(Weekly.test$Lag2)

# KNN with K = 1
knn.pred <- knn(
  train = train.X,
  test = test.X,
  cl = Weekly.train$Direction,
  k = 1
)

# Confusion matrix
conf_matrix <- table(
  Predicted = knn.pred,
  Actual = Weekly.test$Direction
)

print(conf_matrix)
##          Actual
## Predicted Down Up
##      Down   21 30
##      Up     22 31
# Accuracy
mean(knn.pred == Weekly.test$Direction)
## [1] 0.5

(h) Repeat (d) using naive Bayes.

Naive Bayes

library(e1071)

# Fit Naive Bayes model
weekly.nb <- naiveBayes(
  Direction ~ Lag2,
  data = Weekly.train
)

# Predictions
nb.pred <- predict(
  weekly.nb,
  newdata = Weekly.test
)

# Confusion matrix
conf_matrix <- table(
  Predicted = nb.pred,
  Actual = Weekly.test$Direction
)

print(conf_matrix)
##          Actual
## Predicted Down Up
##      Down    0  0
##      Up     43 61
# Accuracy
mean(nb.pred == Weekly.test$Direction)
## [1] 0.5865385

(i) Which of these methods appears to provide the best results on this data?

Model Comparison

Method Accuracy
Logistic Regression 62.5%
LDA 62.5%
QDA 58.7%
KNN (K = 1) 50.0%
Naive Bayes 62.5%

Comparison of classification methods for predicting market direction in the Weekly dataset using Lag2 as the sole predictor. Logistic Regression, LDA, and Naive Bayes achieved the highest accuracy (62.5%), while QDA performed slightly worse (58.7%). KNN with (K = 1) had the lowest accuracy (50.0%), indicating performance comparable to random guessing.

(j) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

Experiment Models

library(MASS)
library(e1071)
library(class)
library(dplyr)

results <- data.frame(
  Method = character(),
  Predictors = character(),
  Accuracy = numeric(),
  Test_Error = numeric(),
  stringsAsFactors = FALSE
)

#-----------------------------
# Logistic Regression
#-----------------------------
log_model <- glm(
  Direction ~ Lag1 + Lag2 + Volume,
  data = Weekly.train,
  family = binomial
)

log_prob <- predict(log_model, Weekly.test, type = "response")
log_pred <- ifelse(log_prob > 0.5, "Up", "Down")
log_pred <- factor(log_pred, levels = levels(Weekly.test$Direction))

log_acc <- mean(log_pred == Weekly.test$Direction)

results <- rbind(
  results,
  data.frame(
    Method = "Logistic Regression",
    Predictors = "Lag1 + Lag2 + Volume",
    Accuracy = log_acc,
    Test_Error = 1 - log_acc
  )
)

#-----------------------------
# LDA
#-----------------------------
lda_model <- lda(
  Direction ~ Lag1 + Lag2 + Volume,
  data = Weekly.train
)

lda_pred <- predict(lda_model, Weekly.test)$class

lda_acc <- mean(lda_pred == Weekly.test$Direction)

results <- rbind(
  results,
  data.frame(
    Method = "LDA",
    Predictors = "Lag1 + Lag2 + Volume",
    Accuracy = lda_acc,
    Test_Error = 1 - lda_acc
  )
)

#-----------------------------
# QDA
#-----------------------------
qda_model <- qda(
  Direction ~ Lag1 + Lag2 + Volume,
  data = Weekly.train
)

qda_pred <- predict(qda_model, Weekly.test)$class

qda_acc <- mean(qda_pred == Weekly.test$Direction)

results <- rbind(
  results,
  data.frame(
    Method = "QDA",
    Predictors = "Lag1 + Lag2 + Volume",
    Accuracy = qda_acc,
    Test_Error = 1 - qda_acc
  )
)

#-----------------------------
# Naive Bayes
#-----------------------------
nb_model <- naiveBayes(
  Direction ~ Lag1 + Lag2 + Volume,
  data = Weekly.train
)

nb_pred <- predict(nb_model, Weekly.test)

nb_acc <- mean(nb_pred == Weekly.test$Direction)

results <- rbind(
  results,
  data.frame(
    Method = "Naive Bayes",
    Predictors = "Lag1 + Lag2 + Volume",
    Accuracy = nb_acc,
    Test_Error = 1 - nb_acc
  )
)

#-----------------------------
# KNN (K = 5)
#-----------------------------
train.X <- Weekly.train[, c("Lag1", "Lag2", "Volume")]
test.X  <- Weekly.test[, c("Lag1", "Lag2", "Volume")]

train.X <- scale(train.X)
test.X <- scale(
  test.X,
  center = attr(train.X, "scaled:center"),
  scale = attr(train.X, "scaled:scale")
)

knn_pred <- knn(
  train = train.X,
  test = test.X,
  cl = Weekly.train$Direction,
  k = 5
)

knn_acc <- mean(knn_pred == Weekly.test$Direction)

results <- rbind(
  results,
  data.frame(
    Method = "KNN (K = 5)",
    Predictors = "Lag1 + Lag2 + Volume",
    Accuracy = knn_acc,
    Test_Error = 1 - knn_acc
  )
)

results
##                Method           Predictors  Accuracy Test_Error
## 1 Logistic Regression Lag1 + Lag2 + Volume 0.5288462  0.4711538
## 2                 LDA Lag1 + Lag2 + Volume 0.5288462  0.4711538
## 3                 QDA Lag1 + Lag2 + Volume 0.4615385  0.5384615
## 4         Naive Bayes Lag1 + Lag2 + Volume 0.4230769  0.5769231
## 5         KNN (K = 5) Lag1 + Lag2 + Volume 0.5673077  0.4326923
knitr::kable(
  results,
  digits = 3,
  caption = "Comparison of Classification Models"
)
Comparison of Classification Models
Method Predictors Accuracy Test_Error
Logistic Regression Lag1 + Lag2 + Volume 0.529 0.471
LDA Lag1 + Lag2 + Volume 0.529 0.471
QDA Lag1 + Lag2 + Volume 0.462 0.538
Naive Bayes Lag1 + Lag2 + Volume 0.423 0.577
KNN (K = 5) Lag1 + Lag2 + Volume 0.567 0.433

The above table summarizes the performance of the classification models using the predictors Lag1, Lag2, and Volume. Among the models evaluated, the K-Nearest Neighbors (KNN) classifier with \(K = 5\) achieved the highest test accuracy of 56.7%, corresponding to a test error of 43.3%. Logistic Regression and LDA produced identical results, each achieving an accuracy of 52.9% and a test error of 47.1%. QDA and Naive Bayes performed worse, with accuracies of 46.2% and 42.3%, respectively.

Although KNN (\(K = 5\)) was the best-performing model among those using the predictors Lag1, Lag2, and Volume, its performance was still inferior to the simpler models developed earlier using only Lag2 as the predictor. The earlier Logistic Regression, LDA, and Naive Bayes models achieved approximately 62.5% accuracy, suggesting that adding Lag1 and Volume introduced additional noise rather than improving predictive performance.

Question 14.

In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto Data set.

attach(Auto)

(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.

Auto_binary <- Auto |>
  transform(mpg01 = ifelse(mpg > median(mpg), 1, 0))

(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

pairs(Auto_binary)

# Boxplots
par(mfrow = c(2, 3))

boxplot(cylinders ~ mpg01, data = Auto_binary,
        main = "Cylinders vs mpg01")

boxplot(displacement ~ mpg01, data = Auto_binary,
        main = "Displacement vs mpg01")

boxplot(horsepower ~ mpg01, data = Auto_binary,
        main = "Horsepower vs mpg01")

boxplot(weight ~ mpg01, data = Auto_binary,
        main = "Weight vs mpg01")

boxplot(acceleration ~ mpg01, data = Auto_binary,
        main = "Acceleration vs mpg01")

boxplot(year ~ mpg01, data = Auto_binary,
        main = "Year vs mpg01")

Exploratory analysis of the data suggests that several variables are strongly associated with mpg01. In particular, vehicles with higher fuel efficiency (mpg01 = 1) tend to have fewer cylinders, lower engine displacement, lower horsepower, and lower weight than vehicles with lower fuel efficiency (mpg01 = 0). The boxplots for these variables show clear separation between the two groups, indicating that they may be useful predictors of fuel efficiency classification.

The variable year also appears to be an important predictor. Vehicles classified as having high fuel efficiency tend to be from more recent model years, suggesting that improvements in automotive technology have contributed to higher fuel economy over time. In contrast, acceleration shows substantial overlap between the two groups and appears to be a weaker predictor. Overall, the variables cylinders, displacement, horsepower, weight, and year appear most likely to be useful for predicting mpg01.

(c) Split the data into a training set and a test set.

set.seed(42)

library(modelr) 
# Partition data: 75% Training, 25% Testing
partitions <- resample_partition(Auto_binary, c(train = 0.75, test = 0.25))

# Convert partitions to data frames
train_df <- as.data.frame(partitions$train)
test_df  <- as.data.frame(partitions$test)

(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

Linear Discriminant Analysis (LDA)

library(MASS)

# Fit LDA model
lda_model <- lda(
  mpg01 ~ cylinders + displacement + horsepower + weight + year,
  data = train_df
)

# Predict on the test set
lda_pred <- predict(lda_model, newdata = test_df)

# Confusion matrix
(conf_matrix <- table(
  Predicted = lda_pred$class,
  Actual = test_df$mpg01
))
##          Actual
## Predicted  0  1
##         0 43  3
##         1  5 48
# Test accuracy
accuracy <- mean(lda_pred$class == test_df$mpg01)


# Test error
test_error <- 1 - accuracy

cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.9191919
cat("Test Error:", test_error, "\n")
## Test Error: 0.08080808

The model achieved a test accuracy of 91.8%, corresponding to a test error of 8.2%. These results suggest that the selected predictors provide strong discriminatory power for classifying vehicles with above- and below-median fuel efficiency.

(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

Quadratic Discriminant Analysis (QDA)

library(MASS)

# Fit QDA model
qda_model <- qda(
  mpg01 ~ cylinders + displacement + horsepower + weight + year,
  data = train_df
)

# Predict on the test set
qda_pred <- predict(qda_model, newdata = test_df)

# Confusion matrix
(conf_matrix <- table(
  Predicted = qda_pred$class,
  Actual = test_df$mpg01
))
##          Actual
## Predicted  0  1
##         0 42  5
##         1  6 46
# Test accuracy
accuracy <- mean(qda_pred$class == test_df$mpg01)


# Test error

cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.8888889
cat("Test Error:", test_error, "\n")
## Test Error: 0.08080808

The QDA model achieved a test accuracy of 88.9%, corresponding to a test error of 11.1%. These results indicate that the selected predictors (cylinders, displacement, horsepower, weight, and year) are effective in distinguishing between vehicles with above- and below-median fuel efficiency. Although the model performed well, its accuracy was slightly lower than that of the LDA model, suggesting that LDA provides a marginally better fit for this classification problem.

(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

Logistic Regression

# Fit logistic regression model
logistic_model <- glm(
  mpg01 ~ cylinders + displacement + horsepower + weight + year,
  data = train_df,
  family = binomial
)

# Predicted probabilities
probabilities <- predict(
  logistic_model,
  newdata = test_df,
  type = "response"
)

# Convert probabilities to class predictions
logistic_pred <- ifelse(probabilities > 0.5, 1, 0)

# Confusion matrix
(conf_matrix <- table(
  Predicted = logistic_pred,
  Actual = test_df$mpg01
))
##          Actual
## Predicted  0  1
##         0 43  4
##         1  5 47
# Test accuracy
accuracy <- mean(logistic_pred == test_df$mpg01)


# Test error
test_error <- 1 - accuracy

cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.9090909
cat("Test Error:", test_error, "\n")
## Test Error: 0.09090909

The logistic regression model achieved a test accuracy of 90.9%, corresponding to a test error of 9.1%. These results indicate that the selected predictors provide strong predictive power for classifying vehicles into above- and below-median fuel efficiency groups. The low test error suggests that logistic regression is an effective classification method for this dataset.

(g) Perform naive Bayes on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

Naive Bayes

library(e1071)

# Fit Naive Bayes model
nb_model <- naiveBayes(
  mpg01 ~ cylinders + displacement + horsepower + weight + year,
  data = train_df
)

# Predict on the test set
nb_pred <- predict(
  nb_model,
  newdata = test_df
)

# Confusion matrix
(conf_matrix <- table(
  Predicted = nb_pred,
  Actual = test_df$mpg01
))
##          Actual
## Predicted  0  1
##         0 45  4
##         1  3 47
# Test accuracy
accuracy <- mean(nb_pred == test_df$mpg01)


# Test error
test_error <- 1 - accuracy

cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.9292929
cat("Test Error:", test_error, "\n")
## Test Error: 0.07070707

The Naive Bayes classifier achieved a test accuracy of 92.9%, corresponding to a test error of 7.1%. These results indicate that the selected predictors provide strong predictive power for classifying vehicles into above- and below-median fuel efficiency groups. Among the classification methods evaluated, Naive Bayes achieved the highest accuracy and the lowest test error, making it the best-performing model on the test data.

(h) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

K-Nearest Neighbors (KNN)

library(class)

# Select predictors
train.X <- train_df[, c("cylinders", "displacement",
                        "horsepower", "weight", "year")]

test.X <- test_df[, c("cylinders", "displacement",
                      "horsepower", "weight", "year")]

# Standardize predictors using the training set statistics
train.X.scaled <- scale(train.X)

test.X.scaled <- scale(
  test.X,
  center = attr(train.X.scaled, "scaled:center"),
  scale = attr(train.X.scaled, "scaled:scale")
)

# Response variables
train.Y <- train_df$mpg01
test.Y <- test_df$mpg01

# Candidate values of K
k.values <- c(1, 3, 5, 7, 9, 11, 15, 21)

results <- data.frame(
  K = integer(),
  Accuracy = numeric(),
  Test_Error = numeric()
)

for (k in k.values) {

  pred <- knn(
    train = train.X.scaled,
    test = test.X.scaled,
    cl = train.Y,
    k = k
  )

  accuracy <- mean(pred == test.Y)

  results <- rbind(
    results,
    data.frame(
      K = k,
      Accuracy = accuracy,
      Test_Error = 1 - accuracy
    )
  )
}

results
##    K  Accuracy Test_Error
## 1  1 0.9393939 0.06060606
## 2  3 0.9696970 0.03030303
## 3  5 0.9494949 0.05050505
## 4  7 0.9494949 0.05050505
## 5  9 0.9292929 0.07070707
## 6 11 0.9292929 0.07070707
## 7 15 0.9393939 0.06060606
## 8 21 0.9292929 0.07070707

The KNN classifier with \(K = 3\) outperformed all of the other classification methods evaluated, including Logistic Regression, LDA, QDA, and Naive Bayes, making it the most accurate model for predicting whether a vehicle’s fuel efficiency was above or below the median.

Model Comparison

Method Test Accuracy Test Error
KNN (\(K = 3\)) 96.97% 3.03%
Naive Bayes 92.93% 7.07%
LDA 91.92% 8.08%
Logistic Regression 90.91% 9.09%
QDA 88.89% 11.11%

Question 16.

Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings. Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set.

attach(Boston)

Create the response variable

Boston$crim01 <- factor(ifelse(
  Boston$crim > median(Boston$crim),
  1, 0
))

Exploratory Data Analysis of Boston Data set

pairs(Boston)

library(tidyverse)

Boston |>
  pivot_longer(
    cols = -c(crim, crim01),
    names_to = "Variable",
    values_to = "Value"
  ) |>
  ggplot(aes(x = crim01, y = Value, fill = crim01)) +
  geom_boxplot(show.legend = FALSE) +
  facet_wrap(~Variable, scales = "free", ncol = 4) +
  labs(
    x = "Crime Rate Group (crim01)",
    y = "Value",
    title = "Predictor Distributions by Crime Rate Classification"
  ) +
  theme_minimal()

Split Data

set.seed(42)

library(modelr)

partitions <- resample_partition(
  Boston,
  c(train = 0.75, test = 0.25)
)

train_df <- as.data.frame(partitions$train)
test_df  <- as.data.frame(partitions$test)

Logistic Regression

log_model <- glm(
  crim01 ~ rad + tax + lstat + nox + dis,
  data = train_df,
  family = binomial
)

prob <- predict(
  log_model,
  newdata = test_df,
  type = "response"
)

pred <- factor(ifelse(prob > 0.5, 1, 0))

table(pred, test_df$crim01)
##     
## pred  0  1
##    0 55 10
##    1  7 55
accuracy <- mean(pred == test_df$crim01)

test_error <- 1 - accuracy

cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.8661417
cat("Test Error:", test_error, "\n")
## Test Error: 0.1338583

Linear Discriminant Analysis (LDA)

library(MASS)

lda_model <- lda(
  crim01 ~ rad + tax + lstat + nox + dis,
  data = train_df
)

lda_pred <- predict(
  lda_model,
  newdata = test_df
)

table(lda_pred$class, test_df$crim01)
##    
##      0  1
##   0 62 15
##   1  0 50
accuracy <- mean(lda_pred$class == test_df$crim01)

test_error <- 1 - accuracy

cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.8818898
cat("Test Error:", test_error, "\n")
## Test Error: 0.1181102

Naive Bayes

library(e1071)

nb_model <- naiveBayes(
  crim01 ~ rad + tax + lstat + nox + dis,
  data = train_df
)

nb_pred <- predict(
  nb_model,
  newdata = test_df
)

table(nb_pred, test_df$crim01)
##        
## nb_pred  0  1
##       0 60 14
##       1  2 51
accuracy <- mean(nb_pred == test_df$crim01)

test_error <- 1 - accuracy

cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.8740157
cat("Test Error:", test_error, "\n")
## Test Error: 0.1259843

K-Nearest Neighbors (KNN)

library(class)

train.X <- train_df[, c(
  "rad",
  "tax",
  "lstat",
  "nox",
  "dis"
)]

test.X <- test_df[, c(
  "rad",
  "tax",
  "lstat",
  "nox",
  "dis"
)]

train.X <- scale(train.X)

test.X <- scale(
  test.X,
  center = attr(train.X, "scaled:center"),
  scale = attr(train.X, "scaled:scale")
)

train.Y <- train_df$crim01
test.Y <- test_df$crim01

k.values <- c(1,3,5,7,9,11,15,21)

results <- data.frame()

for(k in k.values){

  pred <- knn(
    train.X,
    test.X,
    train.Y,
    k = k
  )

  acc <- mean(pred == test.Y)

  results <- rbind(
    results,
    data.frame(
      K = k,
      Accuracy = acc,
      Test_Error = 1 - acc
    )
  )
}

results
##    K  Accuracy Test_Error
## 1  1 0.9448819 0.05511811
## 2  3 0.9291339 0.07086614
## 3  5 0.9370079 0.06299213
## 4  7 0.9133858 0.08661417
## 5  9 0.9133858 0.08661417
## 6 11 0.9133858 0.08661417
## 7 15 0.9212598 0.07874016
## 8 21 0.9291339 0.07086614

Model Comparison

Model Accuracy Test Error
Logistic Regression 86.61% 13.39%
LDA 88.19% 11.81%
Naive Bayes 87.40% 12.60%
KNN (K = 1) 94.49% 5.51%
KNN (K = 3) 92.91% 7.09%
KNN (K = 5) 93.70% 6.30%
KNN (K = 7) 91.34% 8.66%
KNN (K = 9) 91.34% 8.66%
KNN (K = 11) 91.34% 8.66%
KNN (K = 15) 92.13% 7.87%
KNN (K = 21) 92.91% 7.09%

Based on the results, KNN with \(K = 1\) performed the best, achieving the highest accuracy of 94.49% and the lowest test error of 5.51%. Among the non-KNN methods, LDA performed best with an accuracy of 88.19% and a test error of 11.81%. Overall, the KNN models produced lower test errors than Logistic Regression, LDA, and Naive Bayes for this dataset.