ISLR Chapter 04 (page 189): 13, 14, 16

Assignment 3

Q-13.

This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

library(ISLR2)
data(Weekly)

(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

str(Weekly)
## 'data.frame':    1089 obs. of  9 variables:
##  $ Year     : num  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
##  $ Lag1     : num  0.816 -0.27 -2.576 3.514 0.712 ...
##  $ Lag2     : num  1.572 0.816 -0.27 -2.576 3.514 ...
##  $ Lag3     : num  -3.936 1.572 0.816 -0.27 -2.576 ...
##  $ Lag4     : num  -0.229 -3.936 1.572 0.816 -0.27 ...
##  $ Lag5     : num  -3.484 -0.229 -3.936 1.572 0.816 ...
##  $ Volume   : num  0.155 0.149 0.16 0.162 0.154 ...
##  $ Today    : num  -0.27 -2.576 3.514 0.712 1.178 ...
##  $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
library(gtsummary)
## Warning: package 'gtsummary' was built under R version 4.5.2
tbl_summary(Weekly)
Characteristic N = 1,0891
Year 2,000.0 (1,995.0, 2,005.0)
Lag1 0.24 (-1.15, 1.41)
Lag2 0.24 (-1.15, 1.41)
Lag3 0.24 (-1.16, 1.41)
Lag4 0.24 (-1.16, 1.41)
Lag5 0.23 (-1.17, 1.41)
Volume 1.00 (0.33, 2.05)
Today 0.24 (-1.15, 1.41)
Direction
    Down 484 (44%)
    Up 605 (56%)
1 Median (Q1, Q3); n (%)
library(GGally)
## Warning: package 'GGally' was built under R version 4.5.2
ggpairs(Weekly, columns = 1:8) 

> There is no distinct pattern within variables, they all seem to be scattered randomly however there is a clear positive correlation between volume and year, showing that volume consistently grows over time.

(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

full_model <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)
summary(full_model)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Yes, only Lag2 variable seem to be statistically significant with coefficiency value 0.05844 and p-value is 0.0296 suggesting that holding all other variables constant, a one-unit increase in the Lag2 increases the log-odds of the market moving Up by 0.05844 exponentiating this value \(e^{0.05844} \approx 1.060\) reveals that for every one-unit increase in Lag2, the odds of the market going up increase by approximately 6%

(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

pred_probs <- predict(full_model, type = "response")

pred_labels <- ifelse(pred_probs > 0.5, "Up", "Down")

conf_matrix <- table(pred_labels, Weekly$Direction)

print(conf_matrix)
##            
## pred_labels Down  Up
##        Down   54  48
##        Up    430 557

Our full model predicted down actually down is 54 and predicted up and actually up 557. Our model predicted up but actually down is 48, these are our False ups. Model predicted down but actually up is 430, these are False downs. we can take a look at this by ratio for better grasp.

Our model accuracy is \((54+557)/1089 = 56.11\%\). Our model has a strong capability to reliable up predictions but fails heavily on down predictions.

(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

train_data <- subset(Weekly, Year >= 1990 & Year <= 2008)
test_data  <- subset(Weekly, Year >= 2009 & Year <= 2010)

model_lag2 <- glm(Direction ~ Lag2, family = binomial, data = train_data)

pred_probs <- predict(model_lag2, newdata = test_data, type = "response")

pred_labels <- ifelse(pred_probs > 0.5, "Up", "Down")
pred_labels <- factor(pred_labels, levels = c("Down", "Up"))

actual_labels <- test_data$Direction
conf_matrix_lag2 <- table(Predicted = pred_labels, Actual = actual_labels)
print(conf_matrix_lag2)
##          Actual
## Predicted Down Up
##      Down    9  5
##      Up     34 56

By using only Lag2 variable, our new model correctly predicted the market’s direction 62.5% of the time (65 out of 104 weeks). This is a noticeable improvement over our first model. However, the model still has a strong habit of guessing up too often, which means it accurately caught most upward weeks but missed many downward weeks.

(e) Repeat (d) using LDA.

library(MASS)
## Warning: package 'MASS' was built under R version 4.5.3
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:gtsummary':
## 
##     select
## The following object is masked from 'package:ISLR2':
## 
##     Boston
lda_model <- lda(Direction ~ Lag2, data = train_data)

pred_lda <- predict(lda_model, newdata = test_data)

lda_labels <- pred_lda$class

conf_matrix_lda <- table(Predicted = lda_labels, Actual = test_data$Direction)

print(conf_matrix_lda)
##          Actual
## Predicted Down Up
##      Down    9  5
##      Up     34 56

LDA model prediction is identical to our previous model_lag2 prediction.

(f) Repeat (d) using QDA.

qda_model <- qda(Direction ~ Lag2, data = train_data)

pred_qda <- predict(qda_model, newdata = test_data)

qda_labels <- pred_qda$class

conf_matrix_qda <- table(Predicted = qda_labels, Actual = test_data$Direction)

print(conf_matrix_qda)
##          Actual
## Predicted Down Up
##      Down    0  0
##      Up     43 61

Based on our QDA model’s confusion matrix above, the model correctly predicted the market’s direction 58.65% of the time (61 out of 104 weeks). Our model correctly flagged 61 Up weeks but missed 43 of them. Also our model completely failed to predict any downward movement, resulting in a score of zero for all Down weeks.

(g) Repeat (d) using KNN with K =1.

library(class)


x_train <- data.frame(train_data$Lag2)
x_test  <- data.frame(test_data$Lag2)

y_train <- train_data$Direction

set.seed(210)

knn_pred <- knn(train = x_train, test = x_test, cl = y_train, k = 1)

conf_matrix_knn <- table(Predicted = knn_pred, Actual = test_data$Direction)

print(conf_matrix_knn)
##          Actual
## Predicted Down Up
##      Down   21 29
##      Up     22 32

Based on our KNN (K=1) model’s confusion matrix, the model correctly predicted the market’s direction 50.96% of the time (53 out of 104 weeks). It accurately captured 21 downward weeks and 32 upward weeks. However, model’s accuracy almost same as coin flip and it performs much worse than our previous models.

(h) Repeat (d) using naive Bayes.

library(e1071)
## Warning: package 'e1071' was built under R version 4.5.2
## 
## Attaching package: 'e1071'
## The following object is masked from 'package:ggplot2':
## 
##     element
nbayes_model <- naiveBayes(Direction ~ Lag2, data = train_data)

nb_class <- predict(nbayes_model, newdata = test_data)

conf_matrix_nb <- table(Predicted = nb_class, Actual = test_data$Direction)

print(conf_matrix_nb)
##          Actual
## Predicted Down Up
##      Down    0  0
##      Up     43 61

Our Naive Bayes model performed identically to our previous QDA model with accuracy of 58.65% (61 out of 104 weeks).

(i) Which of these methods appears to provide the best results on this data?

As result, lda_model and model_lag2 performed the best with the accuray of 62.5%.

(j) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

First lets try all our variables besides Today variable since it is directly related to trend of our response variable.

model_try1 <- glm(Direction ~ Lag1+Lag2+Lag3+Lag4+Lag5+Year+Volume, family = binomial, data = train_data)

pred_probs <- predict(model_try1, newdata = test_data, type = "response")

pred_labels <- ifelse(pred_probs > 0.5, "Up", "Down")
pred_labels <- factor(pred_labels, levels = c("Down", "Up"))

actual_labels <- test_data$Direction
conf_matrix_try1 <- table(Predicted = pred_labels, Actual = actual_labels)
print(conf_matrix_try1)
##          Actual
## Predicted Down Up
##      Down   30 44
##      Up     13 17

Accuracy for new model is still below than our best performed qda_model and model_lag2. Next, I will try different combinations of KNN models with different K values. On this experiment I will use for loop to detect best K value and prediction accuracy between all K values between 1 to 100.

library(class)

k_values <- 1:100
accuracies <- numeric(length(k_values))

for (i in seq_along(k_values)) 
  {
  set.seed(210) 
  pred <- knn(train = x_train, test = x_test, cl = y_train, k = k_values[i])
  accuracies[i] <- mean(pred == test_data$Direction)
  }

knn_results <- data.frame(K = k_values, Accuracy = accuracies)

plot(knn_results$K, knn_results$Accuracy, type = "b", 
     xlab = "Number of Neighbors (K)", ylab = "Test Accuracy",
     main = "Finding best K Value for KNN",
     col = "blue", pch = 16)

best_row <- knn_results[which.max(knn_results$Accuracy), ]
abline(v = best_row$K, col = "red", lty = 2)

cat("Best K Value:", best_row$K, "\n")
## Best K Value: 47
cat("Highest Accuracy Score:", round(best_row$Accuracy, 4), "\n\n")
## Highest Accuracy Score: 0.6154
set.seed(210)

knn_pred_last <- knn(train = x_train, test = x_test, cl = y_train, k = 47)

conf_matrix_knn_last <- table(Predicted = knn_pred_last, Actual = test_data$Direction)

print(conf_matrix_knn_last)
##          Actual
## Predicted Down Up
##      Down   23 20
##      Up     20 41

Even best K value for KNN model performed slight worse than our best performed models.

Q-14.

In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

data(Auto)

(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.

median(Auto$mpg)
## [1] 22.75
mpg01 <- ifelse(Auto$mpg > median(Auto$mpg), 1, 0)

Auto_new <- data.frame(Auto, mpg01)

head(Auto_new)

(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatter plots and box plots may be useful tools to answer this question. Describe your findings.

par(mfrow = c(2, 3))


boxplot(cylinders ~ mpg01, data = Auto_new, main = "Cylinders vs mpg01", xlab = "mpg01", ylab = "Cylinders", col = "lightblue")
boxplot(displacement ~ mpg01, data = Auto_new, main = "Displacement vs mpg01", xlab = "mpg01", ylab = "Displacement", col = "lightgreen")
boxplot(horsepower ~ mpg01, data = Auto_new, main = "Horsepower vs mpg01", xlab = "mpg01", ylab = "Horsepower", col = "lightpink")
boxplot(weight ~ mpg01, data = Auto_new, main = "Weight vs mpg01", xlab = "mpg01", ylab = "Weight", col = "lightyellow")
boxplot(acceleration ~ mpg01, data = Auto_new, main = "Acceleration vs mpg01", xlab = "mpg01", ylab = "Acceleration", col = "lavender")
boxplot(year ~ mpg01, data = Auto_new, main = "Year vs mpg01", xlab = "mpg01", ylab = "Year", col = "lightgray")

> Based on our graphical analysis above, weight, displacement, horsepower, and cylinders show as the most useful features for predicting mpg01. Vehicles with above median fuel efficiency are strongly associated with lower weight, smaller engine displacements, lower horsepower, and fewer cylinders. On the other hand, year shows only a moderate positive association as newer cars lean toward higher efficiency.

(c) Split the data into a training set and a test set.

set.seed(210)

train_size <- floor(0.5 * nrow(Auto_new))

train_part <- sample(seq_len(nrow(Auto_new)), size = train_size)

train_set <- Auto_new[train_part, ]
test_set  <- Auto_new[-train_part, ]

dim(train_set)
## [1] 196  10
dim(test_set)
## [1] 196  10

(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

lda_auto_model <- lda(mpg01 ~ cylinders+displacement+horsepower+weight, data = train_set)
pred_lda_auto_model <- predict(lda_auto_model, newdata = test_set)

lda_auto_labels <- pred_lda_auto_model$class

conf_matrix_lda_auto <- table(Predicted = lda_auto_labels, Actual = test_set$mpg01)

print(conf_matrix_lda_auto)
##          Actual
## Predicted  0  1
##         0 80  8
##         1 18 90
error_rate_lda_auto <- 1 - sum(diag(conf_matrix_lda_auto)) / sum(conf_matrix_lda_auto)
error_rate_lda_auto
## [1] 0.1326531

Our model performed very well with hint of diagnostic plots variables. Our prediction accuracy 87% and our error rate %13.

(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

qda_auto_model <- qda(mpg01 ~ cylinders+displacement+horsepower+weight, data = train_set)

pred_qda_auto <- predict(qda_auto_model, newdata=test_set)

qda_auto_labels <- pred_qda_auto$class

conf_matrix_qda_auto <- table(Predicted = qda_auto_labels, Actual=test_set$mpg01)

print(conf_matrix_qda_auto)
##          Actual
## Predicted  0  1
##         0 84 12
##         1 14 86
error_rate_qda_auto <- 1 - sum(diag(conf_matrix_qda_auto)) / sum(conf_matrix_qda_auto)
error_rate_qda_auto
## [1] 0.1326531

our QDA model performed with the same accuracy and error rate with lda model however, our True pozitives are less while our true negatives are more than lda model.

(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

log_auto_model <- glm(mpg01 ~ cylinders+displacement+horsepower+weight, data = train_set, family = binomial)

pred_log_auto <- predict(log_auto_model, newdata = test_set, type = "response")

log_auto_labels <- ifelse(pred_log_auto > 0.5, 1, 0)

conf_matrix_log_auto <- table(Predicted = log_auto_labels, Actual = test_set$mpg01)
print(conf_matrix_log_auto)
##          Actual
## Predicted  0  1
##         0 81 10
##         1 17 88
error_rate_log_auto <- 1 - sum(diag(conf_matrix_log_auto)) / sum(conf_matrix_log_auto)
error_rate_log_auto
## [1] 0.1377551

Our log model performed slight worse than our qda model with the accuracy of 86.22% and the error rate is 13.78%

(g) Perform naive Bayes on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

nb_auto_model <- naiveBayes(mpg01 ~ cylinders+displacement+horsepower+weight, data = train_set)

pred_nb_auto <- predict(nb_auto_model, newdata=test_set)

nb_auto_labels <- pred_nb_auto

conf_matrix_nb_auto <- table(Predicted = nb_auto_labels, Actual=test_set$mpg01)

print(conf_matrix_nb_auto)
##          Actual
## Predicted  0  1
##         0 81 10
##         1 17 88
error_rate_nb_auto <- 1 - sum(diag(conf_matrix_nb_auto)) / sum(conf_matrix_nb_auto)
error_rate_nb_auto
## [1] 0.1377551

Our naive Bayes model performed identical to our log_auto_model with the accuracy of 86.22% and the error rate is 13.78%

(h) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

predictors <- c("cylinders", "displacement", "horsepower", "weight")

k_values_auto <- 1:100
accuracies <- numeric(length(k_values_auto))

for (i in seq_along(k_values_auto)) {
  set.seed(210) 
  pred <- knn(train = train_set[, predictors], 
              test = test_set[, predictors], 
              cl = train_set$mpg01, 
              k = k_values_auto[i])
  
  accuracies[i] <- mean(pred == test_set$mpg01)}

knn_results <- data.frame(K = k_values_auto, Accuracy = accuracies)

plot(knn_results$K, knn_results$Accuracy, type = "b", 
     xlab = "Number of Neighbors (K)", ylab = "Test Accuracy", 
     main = "Finding best K Value for KNN", col = "blue", pch = 16)

best_row <- knn_results[which.max(knn_results$Accuracy), ]
abline(v = best_row$K, col = "red", lty = 2)

print(best_row)
##   K  Accuracy
## 3 3 0.8520408
set.seed(210)

knn_pred_auto <- knn(train = train_set[, c("cylinders", "displacement", "horsepower", "weight")], 
                     test = test_set[, c("cylinders", "displacement", "horsepower", "weight")], 
                     cl = train_set$mpg01, k = 3)

conf_matrix_knn_auto <- table(Predicted = knn_pred_auto, Actual = test_set$mpg01)
print(conf_matrix_knn_auto)
##          Actual
## Predicted  0  1
##         0 76  7
##         1 22 91
error_rate_knn_auto <- 1 - sum(diag(conf_matrix_knn_last)) / sum(conf_matrix_knn_last)
error_rate_knn_auto
## [1] 0.3846154

even the best knn value performed worse than out previous models with the accuracy of 85.20% with the error rate 14.79%.

Q-16.

Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings.

Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set.

First we have to load Boston dataset into R.

data("Boston")

Next, we will split dataset as test and train then we detect some predictors for our model tryings.

set.seed(210)

median_crim <- median(Boston$crim)
Boston$crim_high <- ifelse(Boston$crim > median_crim, 1, 0)

boston_train_idx <- sample(1:nrow(Boston), 0.5 * nrow(Boston))
boston_train_set <- Boston[boston_train_idx, ]
boston_test_set  <- Boston[-boston_train_idx, ]

predictors <- c("nox", "rad", "tax", "age")
boston_formula_sub <- as.formula(paste("crim_high ~", paste(predictors, collapse = " + ")))

I have selected the following variables as primary predictors on first model; nox (pollution), rad (highway access), tax (property tax), age (older homes) now I will start with fitting logistic regressin model first.

boston_log_model <- glm(boston_formula_sub, data = boston_train_set, family = binomial)
boston_log_probs <- predict(boston_log_model, boston_test_set, type = "response")
boston_log_preds <- ifelse(boston_log_probs > 0.5, 1, 0)

boston_log_error <- mean(boston_log_preds != boston_test_set$crim_high)
cat("Logistic Regression Test Error:", boston_log_error, "\n")
## Logistic Regression Test Error: 0.1304348

Our logistic model’s error rate is 13.04% means log model accurately classified whether a census tract is above or below the median crime rate roughly 87% of the time using 50/50 split. Now, lets try lda model.

boston_lda_model <- lda(boston_formula_sub, data = boston_train_set)
boston_lda_preds <- predict(boston_lda_model, boston_test_set)$class

boston_lda_error <- mean(boston_lda_preds != boston_test_set$crim_high)
cat("LDA Test Error:", boston_lda_error, "\n")
## LDA Test Error: 0.1581028

Our lda model’s error rate is 15.18% means our lda model accurately classified whether a census tract is above or below the median crime rate roughly 84% of the time using 50/50 split. Now, lets continue with next model.

boston_nb_model <- naiveBayes(boston_formula_sub, data = boston_train_set)
boston_nb_preds <- predict(boston_nb_model, boston_test_set)

boston_nb_error <- mean(boston_nb_preds != boston_test_set$crim_high)
cat("Naive Bayes Test Error:", boston_nb_error, "\n")
## Naive Bayes Test Error: 0.1699605

Naive Bayes error rate is nearly 17% percent means that model accurately classified whether a census tract is above or below the median crime rate roughly 83% of the time using 50/50 split. Log model is winning right now but lets take a look at KNN model.

boston_scale_train <- scale(boston_train_set[, predictors])
boston_scale_test  <- scale(boston_test_set[, predictors], 
                            center = attr(boston_scale_train, "scaled:center"), 
                            scale = attr(boston_scale_train, "scaled:scale"))
boston_k_values <- 1:100
boston_accuracies <- numeric(length(boston_k_values))

for (i in seq_along(boston_k_values)) {
  set.seed(210)
  boston_pred <- knn(train = boston_scale_train, 
                     test = boston_scale_test, 
                     cl = boston_train_set$crim_high, 
                     k = boston_k_values[i])
  boston_accuracies[i] <- mean(boston_pred == boston_test_set$crim_high)}

boston_knn_results <- data.frame(K = boston_k_values, Accuracy = boston_accuracies)

plot(boston_knn_results$K, boston_knn_results$Accuracy, type = "b", 
     xlab = "Number of Neighbors (K)", ylab = "Test Accuracy", 
     main = "Finding Best K Value for Boston KNN Model", 
     col = "blue", pch = 16)

boston_best_row <- boston_knn_results[which.max(boston_knn_results$Accuracy), ]
abline(v = boston_best_row$K, col = "red", lty = 2)

print(boston_best_row)
##   K  Accuracy
## 1 1 0.9249012

This is pretty great accuracy now lets fit our final knn model with k-value=1.

set.seed(210)

boston_knn_pred_k1 <- knn(train = boston_scale_train, 
                          test = boston_scale_test, 
                          cl = boston_train_set$crim_high, 
                          k = 1)

boston_knn_acc_k1 <- mean(boston_knn_pred_k1 == boston_test_set$crim_high)
boston_knn_err_k1 <- mean(boston_knn_pred_k1 != boston_test_set$crim_high)

cat("KNN (K=1) Test Accuracy:", boston_knn_acc_k1, "\n")
## KNN (K=1) Test Accuracy: 0.9249012
cat("KNN (K=1) Test Error Rate:", boston_knn_err_k1, "\n")
## KNN (K=1) Test Error Rate: 0.07509881

Our knn model wins the game, error rate is 7.5% percent means that model accurately classified whether a census tract is above or below the median crime rate roughly 92.4% of the time using 50/50 split.

Conclusion

I applied several different classification models to this dataset to find the most accurate results. This process showed me that finding the best model requires a holistic understanding of both the variables and the underlying algorithms. Replicating these steps across different setups was a great practice for me. Tuning hyperparameters like the optimal K value and splitting the data into train and test sets gave me a much stronger, more dimensional perspective on data behavior.