Loading the dataset and libaries.

library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.3.2
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.3.2
## 
## Attaching package: 'ISLR'
## The following objects are masked from 'package:ISLR2':
## 
##     Auto, Credit
library(class)
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.3.2
## Loading required package: Matrix
## Loaded glmnet 4.1-8
library(tidyverse)
## Warning: package 'dplyr' was built under R version 4.3.2
## Warning: package 'lubridate' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::expand() masks Matrix::expand()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ tidyr::pack()   masks Matrix::pack()
## ✖ tidyr::unpack() masks Matrix::unpack()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(datasets)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.2
## corrplot 0.92 loaded
library(MASS)
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
## 
## The following object is masked from 'package:ISLR2':
## 
##     Boston
library(ggplot2)
library(e1071)
## Warning: package 'e1071' was built under R version 4.3.2
library(naivebayes)
## Warning: package 'naivebayes' was built under R version 4.3.2
## naivebayes 0.9.7 loaded
data("Weekly")

13) This question should be answered using the Weekly data set, whichis part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1, 089weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

summary(Weekly)
##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
## 
pairs(Weekly[, -9], col = "green")

It looks like Volume and Year are the only two variables that have any meaningful correlation.

(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

# Fit logistic regression model
model <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)
summary(model)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Only Lag2 is significantly significant to reject the null hypothesis.

(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

# Predictions
predictions <- ifelse(predict(model, type = "response") > 0.5, "Up", "Down")
conf_matrix <- table(Weekly$Direction, predictions)
conf_matrix
##       predictions
##        Down  Up
##   Down   54 430
##   Up     48 557
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
accuracy
## [1] 0.5610652

Thus, the model predicted the weekly market trends around 56.1% of the times. Slightly better than random guessing. Looking at the weeks that went up, we can see that it predicted it correctly (557/(557+48)=0.920661157) around 92.1%. Versus for down, it predicted it correctly (54/(54+430))=0.111570248) around 11.2%. So we can see that the model is not good at predicting down weeks. Also, we can see that the model tends to predict that most weeks will be up

(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

training_data <- subset(Weekly, Year < 2009)
testing_data <- subset(Weekly, Year >= 2009)
logistic_model_lag2 <- glm(Direction ~ Lag2, data = training_data, family = "binomial")
predicted_direction_lag2 <- ifelse(predict(logistic_model_lag2, newdata = testing_data, type = "response") > 0.5, "Up", "Down")
confusion_matrix_lag2 <- table(Actual = testing_data$Direction, Predicted = predicted_direction_lag2)
accuracy_lag2 <- sum(diag(confusion_matrix_lag2)) / sum(confusion_matrix_lag2)
print(confusion_matrix_lag2)
##       Predicted
## Actual Down Up
##   Down    9 34
##   Up      5 56
cat("Overall accuracy with Lag2:", accuracy_lag2, "\n")
## Overall accuracy with Lag2: 0.625

When split up by years, the model correctly predicts 62.5%. However, like the previous model, it still predicts Up with a much higher percentage (56/56+5)=0.918032787 => 91.8%. And down with 9/(9+34)=0.209302326 => 20.9%. But, the down percentage is better than the previous model.

(e) Repeat (d) using LDA.

training_data <- subset(Weekly, Year <= 2008)
lda_model <- lda(Direction ~ Lag2, data = training_data)
held_out_data <- subset(Weekly, Year > 2008)
predictions <- predict(lda_model, newdata = held_out_data)$class

# Confusion matrix for held-out data
conf_matrix <- table(held_out_data$Direction, predictions)
conf_matrix
##       predictions
##        Down Up
##   Down    9 34
##   Up      5 56
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
accuracy
## [1] 0.625

Linear Discriminant Analysis (LDA) has a similar result to logistic regression

(f) Repeat (d) using QDA.

training_data <- subset(Weekly, Year <= 2008)
qda_model <- qda(Direction ~ Lag2, data = training_data)
held_out_data <- subset(Weekly, Year > 2008)
predictions <- predict(qda_model, newdata = held_out_data)$class

# Confusion matrix for held-out data
conf_matrix <- table(held_out_data$Direction, predictions)
conf_matrix
##       predictions
##        Down Up
##   Down    0 43
##   Up      0 61
# Overall fraction of correct predictions for held-out data
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
accuracy
## [1] 0.5865385

Quadratic Discriminant Analysis (QDA) has a lower accuracy with only 58.7% and only considers the weeks to be going up every weeek.

(g) Repeat (d) using KNN with K =1.

training_data <- subset(Weekly, Year <= 2008)

knn_model <- knn(train = as.matrix(training_data$Lag2), 
                 test = as.matrix(held_out_data$Lag2), 
                 cl = training_data$Direction, 
                 k = 1)

# Confusion matrix for held-out data
conf_matrix <- table(held_out_data$Direction, knn_model)
conf_matrix
##       knn_model
##        Down Up
##   Down   21 22
##   Up     30 31
# Overall fraction of correct predictions for held-out data
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
accuracy
## [1] 0.5

Both up and down are essentially at 50%. So, this is truely something that is acting almost as a random guess.

(h) Repeat (d) using naive Bayes.

training_data <- subset(Weekly, Year <= 2008)
naive_model <- naiveBayes(Direction ~ Lag2, data = training_data)
predictions <- predict(naive_model, newdata = held_out_data)

# Confusion matrix for held-out data
conf_matrix <- table(held_out_data$Direction, predictions)
conf_matrix
##       predictions
##        Down Up
##   Down    0 43
##   Up      0 61
# Overall fraction of correct predictions for held-out data
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
accuracy
## [1] 0.5865385

This predicted a 61.54% rate, which is one of the higher rates. Also, we can see similar results to the logistic regression in it predicting Ups much better than Downs.

(i) Which of these methods appears to provide the best results on this data? i) Based on the overall accuracy results on the held-out data:

Overall accuracy with Naive Bayes: 0.5865385

Overall accuracy with KNN: 0.5096154

Overall accuracy with QDA: 0.5865385

Overall accuracy with LDA: 0.625

Overall accuracy with Logistic Regression using Lag2: 0.625

It appears that Logistic Regression with Lag2 and LDA have the highest overall accuracy, both achieving 62.5%. Therefore, based on this evaluation, both Logistic Regression with Lag2 and LDA seem to provide the best results on this specific dataset.

(j) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

predictors_combination <- c("Lag2", "Lag3")  # Modify this with different combinations
training_data_sub <- subset(training_data, select = c("Direction", predictors_combination))
testing_data_sub <- subset(testing_data, select = c("Direction", predictors_combination))

# Fit the model (e.g., logistic regression)
model_sub <- glm(Direction ~ ., data = training_data_sub, family = "binomial")
predicted_direction_sub <- ifelse(predict(model_sub, newdata = testing_data_sub, type = "response") > 0.5, "Up", "Down")

# Evaluate performance
confusion_matrix_sub <- table(Actual = testing_data_sub$Direction, Predicted = predicted_direction_sub)
accuracy_sub <- sum(diag(confusion_matrix_sub)) / sum(confusion_matrix_sub)
print(confusion_matrix_sub)
##       Predicted
## Actual Down Up
##   Down    8 35
##   Up      4 57
cat("Overall accuracy with subset of predictors:", accuracy_sub, "\n")
## Overall accuracy with subset of predictors: 0.625
predictors_combination <- c("Lag2", "Lag3")
training_data_sub <- subset(training_data, select = c("Direction", predictors_combination))
testing_data_sub <- subset(testing_data, select = c("Direction", predictors_combination))
k_values <- c(1, 3, 5, 7, 9)

for (k in k_values) {
  cat("K =", k, "\n")
  model_knn <- knn(train = training_data_sub[, -1], test = testing_data_sub[, -1], cl = training_data_sub$Direction, k = k)
  confusion_matrix_knn <- table(Actual = testing_data_sub$Direction, Predicted = model_knn)
  accuracy_knn <- sum(diag(confusion_matrix_knn)) / sum(confusion_matrix_knn)
  
  # Print confusion matrix and accuracy
  print(confusion_matrix_knn)
  cat("Accuracy with K =", k, ":", accuracy_knn, "\n")
}
## K = 1 
##       Predicted
## Actual Down Up
##   Down   19 24
##   Up     23 38
## Accuracy with K = 1 : 0.5480769 
## K = 3 
##       Predicted
## Actual Down Up
##   Down   16 27
##   Up     20 41
## Accuracy with K = 3 : 0.5480769 
## K = 5 
##       Predicted
## Actual Down Up
##   Down   15 28
##   Up     20 41
## Accuracy with K = 5 : 0.5384615 
## K = 7 
##       Predicted
## Actual Down Up
##   Down   11 32
##   Up     17 44
## Accuracy with K = 7 : 0.5288462 
## K = 9 
##       Predicted
## Actual Down Up
##   Down   12 31
##   Up     17 44
## Accuracy with K = 9 : 0.5384615

LDA and logistic regression still have the best accuracy with 65.4%. But,squaring Lag2 in QDA did improve the accuracy of that model to 62.5%.

14) In this problem, you will develop a model to predict whether a given In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

Loading the Auto dataset

data(Auto)
summary(Auto)
##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365

(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median.

Auto$mpg01 <- ifelse(Auto$mpg > median(Auto$mpg), 1, 0)
Auto$mpg01 <- as.factor(Auto$mpg01)

(b) Explore the data graphically in order to investigate the association between mpg01 and the other features

pairs(Auto[, c("mpg01", "cylinders", "horsepower", "weight", "displacement")])

Cylinders, Displacement, and Weight have a strong correlation to mpg01. Horsepower and origin have a more minor, but potentially relevant, correlation with mpg01 as well.

boxplot(weight ~ mpg01, data = Auto)

(c) Split the data into a training set and a test set.

set.seed(123)
train_index <- sample(1:nrow(Auto), 0.7 * nrow(Auto))
train_data <- Auto[train_index, ]
test_data <- Auto[-train_index, ]

(d) Perform LDA on the training data and finding the test error

lda_model <- lda(mpg01 ~ cylinders + horsepower + weight + displacement, data = train_data)
lda_pred <- predict(lda_model, test_data)
lda_error <- mean(lda_pred$class != test_data$mpg01)
print(paste("LDA Test Error:", lda_error))
## [1] "LDA Test Error: 0.110169491525424"

Test Error Rate of LDA is 11.01%

(e) Perform QDA on the training data and finding the test error

qda_model <- qda(mpg01 ~ cylinders + horsepower + weight + displacement, data = train_data)
qda_pred <- predict(qda_model, test_data)
qda_error <- mean(qda_pred$class != test_data$mpg01)
print(paste("QDA Test Error:", qda_error))
## [1] "QDA Test Error: 0.101694915254237"

Test Error rate of QDA is 10.16%

(f) Perform Logistic Regression on the training data and finding the test error

logreg_model <- glm(mpg01 ~ cylinders + horsepower + weight + displacement, family = binomial, data = train_data)
logreg_pred <- predict(logreg_model, test_data, type = "response")
logreg_pred_class <- ifelse(logreg_pred > 0.5, 1, 0)
logreg_error <- mean(logreg_pred_class != test_data$mpg01)
print(paste("Logistic Regression Test Error:", logreg_error))
## [1] "Logistic Regression Test Error: 0.110169491525424"

Test Error rate of Logistic Regression is 11.01%

(g) Perform naive Bayes on the training data and finding the test error

nb_model <- naive_bayes(mpg01 ~ cylinders + horsepower + weight + displacement, data = train_data)
nb_pred_probs <- predict(nb_model, newdata = test_data, type = "prob")
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
nb_pred_probs_class1 <- nb_pred_probs[, "1"]
nb_pred_class <- ifelse(nb_pred_probs_class1 > 0.5, 1, 0)

# Calculate the Naive Bayes test error
nb_error <- mean(nb_pred_class != as.numeric(test_data$mpg01))
print(paste("Naive Bayes Test Error:", nb_error))
## [1] "Naive Bayes Test Error: 0.932203389830508"

Test Error rate of Naive Bayes is 9.32%

(h) Perform KNN on the training data and finding the test error

k_values <- c(3, 5, 7)  # Example K values to try
for (k in k_values) {
  knn_pred <- knn(train_data[, c("cylinders", "horsepower", "weight", "displacement")],
                  test_data[, c("cylinders", "horsepower", "weight", "displacement")],
                  train_data$mpg01, k = k)
  knn_error <- mean(knn_pred != test_data$mpg01)
  print(paste("KNN Test Error (K =", k, "):", knn_error))
}
## [1] "KNN Test Error (K = 3 ): 0.127118644067797"
## [1] "KNN Test Error (K = 5 ): 0.110169491525424"
## [1] "KNN Test Error (K = 7 ): 0.101694915254237"

Test Error Rate K=3: 12.70%, Test Error Rate K=5: 11.01%, and Test Error Rate K=7: 10.16% Test Error Rate K=7 seems to have the lowest error rate when compared to higher values.