library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.3.2
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.3.2
##
## Attaching package: 'ISLR'
## The following objects are masked from 'package:ISLR2':
##
## Auto, Credit
library(class)
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.3.2
## Loading required package: Matrix
## Loaded glmnet 4.1-8
library(tidyverse)
## Warning: package 'dplyr' was built under R version 4.3.2
## Warning: package 'lubridate' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::expand() masks Matrix::expand()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tidyr::pack() masks Matrix::pack()
## ✖ tidyr::unpack() masks Matrix::unpack()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(datasets)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.2
## corrplot 0.92 loaded
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
##
## The following object is masked from 'package:ISLR2':
##
## Boston
library(ggplot2)
library(e1071)
## Warning: package 'e1071' was built under R version 4.3.2
library(naivebayes)
## Warning: package 'naivebayes' was built under R version 4.3.2
## naivebayes 0.9.7 loaded
data("Weekly")
(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
pairs(Weekly[, -9], col = "green")
It looks like Volume and Year are the only two variables that have any
meaningful correlation.
(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?
# Fit logistic regression model
model <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)
summary(model)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Only Lag2 is significantly significant to reject the null hypothesis.
(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
# Predictions
predictions <- ifelse(predict(model, type = "response") > 0.5, "Up", "Down")
conf_matrix <- table(Weekly$Direction, predictions)
conf_matrix
## predictions
## Down Up
## Down 54 430
## Up 48 557
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
accuracy
## [1] 0.5610652
Thus, the model predicted the weekly market trends around 56.1% of the times. Slightly better than random guessing. Looking at the weeks that went up, we can see that it predicted it correctly (557/(557+48)=0.920661157) around 92.1%. Versus for down, it predicted it correctly (54/(54+430))=0.111570248) around 11.2%. So we can see that the model is not good at predicting down weeks. Also, we can see that the model tends to predict that most weeks will be up
(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).
training_data <- subset(Weekly, Year < 2009)
testing_data <- subset(Weekly, Year >= 2009)
logistic_model_lag2 <- glm(Direction ~ Lag2, data = training_data, family = "binomial")
predicted_direction_lag2 <- ifelse(predict(logistic_model_lag2, newdata = testing_data, type = "response") > 0.5, "Up", "Down")
confusion_matrix_lag2 <- table(Actual = testing_data$Direction, Predicted = predicted_direction_lag2)
accuracy_lag2 <- sum(diag(confusion_matrix_lag2)) / sum(confusion_matrix_lag2)
print(confusion_matrix_lag2)
## Predicted
## Actual Down Up
## Down 9 34
## Up 5 56
cat("Overall accuracy with Lag2:", accuracy_lag2, "\n")
## Overall accuracy with Lag2: 0.625
When split up by years, the model correctly predicts 62.5%. However, like the previous model, it still predicts Up with a much higher percentage (56/56+5)=0.918032787 => 91.8%. And down with 9/(9+34)=0.209302326 => 20.9%. But, the down percentage is better than the previous model.
(e) Repeat (d) using LDA.
training_data <- subset(Weekly, Year <= 2008)
lda_model <- lda(Direction ~ Lag2, data = training_data)
held_out_data <- subset(Weekly, Year > 2008)
predictions <- predict(lda_model, newdata = held_out_data)$class
# Confusion matrix for held-out data
conf_matrix <- table(held_out_data$Direction, predictions)
conf_matrix
## predictions
## Down Up
## Down 9 34
## Up 5 56
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
accuracy
## [1] 0.625
Linear Discriminant Analysis (LDA) has a similar result to logistic regression
(f) Repeat (d) using QDA.
training_data <- subset(Weekly, Year <= 2008)
qda_model <- qda(Direction ~ Lag2, data = training_data)
held_out_data <- subset(Weekly, Year > 2008)
predictions <- predict(qda_model, newdata = held_out_data)$class
# Confusion matrix for held-out data
conf_matrix <- table(held_out_data$Direction, predictions)
conf_matrix
## predictions
## Down Up
## Down 0 43
## Up 0 61
# Overall fraction of correct predictions for held-out data
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
accuracy
## [1] 0.5865385
Quadratic Discriminant Analysis (QDA) has a lower accuracy with only 58.7% and only considers the weeks to be going up every weeek.
(g) Repeat (d) using KNN with K =1.
training_data <- subset(Weekly, Year <= 2008)
knn_model <- knn(train = as.matrix(training_data$Lag2),
test = as.matrix(held_out_data$Lag2),
cl = training_data$Direction,
k = 1)
# Confusion matrix for held-out data
conf_matrix <- table(held_out_data$Direction, knn_model)
conf_matrix
## knn_model
## Down Up
## Down 21 22
## Up 30 31
# Overall fraction of correct predictions for held-out data
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
accuracy
## [1] 0.5
Both up and down are essentially at 50%. So, this is truely something that is acting almost as a random guess.
(h) Repeat (d) using naive Bayes.
training_data <- subset(Weekly, Year <= 2008)
naive_model <- naiveBayes(Direction ~ Lag2, data = training_data)
predictions <- predict(naive_model, newdata = held_out_data)
# Confusion matrix for held-out data
conf_matrix <- table(held_out_data$Direction, predictions)
conf_matrix
## predictions
## Down Up
## Down 0 43
## Up 0 61
# Overall fraction of correct predictions for held-out data
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
accuracy
## [1] 0.5865385
This predicted a 61.54% rate, which is one of the higher rates. Also, we can see similar results to the logistic regression in it predicting Ups much better than Downs.
(i) Which of these methods appears to provide the best results on this data? i) Based on the overall accuracy results on the held-out data:
Overall accuracy with Naive Bayes: 0.5865385
Overall accuracy with KNN: 0.5096154
Overall accuracy with QDA: 0.5865385
Overall accuracy with LDA: 0.625
Overall accuracy with Logistic Regression using Lag2: 0.625
It appears that Logistic Regression with Lag2 and LDA have the highest overall accuracy, both achieving 62.5%. Therefore, based on this evaluation, both Logistic Regression with Lag2 and LDA seem to provide the best results on this specific dataset.
(j) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.
predictors_combination <- c("Lag2", "Lag3") # Modify this with different combinations
training_data_sub <- subset(training_data, select = c("Direction", predictors_combination))
testing_data_sub <- subset(testing_data, select = c("Direction", predictors_combination))
# Fit the model (e.g., logistic regression)
model_sub <- glm(Direction ~ ., data = training_data_sub, family = "binomial")
predicted_direction_sub <- ifelse(predict(model_sub, newdata = testing_data_sub, type = "response") > 0.5, "Up", "Down")
# Evaluate performance
confusion_matrix_sub <- table(Actual = testing_data_sub$Direction, Predicted = predicted_direction_sub)
accuracy_sub <- sum(diag(confusion_matrix_sub)) / sum(confusion_matrix_sub)
print(confusion_matrix_sub)
## Predicted
## Actual Down Up
## Down 8 35
## Up 4 57
cat("Overall accuracy with subset of predictors:", accuracy_sub, "\n")
## Overall accuracy with subset of predictors: 0.625
predictors_combination <- c("Lag2", "Lag3")
training_data_sub <- subset(training_data, select = c("Direction", predictors_combination))
testing_data_sub <- subset(testing_data, select = c("Direction", predictors_combination))
k_values <- c(1, 3, 5, 7, 9)
for (k in k_values) {
cat("K =", k, "\n")
model_knn <- knn(train = training_data_sub[, -1], test = testing_data_sub[, -1], cl = training_data_sub$Direction, k = k)
confusion_matrix_knn <- table(Actual = testing_data_sub$Direction, Predicted = model_knn)
accuracy_knn <- sum(diag(confusion_matrix_knn)) / sum(confusion_matrix_knn)
# Print confusion matrix and accuracy
print(confusion_matrix_knn)
cat("Accuracy with K =", k, ":", accuracy_knn, "\n")
}
## K = 1
## Predicted
## Actual Down Up
## Down 19 24
## Up 23 38
## Accuracy with K = 1 : 0.5480769
## K = 3
## Predicted
## Actual Down Up
## Down 16 27
## Up 20 41
## Accuracy with K = 3 : 0.5480769
## K = 5
## Predicted
## Actual Down Up
## Down 15 28
## Up 20 41
## Accuracy with K = 5 : 0.5384615
## K = 7
## Predicted
## Actual Down Up
## Down 11 32
## Up 17 44
## Accuracy with K = 7 : 0.5288462
## K = 9
## Predicted
## Actual Down Up
## Down 12 31
## Up 17 44
## Accuracy with K = 9 : 0.5384615
LDA and logistic regression still have the best accuracy with 65.4%. But,squaring Lag2 in QDA did improve the accuracy of that model to 62.5%.
Loading the Auto dataset
data(Auto)
summary(Auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median.
Auto$mpg01 <- ifelse(Auto$mpg > median(Auto$mpg), 1, 0)
Auto$mpg01 <- as.factor(Auto$mpg01)
(b) Explore the data graphically in order to investigate the association between mpg01 and the other features
pairs(Auto[, c("mpg01", "cylinders", "horsepower", "weight", "displacement")])
Cylinders, Displacement, and Weight have a strong correlation to mpg01.
Horsepower and origin have a more minor, but potentially relevant,
correlation with mpg01 as well.
boxplot(weight ~ mpg01, data = Auto)
(c) Split the data into a training set and a test set.
set.seed(123)
train_index <- sample(1:nrow(Auto), 0.7 * nrow(Auto))
train_data <- Auto[train_index, ]
test_data <- Auto[-train_index, ]
(d) Perform LDA on the training data and finding the test error
lda_model <- lda(mpg01 ~ cylinders + horsepower + weight + displacement, data = train_data)
lda_pred <- predict(lda_model, test_data)
lda_error <- mean(lda_pred$class != test_data$mpg01)
print(paste("LDA Test Error:", lda_error))
## [1] "LDA Test Error: 0.110169491525424"
Test Error Rate of LDA is 11.01%
(e) Perform QDA on the training data and finding the test error
qda_model <- qda(mpg01 ~ cylinders + horsepower + weight + displacement, data = train_data)
qda_pred <- predict(qda_model, test_data)
qda_error <- mean(qda_pred$class != test_data$mpg01)
print(paste("QDA Test Error:", qda_error))
## [1] "QDA Test Error: 0.101694915254237"
Test Error rate of QDA is 10.16%
(f) Perform Logistic Regression on the training data and finding the test error
logreg_model <- glm(mpg01 ~ cylinders + horsepower + weight + displacement, family = binomial, data = train_data)
logreg_pred <- predict(logreg_model, test_data, type = "response")
logreg_pred_class <- ifelse(logreg_pred > 0.5, 1, 0)
logreg_error <- mean(logreg_pred_class != test_data$mpg01)
print(paste("Logistic Regression Test Error:", logreg_error))
## [1] "Logistic Regression Test Error: 0.110169491525424"
Test Error rate of Logistic Regression is 11.01%
(g) Perform naive Bayes on the training data and finding the test error
nb_model <- naive_bayes(mpg01 ~ cylinders + horsepower + weight + displacement, data = train_data)
nb_pred_probs <- predict(nb_model, newdata = test_data, type = "prob")
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
nb_pred_probs_class1 <- nb_pred_probs[, "1"]
nb_pred_class <- ifelse(nb_pred_probs_class1 > 0.5, 1, 0)
# Calculate the Naive Bayes test error
nb_error <- mean(nb_pred_class != as.numeric(test_data$mpg01))
print(paste("Naive Bayes Test Error:", nb_error))
## [1] "Naive Bayes Test Error: 0.932203389830508"
Test Error rate of Naive Bayes is 9.32%
(h) Perform KNN on the training data and finding the test error
k_values <- c(3, 5, 7) # Example K values to try
for (k in k_values) {
knn_pred <- knn(train_data[, c("cylinders", "horsepower", "weight", "displacement")],
test_data[, c("cylinders", "horsepower", "weight", "displacement")],
train_data$mpg01, k = k)
knn_error <- mean(knn_pred != test_data$mpg01)
print(paste("KNN Test Error (K =", k, "):", knn_error))
}
## [1] "KNN Test Error (K = 3 ): 0.127118644067797"
## [1] "KNN Test Error (K = 5 ): 0.110169491525424"
## [1] "KNN Test Error (K = 7 ): 0.101694915254237"
Test Error Rate K=3: 12.70%, Test Error Rate K=5: 11.01%, and Test Error Rate K=7: 10.16% Test Error Rate K=7 seems to have the lowest error rate when compared to higher values.