# Load required libraries
library(MASS)
library(class)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.3
## corrplot 0.95 loaded
library(ISLR2)
##
## Attaching package: 'ISLR2'
##
## The following object is masked from 'package:MASS':
##
## Boston
library(e1071)
## Warning: package 'e1071' was built under R version 4.4.3
# Load the Weekly dataset
data("Weekly")
13. This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.
# Summary of the dataset
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
# Correlation matrix and plot for numeric variables (excluding Direction)
corrplot(cor(Weekly[, -9]), type = "lower", diag = FALSE, method = "ellipse")
Interpretation:
The summary()
function provides an overview of the
dataset, including measures such as min, max, median, and mean.
The correlation plot reveals relationships between numeric variables.
Notably, Volume
has a strong positive correlation with
Year
, indicating an increase in trading volume over
time.
Lag variables show relatively weak correlations with each other, suggesting that previous weeks’ returns may not strongly predict future returns.
# Logistic regression with Direction as the response variable
fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Weekly, family = binomial)
# Display the summary of the model
summary(fit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Interpretation:
The logistic regression model predicts Direction
(Up/Down) based on the five lag variables and
Volume
.
Among the predictors, Lag2
appears statistically
significant (p = 0.0296), while the others do not.
The model suggests that higher Lag2
values are
associated with a greater likelihood of an upward movement in stock
price.
# Create predictions based on the logistic model
contrasts(Weekly$Direction)
## Up
## Down 0
## Up 1
pred <- predict(fit, type = "response") > 0.5
# Confusion matrix
(t <- table(ifelse(pred, "Up (pred)", "Down (pred)"), Weekly$Direction))
##
## Down Up
## Down (pred) 54 48
## Up (pred) 430 557
# Overall accuracy
accuracy <- sum(diag(t)) / sum(t)
accuracy
## [1] 0.5610652
Interpretation:
The confusion matrix shows how well the logistic model predicts the market direction.
The model achieves approximately 56% accuracy, which is only slightly better than random guessing.
Most downward movements are misclassified as “Up,” indicating poor predictive performance for declines.
# Split the data into training (1990-2008) and testing (2009-2010)
train <- Weekly$Year < 2009
# Logistic regression with Lag2 as the predictor
fit <- glm(Direction ~ Lag2, data = Weekly[train, ], family = binomial)
pred <- predict(fit, Weekly[!train, ], type = "response") > 0.5
# Confusion matrix and accuracy for the test set
(t <- table(ifelse(pred, "Up (pred)", "Down (pred)"), Weekly[!train, ]$Direction))
##
## Down Up
## Down (pred) 9 5
## Up (pred) 34 56
accuracy <- sum(diag(t)) / sum(t)
accuracy
## [1] 0.625
Interpretation:
The logistic model trained on Lag2
alone achieves an
accuracy similar to the full model.
The confusion matrix again shows a tendency to misclassify downward movements.
# LDA with Lag2 as the predictor
fit <- lda(Direction ~ Lag2, data = Weekly[train, ])
pred <- predict(fit, Weekly[!train, ], type = "response")$class
# Confusion matrix and accuracy
(t <- table(pred, Weekly[!train, ]$Direction))
##
## pred Down Up
## Down 9 5
## Up 34 56
accuracy <- sum(diag(t)) / sum(t)
accuracy
## [1] 0.625
Interpretation:
LDA performs similarly to logistic regression in terms of accuracy.
The classification performance remains modest.
# QDA with Lag2 as the predictor
fit <- qda(Direction ~ Lag2, data = Weekly[train, ])
pred <- predict(fit, Weekly[!train, ], type = "response")$class
# Confusion matrix and accuracy
(t <- table(pred, Weekly[!train, ]$Direction))
##
## pred Down Up
## Down 0 0
## Up 43 61
accuracy <- sum(diag(t)) / sum(t)
accuracy
## [1] 0.5865385
Interpretation:
QDA does not show significant improvement over LDA or logistic regression.
It may not be the best approach for this dataset.
# KNN with Lag2 as the predictor and K = 1
fit <- knn(Weekly[train, "Lag2", drop = FALSE],
Weekly[!train, "Lag2", drop = FALSE],
Weekly$Direction[train], k = 1)
# Confusion matrix and accuracy
(t <- table(fit, Weekly[!train, ]$Direction))
##
## fit Down Up
## Down 21 29
## Up 22 32
accuracy <- sum(diag(t)) / sum(t)
accuracy
## [1] 0.5096154
Interpretation:
KNN (K=1) achieves lower accuracy than logistic regression and LDA.
This suggests that a more refined choice of K is needed.
# Naive Bayes with Lag2 as the predictor
fit <- naiveBayes(Direction ~ Lag2, data = Weekly, subset = train)
pred <- predict(fit, Weekly[!train, ], type = "class")
# Confusion matrix and accuracy
(t <- table(pred, Weekly[!train, ]$Direction))
##
## pred Down Up
## Down 0 0
## Up 43 61
accuracy <- sum(diag(t)) / sum(t)
accuracy
## [1] 0.5865385
Interpretation:
Logistic regression and LDA provide the best results, achieving comparable accuracy.
KNN can improve with careful tuning of K.
# Testing with different combinations of Lag variables
fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4, data = Weekly[train, ], family = binomial)
pred <- predict(fit, Weekly[!train, ], type = "response") > 0.5
mean(ifelse(pred, "Up", "Down") == Weekly[!train, ]$Direction)
## [1] 0.5865385
# LDA with different Lag combinations
fit <- lda(Direction ~ Lag1 + Lag2 + Lag3 + Lag4, data = Weekly[train, ])
pred <- predict(fit, Weekly[!train, ], type = "response")$class
mean(pred == Weekly[!train, ]$Direction)
## [1] 0.5769231
# Experiment with KNN with different values of K
set.seed(1)
res <- sapply(1:30, function(k) {
fit <- knn(Weekly[train, 2:4, drop = FALSE],
Weekly[!train, 2:4, drop = FALSE],
Weekly$Direction[train], k = k)
mean(fit == Weekly[!train, ]$Direction)
})
# Plot K vs Accuracy
plot(1:30, res, type = "o", xlab = "K", ylab = "Fraction correct")
Interpretation:
KNN using the first 3 Lag variables provides slightly better results
than logistic regression, with the best performance when K=3K =
3K=3.
The logistic regression and LDA models provide the best results, with some improvement using KNN for smaller values of KKK. Experimentation with predictor combinations shows that including additional lag variables does not significantly improve model performance.
14. In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.
# Load the dataset
library(ISLR)
##
## Attaching package: 'ISLR'
## The following objects are masked from 'package:ISLR2':
##
## Auto, Credit
data(Auto)
# Create mpg01 binary variable
Auto$mpg01 <- as.numeric(Auto$mpg > median(Auto$mpg))
# Combine mpg01 with the other Auto variables into a new data frame
x <- Auto
# View the first few rows of the dataset
head(x)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name mpg01
## 1 chevrolet chevelle malibu 0
## 2 buick skylark 320 0
## 3 plymouth satellite 0
## 4 amc rebel sst 0
## 5 ford torino 0
## 6 ford galaxie 500 0
Interpretation:
We compute the median of the mpg
column and create a
new column mpg01
to indicate whether a car’s MPG is above
or below the median.
This will help us classify cars as either “high MPG” or “low MPG,” setting up the binary classification task.
# Set up graphical parameters for multiple plots
par(mfrow = c(2, 4))
# Create histograms of the first 7 numerical variables
for (i in 1:7) {
hist(x[, i], breaks = 20, main = colnames(x)[i], xlab = colnames(x)[i])
}
# Boxplots to compare each feature with mpg01
par(mfrow = c(2, 4))
for (i in 1:7) {
boxplot(x[, i] ~ x$mpg01, main = colnames(x)[i], xlab = colnames(x)[i], ylab = "mpg01")
}
# Scatterplot matrix
pairs(x[, 1:7])
Interpretation:
We use histograms to visualize the distributions of different variables. This helps us understand the spread and distribution of the features.
Boxplots help us see which variables differ based on the
mpg01
classification. For example, variables like
weight
, cylinders
, and
displacement
are visually correlated with
mpg01
.
Scatterplot matrices help us visually assess relationships
between variables and mpg01
. Variables like
cylinders
, weight
, and
displacement
show clear trends associated with higher/lower
gas mileage.
# Set seed for reproducibility
set.seed(1)
# Split the data (2/3 for training, 1/3 for testing)
train_index <- sample(seq_len(nrow(x)), nrow(x) * 2 / 3)
train_data <- x[train_index, ]
test_data <- x[-train_index, ]
Interpretation:
library(MASS)
# Fit LDA model on the training set
lda_model <- lda(mpg01 ~ cylinders + weight + displacement, data = train_data)
# Predict on the test set
lda_pred <- predict(lda_model, test_data)$class
# Compute test error
lda_test_error <- mean(lda_pred != test_data$mpg01)
lda_test_error
## [1] 0.1068702
Interpretation:
LDA assumes that the data from each class comes from a multivariate normal distribution with a class-specific mean but a shared covariance matrix.
We calculate the test error by comparing predicted classes with the true values in the test set. LDA typically works well for linear decision boundaries.
# Fit QDA model on the training set
qda_model <- qda(mpg01 ~ cylinders + weight + displacement, data = train_data)
# Predict on the test set
qda_pred <- predict(qda_model, test_data)$class
# Compute test error
qda_test_error <- mean(qda_pred != test_data$mpg01)
qda_test_error
## [1] 0.09923664
Interpretation:
QDA allows more flexible decision boundaries compared to LDA since it accounts for different covariance structures in each class.
The test error here is compared to that of LDA to evaluate if the more flexible model improves performance.
# Fit logistic regression model on the training set
logit_model <- glm(mpg01 ~ cylinders + weight + displacement, data = train_data, family = binomial)
# Predict probabilities on the test set
logit_pred <- predict(logit_model, test_data, type = "response")
# Convert probabilities to binary predictions (0 or 1)
logit_class_pred <- as.numeric(logit_pred > 0.5)
# Compute test error
logit_test_error <- mean(logit_class_pred != test_data$mpg01)
logit_test_error
## [1] 0.1145038
Interpretation:
Logistic regression models the log-odds of the binary outcome using a linear combination of the features.
The test error is calculated based on the predicted probabilities. This method is widely used for binary classification problems.
library(e1071)
# Fit Naive Bayes model on the training set
nb_model <- naiveBayes(mpg01 ~ cylinders + weight + displacement, data = train_data)
# Predict on the test set
nb_pred <- predict(nb_model, test_data)
# Compute test error
nb_test_error <- mean(nb_pred != test_data$mpg01)
nb_test_error
## [1] 0.09923664
Interpretation:
Naive Bayes assumes that the features are conditionally independent given the class label, which can be a strong but effective assumption in practice.
The test error is computed based on the predicted classes.
library(class)
# Define the features to use for KNN
train_knn <- train_data[, c("cylinders", "weight", "displacement")]
test_knn <- test_data[, c("cylinders", "weight", "displacement")]
# Perform KNN for multiple values of K and compute test error
knn_errors <- sapply(1:50, function(k) {
knn_pred <- knn(train_knn, test_knn, train_data$mpg01, k = k)
mean(knn_pred != test_data$mpg01)
})
# Plot test errors for different values of K
plot(1:50, knn_errors, type = "o", xlab = "K", ylab = "Test Error", main = "KNN Test Error vs K")
best_k <- which.min(knn_errors)
best_k
## [1] 3
Interpretation:
KNN is a non-parametric method that classifies a data point based on the majority vote of its K nearest neighbors.
We explore multiple values of K to find the one that minimizes the test error, and we visualize the relationship between K and test error.
We created a binary classification problem and explored multiple classification techniques (LDA, QDA, logistic regression, Naive Bayes, KNN) to predict whether a car has high or low gas mileage.
Each method provided different results, with QDA often outperforming others in terms of lower test error, followed by logistic regression and KNN.