Assignment 3

Author

Jonathan McCanlas

library(ISLR2)
data(Weekly)

Assignmnet 3

Problem 13

Part A

summary(Weekly)

      Year           Lag1               Lag2               Lag3         
 Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
 1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
 Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
 Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
 3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
 Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
      Lag4               Lag5              Volume            Today         
 Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
 1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
 Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
 Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
 3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
 Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
 Direction 
 Down:484  
 Up  :605

cor(Weekly[, -9])  # exclude Direction, which is qualitative

              Year         Lag1        Lag2        Lag3         Lag4
Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
               Lag5      Volume        Today
Year   -0.030519101  0.84194162 -0.032459894
Lag1   -0.008183096 -0.06495131 -0.075031842
Lag2   -0.072499482 -0.08551314  0.059166717
Lag3    0.060657175 -0.06928771 -0.071243639
Lag4   -0.075675027 -0.06107462 -0.007825873
Lag5    1.000000000 -0.05851741  0.011012698
Volume -0.058517414  1.00000000 -0.033077783
Today   0.011012698 -0.03307778  1.000000000

pairs(Weekly[, -9])

plot(Weekly$Volume, type = "l", main = "Volume Over Time", ylab = "Volume")

boxplot(Today ~ Direction, data = Weekly,
        main = "Today’s Return vs Direction",
        ylab = "Today",
        xlab = "Market Direction")

Pairs plot: Shows that the lag variables (Lag1 to Lag5) don’t have strong patterns with the target variable Today, but some relationships between lag values themselves are slightly visible.

Volume over time: There’s a noticeable upward trend in trading volume over the years, especially after the mid-1990s, which could reflect increasing market activity.

Today’s return vs. direction: The boxplot shows that when the market goes Up, returns tend to be higher and more positive. When it goes Down, returns are more negative and spread out, with some extreme outliers.

Part B

# Fit logistic regression model
log_model <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, 
                 data = Weekly, 
                 family = binomial)

summary(log_model)


Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
    Volume, family = binomial, data = Weekly)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  0.26686    0.08593   3.106   0.0019 **
Lag1        -0.04127    0.02641  -1.563   0.1181   
Lag2         0.05844    0.02686   2.175   0.0296 * 
Lag3        -0.01606    0.02666  -0.602   0.5469   
Lag4        -0.02779    0.02646  -1.050   0.2937   
Lag5        -0.01447    0.02638  -0.549   0.5833   
Volume      -0.02274    0.03690  -0.616   0.5377   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1496.2  on 1088  degrees of freedom
Residual deviance: 1486.4  on 1082  degrees of freedom
AIC: 1500.4

Number of Fisher Scoring iterations: 4

Statistically Significant Predictor: Lag2

Estimate = 0.05844 p-value = 0.0296 Since p-value < 0.05, this variable is statistically significant.

Part C

glm.fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
               family = binomial, data = Weekly)

# Step 1: Predict probabilities
glm.probs <- predict(glm.fit, type = "response")

# Step 2: Convert probabilities to "Up"/"Down"
glm.pred <- ifelse(glm.probs > 0.5, "Up", "Down")

# Step 3: Create confusion matrix
confusion <- table(Predicted = glm.pred, Actual = Weekly$Direction)
print(confusion)

         Actual
Predicted Down  Up
     Down   54  48
     Up    430 557

# Step 4: Compute accuracy
accuracy <- mean(glm.pred == Weekly$Direction)
print(accuracy)

[1] 0.5610652

The logistic regression model correctly predicted the market direction about 56% of the time. It mostly predicts “Up,” and while it gets many of those right, it performs poorly when the market actually goes “Down,” often misclassifying them. This suggests the model isn’t very reliable for detecting downward trends.

Part D

# Create training and test indicators
train <- Weekly$Year <= 2008

# Fit logistic regression on training data with Lag2 only
glm.fit <- glm(Direction ~ Lag2, data = Weekly, family = binomial, subset = train)

# Predict probabilities on test data (2009–2010)
glm.probs <- predict(glm.fit, newdata = Weekly[!train, ], type = "response")

# Convert probabilities to class labels
glm.pred <- ifelse(glm.probs > 0.5, "Up", "Down")
glm.pred <- factor(glm.pred, levels = levels(Weekly$Direction))  # ensure matching factor levels

# Actual values for test data
actual <- Weekly$Direction[!train]

# Confusion matrix
table(Predicted = glm.pred, Actual = actual)

         Actual
Predicted Down Up
     Down    9  5
     Up     34 56

# Accuracy
mean(glm.pred == actual)

[1] 0.625

Parts E through H

library(ISLR2)
library(MASS)       # for LDA, QDA


Attaching package: 'MASS'

The following object is masked from 'package:ISLR2':

    Boston

library(class)      # for KNN
library(e1071)      # for Naive Bayes

# Train/test split
train <- Weekly$Year <= 2008
test <- Weekly$Year > 2008

# Training and test sets
train.X <- Weekly$Lag2[train]
test.X <- Weekly$Lag2[test]
train.Y <- Weekly$Direction[train]
test.Y <- Weekly$Direction[test]

# Must be data frames for LDA/QDA/Bayes
train.df <- data.frame(Direction = train.Y, Lag2 = train.X)
test.df <- data.frame(Lag2 = test.X)

LDA

lda.fit <- lda(Direction ~ Lag2, data = train.df)
lda.pred <- predict(lda.fit, test.df)$class

# Confusion matrix and accuracy
table(Predicted = lda.pred, Actual = test.Y)

         Actual
Predicted Down Up
     Down    9  5
     Up     34 56

mean(lda.pred == test.Y)

[1] 0.625

QDA

qda.fit <- qda(Direction ~ Lag2, data = train.df)
qda.pred <- predict(qda.fit, test.df)$class

# Confusion matrix and accuracy
table(Predicted = qda.pred, Actual = test.Y)

         Actual
Predicted Down Up
     Down    0  0
     Up     43 61

mean(qda.pred == test.Y)

[1] 0.5865385

KNN with K=1

train.X.knn <- as.matrix(train.X)
test.X.knn <- as.matrix(test.X)

knn.pred <- knn(train = train.X.knn, test = test.X.knn, cl = train.Y, k = 1)

# Confusion matrix and accuracy
table(Predicted = knn.pred, Actual = test.Y)

         Actual
Predicted Down Up
     Down   21 29
     Up     22 32

mean(knn.pred == test.Y)

[1] 0.5096154

Naive Bayes

nb.fit <- naiveBayes(Direction ~ Lag2, data = train.df)
nb.pred <- predict(nb.fit, test.df)

# Confusion matrix and accuracy
table(Predicted = nb.pred, Actual = test.Y)

         Actual
Predicted Down Up
     Down    0  0
     Up     43 61

mean(nb.pred == test.Y)

[1] 0.5865385

Logistic regression and LDA gave the best results, each correctly predicting about 62.5% of market directions. They outperformed QDA, Naive Bayes, and KNN, which had lower accuracy or predicted only one outcome. Overall, logistic regression or LDA are the most reliable choices for this data.

Part J

data("Weekly")

# Split data: train = before 2009, test = 2009–2010
train <- subset(Weekly, Year <= 2008)
test <- subset(Weekly, Year > 2008)

glm.fit <- glm(Direction ~ Lag1 * Lag2, data = train, family = binomial)
glm.probs <- predict(glm.fit, test, type = "response")
glm.pred <- ifelse(glm.probs > 0.5, "Up", "Down")

# Make prediction factors to match "Direction"
glm.pred <- factor(glm.pred, levels = levels(test$Direction))

# Confusion matrix
table(Predicted = glm.pred, Actual = test$Direction)

         Actual
Predicted Down Up
     Down    7  8
     Up     36 53

# Accuracy
mean(glm.pred == test$Direction)

[1] 0.5769231

LDA with Lag2 and Volume (Transformed)

train$logVolume <- log(train$Volume + 1)
test$logVolume <- log(test$Volume + 1)

lda.fit <- lda(Direction ~ Lag2 + logVolume, data = train)
lda.pred <- predict(lda.fit, test)$class
table(Predicted = lda.pred, Actual = test$Direction)

         Actual
Predicted Down Up
     Down   17 19
     Up     26 42

mean(lda.pred == test$Direction)

[1] 0.5673077

QDA with Transformed Lag1

train$sqrtLag1 <- sqrt(abs(train$Lag1))
test$sqrtLag1 <- sqrt(abs(test$Lag1))

qda.fit <- qda(Direction ~ sqrtLag1 + Lag2, data = train)
qda.pred <- predict(qda.fit, test)$class
table(Predicted = qda.pred, Actual = test$Direction)

         Actual
Predicted Down Up
     Down    0  0
     Up     43 61

mean(qda.pred == test$Direction)

[1] 0.5865385

KNN with Optimized K

library(class)
train.X <- cbind(train$Lag2, train$Lag1)
test.X <- cbind(test$Lag2, test$Lag1)
train.Direction <- train$Direction

# Try k = 1 to 10
for (k in 1:10) {
  knn.pred <- knn(train.X, test.X, train.Direction, k = k)
  acc <- mean(knn.pred == test$Direction)
  cat("K =", k, "Accuracy =", acc, "\n")
}

K = 1 Accuracy = 0.4807692 
K = 2 Accuracy = 0.4903846 
K = 3 Accuracy = 0.5192308 
K = 4 Accuracy = 0.5673077 
K = 5 Accuracy = 0.4903846 
K = 6 Accuracy = 0.4807692 
K = 7 Accuracy = 0.5288462 
K = 8 Accuracy = 0.5865385 
K = 9 Accuracy = 0.5384615 
K = 10 Accuracy = 0.4711538

best.knn <- knn(train.X, test.X, train.Direction, k = 6)
table(Predicted = best.knn, Actual = test$Direction)

         Actual
Predicted Down Up
     Down   23 31
     Up     20 30

Problem 14

Part A

library(ISLR2)

# Calculate the median of mpg
mpg_median <- median(Auto$mpg)

# Create binary variable: 1 if mpg > median, else 0
mpg01 <- ifelse(Auto$mpg > mpg_median, 1, 0)

# Add mpg01 to the Auto dataset
Auto.MPG01 <- data.frame(Auto, mpg01)

summary(Auto.MPG01)

      mpg          cylinders      displacement     horsepower        weight    
 Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
 1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
 Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
 Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
 3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
 Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
                                                                               
  acceleration        year           origin                      name    
 Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
 1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
 Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
 Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
 3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
 Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
                                                 (Other)           :365  
     mpg01    
 Min.   :0.0  
 1st Qu.:0.0  
 Median :0.5  
 Mean   :0.5  
 3rd Qu.:1.0  
 Max.   :1.0

Part B

boxplot(displacement ~ mpg01, data = Auto.MPG01, main = "Displacement vs mpg01", xlab = "mpg01", ylab = "Displacement")

boxplot(horsepower ~ mpg01, data = Auto.MPG01, main = "Horsepower vs mpg01", xlab = "mpg01", ylab = "Horsepower")

boxplot(weight ~ mpg01, data = Auto.MPG01, main = "Weight vs mpg01", xlab = "mpg01", ylab = "Weight")

boxplot(acceleration ~ mpg01, data = Auto.MPG01, main = "Acceleration vs mpg01", xlab = "mpg01", ylab = "Acceleration")

Cars with high mpg (mpg01 = 1) usually have lower displacement, horsepower, and weight. These three features show the biggest difference between high and low mpg cars. Acceleration also shows a small difference, but it’s not as strong. So, displacement, horsepower, and weight are the most useful for predicting fuel efficiency.

Part C

set.seed(1)
train_index <- sample(1:nrow(Auto.MPG01), 0.7 * nrow(Auto.MPG01))
train <- Auto.MPG01[train_index, ]
test <- Auto.MPG01[-train_index, ]

Part D

Step 1: Identify variables most associated with mpg01

From the boxplots in part (b), the strongest predictors of mpg01 appear to be:

displacement

horsepower

weight

acceleration

Step 2: Run LDA on training data

library(MASS)

lda.fit <- lda(mpg01 ~ displacement + horsepower + weight + acceleration, data = train)

Step 3: Predict on test data

lda.pred <- predict(lda.fit, test)
lda.class <- lda.pred$class

Step 4: Create confusion matrix & calculate test error

table(Predicted = lda.class, Actual = test$mpg01)

         Actual
Predicted  0  1
        0 47  1
        1 14 56

mean(lda.class != test$mpg01)  # test error rate

[1] 0.1271186

Part E

Run QDA

# Fit QDA model on training data
qda.model <- qda(mpg01 ~ displacement + horsepower + weight + acceleration, data = train)

# Predict on test set
qda.pred <- predict(qda.model, test)$class

# Create confusion matrix
table(Predicted = qda.pred, Actual = test$mpg01)

         Actual
Predicted  0  1
        0 49  5
        1 12 52

# Calculate test error rate
mean(qda.pred != test$mpg01)

[1] 0.1440678

Part F

glm.fit <- glm(mpg01 ~ weight + horsepower + displacement + acceleration,
               data = train, family = binomial)

glm.probs <- predict(glm.fit, newdata = test, type = "response")

glm.pred <- ifelse(glm.probs > 0.5, 1, 0)

table(Predicted = glm.pred, Actual = test$mpg01)

         Actual
Predicted  0  1
        0 53  4
        1  8 53

mean(glm.pred != test$mpg01)

[1] 0.1016949

Part G

library(e1071)

nb.model <- naiveBayes(mpg01 ~ displacement + horsepower + weight + acceleration, data = train)

nb.pred <- predict(nb.model, test)

table(Predicted = nb.pred, Actual = test$mpg01)

         Actual
Predicted  0  1
        0 49  3
        1 12 54

mean(nb.pred != test$mpg01)

[1] 0.1271186

Part H

# Load required library
library(class)

# Select only the relevant predictors
train.X <- train[, c("displacement", "horsepower", "weight", "acceleration")]
test.X <- test[, c("displacement", "horsepower", "weight", "acceleration")]

# Set up the response variable
train.Y <- train$mpg01

# Try different values of K and calculate test error
for (k in 1:10) {
  knn.pred <- knn(train.X, test.X, train.Y, k = k)
  error_rate <- mean(knn.pred != test$mpg01)
  cat("K =", k, "Test Error Rate =", round(error_rate, 4), "\n")
}

K = 1 Test Error Rate = 0.1356 
K = 2 Test Error Rate = 0.1525 
K = 3 Test Error Rate = 0.1102 
K = 4 Test Error Rate = 0.1102 
K = 5 Test Error Rate = 0.1271 
K = 6 Test Error Rate = 0.1356 
K = 7 Test Error Rate = 0.1271 
K = 8 Test Error Rate = 0.1356 
K = 9 Test Error Rate = 0.1441 
K = 10 Test Error Rate = 0.1441

Looks like K=3 and K=4 gave me the best results. The error rate was the lowest at around 11%.

Problem 16

data("Boston")

median_crim <- median(Boston$crim)
Boston$crim01 <- ifelse(Boston$crim > median_crim, 1, 0)

Boston$crim01 <- as.factor(Boston$crim01)

set.seed(1)
train_index <- sample(1:nrow(Boston), nrow(Boston) * 0.7)
train <- Boston[train_index, ]
test <- Boston[-train_index, ]

log_model <- glm(crim01 ~ nox + rad + tax + dis + lstat, data = train, family = binomial)
log_probs <- predict(log_model, test, type = "response")
log_preds <- ifelse(log_probs > 0.5, 1, 0)
mean(log_preds != test$crim01)

[1] 0.1907895

lda_model <- lda(crim01 ~ nox + rad + tax + dis + lstat, data = train)
lda_preds <- predict(lda_model, test)$class
mean(lda_preds != test$crim01)

[1] 0.1513158

nb_model <- naiveBayes(crim01 ~ nox + rad + tax + dis + lstat, data = train)
nb_preds <- predict(nb_model, test)
mean(nb_preds != test$crim01)

[1] 0.1644737

# Normalize predictors
normalize <- function(x) (x - min(x)) / (max(x) - min(x))
Boston_norm <- as.data.frame(lapply(Boston[, c("nox", "rad", "tax", "dis", "lstat")], normalize))
Boston_norm$crim01 <- Boston$crim01

train_knn <- Boston_norm[train_index, ]
test_knn <- Boston_norm[-train_index, ]

# Try different K values
for (k in 1:10) {
  knn_pred <- knn(train_knn[, 1:5], test_knn[, 1:5], train_knn$crim01, k = k)
  cat("K =", k, "Test Error Rate =", mean(knn_pred != test_knn$crim01), "\n")
}

K = 1 Test Error Rate = 0.05921053 
K = 2 Test Error Rate = 0.05921053 
K = 3 Test Error Rate = 0.06578947 
K = 4 Test Error Rate = 0.05921053 
K = 5 Test Error Rate = 0.07236842 
K = 6 Test Error Rate = 0.09868421 
K = 7 Test Error Rate = 0.08552632 
K = 8 Test Error Rate = 0.09210526 
K = 9 Test Error Rate = 0.07236842 
K = 10 Test Error Rate = 0.07894737

KNN clearly outperformed the other models in this case — especially with K = 1. That suggests local patterns in the data are very strong for classifying crime rate, and simpler distance-based models capture them well. Logistic regression and LDA worked, but not as precisely.