library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.4.3
library(MASS)
## Warning: package 'MASS' was built under R version 4.4.3
##
## Attaching package: 'MASS'
## The following object is masked from 'package:ISLR2':
##
## Boston
library(class)
## Warning: package 'class' was built under R version 4.4.3
library(e1071)
## Warning: package 'e1071' was built under R version 4.4.3
data("Weekly")
This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1, 089 weekly returns for 21 years, from the beginning of 1990 to the end of2010.
The Weekly dataset contains weekly stock market data from 1990 to 2010. Most lag variables (Lag1 to Lag5) have average values close to zero and show no strong trends. The “Volume” variable stands out, as it has increased steadily over time, which is also shown by its high correlation with the “Year” (correlation = 0.84). However, the lag variables have weak correlations with “Direction,” meaning they don’t strongly predict whether the market goes up or down. Overall, no obvious patterns appear except for the upward trend in trading volume.
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
cor(Weekly[, -9])
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
plot(Weekly$Volume, type = "l", main = "Volume Over Time", ylab = "Volume", xlab = "Week")
pairs(Weekly[, 2:6], main = "Lag Variables Pairs Plot", col = ifelse(Weekly$Direction == "Up", "blue", "red"))
A logistic regression model was built using Lag1 to Lag5 and Volume to predict the market Direction. Based on the output, only Lag2 was statistically significant, with a p-value of 0.0296. This suggests Lag2 has some predictive value. The intercept was also significant. All other variables—Lag1, Lag3, Lag4, Lag5, and Volume—had p-values much higher than 0.05, meaning they were not statistically significant and likely don’t help the model much.
glm.fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Weekly, family = binomial)
summary(glm.fit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
The confusion matrix shows that the logistic regression model predicted “Up” much more often than “Down.” It correctly predicted 557 “Up” days and only 54 “Down” days, but it missed 430 “Down” days and 48 “Up” days. The overall accuracy was about 56.1%, which is slightly better than random guessing. This suggests the model tends to favor predicting “Up,” even when it’s wrong, leading to many false positives for that class.
glm.probs <- predict(glm.fit, type = "response")
glm.pred <- ifelse(glm.probs > 0.5, "Up", "Down")
table(Predicted = glm.pred, Actual = Weekly$Direction)
## Actual
## Predicted Down Up
## Down 54 48
## Up 430 557
mean(glm.pred == Weekly$Direction)
## [1] 0.5610652
A simpler logistic regression model was created using only Lag2 as the predictor and trained on data from 1990 to 2008. When tested on the 2009–2010 data, the model achieved an improved accuracy of 62.5%. The confusion matrix showed better balance than the full model, indicating that this smaller model made fewer mistakes and was more reliable than the one with all variables.
train <- Weekly$Year <= 2008
test <- Weekly$Year > 2008
glm.fit2 <- glm(Direction ~ Lag2, data = Weekly, subset = train, family = binomial)
glm.probs2 <- predict(glm.fit2, Weekly[!train, ], type = "response")
glm.pred2 <- ifelse(glm.probs2 > 0.5, "Up", "Down")
table(Predicted = glm.pred2, Actual = Weekly$Direction[!train])
## Actual
## Predicted Down Up
## Down 9 5
## Up 34 56
mean(glm.pred2 == Weekly$Direction[!train])
## [1] 0.625
Linear Discriminant Analysis (LDA) was applied using only Lag2 as the predictor, trained on data from 1990 to 2008. When tested on the 2009–2010 data, the model achieved an accuracy of about 62.5%, which was similar to the logistic regression model. This suggests that both LDA and logistic regression perform equally well when using Lag2 alone to predict the market direction.
lda.fit <- lda(Direction ~ Lag2, data = Weekly, subset = train)
lda.pred <- predict(lda.fit, Weekly[!train, ])
table(Predicted = lda.pred$class, Actual = Weekly$Direction[!train])
## Actual
## Predicted Down Up
## Down 9 5
## Up 34 56
mean(lda.pred$class == Weekly$Direction[!train])
## [1] 0.625
QDA was used with Lag2 as the only predictor, trained on data from 1990 to 2008. On the 2009–2010 test data, it achieved an accuracy of 58.7%. The confusion matrix showed that the model predicted every week as “Up,” resulting in zero correct predictions for weeks when the market actually went down. This means the model was biased and performed worse than LDA and logistic regression.
qda.fit <- qda(Direction ~ Lag2, data = Weekly, subset = train)
qda.pred <- predict(qda.fit, Weekly[test, ])
table(qda.pred$class, Weekly$Direction[test])
mean(qda.pred$class == Weekly$Direction[test])
##
## Down Up
## Down 0 0
## Up 43 61
## [1] 0.5865385
KNN with K = 1 didn’t work well. It achieved 50% accuracy, which is no better than random guessing, indicating that K = 1 was too low.
train.X <- data.frame(Lag2 = Weekly$Lag2[train])
test.X <- data.frame(Lag2 = Weekly$Lag2[!train])
train.Direction <- Weekly$Direction[train]
set.seed(1)
knn.pred <- knn(train.X, test.X, train.Direction, k = 1)
table(Predicted = knn.pred, Actual = Weekly$Direction[!train])
## Actual
## Predicted Down Up
## Down 21 30
## Up 22 31
mean(knn.pred == Weekly$Direction[!train])
## [1] 0.5
Naive Bayes was used with Lag2 as the predictor and trained on data from 1990 to 2008. On the 2009–2010 test set, it predicted “Up” for every week, just like QDA. The confusion matrix showed that it got 61 weeks right and missed all 43 “Down” weeks, resulting in an overall accuracy of 58.7%. Although Naive Bayes usually performs well, in this case it made the same one-sided mistake as QDA.
nb.fit <- naiveBayes(Direction ~ Lag2, data = Weekly, subset = train)
nb.pred <- predict(nb.fit, Weekly[!train, ])
table(Predicted = nb.pred, Actual = Weekly$Direction[!train])
## Actual
## Predicted Down Up
## Down 0 0
## Up 43 61
mean(nb.pred == Weekly$Direction[!train])
## [1] 0.5865385
Among all the models tested, logistic regression with Lag2 gave the best performance with an accuracy of 62.5%. LDA had similar results, while QDA and Naive Bayes performed worse by predicting only “Up” and missing all “Down” weeks. KNN with K = 1 had the lowest accuracy at just 50%. Overall, logistic regression was the most reliable and balanced method for this data.
To improve the model, interaction terms and additional predictors were tested. A logistic regression model using Lag1 * Lag2, Lag3, and Volume slightly improved performance, with Lag2 remaining statistically significant. However, the interaction term Lag1:Lag2 was not significant. On the test data, this model achieved an accuracy of 63.5%, which was better than the model using Lag2 alone. In KNN, increasing K to 3 or 5 also improved accuracy compared to K = 1, showing that a moderate K-value helps reduce overfitting. Overall, the best result came from logistic regression with selected variables and interactions, making it the most effective approach for predicting market direction in this dataset.
glm.fit3 <- glm(Direction ~ Lag1 * Lag2 + Lag3 + Volume, data = Weekly, family = binomial)
summary(glm.fit3)
##
## Call:
## glm(formula = Direction ~ Lag1 * Lag2 + Lag3 + Volume, family = binomial,
## data = Weekly)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.255045 0.085134 2.996 0.00274 **
## Lag1 -0.035440 0.028143 -1.259 0.20793
## Lag2 0.057488 0.026829 2.143 0.03213 *
## Lag3 -0.015122 0.026497 -0.571 0.56819
## Volume -0.019599 0.036768 -0.533 0.59400
## Lag1:Lag2 0.002407 0.006757 0.356 0.72171
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1487.6 on 1083 degrees of freedom
## AIC: 1499.6
##
## Number of Fisher Scoring iterations: 4
A new column called mpg01 was created to indicate whether a car has high or low gas mileage. It was one of the miles per gallon that were above the median and zero otherwise.
data("Auto")
Auto$mpg01 <- ifelse(Auto$mpg > median(Auto$mpg), 1, 0)
table(Auto$mpg01)
##
## 0 1
## 196 196
Graphs showed that cars with high MPG usually have lower weight, less horsepower, smaller engines, and fewer cylinders. These features are the most useful for prediction.
par(mfrow = c(2, 3))
boxplot(displacement ~ mpg01, data = Auto, main = "Displacement")
boxplot(horsepower ~ mpg01, data = Auto, main = "Horsepower")
boxplot(weight ~ mpg01, data = Auto, main = "Weight")
boxplot(acceleration ~ mpg01, data = Auto, main = "Acceleration")
boxplot(cylinders ~ mpg01, data = Auto, main = "Cylinders")
boxplot(origin ~ mpg01, data = Auto, main = "Origin")
The data was split into training and test sets so the models could learn from one part and be tested on the other. This helps measure real performance.
set.seed(1)
train_idx <- sample(1:nrow(Auto), nrow(Auto) * 0.7)
train <- Auto[train_idx, ]
test <- Auto[-train_idx, ]
Linear Discriminant Analysis (LDA) was used to predict
mpg01
using the variables most related to fuel efficiency,
such as weight, horsepower, and displacement. The model was trained on
the training set and tested on the test set, where it achieved an
accuracy of approximately 88.1%, with a test error rate of 11.9%. The
confusion matrix showed that the model correctly classified most
observations, making it a strong performer for this task.
lda.fit <- lda(mpg01 ~ cylinders + weight + horsepower + displacement, data = train)
lda.pred <- predict(lda.fit, test)
table(Predicted = lda.pred$class, Actual = test$mpg01)
## Actual
## Predicted 0 1
## 0 50 3
## 1 11 54
mean(lda.pred$class != test$mpg01)
## [1] 0.1186441
Quadratic Discriminant Analysis (QDA) was applied to predict
mpg01
using key features like weight, horsepower, and
displacement. On the test set, the model achieved an accuracy of 88.1%,
resulting in a test error rate of 11.9%, the same as LDA. The confusion
matrix showed that QDA correctly classified most cases, slightly
improving on false positives compared to LDA, and performed well
overall.
qda.fit <- qda(mpg01 ~ cylinders + weight + horsepower + displacement, data = train)
qda.pred <- predict(qda.fit, test)
table(Predicted = qda.pred$class, Actual = test$mpg01)
## Actual
## Predicted 0 1
## 0 52 5
## 1 9 52
mean(qda.pred$class != test$mpg01)
## [1] 0.1186441
Logistic regression was used to predict mpg01
with the
most relevant variables, including weight, horsepower, and displacement.
The model performed very well on the test set, achieving an accuracy of
90.7% and a test error rate of 9.3%. The confusion matrix showed it
correctly predicted both high and low MPG cars, making it the best
performing model so far.
glm.fit <- glm(mpg01 ~ cylinders + weight + horsepower + displacement,
data = train, family = binomial)
glm.probs <- predict(glm.fit, test, type = "response")
glm.pred <- ifelse(glm.probs > 0.5, 1, 0)
table(Predicted = glm.pred, Actual = test$mpg01)
## Actual
## Predicted 0 1
## 0 53 3
## 1 8 54
mean(glm.pred != test$mpg01)
## [1] 0.09322034
Naive Bayes was used to predict mpg01
using the most
important features related to fuel efficiency. On the test set, the
model achieved an accuracy of 88.98%, with a test error rate of
approximately 11.0%. The confusion matrix showed it made a few more
mistakes than logistic regression but still performed very well
overall.
nb.fit <- naiveBayes(as.factor(mpg01) ~ cylinders + weight + horsepower + displacement, data = train)
nb.pred <- predict(nb.fit, test)
table(Predicted = nb.pred, Actual = test$mpg01)
## Actual
## Predicted 0 1
## 0 52 4
## 1 9 53
mean(nb.pred != test$mpg01)
## [1] 0.1101695
K-Nearest Neighbors (KNN) was used to predict mpg01
using key features. Different values of K were tested. For K = 1 and K =
3, the test error rate was 14.4%, while K = 5 gave a slightly better
result with a test error rate of 13.6%. This shows that using a slightly
larger K helps improve accuracy by reducing overfitting.
train.X <- scale(train[, c("cylinders", "weight", "horsepower", "displacement")])
test.X <- scale(test[, c("cylinders", "weight", "horsepower", "displacement")])
train.Y <- train$mpg01
set.seed(1)
for (k in c(1, 3, 5)) {
knn.pred <- knn(train.X, test.X, train.Y, k = k)
err <- mean(knn.pred != test$mpg01)
cat("Test error for k =", k, ":", round(err, 3), "\n")
}
## Test error for k = 1 : 0.144
## Test error for k = 3 : 0.144
## Test error for k = 5 : 0.136
To find out if a neighborhood in Boston has a high crime rate, a new
variable called crim01
was created. This variable is 1 if
the crime rate is above the median, and 0 if it’s below. The anaysis
used a few features from the dataset that are related to neighborhood
conditions:
nox
: air pollution
rm
: number of rooms in homes
dis
: distance to work centers
tax
: property tax
lstat
: percentage of low-income residents
Then the data was split into 70% for training and 30% for testing, then tried different models to see which could best predict high or low crime.
data("Boston")
Boston$crim01 <- ifelse(Boston$crim > median(Boston$crim), 1, 0)
Boston <- Boston[, !names(Boston) %in% "crim"]
set.seed(1)
train_idx <- sample(1:nrow(Boston), nrow(Boston) * 0.7)
train <- Boston[train_idx, ]
test <- Boston[-train_idx, ]
A logistic regression model was trained using predictors like
nox
, rm
, dis
, tax
,
and lstat
. On the test set, it correctly classified most
neighborhoods, with 61 true negatives and 64 true positives. It made 27
errors total (15 false positives and 12 false negatives), resulting in a
test error rate of about 17.8%, or 82.2% accuracy. The model showed that
higher air pollution (nox
) and property tax
(tax
) were linked to more crime, while more rooms and
distance from city centers were associated with less crime. The model
performed well and was easy to interpret.
glm.fit <- glm(crim01 ~ nox + rm + dis + tax + lstat, data = train, family = binomial)
glm.probs <- predict(glm.fit, test, type = "response")
glm.pred <- ifelse(glm.probs > 0.5, 1, 0)
table(Predicted = glm.pred, Actual = test$crim01)
## Actual
## Predicted 0 1
## 0 61 15
## 1 12 64
mean(glm.pred != test$crim01)
## [1] 0.1776316
The LDA model used the same predictors as logistic regression and performed well on the test data. It correctly predicted 68 low-crime and 60 high-crime neighborhoods, while making 24 mistakes (19 false positives and 5 false negatives). This gave a test error rate of about 15.8%, slightly better than logistic regression. LDA did a good job overall, especially at correctly identifying low-crime areas.
lda.fit <- lda(crim01 ~ nox + rm + dis + tax + lstat, data = train)
lda.pred <- predict(lda.fit, test)
table(Predicted = lda.pred$class, Actual = test$crim01)
## Actual
## Predicted 0 1
## 0 68 19
## 1 5 60
mean(lda.pred$class != test$crim01)
## [1] 0.1578947
The Naive Bayes model also used the same predictors and gave strong results. It correctly predicted 64 low-crime and 64 high-crime neighborhoods, with only 24 total errors (15 false positives and 9 false negatives). The test error rate was about 15.8%, matching the performance of LDA. This model was simple, fast, and accurate, making it a reliable choice for this classification task.
nb.fit <- naiveBayes(as.factor(crim01) ~ nox + rm + dis + tax + lstat, data = train)
nb.pred <- predict(nb.fit, test)
table(Predicted = nb.pred, Actual = test$crim01)
## Actual
## Predicted 0 1
## 0 64 15
## 1 9 64
mean(nb.pred != test$crim01)
## [1] 0.1578947
The KNN model was tested using several values of K. The best performance came from K = 3, with a test error rate of only 6.6%, the lowest among all models. K = 1 also performed well with an error of 8.6%, while K = 5 and K = 10 had slightly higher errors of 9.2%. These results show that KNN is highly accurate when the right K value is chosen, especially with properly scaled data. However, KNN can be harder to interpret and is more sensitive to noise or irrelevant features.
train.X <- scale(train[, c("nox", "rm", "dis", "tax", "lstat")])
test.X <- scale(test[, c("nox", "rm", "dis", "tax", "lstat")])
train.Y <- train$crim01
for (k in c(1, 3, 5, 10)) {
knn.pred <- knn(train.X, test.X, train.Y, k = k)
err <- mean(knn.pred != test$crim01)
cat("Test error for k =", k, ":", round(err, 3), "\n")
}
## Test error for k = 1 : 0.086
## Test error for k = 3 : 0.066
## Test error for k = 5 : 0.092
## Test error for k = 10 : 0.092
In this analysis, several classification models were used to predict
whether a Boston neighborhood has a high or low crime rate. All models
used the same set of predictors related to housing and environment:
nox
, rm
, dis
, tax
,
and lstat
.
KNN with K = 3 gave the best performance, with a test error rate of just 6.6%.
Naive Bayes and LDA both performed well with identical test error rates of 15.8%.
Logistic regression also did well, with a slightly higher error of 17.8%, but was easy to interpret and reliable.
While KNN was the most accurate, it is more complex to tune and interpret. Logistic regression, LDA, and Naive Bayes offer a strong balance between simplicity and performance. Overall, this shows that crime in Boston neighborhoods can be predicted with good accuracy using a few key features related to housing and socio-economic conditions.