This question should be answered using the Weekly data set, which […] contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.
(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
corrplot(cor(Weekly[,-9]), method = "number")
From the correlation plot, we observe that Year and
Volume are highly correlated, while none of the remaining
variables don’t show a significant linear relationship among them.
(b) Use the full data set to perform a logistic
regression with Direction as the response and the five lag
variables plus Volume as predictors. Use the
summary() function to print the results. Do any of the
predictors appear to be statistically significant? If so, which
ones?
glm1.fit = glm(Direction ~ Lag1+Lag2+Lag3+Lag4+Lag5+Volume, Weekly, family = binomial)
summary(glm1.fit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
We observe from the output above that Lag2 is the only
statistically significant predictor of Direction using a
significance level of \(\alpha=.05\).
(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
glm1.probs = predict(glm1.fit, type = "response")
glm1.pred = rep("Down", length(glm1.probs))
glm1.pred[glm1.probs >.5] = "Up"
table(glm1.pred, Weekly$Direction)
##
## glm1.pred Down Up
## Down 54 48
## Up 430 557
By taking the correct predictions from the confusion matrix and dividing by the total number of observations, we get an accuracy of 56.1% as shown below. This suggest that the model overall is predicting slightly better than random chance.
(54+557)/1089
## [1] 0.5610652
When looking at the model’s ability to predict Up and Down respectively, we notice a significant difference between the two. Below are the calculations for each group, showing that the model correctly predicted Up 92.1% of the time, while correctly predicting Down just 11.5% of the time.
557/(557+48) # Up
## [1] 0.9206612
54/(54+430) # Down
## [1] 0.1115702
(d) Now fit the logistic regression model using a
training data period from 1990 to 2008, with Lag2 as the
only predictor. Compute the confusion matrix and the overall fraction of
correct predictions for the held out data (that is, the data from 2009
and 2010).
w_train = (Weekly$Year < 2009)
w_test = Weekly[!w_train,]
glm2.fit = glm(Direction ~ Lag2, data = Weekly, family = binomial, subset = w_train)
glm2.prob = predict(glm2.fit, w_test, type = "response")
glm2.pred = rep("Down", length(glm2.prob))
glm2.pred[glm2.prob > 0.5] = "Up"
Direction.0910 = Weekly$Direction[!w_train]
table(glm2.pred, Direction.0910)
## Direction.0910
## glm2.pred Down Up
## Down 9 5
## Up 34 56
(9+56)/(9+5+34+56) # Overall
## [1] 0.625
56/(56+5) # Up
## [1] 0.9180328
9/(9+34) # Down
## [1] 0.2093023
In comparison with the previous model, this model using
Lag2 as the only predictor and a train/test split
moderately improved on overall accuracy, going from 56.1% to 62.5%.
Likewise, the model’s ability to correctly predict Down
significantly improved, whereas the accuracy for Up slightly
decreased.
(e) Repeat (d) using LDA.
lda1.fit = lda(Direction ~ Lag2, data = Weekly, subset = w_train)
lda1.pred = predict(lda1.fit, w_test)
table(lda1.pred$class, Direction.0910)
## Direction.0910
## Down Up
## Down 9 5
## Up 34 56
acc_lda <- mean(lda1.pred$class == Direction.0910)
cat("LDA Model Accuracy:", acc_lda, "\n")
## LDA Model Accuracy: 0.625
The LDA model produces identical results to the logistic regression from (d).
(f) Repeat (d) using QDA.
qda1.fit = qda(Direction ~ Lag2, data = Weekly, subset = w_train)
qda1.pred = predict(qda1.fit, w_test)
table(qda1.pred$class, Direction.0910)
## Direction.0910
## Down Up
## Down 0 0
## Up 43 61
acc_qda <- mean(qda1.pred$class == Direction.0910)
cat("QDA Model Accuracy:", acc_qda, "\n")
## QDA Model Accuracy: 0.5865385
While the overall accuracy of the QDA (58.7%) is only a small decrease from the previous two models, we notice from the confusion matrix that the model exclusively predicted “Up”. In other words, the model did not perform well.
(g) Repeat (d) using KNN with \(K=1\).
train.X = as.matrix(Weekly$Lag2[w_train])
test.X = as.matrix(Weekly$Lag2[!w_train])
train.Direction = Weekly$Direction[w_train]
set.seed(1)
knn1.pred = knn(train.X, test.X, train.Direction, k=1)
table(knn1.pred, Direction.0910)
## Direction.0910
## knn1.pred Down Up
## Down 21 30
## Up 22 31
acc_knn <- mean(knn1.pred == Direction.0910)
cat("KNN Model Accuracy:", acc_knn, "\n")
## KNN Model Accuracy: 0.5
The overall accuracy of the KNN model is exactly 50%.
(h) Repeat (d) using naive Bayes.
bayes1.fit = naiveBayes(Direction ~ Lag2, data = Weekly, subset = w_train)
bayes1.pred = predict(bayes1.fit, w_test)
table(bayes1.pred, Direction.0910)
## Direction.0910
## bayes1.pred Down Up
## Down 0 0
## Up 43 61
acc_bayes <- mean(bayes1.pred == Direction.0910)
cat("Naive Bayes Model Accuracy:", acc_bayes, "\n")
## Naive Bayes Model Accuracy: 0.5865385
Like the QDA, the naive Bayes model exclusively predicted “Up”, making the model rather useless.
(h) Which of these methods appears to provide the best
results on this data?
The LDA and the logistic regression model using only Lag2
performed equally well, both yielding an overall accuracy of 62.5%.
(h) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for \(K\) in the KNN classifier.
Testing a logistic regression with interaction term between
Lag1 and Lag2:
glm3.fit = glm(Direction ~ Lag1*Lag2, data = Weekly, family = binomial, subset = w_train)
glm3.prob = predict(glm3.fit, w_test, type = "response")
glm3.pred = rep("Down", length(glm3.prob))
glm3.pred[glm3.prob > 0.5] = "Up"
Direction.0910 = Weekly$Direction[!w_train]
table(glm3.pred, Direction.0910)
## Direction.0910
## glm3.pred Down Up
## Down 7 8
## Up 36 53
acc_glm3 <- mean(glm3.pred == Direction.0910)
cat("Accuracy of Logit Model w. Interaction Term:", acc_glm3, "\n")
## Accuracy of Logit Model w. Interaction Term: 0.5769231
Testing a polynomial transformation on a LDA:
lda2.fit = lda(Direction ~ poly(Lag2, 2), data = Weekly, subset = w_train)
lda2.pred = predict(lda2.fit, w_test)
table(lda2.pred$class, Direction.0910)
## Direction.0910
## Down Up
## Down 7 4
## Up 36 57
acc_lda2 <- mean(lda2.pred$class == Direction.0910)
cat("Accuracy of LDA Model w. Poly Transformation:", acc_lda2, "\n")
## Accuracy of LDA Model w. Poly Transformation: 0.6153846
Testing a KNN classifier using \(k=5\):
train.X = as.matrix(Weekly$Lag2[w_train])
test.X = as.matrix(Weekly$Lag2[!w_train])
train.Direction = Weekly$Direction[w_train]
set.seed(1)
knn2.pred = knn(train.X, test.X, train.Direction, k=5)
table(knn2.pred, Direction.0910)
## Direction.0910
## knn2.pred Down Up
## Down 16 21
## Up 27 40
acc_knn2 <- mean(knn2.pred == Direction.0910)
cat("Accuracy of KNN Model w. K=5:", acc_knn2, "\n")
## Accuracy of KNN Model w. K=5: 0.5384615
Testing a KNN classifier using \(k=20\):
train.X = as.matrix(Weekly$Lag2[w_train])
test.X = as.matrix(Weekly$Lag2[!w_train])
train.Direction = Weekly$Direction[w_train]
set.seed(1)
knn3.pred = knn(train.X, test.X, train.Direction, k=20)
table(knn3.pred, Direction.0910)
## Direction.0910
## knn3.pred Down Up
## Down 21 21
## Up 22 40
acc_knn3 <- mean(knn3.pred == Direction.0910)
cat("Accuracy of KNN Model w. K=20:", acc_knn3, "\n")
## Accuracy of KNN Model w. K=20: 0.5865385
Testing a naive Bayes model with all lags as predictors:
bayes2.fit = naiveBayes(Direction ~ Lag1+Lag2+Lag3+Lag4+Lag5, data = Weekly, subset = w_train)
bayes2.pred = predict(bayes2.fit, w_test)
table(bayes2.pred, Direction.0910)
## Direction.0910
## bayes2.pred Down Up
## Down 10 21
## Up 33 40
acc_bayes2 <- mean(bayes2.pred == Direction.0910)
cat("Accuracy of Naive Bayes Model w. All Lags:", acc_bayes2, "\n")
## Accuracy of Naive Bayes Model w. All Lags: 0.4807692
After testing interaction terms, more predictors, transformations,
and different values of \(k\), I did
not achieved any better results than previously. In conclusion, the
original LDA and logistic regression using Lag2 as the only
predictor maintain the position as the best models.
In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.
(a) Create a binary variable, mpg01, that
contains a 1 if mpg contains a value above its median, and
a 0 if mpg contains a value below its median. You can
compute the median using the median()
function.
Auto <- Auto %>%
mutate(mpg01 = as.numeric(mpg > median(mpg)))
…
(b) Explore the data graphically in order to investigate
the association between mpg01 and the other features. Which
of the other features seem most likely to be useful in predicting
mpg01? Scatterplots and boxplots may be useful tools to
answer this question. Describe your findings.
corrplot(cor(Auto[,-c(1, 9)]), method = "number", number.cex = 0.9)
From the correlation plot we observe that mpg01 exhibits
the strongest linear correlations with cylinders,
weight, displacement, and
horsepower, all of which show negative correlations with
mpg01. This suggests that vehicles with more cylinders,
greater weight, larger engine displacements, and more horsepower tend to
have below median fuel efficiency. To further investigate these
relationships, below are four boxplots illustrating the distribution of
these variables in relation to mpg01.
boxplot(cylinders ~ mpg01, Auto)
boxplot(weight ~ mpg01, Auto)
boxplot(displacement ~ mpg01, Auto)
boxplot(horsepower ~ mpg01, Auto)
As expected, the boxplots reveal distinct differences in the
distributions of cylinders, weight,
displacement, and horsepower when
mpg01 is 0 compared to when it is 1.
(c) Split the data into a training set and a test set.
set.seed(1)
index = sample(nrow(Auto), 0.7*nrow(Auto), replace = FALSE)
a_train = Auto[index,]
a_test = Auto[-index,]
The data is split into a training set and test set using a 70/30 split.
(d) Perform LDA on the training data in order to predict
mpg01 using the variables that seemed most associated with
mpg01 in (b). What is the test error of the model obtained?
lda.fit1 = lda(mpg01 ~ cylinders + weight + displacement + horsepower,
data = a_train)
lda.pred1 = predict(lda.fit1, a_test)
table(lda.pred1$class, a_test$mpg01)
##
## 0 1
## 0 50 3
## 1 11 54
error_lda <- mean(lda.pred1$class != a_test$mpg01)
cat("LDA Model Test Error:", error_lda, "\n")
## LDA Model Test Error: 0.1186441
The LDA model results in a test error of 11.86%.
(e) Perform QDA on the training data in order to predict
mpg01 using the variables that seemed most associated with
mpg01 in (b). What is the test error of the model obtained?
qda.fit1 = qda(mpg01 ~ cylinders + weight + displacement + horsepower,
data = a_train)
qda.pred1 = predict(qda.fit1, a_test)
table(qda.pred1$class, a_test$mpg01)
##
## 0 1
## 0 52 5
## 1 9 52
error_qda <- mean(qda.pred1$class != a_test$mpg01)
cat("QDA Model Test Error:", error_qda, "\n")
## QDA Model Test Error: 0.1186441
The QDA model yields a test error similar to the LDA model. However, a closer look at the confusion matrices reveals distinctions in the models’ predictions for the two classes (0 and 1). In other words, their precision and recall metrics for individual classes differ, indicating variations in their ability to correctly classify instances belonging to each class.
(f) Perform logistic regression on the training data in
order to predict mpg01 using the variables that seemed most
associated with mpg01 in (b). What is the test error of the
model obtained?
glm.fit1 = glm(mpg01 ~ cylinders + weight + displacement + horsepower,
data = a_train, family = binomial)
glm.prob1 = predict(glm.fit1, a_test, type = "response")
glm.pred1 = rep(0, length(glm.prob1))
glm.pred1[glm.prob1 > 0.5] = 1
table(glm.pred1, a_test$mpg01)
##
## glm.pred1 0 1
## 0 53 3
## 1 8 54
error_glm1 <- mean(glm.pred1 != a_test$mpg01)
cat("Logistic Regression Model Test Error:", error_glm1, "\n")
## Logistic Regression Model Test Error: 0.09322034
The logistic regression model results in a test error of 9.32, making it a better performing model than the LDA and QDA in terms of test error.
(g) Perform naive Bayes on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
bayes.fit1 = naiveBayes(mpg01 ~ cylinders + weight + displacement + horsepower,
data = a_train)
bayes.pred1 = predict(bayes.fit1, a_test)
table(bayes.pred1, a_test$mpg01)
##
## bayes.pred1 0 1
## 0 52 4
## 1 9 53
error_bayes1 <- mean(bayes.pred1 != a_test$mpg01)
cat("Naive Bayes Model Test Error:", error_bayes1, "\n")
## Naive Bayes Model Test Error: 0.1101695
The Naive Bayes model results in a test error of 11.02%.
(h) Perform KNN on the training data, with several values
of K, in order to predict mpg01. Use only the variables
that seemed most associated with mpg01 in (b). What test
errors do you obtain? Which value of K seems to perform the best on this
data set?
train.A <- as.matrix(a_train[, c("displacement", "horsepower", "weight", "cylinders")])
test.A <- as.matrix(a_test[, c("displacement", "horsepower", "weight", "cylinders")])
set.seed(1)
knn.pred1 <- knn(train.A, test.A, a_train$mpg01, k = 3)
table(knn.pred1, a_test$mpg01)
##
## knn.pred1 0 1
## 0 52 4
## 1 9 53
acc_knn1 <- mean(knn.pred1 != a_test$mpg01)
cat("KNN Model Test Error", acc_knn1, "\n")
## KNN Model Test Error 0.1101695
After experimenting with different values of \(k\) ranging from 1 to 10, I found that using \(k=3\) results in the lowest test error of 11.02%, similar to the Naive Bayes model. This performance slightly outperforms the LDA and QDA models, while the logistic regression model remains superior among the accessed methods in terms of test error.
Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings.
To prepare and examine the data, I will: * Create the binary variable
for crime, crim01 * Split the data into a training set and
testing set using a 70/30 split * Use the corrplot()
function to detect which predictors exhibit the strongest linear
relationship with crim01:
Boston <- Boston %>%
mutate(crim01 = as.numeric(crim > median(crim)))
set.seed(1)
index = sample(nrow(Boston), 0.7*nrow(Boston), replace = FALSE)
b_train = Boston[index,]
b_test = Boston[-index,]
corrplot(cor(Boston[,-1]), method = "number", number.cex = 0.65)
The correlation plot reveal that crim01 has the most
significant linear correlations with indus,
nox, age, dis, rad,
and tax. All these variables show a positive correlation
with crim01 except dis, which has a negative
linear relationship with crim01.
My first model will be utilizing logistic regression to predicts the response variable using the predictors described above.
my.glm.fit = glm(crim01 ~ indus + nox + age + dis + rad + tax , b_train, family = binomial)
my.glm.prob = predict(my.glm.fit, b_test, type = "response")
my.glm.pred = rep(0, length(my.glm.prob))
my.glm.pred[my.glm.prob > 0.5] = 1
table(my.glm.pred, b_test$crim01)
##
## my.glm.pred 0 1
## 0 58 6
## 1 15 73
acc.my.glm <- mean(my.glm.pred == b_test$crim01)
cat("Accuracy of Logit Model w. Linearly Correlated Predictors:", acc.my.glm, "\n")
## Accuracy of Logit Model w. Linearly Correlated Predictors: 0.8618421
Next, I will access a model containing all predictors (not including
crim):
my.glm.fit2 = glm(crim01 ~ .-crim , b_train, family = binomial)
my.glm.prob2 = predict(my.glm.fit2, b_test, type = "response")
my.glm.pred2 = rep(0, length(my.glm.prob2))
my.glm.pred2[my.glm.prob2 > 0.5] = 1
table(my.glm.pred2, b_test$crim01)
##
## my.glm.pred2 0 1
## 0 62 5
## 1 11 74
acc.my.glm2 <- mean(my.glm.pred2 == b_test$crim01)
cat("Accuracy of Logit Model w. All Predictors:", acc.my.glm2, "\n")
## Accuracy of Logit Model w. All Predictors: 0.8947368
Third, I will test a logistic regression model with predictors remaining from a stepwise selection using the AIC criterion.
null_model = glm(crim01 ~ 1, data = b_train, family = binomial)
full_model = my.glm.fit2
step.model.AIC = step(null_model, scope = list(upper = full_model),
direction = "both", test = "Chisq", trace = F)
summary(step.model.AIC)
##
## Call:
## glm(formula = crim01 ~ nox + rad + tax + dis + zn + rm + lstat +
## black, family = binomial, data = b_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -37.090639 6.921915 -5.358 8.39e-08 ***
## nox 47.749134 7.891777 6.050 1.44e-09 ***
## rad 0.675929 0.154962 4.362 1.29e-05 ***
## tax -0.007918 0.002798 -2.830 0.00466 **
## dis 0.910433 0.253567 3.591 0.00033 ***
## zn -0.108743 0.040446 -2.689 0.00718 **
## rm 1.467346 0.511761 2.867 0.00414 **
## lstat 0.108594 0.052865 2.054 0.03996 *
## black -0.007586 0.005156 -1.471 0.14121
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 490.65 on 353 degrees of freedom
## Residual deviance: 148.44 on 345 degrees of freedom
## AIC: 166.44
##
## Number of Fisher Scoring iterations: 9
my.glm.fit3 = glm(crim01 ~ nox + rad + tax + dis + zn +rm +lstat, b_train, family = binomial)
my.glm.prob3 = predict(my.glm.fit3, b_test, type = "response")
my.glm.pred3 = rep(0, length(my.glm.prob3))
my.glm.pred3[my.glm.prob3 > 0.5] = 1
table(my.glm.pred3, b_test$crim01)
##
## my.glm.pred3 0 1
## 0 59 8
## 1 14 71
acc.my.glm3 <- mean(my.glm.pred3 == b_test$crim01)
cat("Accuracy of Logit Model w. Selected Predictors:", acc.my.glm3, "\n")
## Accuracy of Logit Model w. Selected Predictors: 0.8552632
When comparing all three logistic regression models accessed above, we observe the greatest accuracy for the model using all predictors (88.8%). However, the model using the linearly correlated predictors and the model using the predictors from stepwise selection perform almost equally well, achieving accuracies of 86.2 and 85.5 respectively. The simplicity of these two models are generally preferred over the full model, especially considering the minor difference in prediction power.
The forth model I will test is an LDA using the predictors from stepwise selection.
my.lda.fit = lda(crim01 ~ nox + rad + tax + dis + zn +rm +lstat, b_train)
my.lda.pred = predict(my.lda.fit, b_test)
table(my.lda.pred$class, b_test$crim01)
##
## 0 1
## 0 73 20
## 1 0 59
acc.my.lda <- mean(my.lda.pred$class == b_test$crim01)
cat("Accuracy of LDA Model:", acc.my.lda, "\n")
## Accuracy of LDA Model: 0.8684211
The sixth model is applying QDA to the selected predictors.
my.qda.fit = qda(crim01 ~ nox + rad + tax + dis + zn +rm +lstat, b_train)
my.qda.pred = predict(my.qda.fit, b_test)
table(my.qda.pred$class, b_test$crim01)
##
## 0 1
## 0 67 8
## 1 6 71
acc.my.qda <- mean(my.qda.pred$class == b_test$crim01)
cat("Accuracy of QDA Model:", acc.my.qda, "\n")
## Accuracy of QDA Model: 0.9078947
Next, we till test a naive Bayes model on the selected predictors.
my.bayes.fit = naiveBayes(crim01 ~ nox + rad + tax + dis + zn +rm +lstat, b_train)
my.bayes.pred = predict(my.bayes.fit, b_test)
table(my.bayes.pred, b_test$crim01)
##
## my.bayes.pred 0 1
## 0 63 20
## 1 10 59
acc.my.bayes <- mean(my.bayes.pred == b_test$crim01)
cat("Accuracy of Naive Bayes Model:", acc.my.bayes, "\n")
## Accuracy of Naive Bayes Model: 0.8026316
The final model will employ KNN on the selected predictors, using \(k=1\) after testing different values of \(k\) for optimal accuracy.
train.B <- as.matrix(b_train[, c("nox", "rad", "tax", "dis", "zn", "rm", "lstat")])
test.B <- as.matrix(b_test[, c("nox", "rad", "tax", "dis", "zn", "rm", "lstat")])
set.seed(1)
my.knn.pred <- knn(train.B, test.B, b_train$crim01, k = 1)
table(my.knn.pred, b_test$crim01)
##
## my.knn.pred 0 1
## 0 67 1
## 1 6 78
acc.my.knn <- mean(my.knn.pred == b_test$crim01)
cat("Accuracy of KNN Model:", acc.my.knn, "\n")
## Accuracy of KNN Model: 0.9539474
After evaluating the eight models above employing various methodologies and different combinations of predictors for logistic regression, as well as tested a range of values of \(k\) for KNN, the model that exhibited the best performance was the KNN model using \(k=1\). This model achieved an accuracy of 95.39%, trailed by the QDA model that resulted in an accuracy of 90.79%. In conclusion, the KNN model was the best at predicting whether a census tract has a crime rate above or below the median.