library(tidyverse)
library(openintro)
library(ISLR2)
library(stats)
library(MASS)
library(e1071)
library(class)
library(dplyr)This question should be answered using the
Weekly data set, which is part of the ISLR2
package. This data is similar in nature to the Smarket data
from this chapter’s lab, except that it contains 1,089 weekly returns
for 21 years, from the beginning of 1990 to the end of
2010.
#load the data
data(Weekly)Produce some numerical and graphical summaries of the
Weekly data. Do there appear to be any
patterns?
summary(Weekly)## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
plot(Weekly)cor(Weekly[,-9])## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
plot(Weekly$Volume)From the summary statistics, Lag1 through
Lag5 appear to have nearly identical data. From the
scatterplot matrix, there definitely seems to be some kind of
relationship between Year and Volume, which is
confirmed by the correlation matrix. There do not appear to be any other
strong relationships in the correlation matrix. Finally, we can see that
Volume increases with increasing (chronological) index;
that is, over time.
Use the full data set to perform a logistic regression with
Direction as the response and the five lag variables plus
Volume as predictors. Use the summary function to print the
results. Do any of the predictors appear to be statistically
significant? If so, which ones?
#logistic regression model
glm.fits <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Weekly, family = binomial)
summary(glm.fits)##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Lag2 is the only predictor that appears to be
statistically significant, p < 0.5, according to this model.
Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
#prepare confusion matrix
glm.probs = predict(glm.fits, type = 'response')
glm.pred = rep("Down", 1089)
glm.pred[glm.probs > 0.5] = "Up"
writeLines("Confusion Matrix and Overall Fraction of Correct Predictions:")## Confusion Matrix and Overall Fraction of Correct Predictions:
table(glm.pred, Weekly$Direction)##
## glm.pred Down Up
## Down 54 48
## Up 430 557
mean(glm.pred == Weekly$Direction)## [1] 0.5610652
Correct predictions: 0.5610652 or ~56%
This model predicted “Up” correctly
(557/(48+557))*100 = 92.06612 about 92% of the time.
However, it predicted “Down” correctly
(54/(54+430))*100 = 11.15702 only about 11% of the
time.
Now fit the logistic regression model using a training data
period from 1990 to 2008, with Lag2 as the only predictor.
Compute the confusion matrix and the overall fraction of correct
predictions for the held out data (that is, the data from 2009 and
2010).
#define training and testing data sets
train = (Weekly$Year < 2009)
Weekly.train = Weekly[train,]
Weekly.test = Weekly[!train,]
#logistic regression with `Lag2` as only predictor
glm.fitsTrained <- glm(Direction ~ Lag2,
data = Weekly.train, family = binomial)
#prepare confusion matrix
glm.probs = predict(glm.fitsTrained, Weekly.test, type = 'response')
glm.pred = rep("Down", 104)
glm.pred[glm.probs > 0.5] = "Up"
writeLines("Confusion Matrix and Overall Fraction of Correct Predictions:")## Confusion Matrix and Overall Fraction of Correct Predictions:
table(glm.pred, Weekly.test$Direction)##
## glm.pred Down Up
## Down 9 5
## Up 34 56
mean(glm.pred == Weekly.test$Direction)## [1] 0.625
Repeat (d) using LDA.
#linear discriminant analysis model
lda.fit <- lda(Direction ~ Lag2, data = Weekly.train)
lda.fit## Call:
## lda(Direction ~ Lag2, data = Weekly.train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
##
## Coefficients of linear discriminants:
## LD1
## Lag2 0.4414162
#prepare confusion matrix
lda.pred <- predict(lda.fit, Weekly.test)
names(lda.pred)## [1] "class" "posterior" "x"
writeLines("\n\nConfusion Matrix and Overall Fraction of Correct Predictions:")##
##
## Confusion Matrix and Overall Fraction of Correct Predictions:
table(lda.pred$class, Weekly.test$Direction)##
## Down Up
## Down 9 5
## Up 34 56
mean(lda.pred$class == Weekly.test$Direction)## [1] 0.625
Repeat (d) using QDA.
#quadratic discriminant analysis model
qda.fit <- qda(Direction ~ Lag2, data = Weekly.train)
qda.fit## Call:
## qda(Direction ~ Lag2, data = Weekly.train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
#prepare confusion matrix
qda.pred <- predict(qda.fit, Weekly.test)
names(lda.pred)## [1] "class" "posterior" "x"
writeLines("\n\nConfusion Matrix and Overall Fraction of Correct Predictions:")##
##
## Confusion Matrix and Overall Fraction of Correct Predictions:
table(qda.pred$class, Weekly.test$Direction)##
## Down Up
## Down 0 0
## Up 43 61
mean(qda.pred$class == Weekly.test$Direction)## [1] 0.5865385
Repeat (d) using KNN with K = 1.
#KNN, K = 1
train.x = cbind(Weekly[train,]$Lag2)
test.x = cbind(Weekly[!train,]$Lag2)
train.Direction = Weekly[train,]$Direction
set.seed(291)
knn.pred <- knn(train.x, test.x, train.Direction, k = 1)
writeLines("\n\nConfusion Matrix and Overall Fraction of Correct Predictions:")##
##
## Confusion Matrix and Overall Fraction of Correct Predictions:
table(knn.pred, Weekly.test$Direction)##
## knn.pred Down Up
## Down 21 30
## Up 22 31
mean(knn.pred == Weekly.test$Direction)## [1] 0.5
Repeat (d) using naive Bayes.
#naive Bayes model
nb.fit <- naiveBayes(Direction ~ Lag2, data = Weekly.train)
nb.fit##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Down Up
## 0.4477157 0.5522843
##
## Conditional probabilities:
## Lag2
## Y [,1] [,2]
## Down -0.03568254 2.199504
## Up 0.26036581 2.317485
writeLines("\n\nConfusion Matrix and Overall Fraction of Correct Predictions:")##
##
## Confusion Matrix and Overall Fraction of Correct Predictions:
nb.class <- predict(nb.fit, Weekly.test)
table(nb.class, Weekly.test$Direction)##
## nb.class Down Up
## Down 0 0
## Up 43 61
mean(nb.class == Weekly.test$Direction)## [1] 0.5865385
Which of these methods appears to provide the best results on this data?
Logistic regression and linear discriminant analysis have had the highest prediction success rate.
Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.
In this problem, you will develop a model to predict whether
a given car gets high or low gas mileage based on the Auto
data set.
data(Auto)Create a binary variable, mpg01, that contains a
1 if mpg contains a value above its median, and a 0 if
mpg contains a value below its median. You can compute the
median using the median() function. Note you may find it
helpful to use the data.frame() function to create a single
data set containing both mpg01 and the other
Auto variables.
mpg01 <- c(Auto$mpg > median(Auto$mpg))
Auto1 <- data.frame(Auto, mpg01)Explore the data graphically in order to investigate the
association between mpg01 and the other features. Which of
the other features seem most likely to be useful in predicting
mpg01? Scatterplots and boxplots may be useful tools to
answer this question. Describe your findings.
plot(Auto1)boxplot(Auto1$mpg01 ~ Auto1$cylinders)boxplot(Auto1$mpg01 ~ Auto1$displacement)boxplot(Auto1$mpg01 ~ Auto1$horsepower)boxplot(Auto1$mpg01 ~ Auto1$weight)boxplot(Auto1$mpg01 ~ Auto1$acceleration)boxplot(Auto1$mpg01 ~ Auto1$year)boxplot(Auto1$mpg01 ~ Auto1$origin)
From the scatterplot matrix, it looks like
displacement,
horsepower, weight, and
acceleration might be useful predictors for
mpg01, because the TRUE AND FALSE
values for mpg01 are bunched together in distinct regions
of the plot.
Split the data into a training set and a test set.
#define training and testing data sets by randomly sampling 80% of Auto1
set.seed(191)
Atrain = sample(nrow(Auto1), 314)
Autotrain <- Auto1[Atrain,]
Autotest <- Auto1[!(as.numeric(rownames(Auto1)) %in% Atrain),]Perform LDA on the training data in order to predict
mpg01 using the variables that seemed most associated with
mpg01 in (b). What is the test error of the model
obtained?
#linear discriminant analysis model
lda.Auto <- lda(mpg01 ~ displacement + horsepower + weight + acceleration, data = Autotrain)
lda.Auto## Call:
## lda(mpg01 ~ displacement + horsepower + weight + acceleration,
## data = Autotrain)
##
## Prior probabilities of groups:
## FALSE TRUE
## 0.5191083 0.4808917
##
## Group means:
## displacement horsepower weight acceleration
## FALSE 276.0307 132.0798 3637.454 14.53988
## TRUE 115.7185 78.5894 2328.781 16.49073
##
## Coefficients of linear discriminants:
## LD1
## displacement -0.0078245278
## horsepower 0.0036968317
## weight -0.0009948677
## acceleration -0.0094873543
#prepare confusion matrix
lda.pred <- predict(lda.Auto, Autotest)
names(lda.pred)## [1] "class" "posterior" "x"
writeLines("\n\nConfusion Matrix and Overall Fraction of Incorrect Predictions:")##
##
## Confusion Matrix and Overall Fraction of Incorrect Predictions:
table(lda.pred$class, Autotest$mpg01)##
## FALSE TRUE
## FALSE 29 2
## TRUE 4 48
1 - mean(lda.pred$class == Autotest$mpg01)## [1] 0.07228916
The test error rate is about 7%.
Perform QDA on the training data in order to predict
mpg01 using the variables that seemed most associated with
mpg01 in (b). What is the test error of the model
obtained?
#quadratic discriminant analysis model
qda.Auto <- qda(mpg01 ~ displacement + horsepower + weight + acceleration, data = Autotrain)
qda.Auto## Call:
## qda(mpg01 ~ displacement + horsepower + weight + acceleration,
## data = Autotrain)
##
## Prior probabilities of groups:
## FALSE TRUE
## 0.5191083 0.4808917
##
## Group means:
## displacement horsepower weight acceleration
## FALSE 276.0307 132.0798 3637.454 14.53988
## TRUE 115.7185 78.5894 2328.781 16.49073
#prepare confusion matrix
qda.pred <- predict(qda.Auto, Autotest)
names(qda.pred)## [1] "class" "posterior"
writeLines("\n\nConfusion Matrix and Overall Fraction of Incorrect Predictions:")##
##
## Confusion Matrix and Overall Fraction of Incorrect Predictions:
table(qda.pred$class, Autotest$mpg01)##
## FALSE TRUE
## FALSE 32 2
## TRUE 1 48
1 - mean(qda.pred$class == Autotest$mpg01)## [1] 0.03614458
The test error rate of this model is about 3.6%.
Perform logistic regression on the training data in order to
predict mpg01 using the variables that seemed most
associated with mpg01 in (b). What is the test error of the
model obtained?
#logistic regression model
glm.Auto <- glm(mpg01 ~ displacement + horsepower + weight + acceleration,
data = Autotrain, family = binomial)
glm.Auto##
## Call: glm(formula = mpg01 ~ displacement + horsepower + weight + acceleration,
## family = binomial, data = Autotrain)
##
## Coefficients:
## (Intercept) displacement horsepower weight acceleration
## 12.96911 -0.01260 -0.05740 -0.00129 -0.11074
##
## Degrees of Freedom: 313 Total (i.e. Null); 309 Residual
## Null Deviance: 434.8
## Residual Deviance: 172.2 AIC: 182.2
#prepare confusion matrix
glm.Aprobs = predict(glm.Auto, Autotest, type = 'response')
glm.Apred = rep("High", 83)
glm.Apred[glm.Aprobs > 0.5] = "Low"
#writeLines("Confusion Matrix and Overall Fraction of Correct Predictions:")
table(glm.Apred, Autotest$mpg01)##
## glm.Apred FALSE TRUE
## High 32 5
## Low 1 45
mean(glm.Apred == Autotest$mpg01)## [1] 0
(5/83)*100## [1] 6.024096
The error rate is about 6%.
Perform naive Bayes on the training data in order to predict
mpg01 using the variables that seemed most associated with
mpg01 in (b). What is the test error of the model
obtained?
#naive Bayes model
nb.Auto <- naiveBayes(mpg01 ~ displacement + horsepower + weight + acceleration,
data = Autotrain)
nb.Auto##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## FALSE TRUE
## 0.5191083 0.4808917
##
## Conditional probabilities:
## displacement
## Y [,1] [,2]
## FALSE 276.0307 92.00204
## TRUE 115.7185 41.02814
##
## horsepower
## Y [,1] [,2]
## FALSE 132.0798 39.10574
## TRUE 78.5894 15.60482
##
## weight
## Y [,1] [,2]
## FALSE 3637.454 699.0066
## TRUE 2328.781 410.3934
##
## acceleration
## Y [,1] [,2]
## FALSE 14.53988 2.809981
## TRUE 16.49073 2.524001
writeLines("\n\nConfusion Matrix and Overall Fraction of Inorrect Predictions:")##
##
## Confusion Matrix and Overall Fraction of Inorrect Predictions:
nb.Aclass <- predict(nb.Auto, Autotest)
table(nb.Aclass, Autotest$mpg01)##
## nb.Aclass FALSE TRUE
## FALSE 30 2
## TRUE 3 48
1 - mean(nb.Aclass == Autotest$mpg01)## [1] 0.06024096
The error rate is about 6%.
Perform KNN on the training data, with several values of
K, in order to predict mpg01. Use only the
variables that seemed most associated with mpg01 in (b).
What test errors do you obtain? Which value of K seems to
perform the best on this data set?
Using the Boston data set, fit classification
models in order to predict whether a given census tract has a crime rate
above or below the median. Explore logistic regression, LDA, naive
Bayes, and KNN models using various subsets of the predictors. Describe
your findings.
Hint: You will have to create the response variable yourself,
using the variables that are contained in the Boston data
set.
#prepare the data, create new variable
data(Boston)
high_crime <- c(Boston$crim > median(Boston$crim))
Boston$high_crime = high_crime #preliminary logistic regression
glm.Boston <- glm(high_crime ~ zn + indus + chas + nox + rm + age + dis +
rad + tax + ptratio + black + lstat + medv,
data = Boston, family = binomial)
summary(glm.Boston)##
## Call:
## glm(formula = high_crime ~ zn + indus + chas + nox + rm + age +
## dis + rad + tax + ptratio + black + lstat + medv, family = binomial,
## data = Boston)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3946 -0.1585 -0.0004 0.0023 3.4239
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -34.103704 6.530014 -5.223 1.76e-07 ***
## zn -0.079918 0.033731 -2.369 0.01782 *
## indus -0.059389 0.043722 -1.358 0.17436
## chas 0.785327 0.728930 1.077 0.28132
## nox 48.523782 7.396497 6.560 5.37e-11 ***
## rm -0.425596 0.701104 -0.607 0.54383
## age 0.022172 0.012221 1.814 0.06963 .
## dis 0.691400 0.218308 3.167 0.00154 **
## rad 0.656465 0.152452 4.306 1.66e-05 ***
## tax -0.006412 0.002689 -2.385 0.01709 *
## ptratio 0.368716 0.122136 3.019 0.00254 **
## black -0.013524 0.006536 -2.069 0.03853 *
## lstat 0.043862 0.048981 0.895 0.37052
## medv 0.167130 0.066940 2.497 0.01254 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 701.46 on 505 degrees of freedom
## Residual deviance: 211.93 on 492 degrees of freedom
## AIC: 239.93
##
## Number of Fisher Scoring iterations: 9
#logistic regression with high p-value predictors removed
glm.Boston <- glm(high_crime ~ zn + nox + age + dis + rad + tax + ptratio + black + medv,
data = Boston, family = binomial)
summary(glm.Boston)##
## Call:
## glm(formula = high_crime ~ zn + nox + age + dis + rad + tax +
## ptratio + black + medv, family = binomial, data = Boston)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4197 -0.1840 -0.0004 0.0022 3.4087
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -31.441272 6.048989 -5.198 2.02e-07 ***
## zn -0.082567 0.031424 -2.628 0.00860 **
## nox 43.195824 6.452812 6.694 2.17e-11 ***
## age 0.022851 0.009894 2.310 0.02091 *
## dis 0.634380 0.207634 3.055 0.00225 **
## rad 0.718773 0.142066 5.059 4.21e-07 ***
## tax -0.007676 0.002503 -3.066 0.00217 **
## ptratio 0.303502 0.109255 2.778 0.00547 **
## black -0.012866 0.006334 -2.031 0.04224 *
## medv 0.112882 0.034362 3.285 0.00102 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 701.46 on 505 degrees of freedom
## Residual deviance: 216.22 on 496 degrees of freedom
## AIC: 236.22
##
## Number of Fisher Scoring iterations: 9
#linear discriminant analysis
lda.Boston <- lda(high_crime ~ zn + nox + age + dis + rad + tax + ptratio + black + medv,
data = Boston)#naive Bayes model
nb.Boston <- naiveBayes(high_crime ~ zn + nox + age + dis + rad + tax + ptratio + black + medv,
data = Boston)
nb.Boston##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## FALSE TRUE
## 0.5 0.5
##
## Conditional probabilities:
## zn
## Y [,1] [,2]
## FALSE 21.525692 29.319808
## TRUE 1.201581 4.798611
##
## nox
## Y [,1] [,2]
## FALSE 0.4709711 0.05559789
## TRUE 0.6384190 0.09870365
##
## age
## Y [,1] [,2]
## FALSE 51.31028 25.88190
## TRUE 85.83953 17.87423
##
## dis
## Y [,1] [,2]
## FALSE 5.091596 2.081304
## TRUE 2.498489 1.085521
##
## rad
## Y [,1] [,2]
## FALSE 4.158103 1.659121
## TRUE 14.940711 9.529843
##
## tax
## Y [,1] [,2]
## FALSE 305.7431 87.4837
## TRUE 510.7312 167.8553
##
## ptratio
## Y [,1] [,2]
## FALSE 17.90711 1.811216
## TRUE 19.00395 2.346947
##
## black
## Y [,1] [,2]
## FALSE 388.7061 22.83774
## TRUE 324.6420 118.83084
##
## medv
## Y [,1] [,2]
## FALSE 24.94941 7.232047
## TRUE 20.11621 10.270362