This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1, 089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.
## Warning: package 'ISLR2' was built under R version 4.1.3
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
There seems to be little to no correlation between today’s returns and previous day returns. However, there seems to be a substantial correlation between Year and Volume, meaning that the number of shares traded daily increased (see plot).
glm.fits <- glm(
Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume ,
data = Weekly, family = binomial
)
summary(glm.fits)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
The smallest p-value is associated with Lag2 with a p-value of 0.0296. This value is statistically significant as it’s less than 0.05.
glm.probs <- predict(glm.fits, type = "response")
glm.probs [1:10]
## 1 2 3 4 5 6 7 8
## 0.6086249 0.6010314 0.5875699 0.4816416 0.6169013 0.5684190 0.5786097 0.5151972
## 9 10
## 0.5715200 0.5554287
contrasts(Weekly$Direction)
## Up
## Down 0
## Up 1
glm.pred <- rep("Down", 1089)
glm.pred[glm.probs > .5] = "Up"
table(glm.pred, Weekly$Direction)
##
## glm.pred Down Up
## Down 54 48
## Up 430 557
mean(glm.pred == Weekly$Direction)
## [1] 0.5610652
The logistic regression correctly predicted the movement of the market 56.10% of the time. However, since this is the training set. The training error rate (100% - 56.10% = 43.90%) is often too optimistic and tends to underestimate the test error rate.
train = (Weekly$Year <2009)
Weekly.2009 <- Weekly[!train, ]
dim(Weekly.2009)
## [1] 104 9
Direction.2009 <- Weekly$Direction[!train]
glm.fits <- glm(
Direction ~ Lag2, data = Weekly , family = binomial , subset = train
)
glm.probs <- predict (glm.fits, Weekly.2009, type = "response")
glm.pred <- rep("Down", 104)
glm.pred[glm.probs > .5] <- "Up"
table (glm.pred, Direction.2009)
## Direction.2009
## glm.pred Down Up
## Down 9 5
## Up 34 56
mean (glm.pred == Direction.2009)
## [1] 0.625
mean (glm.pred != Direction.2009)
## [1] 0.375
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:ISLR2':
##
## Boston
lda.fit <- lda(Direction ~ Lag1 + Lag2, data = Weekly, subset = train)
lda.fit
## Call:
## lda(Direction ~ Lag1 + Lag2, data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag1 Lag2
## Down 0.289444444 -0.03568254
## Up -0.009213235 0.26036581
##
## Coefficients of linear discriminants:
## LD1
## Lag1 -0.3013148
## Lag2 0.2982579
plot(lda.fit)
lda.pred <- predict(lda.fit, Weekly.2009)
names(lda.pred)
## [1] "class" "posterior" "x"
lda.class <- lda.pred$class
table(lda.class, Direction.2009)
## Direction.2009
## lda.class Down Up
## Down 7 8
## Up 36 53
mean(lda.class == Direction.2009)
## [1] 0.5769231
sum(lda.class == Direction.2009)
## [1] 60
qda.fit <- qda(Weekly$Direction ~ Lag1 + Lag2, data = Weekly, subset = train)
qda.fit
## Call:
## qda(Weekly$Direction ~ Lag1 + Lag2, data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag1 Lag2
## Down 0.289444444 -0.03568254
## Up -0.009213235 0.26036581
qda.class <- predict (qda.fit, Weekly.2009)$class
table(qda.class, Direction.2009)
## Direction.2009
## qda.class Down Up
## Down 7 10
## Up 36 51
mean(qda.class == Direction.2009)
## [1] 0.5576923
library(class)
train.X <- cbind (Weekly$Lag1 , Weekly$Lag2)[train , ]
test.X <- cbind (Weekly$Lag1 , Weekly$Lag2)[!train , ]
train.Direction <- Weekly$Direction[train]
set.seed(1)
knn.pred <- knn(train.X, test.X, train.Direction, k = 1)
table(knn.pred, Direction.2009)
## Direction.2009
## knn.pred Down Up
## Down 18 29
## Up 25 32
knn.pred <- knn(train.X, test.X, train.Direction, k = 3)
table(knn.pred, Direction.2009)
## Direction.2009
## knn.pred Down Up
## Down 22 29
## Up 21 32
mean(knn.pred == Direction.2009)
## [1] 0.5192308
library(e1071)
## Warning: package 'e1071' was built under R version 4.1.3
nb.fit <- naiveBayes(Direction ~ Lag1 + Lag2, data = Weekly, subset = train)
nb.fit
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Down Up
## 0.4477157 0.5522843
##
## Conditional probabilities:
## Lag1
## Y [,1] [,2]
## Down 0.289444444 2.211721
## Up -0.009213235 2.308387
##
## Lag2
## Y [,1] [,2]
## Down -0.03568254 2.199504
## Up 0.26036581 2.317485
nb.class <- predict(nb.fit, Weekly.2009)
table(nb.class, Direction.2009)
## Direction.2009
## nb.class Down Up
## Down 3 8
## Up 40 53
mean(nb.class == Direction.2009)
## [1] 0.5384615
In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.
attach(Auto)
head(Auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
mpg01 <- rep(0, length(mpg))
mpg01[mpg > median(mpg)] <- 1
Auto = data.frame(Auto, mpg01)
median(mpg01)
## [1] 0.5
par(mfrow = c(2, 2))
plot(factor(Auto$mpg01), Auto$year, ylab = "Manufacture year", xlab = "Cars")
plot(factor(Auto$mpg01), Auto$weight, ylab = "Weight (lbs)", xlab = "Cars")
plot(factor(Auto$mpg01), Auto$horsepower, ylab = "Horsepower", xlab = "Cars")
plot(factor(Auto$mpg01), Auto$acceleration, ylab = "Time to reach 60mpg in seconds", xlab = "Cars")
The boxplots suggest that there is some overlap between manufacture year and time to reach 60mpg in seconds. It appears that there is a greater difference in the weight and the horsepower between the cars.
train <- (year %% 2 == 0)
train_auto <- Auto[train,]
test_auto <- Auto[!train,]
autolda_fit <- lda(mpg01~displacement+horsepower+weight+year+cylinders+origin, data=train_auto, subset = train)
autolda_pred <- predict(autolda_fit, Auto[-train, ])
table(autolda_pred$class, Auto[-train, "mpg01"])
##
## 0 1
## 0 171 14
## 1 24 182
1 - mean(autolda_pred$class == Auto[-train, "mpg01"])
## [1] 0.0971867
The test error of the model obtained is 9.71%
autoqda_fit = qda(mpg01 ~ cylinders + displacement + horsepower + weight + year + origin, data = train_auto, subset = train)
autoqda_pred = predict(autoqda_fit, Auto[-train, ])
table(autoqda_pred$class, Auto[-train, "mpg01"])
##
## 0 1
## 0 175 21
## 1 20 175
1 - mean(autoqda_pred$class == Auto[-train, "mpg01"])
## [1] 0.1048593
The test error of the model obtained is 10.49%
autoglm_fit = glm(mpg01 ~cylinders + displacement + horsepower + weight + year + origin, data = Auto, family = "binomial")
summary(autoglm_fit)
##
## Call:
## glm(formula = mpg01 ~ cylinders + displacement + horsepower +
## weight + year + origin, family = "binomial", data = Auto)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4289 -0.1044 0.0081 0.2125 3.1750
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.680e+01 4.831e+00 -3.477 0.000506 ***
## cylinders -1.636e-01 4.235e-01 -0.386 0.699274
## displacement 1.963e-03 1.199e-02 0.164 0.869905
## horsepower -4.297e-02 1.664e-02 -2.581 0.009839 **
## weight -4.251e-03 9.889e-04 -4.299 1.72e-05 ***
## year 4.286e-01 7.485e-02 5.726 1.03e-08 ***
## origin 4.755e-01 3.615e-01 1.315 0.188440
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 543.43 on 391 degrees of freedom
## Residual deviance: 157.55 on 385 degrees of freedom
## AIC: 171.55
##
## Number of Fisher Scoring iterations: 8
autoglm_probs = predict(autoglm_fit, Auto[-train, ], type = "response")
autoglm_pred = rep(0, dim(Auto[-train, ])[1])
autoglm_pred[autoglm_probs > 0.5] = 1
table(autoglm_pred, Auto[-train, "mpg01"], dnn = c("Predicted", "Actual"))
## Actual
## Predicted 0 1
## 0 172 13
## 1 23 183
1- mean(autoglm_pred == Auto[-train, "mpg01"])
## [1] 0.09207161
The test error rate is 8.43%
autonb_fit <- naiveBayes(mpg01 ~cylinders + displacement + horsepower + weight + year + origin, data = Auto, subset = train)
nb.fit
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Down Up
## 0.4477157 0.5522843
##
## Conditional probabilities:
## Lag1
## Y [,1] [,2]
## Down 0.289444444 2.211721
## Up -0.009213235 2.308387
##
## Lag2
## Y [,1] [,2]
## Down -0.03568254 2.199504
## Up 0.26036581 2.317485
autonb_pred <- predict(autonb_fit, Auto)
autoknn_train <- cbind( cylinders, displacement, horsepower, weight
)[train,]
autoknn_test <- cbind( cylinders, displacement, horsepower, weight
)[!train,]
set.seed(1)
autoknn_1 <- knn(autoknn_train, autoknn_test, train_auto$mpg01, k = 1)
table(autoknn_1, test_auto$mpg01)
##
## autoknn_1 0 1
## 0 83 11
## 1 17 71
1 - mean(autoknn_1 == test_auto$mpg01)
## [1] 0.1538462
The test error rate is 15.38% with K=1.
autoknn_10 <- knn(autoknn_train, autoknn_test, train_auto$mpg01, k = 10)
table(autoknn_10, test_auto$mpg01)
##
## autoknn_10 0 1
## 0 77 7
## 1 23 75
1 - mean(autoknn_10 == test_auto$mpg01)
## [1] 0.1648352
The test error rate is 16.48% with K=10.
autoknn_5 <- knn(autoknn_train, autoknn_test, train_auto$mpg01, k = 5)
table(autoknn_5, test_auto$mpg01)
##
## autoknn_5 0 1
## 0 82 9
## 1 18 73
1 - mean(autoknn_5 == test_auto$mpg01)
## [1] 0.1483516
The test error rate is 14.83% with K=5.
autoknn_100 <- knn(autoknn_train, autoknn_test, train_auto$mpg01, k = 100)
table(autoknn_100, test_auto$mpg01)
##
## autoknn_100 0 1
## 0 81 7
## 1 19 75
1 - mean(autoknn_100 == test_auto$mpg01)
## [1] 0.1428571
The test error rate is 14.29% with K=100.
Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings.
Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
#crime is "crim" in the data set
attach(Boston)
Boston_crime <- rep(0, length(crim))
Boston_crime[crim > median(crim)] <- 1
Boston = data.frame(Boston, Boston_crime)
train = 1:(dim(Boston)[1]/2)
test = (dim(Boston)[1]/2 + 1):dim(Boston)[1]
Boston_train = Boston[train, ]
Boston_test = Boston[test, ]
Boston_crime_test = Boston_crime[test]
pairs(Boston)
The paired correlation plots indicate that the variables “indus” (proportion of non-retail business acres per town.), “nox” (nitrogen oxides concentration (parts per 10 million), “age” (proportion of owner-occupied units built prior to 1940.), “dis” (weighted mean of distances to five Boston employment centres), “rad” (index of accessibility to radial highways.), and “tax” full-value property-tax rate per $10,000.) could be correlated with the crime rate.
To test significance, we calculate a logistic regression model:
set.seed(1)
Boston_fit <-glm(Boston_crime~ indus+nox+age+dis+rad+tax, data=Boston_train,family=binomial)
Boston_probs = predict(Boston_fit, Boston_test, type = "response")
Boston_pred = rep(0, length(Boston_probs))
Boston_pred[Boston_probs > 0.5] = 1
table(Boston_pred, Boston_crime_test)
## Boston_crime_test
## Boston_pred 0 1
## 0 75 8
## 1 15 155
mean(Boston_pred != Boston_crime_test)
## [1] 0.09090909
summary(Boston_fit)
##
## Call:
## glm(formula = Boston_crime ~ indus + nox + age + dis + rad +
## tax, family = binomial, data = Boston_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.97810 -0.21406 -0.03454 0.47107 3.04502
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -42.214032 7.617440 -5.542 2.99e-08 ***
## indus -0.213126 0.073236 -2.910 0.00361 **
## nox 80.868029 16.066473 5.033 4.82e-07 ***
## age 0.003397 0.012032 0.282 0.77772
## dis 0.307145 0.190502 1.612 0.10690
## rad 0.847236 0.183767 4.610 4.02e-06 ***
## tax -0.013760 0.004956 -2.777 0.00549 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 329.37 on 252 degrees of freedom
## Residual deviance: 144.44 on 246 degrees of freedom
## AIC: 158.44
##
## Number of Fisher Scoring iterations: 8
Next we calculate the LDA:
Bostonlda_fit <-lda(Boston_crime~ indus+nox+age+dis+rad+tax, data=Boston_train,family=binomial)
Bostonlda_pred = predict(Bostonlda_fit, Boston_test)
table(Bostonlda_pred$class, Boston_crime_test)
## Boston_crime_test
## 0 1
## 0 81 18
## 1 9 145
mean(Bostonlda_pred$class != Boston_crime_test)
## [1] 0.1067194
Next, we calculate Naive Bayes:
Boston_nb_fit <- naiveBayes(Boston_crime ~ indus+nox+age+dis+rad+tax, data=Boston)
Boston_nb_fit
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.5 0.5
##
## Conditional probabilities:
## indus
## Y [,1] [,2]
## 0 7.002292 5.514454
## 1 15.271265 5.439010
##
## nox
## Y [,1] [,2]
## 0 0.4709711 0.05559789
## 1 0.6384190 0.09870365
##
## age
## Y [,1] [,2]
## 0 51.31028 25.88190
## 1 85.83953 17.87423
##
## dis
## Y [,1] [,2]
## 0 5.091596 2.081304
## 1 2.498489 1.085521
##
## rad
## Y [,1] [,2]
## 0 4.158103 1.659121
## 1 14.940711 9.529843
##
## tax
## Y [,1] [,2]
## 0 305.7431 87.4837
## 1 510.7312 167.8553
Next, we calculate the KNN. First, K=1
Bostonknn_train=cbind(indus,nox,age,dis,rad,tax)[train,]
Bostonknn_test=cbind(indus,nox,age,dis,rad,tax)[test,]
Bostonknn_pred=knn(Bostonknn_train, Bostonknn_test, Boston_crime_test, k=1)
table(Bostonknn_pred,Boston_crime_test)
## Boston_crime_test
## Bostonknn_pred 0 1
## 0 31 155
## 1 59 8
mean(Bostonknn_pred != Boston_crime_test)
## [1] 0.8458498
Next, K=5
Bostonknn_pred5=knn(Bostonknn_train, Bostonknn_test, Boston_crime_test, k=5)
table(Bostonknn_pred5,Boston_crime_test)
## Boston_crime_test
## Bostonknn_pred5 0 1
## 0 42 20
## 1 48 143
mean(Bostonknn_pred5 != Boston_crime_test)
## [1] 0.2687747
Next, K=100
Bostonknn_pred100=knn(Bostonknn_train, Bostonknn_test, Boston_crime_test, k=100)
table(Bostonknn_pred100,Boston_crime_test)
## Boston_crime_test
## Bostonknn_pred100 0 1
## 0 21 7
## 1 69 156
mean(Bostonknn_pred100 != Boston_crime_test)
## [1] 0.3003953
In summary, we can see that the logistic regression model had the lowest test error rate of 9.09%, followed by the LDA with 10.67%. The statistically significant variables turned out to be indus, nox, rad, and tax.