Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.## Rows: 1,089
## Columns: 9
## $ Year <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990,…
## $ Lag1 <dbl> 0.816, -0.270, -2.576, 3.514, 0.712, 1.178, -1.372, 0.807, …
## $ Lag2 <dbl> 1.572, 0.816, -0.270, -2.576, 3.514, 0.712, 1.178, -1.372, …
## $ Lag3 <dbl> -3.936, 1.572, 0.816, -0.270, -2.576, 3.514, 0.712, 1.178, …
## $ Lag4 <dbl> -0.229, -3.936, 1.572, 0.816, -0.270, -2.576, 3.514, 0.712,…
## $ Lag5 <dbl> -3.484, -0.229, -3.936, 1.572, 0.816, -0.270, -2.576, 3.514…
## $ Volume <dbl> 0.1549760, 0.1485740, 0.1598375, 0.1616300, 0.1537280, 0.15…
## $ Today <dbl> -0.270, -2.576, 3.514, 0.712, 1.178, -1.372, 0.807, 0.041, …
## $ Direction <fct> Down, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, U…
Weekly is a data frame with A data frame with 1089 observations on the following 9 variables.
Year: The year that the observation was recordedLag1: Percentage return for previous weekLag2: Percentage return for 2 weeks previousLag3: Percentage return for 3 weeks previousLag4: Percentage return for 4 weeks previousLag5: Percentage return for 5 weeks previousVolume: Volume of shares traded (average number of daily shares traded in billions)Today: Percentage return for this weekDirection: A factor with levels Down and Up indicating whether the market had a positive or negative return on a given weekWeekly data.## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
Weekly data, it appears that the only 2 variables that are correlated are Year and Volume. This is represented in the below chart where Volume increases as Year increases (positive correlation).#Exercise 10-a Year & Volume
Weekly$Week = 1:nrow(Weekly)
year_breaks = Weekly %>%
group_by(Year) %>%
summarize(Week = min(Week))
year_volume_plot = ggplot(data = Weekly, aes(x = Week, y = Volume)) +
geom_line() +
geom_smooth() +
scale_x_continuous(breaks = year_breaks$Week, minor_breaks = NULL, labels = year_breaks$Year) +
theme_economist() +
labs(title = "Average Number of Daily Shares Traded Over Time",
x = "Year")
year_volume_plotDirection as the response and the five lag variables plus Volume as predictors.#Exercise 10-b
logreg_1 = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume , data = Weekly , family = "binomial")
summary(logreg_1)##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = "binomial", data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Lag2 is the only statistically significant predictor as it has a p-value below the 0.05 significance level.#Exercise 10-c
prob_1 = predict(logreg_1, Weekly, type ="response")
preds_1 = rep("Down", 1089) #1089 rows in the data
preds_1[prob_1 > 0.5] = "Up"
table(preds_1, Weekly$Direction)##
## preds_1 Down Up
## Down 54 48
## Up 430 557
Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).#Exercise 10-d
training_data = (Weekly$Year <= 2008)
Weekly.20092010 = Weekly[!training_data, ]
Direction.20092010 = Weekly$Direction[!training_data]
logreg_2 = glm(Direction ~ Lag2,
data = Weekly,
family = "binomial",
subset = training_data)
summary(logreg_2)##
## Call:
## glm(formula = Direction ~ Lag2, family = "binomial", data = Weekly,
## subset = training_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.536 -1.264 1.021 1.091 1.368
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20326 0.06428 3.162 0.00157 **
## Lag2 0.05810 0.02870 2.024 0.04298 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1350.5 on 983 degrees of freedom
## AIC: 1354.5
##
## Number of Fisher Scoring iterations: 4
#Exercise 10-d Matrix
probs2 = predict(logreg_2, Weekly.20092010, type = "response")
preds2 = rep("Down", length(probs2))
preds2[probs2 > 0.5] = "Up"
table(preds2, Direction.20092010)## Direction.20092010
## preds2 Down Up
## Down 9 5
## Up 34 56
#Exercise 10-e
lda.fit = lda(Direction ~ Lag2, data = Weekly, subset = training_data)
preds3 = predict(lda.fit, Weekly.20092010)
table(preds3$class, Direction.20092010)## Direction.20092010
## Down Up
## Down 9 5
## Up 34 56
#Exercise 10-f
qda.fit = qda(Direction ~ Lag2, data = Weekly, subset = training_data)
preds4 = predict(qda.fit, Weekly.20092010)
table(preds4$class, Direction.20092010)## Direction.20092010
## Down Up
## Down 0 0
## Up 43 61
#Exercise 10-g
train.X = as.matrix(Weekly$Lag2[training_data])
test.X = as.matrix(Weekly$Lag2[!training_data])
train.Direction = Weekly$Direction[training_data]
set.seed(1)
pred.knn = knn(train.X, test.X, train.Direction, k = 1)
table(pred.knn, Direction.20092010)## Direction.20092010
## pred.knn Down Up
## Down 21 30
## Up 22 31
#Exercise 10-i Establish Models
#Logistic Regression Model
logreg_3 = glm(Direction ~ Lag1:Lag2,
data = Weekly,
family = "binomial",
subset = training_data)
#LDA Model
lda.fit_2 = lda(Direction ~ Lag1:Lag2, data = Weekly, subset = training_data)
#QDA Model
qda.fit_2 = qda(Direction ~ Lag1:Lag2, data = Weekly, subset = training_data)
#KNN Model
set.seed(1)
pred.knn_2 = knn(train.X, test.X, train.Direction, k = 12)# 10-i Logistic Regression Model Results (with interaction Lag1:Lag2)
probs2.1 = predict(logreg_3, Weekly.20092010, type = "response")
preds2.1 = rep("Down", length(probs2.1))
preds2.1[probs2.1 > 0.5] = "Up"
table(preds2.1, Direction.20092010)## Direction.20092010
## preds2.1 Down Up
## Down 1 1
## Up 42 60
# 10-i LDA Model Results (with interaction Lag1:Lag2)
preds3.1 = predict(lda.fit_2, Weekly.20092010)
table(preds3.1$class, Direction.20092010)## Direction.20092010
## Down Up
## Down 0 1
## Up 43 60
# 10-i QDA Model Results (with interaction Lag1:Lag2)
preds4.1 = predict(qda.fit_2, Weekly.20092010)
table(preds4.1$class, Direction.20092010)## Direction.20092010
## Down Up
## Down 16 32
## Up 27 29
## Direction.20092010
## pred.knn_2 Down Up
## Down 18 21
## Up 25 40
Auto data set. Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.#Exercise 11-a
mpg01 = rep(0, length(Auto$mpg))
mpg01[Auto$mpg > median(Auto$mpg)] = 1
Auto = data.frame(Auto, mpg01)
table(Auto$mpg01)##
## 0 1
## 196 196
mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.## Rows: 392
## Columns: 8
## $ cylinders <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4,…
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 3…
## $ horsepower <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 1…
## $ weight <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 38…
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5,…
## $ year <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, …
## $ origin <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 3,…
## $ mpg01 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,…
## cylinders displacement horsepower weight acceleration
## cylinders 1.0000000 0.9508233 0.8429834 0.8975273 -0.5046834
## displacement 0.9508233 1.0000000 0.8972570 0.9329944 -0.5438005
## horsepower 0.8429834 0.8972570 1.0000000 0.8645377 -0.6891955
## weight 0.8975273 0.9329944 0.8645377 1.0000000 -0.4168392
## acceleration -0.5046834 -0.5438005 -0.6891955 -0.4168392 1.0000000
## year -0.3456474 -0.3698552 -0.4163615 -0.3091199 0.2903161
## origin -0.5689316 -0.6145351 -0.4551715 -0.5850054 0.2127458
## mpg01 -0.7591939 -0.7534766 -0.6670526 -0.7577566 0.3468215
## year origin mpg01
## cylinders -0.3456474 -0.5689316 -0.7591939
## displacement -0.3698552 -0.6145351 -0.7534766
## horsepower -0.4163615 -0.4551715 -0.6670526
## weight -0.3091199 -0.5850054 -0.7577566
## acceleration 0.2903161 0.2127458 0.3468215
## year 1.0000000 0.1815277 0.4299042
## origin 0.1815277 1.0000000 0.5136984
## mpg01 0.4299042 0.5136984 1.0000000
#Exercise 11-b Charts
par(mfrow=c(2,2))
boxplot(cylinders ~ mpg01, data = Auto, main = "Cylinders vs mpg01")
boxplot(displacement ~ mpg01, data = Auto, main = "Displacement vs mpg01")
boxplot(horsepower ~ mpg01, data = Auto, main = "Horsepower vs mpg01")
boxplot(weight ~ mpg01, data = Auto, main = "Weight vs mpg01")mpg01 based on the above analysis:
cylinders -0.7591939displacement -0.7534766horsepower -0.6670526weight -0.7577566#Exercise 11-c
train = (Auto$year %% 2 == 0)
Auto.train = Auto[train, ]
Auto.test = Auto[!train, ]
mpg01.test = Auto$mpg01[!train]mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?#Exercise 11-d
#LDA Model
Auto.lda = lda(mpg01 ~ cylinders + displacement + horsepower + weight, data = Auto, subset = train)
pred.lda = predict(Auto.lda, Auto.test)
table(pred.lda$class, mpg01.test)## mpg01.test
## 0 1
## 0 86 9
## 1 14 73
mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?#Exercise 11-e
#QDA Model
Auto.qda = qda(mpg01 ~ cylinders + displacement + horsepower + weight, data = Auto, subset = train)
pred.qda = predict(Auto.qda, Auto.test)
table(pred.qda$class, mpg01.test)## mpg01.test
## 0 1
## 0 89 13
## 1 11 69
mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?#Exercise 11-f
#Logistic Regression Model
Auto.log = glm(mpg01 ~ cylinders + displacement + horsepower + weight,
data = Auto,
family = "binomial",
subset = train)
Auto.probs = predict(Auto.log, Auto.test, type = "response")
pred.auto = rep(0, length(Auto.probs))
pred.auto[Auto.probs > 0.5] <- 1
table(pred.auto, mpg01.test)## mpg01.test
## pred.auto 0 1
## 0 89 11
## 1 11 71
mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?#Exercise 11-g
attach(Auto)
#KNN Model
Autotrain.X = cbind(cylinders, weight, displacement, horsepower)[train, ]
Autotest.X = cbind(cylinders, weight, displacement, horsepower)[!train, ]
train.mpg01 = mpg01[train]
set.seed(1)
Autopred.knn_1 = knn(Autotrain.X, Autotest.X, train.mpg01, k = 1)
table(Autopred.knn_1, mpg01.test)## mpg01.test
## Autopred.knn_1 0 1
## 0 83 11
## 1 17 71
Autopred.knn_50 = knn(Autotrain.X, Autotest.X, train.mpg01, k = 50)
table(Autopred.knn_50, mpg01.test)## mpg01.test
## Autopred.knn_50 0 1
## 0 80 7
## 1 20 75
Autopred.knn_100 = knn(Autotrain.X, Autotest.X, train.mpg01, k = 100)
table(Autopred.knn_100, mpg01.test)## mpg01.test
## Autopred.knn_100 0 1
## 0 81 7
## 1 19 75
Autopred.knn_150 = knn(Autotrain.X, Autotest.X, train.mpg01, k = 150)
table(Autopred.knn_150, mpg01.test)## mpg01.test
## Autopred.knn_150 0 1
## 0 72 6
## 1 28 76
PERFORMS THE BESTBoston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.#Exercise 13
attach(Boston)
crim01 = rep(0, length(crim))
crim01[crim > median(crim)] = 1
Boston = data.frame(Boston, crim01)
table(crim01)## crim01
## 0 1
## 253 253
#Exercise 13 Split
train = 1:(length(crim) / 2)
test = (length(crim) / 2 + 1):length(crim)
Boston.train = Boston[train, ]
Boston.test = Boston[test, ]
crim01.test = crim01[test]#Exercise 13 Logistic Regression Model
Boston.glm_1 = glm(crim01 ~ . - crim01 - crim,
data = Boston,
family = "binomial",
subset = train)
Boston.glm_2 = glm(crim01 ~ . - crim01 - crim - chas - nox,
data = Boston,
family = 'binomial',
subset = train)
Boston.probs_1 = predict(Boston.glm_1, Boston.test, type = "response")
Boston.pred_1 = rep(0, length(Boston.probs_1))
Boston.pred_1[Boston.probs_1 > 0.5] = 1
table(Boston.pred_1, crim01.test)## crim01.test
## Boston.pred_1 0 1
## 0 68 24
## 1 22 139
Boston.probs_2 = predict(Boston.glm_2, Boston.test, type = "response")
Boston.pred_2 = rep(0, length(Boston.probs_2))
Boston.pred_2[Boston.probs_2 > 0.5] = 1
table(Boston.pred_2, crim01.test)## crim01.test
## Boston.pred_2 0 1
## 0 78 28
## 1 12 135
#Exercise 13 LDA Model
Boston.lda_1 = lda(crim01 ~ . - crim01 - crim, data = Boston, subset = train)
pred.lda_1 = predict(Boston.lda_1, Boston.test)
table(pred.lda_1$class, crim01.test)## crim01.test
## 0 1
## 0 80 24
## 1 10 139
Boston.lda_2 = lda(crim01 ~ . - crim01 - crim - chas - nox - tax, data = Boston, subset = train)
pred.lda_2 = predict(Boston.lda_2, Boston.test)
table(pred.lda_2$class, crim01.test)## crim01.test
## 0 1
## 0 83 28
## 1 7 135
#Exercise 13 KNN Model
Boston.train.X = cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat, medv)[train, ]
Boston.test.X = cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat, medv)[test, ]
train.crim01 = crim01[train]
set.seed(1)
Boston.pred.knn_1 = knn(Boston.train.X, Boston.test.X, train.crim01, k = 1)
table(Boston.pred.knn_1, crim01.test)## crim01.test
## Boston.pred.knn_1 0 1
## 0 85 111
## 1 5 52
Boston.pred.knn_50 = knn(Boston.train.X, Boston.test.X, train.crim01, k = 50)
table(Boston.pred.knn_50, crim01.test)## crim01.test
## Boston.pred.knn_50 0 1
## 0 80 23
## 1 10 140
Boston.pred.knn_150 = knn(Boston.train.X, Boston.test.X, train.crim01, k = 150)
table(Boston.pred.knn_150, crim01.test)## crim01.test
## Boston.pred.knn_150 0 1
## 0 84 31
## 1 6 132
Boston.glm_1 Test Error = 1 - (68+139) / 253 OR 18.18%Boston.glm_2 Test Error = 1 - (78+135) / 253 OR 15.81%Boston.lda_1 Test Error = 1 - (80+139) / 253 OR 13.44%Boston.lda_2 Test Error = 1 - (83+135) / 253 OR 13.83%Boston.pred.knn_1 Test Error = 1 - (85+52) / 253 OR 45.85%Boston.pred.knn_50 Test Error = 1 - (80+140) / 253 OR 13.04% PERFORMS THE BESTBoston.pred.knn_150 Test Error = 1 - (84+132) / 253 OR 14.62%