Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.week = Weekly
summary(week)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
The variables are:
Year: The year that the observation was recorded
Lag1: Percentage return for previous week
Lag2: Percentage return for 2 weeks previous
Lag3: Percentage return for 3 weeks previous
Lag4: Percentage return for 4 weeks previous
Lag5: Percentage return for 5 weeks previous
Volume: Volume of shares traded (average number of daily shares traded in billions)
Today: Percentage return for this week
Direction: A factor with levels Down and Up indicating whether the market had a positive or negative return on a given week
Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
plot(week)
corrplot(cor(week[,-9]), method = "number")
I want to plot a more granular breakdown of Volume over time so I will need to create a Week variable within this dataset
week$Week = 1:nrow(week)
yrs = week %>%
group_by(Year) %>%
summarize(Week = min(Week)) %>%
ungroup()
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(week, aes(Week, Volume)) +
geom_line() +
geom_smooth(se = FALSE) +
scale_x_continuous(breaks = yrs$Week ,minor_breaks = NULL ,labels = yrs$Year) +
labs(title = "Average Weekly Shares Traded vs Time",
x = "Year") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
There is a clear relationship between Year and Volume. The volume of stocks traded appears to be increasing throughout the 90s and begins to decrease around 2010.
Let’s see a breakdown of the variable Direction overtime.
ggplot(week, aes(Year, fill = Direction)) +
geom_bar(position = "fill") +
geom_hline(yintercept = 0.5, col = "white") +
scale_x_continuous(breaks = seq(1990, 2010)) +
scale_y_continuous(labels = scales::percent_format()) +
labs(title = "Percentage of Direction vs Time",
x = "Year") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90))
The plot shows only 4 years where less than 50% of the weeks didn’t see a positive return. As shown in the table below, we could get a classifier that does better than random chance simply by predicting the return will be positive each week. Let’s investigate the volatility of the market overtime by looking at its percentage change.
prop.table(table(week$Direction))
##
## Down Up
## 0.4444444 0.5555556
ggplot(week, aes(Week, Today/100)) +
geom_line() +
scale_x_continuous(breaks = yrs$Week, minor_breaks = NULL, labels = yrs$Year) +
scale_y_continuous(labels = scales::percent_format(), breaks = seq(-0.2, 0.2, 0.05)) +
geom_hline(yintercept = 0, col = "blue") +
theme_light() +
labs(title = "Weekly Percetage Return vs Time",
x = "Year",
y = "Percentage Return") +
theme(axis.text.x = element_text(angle = 90))
Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?
glm.fit1 = glm(Direction~.-Today, week, family = binomial)
summary(glm.fit1)
##
## Call:
## glm(formula = Direction ~ . - Today, family = binomial, data = week)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7631 -1.2574 0.9678 1.0826 1.4948
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 708.898033 425.223143 1.667 0.0955 .
## Year -0.356148 0.213721 -1.666 0.0956 .
## Lag1 -0.041577 0.026488 -1.570 0.1165
## Lag2 0.059148 0.026986 2.192 0.0284 *
## Lag3 -0.015509 0.026774 -0.579 0.5624
## Lag4 -0.027782 0.026543 -1.047 0.2952
## Lag5 -0.014453 0.026472 -0.546 0.5851
## Volume -0.001679 0.068948 -0.024 0.9806
## Week 0.006698 0.004101 1.633 0.1024
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1483.5 on 1080 degrees of freedom
## AIC: 1501.5
##
## Number of Fisher Scoring iterations: 4
After running a logistic regression on the full data set, it appears that having a 2 day lag has a significant effect on the Direction of the weekly percentage returns. The variable Lag2 has the smallest p-value at 0.02 and the coefficient suggests this has a positive impact on the response variable, Direction.
Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
glm.probs = predict(glm.fit1, type = "response")
glm.preds = ifelse(glm.probs>=0.5, "Up", "Down")
caret::confusionMatrix(as.factor(glm.preds), week$Direction)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 73 68
## Up 411 537
##
## Accuracy : 0.5601
## 95% CI : (0.5301, 0.5899)
## No Information Rate : 0.5556
## P-Value [Acc > NIR] : 0.3923
##
## Kappa : 0.0414
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.15083
## Specificity : 0.88760
## Pos Pred Value : 0.51773
## Neg Pred Value : 0.56646
## Prevalence : 0.44444
## Detection Rate : 0.06703
## Detection Prevalence : 0.12948
## Balanced Accuracy : 0.51921
##
## 'Positive' Class : Down
##
mean(glm.preds==week$Direction) # train error
## [1] 0.5601469
mean(glm.preds!=week$Direction) # test error
## [1] 0.4398531
We achieve an Accuracy of 56.01%.
Our model is predicting just above the baseline accuracy we calculated earlier (55.56%). The confusion matrix is finding that our model does not have a strong true negative rate, meaning it does not predict negative predictions well as shown in the table below.
prop.table(table(glm.preds))
## glm.preds
## Down Up
## 0.1294766 0.8705234
Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).
train = week$Year<=2008
x_test = week[!train,]
x_train = week[train,]
y_test = week$Direction[!train]
glm.fit2=glm(Direction~Lag2, x_train, family = binomial)
glm.probs=predict(glm.fit2, x_test, type="response")
glm.preds = ifelse(glm.probs >= 0.5, "Up", "Down")
caret::confusionMatrix(as.factor(glm.preds), y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 9 5
## Up 34 56
##
## Accuracy : 0.625
## 95% CI : (0.5247, 0.718)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.2439
##
## Kappa : 0.1414
##
## Mcnemar's Test P-Value : 7.34e-06
##
## Sensitivity : 0.20930
## Specificity : 0.91803
## Pos Pred Value : 0.64286
## Neg Pred Value : 0.62222
## Prevalence : 0.41346
## Detection Rate : 0.08654
## Detection Prevalence : 0.13462
## Balanced Accuracy : 0.56367
##
## 'Positive' Class : Down
##
mean(glm.preds==y_test) # train error rate
## [1] 0.625
mean(glm.preds!=y_test) # test error rate (most important how it acts on new data)
## [1] 0.375
We achieve an Accuracy of 62.50%.
With the data now split based on a given year range and using only 1 predictor variable, the model performs better on new data with a test error rate of .375 or 37.5%. The model correctly predicted the held out data 62.5% percent of the time.
Repeat (d) using LDA
lda.fit = lda(Direction~Lag2, x_train)
lda.fit
## Call:
## lda(Direction ~ Lag2, data = x_train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
##
## Coefficients of linear discriminants:
## LD1
## Lag2 0.4414162
lda.class = predict(lda.fit, x_test)$class
caret::confusionMatrix(lda.class, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 9 5
## Up 34 56
##
## Accuracy : 0.625
## 95% CI : (0.5247, 0.718)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.2439
##
## Kappa : 0.1414
##
## Mcnemar's Test P-Value : 7.34e-06
##
## Sensitivity : 0.20930
## Specificity : 0.91803
## Pos Pred Value : 0.64286
## Neg Pred Value : 0.62222
## Prevalence : 0.41346
## Detection Rate : 0.08654
## Detection Prevalence : 0.13462
## Balanced Accuracy : 0.56367
##
## 'Positive' Class : Down
##
mean(lda.class==y_test) # train error rate
## [1] 0.625
mean(lda.class!=y_test) # test error rate (most important how it acts on new data)
## [1] 0.375
We achieve an Accuracy of 62.50%.
In this LDA model, 44.7% of the training data corresponds to days the market went down and 55.2% of the observations correspond to days the market went up. The group means suggest there is a tendency for returns to be negative on days with a 2 day lag when the market went down and positive on days the market went up. The confusion matrix depicts the same results seen in (d) with a 62.5% test error rate.
Repeat (d) using QDA
qda.fit = qda(Direction~Lag2, x_train)
qda.fit
## Call:
## qda(Direction ~ Lag2, data = x_train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
qda.class = predict(qda.fit, x_test)$class
caret::confusionMatrix(qda.class, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 0 0
## Up 43 61
##
## Accuracy : 0.5865
## 95% CI : (0.4858, 0.6823)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.5419
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 1.504e-10
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.5865
## Prevalence : 0.4135
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : Down
##
mean(qda.class==y_test) # train error rate
## [1] 0.5865385
mean(qda.class!=y_test) # test error rate (most important how it acts on new data)
## [1] 0.4134615
We achieve an Accuracy of 58.65%.
Fitting a QDA model, the prior probabilities and group means are identical to those since in the LDA model. This model only predicts Up similar to our baseline model that did slightly better than random chance. This quadratic model achieves a little better accuracy than that.
Repeat (d) using KNN with K = 1
set.seed(123)
x.train = as.matrix(week$Lag2[train])
x.test = as.matrix(week$Lag2[!train])
y.train = week$Direction[train]
knn.preds = knn(x.train, x.test, y.train, k=1)
caret::confusionMatrix(knn.preds, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 21 29
## Up 22 32
##
## Accuracy : 0.5096
## 95% CI : (0.4097, 0.609)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.9540
##
## Kappa : 0.0127
##
## Mcnemar's Test P-Value : 0.4008
##
## Sensitivity : 0.4884
## Specificity : 0.5246
## Pos Pred Value : 0.4200
## Neg Pred Value : 0.5926
## Prevalence : 0.4135
## Detection Rate : 0.2019
## Detection Prevalence : 0.4808
## Balanced Accuracy : 0.5065
##
## 'Positive' Class : Down
##
mean(knn.preds == y_test)
## [1] 0.5096154
mean(knn.preds != y_test)
## [1] 0.4903846
We achieve an Accuracy of 50.96%.
Using the KNN model, our accuracy has worsened and test error rate have worsened. We are barely beating random chance, not evening meeting our baseline with this model. This model is doing a worse job of predicting new data.
Which of these methods appears to provide the best results on this data? The LDA model and the Logistic Regression with data split into train and test datasets did the best compared to other models. They correctly predicted the highest amount of new data with a test error rate of 62.5%
Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.
hist(week$Volume)
Due to the right skewed data seen in the Volume variable it may be good to log transform and see the how this effects the model.
glm.fit3=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+log(Volume), x_train, family = binomial)
glm.probs=predict(glm.fit3, x_test, type="response")
glm.preds = ifelse(glm.probs>=0.5, "Up", "Down")
caret::confusionMatrix(as.factor(glm.preds), y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 23 31
## Up 20 30
##
## Accuracy : 0.5096
## 95% CI : (0.4097, 0.609)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.9540
##
## Kappa : 0.0257
##
## Mcnemar's Test P-Value : 0.1614
##
## Sensitivity : 0.5349
## Specificity : 0.4918
## Pos Pred Value : 0.4259
## Neg Pred Value : 0.6000
## Prevalence : 0.4135
## Detection Rate : 0.2212
## Detection Prevalence : 0.5192
## Balanced Accuracy : 0.5133
##
## 'Positive' Class : Down
##
mean(glm.preds==y_test) # train error rate
## [1] 0.5096154
mean(glm.preds!=y_test) # test error rate (most important how it acts on new data)
## [1] 0.4903846
Though there is no apparent improvement in the diagnostic plot. Fitting this model with all predictors, including a log transformed Volume variable worsened the test error rate. Now, the model only predicts new data correctly 50% of the time compared to the 62.5% seen in (d)
lda.fit2 = lda(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+log(Volume)+Lag1:Lag2, x_train)
lda.fit2
## Call:
## lda(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + log(Volume) +
## Lag1:Lag2, data = x_train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag1 Lag2 Lag3 Lag4 Lag5 log(Volume)
## Down 0.289444444 -0.03568254 0.17080045 0.15925624 0.21409297 -0.2386605
## Up -0.009213235 0.26036581 0.08404044 0.09220956 0.04548897 -0.3281402
## Lag1:Lag2
## Down -0.8014495
## Up -0.1393632
##
## Coefficients of linear discriminants:
## LD1
## Lag1 -0.24453155
## Lag2 0.20964391
## Lag3 -0.05951762
## Lag4 -0.13167014
## Lag5 -0.15170339
## log(Volume) -0.45849491
## Lag1:Lag2 0.01148372
lda.class = predict(lda.fit2, x_test)$class
caret::confusionMatrix(lda.class, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 22 30
## Up 21 31
##
## Accuracy : 0.5096
## 95% CI : (0.4097, 0.609)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.9540
##
## Kappa : 0.0192
##
## Mcnemar's Test P-Value : 0.2626
##
## Sensitivity : 0.5116
## Specificity : 0.5082
## Pos Pred Value : 0.4231
## Neg Pred Value : 0.5962
## Prevalence : 0.4135
## Detection Rate : 0.2115
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.5099
##
## 'Positive' Class : Down
##
mean(lda.class==y_test) # train error rate
## [1] 0.5096154
mean(lda.class!=y_test) # test error rate (most important how it acts on new data)
## [1] 0.4903846
The LDA model sees similar results as the above logistic regression even after adding an interaction effect, correctly predicting 50% of the test data.
qda.fit = qda(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+log(Volume), x_train)
qda.fit
## Call:
## qda(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + log(Volume),
## data = x_train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag1 Lag2 Lag3 Lag4 Lag5 log(Volume)
## Down 0.289444444 -0.03568254 0.17080045 0.15925624 0.21409297 -0.2386605
## Up -0.009213235 0.26036581 0.08404044 0.09220956 0.04548897 -0.3281402
qda.class = predict(qda.fit, x_test)$class
caret::confusionMatrix(qda.class, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 18 25
## Up 25 36
##
## Accuracy : 0.5192
## 95% CI : (0.4191, 0.6183)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.9316
##
## Kappa : 0.0088
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.4186
## Specificity : 0.5902
## Pos Pred Value : 0.4186
## Neg Pred Value : 0.5902
## Prevalence : 0.4135
## Detection Rate : 0.1731
## Detection Prevalence : 0.4135
## Balanced Accuracy : 0.5044
##
## 'Positive' Class : Down
##
mean(qda.class==y_test) # train error rate
## [1] 0.5192308
mean(qda.class!=y_test) # test error rate (most important how it acts on new data)
## [1] 0.4807692
qda.fit = qda(Direction~Lag2+log(Volume), x_train)
qda.fit
## Call:
## qda(Direction ~ Lag2 + log(Volume), data = x_train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2 log(Volume)
## Down -0.03568254 -0.2386605
## Up 0.26036581 -0.3281402
qda.class = predict(qda.fit, x_test)$class
caret::confusionMatrix(qda.class, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 21 24
## Up 22 37
##
## Accuracy : 0.5577
## 95% CI : (0.457, 0.655)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.7579
##
## Kappa : 0.0943
##
## Mcnemar's Test P-Value : 0.8828
##
## Sensitivity : 0.4884
## Specificity : 0.6066
## Pos Pred Value : 0.4667
## Neg Pred Value : 0.6271
## Prevalence : 0.4135
## Detection Rate : 0.2019
## Detection Prevalence : 0.4327
## Balanced Accuracy : 0.5475
##
## 'Positive' Class : Down
##
mean(qda.class==y_test) # train error rate
## [1] 0.5576923
mean(qda.class!=y_test) # test error rate (most important how it acts on new data)
## [1] 0.4423077
qda.fit = qda(Direction~I(Lag2**3)+log(Volume), x_train)
qda.fit
## Call:
## qda(Direction ~ I(Lag2^3) + log(Volume), data = x_train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## I(Lag2^3) log(Volume)
## Down -13.71386 -0.2386605
## Up 1.09383 -0.3281402
qda.class = predict(qda.fit, x_test)$class
caret::confusionMatrix(qda.class, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 3 5
## Up 40 56
##
## Accuracy : 0.5673
## 95% CI : (0.4665, 0.6641)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.6921
##
## Kappa : -0.0139
##
## Mcnemar's Test P-Value : 4.011e-07
##
## Sensitivity : 0.06977
## Specificity : 0.91803
## Pos Pred Value : 0.37500
## Neg Pred Value : 0.58333
## Prevalence : 0.41346
## Detection Rate : 0.02885
## Detection Prevalence : 0.07692
## Balanced Accuracy : 0.49390
##
## 'Positive' Class : Down
##
mean(qda.class==y_test) # train error rate
## [1] 0.5673077
mean(qda.class!=y_test) # test error rate (most important how it acts on new data)
## [1] 0.4326923
qda.fit = qda(Direction~Lag2, x_train)
qda.fit
## Call:
## qda(Direction ~ Lag2, data = x_train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
qda.class = predict(qda.fit, x_test)$class
caret::confusionMatrix(qda.class, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 0 0
## Up 43 61
##
## Accuracy : 0.5865
## 95% CI : (0.4858, 0.6823)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.5419
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 1.504e-10
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.5865
## Prevalence : 0.4135
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : Down
##
mean(qda.class==y_test) # train error rate
## [1] 0.5865385
mean(qda.class!=y_test) # test error rate (most important how it acts on new data)
## [1] 0.4134615
There seems to be a better test error rate when including less predictors in the model, log transforming the right skewed Volume variable, and cubing the predictor that is included in the model. Each transformation brought more and more improved performance on new data. The final model correctly predicted new data 58% of the time. This is still not better than our original model in (d)
Adding to the value of k seems to worsen the model’s test error rate taking it to only 45%.
set.seed(123)
combo.train = cbind(week$Lag1, week$Lag2)[train,]
combo.test = cbind(week$Lag1, week$Lag2)[!train,]
x.train = as.matrix(combo.train)
x.test = as.matrix(combo.test)
y.train = week$Direction[train]
knn.preds = knn(x.train, x.test, y.train, k=90)
caret::confusionMatrix(knn.preds, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 14 18
## Up 29 43
##
## Accuracy : 0.5481
## 95% CI : (0.4474, 0.6459)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.8152
##
## Kappa : 0.0317
##
## Mcnemar's Test P-Value : 0.1447
##
## Sensitivity : 0.3256
## Specificity : 0.7049
## Pos Pred Value : 0.4375
## Neg Pred Value : 0.5972
## Prevalence : 0.4135
## Detection Rate : 0.1346
## Detection Prevalence : 0.3077
## Balanced Accuracy : 0.5152
##
## 'Positive' Class : Down
##
mean(knn.preds == y_test)
## [1] 0.5480769
mean(knn.preds != y_test)
## [1] 0.4519231
Including 2 predictors and increasing the value of k to 90 increased the accuracy to 55% but decrease the test error rate to 44%. It is performing worse on new data and not doing better than our model in (d)
set.seed(123)
combo.train = cbind(week$Lag1, week$Lag2, week$Lag3, week$Lag4, week$Lag5)[train,]
combo.test = cbind(week$Lag1, week$Lag2, week$Lag3, week$Lag4, week$Lag5)[!train,]
x.train = as.matrix(combo.train)
x.test = as.matrix(combo.test)
y.train = week$Direction[train]
knn.preds = knn(x.train, x.test, y.train, k=3)
caret::confusionMatrix(knn.preds, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 21 24
## Up 22 37
##
## Accuracy : 0.5577
## 95% CI : (0.457, 0.655)
## No Information Rate : 0.5865
## P-Value [Acc > NIR] : 0.7579
##
## Kappa : 0.0943
##
## Mcnemar's Test P-Value : 0.8828
##
## Sensitivity : 0.4884
## Specificity : 0.6066
## Pos Pred Value : 0.4667
## Neg Pred Value : 0.6271
## Prevalence : 0.4135
## Detection Rate : 0.2019
## Detection Prevalence : 0.4327
## Balanced Accuracy : 0.5475
##
## 'Positive' Class : Down
##
mean(knn.preds == y_test)
## [1] 0.5576923
mean(knn.preds != y_test)
## [1] 0.4423077
Auto data set.cars = Auto
any(is.na(cars))
## [1] FALSE
Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.
cars$mpg01 = ifelse(cars$mpg > median(cars$mpg), 1, 0)
median(cars$mpg)
## [1] 22.75
Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.
corrplot(cor(cars[,-9]), method = "number")
The variables that most positively correlate with our binary variable mpg01 are:year , mpg, and origin
The variables that most negatively correlate with mpg01 are: cylinders , displacement and weight have the strongest correlation
Split the data into a training set and a test set.
set.seed(123)
split = sample.split(cars, SplitRatio = 0.50)
x_train = subset(cars, split == TRUE)
x_test = subset(cars, split == FALSE)
y_train = subset(cars$mpg01, split == TRUE)
y_test = subset(cars$mpg01, split == FALSE)
cars$mpg01 = as.factor(cars$mpg01)
Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
lda.fit = lda(mpg01~cylinders+displacement+weight, x_train)
lda.fit
## Call:
## lda(mpg01 ~ cylinders + displacement + weight, data = x_train)
##
## Prior probabilities of groups:
## 0 1
## 0.5102041 0.4897959
##
## Group means:
## cylinders displacement weight
## 0 6.670000 267.4300 3556.360
## 1 4.229167 117.7812 2347.333
##
## Coefficients of linear discriminants:
## LD1
## cylinders -0.4496559276
## displacement -0.0002958369
## weight -0.0008855457
lda.class = predict(lda.fit,x_test)$class
caret::confusionMatrix(lda.class, as.factor(y_test))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 84 5
## 1 12 95
##
## Accuracy : 0.9133
## 95% CI : (0.8648, 0.9487)
## No Information Rate : 0.5102
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8262
##
## Mcnemar's Test P-Value : 0.1456
##
## Sensitivity : 0.8750
## Specificity : 0.9500
## Pos Pred Value : 0.9438
## Neg Pred Value : 0.8879
## Prevalence : 0.4898
## Detection Rate : 0.4286
## Detection Prevalence : 0.4541
## Balanced Accuracy : 0.9125
##
## 'Positive' Class : 0
##
mean(lda.class!=y_test)
## [1] 0.08673469
1-mean(lda.class!=y_test)
## [1] 0.9132653
This LDA model achieves a test error rate of .086 so, the model correctly predicted about 91.3% of the new data
Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
qda.fit = qda(mpg01~cylinders+displacement+weight, x_train)
qda.fit
## Call:
## qda(mpg01 ~ cylinders + displacement + weight, data = x_train)
##
## Prior probabilities of groups:
## 0 1
## 0.5102041 0.4897959
##
## Group means:
## cylinders displacement weight
## 0 6.670000 267.4300 3556.360
## 1 4.229167 117.7812 2347.333
qda.class = predict(qda.fit, x_test)$class
caret::confusionMatrix(qda.class, as.factor(y_test))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 87 7
## 1 9 93
##
## Accuracy : 0.9184
## 95% CI : (0.8708, 0.9526)
## No Information Rate : 0.5102
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8366
##
## Mcnemar's Test P-Value : 0.8026
##
## Sensitivity : 0.9062
## Specificity : 0.9300
## Pos Pred Value : 0.9255
## Neg Pred Value : 0.9118
## Prevalence : 0.4898
## Detection Rate : 0.4439
## Detection Prevalence : 0.4796
## Balanced Accuracy : 0.9181
##
## 'Positive' Class : 0
##
mean(qda.class!=y_test)
## [1] 0.08163265
1-mean(qda.class!=y_test)
## [1] 0.9183673
This QDA models achieves a slightly higher test error rate of .081 so, the model correctly predicted 91.8% of the new data
Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
glm.fit = glm(mpg01~cylinders+displacement+weight, x_train, family = binomial)
glm.probs=predict(glm.fit, x_test, type="response")
glm.preds = ifelse(glm.probs>=0.5, 1,0)
caret::confusionMatrix(as.factor(glm.preds), as.factor(y_test))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 86 7
## 1 10 93
##
## Accuracy : 0.9133
## 95% CI : (0.8648, 0.9487)
## No Information Rate : 0.5102
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8263
##
## Mcnemar's Test P-Value : 0.6276
##
## Sensitivity : 0.8958
## Specificity : 0.9300
## Pos Pred Value : 0.9247
## Neg Pred Value : 0.9029
## Prevalence : 0.4898
## Detection Rate : 0.4388
## Detection Prevalence : 0.4745
## Balanced Accuracy : 0.9129
##
## 'Positive' Class : 0
##
mean(glm.preds!=y_test)
## [1] 0.08673469
1-mean(glm.preds!=y_test)
## [1] 0.9132653
The Logistic Regression model performs slightly worse, with a test error rate similar to the LDA model of .086. The model correctly predicted 91.3% of the new data
Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?
combo.train = cbind(cars$cylinders, cars$displacement, cars$weight)[split,]
combo.test = cbind(cars$cylinders, cars$displacement, cars$weight)[!split,]
x.train = as.matrix(combo.train)
x.test = as.matrix(combo.test)
y.train = cars$mpg01[split]
knn.preds = knn(x.train, x.test, y.train, k=1)
caret::confusionMatrix(knn.preds, as.factor(y_test))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 82 17
## 1 14 83
##
## Accuracy : 0.8418
## 95% CI : (0.7831, 0.8899)
## No Information Rate : 0.5102
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6837
##
## Mcnemar's Test P-Value : 0.7194
##
## Sensitivity : 0.8542
## Specificity : 0.8300
## Pos Pred Value : 0.8283
## Neg Pred Value : 0.8557
## Prevalence : 0.4898
## Detection Rate : 0.4184
## Detection Prevalence : 0.5051
## Balanced Accuracy : 0.8421
##
## 'Positive' Class : 0
##
mean(knn.preds != y_test)
## [1] 0.1581633
1-mean(knn.preds != y_test)
## [1] 0.8418367
combo.train = cbind(cars$cylinders, cars$displacement, cars$weight)[split,]
combo.test = cbind(cars$cylinders, cars$displacement, cars$weight)[!split,]
x.train = as.matrix(combo.train)
x.test = as.matrix(combo.test)
y.train = cars$mpg01[split]
knn.preds = knn(x.train, x.test, y.train, k=3)
caret::confusionMatrix(knn.preds, as.factor(y_test))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 85 15
## 1 11 85
##
## Accuracy : 0.8673
## 95% CI : (0.8117, 0.9115)
## No Information Rate : 0.5102
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7348
##
## Mcnemar's Test P-Value : 0.5563
##
## Sensitivity : 0.8854
## Specificity : 0.8500
## Pos Pred Value : 0.8500
## Neg Pred Value : 0.8854
## Prevalence : 0.4898
## Detection Rate : 0.4337
## Detection Prevalence : 0.5102
## Balanced Accuracy : 0.8677
##
## 'Positive' Class : 0
##
mean(knn.preds != y_test)
## [1] 0.1326531
1-mean(knn.preds != y_test)
## [1] 0.8673469
combo.train = cbind(cars$cylinders, cars$displacement, cars$weight)[split,]
combo.test = cbind(cars$cylinders, cars$displacement, cars$weight)[!split,]
x.train = as.matrix(combo.train)
x.test = as.matrix(combo.test)
y.train = cars$mpg01[split]
knn.preds = knn(x.train, x.test, y.train, k=50)
caret::confusionMatrix(knn.preds, as.factor(y_test))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 86 7
## 1 10 93
##
## Accuracy : 0.9133
## 95% CI : (0.8648, 0.9487)
## No Information Rate : 0.5102
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8263
##
## Mcnemar's Test P-Value : 0.6276
##
## Sensitivity : 0.8958
## Specificity : 0.9300
## Pos Pred Value : 0.9247
## Neg Pred Value : 0.9029
## Prevalence : 0.4898
## Detection Rate : 0.4388
## Detection Prevalence : 0.4745
## Balanced Accuracy : 0.9129
##
## 'Positive' Class : 0
##
mean(knn.preds != y_test)
## [1] 0.08673469
1-mean(knn.preds != y_test)
## [1] 0.9132653
combo.train = cbind(cars$cylinders, cars$displacement, cars$weight)[split,]
combo.test = cbind(cars$cylinders, cars$displacement, cars$weight)[!split,]
x.train = as.matrix(combo.train)
x.test = as.matrix(combo.test)
y.train = cars$mpg01[split]
knn.preds = knn(x.train, x.test, y.train, k=149)
caret::confusionMatrix(knn.preds, as.factor(y_test))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 80 3
## 1 16 97
##
## Accuracy : 0.9031
## 95% CI : (0.8528, 0.9406)
## No Information Rate : 0.5102
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8055
##
## Mcnemar's Test P-Value : 0.005905
##
## Sensitivity : 0.8333
## Specificity : 0.9700
## Pos Pred Value : 0.9639
## Neg Pred Value : 0.8584
## Prevalence : 0.4898
## Detection Rate : 0.4082
## Detection Prevalence : 0.4235
## Balanced Accuracy : 0.9017
##
## 'Positive' Class : 0
##
mean(knn.preds != y_test)
## [1] 0.09693878
1-mean(knn.preds != y_test)
## [1] 0.9030612
The KNN model achieves a test error rate of .086. This model correctly predicted the new data 91.3% of the time similiar to the LDA and Logit models seen earlier. To achieve this result K was set to 50. Others values gave various results ranging from better to worse but, K=50 was the best argument in each of the KNN models.
bost = Boston
any(is.na(bost))
## [1] FALSE
# create binary variable
bost$crim.pred = ifelse(bost$crim > median(bost$crim), 1, 0)
median(bost$crim)
## [1] 0.25651
corrplot(cor(bost), method = "number", type = "upper")
Variables that appear to be highly correlated with our binary response variable are: nox, age, dis, rad, indus, and tax
Now we will explore a few models to see which one best predicts new data points
# splitting the data
set.seed(1)
split = sample.split(bost, SplitRatio = 0.50)
x_train = subset(bost, split == TRUE)
x_test = subset(bost, split == FALSE)
y_train = subset(bost$crim.pred, split == TRUE)
y_test = subset(bost$crim.pred, split == FALSE)
bost$crim.pred = as.factor(bost$crim.pred)
Logistic Regression:
glm.fit = glm(crim.pred~nox+age+dis+rad+indus+tax, x_train, family = binomial)
glm.probs=predict(glm.fit, x_test, type="response")
glm.preds = ifelse(glm.probs>=0.5, 1, 0)
caret::confusionMatrix(as.factor(glm.preds), as.factor(y_test))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 115 18
## 1 23 114
##
## Accuracy : 0.8481
## 95% CI : (0.7997, 0.8888)
## No Information Rate : 0.5111
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6964
##
## Mcnemar's Test P-Value : 0.5322
##
## Sensitivity : 0.8333
## Specificity : 0.8636
## Pos Pred Value : 0.8647
## Neg Pred Value : 0.8321
## Prevalence : 0.5111
## Detection Rate : 0.4259
## Detection Prevalence : 0.4926
## Balanced Accuracy : 0.8485
##
## 'Positive' Class : 0
##
mean(glm.preds!=y_test)
## [1] 0.1518519
1-mean(glm.preds!=y_test)
## [1] 0.8481481
Test-Accuracy: .151
The model correctly predicted 84.8% of the new data
Linear Discriminant Analysis:
lda.fit = lda(crim.pred~nox+age+dis+rad+indus+tax, x_train)
lda.fit
## Call:
## lda(crim.pred ~ nox + age + dis + rad + indus + tax, data = x_train)
##
## Prior probabilities of groups:
## 0 1
## 0.4872881 0.5127119
##
## Group means:
## nox age dis rad indus tax
## 0 0.4702070 50.53739 5.160057 4.069565 7.049478 310.6696
## 1 0.6380331 84.82231 2.548820 14.595041 15.066529 504.6860
##
## Coefficients of linear discriminants:
## LD1
## nox 7.272485559
## age 0.007580730
## dis -0.081812204
## rad 0.101940253
## indus 0.010295835
## tax -0.002892852
lda.class = predict(lda.fit,x_test)$class
caret::confusionMatrix(lda.class, as.factor(y_test))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 129 29
## 1 9 103
##
## Accuracy : 0.8593
## 95% CI : (0.812, 0.8984)
## No Information Rate : 0.5111
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7174
##
## Mcnemar's Test P-Value : 0.002055
##
## Sensitivity : 0.9348
## Specificity : 0.7803
## Pos Pred Value : 0.8165
## Neg Pred Value : 0.9196
## Prevalence : 0.5111
## Detection Rate : 0.4778
## Detection Prevalence : 0.5852
## Balanced Accuracy : 0.8575
##
## 'Positive' Class : 0
##
mean(lda.class!=y_test)
## [1] 0.1407407
1-mean(lda.class!=y_test)
## [1] 0.8592593
Test-Accuracy: .140
The model correctly predicted 85.9% of the new data
Quadratic Discriminant Analysis:
qda.fit = qda(crim.pred~nox+age+dis+rad+indus+tax, x_train)
qda.fit
## Call:
## qda(crim.pred ~ nox + age + dis + rad + indus + tax, data = x_train)
##
## Prior probabilities of groups:
## 0 1
## 0.4872881 0.5127119
##
## Group means:
## nox age dis rad indus tax
## 0 0.4702070 50.53739 5.160057 4.069565 7.049478 310.6696
## 1 0.6380331 84.82231 2.548820 14.595041 15.066529 504.6860
qda.class = predict(qda.fit, x_test)$class
caret::confusionMatrix(qda.class, as.factor(y_test))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 133 19
## 1 5 113
##
## Accuracy : 0.9111
## 95% CI : (0.8706, 0.9422)
## No Information Rate : 0.5111
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8217
##
## Mcnemar's Test P-Value : 0.007963
##
## Sensitivity : 0.9638
## Specificity : 0.8561
## Pos Pred Value : 0.8750
## Neg Pred Value : 0.9576
## Prevalence : 0.5111
## Detection Rate : 0.4926
## Detection Prevalence : 0.5630
## Balanced Accuracy : 0.9099
##
## 'Positive' Class : 0
##
mean(qda.class!=y_test)
## [1] 0.08888889
1-mean(qda.class!=y_test)
## [1] 0.9111111
Test-Accuracy: .088
The model correctly predicted 91.1% of the new data
K-Nearest Neighbors:
combo.train = cbind(bost$nox, bost$age, bost$dis, bost$rad, bost$indus, bost$tax)[split,]
combo.test = cbind(bost$nox, bost$age, bost$dis, bost$rad, bost$indus, bost$tax)[!split,]
x.train = as.matrix(combo.train)
x.test = as.matrix(combo.test)
y.train = bost$crim.pred[split]
knn.preds = knn(x.train, x.test, y.train, k=1)
caret::confusionMatrix(knn.preds, as.factor(y_test))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 124 7
## 1 14 125
##
## Accuracy : 0.9222
## 95% CI : (0.8836, 0.9512)
## No Information Rate : 0.5111
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8445
##
## Mcnemar's Test P-Value : 0.1904
##
## Sensitivity : 0.8986
## Specificity : 0.9470
## Pos Pred Value : 0.9466
## Neg Pred Value : 0.8993
## Prevalence : 0.5111
## Detection Rate : 0.4593
## Detection Prevalence : 0.4852
## Balanced Accuracy : 0.9228
##
## 'Positive' Class : 0
##
mean(knn.preds != y_test)
## [1] 0.07777778
1-mean(knn.preds != y_test)
## [1] 0.9222222
**
knn.preds = knn(x.train, x.test, y.train, k=3)
caret::confusionMatrix(knn.preds, as.factor(y_test))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 126 9
## 1 12 123
##
## Accuracy : 0.9222
## 95% CI : (0.8836, 0.9512)
## No Information Rate : 0.5111
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8444
##
## Mcnemar's Test P-Value : 0.6625
##
## Sensitivity : 0.9130
## Specificity : 0.9318
## Pos Pred Value : 0.9333
## Neg Pred Value : 0.9111
## Prevalence : 0.5111
## Detection Rate : 0.4667
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.9224
##
## 'Positive' Class : 0
##
mean(knn.preds != y_test)
## [1] 0.07777778
1-mean(knn.preds != y_test)
## [1] 0.9222222
Test-Accuracy: .077
The model correctly predicted 92.2% of the new data
The KNN model performed the best on this data set and correctly predicted the highest amount of new data at 92%. This was achieved when K was set equal to 1 or 3.