Question 13
(a) There is some relationship between Direction and Lag1 and Today
as shown in the scatter plot table and summary tables below.
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
| Characteristic |
N = 1,089 |
| Year |
2,000.0 (1,995.0, 2,005.0) |
| Lag1 |
0.24 (-1.15, 1.41) |
| Lag2 |
0.24 (-1.15, 1.41) |
| Lag3 |
0.24 (-1.16, 1.41) |
| Lag4 |
0.24 (-1.16, 1.41) |
| Lag5 |
0.23 (-1.17, 1.41) |
| Volume |
1.00 (0.33, 2.05) |
| Today |
0.24 (-1.15, 1.41) |
| Direction |
|
| Down |
484 (44%) |
| Up |
605 (56%) |

| Year |
2,000.0 (1,995.0, 2,005.0) |
2,000.0 (1,995.0, 2,005.0) |
0.5 |
| Lag1 |
0.38 (-0.94, 1.59) |
0.10 (-1.24, 1.31) |
0.043 |
| Lag2 |
0.15 (-1.31, 1.30) |
0.30 (-1.00, 1.46) |
0.045 |
| Lag3 |
0.25 (-1.16, 1.41) |
0.22 (-1.17, 1.42) |
0.7 |
| Lag4 |
0.22 (-1.16, 1.45) |
0.24 (-1.16, 1.35) |
0.6 |
| Lag5 |
0.33 (-1.10, 1.51) |
0.13 (-1.20, 1.34) |
0.2 |
| Volume |
1.07 (0.34, 2.02) |
0.93 (0.33, 2.09) |
0.4 |
| Today |
-1.33 (-2.29, -0.59) |
1.25 (0.63, 2.22) |
<0.001 |
(b) Lag2 is the only predictor that is statistically
significant.
Logistic Regression Odds Ratios
| Characteristic |
OR |
95% CI |
p-value |
| Lag1 |
0.96 |
0.91, 1.01 |
0.12 |
| Lag2 |
1.06 |
1.01, 1.12 |
0.030 |
| Lag3 |
0.98 |
0.93, 1.04 |
0.5 |
| Lag4 |
0.97 |
0.92, 1.02 |
0.3 |
| Lag5 |
0.99 |
0.94, 1.04 |
0.6 |
| Volume |
0.98 |
0.91, 1.05 |
0.5 |
| Abbreviations: CI = Confidence Interval, OR = Odds Ratio |
(c) The confusion matrix shows that the glm model predictions are
about 56% accurate. However, it did much better at predicting Ups than
Downs. From what I can understand, it shows that logistic regression is
very sensitive to class imbalance, and our Ups/Downs were not equal to
begin with. It predicts Ups more accurately because there are more of
them than Downs.
## Actual
## Predicted Down Up Sum
## Down 54 48 102
## Up 430 557 987
## Sum 484 605 1089
## [1] 0.5610652
(d) See below for GLM table and success rate.
## Actual
## Predicted Down Up Sum
## Down 9 5 14
## Up 34 56 90
## Sum 43 61 104
## [1] 0.625
(e) See below for LDA table and success rate.
## Actual
## Predicted Down Up Sum
## Down 9 5 14
## Up 34 56 90
## Sum 43 61 104
## [1] 0.625
(f) See below for QDA table and success rate.
## Actual
## Predicted Down Up Sum
## Up 43 61 104
## Sum 43 61 104
## [1] 0.5865385
(g) See below for KNN table and success rate for when K=1.
## Actual
## Predicted Down Up Sum
## Down 21 29 50
## Up 22 32 54
## Sum 43 61 104
## [1] 0.5096154
(h) See below for Naive Bayes table and success rate.
## Actual
## Predicted Down Up Sum
## Up 43 61 104
## Sum 43 61 104
## [1] 0.5865385
(i) The logistic model and the LDA model both predicted 62.5% of the
results correctly.
(j) Using the “Today” variable leads the GLM model to producing 100%
accurate results, which is likely co-linearity. Removing the “Today”
reduces this significantly and leads to only 50.9% accuracy. Including
that same variable in the LDA model only produced 98% accurate
predictions, which is still interesting. Removing it, like with GLM,
also reduced the accuracy to 50.9%. I figured at this point it wasn’t
worth including “Today” in any of the models, so I removed it for QDA
and just kept Lag1, Lag2, Lag5, and Year (which were all also strong
predictors in terms of P-values). For QDA, then, the accuracy was only
48%.
Using these same predictors with K=1 only led to 45.2% accuracy. I
began reading about KNN more and learned that the rule of thumb for KNN
is to use a K that is the square root of the total number of
observations for the training dataset – in this case, I decided to use a
K=31. This still only led to an accuracy rate of 54.8%. Overall, I would
likely select KNN for this analysis and use a K=31.
glmmodel2 <- glm(newdirection ~ Year + Lag1 + Lag2 + Lag5 + Today, data = train, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
glmprobs2 <- predict(glmmodel2, newdata = test, type = "response")
glmpred_class2 <- ifelse(glmprobs2 > 0.5, "Up", "Down")
glmmatrix2 <- table(Predicted = glmpred_class2, Actual = test$Direction)
addmargins(glmmatrix2)
## Actual
## Predicted Down Up Sum
## Down 43 0 43
## Up 0 61 61
## Sum 43 61 104
print(mean(glmpred_class2 == test$Direction))
## [1] 1
glmmodel3 <- glm(newdirection ~ Year + Lag1 + Lag2 + Lag5, data = train, family = binomial)
glmprobs3 <- predict(glmmodel3, newdata = test, type = "response")
glmpred_class3 <- ifelse(glmprobs3 > 0.5, "Up", "Down")
glmmatrix3 <- table(Predicted = glmpred_class3, Actual = test$Direction)
addmargins(glmmatrix3)
## Actual
## Predicted Down Up Sum
## Down 18 26 44
## Up 25 35 60
## Sum 43 61 104
print(mean(glmpred_class3 == test$Direction))
## [1] 0.5096154
## Actual
## Predicted Down Up Sum
## Down 41 0 41
## Up 2 61 63
## Sum 43 61 104
## [1] 0.9807692
## Actual
## Predicted Down Up Sum
## Down 18 26 44
## Up 25 35 60
## Sum 43 61 104
## [1] 0.5096154
## Actual
## Predicted Down Up Sum
## Down 17 28 45
## Up 26 33 59
## Sum 43 61 104
## [1] 0.4807692
## Actual
## Predicted Down Up Sum
## Down 27 41 68
## Up 16 20 36
## Sum 43 61 104
## [1] 0.4519231
## Actual
## Predicted Down Up Sum
## Down 25 29 54
## Up 18 32 50
## Sum 43 61 104
## [1] 0.5480769
Question 14
(a) See below.
auto <- Auto
median_mpg <- median(auto$mpg)
auto$mpg01 <- ifelse(auto$mpg > median_mpg, 1, 0)
(c) See below.
set.seed(123)
train_index <- createDataPartition(auto$mpg01, p = 0.8, list = FALSE)
autotrain_data <- auto[train_index, ]
autotest_data <- auto[-train_index, ]
(d) See below for LDA model. The test error is only 14%.
## Actual
## Predicted 0 1 Sum
## 0 28 0 28
## 1 11 39 50
## Sum 39 39 78
## [1] 0.1410256
(e) See below for QDA model. The test error was only 10.2% for this
one!
## Actual
## Predicted 0 1 Sum
## 0 33 2 35
## 1 6 37 43
## Sum 39 39 78
## [1] 0.1025641
(f) See belowfor GLM analysis. The test error is about 12.8%.
## Actual
## Predicted 0 1 Sum
## 0 31 2 33
## 1 8 37 45
## Sum 39 39 78
## [1] 0.1282051
(g) See below for Naive Bayes analysis. The test error for this
model was 16.6%.
## Actual
## Predicted 0 1 Sum
## 0 28 2 30
## 1 11 37 48
## Sum 39 39 78
## [1] 0.1666667
(h) See below for KNN analysis. The error rate is 12.8% when K=1,
14.1% when K=5, 10.2% when K=15, 11.5% when K=17 and K=21. Therefore,
the K=15 model worked the best.
## Actual
## Predicted 0 1 Sum
## 0 31 2 33
## 1 8 37 45
## Sum 39 39 78
## [1] 0.1282051
## Actual
## Predicted 0 1 Sum
## 0 31 3 34
## 1 8 36 44
## Sum 39 39 78
## [1] 0.1410256
## Actual
## Predicted 0 1 Sum
## 0 32 1 33
## 1 7 38 45
## Sum 39 39 78
## [1] 0.1025641
## Actual
## Predicted 0 1 Sum
## 0 32 2 34
## 1 7 37 44
## Sum 39 39 78
## [1] 0.1153846
## Actual
## Predicted 0 1 Sum
## 0 32 2 34
## 1 7 37 44
## Sum 39 39 78
## [1] 0.1153846
Question 16
Data exploration and basic modeling:

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
| Logistic Model with All Available Predictors |
log(OR) |
95% CI |
p-value |
| zn |
-0.09 |
-0.16, -0.03 |
0.011 |
| indus |
-0.05 |
-0.14, 0.03 |
0.2 |
| chas |
0.62 |
-0.79, 2.1 |
0.4 |
| nox |
48 |
34, 63 |
<0.001 |
| rm |
-0.27 |
-1.6, 1.1 |
0.7 |
| age |
0.02 |
0.00, 0.05 |
0.076 |
| dis |
0.67 |
0.27, 1.1 |
0.002 |
| rad |
0.67 |
0.39, 0.99 |
<0.001 |
| tax |
-0.01 |
-0.01, 0.00 |
0.019 |
| ptratio |
0.33 |
0.10, 0.56 |
0.005 |
| lstat |
0.05 |
-0.04, 0.15 |
0.3 |
| medv |
0.15 |
0.03, 0.28 |
0.021 |
| Abbreviations: CI = Confidence Interval, OR = Odds Ratio |
| Logistic Model with Only Statistically Significant Predictors |
log(OR) |
95% CI |
p-value |
| zn |
-0.08 |
-0.14, -0.02 |
0.010 |
| nox |
44 |
33, 57 |
<0.001 |
| dis |
0.47 |
0.11, 0.86 |
0.014 |
| rad |
0.70 |
0.44, 0.98 |
<0.001 |
| tax |
-0.01 |
-0.01, 0.00 |
0.003 |
| ptratio |
0.25 |
0.05, 0.46 |
0.015 |
| medv |
0.08 |
0.02, 0.14 |
0.011 |
| Abbreviations: CI = Confidence Interval, OR = Odds Ratio |
Logistic regression analysis and success rate:
## Actual
## Predicted 0 1 Sum
## 0 46 4 50
## 1 4 46 50
## Sum 50 50 100
## [1] 0.8717949
LDA analysis and success rate:
## Actual
## Predicted 0 1 Sum
## 0 50 12 62
## 1 0 38 38
## Sum 50 50 100
## [1] 0.88
QDA analysis and success rate:
## Actual
## Predicted 0 1 Sum
## 0 48 8 56
## 1 2 42 44
## Sum 50 50 100
## [1] 0.9
Naive Bayes analysis and success rate:
## Actual
## Predicted 0 1 Sum
## 0 45 15 60
## 1 5 35 40
## Sum 50 50 100
## [1] 0.8
KNN analysis and success rate for K=1:
## Actual
## Predicted 0 1 Sum
## 0 48 2 50
## 1 2 48 50
## Sum 50 50 100
## [1] 0.96
KNN analysis and success rate for K=11:
## Actual
## Predicted 0 1 Sum
## 0 49 5 54
## 1 1 45 46
## Sum 50 50 100
## [1] 0.94
KNN analysis and success rate for K=19:
## Actual
## Predicted 0 1 Sum
## 0 47 8 55
## 1 3 42 45
## Sum 50 50 100
## [1] 0.89
Findings: The analysis above helps us determine that KNN is likely
the best method of analysis for this dataset. The GLM section was useful
for narrowing down the best predictors for MPG, but it was not the
strongest in terms of predictions. The QDA performed well, meaning that
the data is perhaps quadratic in shape and not linear or logistic. The
KNN of K=1 predicted the best, but it also is susceptible to high
variance so we have to take it lightly. The K=11 is still pretty good
and likely balances variance with bias.