Question 13

(a) There is some relationship between Direction and Lag1 and Today as shown in the scatter plot table and summary tables below.

##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
## 
Characteristic N = 1,0891
Year 2,000.0 (1,995.0, 2,005.0)
Lag1 0.24 (-1.15, 1.41)
Lag2 0.24 (-1.15, 1.41)
Lag3 0.24 (-1.16, 1.41)
Lag4 0.24 (-1.16, 1.41)
Lag5 0.23 (-1.17, 1.41)
Volume 1.00 (0.33, 2.05)
Today 0.24 (-1.15, 1.41)
Direction
    Down 484 (44%)
    Up 605 (56%)
1 Median (Q1, Q3); n (%)

Characteristic Down N = 484 Up N = 605 p-value
Year 2,000.0 (1,995.0, 2,005.0) 2,000.0 (1,995.0, 2,005.0) 0.5
Lag1 0.38 (-0.94, 1.59) 0.10 (-1.24, 1.31) 0.043
Lag2 0.15 (-1.31, 1.30) 0.30 (-1.00, 1.46) 0.045
Lag3 0.25 (-1.16, 1.41) 0.22 (-1.17, 1.42) 0.7
Lag4 0.22 (-1.16, 1.45) 0.24 (-1.16, 1.35) 0.6
Lag5 0.33 (-1.10, 1.51) 0.13 (-1.20, 1.34) 0.2
Volume 1.07 (0.34, 2.02) 0.93 (0.33, 2.09) 0.4
Today -1.33 (-2.29, -0.59) 1.25 (0.63, 2.22) <0.001

(b) Lag2 is the only predictor that is statistically significant.

Logistic Regression Odds Ratios
Characteristic OR 95% CI p-value
Lag1 0.96 0.91, 1.01 0.12
Lag2 1.06 1.01, 1.12 0.030
Lag3 0.98 0.93, 1.04 0.5
Lag4 0.97 0.92, 1.02 0.3
Lag5 0.99 0.94, 1.04 0.6
Volume 0.98 0.91, 1.05 0.5
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

(c) The confusion matrix shows that the glm model predictions are about 56% accurate. However, it did much better at predicting Ups than Downs. From what I can understand, it shows that logistic regression is very sensitive to class imbalance, and our Ups/Downs were not equal to begin with. It predicts Ups more accurately because there are more of them than Downs.

##          Actual
## Predicted Down   Up  Sum
##      Down   54   48  102
##      Up    430  557  987
##      Sum   484  605 1089
## [1] 0.5610652

(d) See below for GLM table and success rate.

##          Actual
## Predicted Down  Up Sum
##      Down    9   5  14
##      Up     34  56  90
##      Sum    43  61 104
## [1] 0.625

(e) See below for LDA table and success rate.

##          Actual
## Predicted Down  Up Sum
##      Down    9   5  14
##      Up     34  56  90
##      Sum    43  61 104
## [1] 0.625

(f) See below for QDA table and success rate.

##          Actual
## Predicted Down  Up Sum
##       Up    43  61 104
##       Sum   43  61 104
## [1] 0.5865385

(g) See below for KNN table and success rate for when K=1.

##          Actual
## Predicted Down  Up Sum
##      Down   21  29  50
##      Up     22  32  54
##      Sum    43  61 104
## [1] 0.5096154

(h) See below for Naive Bayes table and success rate.

##          Actual
## Predicted Down  Up Sum
##       Up    43  61 104
##       Sum   43  61 104
## [1] 0.5865385

(i) The logistic model and the LDA model both predicted 62.5% of the results correctly.

(j) Using the “Today” variable leads the GLM model to producing 100% accurate results, which is likely co-linearity. Removing the “Today” reduces this significantly and leads to only 50.9% accuracy. Including that same variable in the LDA model only produced 98% accurate predictions, which is still interesting. Removing it, like with GLM, also reduced the accuracy to 50.9%. I figured at this point it wasn’t worth including “Today” in any of the models, so I removed it for QDA and just kept Lag1, Lag2, Lag5, and Year (which were all also strong predictors in terms of P-values). For QDA, then, the accuracy was only 48%.

Using these same predictors with K=1 only led to 45.2% accuracy. I began reading about KNN more and learned that the rule of thumb for KNN is to use a K that is the square root of the total number of observations for the training dataset – in this case, I decided to use a K=31. This still only led to an accuracy rate of 54.8%. Overall, I would likely select KNN for this analysis and use a K=31.

glmmodel2 <- glm(newdirection ~ Year + Lag1 + Lag2 + Lag5 + Today, data = train, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
glmprobs2 <- predict(glmmodel2, newdata = test, type = "response")

glmpred_class2 <- ifelse(glmprobs2 > 0.5, "Up", "Down")

glmmatrix2 <- table(Predicted = glmpred_class2, Actual = test$Direction)

addmargins(glmmatrix2)
##          Actual
## Predicted Down  Up Sum
##      Down   43   0  43
##      Up      0  61  61
##      Sum    43  61 104
print(mean(glmpred_class2 == test$Direction))
## [1] 1
glmmodel3 <- glm(newdirection ~ Year + Lag1 + Lag2 + Lag5, data = train, family = binomial)

glmprobs3 <- predict(glmmodel3, newdata = test, type = "response")

glmpred_class3 <- ifelse(glmprobs3 > 0.5, "Up", "Down")

glmmatrix3 <- table(Predicted = glmpred_class3, Actual = test$Direction)

addmargins(glmmatrix3)
##          Actual
## Predicted Down  Up Sum
##      Down   18  26  44
##      Up     25  35  60
##      Sum    43  61 104
print(mean(glmpred_class3 == test$Direction))
## [1] 0.5096154
##          Actual
## Predicted Down  Up Sum
##      Down   41   0  41
##      Up      2  61  63
##      Sum    43  61 104
## [1] 0.9807692
##          Actual
## Predicted Down  Up Sum
##      Down   18  26  44
##      Up     25  35  60
##      Sum    43  61 104
## [1] 0.5096154
##          Actual
## Predicted Down  Up Sum
##      Down   17  28  45
##      Up     26  33  59
##      Sum    43  61 104
## [1] 0.4807692
##          Actual
## Predicted Down  Up Sum
##      Down   27  41  68
##      Up     16  20  36
##      Sum    43  61 104
## [1] 0.4519231
##          Actual
## Predicted Down  Up Sum
##      Down   25  29  54
##      Up     18  32  50
##      Sum    43  61 104
## [1] 0.5480769

Question 14

(a) See below.

auto <- Auto

median_mpg <- median(auto$mpg)

auto$mpg01 <- ifelse(auto$mpg > median_mpg, 1, 0)

(b) Just off of visualization, it looks like horsepower is perhaps the strongest predictor (without digging into actual regression analysis).

(c) See below.

set.seed(123)
train_index <- createDataPartition(auto$mpg01, p = 0.8, list = FALSE)

autotrain_data <- auto[train_index, ]
autotest_data  <- auto[-train_index, ]

(d) See below for LDA model. The test error is only 14%.

##          Actual
## Predicted  0  1 Sum
##       0   28  0  28
##       1   11 39  50
##       Sum 39 39  78
## [1] 0.1410256

(e) See below for QDA model. The test error was only 10.2% for this one!

##          Actual
## Predicted  0  1 Sum
##       0   33  2  35
##       1    6 37  43
##       Sum 39 39  78
## [1] 0.1025641

(f) See belowfor GLM analysis. The test error is about 12.8%.

##          Actual
## Predicted  0  1 Sum
##       0   31  2  33
##       1    8 37  45
##       Sum 39 39  78
## [1] 0.1282051

(g) See below for Naive Bayes analysis. The test error for this model was 16.6%.

##          Actual
## Predicted  0  1 Sum
##       0   28  2  30
##       1   11 37  48
##       Sum 39 39  78
## [1] 0.1666667

(h) See below for KNN analysis. The error rate is 12.8% when K=1, 14.1% when K=5, 10.2% when K=15, 11.5% when K=17 and K=21. Therefore, the K=15 model worked the best.

##          Actual
## Predicted  0  1 Sum
##       0   31  2  33
##       1    8 37  45
##       Sum 39 39  78
## [1] 0.1282051
##          Actual
## Predicted  0  1 Sum
##       0   31  3  34
##       1    8 36  44
##       Sum 39 39  78
## [1] 0.1410256
##          Actual
## Predicted  0  1 Sum
##       0   32  1  33
##       1    7 38  45
##       Sum 39 39  78
## [1] 0.1025641
##          Actual
## Predicted  0  1 Sum
##       0   32  2  34
##       1    7 37  44
##       Sum 39 39  78
## [1] 0.1153846
##          Actual
## Predicted  0  1 Sum
##       0   32  2  34
##       1    7 37  44
##       Sum 39 39  78
## [1] 0.1153846

Question 16

Data exploration and basic modeling:

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Logistic Model with All Available Predictors log(OR) 95% CI p-value
zn -0.09 -0.16, -0.03 0.011
indus -0.05 -0.14, 0.03 0.2
chas 0.62 -0.79, 2.1 0.4
nox 48 34, 63 <0.001
rm -0.27 -1.6, 1.1 0.7
age 0.02 0.00, 0.05 0.076
dis 0.67 0.27, 1.1 0.002
rad 0.67 0.39, 0.99 <0.001
tax -0.01 -0.01, 0.00 0.019
ptratio 0.33 0.10, 0.56 0.005
lstat 0.05 -0.04, 0.15 0.3
medv 0.15 0.03, 0.28 0.021
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
Logistic Model with Only Statistically Significant Predictors log(OR) 95% CI p-value
zn -0.08 -0.14, -0.02 0.010
nox 44 33, 57 <0.001
dis 0.47 0.11, 0.86 0.014
rad 0.70 0.44, 0.98 <0.001
tax -0.01 -0.01, 0.00 0.003
ptratio 0.25 0.05, 0.46 0.015
medv 0.08 0.02, 0.14 0.011
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Logistic regression analysis and success rate:

##          Actual
## Predicted   0   1 Sum
##       0    46   4  50
##       1     4  46  50
##       Sum  50  50 100
## [1] 0.8717949

LDA analysis and success rate:

##          Actual
## Predicted   0   1 Sum
##       0    50  12  62
##       1     0  38  38
##       Sum  50  50 100
## [1] 0.88

QDA analysis and success rate:

##          Actual
## Predicted   0   1 Sum
##       0    48   8  56
##       1     2  42  44
##       Sum  50  50 100
## [1] 0.9

Naive Bayes analysis and success rate:

##          Actual
## Predicted   0   1 Sum
##       0    45  15  60
##       1     5  35  40
##       Sum  50  50 100
## [1] 0.8

KNN analysis and success rate for K=1:

##          Actual
## Predicted   0   1 Sum
##       0    48   2  50
##       1     2  48  50
##       Sum  50  50 100
## [1] 0.96

KNN analysis and success rate for K=11:

##          Actual
## Predicted   0   1 Sum
##       0    49   5  54
##       1     1  45  46
##       Sum  50  50 100
## [1] 0.94

KNN analysis and success rate for K=19:

##          Actual
## Predicted   0   1 Sum
##       0    47   8  55
##       1     3  42  45
##       Sum  50  50 100
## [1] 0.89

Findings: The analysis above helps us determine that KNN is likely the best method of analysis for this dataset. The GLM section was useful for narrowing down the best predictors for MPG, but it was not the strongest in terms of predictions. The QDA performed well, meaning that the data is perhaps quadratic in shape and not linear or logistic. The KNN of K=1 predicted the best, but it also is susceptible to high variance so we have to take it lightly. The K=11 is still pretty good and likely balances variance with bias.