This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.
(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
With the scatterplot matrix, it a little difficult to tell if there are any patterns, so to further discover patterns or relationships between the variables, I will use a correlation matrix.
str(Weekly)
## 'data.frame': 1089 obs. of 9 variables:
## $ Year : num 1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
## $ Lag1 : num 0.816 -0.27 -2.576 3.514 0.712 ...
## $ Lag2 : num 1.572 0.816 -0.27 -2.576 3.514 ...
## $ Lag3 : num -3.936 1.572 0.816 -0.27 -2.576 ...
## $ Lag4 : num -0.229 -3.936 1.572 0.816 -0.27 ...
## $ Lag5 : num -3.484 -0.229 -3.936 1.572 0.816 ...
## $ Volume : num 0.155 0.149 0.16 0.162 0.154 ...
## $ Today : num -0.27 -2.576 3.514 0.712 1.178 ...
## $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
pairs(Weekly)
By looking at the correlation matrix, there appears to be a strong relationship between Volume~Year, while all other correlation values are close to 0.
cor(Weekly[,-9])
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
By looking at the Volume plot below, we see that Volume is increasing over time. In other words, the average number of daily shares traded increased (almost exponentially) from 1990 to 2010.
plot(Volume)
(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?
The only variable that is statistically significant with Direction is Lag2 with a p-value of 0.0296. With a p-value <0.05, there is evidence that there may be a real association between Lag2 and Direction. The positive coefficient of Lag2 suggests that if the market had a positive return 2 weeks prior, then it is more likely to go up today.
All other predictor variables are insignificant / have relatively large p-values which indicate that there is no clear evidence of a real assosication between Lag1, Lag3, Lag4, Lag5, Volume and Direction.
Weekly.glm = glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data = Weekly, family = binomial)
summary(Weekly.glm)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
The confusion matrix below shows that the Logistic Regression modeled on the whole dataset does not perform very well on predicting the outcome variable, but it a little better than random guessing. Since these results were implemented on the entire dataset instead of a training and testing dataset, its a little optimistic with regard to the error rate.
For this logistic regression run, our model correctly predicts that the market would go up 557 days and would go down 54 days - in total there are 611 correct predictions. The below percentages are the Accuracy Rate, Missclassification Rate, Sensitivity (true positive rate), and Specificity (true negative rate).
Accuracy Rate - 54+557/1089 = 56.1% Missclassification - 48+430/1089 = 43.9% Sensitivity - 557/605 = 92% Specificity - 54/(54+430) = 11%
Weekly.probs = predict(Weekly.glm, type = 'response')
Weekly.pred = rep("Down", 1089)
Weekly.pred[Weekly.probs > 0.5] = "Up"
table(Weekly.pred, Direction)
## Direction
## Weekly.pred Down Up
## Down 54 48
## Up 430 557
(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).
train = (Year <= 2008)
head(train)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
Weekly.2008 = Weekly[!train,]
dim(Weekly)
## [1] 1089 9
dim(Weekly.2008)
## [1] 104 9
Direction.2008 = Direction[!train]
length(Direction.2008)
## [1] 104
Weekly.glm2 = glm(Direction~Lag2, data = Weekly, subset = train, family = "binomial")
summary(Weekly.glm2)
##
## Call:
## glm(formula = Direction ~ Lag2, family = "binomial", data = Weekly,
## subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.536 -1.264 1.021 1.091 1.368
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20326 0.06428 3.162 0.00157 **
## Lag2 0.05810 0.02870 2.024 0.04298 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1350.5 on 983 degrees of freedom
## AIC: 1354.5
##
## Number of Fisher Scoring iterations: 4
Weekly.probs2 = predict(Weekly.glm2, Weekly.2008, type = 'response')
Weekly.pred2 = rep("Down", 104)
Weekly.pred2[Weekly.probs2 > 0.5] = "Up"
table(Weekly.pred2, Direction.2008)
## Direction.2008
## Weekly.pred2 Down Up
## Down 9 5
## Up 34 56
The second iteration of the Logistic Regression was trained on a subset of data for Years 1990 to 2008 and then tested on the data from 2009 to 2010. When computing the predictions for Years 2009 & 2010, we can see that the model correctly predicts that the market would go up 56 days and would go down 9 days - in total, there are 65 correct predictions. The below percentages are the Accuracy Rate, Missclassification Rate, Sensitivity (true positive rate), and Specificity (true negative rate) for this model.
Accuracy Rate - 56+9/104 = 62.5% Missclassification - 34+5/104 = 37.5% Sensitivity - 56/61 = 92% Specificity - 9/(9+34) = 21%
Based on the above calculations, the model is actually performing pretty well and is better than random guessing.
With the confusion matrix, it shows that on days when the logistic regression predicts an increase in the market, it has a 62.5% accuracy rate. This suggests a possible trading strategy of buying on days when the models predicts an increasing market and avoiding trade on days when a decrease is predicted.
(e) Repeat (d) using LDA.
library(MASS)
Weekly.lda = lda(Direction~Lag2, data = Weekly, subset = train)
Weekly.lda
## Call:
## lda(Direction ~ Lag2, data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
##
## Coefficients of linear discriminants:
## LD1
## Lag2 0.4414162
The LDA output indicated that \(\hat{\pi}_1\) = 0.448 and \(\hat{\pi}_2\) = 0.552; in other words 44.8% of the training observations correspond to days in which the market went down and 55.2% of the training observations correspond to days in which the market went up. The LDA output also provides the group means for each predictor within each class, in which can be used by the LDA as estimates of \(\mu_k\). The group means suggest that there is a tendency for the previous 2 days’ returns to be positive on days when the market increase and negative on days when the market decreases.
plot(Weekly.lda)
lda.pred = predict(Weekly.lda, type = "response", newdata = Weekly.2008)
table(lda.pred$class, Direction.2008)
## Direction.2008
## Down Up
## Down 9 5
## Up 34 56
mean(lda.pred$class==Direction.2008)
## [1] 0.625
mean(lda.pred$class!=Direction.2008)
## [1] 0.375
The Confusion Matrix that is created for the LDA correctly predicts that the market would go up 56 days and that it would go down 9 days - in total the LDA model correctly predicts the market movement on 65 days. The mean() function was used to compute the fraction of days for which the prediction was correct and was not correct. In this case, the LDA model correctly predicted the movement of the market 62.5% of the time and incorrectly predicted the market movement 37.5% of the time. The below percentages are the Accuracy Rate, Missclassification Rate, Sensitivity (true positive rate), and Specificity (true negative rate) for this model.
Accuracy Rate - 56+9/104 = 62.5% Missclassification - 34+5/104 = 37.5% Sensitivity - 56/61 = 92% Specificity - 9/(9+34) = 21%
Based on review of the Logistic Regression, the two models (Logistic Regression and LDA) perform similarly.
(f) Repeat (d) using QDA.
The QDA output is very similar to the LDA output. It contains the prior probabilities and the group means. But it does not contain the coefficients of the linear discriminants, because the QDA classifier involves a quadratic, rather than a linear function of the predictors.
The QDA output indicates that \(\hat{\pi}_1\) = 0.448 and \(\hat{\pi}_2\) = 0.552; in other words 44.8% of the training observations correspond to days in which the market went down and 55.2% of the training observations correspond to days in which the market went up. The QDA output also provides the group means for each predictor within each class, in which can be used by the QDA as estimates of \(\mu_k\). The group means suggest that there is a tendency for the previous 2 days’ returns to be positive on days when the market increase and negative on days when the market decreases.
Weekly.qda = qda(Direction~Lag2, data = Weekly, subset = train)
Weekly.qda
## Call:
## qda(Direction ~ Lag2, data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
The predict function works in exactly the same fashion as for LDA except it does not return the linear discriminant values.
qda.pred = predict(Weekly.qda, type = "response", newdata = Weekly.2008)
table(qda.pred$class, Direction.2008)
## Direction.2008
## Down Up
## Down 0 0
## Up 43 61
mean(qda.pred$class==Direction.2008)
## [1] 0.5865385
mean(qda.pred$class!=Direction.2008)
## [1] 0.4134615
The Confusion Matrix that is created for the QDA correctly predicts that the market would go up 61 days and that it would go down 0 days - in total the QDA model correctly predicts the market movement on 61 days. The mean() function was used to compute the fraction of days for which the prediction was correct and was not correct. In this case, the QDA model correctly predicted the movement of the market 58.7% of the time and incorrectly predicted the market 41.3% of the time. The below percentages are the Accuracy Rate, Missclassification Rate, Sensitivity (true positive rate), and Specificity (true negative rate) for this model.
Accuracy Rate - 61+0/104 = 58.7% Missclassification - 43+0/104 = 41.3% Sensitivity - 61/61 = 100% Specificity - 0/(0+43) = 0%
Based on review of the output for the LDA, Logistic Regression and QDA models, we can see that the QDA model does not perform better than the LDA or Logistic Regression Model when it comes to the Accuracy Rate; however, does maximize the Sensitivity Rate or true positives.
(g) Repeat (d) using KNN with K = 1.
library(class)
train.x = as.matrix(Lag2[train])
test.x = as.matrix(Lag2[!train])
train.Direction = Direction[train]
set.seed(1)
knn.pred = knn(train.x, test.x, train.Direction, k = 1)
table(knn.pred, Direction.2008)
## Direction.2008
## knn.pred Down Up
## Down 21 30
## Up 22 31
mean(knn.pred==Direction.2008)
## [1] 0.5
mean(knn.pred!=Direction.2008)
## [1] 0.5
The results from the KNN model where K = 1 does not perform very well as it is basically random guessing. The KNN model correctly predicts 50% of observations correctly. The confusion matrix explains that our KNN model correctly predicted that the market would go up 31 days and that it would go down 21 days, for a total of 52 correct predictions. The mean() function was used to compute the fraction of days for which the prediction was correct. In this case, the KNN model correctly predicted the movement of the market 50% of the time.
(h) Which of these methods appears to provide the best results on this data?
If we are solely concerned with the accuracy rate of the model, the Logistic Regression or LDA model would be best. If we primarily concerned with the maximizing the Sensitivity Rate and reducing the Specificity Rate, the QDA Model may be our best method. The method that may be most preferred depends on what the goal of the prediction is.
(i) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.
Weekly.glm3<-glm(Direction~Lag1+Lag2+(Lag1*Lag2), data=Weekly, family="binomial", subset=train)
summary(Weekly.glm3)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + (Lag1 * Lag2), family = "binomial",
## data = Weekly, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.573 -1.259 1.003 1.086 1.596
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.211419 0.064589 3.273 0.00106 **
## Lag1 -0.051505 0.030727 -1.676 0.09370 .
## Lag2 0.053471 0.029193 1.832 0.06700 .
## Lag1:Lag2 0.001921 0.007460 0.257 0.79680
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1346.9 on 981 degrees of freedom
## AIC: 1354.9
##
## Number of Fisher Scoring iterations: 4
Weekly.probs3 = predict(Weekly.glm3, type="response", newdata=Weekly.2008)
Weekly.pred3 = rep("Down", 104)
Weekly.pred3[Weekly.probs3>0.5]="Up"
table(Weekly.pred3, Direction.2008)
## Direction.2008
## Weekly.pred3 Down Up
## Down 7 8
## Up 36 53
mean(Weekly.pred3==Direction.2008)
## [1] 0.5769231
The Logistic Regression Model inclusive of Lag1, Lag2 and the interaction effect between Lag1 and Lag2 does a slightly better job in prediction accuracy.
Weekly.lda2 = lda(Direction~Lag1+Lag2+(Lag1*Lag2), data = Weekly, subset = train)
lda.pred2 = predict(Weekly.lda2, type = "response", newdata = Weekly.2008)
table(lda.pred2$class, Direction.2008)
## Direction.2008
## Down Up
## Down 7 8
## Up 36 53
mean(lda.pred2$class==Direction.2008)
## [1] 0.5769231
Similar to the Logistic Regression inclusive of Lag1, Lag2 and the interaction effect between Lag1 and Lag2, the LDA predicts similarly.
Weekly.qda2 = qda(Direction~Lag1+Lag2+(Lag1*Lag2), data=Weekly, subset=train)
qda.pred2 = predict(Weekly.qda2, type="response", newdata=Weekly.2008)
table(qda.pred2$class, Direction.2008)
## Direction.2008
## Down Up
## Down 23 36
## Up 20 25
mean(qda.pred2$class==Direction.2008)
## [1] 0.4615385
The QDA with interaction effect does not perform any better than random guessing.
Xtrain2=cbind(Weekly$Lag1, Weekly$Lag2, (Weekly$Lag1 * Weekly$Lag2))[train,]
Xtest2=cbind(Weekly$Lag1, Weekly$Lag2, (Weekly$Lag1 * Weekly$Lag2))[!train,]
train.Direction2=Weekly$Direction[train]
test.Direction2=Weekly$Direction[!train]
set.seed(1)
knn.pred2<-knn(data.frame(Xtrain2), data.frame(Xtest2), train.Direction2, k=3)
table(knn.pred2, Direction.2008)
## Direction.2008
## knn.pred2 Down Up
## Down 23 33
## Up 20 28
mean(knn.pred2==Direction.2008)
## [1] 0.4903846
The KNN where K = 3 does a little better than the QDA model with interaction effect, but is not any better than the Logisitc Regression or LDA models.
detach(Weekly)
In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.
library(ISLR)
attach(Auto)
summary(Auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
str(Auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median()function. Note you may find it helpful to use the data.frame()function to create a single data set containing both mpg01 and the other Auto variables.
mpg01 = as.numeric(ifelse(mpg>median(mpg), "1", "0"))
Auto = data.frame(Auto, mpg01)
head(Auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name mpg01
## 1 chevrolet chevelle malibu 0
## 2 buick skylark 320 0
## 3 plymouth satellite 0
## 4 amc rebel sst 0
## 5 ford torino 0
## 6 ford galaxie 500 0
(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.
cor(Auto[,-9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## mpg01 0.8369392 -0.7591939 -0.7534766 -0.6670526 -0.7577566
## acceleration year origin mpg01
## mpg 0.4233285 0.5805410 0.5652088 0.8369392
## cylinders -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration 1.0000000 0.2903161 0.2127458 0.3468215
## year 0.2903161 1.0000000 0.1815277 0.4299042
## origin 0.2127458 0.1815277 1.0000000 0.5136984
## mpg01 0.3468215 0.4299042 0.5136984 1.0000000
pairs(Auto)
par(mfrow=c(2,3))
boxplot(cylinders~mpg01, data=Auto, main="Cylinders vs MPG01")
boxplot(displacement~mpg01, data=Auto, main="Displacement vs MPG01")
boxplot(horsepower~mpg01, data=Auto, main="Horsepower vs MPG01")
boxplot(weight~mpg01, data=Auto, main="Weight vs MPG01")
boxplot(acceleration~mpg01, data=Auto, main="Acceleration vs MPG01")
boxplot(year~mpg01, data=Auto, main="Year vs MPG01")
The graphs indicate that there are some association between mpg01 and cylinders, weight, horsepower, and displacement. These 4 variables show more of a distinct distinguishment between mpg01 where it equals 0 and mpg01 where it equals 1. When mpg01 equals 0, this means that the car has a low gas mileage. When mpg01 equals 1, this means that the car has high gas mileage.
(c) Split the data into a training set and a test set.
set.seed(343)
autotraining = sample(nrow(Auto), 0.75*nrow(Auto))
autotesting = -autotraining
Autotrain = Auto[autotraining, ]
Autotest = Auto[autotesting, ]
mpg01test = mpg01[autotesting]
dim(Autotrain)
## [1] 294 10
dim(Autotest)
## [1] 98 10
length(mpg01test)
## [1] 98
(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
The LDA output indicates that \(\hat{\pi}_1\) = 0.5170 and \(\hat{\pi}_2\) = 0.4830; in other words 51.70% of the training observations correspond to automobiles that have a mpg value below the median and 48.30% of the training observations correspond to automobiles that have a mpg value above the median mpg.
auto.lda = lda(mpg01~cylinders+weight+horsepower+displacement, data = Autotrain)
auto.lda
## Call:
## lda(mpg01 ~ cylinders + weight + horsepower + displacement, data = Autotrain)
##
## Prior probabilities of groups:
## 0 1
## 0.5170068 0.4829932
##
## Group means:
## cylinders weight horsepower displacement
## 0 6.671053 3583.382 128.63158 269.5066
## 1 4.183099 2354.352 79.71127 117.1725
##
## Coefficients of linear discriminants:
## LD1
## cylinders -0.397456222
## weight -0.001016632
## horsepower 0.005679922
## displacement -0.002107400
plot(auto.lda)
auto.pred = predict(auto.lda, type = "response", newdata = Autotest)
table(auto.pred$class, mpg01test)
## mpg01test
## 0 1
## 0 41 2
## 1 3 52
mean(auto.pred$class == mpg01test)
## [1] 0.9489796
mean(auto.pred$class != mpg01test)
## [1] 0.05102041
The Confusion Matrix that is created for the LDA correctly predicts that the mpg value for an automobile is above the median 52 times and that an automobile’s mpg value is below the median 41 times - in total the LDA model correctly predicts whether an automobile is above or below the median 93 times. The mean() function was used to compute the fraction of autombiles for which the prediction was correct and was not correct. In this case, the LDA model correctly predicted whether an automobile’s mpg value is above or below the median 94.90% of the time and incorrectly predicted the automobile median value 5.10% of the time. The below percentages are the Accuracy Rate, Missclassification Rate, Sensitivity (true positive rate), and Specificity (true negative rate) for this model.
Accuracy Rate - 41+52/98 = 94.90% Missclassification - 2_3/98 = 5.10% Sensitivity - 52/54 = 96% Specificity - 41/(41+3) = 93%
Per the question, the test error rate for the LDA model is 5.10%.
(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
The QDA output indicates that \(\hat{\pi}_1\) = 0.5170 and \(\hat{\pi}_2\) = 0.4830; in other words 51.70% of the training observations correspond to automobiles that have a mpg value below the median and 48.30% of the training observations correspond to automobiles that have a mpg value above the median mpg. The results seem to be pretty similar in nature without the linear discriminant estimates.
auto.qda = qda(mpg01~cylinders+weight+horsepower+displacement, data = Autotrain)
auto.qda
## Call:
## qda(mpg01 ~ cylinders + weight + horsepower + displacement, data = Autotrain)
##
## Prior probabilities of groups:
## 0 1
## 0.5170068 0.4829932
##
## Group means:
## cylinders weight horsepower displacement
## 0 6.671053 3583.382 128.63158 269.5066
## 1 4.183099 2354.352 79.71127 117.1725
auto.pred2 = predict(auto.qda, type = "response", newdata = Autotest)
table(auto.pred2$class, mpg01test)
## mpg01test
## 0 1
## 0 43 5
## 1 1 49
mean(auto.pred2$class == mpg01test)
## [1] 0.9387755
mean(auto.pred2$class != mpg01test)
## [1] 0.06122449
The Confusion Matrix that is created for the QDA correctly predicts that the mpg value for an automobile is above the median 49 times and that an automobile’s mpg value is below the median 41 times - in total the QDA model correctly predicts whether an automobile is above or below the median 43 times. The mean() function was used to compute the fraction of autombiles for which the prediction was correct and was not correct. In this case, the QDA model correctly predicted whether an automobile’s mpg value is above or below the median 93.88 of the time and incorrectly predicted the automobile median value 6.12% of the time. The below percentages are the Accuracy Rate, Missclassification Rate, Sensitivity (true positive rate), and Specificity (true negative rate) for this model.
Accuracy Rate - 43+49/98 = 93.9% Missclassification - 1+5/98 = 6.1% Sensitivity - 49/54 = 91% Specificity - 49/(49+1) = 98%
Per the question, the test error rate for the QDA model is 6.12% - which is close but higher than the LDA model.
(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
While it appears that predictor variables cylinders, weight, horsepower, and displacement have an association with mpg01, the Z-statistic does not confirm this information. The only predictor variables that have a statistically significant relationship with mpg01 are weight and horsepower.
auto.glm = glm(mpg01~cylinders+weight+horsepower+displacement, data = Autotrain, family = "binomial")
summary(auto.glm)
##
## Call:
## glm(formula = mpg01 ~ cylinders + weight + horsepower + displacement,
## family = "binomial", data = Autotrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3201 -0.3142 -0.0074 0.4163 3.2431
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 10.5825149 1.8281409 5.789 7.09e-09 ***
## cylinders -0.0249888 0.4034757 -0.062 0.9506
## weight -0.0018611 0.0007296 -2.551 0.0107 *
## horsepower -0.0347333 0.0152534 -2.277 0.0228 *
## displacement -0.0115324 0.0092956 -1.241 0.2147
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 407.23 on 293 degrees of freedom
## Residual deviance: 175.16 on 289 degrees of freedom
## AIC: 185.16
##
## Number of Fisher Scoring iterations: 7
auto.probs = predict(auto.glm, type = 'response', newdata = Autotest)
auto.pred3 = rep(0, 98)
auto.pred3[auto.probs > 0.5] = 1
table(auto.pred3, mpg01test)
## mpg01test
## auto.pred3 0 1
## 0 42 3
## 1 2 51
mean(auto.pred3 == mpg01test)
## [1] 0.9489796
mean(auto.pred3 != mpg01test)
## [1] 0.05102041
The Confusion Matrix that is created for the Logistic Regression correctly predicts that the mpg value for an automobile is above the median 51 times and that an automobile’s mpg value is below the median 42 times - in total the Logistic Regression model correctly predicts whether an automobile is above or below the median 93 times. The mean() function was used to compute the fraction of autombiles for which the prediction was correct and was not correct. In this case, the Logistic Regression model correctly predicted whether an automobile’s mpg value is above or below the median 94.90% of the time and incorrectly predicted the automobile median value 5.10% of the time. The below percentages are the Accuracy Rate, Missclassification Rate, Sensitivity (true positive rate), and Specificity (true negative rate) for this model.
Accuracy Rate - 42+51/98 = 94.90% Missclassification - 3+2/98 = 5.10% Sensitivity - 51/54 = 94% Specificity - 42/(42+2) = 95%
Per the question, the test error rate for the Logistic Regression model is 5.10% which is similar to the LDA model. The key difference between the models can be found in the Sensitivity (true positive rate) and Specificity (true negative rate) rate.
(g) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?
library(class)
auto.train = cbind(cylinders, weight, displacement, horsepower)[autotraining,]
auto.test = cbind(cylinders, weight, displacement, horsepower)[autotesting,]
train.mpg01 = mpg01[autotraining]
#KNN (k=1)
set.seed(343)
auto.knn = knn(auto.train, auto.test, train.mpg01, k = 1)
mean(auto.knn == mpg01test)
## [1] 0.9285714
mean(auto.knn != mpg01test)
## [1] 0.07142857
auto.knn2 = knn(auto.train, auto.test, train.mpg01, k = 10)
mean(auto.knn2 == mpg01test)
## [1] 0.9285714
mean(auto.knn2 != mpg01test)
## [1] 0.07142857
auto.knn3 = knn(auto.train, auto.test, train.mpg01, k = 100)
mean(auto.knn3 == mpg01test)
## [1] 0.9387755
mean(auto.knn3 != mpg01test)
## [1] 0.06122449
The Test Errors for the KNN models are as followed - KNN with K=1 - Test Error Rate: 7.14% KNN with K=10 - Test Error Rate: 7.14% KNN with K=100 - Test Error Rate: 6.12%
The KNN model with K=100 seems to perform the best for this data set as the Test Error Rate is the lowest.
detach(Auto)
Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
crim01 = as.numeric(ifelse(Boston$crim>median(Boston$crim), "1", "0"))
boston = data.frame(Boston, crim01)
head(boston)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv crim01
## 1 24.0 0
## 2 21.6 0
## 3 34.7 0
## 4 33.4 0
## 5 36.2 0
## 6 28.7 0
dim(boston)
## [1] 506 15
attach(boston)
## The following object is masked _by_ .GlobalEnv:
##
## crim01
cor(boston)
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## crim01 0.40939545 -0.43615103 0.60326017 0.070096774 0.72323480
## rm age dis rad tax ptratio
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431 0.2899456
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018 0.3832476
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320 0.1889327
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783 -0.3555015
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559 0.2615150
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819 0.4647412
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000 0.4608530
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304 1.0000000
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341 0.3740443
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593 -0.5077867
## crim01 -0.15637178 0.61393992 -0.61634164 0.619786249 0.60874128 0.2535684
## black lstat medv crim01
## crim -0.38506394 0.4556215 -0.3883046 0.40939545
## zn 0.17552032 -0.4129946 0.3604453 -0.43615103
## indus -0.35697654 0.6037997 -0.4837252 0.60326017
## chas 0.04878848 -0.0539293 0.1752602 0.07009677
## nox -0.38005064 0.5908789 -0.4273208 0.72323480
## rm 0.12806864 -0.6138083 0.6953599 -0.15637178
## age -0.27353398 0.6023385 -0.3769546 0.61393992
## dis 0.29151167 -0.4969958 0.2499287 -0.61634164
## rad -0.44441282 0.4886763 -0.3816262 0.61978625
## tax -0.44180801 0.5439934 -0.4685359 0.60874128
## ptratio -0.17738330 0.3740443 -0.5077867 0.25356836
## black 1.00000000 -0.3660869 0.3334608 -0.35121093
## lstat -0.36608690 1.0000000 -0.7376627 0.45326273
## medv 0.33346082 -0.7376627 1.0000000 -0.26301673
## crim01 -0.35121093 0.4532627 -0.2630167 1.00000000
pairs(boston)
First, we are going to split the data on a 75/25 split.
set.seed(343)
bos.training = sample(nrow(boston), 0.75*nrow(boston))
bos.testing = -bos.training
bostontrain = boston[bos.training, ]
bostontest = boston[bos.testing, ]
crim01test = crim01[bos.testing]
dim(bostontrain)
## [1] 379 15
dim(bostontest)
## [1] 127 15
length(crim01test)
## [1] 127
Logistic Regression with all Predictors
Next, we will train the data with all predictors minus crim.
boston.glm = glm(crim01~. -crim, data = bostontrain, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(boston.glm)
##
## Call:
## glm(formula = crim01 ~ . - crim, family = "binomial", data = bostontrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0888 -0.1653 0.0000 0.0030 3.4141
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -26.518139 9.131172 -2.904 0.00368 **
## zn -0.088450 0.041911 -2.110 0.03482 *
## indus -0.122768 0.056017 -2.192 0.02841 *
## chas 1.529483 0.868505 1.761 0.07823 .
## nox 55.348443 9.422343 5.874 4.25e-09 ***
## rm -0.339865 0.867162 -0.392 0.69511
## age 0.017334 0.014016 1.237 0.21619
## dis 0.705396 0.268295 2.629 0.00856 **
## rad 0.556724 0.178992 3.110 0.00187 **
## tax -0.007124 0.003277 -2.174 0.02970 *
## ptratio 0.403384 0.146321 2.757 0.00584 **
## black -0.040742 0.016720 -2.437 0.01482 *
## lstat 0.049228 0.055767 0.883 0.37737
## medv 0.160242 0.082655 1.939 0.05254 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 525.38 on 378 degrees of freedom
## Residual deviance: 148.31 on 365 degrees of freedom
## AIC: 176.31
##
## Number of Fisher Scoring iterations: 9
boston.probs = predict(boston.glm, type = 'response', newdata = bostontest)
boston.pred = rep(0, 127)
boston.pred[boston.probs > 0.5] = 1
table(boston.pred, crim01test)
## crim01test
## boston.pred 0 1
## 0 56 3
## 1 9 59
mean(boston.pred == crim01test)
## [1] 0.9055118
mean(boston.pred != crim01test)
## [1] 0.09448819
The Confusion Matrix that is created for the Logistic Regression correctly predicts that the crime rate in 59 suburbs are above the median and the crime rate in 56 suburbs are below the median - in total the Logistic Regression model correctly predicts the crime rate for 115 suburbs. In this case, the Logistic Regression model correctly predicted the crime rate for 90.6% and incorrectly predicted the crime rate 9.4% of the time. The below percentages are the Accuracy Rate, Missclassification Rate, Sensitivity (true positive rate), and Specificity (true negative rate) for this model.
Accuracy Rate - 56+59/127 = 90.6% Missclassification - 9+3/127 = 9.4% Sensitivity - 59/62 = 95% Specificity - 56/(56+9) = 86%
After modeling the Logistic Regression Model on all predictors, I have decided to only move forward with the variable that are significant with a 95% confidence level. Variables with z-statistics > 0.05 will not be included in the next iteration of the Logistic Regression.
Logisitc Regression with Subset of Predictors
boston.glm2 = glm(crim01~zn+indus+nox+dis+rad+tax+ptratio+black, data = bostontrain, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(boston.glm2)
##
## Call:
## glm(formula = crim01 ~ zn + indus + nox + dis + rad + tax + ptratio +
## black, family = "binomial", data = bostontrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.12396 -0.20733 0.00000 0.00298 3.00921
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.830267 7.256124 -1.768 0.07703 .
## zn -0.072756 0.035479 -2.051 0.04030 *
## indus -0.100537 0.050861 -1.977 0.04808 *
## nox 46.224462 8.098248 5.708 1.14e-08 ***
## dis 0.268721 0.211256 1.272 0.20337
## rad 0.658845 0.162986 4.042 5.29e-05 ***
## tax -0.009344 0.003107 -3.008 0.00263 **
## ptratio 0.167504 0.110080 1.522 0.12809
## black -0.039054 0.015669 -2.492 0.01269 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 525.38 on 378 degrees of freedom
## Residual deviance: 162.63 on 370 degrees of freedom
## AIC: 180.63
##
## Number of Fisher Scoring iterations: 9
boston.probs2 = predict(boston.glm2, type = 'response', newdata = bostontest)
boston.pred2 = rep(0, 127)
boston.pred2[boston.probs2 > 0.5] = 1
table(boston.pred2, crim01test)
## crim01test
## boston.pred2 0 1
## 0 56 2
## 1 9 60
mean(boston.pred2 == crim01test)
## [1] 0.9133858
mean(boston.pred2 != crim01test)
## [1] 0.08661417
The Confusion Matrix that is created for the second iteration of the Logistic Regression correctly predicts that the crime rate in 60 suburbs are above the median and the crime rate in 56 suburbs are below the median - in total the Logistic Regression model correctly predicts the crime rate for 116 suburbs. In this case, the Logistic Regression model correctly predicted the crime rate for 91.33% and incorrectly predicted the crime rate 8.66% of the time. The below percentages are the Accuracy Rate, Missclassification Rate, Sensitivity (true positive rate), and Specificity (true negative rate) for this model.
Accuracy Rate - 56+59/127 = 91.33% Missclassification - 9+3/127 = 8.66% Sensitivity - 59/62 = 97% Specificity - 56/(56+9) = 86%
In comparison to the previous Logistic Regression Model, the Logistic Regression ran on a subset of variables is only slightly better with prediction accuracy.
LDA Model
The LDA output indicates that \(\hat{\pi}_1\) = 0.496 and \(\hat{\pi}_2\) = 0.504; in other words 49.6% of the training observations correspond to suburbs with crime rates below the median and 55.2% of the training observations correspond to suburbs with crime rates above the median. The LDA output also provides the group means for each predictor within each class, in which can be used by the LDA as estimates of \(\mu_k\).
boston.lda = lda(crim01~zn+indus+nox+dis+rad+tax+ptratio+black, data = bostontrain)
boston.lda
## Call:
## lda(crim01 ~ zn + indus + nox + dis + rad + tax + ptratio + black,
## data = bostontrain)
##
## Prior probabilities of groups:
## 0 1
## 0.4960422 0.5039578
##
## Group means:
## zn indus nox dis rad tax ptratio black
## 0 20.712766 7.011755 0.4704867 5.091370 4.202128 306.7979 17.90213 389.9009
## 1 1.371728 15.095079 0.6393770 2.504081 14.518325 502.9215 18.96283 327.8565
##
## Coefficients of linear discriminants:
## LD1
## zn -0.002577380
## indus -0.004383629
## nox 8.673464785
## dis -0.116448668
## rad 0.079951292
## tax -0.001657112
## ptratio 0.037021741
## black -0.000705304
plot(boston.lda)
boston.pred3 = predict(boston.lda, type = "response", newdata = bostontest)
table(boston.pred3$class, crim01test)
## crim01test
## 0 1
## 0 62 13
## 1 3 49
mean(boston.pred3$class == crim01test)
## [1] 0.8740157
mean(boston.pred3$class != crim01test)
## [1] 0.1259843
The Confusion Matrix that is created for the LDA correctly predicts that the 62 suburbs have crime rates below the median while 49 suburbs have crime rates above the median - in total the LDA model correctly predicts the crime rate for 111 suburbs. The mean() function was used to compute the fraction of days for which the prediction was correct and was not correct. In this case, the LDA model correctly predicted the crime rate of the suburbs 87.4% and incorrectly predicted the crime rate of suburbs 12.6% of the time. The below percentages are the Accuracy Rate, Missclassification Rate, Sensitivity (true positive rate), and Specificity (true negative rate) for this model.
Accuracy Rate - 62+49/127 = 87.4% Missclassification - 3+13/127 = 12.6% Sensitivity - 56/61 = 79% Specificity - 9/(9+34) = 95%
In comparison to the 2 iterations of the Logistic Regression, the LDA underperforms significantly on this data.
KNN Models
All three KNN Model where K = 1, 10, or 100 run well on the data. Two of the KNN Models where K = 1 & K = 100 have the same accuracy and missclassification rate of 96.9% and 3.15% respecitively. For analysis reasons, we will utilize the K = 100 model since K-values closer to 1 often become less stable models.
library(class)
boston.train = cbind(zn, indus, nox, dis, rad, tax, ptratio, black)[bos.training,]
boston.test = cbind(zn, indus, nox, dis, rad, tax, ptratio, black)[bos.testing,]
train.crim01 = crim01[bos.training]
set.seed(343)
boston.knn1 = knn(boston.train, boston.test, train.crim01, k = 1)
mean(boston.knn1 == crim01test)
## [1] 0.9685039
mean(boston.knn1 != crim01test)
## [1] 0.03149606
set.seed(343)
boston.knn10 = knn(boston.train, boston.test, train.crim01, k = 10)
mean(boston.knn10 == crim01test)
## [1] 0.9212598
mean(boston.knn10 != crim01test)
## [1] 0.07874016
set.seed(343)
boston.knn100 = knn(boston.train, boston.test, train.crim01, k = 1)
mean(boston.knn100 == crim01test)
## [1] 0.9685039
mean(boston.knn100 != crim01test)
## [1] 0.03149606
Final Conclusion
Logisitc Regression with ALL Predictors - Accuracy Rate: 90.6% Logistic Regression with Subset Predictors - Accuracy Rate: 91.3% LDA - Accuracy Rate: 87.4% KNN with K = 100 - Accuracy Rate: 96.9%
Based on the models ran, KNN with K = 100 is the most accurate model for this data. It appears that KNN may have dominated the LDA and Logistic Regression models as the decision boundary was highly non-linear between the two groups of above and below median crime rate suburbs.