Problem 10

This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

With the scatterplot matrix, it a little difficult to tell if there are any patterns, so to further discover patterns or relationships between the variables, I will use a correlation matrix.

str(Weekly) 
## 'data.frame':    1089 obs. of  9 variables:
##  $ Year     : num  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
##  $ Lag1     : num  0.816 -0.27 -2.576 3.514 0.712 ...
##  $ Lag2     : num  1.572 0.816 -0.27 -2.576 3.514 ...
##  $ Lag3     : num  -3.936 1.572 0.816 -0.27 -2.576 ...
##  $ Lag4     : num  -0.229 -3.936 1.572 0.816 -0.27 ...
##  $ Lag5     : num  -3.484 -0.229 -3.936 1.572 0.816 ...
##  $ Volume   : num  0.155 0.149 0.16 0.162 0.154 ...
##  $ Today    : num  -0.27 -2.576 3.514 0.712 1.178 ...
##  $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
summary(Weekly)
##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
## 
pairs(Weekly)

By looking at the correlation matrix, there appears to be a strong relationship between Volume~Year, while all other correlation values are close to 0.

cor(Weekly[,-9])
##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000

By looking at the Volume plot below, we see that Volume is increasing over time. In other words, the average number of daily shares traded increased (almost exponentially) from 1990 to 2010.

plot(Volume)

(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

The only variable that is statistically significant with Direction is Lag2 with a p-value of 0.0296. With a p-value <0.05, there is evidence that there may be a real association between Lag2 and Direction. The positive coefficient of Lag2 suggests that if the market had a positive return 2 weeks prior, then it is more likely to go up today.

All other predictor variables are insignificant / have relatively large p-values which indicate that there is no clear evidence of a real assosication between Lag1, Lag3, Lag4, Lag5, Volume and Direction.

Weekly.glm = glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data = Weekly, family = binomial)
summary(Weekly.glm)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

The confusion matrix below shows that the Logistic Regression modeled on the whole dataset does not perform very well on predicting the outcome variable, but it a little better than random guessing. Since these results were implemented on the entire dataset instead of a training and testing dataset, its a little optimistic with regard to the error rate.

For this logistic regression run, our model correctly predicts that the market would go up 557 days and would go down 54 days - in total there are 611 correct predictions. The below percentages are the Accuracy Rate, Missclassification Rate, Sensitivity (true positive rate), and Specificity (true negative rate).

Accuracy Rate - 54+557/1089 = 56.1% Missclassification - 48+430/1089 = 43.9% Sensitivity - 557/605 = 92% Specificity - 54/(54+430) = 11%

Weekly.probs = predict(Weekly.glm, type = 'response')
Weekly.pred = rep("Down", 1089)
Weekly.pred[Weekly.probs > 0.5] = "Up"
table(Weekly.pred, Direction)
##            Direction
## Weekly.pred Down  Up
##        Down   54  48
##        Up    430 557

(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

train = (Year <= 2008)
head(train)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
Weekly.2008 = Weekly[!train,]
dim(Weekly)
## [1] 1089    9
dim(Weekly.2008)
## [1] 104   9
Direction.2008 = Direction[!train]
length(Direction.2008)
## [1] 104
Weekly.glm2 = glm(Direction~Lag2, data = Weekly, subset = train, family = "binomial")
summary(Weekly.glm2)
## 
## Call:
## glm(formula = Direction ~ Lag2, family = "binomial", data = Weekly, 
##     subset = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.536  -1.264   1.021   1.091   1.368  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.20326    0.06428   3.162  0.00157 **
## Lag2         0.05810    0.02870   2.024  0.04298 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1350.5  on 983  degrees of freedom
## AIC: 1354.5
## 
## Number of Fisher Scoring iterations: 4
Weekly.probs2 = predict(Weekly.glm2, Weekly.2008, type = 'response')
Weekly.pred2 = rep("Down", 104)
Weekly.pred2[Weekly.probs2 > 0.5] = "Up"
table(Weekly.pred2, Direction.2008)
##             Direction.2008
## Weekly.pred2 Down Up
##         Down    9  5
##         Up     34 56

The second iteration of the Logistic Regression was trained on a subset of data for Years 1990 to 2008 and then tested on the data from 2009 to 2010. When computing the predictions for Years 2009 & 2010, we can see that the model correctly predicts that the market would go up 56 days and would go down 9 days - in total, there are 65 correct predictions. The below percentages are the Accuracy Rate, Missclassification Rate, Sensitivity (true positive rate), and Specificity (true negative rate) for this model.

Accuracy Rate - 56+9/104 = 62.5% Missclassification - 34+5/104 = 37.5% Sensitivity - 56/61 = 92% Specificity - 9/(9+34) = 21%

Based on the above calculations, the model is actually performing pretty well and is better than random guessing.

With the confusion matrix, it shows that on days when the logistic regression predicts an increase in the market, it has a 62.5% accuracy rate. This suggests a possible trading strategy of buying on days when the models predicts an increasing market and avoiding trade on days when a decrease is predicted.

(e) Repeat (d) using LDA.

library(MASS)
Weekly.lda = lda(Direction~Lag2, data = Weekly, subset = train)
Weekly.lda
## Call:
## lda(Direction ~ Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
## 
## Coefficients of linear discriminants:
##            LD1
## Lag2 0.4414162

The LDA output indicated that \(\hat{\pi}_1\) = 0.448 and \(\hat{\pi}_2\) = 0.552; in other words 44.8% of the training observations correspond to days in which the market went down and 55.2% of the training observations correspond to days in which the market went up. The LDA output also provides the group means for each predictor within each class, in which can be used by the LDA as estimates of \(\mu_k\). The group means suggest that there is a tendency for the previous 2 days’ returns to be positive on days when the market increase and negative on days when the market decreases.

plot(Weekly.lda)

lda.pred = predict(Weekly.lda, type = "response", newdata = Weekly.2008)
table(lda.pred$class, Direction.2008)
##       Direction.2008
##        Down Up
##   Down    9  5
##   Up     34 56
mean(lda.pred$class==Direction.2008)
## [1] 0.625
mean(lda.pred$class!=Direction.2008)
## [1] 0.375

The Confusion Matrix that is created for the LDA correctly predicts that the market would go up 56 days and that it would go down 9 days - in total the LDA model correctly predicts the market movement on 65 days. The mean() function was used to compute the fraction of days for which the prediction was correct and was not correct. In this case, the LDA model correctly predicted the movement of the market 62.5% of the time and incorrectly predicted the market movement 37.5% of the time. The below percentages are the Accuracy Rate, Missclassification Rate, Sensitivity (true positive rate), and Specificity (true negative rate) for this model.

Accuracy Rate - 56+9/104 = 62.5% Missclassification - 34+5/104 = 37.5% Sensitivity - 56/61 = 92% Specificity - 9/(9+34) = 21%

Based on review of the Logistic Regression, the two models (Logistic Regression and LDA) perform similarly.

(f) Repeat (d) using QDA.

The QDA output is very similar to the LDA output. It contains the prior probabilities and the group means. But it does not contain the coefficients of the linear discriminants, because the QDA classifier involves a quadratic, rather than a linear function of the predictors.

The QDA output indicates that \(\hat{\pi}_1\) = 0.448 and \(\hat{\pi}_2\) = 0.552; in other words 44.8% of the training observations correspond to days in which the market went down and 55.2% of the training observations correspond to days in which the market went up. The QDA output also provides the group means for each predictor within each class, in which can be used by the QDA as estimates of \(\mu_k\). The group means suggest that there is a tendency for the previous 2 days’ returns to be positive on days when the market increase and negative on days when the market decreases.

Weekly.qda = qda(Direction~Lag2, data = Weekly, subset = train)
Weekly.qda
## Call:
## qda(Direction ~ Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581

The predict function works in exactly the same fashion as for LDA except it does not return the linear discriminant values.

qda.pred = predict(Weekly.qda, type = "response", newdata = Weekly.2008)
table(qda.pred$class, Direction.2008)
##       Direction.2008
##        Down Up
##   Down    0  0
##   Up     43 61
mean(qda.pred$class==Direction.2008)
## [1] 0.5865385
mean(qda.pred$class!=Direction.2008)
## [1] 0.4134615

The Confusion Matrix that is created for the QDA correctly predicts that the market would go up 61 days and that it would go down 0 days - in total the QDA model correctly predicts the market movement on 61 days. The mean() function was used to compute the fraction of days for which the prediction was correct and was not correct. In this case, the QDA model correctly predicted the movement of the market 58.7% of the time and incorrectly predicted the market 41.3% of the time. The below percentages are the Accuracy Rate, Missclassification Rate, Sensitivity (true positive rate), and Specificity (true negative rate) for this model.

Accuracy Rate - 61+0/104 = 58.7% Missclassification - 43+0/104 = 41.3% Sensitivity - 61/61 = 100% Specificity - 0/(0+43) = 0%

Based on review of the output for the LDA, Logistic Regression and QDA models, we can see that the QDA model does not perform better than the LDA or Logistic Regression Model when it comes to the Accuracy Rate; however, does maximize the Sensitivity Rate or true positives.

(g) Repeat (d) using KNN with K = 1.

library(class)
train.x = as.matrix(Lag2[train])
test.x = as.matrix(Lag2[!train])
train.Direction = Direction[train]
set.seed(1)
knn.pred = knn(train.x, test.x, train.Direction, k = 1)
table(knn.pred, Direction.2008)
##         Direction.2008
## knn.pred Down Up
##     Down   21 30
##     Up     22 31
mean(knn.pred==Direction.2008)
## [1] 0.5
mean(knn.pred!=Direction.2008)
## [1] 0.5

The results from the KNN model where K = 1 does not perform very well as it is basically random guessing. The KNN model correctly predicts 50% of observations correctly. The confusion matrix explains that our KNN model correctly predicted that the market would go up 31 days and that it would go down 21 days, for a total of 52 correct predictions. The mean() function was used to compute the fraction of days for which the prediction was correct. In this case, the KNN model correctly predicted the movement of the market 50% of the time.

(h) Which of these methods appears to provide the best results on this data?

If we are solely concerned with the accuracy rate of the model, the Logistic Regression or LDA model would be best. If we primarily concerned with the maximizing the Sensitivity Rate and reducing the Specificity Rate, the QDA Model may be our best method. The method that may be most preferred depends on what the goal of the prediction is.

(i) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

Weekly.glm3<-glm(Direction~Lag1+Lag2+(Lag1*Lag2), data=Weekly, family="binomial", subset=train)
summary(Weekly.glm3)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + (Lag1 * Lag2), family = "binomial", 
##     data = Weekly, subset = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.573  -1.259   1.003   1.086   1.596  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.211419   0.064589   3.273  0.00106 **
## Lag1        -0.051505   0.030727  -1.676  0.09370 . 
## Lag2         0.053471   0.029193   1.832  0.06700 . 
## Lag1:Lag2    0.001921   0.007460   0.257  0.79680   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1346.9  on 981  degrees of freedom
## AIC: 1354.9
## 
## Number of Fisher Scoring iterations: 4
Weekly.probs3 = predict(Weekly.glm3, type="response", newdata=Weekly.2008)
Weekly.pred3 = rep("Down", 104)
Weekly.pred3[Weekly.probs3>0.5]="Up"
table(Weekly.pred3, Direction.2008)
##             Direction.2008
## Weekly.pred3 Down Up
##         Down    7  8
##         Up     36 53
mean(Weekly.pred3==Direction.2008)
## [1] 0.5769231

The Logistic Regression Model inclusive of Lag1, Lag2 and the interaction effect between Lag1 and Lag2 does a slightly better job in prediction accuracy.

Weekly.lda2 = lda(Direction~Lag1+Lag2+(Lag1*Lag2), data = Weekly, subset = train) 
lda.pred2 = predict(Weekly.lda2, type = "response", newdata = Weekly.2008)
table(lda.pred2$class, Direction.2008)
##       Direction.2008
##        Down Up
##   Down    7  8
##   Up     36 53
mean(lda.pred2$class==Direction.2008)
## [1] 0.5769231

Similar to the Logistic Regression inclusive of Lag1, Lag2 and the interaction effect between Lag1 and Lag2, the LDA predicts similarly.

Weekly.qda2 = qda(Direction~Lag1+Lag2+(Lag1*Lag2), data=Weekly, subset=train)
qda.pred2 = predict(Weekly.qda2, type="response", newdata=Weekly.2008)
table(qda.pred2$class, Direction.2008)
##       Direction.2008
##        Down Up
##   Down   23 36
##   Up     20 25
mean(qda.pred2$class==Direction.2008)
## [1] 0.4615385

The QDA with interaction effect does not perform any better than random guessing.

Xtrain2=cbind(Weekly$Lag1, Weekly$Lag2, (Weekly$Lag1 * Weekly$Lag2))[train,]
Xtest2=cbind(Weekly$Lag1, Weekly$Lag2, (Weekly$Lag1 * Weekly$Lag2))[!train,]
train.Direction2=Weekly$Direction[train]
test.Direction2=Weekly$Direction[!train]
set.seed(1)
knn.pred2<-knn(data.frame(Xtrain2), data.frame(Xtest2), train.Direction2, k=3)
table(knn.pred2, Direction.2008)
##          Direction.2008
## knn.pred2 Down Up
##      Down   23 33
##      Up     20 28
mean(knn.pred2==Direction.2008)
## [1] 0.4903846

The KNN where K = 3 does a little better than the QDA model with interaction effect, but is not any better than the Logisitc Regression or LDA models.

detach(Weekly)

Problem 11

In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

library(ISLR)
attach(Auto)
summary(Auto)
##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365
str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...

(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median()function. Note you may find it helpful to use the data.frame()function to create a single data set containing both mpg01 and the other Auto variables.

mpg01 = as.numeric(ifelse(mpg>median(mpg), "1", "0"))
Auto = data.frame(Auto, mpg01)
head(Auto)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name mpg01
## 1 chevrolet chevelle malibu     0
## 2         buick skylark 320     0
## 3        plymouth satellite     0
## 4             amc rebel sst     0
## 5               ford torino     0
## 6          ford galaxie 500     0

(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

cor(Auto[,-9])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
## mpg01         0.8369392 -0.7591939   -0.7534766 -0.6670526 -0.7577566
##              acceleration       year     origin      mpg01
## mpg             0.4233285  0.5805410  0.5652088  0.8369392
## cylinders      -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement   -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower     -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight         -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration    1.0000000  0.2903161  0.2127458  0.3468215
## year            0.2903161  1.0000000  0.1815277  0.4299042
## origin          0.2127458  0.1815277  1.0000000  0.5136984
## mpg01           0.3468215  0.4299042  0.5136984  1.0000000
pairs(Auto)

par(mfrow=c(2,3))
boxplot(cylinders~mpg01, data=Auto, main="Cylinders vs MPG01")
boxplot(displacement~mpg01, data=Auto, main="Displacement vs MPG01")
boxplot(horsepower~mpg01, data=Auto, main="Horsepower vs MPG01")
boxplot(weight~mpg01, data=Auto, main="Weight vs MPG01")
boxplot(acceleration~mpg01, data=Auto, main="Acceleration vs MPG01")
boxplot(year~mpg01, data=Auto, main="Year vs MPG01")

The graphs indicate that there are some association between mpg01 and cylinders, weight, horsepower, and displacement. These 4 variables show more of a distinct distinguishment between mpg01 where it equals 0 and mpg01 where it equals 1. When mpg01 equals 0, this means that the car has a low gas mileage. When mpg01 equals 1, this means that the car has high gas mileage.

(c) Split the data into a training set and a test set.

set.seed(343)
autotraining = sample(nrow(Auto), 0.75*nrow(Auto))
autotesting = -autotraining
Autotrain = Auto[autotraining, ]
Autotest = Auto[autotesting, ]
mpg01test = mpg01[autotesting]

dim(Autotrain)
## [1] 294  10
dim(Autotest)
## [1] 98 10
length(mpg01test)
## [1] 98

(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

The LDA output indicates that \(\hat{\pi}_1\) = 0.5170 and \(\hat{\pi}_2\) = 0.4830; in other words 51.70% of the training observations correspond to automobiles that have a mpg value below the median and 48.30% of the training observations correspond to automobiles that have a mpg value above the median mpg.

auto.lda = lda(mpg01~cylinders+weight+horsepower+displacement, data = Autotrain)
auto.lda
## Call:
## lda(mpg01 ~ cylinders + weight + horsepower + displacement, data = Autotrain)
## 
## Prior probabilities of groups:
##         0         1 
## 0.5170068 0.4829932 
## 
## Group means:
##   cylinders   weight horsepower displacement
## 0  6.671053 3583.382  128.63158     269.5066
## 1  4.183099 2354.352   79.71127     117.1725
## 
## Coefficients of linear discriminants:
##                       LD1
## cylinders    -0.397456222
## weight       -0.001016632
## horsepower    0.005679922
## displacement -0.002107400
plot(auto.lda)

auto.pred = predict(auto.lda, type = "response", newdata = Autotest)
table(auto.pred$class, mpg01test)
##    mpg01test
##      0  1
##   0 41  2
##   1  3 52
mean(auto.pred$class == mpg01test)
## [1] 0.9489796
mean(auto.pred$class != mpg01test)
## [1] 0.05102041

The Confusion Matrix that is created for the LDA correctly predicts that the mpg value for an automobile is above the median 52 times and that an automobile’s mpg value is below the median 41 times - in total the LDA model correctly predicts whether an automobile is above or below the median 93 times. The mean() function was used to compute the fraction of autombiles for which the prediction was correct and was not correct. In this case, the LDA model correctly predicted whether an automobile’s mpg value is above or below the median 94.90% of the time and incorrectly predicted the automobile median value 5.10% of the time. The below percentages are the Accuracy Rate, Missclassification Rate, Sensitivity (true positive rate), and Specificity (true negative rate) for this model.

Accuracy Rate - 41+52/98 = 94.90% Missclassification - 2_3/98 = 5.10% Sensitivity - 52/54 = 96% Specificity - 41/(41+3) = 93%

Per the question, the test error rate for the LDA model is 5.10%.

(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

The QDA output indicates that \(\hat{\pi}_1\) = 0.5170 and \(\hat{\pi}_2\) = 0.4830; in other words 51.70% of the training observations correspond to automobiles that have a mpg value below the median and 48.30% of the training observations correspond to automobiles that have a mpg value above the median mpg. The results seem to be pretty similar in nature without the linear discriminant estimates.

auto.qda = qda(mpg01~cylinders+weight+horsepower+displacement, data = Autotrain)
auto.qda
## Call:
## qda(mpg01 ~ cylinders + weight + horsepower + displacement, data = Autotrain)
## 
## Prior probabilities of groups:
##         0         1 
## 0.5170068 0.4829932 
## 
## Group means:
##   cylinders   weight horsepower displacement
## 0  6.671053 3583.382  128.63158     269.5066
## 1  4.183099 2354.352   79.71127     117.1725
auto.pred2 = predict(auto.qda, type = "response", newdata = Autotest)
table(auto.pred2$class, mpg01test)
##    mpg01test
##      0  1
##   0 43  5
##   1  1 49
mean(auto.pred2$class == mpg01test)
## [1] 0.9387755
mean(auto.pred2$class != mpg01test)
## [1] 0.06122449

The Confusion Matrix that is created for the QDA correctly predicts that the mpg value for an automobile is above the median 49 times and that an automobile’s mpg value is below the median 41 times - in total the QDA model correctly predicts whether an automobile is above or below the median 43 times. The mean() function was used to compute the fraction of autombiles for which the prediction was correct and was not correct. In this case, the QDA model correctly predicted whether an automobile’s mpg value is above or below the median 93.88 of the time and incorrectly predicted the automobile median value 6.12% of the time. The below percentages are the Accuracy Rate, Missclassification Rate, Sensitivity (true positive rate), and Specificity (true negative rate) for this model.

Accuracy Rate - 43+49/98 = 93.9% Missclassification - 1+5/98 = 6.1% Sensitivity - 49/54 = 91% Specificity - 49/(49+1) = 98%

Per the question, the test error rate for the QDA model is 6.12% - which is close but higher than the LDA model.

(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

While it appears that predictor variables cylinders, weight, horsepower, and displacement have an association with mpg01, the Z-statistic does not confirm this information. The only predictor variables that have a statistically significant relationship with mpg01 are weight and horsepower.

auto.glm = glm(mpg01~cylinders+weight+horsepower+displacement, data = Autotrain, family = "binomial")
summary(auto.glm)
## 
## Call:
## glm(formula = mpg01 ~ cylinders + weight + horsepower + displacement, 
##     family = "binomial", data = Autotrain)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3201  -0.3142  -0.0074   0.4163   3.2431  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  10.5825149  1.8281409   5.789 7.09e-09 ***
## cylinders    -0.0249888  0.4034757  -0.062   0.9506    
## weight       -0.0018611  0.0007296  -2.551   0.0107 *  
## horsepower   -0.0347333  0.0152534  -2.277   0.0228 *  
## displacement -0.0115324  0.0092956  -1.241   0.2147    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 407.23  on 293  degrees of freedom
## Residual deviance: 175.16  on 289  degrees of freedom
## AIC: 185.16
## 
## Number of Fisher Scoring iterations: 7
auto.probs = predict(auto.glm, type = 'response', newdata = Autotest)
auto.pred3 = rep(0, 98)
auto.pred3[auto.probs > 0.5] = 1
table(auto.pred3, mpg01test)
##           mpg01test
## auto.pred3  0  1
##          0 42  3
##          1  2 51
mean(auto.pred3 == mpg01test)
## [1] 0.9489796
mean(auto.pred3 != mpg01test)
## [1] 0.05102041

The Confusion Matrix that is created for the Logistic Regression correctly predicts that the mpg value for an automobile is above the median 51 times and that an automobile’s mpg value is below the median 42 times - in total the Logistic Regression model correctly predicts whether an automobile is above or below the median 93 times. The mean() function was used to compute the fraction of autombiles for which the prediction was correct and was not correct. In this case, the Logistic Regression model correctly predicted whether an automobile’s mpg value is above or below the median 94.90% of the time and incorrectly predicted the automobile median value 5.10% of the time. The below percentages are the Accuracy Rate, Missclassification Rate, Sensitivity (true positive rate), and Specificity (true negative rate) for this model.

Accuracy Rate - 42+51/98 = 94.90% Missclassification - 3+2/98 = 5.10% Sensitivity - 51/54 = 94% Specificity - 42/(42+2) = 95%

Per the question, the test error rate for the Logistic Regression model is 5.10% which is similar to the LDA model. The key difference between the models can be found in the Sensitivity (true positive rate) and Specificity (true negative rate) rate.

(g) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

library(class)
auto.train = cbind(cylinders, weight, displacement, horsepower)[autotraining,]
auto.test = cbind(cylinders, weight, displacement, horsepower)[autotesting,]
train.mpg01 = mpg01[autotraining]

#KNN (k=1)
set.seed(343)
auto.knn = knn(auto.train, auto.test, train.mpg01, k = 1)
mean(auto.knn == mpg01test)
## [1] 0.9285714
mean(auto.knn != mpg01test) 
## [1] 0.07142857
auto.knn2 = knn(auto.train, auto.test, train.mpg01, k = 10)
mean(auto.knn2 == mpg01test)
## [1] 0.9285714
mean(auto.knn2 != mpg01test) 
## [1] 0.07142857
auto.knn3 = knn(auto.train, auto.test, train.mpg01, k = 100)
mean(auto.knn3 == mpg01test)
## [1] 0.9387755
mean(auto.knn3 != mpg01test) 
## [1] 0.06122449

The Test Errors for the KNN models are as followed - KNN with K=1 - Test Error Rate: 7.14% KNN with K=10 - Test Error Rate: 7.14% KNN with K=100 - Test Error Rate: 6.12%

The KNN model with K=100 seems to perform the best for this data set as the Test Error Rate is the lowest.

detach(Auto)

Problem 13

Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
crim01 = as.numeric(ifelse(Boston$crim>median(Boston$crim), "1", "0"))
boston = data.frame(Boston, crim01)
head(boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv crim01
## 1 24.0      0
## 2 21.6      0
## 3 34.7      0
## 4 33.4      0
## 5 36.2      0
## 6 28.7      0
dim(boston)
## [1] 506  15
attach(boston)
## The following object is masked _by_ .GlobalEnv:
## 
##     crim01
cor(boston)
##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black   -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
## crim01   0.40939545 -0.43615103  0.60326017  0.070096774  0.72323480
##                  rm         age         dis          rad         tax    ptratio
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431  0.2899456
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018  0.3832476
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320  0.1889327
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783 -0.3555015
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559  0.2615150
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819  0.4647412
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000  0.4608530
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304  1.0000000
## black    0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341  0.3740443
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593 -0.5077867
## crim01  -0.15637178  0.61393992 -0.61634164  0.619786249  0.60874128  0.2535684
##               black      lstat       medv      crim01
## crim    -0.38506394  0.4556215 -0.3883046  0.40939545
## zn       0.17552032 -0.4129946  0.3604453 -0.43615103
## indus   -0.35697654  0.6037997 -0.4837252  0.60326017
## chas     0.04878848 -0.0539293  0.1752602  0.07009677
## nox     -0.38005064  0.5908789 -0.4273208  0.72323480
## rm       0.12806864 -0.6138083  0.6953599 -0.15637178
## age     -0.27353398  0.6023385 -0.3769546  0.61393992
## dis      0.29151167 -0.4969958  0.2499287 -0.61634164
## rad     -0.44441282  0.4886763 -0.3816262  0.61978625
## tax     -0.44180801  0.5439934 -0.4685359  0.60874128
## ptratio -0.17738330  0.3740443 -0.5077867  0.25356836
## black    1.00000000 -0.3660869  0.3334608 -0.35121093
## lstat   -0.36608690  1.0000000 -0.7376627  0.45326273
## medv     0.33346082 -0.7376627  1.0000000 -0.26301673
## crim01  -0.35121093  0.4532627 -0.2630167  1.00000000
pairs(boston)

First, we are going to split the data on a 75/25 split.

set.seed(343)
bos.training = sample(nrow(boston), 0.75*nrow(boston))
bos.testing = -bos.training
bostontrain = boston[bos.training, ]
bostontest = boston[bos.testing, ]
crim01test = crim01[bos.testing]

dim(bostontrain)
## [1] 379  15
dim(bostontest)
## [1] 127  15
length(crim01test)
## [1] 127

Logistic Regression with all Predictors

Next, we will train the data with all predictors minus crim.

boston.glm = glm(crim01~. -crim, data = bostontrain, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(boston.glm)
## 
## Call:
## glm(formula = crim01 ~ . - crim, family = "binomial", data = bostontrain)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0888  -0.1653   0.0000   0.0030   3.4141  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -26.518139   9.131172  -2.904  0.00368 ** 
## zn           -0.088450   0.041911  -2.110  0.03482 *  
## indus        -0.122768   0.056017  -2.192  0.02841 *  
## chas          1.529483   0.868505   1.761  0.07823 .  
## nox          55.348443   9.422343   5.874 4.25e-09 ***
## rm           -0.339865   0.867162  -0.392  0.69511    
## age           0.017334   0.014016   1.237  0.21619    
## dis           0.705396   0.268295   2.629  0.00856 ** 
## rad           0.556724   0.178992   3.110  0.00187 ** 
## tax          -0.007124   0.003277  -2.174  0.02970 *  
## ptratio       0.403384   0.146321   2.757  0.00584 ** 
## black        -0.040742   0.016720  -2.437  0.01482 *  
## lstat         0.049228   0.055767   0.883  0.37737    
## medv          0.160242   0.082655   1.939  0.05254 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 525.38  on 378  degrees of freedom
## Residual deviance: 148.31  on 365  degrees of freedom
## AIC: 176.31
## 
## Number of Fisher Scoring iterations: 9
boston.probs = predict(boston.glm, type = 'response', newdata = bostontest)
boston.pred = rep(0, 127)
boston.pred[boston.probs > 0.5] = 1
table(boston.pred, crim01test)
##            crim01test
## boston.pred  0  1
##           0 56  3
##           1  9 59
mean(boston.pred == crim01test)
## [1] 0.9055118
mean(boston.pred != crim01test)
## [1] 0.09448819

The Confusion Matrix that is created for the Logistic Regression correctly predicts that the crime rate in 59 suburbs are above the median and the crime rate in 56 suburbs are below the median - in total the Logistic Regression model correctly predicts the crime rate for 115 suburbs. In this case, the Logistic Regression model correctly predicted the crime rate for 90.6% and incorrectly predicted the crime rate 9.4% of the time. The below percentages are the Accuracy Rate, Missclassification Rate, Sensitivity (true positive rate), and Specificity (true negative rate) for this model.

Accuracy Rate - 56+59/127 = 90.6% Missclassification - 9+3/127 = 9.4% Sensitivity - 59/62 = 95% Specificity - 56/(56+9) = 86%

After modeling the Logistic Regression Model on all predictors, I have decided to only move forward with the variable that are significant with a 95% confidence level. Variables with z-statistics > 0.05 will not be included in the next iteration of the Logistic Regression.

Logisitc Regression with Subset of Predictors

boston.glm2 = glm(crim01~zn+indus+nox+dis+rad+tax+ptratio+black, data = bostontrain, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(boston.glm2)
## 
## Call:
## glm(formula = crim01 ~ zn + indus + nox + dis + rad + tax + ptratio + 
##     black, family = "binomial", data = bostontrain)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.12396  -0.20733   0.00000   0.00298   3.00921  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -12.830267   7.256124  -1.768  0.07703 .  
## zn           -0.072756   0.035479  -2.051  0.04030 *  
## indus        -0.100537   0.050861  -1.977  0.04808 *  
## nox          46.224462   8.098248   5.708 1.14e-08 ***
## dis           0.268721   0.211256   1.272  0.20337    
## rad           0.658845   0.162986   4.042 5.29e-05 ***
## tax          -0.009344   0.003107  -3.008  0.00263 ** 
## ptratio       0.167504   0.110080   1.522  0.12809    
## black        -0.039054   0.015669  -2.492  0.01269 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 525.38  on 378  degrees of freedom
## Residual deviance: 162.63  on 370  degrees of freedom
## AIC: 180.63
## 
## Number of Fisher Scoring iterations: 9
boston.probs2 = predict(boston.glm2, type = 'response', newdata = bostontest)
boston.pred2 = rep(0, 127)
boston.pred2[boston.probs2 > 0.5] = 1
table(boston.pred2, crim01test)
##             crim01test
## boston.pred2  0  1
##            0 56  2
##            1  9 60
mean(boston.pred2 == crim01test)
## [1] 0.9133858
mean(boston.pred2 != crim01test)
## [1] 0.08661417

The Confusion Matrix that is created for the second iteration of the Logistic Regression correctly predicts that the crime rate in 60 suburbs are above the median and the crime rate in 56 suburbs are below the median - in total the Logistic Regression model correctly predicts the crime rate for 116 suburbs. In this case, the Logistic Regression model correctly predicted the crime rate for 91.33% and incorrectly predicted the crime rate 8.66% of the time. The below percentages are the Accuracy Rate, Missclassification Rate, Sensitivity (true positive rate), and Specificity (true negative rate) for this model.

Accuracy Rate - 56+59/127 = 91.33% Missclassification - 9+3/127 = 8.66% Sensitivity - 59/62 = 97% Specificity - 56/(56+9) = 86%

In comparison to the previous Logistic Regression Model, the Logistic Regression ran on a subset of variables is only slightly better with prediction accuracy.

LDA Model

The LDA output indicates that \(\hat{\pi}_1\) = 0.496 and \(\hat{\pi}_2\) = 0.504; in other words 49.6% of the training observations correspond to suburbs with crime rates below the median and 55.2% of the training observations correspond to suburbs with crime rates above the median. The LDA output also provides the group means for each predictor within each class, in which can be used by the LDA as estimates of \(\mu_k\).

boston.lda = lda(crim01~zn+indus+nox+dis+rad+tax+ptratio+black, data = bostontrain)
boston.lda
## Call:
## lda(crim01 ~ zn + indus + nox + dis + rad + tax + ptratio + black, 
##     data = bostontrain)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4960422 0.5039578 
## 
## Group means:
##          zn     indus       nox      dis       rad      tax  ptratio    black
## 0 20.712766  7.011755 0.4704867 5.091370  4.202128 306.7979 17.90213 389.9009
## 1  1.371728 15.095079 0.6393770 2.504081 14.518325 502.9215 18.96283 327.8565
## 
## Coefficients of linear discriminants:
##                  LD1
## zn      -0.002577380
## indus   -0.004383629
## nox      8.673464785
## dis     -0.116448668
## rad      0.079951292
## tax     -0.001657112
## ptratio  0.037021741
## black   -0.000705304
plot(boston.lda)

boston.pred3 = predict(boston.lda, type = "response", newdata = bostontest)
table(boston.pred3$class, crim01test)
##    crim01test
##      0  1
##   0 62 13
##   1  3 49
mean(boston.pred3$class == crim01test)
## [1] 0.8740157
mean(boston.pred3$class != crim01test)
## [1] 0.1259843

The Confusion Matrix that is created for the LDA correctly predicts that the 62 suburbs have crime rates below the median while 49 suburbs have crime rates above the median - in total the LDA model correctly predicts the crime rate for 111 suburbs. The mean() function was used to compute the fraction of days for which the prediction was correct and was not correct. In this case, the LDA model correctly predicted the crime rate of the suburbs 87.4% and incorrectly predicted the crime rate of suburbs 12.6% of the time. The below percentages are the Accuracy Rate, Missclassification Rate, Sensitivity (true positive rate), and Specificity (true negative rate) for this model.

Accuracy Rate - 62+49/127 = 87.4% Missclassification - 3+13/127 = 12.6% Sensitivity - 56/61 = 79% Specificity - 9/(9+34) = 95%

In comparison to the 2 iterations of the Logistic Regression, the LDA underperforms significantly on this data.

KNN Models

All three KNN Model where K = 1, 10, or 100 run well on the data. Two of the KNN Models where K = 1 & K = 100 have the same accuracy and missclassification rate of 96.9% and 3.15% respecitively. For analysis reasons, we will utilize the K = 100 model since K-values closer to 1 often become less stable models.

library(class)
boston.train = cbind(zn, indus, nox, dis, rad, tax, ptratio, black)[bos.training,]
boston.test = cbind(zn, indus, nox, dis, rad, tax, ptratio, black)[bos.testing,]
train.crim01 = crim01[bos.training]
set.seed(343)
boston.knn1 = knn(boston.train, boston.test, train.crim01, k = 1)
mean(boston.knn1 == crim01test)
## [1] 0.9685039
mean(boston.knn1 != crim01test) 
## [1] 0.03149606
set.seed(343)
boston.knn10 = knn(boston.train, boston.test, train.crim01, k = 10)
mean(boston.knn10 == crim01test)
## [1] 0.9212598
mean(boston.knn10 != crim01test) 
## [1] 0.07874016
set.seed(343)
boston.knn100 = knn(boston.train, boston.test, train.crim01, k = 1)
mean(boston.knn100 == crim01test)
## [1] 0.9685039
mean(boston.knn100 != crim01test) 
## [1] 0.03149606

Final Conclusion

Logisitc Regression with ALL Predictors - Accuracy Rate: 90.6% Logistic Regression with Subset Predictors - Accuracy Rate: 91.3% LDA - Accuracy Rate: 87.4% KNN with K = 100 - Accuracy Rate: 96.9%

Based on the models ran, KNN with K = 100 is the most accurate model for this data. It appears that KNN may have dominated the LDA and Logistic Regression models as the decision boundary was highly non-linear between the two groups of above and below median crime rate suburbs.