Problem 13

This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

library(ISLR)
attach(Weekly)

(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

names(Weekly)
## [1] "Year"      "Lag1"      "Lag2"      "Lag3"      "Lag4"      "Lag5"     
## [7] "Volume"    "Today"     "Direction"
dim(Weekly)
## [1] 1089    9
summary(Weekly)
##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
## 
str(Weekly)
## 'data.frame':    1089 obs. of  9 variables:
##  $ Year     : num  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
##  $ Lag1     : num  0.816 -0.27 -2.576 3.514 0.712 ...
##  $ Lag2     : num  1.572 0.816 -0.27 -2.576 3.514 ...
##  $ Lag3     : num  -3.936 1.572 0.816 -0.27 -2.576 ...
##  $ Lag4     : num  -0.229 -3.936 1.572 0.816 -0.27 ...
##  $ Lag5     : num  -3.484 -0.229 -3.936 1.572 0.816 ...
##  $ Volume   : num  0.155 0.149 0.16 0.162 0.154 ...
##  $ Today    : num  -0.27 -2.576 3.514 0.712 1.178 ...
##  $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
pairs(Weekly)

cor(Weekly[,-9])
##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000

From the matrix providing pairwise correlations among predictors in the Weekly data set we can see that the lag variables reflecting previous weeks’ returns have little correlation with this week’s returns. The only significant correlation appears to exist between Year and Volume. As we can see from the plot below, Volume which reflects the average number of shares traded daily in billions, is increasing over time.

plot(Volume)

(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

glm_fits<-glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,data=Weekly,family=binomial)
summary(glm_fits)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

From the summary above we can see that the lag2 predictor indicating percentage return for two weeks previous appears to be statistically significant. This is given by the p-value of 0.0296, which is less than 0.05.

(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

glm_probs<-predict(glm_fits,type='response')
glm_pred<-rep("Down",1089)
glm_pred[glm_probs>0.5]="Up"
table(glm_pred,Direction)
##         Direction
## glm_pred Down  Up
##     Down   54  48
##     Up    430 557
mean(glm_pred==Direction)
## [1] 0.5610652

From the confusion matrix we can calculate the fraction for which our logistic model correctly predicted the weekly direction of return from the market as \(\frac{(557+54)}{1089} = .5611\). The matrix indicates that our model correctly predicted that the market would go up on 557 weeks and that it would go down on 54 weeks, for a total of 611 correct predictions.In this case, regression correctly predicted the movement of the market 56.11% of the time.

(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

train<-(Year<2009)
Weekly_2009<-Weekly[!train,]
Direction_2009<-Direction[!train]
glm_fits<-glm(Direction~Lag2,data=Weekly,family=binomial,subset=train)
glm_probs<-predict(glm_fits,Weekly_2009,type="response")
glm_pred=rep('Down',104)
glm_pred[glm_probs>.5]='Up'
table(glm_pred,Direction_2009)
##         Direction_2009
## glm_pred Down Up
##     Down    9  5
##     Up     34 56

From the confusion matrix above we can calculate the overall fraction of correct predictions for the held out data from 2009 to 2010 as \(\frac{(9+56)}{104} = .625\), or 62.5%. This is an improvement from our first logistic regression which used the entire data set and included all of the Lag variables as well as the Volume variable.

mean(glm_pred==Direction_2009)
## [1] 0.625

(e) Repeat (d) using LDA.

library(MASS)
lda_fit<-lda(Direction~Lag2,data=Weekly,subset=train)
lda_pred<-predict(lda_fit,Weekly_2009)
lda_class<-lda_pred$class
table(lda_class,Direction_2009)
##          Direction_2009
## lda_class Down Up
##      Down    9  5
##      Up     34 56
mean(lda_class==Direction_2009)
## [1] 0.625

The model produced using linear discriminant analysis yields similar results to the logistic regression model, with a correct prediction rate of .625 or 62.5%.

(f) Repeat (d) using QDA.

qda_fit<-qda(Direction~Lag2,data=Weekly,subset=train)
qda_class<-predict(qda_fit,Weekly_2009)$class
table(qda_class,Direction_2009)
##          Direction_2009
## qda_class Down Up
##      Down    0  0
##      Up     43 61
mean(qda_class==Direction_2009)
## [1] 0.5865385

The model produced using quadratic discriminant analysis yields a lower correct prediction rate at .5865, or 58.65%

(g) Repeat (d) using KNN with K = 1.

library(class)
train_x<-as.matrix(Lag2[train])
test_x<-as.matrix(Lag2[!train])
train_Direction<-Direction[train]
set.seed(1)
knn_pred<-knn(train_x,test_x,train_Direction,k=1)
table(knn_pred,Direction_2009)
##         Direction_2009
## knn_pred Down Up
##     Down   21 30
##     Up     22 31
mean(knn_pred==Direction_2009)
## [1] 0.5

The KNN model yields an accuracy rate of 0.5 or 50%.

(h) Repeat (d) using naive Bayes.

library(e1071)
nb_fit<-naiveBayes(Direction~Lag2,data=Weekly,subset=train)
nb_class<-predict(nb_fit,Weekly_2009)
table(nb_class,Direction_2009)
##         Direction_2009
## nb_class Down Up
##     Down    0  0
##     Up     43 61
mean(nb_class==Direction_2009)
## [1] 0.5865385

Our naive bayes model produces similar results to our quadratic discriminant analysis model, yielding a correct prediction rate of .5865 or 58.65%

(i) Which of these methods appears to provide the best results on this data?
It appears that both the logistic regression and linear discriminant analysis models are the most accurate, with correct prediction rates of .625, or 62.5%.

(j) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.
By including the Lag1 variable in our logistic regression we decrease the accuracy of our model to .5769 or 57.69%.

glm_fits2<-glm(Direction~Lag1+Lag2,data=Weekly,family=binomial,subset=train)
glm_probs2<-predict(glm_fits2,Weekly_2009,type="response")
glm_pred2=rep('Down',104)
glm_pred2[glm_probs2>.5]='Up'
table(glm_pred2,Direction_2009)
##          Direction_2009
## glm_pred2 Down Up
##      Down    7  8
##      Up     36 53
mean(glm_pred2==Direction_2009)
## [1] 0.5769231

Including the Lag1 variable in our linear discriminant analysis model also reduces accuracy of the predictions to 0.5769 or 57.69%.

library(MASS)
lda_fit2<-lda(Direction~Lag1+Lag2,data=Weekly,subset=train)
lda_pred2<-predict(lda_fit2,Weekly_2009)
lda_class2<-lda_pred2$class
table(lda_class2,Direction_2009)
##           Direction_2009
## lda_class2 Down Up
##       Down    7  8
##       Up     36 53
mean(lda_class2==Direction_2009)
## [1] 0.5769231

This logistic model includes an interaction between the Lag2 and Volume variables and yields more accurate predictions than the previous models which only included both of the Lag1 and Lag2 predictors. the accuracy of this model is shown to be .6154 or 61.54%. This model still appears to be less accurate than the original logistic model which only included the Lag2 variable.

glm_fits3<-glm(Direction~Lag2:Volume,data=Weekly,family=binomial,subset=train)
glm_probs3<-predict(glm_fits3,Weekly_2009,type="response")
glm_pred3=rep('Down',104)
glm_pred3[glm_probs3>.5]='Up'
table(glm_pred3,Direction_2009)
##          Direction_2009
## glm_pred3 Down Up
##      Down    9  6
##      Up     34 55
mean(glm_pred3==Direction_2009)
## [1] 0.6153846

Including an interaction between the Lag2 and Volume predictors in the linear discriminate analysis model yields a reduced accuracy of .6058 or 60.58% when compared to the original linear discriminate analysis model with an accuracy of 62.5%. This model also yields less accurate predictions than the logistic model which includes the same interaction between Lag2 and Volume. While this model is less accurate than the original linear discriminant analysis model, it does improve on the accuracy seen in the lda model which included only the predictors Lag1 and Lag2.

library(MASS)
lda_fit3<-lda(Direction~Lag2:Volume,data=Weekly,subset=train)
lda_pred3<-predict(lda_fit3,Weekly_2009)
lda_class3<-lda_pred3$class
table(lda_class3,Direction_2009)
##           Direction_2009
## lda_class3 Down Up
##       Down    8  6
##       Up     35 55
mean(lda_class3==Direction_2009)
## [1] 0.6057692

By increasing the value for k in our KNN model from 1 to 3, we are able to improve the accuracy of the model to .55 or 55%.

library(class)
train_x<-as.matrix(Lag2[train])
test_x<-as.matrix(Lag2[!train])
train_Direction<-Direction[train]
set.seed(1)
knn_pred2<-knn(train_x,test_x,train_Direction,k=3)
table(knn_pred2,Direction_2009)
##          Direction_2009
## knn_pred2 Down Up
##      Down   16 20
##      Up     27 41
mean(knn_pred2==Direction_2009)
## [1] 0.5480769

By further increasing our value of k in our KNN model to 200, we are again able to improve the correct prediction rate to .6058 or 60.58%.

library(class)
train_x<-as.matrix(Lag2[train])
test_x<-as.matrix(Lag2[!train])
train_Direction<-Direction[train]
set.seed(1)
knn_pred3<-knn(train_x,test_x,train_Direction,k=200)
table(knn_pred3,Direction_2009)
##          Direction_2009
## knn_pred3 Down Up
##      Down    2  0
##      Up     41 61
mean(knn_pred3==Direction_2009)
## [1] 0.6057692

From the experiments with different interactions and predictors, I find that the original logistic regression and linear discriminate analysis models are the most accurate, with correct prediction rates equaling 62.5%. The third experiment which included an interaction between the Lag2 and Volume predictors was the only model that came close to this level of accuracy by yielding a correct prediction rate of 61.53%.

detach(Weekly)

Problem 14

In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

library(ISLR)
attach(Auto)
summary(Auto)
##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365

(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.

mpg01<-ifelse(mpg>median(mpg),yes=1,no=0)
Auto<-data.frame(Auto,mpg01)
str(Auto)
## 'data.frame':    392 obs. of  10 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  $ mpg01       : num  0 0 0 0 0 0 0 0 0 0 ...

(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatter plots and box plots may be useful tools to answer this question. Describe your findings.

cor(Auto[,-9])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
## mpg01         0.8369392 -0.7591939   -0.7534766 -0.6670526 -0.7577566
##              acceleration       year     origin      mpg01
## mpg             0.4233285  0.5805410  0.5652088  0.8369392
## cylinders      -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement   -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower     -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight         -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration    1.0000000  0.2903161  0.2127458  0.3468215
## year            0.2903161  1.0000000  0.1815277  0.4299042
## origin          0.2127458  0.1815277  1.0000000  0.5136984
## mpg01           0.3468215  0.4299042  0.5136984  1.0000000
pairs(Auto)

We can learn more about the available data by creating box plots that contrast the distribution of quantitative variables between cars with above-median mpg (1) and below-median mpg (0).

par(mfrow=c(2,3))
plot(factor(mpg01),cylinders,ylab= "Number of Engine Cylinders")
plot(factor(mpg01),displacement,ylab="Engine Displacement (Cubic Inches)")
plot(factor(mpg01),horsepower,ylab="Horsepower")
plot(factor(mpg01),weight,ylab="Weight (Pounds)")
plot(factor(mpg01),acceleration,ylab="Time to Reach 60MPH (Seconds)")
plot(factor(mpg01),year,ylab="Manufacture Year")
mtext("Box Plots for Cars With Above (1) and Below (0) Median MPG",outer=TRUE,line=-1)

From the box plots generated we can see that a majority of the vehicles in the data set with above-median mpg appear to have smaller engines, less horsepower, and less weight. It is also clear that vehicles in the data set with above-median mpg have four cylinder engines. The two box plots that demonstrate the distribution of acceleration and manufacture year do show some difference between vehicles with above-median and below-median mpg, but it is not as clear as the first four predictors mentioned. Due to less overlap in the box plots, the cylinders, displacement, horsepower, and weight variables appear to be the most useful in predicting whether a vehicle will have above or below median mpg from this data set.

We can take a closer look at the useful variables that we identified from the box plots by plotting them against mpg01 in scatter plots.

par(mfrow = c(3, 2))
plot(cylinders,mpg01,xlab ="Number of Engine Cylinders")
plot(displacement,mpg01,xlab ="Engine Displacement (Cubic Inches)")
plot(horsepower,mpg01,xlab ="Horsepower")
plot(Auto$weight, Auto$mpg01, xlab ="Weight (Pounds)")
mtext("Scatterplots For Cars With Above(1) and Below(0) Median MPG",outer=TRUE,line=-3)

From the scatter plots generated, we can see that the horsepower and weight variables continue to appear useful in providing predictions for mpg01. The scatter plot for horsepower shows a higher concentration of vehicles with lower horsepower classified as above-median mpg. We see a similar concentration in the weight scatter plot, which suggests that more vehicles from the data set with a lighter weight are classified as having above-median mpg. The scatter plots produced for number of engine cylinders and engine displacement against mpg01 do not appear to provide as useful insights on their own for predicting mpg01.

To see these relationships in another light, I have focused on the scatter plots below that plot these variables against the original mpg variable. From these plots we can more clearly see apparent correlations in the relationships between the predictors and mpg. For example, it appears that vehicles from the data set with smaller engines and less cylinders tend to have above-median mpg.

par(mfrow=c(2,2))
plot(horsepower,mpg,xlab="Horsepower",ylab="MPG")
abline(h=median(mpg),lwd=2,col="green")
plot(weight,mpg,xlab="Weight (Pounds)",ylab="MPG")
abline(h=median(mpg),lwd=2,col="green")
plot(displacement,mpg,xlab="Engine Displacement (Cubic Inches)",ylab="MPG")
abline(h=median(mpg),lwd=2,col="green")
plot(cylinders,mpg,xlab="Number of Engine Cylinders",ylab="MPG")
abline(h=median(mpg),lwd=2,col="green")

We can also see from the plots below that interesting relationships exist between the year and origin variables when plotted with mpg.

par(mfrow=c(1, 2))
plot(year,mpg,xlab="Year",ylab="MPG")
abline(h=median(mpg),lwd=2,col="green")
origin[origin==1]="American"
origin[origin==2]="European"
origin[origin==3]="Japanese"
origin=as.factor(origin)
plot(origin,mpg,xlab="Origin",ylab="MPG")
abline(h=median(mpg),lwd=2,col="green")

From the scatter plot of mpg vs. year, we can see that newer vehicles in our data set tend to have above-median mpg. It also appears from the box plots generated which show the distribution of mpg for different origins of vehicle in the data set that vehicles originating in America tend to have below-median mpg.

(c) Split the data into a training set and a test set.

set.seed(1)
idx<-sample(1:nrow(Auto),0.8*nrow(Auto))
train_auto<-Auto[idx, ]
test_auto<-Auto[-idx, ]
y_train<-Auto[idx,]$mpg
str(train_auto)
## 'data.frame':    313 obs. of  10 variables:
##  $ mpg         : num  44.3 23 26 23.9 23.2 16 33.5 13 31.5 17.6 ...
##  $ cylinders   : num  4 4 4 8 4 8 4 8 4 6 ...
##  $ displacement: num  90 140 122 260 156 318 151 350 89 225 ...
##  $ horsepower  : num  48 83 80 90 105 150 90 175 71 85 ...
##  $ weight      : num  2085 2639 2451 3420 2745 ...
##  $ acceleration: num  21.7 17 16.5 22.2 16.7 13 13.2 13 14.9 16.6 ...
##  $ year        : num  80 75 74 79 78 76 79 73 78 81 ...
##  $ origin      : num  2 1 1 1 1 1 1 1 2 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 303 156 156 201 230 108 248 25 291 72 ...
##  $ mpg01       : num  1 1 1 1 1 0 1 0 1 0 ...

(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

From the understanding we have gathered of the variables so far, I will generate a linear discriminant analysis model which includes the cylinders, displacement, horsepower, weight, year, and origin variables.

library(MASS)
lda_fit_auto<-lda(mpg01~cylinders+displacement+horsepower+weight+year+origin,data=train_auto)
lda_fit_auto
## Call:
## lda(mpg01 ~ cylinders + displacement + horsepower + weight + 
##     year + origin, data = train_auto)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4920128 0.5079872 
## 
## Group means:
##   cylinders displacement horsepower   weight     year   origin
## 0  6.753247     269.6558   128.0844 3598.026 74.53247 1.162338
## 1  4.207547     117.5723    79.8239 2350.283 77.64151 1.949686
## 
## Coefficients of linear discriminants:
##                       LD1
## cylinders    -0.403682186
## displacement -0.001118231
## horsepower    0.008840042
## weight       -0.001101703
## year          0.120435532
## origin        0.147062729
lda_pred_auto<-predict(lda_fit_auto,test_auto)$class
table(lda_pred_auto,test_auto$mpg01)
##              
## lda_pred_auto  0  1
##             0 35  0
##             1  7 37

The test error rate for this model is shown to be .0886, or 8.86%.

1- mean(lda_pred_auto==test_auto$mpg01)
## [1] 0.08860759
library(MASS)
lda_fit_auto2<-lda(mpg01~cylinders+displacement+horsepower+weight,data=train_auto)
lda_fit_auto2
## Call:
## lda(mpg01 ~ cylinders + displacement + horsepower + weight, data = train_auto)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4920128 0.5079872 
## 
## Group means:
##   cylinders displacement horsepower   weight
## 0  6.753247     269.6558   128.0844 3598.026
## 1  4.207547     117.5723    79.8239 2350.283
## 
## Coefficients of linear discriminants:
##                        LD1
## cylinders    -0.3683695573
## displacement -0.0034034169
## horsepower    0.0043232816
## weight       -0.0009434883
lda_pred_auto2<-predict(lda_fit_auto2,test_auto)$class
table(lda_pred_auto2,test_auto$mpg01)
##               
## lda_pred_auto2  0  1
##              0 35  0
##              1  7 37
1- mean(lda_pred_auto2==test_auto$mpg01)
## [1] 0.08860759

It appears that removing the year and origin variables from our linear discriminant analysis model has no affect on the error rate. In both linear discriminant models produced, none of the vehicles that actually had above-median mpg were misclassified.

(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

qda_fit_auto<-qda(mpg01~cylinders+displacement+horsepower+weight+year+origin,data=train_auto)
qda_fit_auto
## Call:
## qda(mpg01 ~ cylinders + displacement + horsepower + weight + 
##     year + origin, data = train_auto)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4920128 0.5079872 
## 
## Group means:
##   cylinders displacement horsepower   weight     year   origin
## 0  6.753247     269.6558   128.0844 3598.026 74.53247 1.162338
## 1  4.207547     117.5723    79.8239 2350.283 77.64151 1.949686
qda_pred_auto<-predict(qda_fit_auto,test_auto)$class
table(qda_pred_auto,test_auto$mpg01)
##              
## qda_pred_auto  0  1
##             0 38  3
##             1  4 34

The test error rate for the quadratic discriminant analysis model is shown to be .0886 or 8.86%. This is the same result yielded by both of the linear discriminant analysis models produced.

1- mean(qda_pred_auto==test_auto$mpg01)
## [1] 0.08860759
qda_fit_auto2<-qda(mpg01~cylinders+displacement+horsepower+weight,data=train_auto)
qda_fit_auto2
## Call:
## qda(mpg01 ~ cylinders + displacement + horsepower + weight, data = train_auto)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4920128 0.5079872 
## 
## Group means:
##   cylinders displacement horsepower   weight
## 0  6.753247     269.6558   128.0844 3598.026
## 1  4.207547     117.5723    79.8239 2350.283
qda_pred_auto2<-predict(qda_fit_auto2,test_auto)$class
table(qda_pred_auto2,test_auto$mpg01)
##               
## qda_pred_auto2  0  1
##              0 37  2
##              1  5 35

Removing the year and origin variables from our quadratic discriminant analysis model appears to have no affect on the test error rate. This model again yields a test error rate of .0886 or 8.86%.

1-mean(qda_pred_auto2==test_auto$mpg01)
## [1] 0.08860759

(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

glm_fits_auto<-glm(mpg01~cylinders+displacement+horsepower+weight+year+origin,data=train_auto)
glm_fits_auto
## 
## Call:  glm(formula = mpg01 ~ cylinders + displacement + horsepower + 
##     weight + year + origin, data = train_auto)
## 
## Coefficients:
##  (Intercept)     cylinders  displacement    horsepower        weight  
##   -0.6085200    -0.0974107    -0.0002698     0.0021332    -0.0002658  
##         year        origin  
##    0.0290618     0.0354871  
## 
## Degrees of Freedom: 312 Total (i.e. Null);  306 Residual
## Null Deviance:       78.23 
## Residual Deviance: 28.47     AIC: 153.9
glm_probs_auto<-predict(glm_fits_auto,test_auto,type="response")
glm_pred_auto<-rep(0,length(test_auto$mpg01))
glm_pred_auto[glm_probs_auto>0.5]=1
table(glm_pred_auto,test_auto$mpg01)
##              
## glm_pred_auto  0  1
##             0 35  0
##             1  7 37

The test error rate yielded by our logistic regression model which includes cylinders, displacement, horsepower, weight, year, and origin is also .0886 or 8.86%.

1-mean(glm_pred_auto==test_auto$mpg01)
## [1] 0.08860759

From the summary generated below, it appears that based on their individual p-values below 0.05, the significant predictors in our logistic model for mpg01 are ’cylinders, weight, and year.

summary(glm_fits_auto)
## 
## Call:
## glm(formula = mpg01 ~ cylinders + displacement + horsepower + 
##     weight + year + origin, data = train_auto)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.93565  -0.15579   0.06056   0.20257   0.92104  
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6.085e-01  4.260e-01  -1.428  0.15422    
## cylinders    -9.741e-02  3.375e-02  -2.886  0.00418 ** 
## displacement -2.698e-04  7.810e-04  -0.346  0.72995    
## horsepower    2.133e-03  1.117e-03   1.910  0.05709 .  
## weight       -2.658e-04  5.814e-05  -4.572 7.01e-06 ***
## year          2.906e-02  5.152e-03   5.641 3.85e-08 ***
## origin        3.549e-02  2.832e-02   1.253  0.21108    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.0930394)
## 
##     Null deviance: 78.23  on 312  degrees of freedom
## Residual deviance: 28.47  on 306  degrees of freedom
## AIC: 153.88
## 
## Number of Fisher Scoring iterations: 2
glm_fits_auto2<-glm(mpg01~cylinders+weight+year,data=train_auto)
glm_fits_auto2
## 
## Call:  glm(formula = mpg01 ~ cylinders + weight + year, data = train_auto)
## 
## Coefficients:
## (Intercept)    cylinders       weight         year  
##  -0.2383035   -0.0971438   -0.0002398    0.0261125  
## 
## Degrees of Freedom: 312 Total (i.e. Null);  309 Residual
## Null Deviance:       78.23 
## Residual Deviance: 29.13     AIC: 155
glm_probs_auto2<-predict(glm_fits_auto2,test_auto,type="response")
glm_pred_auto2<-rep(0,length(test_auto$mpg01))
glm_pred_auto2[glm_probs_auto2>0.5]=1
table(glm_pred_auto2,test_auto$mpg01)
##               
## glm_pred_auto2  0  1
##              0 36  0
##              1  6 37

By focusing on the cylinders, weight, and year variables in our logistic regression model, I am able to reduce the test error rate to 0.0759 or 7.59%.

1-mean(glm_pred_auto2== test_auto$mpg01)
## [1] 0.07594937

(g) Perform naive Bayes on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

library(e1071)
nb_fit_auto<-naiveBayes(mpg01~cylinders+displacement+horsepower+weight+year+origin,data=train_auto)
nb_class_auto<-predict(nb_fit_auto,test_auto$mpg01)
table(nb_class_auto,test_auto$mpg01)
##              
## nb_class_auto  0  1
##             0  0  0
##             1 42 37
1-mean(nb_class_auto==test_auto$mpg01)
## [1] 0.5316456
library(e1071)
nb_fit_auto2<-naiveBayes(mpg01~cylinders+weight+year,data=train_auto)
nb_class_auto2<-predict(nb_fit_auto2,test_auto$mpg01)
table(nb_class_auto2,test_auto$mpg01)
##               
## nb_class_auto2  0  1
##              0  0  0
##              1 42 37
1-mean(nb_class_auto==test_auto$mpg01)
## [1] 0.5316456

The naive Bayes model yields the highest test error rates of the models produced so far. The test error rate is shown to be .5317 or 53.17%, even when only including the cylinders, weight, and year variables.

(h) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

When including the cylinders, weight, and year variables in our KNN models, we are able to yield the lowest test error rate of .1013 or 10.13% with k=1 and k=5.

library(class)
train_X_auto=train_auto[,c("cylinders","weight","year")]
test_X_auto=test_auto[,c("cylinders","weight","year")]
train_mpg01 =train_auto[['mpg01']]
set.seed(2)
y_test_auto<-ifelse(test_auto$mpg>=22.75,1,0)
knn_pred_auto=knn(train_X_auto,test_X_auto,train_mpg01,k=1)
table(knn_pred_auto,y_test_auto)
##              y_test_auto
## knn_pred_auto  0  1
##             0 39  5
##             1  3 32
1-mean(knn_pred_auto==y_test_auto)
## [1] 0.1012658
knn_pred_auto2=knn(train_X_auto,test_X_auto,train_mpg01,k=5)
table(knn_pred_auto2,y_test_auto)
##               y_test_auto
## knn_pred_auto2  0  1
##              0 38  4
##              1  4 33
1-mean(knn_pred_auto2==y_test_auto)
## [1] 0.1012658

Increasing the value of k to 10 yields an increased test error rate of .1139 or 11.39%.

knn_pred_auto3=knn(train_X_auto,test_X_auto,train_mpg01,k=10)
table(knn_pred_auto3,y_test_auto)
##               y_test_auto
## knn_pred_auto3  0  1
##              0 37  4
##              1  5 33
1-mean(knn_pred_auto3==y_test_auto)
## [1] 0.1139241

Further increasing the value of k to 50, yields an even higher test error rate of .1266 or 12.66%.

knn_pred_auto4=knn(train_X_auto,test_X_auto,train_mpg01,k=50)
table(knn_pred_auto4,y_test_auto)
##               y_test_auto
## knn_pred_auto4  0  1
##              0 36  4
##              1  6 33
1-mean(knn_pred_auto4==y_test_auto)
## [1] 0.1265823
detach(Auto)

Problem 16

Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.

library(MASS)
attach(Boston)
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Create a binary variable, crim01, that contains a 1 if crim contains a value above its median, and a 0 if crim contains a value below its median

crim01<-ifelse(crim>median(crim),yes=1,no=0)
Boston<-data.frame(Boston,crim01)
str(Boston)
## 'data.frame':    506 obs. of  15 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
##  $ crim01 : num  0 0 0 0 0 0 0 0 0 0 ...

By visualizing the data graphically, we can determine which variables to include in our model based on what appear to be strong relationships with our crim01 variable.

pairs(Boston)

cor(Boston)
##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black   -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
## crim01   0.40939545 -0.43615103  0.60326017  0.070096774  0.72323480
##                  rm         age         dis          rad         tax    ptratio
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431  0.2899456
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018  0.3832476
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320  0.1889327
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783 -0.3555015
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559  0.2615150
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819  0.4647412
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000  0.4608530
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304  1.0000000
## black    0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341  0.3740443
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593 -0.5077867
## crim01  -0.15637178  0.61393992 -0.61634164  0.619786249  0.60874128  0.2535684
##               black      lstat       medv      crim01
## crim    -0.38506394  0.4556215 -0.3883046  0.40939545
## zn       0.17552032 -0.4129946  0.3604453 -0.43615103
## indus   -0.35697654  0.6037997 -0.4837252  0.60326017
## chas     0.04878848 -0.0539293  0.1752602  0.07009677
## nox     -0.38005064  0.5908789 -0.4273208  0.72323480
## rm       0.12806864 -0.6138083  0.6953599 -0.15637178
## age     -0.27353398  0.6023385 -0.3769546  0.61393992
## dis      0.29151167 -0.4969958  0.2499287 -0.61634164
## rad     -0.44441282  0.4886763 -0.3816262  0.61978625
## tax     -0.44180801  0.5439934 -0.4685359  0.60874128
## ptratio -0.17738330  0.3740443 -0.5077867  0.25356836
## black    1.00000000 -0.3660869  0.3334608 -0.35121093
## lstat   -0.36608690  1.0000000 -0.7376627  0.45326273
## medv     0.33346082 -0.7376627  1.0000000 -0.26301673
## crim01  -0.35121093  0.4532627 -0.2630167  1.00000000

It appears that the indus, nox, rad, tax, and lstat variables all have stronger correlations with the crim variable. The distribution of the values for these predictors when plotted for above and below median crime rates makes their relationship even more apparent.

par(mfrow=c(2,3))
plot(factor(crim01),indus,ylab="Proportion of Non-retail Business Acres per Town")
plot(factor(crim01),nox,ylab="Concentration of Nitrogen Oxides (Parts per 10 Million")
plot(factor(crim01),rad,ylab="Accessibility to Radial Highways")
plot(factor(crim01),tax,ylab="Property Tax Rate (per $10,000")
plot(factor(crim01),lstat,ylab="Lower Status of Population (Percent)")
mtext("Box Plots for Suburbs With Above (1) and Below (0) Median Crime",outer=TRUE,line=-1)

Split the data into a training set and a test set

set.seed(3)
idx<-sample(1:nrow(Boston),0.8*nrow(Boston))
train_Boston<-Boston[idx, ]
test_Boston<-Boston[-idx, ]
y_train<-Boston[idx,]$crim
str(train_Boston)
## 'data.frame':    404 obs. of  15 variables:
##  $ crim   : num  0.5401 0.0605 0.5445 5.6917 0.0642 ...
##  $ zn     : num  20 0 0 0 0 0 0 0 0 0 ...
##  $ indus  : num  3.97 2.46 21.89 18.1 5.96 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.647 0.488 0.624 0.583 0.499 0.573 0.583 0.77 0.614 0.7 ...
##  $ rm     : num  7.2 6.15 6.15 6.11 5.93 ...
##  $ age    : num  81.8 68.8 97.9 79.8 68.2 89.3 53.2 96.2 96.7 82.5 ...
##  $ dis    : num  2.11 3.28 1.67 3.55 3.36 ...
##  $ rad    : int  5 3 4 24 5 1 24 24 24 24 ...
##  $ tax    : num  264 193 437 666 279 273 666 666 666 666 ...
##  $ ptratio: num  13 17.8 21.2 20.2 19.2 21 20.2 20.2 20.2 20.2 ...
##  $ black  : num  393 387 397 393 397 ...
##  $ lstat  : num  9.59 13.15 18.46 14.98 9.68 ...
##  $ medv   : num  33.8 29.6 17.8 19.1 18.9 22 20.6 20.8 14.6 23.2 ...
##  $ crim01 : num  1 0 1 1 0 0 1 1 1 1 ...

Logistic Regression

glm_fits_Boston<-glm(crim01~indus+nox+rad+tax+lstat,data=train_Boston)
glm_fits_Boston
## 
## Call:  glm(formula = crim01 ~ indus + nox + rad + tax + lstat, data = train_Boston)
## 
## Coefficients:
## (Intercept)        indus          nox          rad          tax        lstat  
##  -0.8129302    0.0108806    2.2817649    0.0246078   -0.0006513   -0.0035658  
## 
## Degrees of Freedom: 403 Total (i.e. Null);  398 Residual
## Null Deviance:       101 
## Residual Deviance: 41.76     AIC: 243.6
glm_probs_Boston<-predict(glm_fits_Boston,test_Boston,type="response")
glm_pred_Boston<-rep(0,length(test_Boston$crim01))
glm_pred_Boston[glm_probs_Boston>0.5]=1
table(glm_pred_Boston,test_Boston$crim01)
##                
## glm_pred_Boston  0  1
##               0 52 17
##               1  2 31

A logistic regression model which includes the indus, nox, rad, tax, and lstat yields a test error rate of .1863 or 18.63%.

1-mean(glm_pred_Boston==test_Boston$crim01)
## [1] 0.1862745

From the summary produced below, we can see that the indus, nox, rad, and tax variables are all significant in their relationship as predictors for above or below median crime rate in a given suburb.

summary(glm_fits_Boston)
## 
## Call:
## glm(formula = crim01 ~ indus + nox + rad + tax + lstat, data = train_Boston)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.66774  -0.20179  -0.09736   0.16887   0.84107  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.8129302  0.1159436  -7.011 1.02e-11 ***
## indus        0.0108806  0.0043789   2.485   0.0134 *  
## nox          2.2817649  0.2442468   9.342  < 2e-16 ***
## rad          0.0246078  0.0046468   5.296 1.97e-07 ***
## tax         -0.0006513  0.0002771  -2.351   0.0192 *  
## lstat       -0.0035658  0.0030003  -1.188   0.2353    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1049152)
## 
##     Null deviance: 100.978  on 403  degrees of freedom
## Residual deviance:  41.756  on 398  degrees of freedom
## AIC: 243.6
## 
## Number of Fisher Scoring iterations: 2

However, a logistic regression model that only includes these variables that were identified as significant still yields the same test error rate as the original of 18.63%.

glm_fits_Boston2<-glm(crim01~indus+nox+rad+tax,data=train_Boston)
glm_fits_Boston2
## 
## Call:  glm(formula = crim01 ~ indus + nox + rad + tax, data = train_Boston)
## 
## Coefficients:
## (Intercept)        indus          nox          rad          tax  
##   -0.803658     0.009883     2.221905     0.024460    -0.000672  
## 
## Degrees of Freedom: 403 Total (i.e. Null);  399 Residual
## Null Deviance:       101 
## Residual Deviance: 41.9  AIC: 243
glm_probs_Boston2<-predict(glm_fits_Boston2,test_Boston,type="response")
glm_pred_Boston2<-rep(0,length(test_Boston$crim01))
glm_pred_Boston2[glm_probs_Boston2>0.5]=1
table(glm_pred_Boston2,test_Boston$crim01)
##                 
## glm_pred_Boston2  0  1
##                0 52 17
##                1  2 31
1-mean(glm_pred_Boston2==test_Boston$crim01)
## [1] 0.1862745

I am able to reduce the test error rate of the logistic regression model to .1667 or 16.67% by excluding the indus and tax variables from the original logistic regression model produced.

glm_fits_Boston3<-glm(crim01~nox+rad+lstat,data=train_Boston)
glm_fits_Boston3
## 
## Call:  glm(formula = crim01 ~ nox + rad + lstat, data = train_Boston)
## 
## Coefficients:
## (Intercept)          nox          rad        lstat  
##   -1.010865     2.507217     0.016072    -0.002826  
## 
## Degrees of Freedom: 403 Total (i.e. Null);  400 Residual
## Null Deviance:       101 
## Residual Deviance: 42.61     AIC: 247.8
glm_probs_Boston3<-predict(glm_fits_Boston3,test_Boston,type="response")
glm_pred_Boston3<-rep(0,length(test_Boston$crim01))
glm_pred_Boston3[glm_probs_Boston3>0.5]=1
table(glm_pred_Boston3,test_Boston$crim01)
##                 
## glm_pred_Boston3  0  1
##                0 52 15
##                1  2 33
1-mean(glm_pred_Boston3==test_Boston$crim01)
## [1] 0.1666667

Linear Discriminant Analysis (LDA)

The linear discriminant analysis models produced below appear to mirror the test error results of our logistic regression models. The initial linear discriminant analysis model which includes the indus, nox, rad, tax, and lstat yields a test error rate of .1863 or 18.63%.

library(MASS)
lda_fit_Boston<-lda(crim01~indus+nox+rad+tax+lstat,data=train_Boston)
lda_fit_Boston
## Call:
## lda(crim01 ~ indus + nox + rad + tax + lstat, data = train_Boston)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4925743 0.5074257 
## 
## Group means:
##       indus       nox       rad      tax     lstat
## 0  6.946432 0.4724869  4.170854 306.9196  9.438693
## 1 15.441756 0.6386829 15.341463 517.4098 15.877707
## 
## Coefficients of linear discriminants:
##                LD1
## indus  0.044083539
## nox    9.244775874
## rad    0.099700634
## tax   -0.002638889
## lstat -0.014447233
lda_pred_Boston<-predict(lda_fit_Boston,test_Boston)$class
table(lda_pred_Boston,test_Boston$crim01)
##                
## lda_pred_Boston  0  1
##               0 52 17
##               1  2 31
1- mean(lda_pred_Boston==test_Boston$crim01)
## [1] 0.1862745

By excluding the indus and tax variables, I am able to reduce the test error rate of the linear discriminant analysis model to .1667 or 16.67%.

library(MASS)
lda_fit_Boston2<-lda(crim01~nox+rad+lstat,data=train_Boston)
lda_fit_Boston2
## Call:
## lda(crim01 ~ nox + rad + lstat, data = train_Boston)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4925743 0.5074257 
## 
## Group means:
##         nox       rad     lstat
## 0 0.4724869  4.170854  9.438693
## 1 0.6386829 15.341463 15.877707
## 
## Coefficients of linear discriminants:
##               LD1
## nox   10.12921220
## rad    0.06493179
## lstat -0.01141820
lda_pred_Boston2<-predict(lda_fit_Boston2,test_Boston)$class
table(lda_pred_Boston2,test_Boston$crim01)
##                 
## lda_pred_Boston2  0  1
##                0 52 15
##                1  2 33
1- mean(lda_pred_Boston2==test_Boston$crim01)
## [1] 0.1666667

K-Nearest Neighbors (KNN)

library(class)
train_X_Boston=train_Boston[,c("nox","rad","lstat")]
test_X_Boston=test_Boston[,c("nox","rad","lstat")]
train_crim01=train_Boston[['crim01']]
set.seed(4)
y_test_Boston<-ifelse(test_Boston$crim>=25.65,1,0)
knn_pred_Boston=knn(train_X_Boston,test_X_Boston,train_crim01,k=1)
table(knn_pred_Boston,y_test_Boston)
##                y_test_Boston
## knn_pred_Boston  0  1
##               0 58  0
##               1 42  2
1-mean(knn_pred_Boston==y_test_Boston)
## [1] 0.4117647
knn_pred_Boston2=knn(train_X_Boston,test_X_Boston,train_crim01,k=50)
table(knn_pred_Boston2,y_test_Boston)
##                 y_test_Boston
## knn_pred_Boston2  0  1
##                0 75  0
##                1 25  2
1-mean(knn_pred_Boston2==y_test_Boston)
## [1] 0.245098

By increasing our value of k to 100 for our KNN model which includes the nox, rad, and lstat variables, we are able to match the test error rate of our least accurate logistic regression and linear discriminant analysis models.

knn_pred_Boston3=knn(train_X_Boston,test_X_Boston,train_crim01,k=100)
table(knn_pred_Boston3,y_test_Boston)
##                 y_test_Boston
## knn_pred_Boston3  0  1
##                0 81  0
##                1 19  2
1-mean(knn_pred_Boston3==y_test_Boston)
## [1] 0.1862745

From the classification models tested, it appears that the logistic regression and linear discriminant analysis models which only include the nox, rad, and lstat variables, yields the lowest test error rate of 16.67%.

detach(Boston)