This question should be answered using the Weekly data
set, which is part of the ISLR2 package. This data is
similar in nature to the Smarket data from this chapter’s
lab, except that it contains 1,089 weekly returns for 21 years, from the
beginning of 1990 to the end of 2010.
library(ISLR)
attach(Weekly)
(a) Produce some numerical and graphical summaries of the
Weekly data. Do there appear to be any
patterns?
names(Weekly)
## [1] "Year" "Lag1" "Lag2" "Lag3" "Lag4" "Lag5"
## [7] "Volume" "Today" "Direction"
dim(Weekly)
## [1] 1089 9
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
str(Weekly)
## 'data.frame': 1089 obs. of 9 variables:
## $ Year : num 1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
## $ Lag1 : num 0.816 -0.27 -2.576 3.514 0.712 ...
## $ Lag2 : num 1.572 0.816 -0.27 -2.576 3.514 ...
## $ Lag3 : num -3.936 1.572 0.816 -0.27 -2.576 ...
## $ Lag4 : num -0.229 -3.936 1.572 0.816 -0.27 ...
## $ Lag5 : num -3.484 -0.229 -3.936 1.572 0.816 ...
## $ Volume : num 0.155 0.149 0.16 0.162 0.154 ...
## $ Today : num -0.27 -2.576 3.514 0.712 1.178 ...
## $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
pairs(Weekly)
cor(Weekly[,-9])
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
From the matrix providing pairwise correlations among predictors in
the Weekly data set we can see that the lag variables
reflecting previous weeks’ returns have little correlation with this
week’s returns. The only significant correlation appears to exist
between Year and Volume. As we can see from
the plot below, Volume which reflects the average number of
shares traded daily in billions, is increasing over time.
plot(Volume)
(b) Use the full data set to perform a logistic regression
with Direction as the response and the five lag variables
plus Volume as predictors. Use the summary function to
print the results. Do any of the predictors appear to be statistically
significant? If so, which ones?
glm_fits<-glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,data=Weekly,family=binomial)
summary(glm_fits)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
From the summary above we can see that the lag2
predictor indicating percentage return for two weeks previous appears to
be statistically significant. This is given by the p-value of 0.0296,
which is less than 0.05.
(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
glm_probs<-predict(glm_fits,type='response')
glm_pred<-rep("Down",1089)
glm_pred[glm_probs>0.5]="Up"
table(glm_pred,Direction)
## Direction
## glm_pred Down Up
## Down 54 48
## Up 430 557
mean(glm_pred==Direction)
## [1] 0.5610652
From the confusion matrix we can calculate the fraction for which our logistic model correctly predicted the weekly direction of return from the market as \(\frac{(557+54)}{1089} = .5611\). The matrix indicates that our model correctly predicted that the market would go up on 557 weeks and that it would go down on 54 weeks, for a total of 611 correct predictions.In this case, regression correctly predicted the movement of the market 56.11% of the time.
(d) Now fit the logistic regression model using a training
data period from 1990 to 2008, with Lag2 as the only
predictor. Compute the confusion matrix and the overall fraction of
correct predictions for the held out data (that is, the data from 2009
and 2010).
train<-(Year<2009)
Weekly_2009<-Weekly[!train,]
Direction_2009<-Direction[!train]
glm_fits<-glm(Direction~Lag2,data=Weekly,family=binomial,subset=train)
glm_probs<-predict(glm_fits,Weekly_2009,type="response")
glm_pred=rep('Down',104)
glm_pred[glm_probs>.5]='Up'
table(glm_pred,Direction_2009)
## Direction_2009
## glm_pred Down Up
## Down 9 5
## Up 34 56
From the confusion matrix above we can calculate the overall fraction
of correct predictions for the held out data from 2009 to 2010 as \(\frac{(9+56)}{104} = .625\), or 62.5%. This
is an improvement from our first logistic regression which used the
entire data set and included all of the Lag variables as
well as the Volume variable.
mean(glm_pred==Direction_2009)
## [1] 0.625
(e) Repeat (d) using LDA.
library(MASS)
lda_fit<-lda(Direction~Lag2,data=Weekly,subset=train)
lda_pred<-predict(lda_fit,Weekly_2009)
lda_class<-lda_pred$class
table(lda_class,Direction_2009)
## Direction_2009
## lda_class Down Up
## Down 9 5
## Up 34 56
mean(lda_class==Direction_2009)
## [1] 0.625
The model produced using linear discriminant analysis yields similar results to the logistic regression model, with a correct prediction rate of .625 or 62.5%.
(f) Repeat (d) using QDA.
qda_fit<-qda(Direction~Lag2,data=Weekly,subset=train)
qda_class<-predict(qda_fit,Weekly_2009)$class
table(qda_class,Direction_2009)
## Direction_2009
## qda_class Down Up
## Down 0 0
## Up 43 61
mean(qda_class==Direction_2009)
## [1] 0.5865385
The model produced using quadratic discriminant analysis yields a lower correct prediction rate at .5865, or 58.65%
(g) Repeat (d) using KNN with K = 1.
library(class)
train_x<-as.matrix(Lag2[train])
test_x<-as.matrix(Lag2[!train])
train_Direction<-Direction[train]
set.seed(1)
knn_pred<-knn(train_x,test_x,train_Direction,k=1)
table(knn_pred,Direction_2009)
## Direction_2009
## knn_pred Down Up
## Down 21 30
## Up 22 31
mean(knn_pred==Direction_2009)
## [1] 0.5
The KNN model yields an accuracy rate of 0.5 or 50%.
(h) Repeat (d) using naive Bayes.
library(e1071)
nb_fit<-naiveBayes(Direction~Lag2,data=Weekly,subset=train)
nb_class<-predict(nb_fit,Weekly_2009)
table(nb_class,Direction_2009)
## Direction_2009
## nb_class Down Up
## Down 0 0
## Up 43 61
mean(nb_class==Direction_2009)
## [1] 0.5865385
Our naive bayes model produces similar results to our quadratic discriminant analysis model, yielding a correct prediction rate of .5865 or 58.65%
(i) Which of these methods appears to provide the best
results on this data?
It appears that both the logistic regression and linear discriminant
analysis models are the most accurate, with correct prediction rates of
.625, or 62.5%.
(j) Experiment with different combinations of predictors,
including possible transformations and interactions, for each of the
methods. Report the variables, method, and associated confusion matrix
that appears to provide the best results on the held out data. Note that
you should also experiment with values for K in the KNN
classifier.
By including the Lag1 variable in our logistic regression
we decrease the accuracy of our model to .5769 or 57.69%.
glm_fits2<-glm(Direction~Lag1+Lag2,data=Weekly,family=binomial,subset=train)
glm_probs2<-predict(glm_fits2,Weekly_2009,type="response")
glm_pred2=rep('Down',104)
glm_pred2[glm_probs2>.5]='Up'
table(glm_pred2,Direction_2009)
## Direction_2009
## glm_pred2 Down Up
## Down 7 8
## Up 36 53
mean(glm_pred2==Direction_2009)
## [1] 0.5769231
Including the Lag1 variable in our linear discriminant
analysis model also reduces accuracy of the predictions to 0.5769 or
57.69%.
library(MASS)
lda_fit2<-lda(Direction~Lag1+Lag2,data=Weekly,subset=train)
lda_pred2<-predict(lda_fit2,Weekly_2009)
lda_class2<-lda_pred2$class
table(lda_class2,Direction_2009)
## Direction_2009
## lda_class2 Down Up
## Down 7 8
## Up 36 53
mean(lda_class2==Direction_2009)
## [1] 0.5769231
This logistic model includes an interaction between the
Lag2 and Volume variables and yields more
accurate predictions than the previous models which only included both
of the Lag1 and Lag2 predictors. the accuracy
of this model is shown to be .6154 or 61.54%. This model still appears
to be less accurate than the original logistic model which only included
the Lag2 variable.
glm_fits3<-glm(Direction~Lag2:Volume,data=Weekly,family=binomial,subset=train)
glm_probs3<-predict(glm_fits3,Weekly_2009,type="response")
glm_pred3=rep('Down',104)
glm_pred3[glm_probs3>.5]='Up'
table(glm_pred3,Direction_2009)
## Direction_2009
## glm_pred3 Down Up
## Down 9 6
## Up 34 55
mean(glm_pred3==Direction_2009)
## [1] 0.6153846
Including an interaction between the Lag2 and
Volume predictors in the linear discriminate analysis model
yields a reduced accuracy of .6058 or 60.58% when compared to the
original linear discriminate analysis model with an accuracy of 62.5%.
This model also yields less accurate predictions than the logistic model
which includes the same interaction between Lag2 and
Volume. While this model is less accurate than the original
linear discriminant analysis model, it does improve on the accuracy seen
in the lda model which included only the predictors Lag1
and Lag2.
library(MASS)
lda_fit3<-lda(Direction~Lag2:Volume,data=Weekly,subset=train)
lda_pred3<-predict(lda_fit3,Weekly_2009)
lda_class3<-lda_pred3$class
table(lda_class3,Direction_2009)
## Direction_2009
## lda_class3 Down Up
## Down 8 6
## Up 35 55
mean(lda_class3==Direction_2009)
## [1] 0.6057692
By increasing the value for k in our KNN model from 1 to 3, we are able to improve the accuracy of the model to .55 or 55%.
library(class)
train_x<-as.matrix(Lag2[train])
test_x<-as.matrix(Lag2[!train])
train_Direction<-Direction[train]
set.seed(1)
knn_pred2<-knn(train_x,test_x,train_Direction,k=3)
table(knn_pred2,Direction_2009)
## Direction_2009
## knn_pred2 Down Up
## Down 16 20
## Up 27 41
mean(knn_pred2==Direction_2009)
## [1] 0.5480769
By further increasing our value of k in our KNN model to 200, we are again able to improve the correct prediction rate to .6058 or 60.58%.
library(class)
train_x<-as.matrix(Lag2[train])
test_x<-as.matrix(Lag2[!train])
train_Direction<-Direction[train]
set.seed(1)
knn_pred3<-knn(train_x,test_x,train_Direction,k=200)
table(knn_pred3,Direction_2009)
## Direction_2009
## knn_pred3 Down Up
## Down 2 0
## Up 41 61
mean(knn_pred3==Direction_2009)
## [1] 0.6057692
From the experiments with different interactions and predictors, I
find that the original logistic regression and linear discriminate
analysis models are the most accurate, with correct prediction rates
equaling 62.5%. The third experiment which included an interaction
between the Lag2 and Volume predictors was the
only model that came close to this level of accuracy by yielding a
correct prediction rate of 61.53%.
detach(Weekly)
In this problem, you will develop a model to predict whether
a given car gets high or low gas mileage based on the Auto
data set.
library(ISLR)
attach(Auto)
summary(Auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
(a) Create a binary variable, mpg01, that
contains a 1 if mpg contains a value above its median, and
a 0 if mpg contains a value below its median. You can
compute the median using the median() function. Note you
may find it helpful to use the data.frame() function to
create a single data set containing both mpg01 and the
other Auto variables.
mpg01<-ifelse(mpg>median(mpg),yes=1,no=0)
Auto<-data.frame(Auto,mpg01)
str(Auto)
## 'data.frame': 392 obs. of 10 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## $ mpg01 : num 0 0 0 0 0 0 0 0 0 0 ...
(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatter plots and box plots may be useful tools to answer this question. Describe your findings.
cor(Auto[,-9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## mpg01 0.8369392 -0.7591939 -0.7534766 -0.6670526 -0.7577566
## acceleration year origin mpg01
## mpg 0.4233285 0.5805410 0.5652088 0.8369392
## cylinders -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration 1.0000000 0.2903161 0.2127458 0.3468215
## year 0.2903161 1.0000000 0.1815277 0.4299042
## origin 0.2127458 0.1815277 1.0000000 0.5136984
## mpg01 0.3468215 0.4299042 0.5136984 1.0000000
pairs(Auto)
We can learn more about the available data by creating box plots that contrast the distribution of quantitative variables between cars with above-median mpg (1) and below-median mpg (0).
par(mfrow=c(2,3))
plot(factor(mpg01),cylinders,ylab= "Number of Engine Cylinders")
plot(factor(mpg01),displacement,ylab="Engine Displacement (Cubic Inches)")
plot(factor(mpg01),horsepower,ylab="Horsepower")
plot(factor(mpg01),weight,ylab="Weight (Pounds)")
plot(factor(mpg01),acceleration,ylab="Time to Reach 60MPH (Seconds)")
plot(factor(mpg01),year,ylab="Manufacture Year")
mtext("Box Plots for Cars With Above (1) and Below (0) Median MPG",outer=TRUE,line=-1)
From the box plots generated we can see that a majority of the
vehicles in the data set with above-median mpg appear to have smaller
engines, less horsepower, and less weight. It is also clear that
vehicles in the data set with above-median mpg have four cylinder
engines. The two box plots that demonstrate the distribution of
acceleration and manufacture year do show some difference between
vehicles with above-median and below-median mpg, but it is not as clear
as the first four predictors mentioned. Due to less overlap in the box
plots, the cylinders, displacement,
horsepower, and weight variables appear to be
the most useful in predicting whether a vehicle will have above or below
median mpg from this data set.
We can take a closer look at the useful variables that we identified
from the box plots by plotting them against mpg01 in
scatter plots.
par(mfrow = c(3, 2))
plot(cylinders,mpg01,xlab ="Number of Engine Cylinders")
plot(displacement,mpg01,xlab ="Engine Displacement (Cubic Inches)")
plot(horsepower,mpg01,xlab ="Horsepower")
plot(Auto$weight, Auto$mpg01, xlab ="Weight (Pounds)")
mtext("Scatterplots For Cars With Above(1) and Below(0) Median MPG",outer=TRUE,line=-3)
From the scatter plots generated, we can see that the
horsepower and weight variables continue to
appear useful in providing predictions for mpg01. The scatter plot for
horsepower shows a higher concentration of vehicles with lower
horsepower classified as above-median mpg. We see a similar
concentration in the weight scatter plot, which suggests that more
vehicles from the data set with a lighter weight are classified as
having above-median mpg. The scatter plots produced for number of engine
cylinders and engine displacement against mpg01 do not
appear to provide as useful insights on their own for predicting
mpg01.
To see these relationships in another light, I have focused on the
scatter plots below that plot these variables against the original
mpg variable. From these plots we can more clearly see
apparent correlations in the relationships between the predictors and
mpg. For example, it appears that vehicles from the data set with
smaller engines and less cylinders tend to have above-median mpg.
par(mfrow=c(2,2))
plot(horsepower,mpg,xlab="Horsepower",ylab="MPG")
abline(h=median(mpg),lwd=2,col="green")
plot(weight,mpg,xlab="Weight (Pounds)",ylab="MPG")
abline(h=median(mpg),lwd=2,col="green")
plot(displacement,mpg,xlab="Engine Displacement (Cubic Inches)",ylab="MPG")
abline(h=median(mpg),lwd=2,col="green")
plot(cylinders,mpg,xlab="Number of Engine Cylinders",ylab="MPG")
abline(h=median(mpg),lwd=2,col="green")
We can also see from the plots below that interesting relationships
exist between the year and origin variables
when plotted with mpg.
par(mfrow=c(1, 2))
plot(year,mpg,xlab="Year",ylab="MPG")
abline(h=median(mpg),lwd=2,col="green")
origin[origin==1]="American"
origin[origin==2]="European"
origin[origin==3]="Japanese"
origin=as.factor(origin)
plot(origin,mpg,xlab="Origin",ylab="MPG")
abline(h=median(mpg),lwd=2,col="green")
From the scatter plot of mpg vs. year, we can see that newer vehicles in our data set tend to have above-median mpg. It also appears from the box plots generated which show the distribution of mpg for different origins of vehicle in the data set that vehicles originating in America tend to have below-median mpg.
(c) Split the data into a training set and a test set.
set.seed(1)
idx<-sample(1:nrow(Auto),0.8*nrow(Auto))
train_auto<-Auto[idx, ]
test_auto<-Auto[-idx, ]
y_train<-Auto[idx,]$mpg
str(train_auto)
## 'data.frame': 313 obs. of 10 variables:
## $ mpg : num 44.3 23 26 23.9 23.2 16 33.5 13 31.5 17.6 ...
## $ cylinders : num 4 4 4 8 4 8 4 8 4 6 ...
## $ displacement: num 90 140 122 260 156 318 151 350 89 225 ...
## $ horsepower : num 48 83 80 90 105 150 90 175 71 85 ...
## $ weight : num 2085 2639 2451 3420 2745 ...
## $ acceleration: num 21.7 17 16.5 22.2 16.7 13 13.2 13 14.9 16.6 ...
## $ year : num 80 75 74 79 78 76 79 73 78 81 ...
## $ origin : num 2 1 1 1 1 1 1 1 2 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 303 156 156 201 230 108 248 25 291 72 ...
## $ mpg01 : num 1 1 1 1 1 0 1 0 1 0 ...
(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
From the understanding we have gathered of the variables so far, I
will generate a linear discriminant analysis model which includes the
cylinders, displacement,
horsepower, weight, year, and
origin variables.
library(MASS)
lda_fit_auto<-lda(mpg01~cylinders+displacement+horsepower+weight+year+origin,data=train_auto)
lda_fit_auto
## Call:
## lda(mpg01 ~ cylinders + displacement + horsepower + weight +
## year + origin, data = train_auto)
##
## Prior probabilities of groups:
## 0 1
## 0.4920128 0.5079872
##
## Group means:
## cylinders displacement horsepower weight year origin
## 0 6.753247 269.6558 128.0844 3598.026 74.53247 1.162338
## 1 4.207547 117.5723 79.8239 2350.283 77.64151 1.949686
##
## Coefficients of linear discriminants:
## LD1
## cylinders -0.403682186
## displacement -0.001118231
## horsepower 0.008840042
## weight -0.001101703
## year 0.120435532
## origin 0.147062729
lda_pred_auto<-predict(lda_fit_auto,test_auto)$class
table(lda_pred_auto,test_auto$mpg01)
##
## lda_pred_auto 0 1
## 0 35 0
## 1 7 37
The test error rate for this model is shown to be .0886, or 8.86%.
1- mean(lda_pred_auto==test_auto$mpg01)
## [1] 0.08860759
library(MASS)
lda_fit_auto2<-lda(mpg01~cylinders+displacement+horsepower+weight,data=train_auto)
lda_fit_auto2
## Call:
## lda(mpg01 ~ cylinders + displacement + horsepower + weight, data = train_auto)
##
## Prior probabilities of groups:
## 0 1
## 0.4920128 0.5079872
##
## Group means:
## cylinders displacement horsepower weight
## 0 6.753247 269.6558 128.0844 3598.026
## 1 4.207547 117.5723 79.8239 2350.283
##
## Coefficients of linear discriminants:
## LD1
## cylinders -0.3683695573
## displacement -0.0034034169
## horsepower 0.0043232816
## weight -0.0009434883
lda_pred_auto2<-predict(lda_fit_auto2,test_auto)$class
table(lda_pred_auto2,test_auto$mpg01)
##
## lda_pred_auto2 0 1
## 0 35 0
## 1 7 37
1- mean(lda_pred_auto2==test_auto$mpg01)
## [1] 0.08860759
It appears that removing the year and
origin variables from our linear discriminant analysis
model has no affect on the error rate. In both linear discriminant
models produced, none of the vehicles that actually had above-median mpg
were misclassified.
(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
qda_fit_auto<-qda(mpg01~cylinders+displacement+horsepower+weight+year+origin,data=train_auto)
qda_fit_auto
## Call:
## qda(mpg01 ~ cylinders + displacement + horsepower + weight +
## year + origin, data = train_auto)
##
## Prior probabilities of groups:
## 0 1
## 0.4920128 0.5079872
##
## Group means:
## cylinders displacement horsepower weight year origin
## 0 6.753247 269.6558 128.0844 3598.026 74.53247 1.162338
## 1 4.207547 117.5723 79.8239 2350.283 77.64151 1.949686
qda_pred_auto<-predict(qda_fit_auto,test_auto)$class
table(qda_pred_auto,test_auto$mpg01)
##
## qda_pred_auto 0 1
## 0 38 3
## 1 4 34
The test error rate for the quadratic discriminant analysis model is shown to be .0886 or 8.86%. This is the same result yielded by both of the linear discriminant analysis models produced.
1- mean(qda_pred_auto==test_auto$mpg01)
## [1] 0.08860759
qda_fit_auto2<-qda(mpg01~cylinders+displacement+horsepower+weight,data=train_auto)
qda_fit_auto2
## Call:
## qda(mpg01 ~ cylinders + displacement + horsepower + weight, data = train_auto)
##
## Prior probabilities of groups:
## 0 1
## 0.4920128 0.5079872
##
## Group means:
## cylinders displacement horsepower weight
## 0 6.753247 269.6558 128.0844 3598.026
## 1 4.207547 117.5723 79.8239 2350.283
qda_pred_auto2<-predict(qda_fit_auto2,test_auto)$class
table(qda_pred_auto2,test_auto$mpg01)
##
## qda_pred_auto2 0 1
## 0 37 2
## 1 5 35
Removing the year and origin variables from
our quadratic discriminant analysis model appears to have no affect on
the test error rate. This model again yields a test error rate of .0886
or 8.86%.
1-mean(qda_pred_auto2==test_auto$mpg01)
## [1] 0.08860759
(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
glm_fits_auto<-glm(mpg01~cylinders+displacement+horsepower+weight+year+origin,data=train_auto)
glm_fits_auto
##
## Call: glm(formula = mpg01 ~ cylinders + displacement + horsepower +
## weight + year + origin, data = train_auto)
##
## Coefficients:
## (Intercept) cylinders displacement horsepower weight
## -0.6085200 -0.0974107 -0.0002698 0.0021332 -0.0002658
## year origin
## 0.0290618 0.0354871
##
## Degrees of Freedom: 312 Total (i.e. Null); 306 Residual
## Null Deviance: 78.23
## Residual Deviance: 28.47 AIC: 153.9
glm_probs_auto<-predict(glm_fits_auto,test_auto,type="response")
glm_pred_auto<-rep(0,length(test_auto$mpg01))
glm_pred_auto[glm_probs_auto>0.5]=1
table(glm_pred_auto,test_auto$mpg01)
##
## glm_pred_auto 0 1
## 0 35 0
## 1 7 37
The test error rate yielded by our logistic regression model which
includes cylinders, displacement,
horsepower, weight, year, and
origin is also .0886 or 8.86%.
1-mean(glm_pred_auto==test_auto$mpg01)
## [1] 0.08860759
From the summary generated below, it appears that based on their
individual p-values below 0.05, the significant predictors in our
logistic model for mpg01 are ’cylinders,
weight, and year.
summary(glm_fits_auto)
##
## Call:
## glm(formula = mpg01 ~ cylinders + displacement + horsepower +
## weight + year + origin, data = train_auto)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.93565 -0.15579 0.06056 0.20257 0.92104
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.085e-01 4.260e-01 -1.428 0.15422
## cylinders -9.741e-02 3.375e-02 -2.886 0.00418 **
## displacement -2.698e-04 7.810e-04 -0.346 0.72995
## horsepower 2.133e-03 1.117e-03 1.910 0.05709 .
## weight -2.658e-04 5.814e-05 -4.572 7.01e-06 ***
## year 2.906e-02 5.152e-03 5.641 3.85e-08 ***
## origin 3.549e-02 2.832e-02 1.253 0.21108
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.0930394)
##
## Null deviance: 78.23 on 312 degrees of freedom
## Residual deviance: 28.47 on 306 degrees of freedom
## AIC: 153.88
##
## Number of Fisher Scoring iterations: 2
glm_fits_auto2<-glm(mpg01~cylinders+weight+year,data=train_auto)
glm_fits_auto2
##
## Call: glm(formula = mpg01 ~ cylinders + weight + year, data = train_auto)
##
## Coefficients:
## (Intercept) cylinders weight year
## -0.2383035 -0.0971438 -0.0002398 0.0261125
##
## Degrees of Freedom: 312 Total (i.e. Null); 309 Residual
## Null Deviance: 78.23
## Residual Deviance: 29.13 AIC: 155
glm_probs_auto2<-predict(glm_fits_auto2,test_auto,type="response")
glm_pred_auto2<-rep(0,length(test_auto$mpg01))
glm_pred_auto2[glm_probs_auto2>0.5]=1
table(glm_pred_auto2,test_auto$mpg01)
##
## glm_pred_auto2 0 1
## 0 36 0
## 1 6 37
By focusing on the cylinders, weight, and
year variables in our logistic regression model, I am able
to reduce the test error rate to 0.0759 or 7.59%.
1-mean(glm_pred_auto2== test_auto$mpg01)
## [1] 0.07594937
(g) Perform naive Bayes on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
library(e1071)
nb_fit_auto<-naiveBayes(mpg01~cylinders+displacement+horsepower+weight+year+origin,data=train_auto)
nb_class_auto<-predict(nb_fit_auto,test_auto$mpg01)
table(nb_class_auto,test_auto$mpg01)
##
## nb_class_auto 0 1
## 0 0 0
## 1 42 37
1-mean(nb_class_auto==test_auto$mpg01)
## [1] 0.5316456
library(e1071)
nb_fit_auto2<-naiveBayes(mpg01~cylinders+weight+year,data=train_auto)
nb_class_auto2<-predict(nb_fit_auto2,test_auto$mpg01)
table(nb_class_auto2,test_auto$mpg01)
##
## nb_class_auto2 0 1
## 0 0 0
## 1 42 37
1-mean(nb_class_auto==test_auto$mpg01)
## [1] 0.5316456
The naive Bayes model yields the highest test error rates of the
models produced so far. The test error rate is shown to be .5317 or
53.17%, even when only including the cylinders,
weight, and year variables.
(h) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?
When including the cylinders, weight, and
year variables in our KNN models, we are able to yield the
lowest test error rate of .1013 or 10.13% with k=1 and k=5.
library(class)
train_X_auto=train_auto[,c("cylinders","weight","year")]
test_X_auto=test_auto[,c("cylinders","weight","year")]
train_mpg01 =train_auto[['mpg01']]
set.seed(2)
y_test_auto<-ifelse(test_auto$mpg>=22.75,1,0)
knn_pred_auto=knn(train_X_auto,test_X_auto,train_mpg01,k=1)
table(knn_pred_auto,y_test_auto)
## y_test_auto
## knn_pred_auto 0 1
## 0 39 5
## 1 3 32
1-mean(knn_pred_auto==y_test_auto)
## [1] 0.1012658
knn_pred_auto2=knn(train_X_auto,test_X_auto,train_mpg01,k=5)
table(knn_pred_auto2,y_test_auto)
## y_test_auto
## knn_pred_auto2 0 1
## 0 38 4
## 1 4 33
1-mean(knn_pred_auto2==y_test_auto)
## [1] 0.1012658
Increasing the value of k to 10 yields an increased test error rate of .1139 or 11.39%.
knn_pred_auto3=knn(train_X_auto,test_X_auto,train_mpg01,k=10)
table(knn_pred_auto3,y_test_auto)
## y_test_auto
## knn_pred_auto3 0 1
## 0 37 4
## 1 5 33
1-mean(knn_pred_auto3==y_test_auto)
## [1] 0.1139241
Further increasing the value of k to 50, yields an even higher test error rate of .1266 or 12.66%.
knn_pred_auto4=knn(train_X_auto,test_X_auto,train_mpg01,k=50)
table(knn_pred_auto4,y_test_auto)
## y_test_auto
## knn_pred_auto4 0 1
## 0 36 4
## 1 6 33
1-mean(knn_pred_auto4==y_test_auto)
## [1] 0.1265823
detach(Auto)
Using the Boston data set, fit classification
models in order to predict whether a given suburb has a crime rate above
or below the median. Explore logistic regression, LDA, and KNN models
using various subsets of the predictors. Describe your
findings.
library(MASS)
attach(Boston)
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Create a binary variable, crim01, that contains
a 1 if crim contains a value above its median, and a 0 if
crim contains a value below its median
crim01<-ifelse(crim>median(crim),yes=1,no=0)
Boston<-data.frame(Boston,crim01)
str(Boston)
## 'data.frame': 506 obs. of 15 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
## $ crim01 : num 0 0 0 0 0 0 0 0 0 0 ...
By visualizing the data graphically, we can determine which variables
to include in our model based on what appear to be strong relationships
with our crim01 variable.
pairs(Boston)
cor(Boston)
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## crim01 0.40939545 -0.43615103 0.60326017 0.070096774 0.72323480
## rm age dis rad tax ptratio
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431 0.2899456
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018 0.3832476
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320 0.1889327
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783 -0.3555015
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559 0.2615150
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819 0.4647412
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000 0.4608530
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304 1.0000000
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341 0.3740443
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593 -0.5077867
## crim01 -0.15637178 0.61393992 -0.61634164 0.619786249 0.60874128 0.2535684
## black lstat medv crim01
## crim -0.38506394 0.4556215 -0.3883046 0.40939545
## zn 0.17552032 -0.4129946 0.3604453 -0.43615103
## indus -0.35697654 0.6037997 -0.4837252 0.60326017
## chas 0.04878848 -0.0539293 0.1752602 0.07009677
## nox -0.38005064 0.5908789 -0.4273208 0.72323480
## rm 0.12806864 -0.6138083 0.6953599 -0.15637178
## age -0.27353398 0.6023385 -0.3769546 0.61393992
## dis 0.29151167 -0.4969958 0.2499287 -0.61634164
## rad -0.44441282 0.4886763 -0.3816262 0.61978625
## tax -0.44180801 0.5439934 -0.4685359 0.60874128
## ptratio -0.17738330 0.3740443 -0.5077867 0.25356836
## black 1.00000000 -0.3660869 0.3334608 -0.35121093
## lstat -0.36608690 1.0000000 -0.7376627 0.45326273
## medv 0.33346082 -0.7376627 1.0000000 -0.26301673
## crim01 -0.35121093 0.4532627 -0.2630167 1.00000000
It appears that the indus, nox,
rad, tax, and lstat variables all
have stronger correlations with the crim variable. The
distribution of the values for these predictors when plotted for above
and below median crime rates makes their relationship even more
apparent.
par(mfrow=c(2,3))
plot(factor(crim01),indus,ylab="Proportion of Non-retail Business Acres per Town")
plot(factor(crim01),nox,ylab="Concentration of Nitrogen Oxides (Parts per 10 Million")
plot(factor(crim01),rad,ylab="Accessibility to Radial Highways")
plot(factor(crim01),tax,ylab="Property Tax Rate (per $10,000")
plot(factor(crim01),lstat,ylab="Lower Status of Population (Percent)")
mtext("Box Plots for Suburbs With Above (1) and Below (0) Median Crime",outer=TRUE,line=-1)
Split the data into a training set and a test set
set.seed(3)
idx<-sample(1:nrow(Boston),0.8*nrow(Boston))
train_Boston<-Boston[idx, ]
test_Boston<-Boston[-idx, ]
y_train<-Boston[idx,]$crim
str(train_Boston)
## 'data.frame': 404 obs. of 15 variables:
## $ crim : num 0.5401 0.0605 0.5445 5.6917 0.0642 ...
## $ zn : num 20 0 0 0 0 0 0 0 0 0 ...
## $ indus : num 3.97 2.46 21.89 18.1 5.96 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.647 0.488 0.624 0.583 0.499 0.573 0.583 0.77 0.614 0.7 ...
## $ rm : num 7.2 6.15 6.15 6.11 5.93 ...
## $ age : num 81.8 68.8 97.9 79.8 68.2 89.3 53.2 96.2 96.7 82.5 ...
## $ dis : num 2.11 3.28 1.67 3.55 3.36 ...
## $ rad : int 5 3 4 24 5 1 24 24 24 24 ...
## $ tax : num 264 193 437 666 279 273 666 666 666 666 ...
## $ ptratio: num 13 17.8 21.2 20.2 19.2 21 20.2 20.2 20.2 20.2 ...
## $ black : num 393 387 397 393 397 ...
## $ lstat : num 9.59 13.15 18.46 14.98 9.68 ...
## $ medv : num 33.8 29.6 17.8 19.1 18.9 22 20.6 20.8 14.6 23.2 ...
## $ crim01 : num 1 0 1 1 0 0 1 1 1 1 ...
Logistic Regression
glm_fits_Boston<-glm(crim01~indus+nox+rad+tax+lstat,data=train_Boston)
glm_fits_Boston
##
## Call: glm(formula = crim01 ~ indus + nox + rad + tax + lstat, data = train_Boston)
##
## Coefficients:
## (Intercept) indus nox rad tax lstat
## -0.8129302 0.0108806 2.2817649 0.0246078 -0.0006513 -0.0035658
##
## Degrees of Freedom: 403 Total (i.e. Null); 398 Residual
## Null Deviance: 101
## Residual Deviance: 41.76 AIC: 243.6
glm_probs_Boston<-predict(glm_fits_Boston,test_Boston,type="response")
glm_pred_Boston<-rep(0,length(test_Boston$crim01))
glm_pred_Boston[glm_probs_Boston>0.5]=1
table(glm_pred_Boston,test_Boston$crim01)
##
## glm_pred_Boston 0 1
## 0 52 17
## 1 2 31
A logistic regression model which includes the indus,
nox, rad, tax, and
lstat yields a test error rate of .1863 or 18.63%.
1-mean(glm_pred_Boston==test_Boston$crim01)
## [1] 0.1862745
From the summary produced below, we can see that the
indus, nox, rad, and
tax variables are all significant in their relationship as
predictors for above or below median crime rate in a given suburb.
summary(glm_fits_Boston)
##
## Call:
## glm(formula = crim01 ~ indus + nox + rad + tax + lstat, data = train_Boston)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.66774 -0.20179 -0.09736 0.16887 0.84107
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.8129302 0.1159436 -7.011 1.02e-11 ***
## indus 0.0108806 0.0043789 2.485 0.0134 *
## nox 2.2817649 0.2442468 9.342 < 2e-16 ***
## rad 0.0246078 0.0046468 5.296 1.97e-07 ***
## tax -0.0006513 0.0002771 -2.351 0.0192 *
## lstat -0.0035658 0.0030003 -1.188 0.2353
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1049152)
##
## Null deviance: 100.978 on 403 degrees of freedom
## Residual deviance: 41.756 on 398 degrees of freedom
## AIC: 243.6
##
## Number of Fisher Scoring iterations: 2
However, a logistic regression model that only includes these variables that were identified as significant still yields the same test error rate as the original of 18.63%.
glm_fits_Boston2<-glm(crim01~indus+nox+rad+tax,data=train_Boston)
glm_fits_Boston2
##
## Call: glm(formula = crim01 ~ indus + nox + rad + tax, data = train_Boston)
##
## Coefficients:
## (Intercept) indus nox rad tax
## -0.803658 0.009883 2.221905 0.024460 -0.000672
##
## Degrees of Freedom: 403 Total (i.e. Null); 399 Residual
## Null Deviance: 101
## Residual Deviance: 41.9 AIC: 243
glm_probs_Boston2<-predict(glm_fits_Boston2,test_Boston,type="response")
glm_pred_Boston2<-rep(0,length(test_Boston$crim01))
glm_pred_Boston2[glm_probs_Boston2>0.5]=1
table(glm_pred_Boston2,test_Boston$crim01)
##
## glm_pred_Boston2 0 1
## 0 52 17
## 1 2 31
1-mean(glm_pred_Boston2==test_Boston$crim01)
## [1] 0.1862745
I am able to reduce the test error rate of the logistic regression
model to .1667 or 16.67% by excluding the indus and
tax variables from the original logistic regression model
produced.
glm_fits_Boston3<-glm(crim01~nox+rad+lstat,data=train_Boston)
glm_fits_Boston3
##
## Call: glm(formula = crim01 ~ nox + rad + lstat, data = train_Boston)
##
## Coefficients:
## (Intercept) nox rad lstat
## -1.010865 2.507217 0.016072 -0.002826
##
## Degrees of Freedom: 403 Total (i.e. Null); 400 Residual
## Null Deviance: 101
## Residual Deviance: 42.61 AIC: 247.8
glm_probs_Boston3<-predict(glm_fits_Boston3,test_Boston,type="response")
glm_pred_Boston3<-rep(0,length(test_Boston$crim01))
glm_pred_Boston3[glm_probs_Boston3>0.5]=1
table(glm_pred_Boston3,test_Boston$crim01)
##
## glm_pred_Boston3 0 1
## 0 52 15
## 1 2 33
1-mean(glm_pred_Boston3==test_Boston$crim01)
## [1] 0.1666667
Linear Discriminant Analysis (LDA)
The linear discriminant analysis models produced below appear to
mirror the test error results of our logistic regression models. The
initial linear discriminant analysis model which includes the
indus, nox, rad,
tax, and lstat yields a test error rate of
.1863 or 18.63%.
library(MASS)
lda_fit_Boston<-lda(crim01~indus+nox+rad+tax+lstat,data=train_Boston)
lda_fit_Boston
## Call:
## lda(crim01 ~ indus + nox + rad + tax + lstat, data = train_Boston)
##
## Prior probabilities of groups:
## 0 1
## 0.4925743 0.5074257
##
## Group means:
## indus nox rad tax lstat
## 0 6.946432 0.4724869 4.170854 306.9196 9.438693
## 1 15.441756 0.6386829 15.341463 517.4098 15.877707
##
## Coefficients of linear discriminants:
## LD1
## indus 0.044083539
## nox 9.244775874
## rad 0.099700634
## tax -0.002638889
## lstat -0.014447233
lda_pred_Boston<-predict(lda_fit_Boston,test_Boston)$class
table(lda_pred_Boston,test_Boston$crim01)
##
## lda_pred_Boston 0 1
## 0 52 17
## 1 2 31
1- mean(lda_pred_Boston==test_Boston$crim01)
## [1] 0.1862745
By excluding the indus and tax variables, I
am able to reduce the test error rate of the linear discriminant
analysis model to .1667 or 16.67%.
library(MASS)
lda_fit_Boston2<-lda(crim01~nox+rad+lstat,data=train_Boston)
lda_fit_Boston2
## Call:
## lda(crim01 ~ nox + rad + lstat, data = train_Boston)
##
## Prior probabilities of groups:
## 0 1
## 0.4925743 0.5074257
##
## Group means:
## nox rad lstat
## 0 0.4724869 4.170854 9.438693
## 1 0.6386829 15.341463 15.877707
##
## Coefficients of linear discriminants:
## LD1
## nox 10.12921220
## rad 0.06493179
## lstat -0.01141820
lda_pred_Boston2<-predict(lda_fit_Boston2,test_Boston)$class
table(lda_pred_Boston2,test_Boston$crim01)
##
## lda_pred_Boston2 0 1
## 0 52 15
## 1 2 33
1- mean(lda_pred_Boston2==test_Boston$crim01)
## [1] 0.1666667
K-Nearest Neighbors (KNN)
library(class)
train_X_Boston=train_Boston[,c("nox","rad","lstat")]
test_X_Boston=test_Boston[,c("nox","rad","lstat")]
train_crim01=train_Boston[['crim01']]
set.seed(4)
y_test_Boston<-ifelse(test_Boston$crim>=25.65,1,0)
knn_pred_Boston=knn(train_X_Boston,test_X_Boston,train_crim01,k=1)
table(knn_pred_Boston,y_test_Boston)
## y_test_Boston
## knn_pred_Boston 0 1
## 0 58 0
## 1 42 2
1-mean(knn_pred_Boston==y_test_Boston)
## [1] 0.4117647
knn_pred_Boston2=knn(train_X_Boston,test_X_Boston,train_crim01,k=50)
table(knn_pred_Boston2,y_test_Boston)
## y_test_Boston
## knn_pred_Boston2 0 1
## 0 75 0
## 1 25 2
1-mean(knn_pred_Boston2==y_test_Boston)
## [1] 0.245098
By increasing our value of k to 100 for our KNN model which includes
the nox, rad, and lstat
variables, we are able to match the test error rate of our least
accurate logistic regression and linear discriminant analysis
models.
knn_pred_Boston3=knn(train_X_Boston,test_X_Boston,train_crim01,k=100)
table(knn_pred_Boston3,y_test_Boston)
## y_test_Boston
## knn_pred_Boston3 0 1
## 0 81 0
## 1 19 2
1-mean(knn_pred_Boston3==y_test_Boston)
## [1] 0.1862745
From the classification models tested, it appears that the logistic
regression and linear discriminant analysis models which only include
the nox, rad, and lstat
variables, yields the lowest test error rate of 16.67%.
detach(Boston)