library(ISLR2)
library(MASS)
library(class)
This question should be answered using the Weekly data
set, which is part of the ISLR2 package. This data is
similar in nature to the Smarket data from this chapter’s
lab, except that it contains 1, 089 weekly returns for 21 years, from
the beginning of 1990 to the end of 2010.
attach(Weekly)
head(Weekly)
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
## 1 1990 0.816 1.572 -3.936 -0.229 -3.484 0.1549760 -0.270 Down
## 2 1990 -0.270 0.816 1.572 -3.936 -0.229 0.1485740 -2.576 Down
## 3 1990 -2.576 -0.270 0.816 1.572 -3.936 0.1598375 3.514 Up
## 4 1990 3.514 -2.576 -0.270 0.816 1.572 0.1616300 0.712 Up
## 5 1990 0.712 3.514 -2.576 -0.270 0.816 0.1537280 1.178 Up
## 6 1990 1.178 0.712 3.514 -2.576 -0.270 0.1544440 -1.372 Down
(a) Produce some numerical and graphical summaries of the
Weekly data. Do there appear to be any
patterns?
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
pairs(Weekly)
Much like in the Smarket data, the only relationship
that jumps out is that between Year and
Volume.
plot(Year, Volume)
Taking a closer look we can see a steady increase each year in the volume of trades; Looking below, the growth appears exponential with drop off tail in the last couple of years (possibly due to the 2008 recession).
plot(Volume)
Finally, lets look at the matrix of correlations between variables to see if any other correlations other than the volume and year one are there:
cor(Weekly[,-9])
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
Nothing has good correlations with Today
(b) Use the full data set to perform a logistic regression
with Direction as the response and the five lag variables
plus Volume as predictors. Use the summary function to
print the results. Do any of the predictors appear to be statistically
significant? If so, which ones?
glm_full = glm(Direction ~ . - Year - Today, data = Weekly, family = binomial)
summary(glm_full)
##
## Call:
## glm(formula = Direction ~ . - Year - Today, family = binomial,
## data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Lag2 is the only predictor with a p-value less than .05. The next best predictor is Lag1, with still has a p-value of .1181.
(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
glm.probs=predict(glm_full,type='response')
glm.probs[1:10]
## 1 2 3 4 5 6 7 8
## 0.6086249 0.6010314 0.5875699 0.4816416 0.6169013 0.5684190 0.5786097 0.5151972
## 9 10
## 0.5715200 0.5554287
glm.pred=rep("Down",1089)
glm.pred[glm.probs>0.5]="Up"
table(glm.pred, Direction)
## Direction
## glm.pred Down Up
## Down 54 48
## Up 430 557
mean(glm.pred==Direction)
## [1] 0.5610652
(54+557)/1089
## [1] 0.5610652
The correctly predicted Down of 54 and Up of 557 gives us a total correctly predicted rate of 56.11%. Our model corrected the movement of the market 56% of the time. However, this is just for the training dataset. We would need a holdout dataset to get a more realistic error rate.
(d) Now fit the logistic regression model using a training
data period from 1990 to 2008, with Lag2 as the only
predictor. Compute the confusion matrix and the overall fraction of
correct predictions for the held out data (that is, the data from 2009
and 2010).
train = (Year<2009)
head(train)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
tail(train)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
Now we can fit our model with the train dataset and run the confusion matrix.
Weekly.2009=Weekly[!train,]
dim(Weekly)
## [1] 1089 9
dim(Weekly.2009)
## [1] 104 9
Direction.2009=Direction[!train]
glm.fits=glm(Direction~Lag2, data=Weekly, subset=train, family=binomial)
glm.probs=predict(glm.fits,Weekly.2009,type='response')
glm.pred=rep('Down', 104)
glm.pred[glm.probs>0.5]='Up'
table(glm.pred,Direction.2009)
## Direction.2009
## glm.pred Down Up
## Down 9 5
## Up 34 56
mean(glm.pred==Direction.2009)
## [1] 0.625
(9+56)/104
## [1] 0.625
We changed two things in our model: first, we created a hold out for testing which was 2009, and ran our model again. We then reduced our variables to the only one with stasitical significance, Lag2. Looking at the results on the test data, you would expect the accuracy to go down; but with only using Lag2, that was offset enough to get us to 62.5% accuracy.
(e) Repeat (d) using LDA.
lda.fit=lda(Direction~Lag2, data=Weekly, subset=train)
lda.fit
## Call:
## lda(Direction ~ Lag2, data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
##
## Coefficients of linear discriminants:
## LD1
## Lag2 0.4414162
lda.pred=predict(lda.fit,Weekly.2009)
names(lda.pred)
## [1] "class" "posterior" "x"
lda.class=lda.pred$class
table(lda.class,Direction.2009)
## Direction.2009
## lda.class Down Up
## Down 9 5
## Up 34 56
mean(lda.class==Direction.2009)
## [1] 0.625
(9+56)/104
## [1] 0.625
There is no change between the confusion matrix in (d) and (e). In this instance, both models reached the same predictions which was right 62.5% of the time; the components of that accuracy, meaning the amount of time it got the up and down correct each, were also the same.
(f) Repeat (d) using QDA.
qda.fit = qda(Direction ~ Lag2, data = Weekly, subset = train)
qda.fit
## Call:
## qda(Direction ~ Lag2, data = Weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
qda.pred = predict(qda.fit, Weekly.2009)
table(qda.pred$class,Weekly.2009$Direction)
##
## Down Up
## Down 0 0
## Up 43 61
qda.class=predict(qda.fit,Weekly.2009)$class
table(qda.class,Direction.2009)
## Direction.2009
## qda.class Down Up
## Down 0 0
## Up 43 61
mean(qda.class==Direction.2009)
## [1] 0.5865385
61/104
## [1] 0.5865385
QDA accurately predicted 58.7% of the time. However, all accurate predictions were of Up. The result was the same as if we naively predicted that every week is Up.
(g) Repeat (d) using KNN with K = 1.
train.X=data.frame(Weekly[train, ]$Lag2)
test.X=data.frame(Weekly[!train, ]$Lag2)
train.Direction=Weekly[train, ]$Direction
dim(train.X)
## [1] 985 1
dim(test.X)
## [1] 104 1
length(train.Direction)
## [1] 985
set.seed(1)
knn.pred=knn(train.X,test.X,train.Direction,k=1)
table(knn.pred,Direction.2009)
## Direction.2009
## knn.pred Down Up
## Down 21 30
## Up 22 31
(21+31)/104
## [1] 0.5
Our accuracy got worse, using K = 1.
(h) Repeat (d) using naive Bayes.
library(e1071)
nb.fit = naiveBayes(Direction ~ Lag2, data = Weekly, subset = train)
nb.fit
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Down Up
## 0.4477157 0.5522843
##
## Conditional probabilities:
## Lag2
## Y [,1] [,2]
## Down -0.03568254 2.199504
## Up 0.26036581 2.317485
nb.class=predict(nb.fit,Weekly.2009)
table(nb.class, Direction.2009)
## Direction.2009
## nb.class Down Up
## Down 0 0
## Up 43 61
61/104
## [1] 0.5865385
The output is the same as qda. (i) Which of these methods
appears to provide the best results on this data?
Looking only at prediction accuracy, both logistic regression and LDA
had the best accuracy at 62.5% accuracy.
(j) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.
First lets increase the K for KNN. Looking at work done by analyst Liam Morgan, I see two good values for K are 7 and 25. First lets run it with 7.
knn.pred=knn(train.X,test.X,train.Direction,k=7)
table(knn.pred,Direction.2009)
## Direction.2009
## knn.pred Down Up
## Down 16 19
## Up 27 42
(16+42)/104
## [1] 0.5576923
And now with 25.
knn.pred=knn(train.X,test.X,train.Direction,k=25)
table(knn.pred,Direction.2009)
## Direction.2009
## knn.pred Down Up
## Down 20 24
## Up 23 37
(20+37)/104
## [1] 0.5480769
Changing the value of K is helping to increase accuracy, but its still not getting close to the other models.
Lets try adding Lag1 into the first two models and an interaction variable since those originally had the best p-values.
glm.fits2=glm(Direction~Lag1 + Lag2 + Lag1:Lag2, subset=train, family=binomial)
glm.probs2=predict(glm.fits2,Weekly.2009,type='response')
glm.pred2=rep('Down',104)
glm.pred2[glm.probs2>0.5]='Up'
table(glm.pred2,Direction.2009)
## Direction.2009
## glm.pred2 Down Up
## Down 7 8
## Up 36 53
mean(glm.pred2==Direction.2009)
## [1] 0.5769231
Unfortunately, that didn’t increase accuracy compared to just fitting
with Lag2. Finally, lets run the LDA also with Lag1,
Lag2, and Lag3.
lda.fit2=lda(Direction~Lag1 + Lag2 + Lag1:Lag2, data=Weekly, subset=train)
lda.pred2=predict(lda.fit2,Weekly.2009)
lda.class2=lda.pred2$class
table(lda.class2,Direction.2009)
## Direction.2009
## lda.class2 Down Up
## Down 7 8
## Up 36 53
mean(lda.class2==Direction.2009)
## [1] 0.5769231
This also is not increasing performance.
detach(Weekly)
In this problem, you will develop a model to predict whether a given
car gets high or low gas mileage based on the Auto data
set.
attach(Auto)
head(Auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
(a) Create a binary variable, mpg01, that
contains a 1 if mpg contains a value above its median, and
a 0 if mpg contains a value below its median. You can
compute the median using the median() function. Note you may find it
helpful to use the data.frame() function to create a single
data set containing both mpg01 and the other
Auto variables.
In the auto dataset origin contains categorical data,
but its coded with integers. In order to make regression and fitting
models easier, I’m going to replicate work done by user “suugaku” to
replace the values in that column with their meanings and convert it to
a factor column.
Auto$origin[Auto$origin == 1] = "American"
Auto$origin[Auto$origin == 2] = "European"
Auto$origin[Auto$origin == 3] = "Japanese"
Auto$origin = as.factor(Auto$origin)
head(Auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 American
## 2 15 8 350 165 3693 11.5 70 American
## 3 18 8 318 150 3436 11.0 70 American
## 4 16 8 304 150 3433 12.0 70 American
## 5 17 8 302 140 3449 10.5 70 American
## 6 15 8 429 198 4341 10.0 70 American
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
Now we’ll add the mpg01 variable.
mpg01 = rep(0, dim(Auto)[1])
mpg01[Auto$mpg > median(Auto$mpg)] = 1
Auto = data.frame(Auto, mpg01)
head(Auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 American
## 2 15 8 350 165 3693 11.5 70 American
## 3 18 8 318 150 3436 11.0 70 American
## 4 16 8 304 150 3433 12.0 70 American
## 5 17 8 302 140 3449 10.5 70 American
## 6 15 8 429 198 4341 10.0 70 American
## name mpg01
## 1 chevrolet chevelle malibu 0
## 2 buick skylark 320 0
## 3 plymouth satellite 0
## 4 amc rebel sst 0
## 5 ford torino 0
## 6 ford galaxie 500 0
table(Auto$mpg01)
##
## 0 1
## 196 196
(b) Explore the data graphically in order to investigate the
association between mpg01 and the other features. Which of
the other features seem most likely to be useful in predicting
mpg01? Scatterplots and boxplots may be useful tools to
answer this question. Describe your findings.
First, lets start with boxplots and put them in on screen.
par(mfrow = c(2,3))
plot(factor(Auto$mpg01), Auto$cylinders, ylab = "cylinders")
plot(factor(mpg01), displacement, ylab = "displacement")
plot(factor(mpg01), horsepower, ylab = "horsepower")
plot(factor(mpg01), weight, ylab = "weight")
plot(factor(mpg01), acceleration, ylab = "acceleration")
plot(factor(mpg01), year, ylab = "year")
While boxplot may not be the best way to view the relationship between cylinders and mpg01, as cylinders is not continuous and I remember very few values being in 3 and 5, there is still a discernible difference between the 0 and 1 values of mpg01. Displacement, horsepower, and weight also all have noticeable differences.
par(mfrow = c(2,2))
plot(displacement, mpg01, xlab = "displacement")
plot(horsepower, mpg01, xlab = "Horsepower")
plot(weight, mpg01, xlab = "Weight")
plot(acceleration, mpg01, xlab = "acceleration")
As cylinders and year are not discrete, we did not run the plot on
those. But horsepower and weight both look meaningful, as do
displacement and acceleration, just to a lesser degree.
(c) Split the data into a training set and a test
set.
I will split the data based upon the year, splitting up the evens from
the odds. At first I tried to just force data into train and test using
this:
set.seed(1) Auto.train = Auto[1:196,] Auto.test = Auto[196:392,]
But it broke down later on.
#set.seed(1)
#Auto.train = Auto[1:196,]
#Auto.test = Auto[196:392,]
#possibly don't need this
train = (year%%2 == 0) # if the year is even
test = !train
Auto.train = Auto[train, ]
Auto.test = Auto[test, ]
mpg01.test = mpg01[test]
summary(train)
## Mode FALSE TRUE
## logical 182 210
summary(test)
## Mode FALSE TRUE
## logical 210 182
(d) Perform LDA on the training data in order to predict
mpg01 using the variables that seemed most associated with
mpg01 in (b). What is the test error of the model
obtained?
lda.fit = lda(mpg01 ~ cylinders + displacement + horsepower + weight + year + origin, data = Auto, subset = train)
lda.fit
## Call:
## lda(mpg01 ~ cylinders + displacement + horsepower + weight +
## year + origin, data = Auto, subset = train)
##
## Prior probabilities of groups:
## 0 1
## 0.4571429 0.5428571
##
## Group means:
## cylinders displacement horsepower weight year originEuropean
## 0 6.812500 271.7396 133.14583 3604.823 74.10417 0.1041667
## 1 4.070175 111.6623 77.92105 2314.763 77.78947 0.2631579
## originJapanese
## 0 0.0312500
## 1 0.3859649
##
## Coefficients of linear discriminants:
## LD1
## cylinders -0.6220215692
## displacement 0.0006869239
## horsepower 0.0101214580
## weight -0.0012627856
## year 0.0923162589
## originEuropean -0.0156524060
## originJapanese 0.2657932497
And the confusion matrix:
lda.pred=predict(lda.fit, Auto.test[,c('cylinders', 'displacement', 'horsepower', 'weight', 'year', 'origin')])
table(lda.pred$class,Auto.test[,'mpg01'])
##
## 0 1
## 0 87 6
## 1 13 76
mean(lda.pred$class != Auto.test[, 'mpg01'])
## [1] 0.1043956
(87+76)/182
## [1] 0.8956044
We have an error rate of about 10%.
(e) Perform QDA on the training data in order to predict
mpg01 using the variables that seemed most associated with
mpg01 in (b). What is the test error of the model
obtained?
qda.fit=qda(mpg01 ~ cylinders + displacement + horsepower + weight + year + origin, data = Auto, subset = train)
qda.fit
## Call:
## qda(mpg01 ~ cylinders + displacement + horsepower + weight +
## year + origin, data = Auto, subset = train)
##
## Prior probabilities of groups:
## 0 1
## 0.4571429 0.5428571
##
## Group means:
## cylinders displacement horsepower weight year originEuropean
## 0 6.812500 271.7396 133.14583 3604.823 74.10417 0.1041667
## 1 4.070175 111.6623 77.92105 2314.763 77.78947 0.2631579
## originJapanese
## 0 0.0312500
## 1 0.3859649
qda.pred=predict(qda.fit, Auto.test[,c('cylinders', 'displacement', 'horsepower', 'weight', 'year', 'origin')])
table(qda.pred$class,Auto.test[,'mpg01'])
##
## 0 1
## 0 87 12
## 1 13 70
mean(qda.pred$class != Auto.test[, 'mpg01'])
## [1] 0.1373626
(87+70)/182
## [1] 0.8626374
While these values are close to the same, the qda had slightly less accuracy, and an error rate closer to 14%
(f) Perform logistic regression on the training data in order
to predict mpg01 using the variables that seemed most
associated with mpg01 in (b). What is the test error of the
model obtained?
logistic.fit=glm(mpg01 ~ cylinders + displacement + horsepower + weight + year + origin, data = Auto, subset = train)
summary(logistic.fit)
##
## Call:
## glm(formula = mpg01 ~ cylinders + displacement + horsepower +
## weight + year + origin, data = Auto, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9669 -0.0747 0.0695 0.1676 0.6478
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.524e-01 4.907e-01 0.514 0.607550
## cylinders -1.418e-01 3.845e-02 -3.688 0.000291 ***
## displacement 1.566e-04 8.454e-04 0.185 0.853227
## horsepower 2.308e-03 1.281e-03 1.801 0.073165 .
## weight -2.879e-04 6.448e-05 -4.465 1.33e-05 ***
## year 2.105e-02 5.815e-03 3.619 0.000373 ***
## originEuropean -3.568e-03 6.284e-02 -0.057 0.954773
## originJapanese 6.060e-02 6.201e-02 0.977 0.329660
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.07577762)
##
## Null deviance: 52.114 on 209 degrees of freedom
## Residual deviance: 15.307 on 202 degrees of freedom
## AIC: 64.008
##
## Number of Fisher Scoring iterations: 2
logistic.pred=predict(logistic.fit, Auto.test[,c('cylinders', 'displacement', 'horsepower', 'weight', 'year', 'origin')])
logistic.class = ifelse(logistic.pred>0.5,1,0)
table(logistic.class,Auto.test[,'mpg01'])
##
## logistic.class 0 1
## 0 87 6
## 1 13 76
mean(logistic.class != Auto.test[, 'mpg01'])
## [1] 0.1043956
(87+76)/182
## [1] 0.8956044
This gave us the same confusion matrix as lda.
(g) Perform naive Bayes on the training data in order to
predict mpg01 using the variables that seemed most
associated with mpg01 in (b). What is the test error of the
model obtained?
nb.fit=naiveBayes(mpg01 ~ cylinders + displacement + horsepower + weight + year + origin, data = Auto, subset = train)
nb.fit
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.4571429 0.5428571
##
## Conditional probabilities:
## cylinders
## Y [,1] [,2]
## 0 6.812500 1.4165377
## 1 4.070175 0.3928408
##
## displacement
## Y [,1] [,2]
## 0 271.7396 89.15194
## 1 111.6623 28.27696
##
## horsepower
## Y [,1] [,2]
## 0 133.14583 38.49319
## 1 77.92105 15.19731
##
## weight
## Y [,1] [,2]
## 0 3604.823 624.9159
## 1 2314.763 334.7228
##
## year
## Y [,1] [,2]
## 0 74.10417 3.157211
## 1 77.78947 3.756937
##
## origin
## Y American European Japanese
## 0 0.8645833 0.1041667 0.0312500
## 1 0.3508772 0.2631579 0.3859649
nb.class=predict(nb.fit, Auto.test)
table(nb.class, Auto.test[,'mpg01'])
##
## nb.class 0 1
## 0 88 11
## 1 12 71
mean(logistic.class != Auto.test[, 'mpg01'])
## [1] 0.1043956
(88+71)/182
## [1] 0.8736264
This gave us approximately 10% test error again.
(h) Perform KNN on the training data, with several values of
K, in order to predict mpg01. Use only the variables that
seemed most associated with mpg01 in (b). What test errors
do you obtain? Which value of K seems to perform the best on this data
set?
train.X = cbind(cylinders, displacement, horsepower, weight, year)[train, ]
test.X = cbind(cylinders, displacement, horsepower, weight, year)[test, ]
train.mpg01 = mpg01[train]
set.seed(1)
# KNN(k=1)
knn.pred = knn(train.X, test.X, train.mpg01, k = 1)
mean(knn.pred != mpg01.test)
## [1] 0.1538462
Now we’ll run it for a few different values of K, using 5, 9, and
19
5:
knn.pred = knn(train.X, test.X, train.mpg01, k = 5)
mean(knn.pred != mpg01.test)
## [1] 0.1428571
9:
knn.pred = knn(train.X, test.X, train.mpg01, k = 9)
mean(knn.pred != mpg01.test)
## [1] 0.1538462
19:
knn.pred = knn(train.X, test.X, train.mpg01, k = 19)
mean(knn.pred != mpg01.test)
## [1] 0.1483516
K=5 is giving us the lowest error rate, but they are all very close.
detach(Auto)
Using the Boston data set, fit classification models in order
to predict whether a given census tract has a crime rate above or below
the median. Explore logistic regression, LDA, naive Bayes, and KNN
models using various subsets of the predictors. Describe your
findings.
Hint: You will have to create the response variable yourself,
using the variables that are contained in the Boston data
set.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Here we will create crime01 and give it a vale of 1 when
it is above the median. While here, lets also create our train set out
of the first of the data, and test out of the second half.
attach(Boston)
crime01=rep(0, length(crim))
crime01[crim > median(crim)] = 1
Boston = data.frame(Boston, crime01)
train=1:(dim(Boston)[1]/2)
test = (dim(Boston)[1]/2 + 1):dim(Boston)[1]
Boston.train = Boston[train, ]
Boston.test = Boston[test, ]
crime01.test = crime01[test]
One potential issue with dividing the data up this way is that we could get an unusually high amount of one value or the other in crime01, depending on how the data is organized. Lets check that:
table(crime01.test)
## crime01.test
## 0 1
## 90 163
Not great, but lets continue and see if how our models do with this set up.
In the next section, we’ll do Logistic Regression, LDA, Naive Bayes, and then KNN, each time using a different set of variables. Normally we would want to hold our variables constant to see which models are working best (but lets follow the directions instead).
Using GLM and all but the crime variables:
glm.fit = glm(crime01 ~ . - crime01 - crim, data = Boston, family = binomial, subset = train)
glm.probs = predict(glm.fit, Boston.test, type = "response")
glm.pred = rep(0, length(glm.probs))
glm.pred[glm.probs > 0.5] = 1
mean(glm.pred != crime01.test)
## [1] 0.1818182
This way we had a error rate of 18%.
Next we’ll do LDA, and remove the chas and
tax variables as well.
lda.fit = lda(crime01 ~ . - crime01 - crim - chas - tax, data = Boston, subset = train)
lda.pred = predict(lda.fit, Boston.test)
mean(lda.pred$class != crime01.test)
## [1] 0.1225296
This way we get an improvement in error rate down to 12%.
Now we’ll do Naive Bayes and remove indus as well.
nb.fit=naiveBayes(crime01 ~ . - crime01 - crim - chas - tax - indus, data = Boston, subset = train)
nb.fit
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.6442688 0.3557312
##
## Conditional probabilities:
## zn
## Y [,1] [,2]
## 0 17.4815951 27.406344
## 1 0.2444444 2.319004
##
## nox
## Y [,1] [,2]
## 0 0.4702902 0.05024047
## 1 0.6119889 0.12928520
##
## rm
## Y [,1] [,2]
## 0 6.381067 0.5492511
## 1 6.241967 0.8283310
##
## age
## Y [,1] [,2]
## 0 54.68957 27.21436
## 1 85.67889 19.38265
##
## dis
## Y [,1] [,2]
## 0 4.843857 1.884114
## 1 2.895154 1.199945
##
## rad
## Y [,1] [,2]
## 0 4.239264 1.597971
## 1 5.155556 1.542733
##
## ptratio
## Y [,1] [,2]
## 0 17.82086 1.730712
## 1 18.08778 2.775812
##
## black
## Y [,1] [,2]
## 0 388.3934 26.71517
## 1 355.2777 64.50591
##
## lstat
## Y [,1] [,2]
## 0 9.88589 5.090516
## 1 14.06389 7.568536
##
## medv
## Y [,1] [,2]
## 0 25.17362 7.004847
## 1 22.73889 10.182798
nb.class=predict(nb.fit, Boston.test)
table(nb.class, Boston.test[,'crime01'])
##
## nb.class 0 1
## 0 73 22
## 1 17 141
1-((73+141)/253)
## [1] 0.1541502
The error rate here is 15.4%
And finally, we’ll run KNN at 1, 3, 5, and 19 again, but using the same variables selected by analyst “Princehonest”
train.X = cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black,
lstat, medv)[train, ]
test.X = cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black,
lstat, medv)[test, ]
train.crime01 = crime01[train]
set.seed(1)
At KNN = 1
knn.pred = knn(train.X, test.X, train.crime01, k = 1)
mean(knn.pred != crime01.test)
## [1] 0.458498
At KNN = 3
knn.pred = knn(train.X, test.X, train.crime01, k = 3)
mean(knn.pred != crime01.test)
## [1] 0.2648221
At KNN = 5
knn.pred = knn(train.X, test.X, train.crime01, k = 5)
mean(knn.pred != crime01.test)
## [1] 0.1699605
At KNN = 10
knn.pred = knn(train.X, test.X, train.crime01, k = 10)
mean(knn.pred != crime01.test)
## [1] 0.1106719
At KNN = 19
knn.pred = knn(train.X, test.X, train.crime01, k = 19)
mean(knn.pred != crime01.test)
## [1] 0.1185771
KNN = 10 had the lowest error rate at 11% for this set of variables.