Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.
(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.
This is a regression problem since CEO salary is likely quantitative. This is also an inference problem as we are interested in understanding which factors affect CEO salary.
n = 500 (firms in the US)
p = 3: profit, number of employees, industry
(b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.
This is a classification problem since whether a new product will be a success or a failure is a binary variable. This is also a prediction problem since we’re interested in whether the new product will be a success or a failure.
n = 20 similar products previously launched
p = 13 price charged, marketing budget, competition price, and ten other variables
(c) We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.
This is a regression scenario since % change is quantitative. This is also a prediction problem since “We are interested in predicting the % change in the USD/Euro exchange rate.”
n = 52 weeks of 2012 weekly data
p = 3: % change in US market, % change in British market, % change in German market
What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?
The advantages for a very flexible approach for regression or classification are obtaining a better fit for non-linear models, decreasing bias. The disadvantages for a very flexible approach for regression or classification are requires estimating a greater number of parameters, follow the noise too closely (overfit), increasing variance. A more flexible approach would be preferred to a less flexible approach when we are interested in prediction and not the interpretability of the results. A less flexible approach would be preferred to a more flexible approach when we are interested in inference and the interpretability of the results.
Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages?
A parametric approach reduces the problem of estimating f down to one of estimating a set of parameters because it assumes a form for f. A non-parametric approach does not assume a functional form for f and so requires a very large number of observations to accurately estimate f. The advantages of a parametric approach to regression or classification are the simplifying of modeling f to a few parameters and not as many observations are required compared to a non-parametric approach. The disadvantages of a parametric approach to regression or classification are a potential to inaccurately estimate f if the form of f assumed is wrong or to overfit the observations if more flexible models are used.
This question should be answered using the Carseats
data set.
library(ISLR)
attach(Carseats)
(a) Fit a multiple regression model to predict Sales
using Price
,Urban
, and US
.
fit<-lm(Sales~Price+Urban+US)
summary(fit)
Call:
lm(formula = Sales ~ Price + Urban + US)
Residuals:
Min 1Q Median 3Q Max
-6.9206 -1.6220 -0.0564 1.5786 7.0581
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
Price -0.054459 0.005242 -10.389 < 2e-16 ***
UrbanYes -0.021916 0.271650 -0.081 0.936
USYes 1.200573 0.259042 4.635 4.86e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.472 on 396 degrees of freedom
Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
(b) Provide an interpretation of each coefficient in the model. Be careful some of the variables in the model are qualitative!
From the table above, price and US are significant predictors of Sales, for every $1 increase my price, my sales go down by $54. Sales inside of the US are $1,200 higher than sales outside of the US. Urban has no effect on Sales.
(c) Write out the model in equation form, being careful to handle the qualitative variables properly.
\(Sales = 13.043469 -0.054459Price-0.021916Urban_{Yes}+1.200573XUS_{Yes}\)
(d) For which of the predictors can you reject the null hypothesis.
Price
and US
(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
fit<-lm(Sales~Price+US)
summary(fit)
Call:
lm(formula = Sales ~ Price + US)
Residuals:
Min 1Q Median 3Q Max
-6.9269 -1.6286 -0.0574 1.5766 7.0515
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
Price -0.05448 0.00523 -10.416 < 2e-16 ***
USYes 1.19964 0.25846 4.641 4.71e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.469 on 397 degrees of freedom
Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
(f) How well do the models in (a) and (e) fit the data?
Terrible, each model explains around 23% of the variance in Sales.
(g) Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).
confint(fit)
2.5 % 97.5 %
(Intercept) 11.79032020 14.27126531
Price -0.06475984 -0.04419543
USYes 0.69151957 1.70776632
(h) Is there evidence of outliers or high leverage observations in the model from (e)?
R has built in functions to that can help us identify influential points using various statistics with one simple command. Researchers have suggested several cutoff levels or upper limits as to what is the acceptable influence an observation should have before being considered an outlier. For example, the average leverage \(\frac{(p+1)}{n}\) which for us is \(\frac{(2+1)}{400} = 0.0075\).
par(mfrow=c(2,2))
plot(fit)
summary(influence.measures(fit))
Potentially influential observations of
lm(formula = Sales ~ Price + US) :
dfb.1_ dfb.Pric dfb.USYs dffit cov.r cook.d hat
26 0.24 -0.18 -0.17 0.28_* 0.97_* 0.03 0.01
29 -0.10 0.10 -0.10 -0.18 0.97_* 0.01 0.01
43 -0.11 0.10 0.03 -0.11 1.05_* 0.00 0.04_*
50 -0.10 0.17 -0.17 0.26_* 0.98 0.02 0.01
51 -0.05 0.05 -0.11 -0.18 0.95_* 0.01 0.00
58 -0.05 -0.02 0.16 -0.20 0.97_* 0.01 0.01
69 -0.09 0.10 0.09 0.19 0.96_* 0.01 0.01
126 -0.07 0.06 0.03 -0.07 1.03_* 0.00 0.03_*
160 0.00 0.00 0.00 0.01 1.02_* 0.00 0.02
166 0.21 -0.23 -0.04 -0.24 1.02 0.02 0.03_*
172 0.06 -0.07 0.02 0.08 1.03_* 0.00 0.02
175 0.14 -0.19 0.09 -0.21 1.03_* 0.02 0.03_*
210 -0.14 0.15 -0.10 -0.22 0.97_* 0.02 0.01
270 -0.03 0.05 -0.03 0.06 1.03_* 0.00 0.02
298 -0.06 0.06 -0.09 -0.15 0.97_* 0.01 0.00
314 -0.05 0.04 0.02 -0.05 1.03_* 0.00 0.02_*
353 -0.02 0.03 0.09 0.15 0.97_* 0.01 0.00
357 0.02 -0.02 0.02 -0.03 1.03_* 0.00 0.02
368 0.26 -0.23 -0.11 0.27_* 1.01 0.02 0.02_*
377 0.14 -0.15 0.12 0.24 0.95_* 0.02 0.01
384 0.00 0.00 0.00 0.00 1.02_* 0.00 0.02
387 -0.03 0.04 -0.03 0.05 1.02_* 0.00 0.02
396 -0.05 0.05 0.08 0.14 0.98_* 0.01 0.00
R points out a few observations that violate various rules for each influence measure. Typically, one can demonstrate these statistics and report both a regression with all data included and one with the outliers removed and compare.
outyling.obs<-c(26,29,43,50,51,58,69,126,160,166,172,175,210,270,298,314,353,357,368,377,384,387,396)
Carseats.small<-Carseats[-outyling.obs,]
fit2<-lm(Sales~Price+US,data=Carseats.small)
summary(fit2)
Call:
lm(formula = Sales ~ Price + US, data = Carseats.small)
Residuals:
Min 1Q Median 3Q Max
-5.263 -1.605 -0.039 1.590 5.428
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.925232 0.665259 19.429 < 2e-16 ***
Price -0.053973 0.005511 -9.794 < 2e-16 ***
USYes 1.255018 0.248856 5.043 7.15e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.29 on 374 degrees of freedom
Multiple R-squared: 0.2387, Adjusted R-squared: 0.2347
F-statistic: 58.64 on 2 and 374 DF, p-value: < 2.2e-16
With these potential outliers or influential observations removed, very little changes from the linear model fit to the full data set. The confidence interval for the coefficient estimates produced by the linear model fit to the full data set contain the estimates of the coefficients for the estimates of the model with the outliers removed. It’s safe to include all of the data points in our model.
This question should be answered using the Weekly
data set, which is part of the ISLR
package. This data is similar in nature to the Smarket
data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.
(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
library(ISLR)
summary(Weekly)
Year Lag1 Lag2 Lag3 Lag4
Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580 1st Qu.: -1.1580
Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410 Median : 0.2380
Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472 Mean : 0.1458
3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090 3rd Qu.: 1.4090
Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
Lag5 Volume Today Direction
Min. :-18.1950 Min. :0.08747 Min. :-18.1950 Down:484
1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540 Up :605
Median : 0.2340 Median :1.00268 Median : 0.2410
Mean : 0.1399 Mean :1.57462 Mean : 0.1499
3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
Max. : 12.0260 Max. :9.32821 Max. : 12.0260
pairs(Weekly)
cor(Weekly[, -9])
Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume
Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923 -0.030519101 0.84194162
Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876 -0.008183096 -0.06495131
Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535 -0.072499482 -0.08551314
Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865 0.060657175 -0.06928771
Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000 -0.075675027 -0.06107462
Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027 1.000000000 -0.05851741
Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617 -0.058517414 1.00000000
Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873 0.011012698 -0.03307778
Today
Year -0.032459894
Lag1 -0.075031842
Lag2 0.059166717
Lag3 -0.071243639
Lag4 -0.007825873
Lag5 0.011012698
Volume -0.033077783
Today 1.000000000
Year
and Volume
appear to have a strong positive relationship. The rest of the relationships seem weak; that is, there aren’t any other relationships among any of the lag variables and certainly not any associated with Direction
.
(b) Use the full data set to perform a logistic regression with Direction
as the response and the five lag variables plus Volume
as predictors. Use the summary
function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?
attach(Weekly)
The following objects are masked from Weekly (pos = 7):
Direction, Lag1, Lag2, Lag3, Lag4, Lag5, Today, Volume, Year
glm.fit = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)
summary(glm.fit)
Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
Volume, family = binomial, data = Weekly)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6949 -1.2565 0.9913 1.0849 1.4579
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.26686 0.08593 3.106 0.0019 **
Lag1 -0.04127 0.02641 -1.563 0.1181
Lag2 0.05844 0.02686 2.175 0.0296 *
Lag3 -0.01606 0.02666 -0.602 0.5469
Lag4 -0.02779 0.02646 -1.050 0.2937
Lag5 -0.01447 0.02638 -0.549 0.5833
Volume -0.02274 0.03690 -0.616 0.5377
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1496.2 on 1088 degrees of freedom
Residual deviance: 1486.4 on 1082 degrees of freedom
AIC: 1500.4
Number of Fisher Scoring iterations: 4
Lag 2
appears to have an association with Direction
(p-value=0.03).
(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
glm.probs = predict(glm.fit, type = "response")
glm.pred = rep("Down", length(glm.probs))
glm.pred[glm.probs > 0.5] = "Up"
table(glm.pred, Direction)
Direction
glm.pred Down Up
Down 54 48
Up 430 557
Percentage of correct predictions: (54+557)/(54+557+48+430) = 56.1%. Our logistic regression model predicts Direction
correctly 557/(557+48) = 92.1% of the time for weeks that the market goes up
. When the market is down, our logistic regression is wrong most of the time 54/(430+54) = 11.2%.
(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2
as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).
train = (Year < 2009)
Weekly.0910 = Weekly[!train, ]
glm.fit = glm(Direction ~ Lag2, data = Weekly, family = binomial, subset = train)
glm.probs = predict(glm.fit, Weekly.0910, type = "response")
glm.pred = rep("Down", length(glm.probs))
glm.pred[glm.probs > 0.5] = "Up"
Direction.0910 = Direction[!train]
table(glm.pred, Direction.0910)
Direction.0910
glm.pred Down Up
Down 9 5
Up 34 56
mean(glm.pred == Direction.0910)
[1] 0.625
(e) Repeat (d) using LDA.
library(MASS)
lda.fit = lda(Direction ~ Lag2, data = Weekly, subset = train)
lda.pred = predict(lda.fit, Weekly.0910)
table(lda.pred$class, Direction.0910)
Direction.0910
Down Up
Down 9 5
Up 34 56
mean(lda.pred$class == Direction.0910)
[1] 0.625
(f) Repeat (d) using QDA.
qda.fit = qda(Direction ~ Lag2, data = Weekly, subset = train)
qda.class = predict(qda.fit, Weekly.0910)$class
table(qda.class, Direction.0910)
Direction.0910
qda.class Down Up
Down 0 0
Up 43 61
mean(qda.class == Direction.0910)
[1] 0.5865385
Hilarious, our quadratic discriminant picks up
all of the time and is still right 58.7% of the time.
(g) Repeat (d) using KNN with K = 1.
library(class)
train.X = as.matrix(Lag2[train])
test.X = as.matrix(Lag2[!train])
train.Direction = Direction[train]
set.seed(1)
knn.pred = knn(train.X, test.X, train.Direction, k = 1)
table(knn.pred, Direction.0910)
Direction.0910
knn.pred Down Up
Down 21 30
Up 22 31
mean(knn.pred == Direction.0910)
[1] 0.5
(h) Which of these methods appears to provide the best results on this data?
Logistic regression and LDA gave us the highest accuracy rates.
(i) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.
# Logistic regression with Lag2:Lag1
glm.fit = glm(Direction ~ Lag2:Lag1, data = Weekly, family = binomial, subset = train)
glm.probs = predict(glm.fit, Weekly.0910, type = "response")
glm.pred = rep("Down", length(glm.probs))
glm.pred[glm.probs > 0.5] = "Up"
Direction.0910 = Direction[!train]
table(glm.pred, Direction.0910)
Direction.0910
glm.pred Down Up
Down 1 1
Up 42 60
mean(glm.pred == Direction.0910)
[1] 0.5865385
# LDA with Lag2 interaction with Lag1
lda.fit = lda(Direction ~ Lag2:Lag1, data = Weekly, subset = train)
lda.class = predict(lda.fit, Weekly.0910)$class
table(lda.class, Direction.0910)
Direction.0910
lda.class Down Up
Down 0 1
Up 43 60
mean(lda.class == Direction.0910)
[1] 0.5769231
# QDA with sqrt(abs(Lag2))
qda.fit = qda(Direction ~ Lag2 + sqrt(abs(Lag2)), data = Weekly, subset = train)
qda.class = predict(qda.fit, Weekly.0910)$class
table(qda.class, Direction.0910)
Direction.0910
qda.class Down Up
Down 12 13
Up 31 48
mean(qda.class == Direction.0910)
[1] 0.5769231
# KNN k =10
knn.pred = knn(train.X, test.X, train.Direction, k = 10)
table(knn.pred, Direction.0910)
Direction.0910
knn.pred Down Up
Down 17 18
Up 26 43
mean(knn.pred == Direction.0910)
[1] 0.5769231
# KNN k = 100
knn.pred = knn(train.X, test.X, train.Direction, k = 100)
table(knn.pred, Direction.0910)
Direction.0910
knn.pred Down Up
Down 9 12
Up 34 49
mean(knn.pred == Direction.0910)
[1] 0.5576923
The original LDA and logistic regression outperform all of these models with an accuracy of 62.5%.
In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto
data set.
(a) Create a binary variable, mpg01
, that contains a 1 if mpg
contains a value above its median, and a 0 if mpg
contains a value below its median. You can compute the median using the median()
function. Note you may find it helpful to use the data.frame()
function to create a single data set containing both mpg01
and the other Auto
variables.
library(ISLR)
summary(Auto)
mpg cylinders displacement horsepower weight acceleration
Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613 Min. : 8.00
1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225 1st Qu.:13.78
Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804 Median :15.50
Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978 Mean :15.54
3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615 3rd Qu.:17.02
Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140 Max. :24.80
year origin name mpg01
Min. :70.00 Min. :1.000 amc matador : 5 Min. :0.0
1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5 1st Qu.:0.0
Median :76.00 Median :1.000 toyota corolla : 5 Median :0.5
Mean :75.98 Mean :1.577 amc gremlin : 4 Mean :0.5
3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4 3rd Qu.:1.0
Max. :82.00 Max. :3.000 chevrolet chevette: 4 Max. :1.0
(Other) :365
attach(Auto)
The following object is masked _by_ .GlobalEnv:
mpg01
The following objects are masked from Auto (pos = 5):
acceleration, cylinders, displacement, horsepower, mpg, name, origin, weight, year
mpg01 = rep(0, length(mpg))
mpg01[mpg > median(mpg)] = 1
Auto = data.frame(Auto, mpg01)
(b) Explore the data graphically in order to investigate the association between mpg01
and the other features. Which of the other features seem most likely to be useful in predicting mpg01
? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.
cor(Auto[, -9])
mpg cylinders displacement horsepower weight acceleration year
mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442 0.4233285 0.5805410
cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273 -0.5046834 -0.3456474
displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944 -0.5438005 -0.3698552
horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377 -0.6891955 -0.4163615
weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000 -0.4168392 -0.3091199
acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392 1.0000000 0.2903161
year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199 0.2903161 1.0000000
origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054 0.2127458 0.1815277
mpg01 0.8369392 -0.7591939 -0.7534766 -0.6670526 -0.7577566 0.3468215 0.4299042
mpg01.1 0.8369392 -0.7591939 -0.7534766 -0.6670526 -0.7577566 0.3468215 0.4299042
origin mpg01 mpg01.1
mpg 0.5652088 0.8369392 0.8369392
cylinders -0.5689316 -0.7591939 -0.7591939
displacement -0.6145351 -0.7534766 -0.7534766
horsepower -0.4551715 -0.6670526 -0.6670526
weight -0.5850054 -0.7577566 -0.7577566
acceleration 0.2127458 0.3468215 0.3468215
year 0.1815277 0.4299042 0.4299042
origin 1.0000000 0.5136984 0.5136984
mpg01 0.5136984 1.0000000 1.0000000
mpg01.1 0.5136984 1.0000000 1.0000000
pairs(Auto) # doesn't work well since mpg01 is 0 or 1
Negatively associated with cylinders
, weight
, displacement
, horsepower
. Perfectly associated with mpg
of course.
(c) Split the data into a training set and a test set.
train = (year%%2 == 0) # if the year is even
test = !train
Auto.train = Auto[train, ]
Auto.test = Auto[test, ]
mpg01.test = mpg01[test]
(d) Perform LDA on the training data in order to predict mpg01
using the variables that seemed most associated with mpg01
in (b). What is the test error of the model obtained?
# LDA
library(MASS)
lda.fit = lda(mpg01 ~ cylinders + weight + displacement + horsepower, data = Auto, subset = train)
lda.pred = predict(lda.fit, Auto.test)
mean(lda.pred$class != mpg01.test)
[1] 0.1263736
(e) Perform QDA on the training data in order to predict mpg01
using the variables that seemed most associated with mpg01
in (b). What is the test error of the model obtained?
# QDA
qda.fit = qda(mpg01 ~ cylinders + weight + displacement + horsepower, data = Auto, subset = train)
qda.pred = predict(qda.fit, Auto.test)
mean(qda.pred$class != mpg01.test)
[1] 0.1318681
(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01
in (b). What is the test error of the model obtained?
# Logistic regression
glm.fit = glm(mpg01 ~ cylinders + weight + displacement + horsepower, data = Auto,
family = binomial, subset = train)
glm.probs = predict(glm.fit, Auto.test, type = "response")
glm.pred = rep(0, length(glm.probs))
glm.pred[glm.probs > 0.5] = 1
mean(glm.pred != mpg01.test)
[1] 0.1208791
(g) Perform KNN on the training data, with several values of K, in order to predict mpg01
. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?
library(class)
train.X = cbind(cylinders, weight, displacement, horsepower)[train, ]
test.X = cbind(cylinders, weight, displacement, horsepower)[test, ]
train.mpg01 = mpg01[train]
set.seed(1)
# KNN(k=1)
knn.pred = knn(train.X, test.X, train.mpg01, k = 1)
mean(knn.pred != mpg01.test)
[1] 0.1538462
# KNN(k=10)
knn.pred = knn(train.X, test.X, train.mpg01, k = 10)
mean(knn.pred != mpg01.test)
[1] 0.1648352
# KNN(k=100)
knn.pred = knn(train.X, test.X, train.mpg01, k = 100)
mean(knn.pred != mpg01.test)
[1] 0.1428571
K | Test Error Rate |
---|---|
1 | 15.4% |
2 | 16.5% |
3 | 14.3% |
100 nearest neighbors seems to perform the best.