knitr::opts_chunk$set(echo = TRUE)
library(MASS)
library(ISLR)
Exercise 2
Carefully explain the differences between the KNN classifier and KNN regression methods.
KNN classifiers assign the test observation a qualitative class. This is done by using surrounding observations (neighbors) in a training data set to estimate the conditional probability for class assignment.
KNN Regression is a non-parametric method that is similar to KNN classification. However, instead of classifying the variable, the regression attempts to predict the value of the variable through a local average.
Exercise 9
Auto data set. Produce a scatterplot matrix which includes all of the variables in the data set.attach(Auto)
pairs(Auto)
cor(). You will need to exclude the name variable, cor() which is qualitative.auto_num = subset(Auto, select=-name)
cor(auto_num)
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.m1 = lm(mpg~., data=auto_num)
summary(m1)
##
## Call:
## lm(formula = mpg ~ ., data = auto_num)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Is there a relationship between the predictors and the response? Yes, the multiple R-squared indicates that 82% of the variance of the ?response can be explained by the predictors.
Which predictors appear to have a statistically significant relationship to the response? displacement, weight, year, and origin are all significant using a p-value threshold of 0.05.
What does the coefficient for the year variable suggest? The coefficient suggests that for a one unit increase in year, mpg is expected to increase by 0.75.
plot() function to produce diagnostic plots of the linear regression fit.par(mfrow=c(2,2))
plot(m1)
and:` symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?m2 = lm(mpg ~ year+origin+displacement*weight, data=auto_num)
summary(m2)
##
## Call:
## lm(formula = mpg ~ year + origin + displacement * weight, data = auto_num)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.6119 -1.7290 -0.0115 1.5609 12.5584
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.007e+00 3.798e+00 -2.108 0.0357 *
## year 8.194e-01 4.518e-02 18.136 < 2e-16 ***
## origin 3.567e-01 2.574e-01 1.386 0.1666
## displacement -7.148e-02 9.176e-03 -7.790 6.27e-14 ***
## weight -1.054e-02 6.530e-04 -16.146 < 2e-16 ***
## displacement:weight 2.104e-05 2.214e-06 9.506 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.016 on 386 degrees of freedom
## Multiple R-squared: 0.8526, Adjusted R-squared: 0.8507
## F-statistic: 446.5 on 5 and 386 DF, p-value: < 2.2e-16
m3 = lm(mpg ~ log(year)+sqrt(origin)+I(displacement^2)*weight, data=auto_num)
summary(m3)
##
## Call:
## lm(formula = mpg ~ log(year) + sqrt(origin) + I(displacement^2) *
## weight, data = auto_num)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.201 -1.943 0.005 1.559 12.791
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.192e+02 1.517e+01 -14.456 < 2e-16 ***
## log(year) 6.142e+01 3.483e+00 17.638 < 2e-16 ***
## sqrt(origin) 1.414e+00 6.894e-01 2.051 0.0409 *
## I(displacement^2) -1.407e-04 2.299e-05 -6.120 2.30e-09 ***
## weight -8.602e-03 4.483e-04 -19.190 < 2e-16 ***
## I(displacement^2):weight 4.124e-08 5.212e-09 7.913 2.69e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.062 on 386 degrees of freedom
## Multiple R-squared: 0.8481, Adjusted R-squared: 0.8461
## F-statistic: 431 on 5 and 386 DF, p-value: < 2.2e-16
Exercise 10
Carseats data set. Fit a multiple regression model to predict Sales using Price, Urban, and US.attach(Carseats)
fit=lm(Sales~Price+Urban+US)
summary(fit)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Sales = 13.043469 + (-0.054459)Price + (-0.021916)UrbanYes + (1.200573)USYes + εH0 : βj = 0?Price and USYes.fit2=lm(Sales~Price+US)
summary(fit2)
##
## Call:
## lm(formula = Sales ~ Price + US)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
confint(fit2)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
plot(fit2)
Exercise 12
set.seed(42)
x = rnorm(100)
y = rbinom(100,5,.5)
xyset = lm(x~y+0)
summary(xyset)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.06848 -0.65508 0.07654 0.62229 2.21125
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.02513 0.04095 0.614 0.541
##
## Residual standard error: 1.04 on 99 degrees of freedom
## Multiple R-squared: 0.00379, Adjusted R-squared: -0.006272
## F-statistic: 0.3767 on 1 and 99 DF, p-value: 0.5408
yxset = lm(y~x+0)
summary(yxset)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.1239 1.2680 2.0358 3.0913 5.0182
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.1508 0.2458 0.614 0.541
##
## Residual standard error: 2.548 on 99 degrees of freedom
## Multiple R-squared: 0.00379, Adjusted R-squared: -0.006272
## F-statistic: 0.3767 on 1 and 99 DF, p-value: 0.5408
x = 1:100
y = 100:1
xyset2 = lm(x~y+0)
summary(xyset2)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08
yxset2 = lm(y~x+0)
summary(yxset2)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08