Carefully explain the differences between the KNN classifier and KNN regression methods.
The KNN classifier method predicts which “class” a datapoint falls into based on a comparison to the point’s nearest neighbors. It groups the points into separate clusters as oppose to using the average of the nearest neighbors to approximate a regression line, like the KNN regression method. The classification method approximates groups, while the regression method approximates a regression line.
This question involves the use of multiple linear regression on the Auto data set.
a. Produce a scatterplot matrix which includes all of the variables in the data set.
library(ISLR)
pairs(Auto)
b. Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.
cor(Auto[ ,-9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
c. Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.Comment on the output. For instance: i. Is there a relationship between the predictors and the response? ii. Which predictors appear to have a statistically significant relationship to the response? iii. What does the coefficient for the year variable suggest?
lm.mpg=lm(mpg ~ . - name, data = Auto)
summary(lm.mpg)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
coef(lm.mpg)[7]
## year
## 0.7507727
d. Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
par(mfrow = c(2, 2))
plot(lm.mpg)
An issue with the plot’s fit is that the data does not appear to be linear. The Residuals v Fitted graph shows a significant curve. There also appears to be one datapoint, labeled number 14 on the Residuals v Leverage plot, that is an unusually large outlier with unusually large leverage. Additionally, there are other outliers that surpass Cook’s distance that do not have a lot of leverage.
e. Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
lm.compare= lm(mpg ~ . - name + weight:acceleration + year:horsepower + cylinders:displacement + origin:year,data=Auto)
summary(lm.compare)
##
## Call:
## lm(formula = mpg ~ . - name + weight:acceleration + year:horsepower +
## cylinders:displacement + origin:year, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1456 -1.4783 -0.1112 1.3427 11.1754
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.202e+01 1.722e+01 -4.763 2.72e-06 ***
## cylinders -1.119e+00 4.616e-01 -2.425 0.015769 *
## displacement -5.437e-02 1.428e-02 -3.808 0.000163 ***
## horsepower 6.034e-01 1.036e-01 5.824 1.22e-08 ***
## weight 3.921e-04 1.668e-03 0.235 0.814210
## acceleration 7.557e-01 2.604e-01 2.902 0.003929 **
## year 1.578e+00 2.036e-01 7.749 8.56e-14 ***
## origin -1.391e+00 4.610e+00 -0.302 0.763087
## weight:acceleration -2.902e-04 9.204e-05 -3.153 0.001745 **
## horsepower:year -8.976e-03 1.413e-03 -6.351 6.12e-10 ***
## cylinders:displacement 7.515e-03 1.970e-03 3.814 0.000159 ***
## year:origin 2.608e-02 5.919e-02 0.441 0.659691
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.862 on 380 degrees of freedom
## Multiple R-squared: 0.8693, Adjusted R-squared: 0.8655
## F-statistic: 229.8 on 11 and 380 DF, p-value: < 2.2e-16
3 of my 4 interactions appear to be statistically significant: weight:acceleration, year:horsepower, and cylinders:displacement. This makes sense, considering these variables had stronger correlation than origin:year, shown in part b. With these interactions, the coefficient of determination, R squared, increased slightly from our first regression. This could suggest the interactions produce a more fitting model.
f. Try a few different transformations of the variables, such as \(log(X)\), \(√X\), \(X^2\). Comment on your findings.
lm.transform= lm(mpg ~ . - name + log(year) + sqrt(weight) + I(acceleration^2) + I(displacement^2),data=Auto)
summary(lm.transform)
##
## Call:
## lm(formula = mpg ~ . - name + log(year) + sqrt(weight) + I(acceleration^2) +
## I(displacement^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.6314 -1.4739 0.0377 1.3991 12.3633
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.808e+03 4.634e+02 6.060 3.28e-09 ***
## cylinders 2.366e-01 3.240e-01 0.730 0.465729
## displacement -3.865e-02 1.954e-02 -1.978 0.048619 *
## horsepower -6.050e-02 1.295e-02 -4.673 4.14e-06 ***
## weight 1.730e-02 4.146e-03 4.173 3.73e-05 ***
## acceleration -1.918e+00 5.377e-01 -3.566 0.000408 ***
## year 1.164e+01 1.834e+00 6.347 6.25e-10 ***
## origin 5.127e-01 2.535e-01 2.023 0.043796 *
## log(year) -8.247e+02 1.394e+02 -5.916 7.37e-09 ***
## sqrt(weight) -2.338e+00 4.708e-01 -4.966 1.04e-06 ***
## I(acceleration^2) 5.548e-02 1.563e-02 3.550 0.000434 ***
## I(displacement^2) 6.822e-05 3.344e-05 2.040 0.042038 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.775 on 380 degrees of freedom
## Multiple R-squared: 0.8772, Adjusted R-squared: 0.8736
## F-statistic: 246.7 on 11 and 380 DF, p-value: < 2.2e-16
All 4 of my transformations were statistically significant at p=.01, and 3 of the 4 were significant at p=0. Additionally, the coefficient of determination has increased again to .8772.
a. Fit a multiple regression model to predict Sales using Price, Urban, and US.
lm.sales= lm(Sales ~ Price + Urban + US, data = Carseats)
summary(lm.sales)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
b. Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!
Price: For every $1 increase in price, sales decrease by $54 units, with all other variables held constant. This variable is statistically significant, so it is safe to assume this statement holds true. Urban: This variable was not statistically significant, so we cannot conclude a change in sales from being located in an urban area as opposed to a rural area. US: US is also a qualitative variable, suggesting sales are $1,200 higher in the US than stores located outside of the US.
c. Write out the model in equation form, being careful to handle the qualitative variables properly.
\(Sales= 13.043469- 0.054459(Price) -0.021916(Urban) +1.200573(US)\)
Where Urban= 1 if located in an Urban area, 0 if Rural
US= 1 if located in the United States, 0 if not
d. For which of the predictors can you reject the null hypothes is \(H_{0} : β_{j} = 0\)?
We can reject the null hypothesis for Price and US because they were both statistically significant at p=0.
e. On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
lm.sales= lm(Sales ~ Price + US, data = Carseats)
summary(lm.sales)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
f. How well do the models in (a) and (e) fit the data?
Neither model is a good fit for our data; both have a coefficient of determination of .2393. This means that about 23.93% of the variability in Sales can be explained by the predictor variables. However, the model from part e has a higher adjusted R squared of .2354 compared to the first model’s adjusted R squared of .2335. R squared usually increases with additional variables, but this is not the case for our two models. These two reasons make the model from part e our better fit.
g. Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).
confint(lm.sales, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
h. Is there evidence of outliers or high leverage observations in the model from (e)?
par(mfrow = c(2, 2))
plot(lm.sales)
There is evidence of a few outliers, as well as a few datapoints with higher leverage. In particular, there are a few datapoints with leverage approaching .03, and one point passing .04 on the Residuals v Leverage graph. On the same graph, there are points that surpass Cook’s distance, indicating additional outliers that are not particularly high in leverage.
a. Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?
The coefficient estimate for the regression of X onto Y equals the coefficient estimate for the regression of Y onto X when \(\sum_{i} x^2_{i} = \sum_{i} y^2_{i}\).
b. Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
set.seed(1)
x = rnorm(100)
y = 10*x + rnorm(100, sd = 2)
data = data.frame(x, y)
lm.x = lm(x ~ y + 0)
lm.y = lm(y ~ x + 0)
summary(lm.x)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42234 -0.09331 0.04202 0.11837 0.35994
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.095811 0.002043 46.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1878 on 99 degrees of freedom
## Multiple R-squared: 0.9569, Adjusted R-squared: 0.9565
## F-statistic: 2200 on 1 and 99 DF, p-value: < 2.2e-16
summary(lm.y)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8307 -1.2943 -0.3541 1.0112 4.6218
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 9.988 0.213 46.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.917 on 99 degrees of freedom
## Multiple R-squared: 0.9569, Adjusted R-squared: 0.9565
## F-statistic: 2200 on 1 and 99 DF, p-value: < 2.2e-16
The coefficient estimate for X onto Y is .095811 and the coefficient estimate for Y onto X is 9.988, proving they are not equal.
c. Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.
set.seed(1)
x = rnorm(100)
y = x
data = data.frame(x, y)
lm.x = lm(x ~ y + 0)
lm.y = lm(y ~ x + 0)
summary(lm.x)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.888e-16 -1.689e-17 1.339e-18 3.057e-17 2.552e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 1.000e+00 6.479e-18 1.543e+17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.833e-17 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.382e+34 on 1 and 99 DF, p-value: < 2.2e-16
summary(lm.y)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.888e-16 -1.689e-17 1.339e-18 3.057e-17 2.552e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 1.000e+00 6.479e-18 1.543e+17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.833e-17 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.382e+34 on 1 and 99 DF, p-value: < 2.2e-16
The coefficient estimate for X onto Y is 1.00 and the coefficient estimate for Y onto X is 1.00 proving they are equal.