2. Carefully explain the differences between the KNN classifier and KNN regression methods.
KNN classifier is a method used when the response variable is catagorical (qualitative), while KNN regression is used when the response is quanitative. KNN classifiers catagorize a given x into a group based on the corresponding y’s of the surrounding neighbors. KNN regression takes the average of the surrounding neighbors to predict y for a given x.
9. This question involves the use of multiple linear regression on the Auto data set.
(a) Produce a scatterplot matrix which includes all of the variables in the data set.
library(ISLR)
pairs(Auto) #scatterplot matrix

(b) Compute the matrix of correlations between the variables usingthe function cor(). You will need to exclude the name variable, which is qualitative.
str(Auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
cor(Auto[1:8]) #correlation matrix
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
cor(Auto[, -9]) # this way also works and produces the same results
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
linear <- lm(mpg ~ . -name + cylinders*horsepower, data=Auto)
summary(linear)
##
## Call:
## lm(formula = mpg ~ . - name + cylinders * horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.2399 -1.6871 -0.0511 1.2858 11.9380
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.7025260 4.9115648 2.383 0.017676 *
## cylinders -4.3060695 0.4580950 -9.400 < 2e-16 ***
## displacement -0.0013925 0.0069110 -0.201 0.840426
## horsepower -0.3156601 0.0306339 -10.304 < 2e-16 ***
## weight -0.0038948 0.0006231 -6.250 1.09e-09 ***
## acceleration -0.1703028 0.0901427 -1.889 0.059612 .
## year 0.7393193 0.0448736 16.476 < 2e-16 ***
## origin 0.9031644 0.2496880 3.617 0.000338 ***
## cylinders:horsepower 0.0402008 0.0037856 10.619 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.929 on 383 degrees of freedom
## Multiple R-squared: 0.8621, Adjusted R-squared: 0.8592
## F-statistic: 299.3 on 8 and 383 DF, p-value: < 2.2e-16
The intereaction between cylinders and horsepower is significant.
10. This question should be answered using the Carseats data set.
(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.
linear <- lm(Sales ~ Price+Urban+US, data=Carseats)
summary(linear)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
(b) Interpret the coefficients
As price increases by a dollar, sales decrease by 54.459 units in sales. The UrbanYes coefficient indicates that in urban store locations, sales are 21.916 units less than rural locations. The USYes coefficient shows that US stores sell 1,200 more units on average than non US stores.
(d) For which of the predictors can you reject the null hypothesis?
Because price and US are statistically significant, we can reject the null hypotheses for the two predictors.
(e) Fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
linear <- lm(Sales ~ Price+US, data=Carseats) #removed the non significant variable Urban
summary(linear)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
(f) How well do the models in (a) and (e) fit the data?
Both models have an r-squared of 0.2393, indicating that only 23.93% of the variation in the model can be explained by its predictors.
(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s)
confint(linear)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
(h) Is there evidence of outliers or high leverage observations in the model from (e)?
par(mfrow = c(2, 2))
plot(linear)

There appears to be one outlier indicated in the leverage plot.
12. This problem involves simple linear regression without an intercept.
(a) Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?
They are the same when \[\sum_jx_j^2 = \sum_jy_j^2\]
(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
x = rnorm(100, mean=3, sd=10) #set x such that it follows a random normal distribution rnorm() with 100 observations. Pick any numbers for the mean and standard deviation.
y = rnorm(100, mean=5, sd=13) #Since you want the coefficients for the regression of x onto y and y onto x to be different, their mean and standard deviation should be different
fit_y <- lm(y ~ x + 0)
fit_x <- lm(x ~ y + 0)
summary(fit_y)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.556 -4.033 4.798 13.694 37.592
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.009347 0.138520 0.067 0.946
##
## Residual standard error: 14.29 on 99 degrees of freedom
## Multiple R-squared: 4.599e-05, Adjusted R-squared: -0.01005
## F-statistic: 0.004553 on 1 and 99 DF, p-value: 0.9463
summary(fit_x)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.475 -2.939 4.273 10.308 22.774
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.00492 0.07292 0.067 0.946
##
## Residual standard error: 10.37 on 99 degrees of freedom
## Multiple R-squared: 4.599e-05, Adjusted R-squared: -0.01005
## F-statistic: 0.004553 on 1 and 99 DF, p-value: 0.9463
c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.
x <- 1:100 #Now, since you want their coefficient estimates to be the same, set x and y as the inverses of one another. I set x to be numbers 1 to 100 in order
y <- 100:1 #To set y, do the inverse of x. So y is numbers 100 to 1 in decending order.
fit_y <- lm(y ~ x + 0) #Fit x onto y
fit_x <- lm(x ~ y + 0) #Fit y onto x
summary(fit_y) #Output the results and compare. Now the coefficient estimate for both is 0.5075
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08
summary(fit_x)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08