2. Carefully explain the differences between the KNN classifier and KNN regression methods.
The KNN classifier is used for qualitative problems by identifying the closest neighbors of x0. It then estimates the conditional probability P( Y=j | X=x0 ) for class j in the neighborhood whose response values equal j. The KNN regression method is used for quantitative problems by also indentifying the neighbors of x0, but it estimates f(x0) as the average of all the responses in the neighborhood.
9. This question involves the use of multiple linear regression on the Auto data set.
a) Produce a scatterplot matrix which includes all of the variables in the data set.
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.5.3
pairs(Auto) #the pairs function gives you scatterplots for each variable combination.

b) Compute the matrix of correlations between the variables usingthe function cor(). You will need to exclude the name variable, cor() which is qualitative.
names(Auto) #Use the names function to see the variables
## [1] "mpg" "cylinders" "displacement" "horsepower"
## [5] "weight" "acceleration" "year" "origin"
## [9] "name"
cor(Auto[1:8]) #Only find the correlations for column 1 - 8 since we want to exclude the name column (9)
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
interaction_fit <- lm(mpg ~ weight*displacement+acceleration:weight, data = Auto) #The * and : between the variables tell r to fit the interaction of the two variables. In this case, I used the interaction between weight and displacement as well as weight and acceleration.
summary(interaction_fit)
##
## Call:
## lm(formula = mpg ~ weight * displacement + acceleration:weight,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.0690 -2.5691 -0.3685 1.8236 17.0265
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.587e+01 2.039e+00 27.398 < 2e-16 ***
## weight -1.190e-02 1.273e-03 -9.347 < 2e-16 ***
## displacement -7.795e-02 1.119e-02 -6.966 1.41e-11 ***
## weight:displacement 2.060e-05 2.941e-06 7.005 1.10e-11 ***
## weight:acceleration 1.005e-04 3.240e-05 3.100 0.00207 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.053 on 387 degrees of freedom
## Multiple R-squared: 0.7332, Adjusted R-squared: 0.7304
## F-statistic: 265.8 on 4 and 387 DF, p-value: < 2.2e-16
The interaction between weight and acceleration and between weight and displacement are both statistically significant.
10. This question should be answered using the Carseats data set.
a) Fit a multiple regression model to predict Sales using Price, Urban, and US.
fit_a <- lm(Sales ~ Price+Urban+US, data=Carseats)
summary(fit_a)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
b) Interpret the coefficients
As price increases by a dollar, sales decrease by 54.459 units in sales. The UrbanYes coefficient indicates that in urban store locations, sales are 21.916 units less than rural locations. The USYes coefficient shows that US stores sell 1,200 more units on average than non US stores.
c) Write out the model in equation form.
Sales = 13.043469 + (\(-0.054459×price\)) + (\(-0.021916×Urban\)) + (\(1.200573×US\))
d) For which of the predictors can you reject the null hypothesis?
We can reject the null hypothesis for price and US becasue they are both statistically significant.
e) Fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
fit_e <- lm(Sales ~ Price+US, data = Carseats) #Since the Price and US were significant shown in the previous output, in this linear model we exclude the insignificant variables (UrbanYes)
summary(fit_e)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
f) How well do the models in (a) and (e) fit the data?
The R-square values in both models indicate that only about 24% of the variation can be explained by the linear model.
g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s)
confint(fit_e) #the confint() function gives is the confidence interval of our model
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
h) Is there evidence of outliers or high leverage observations in the model from (e)?
par(mfrow = c(2, 2))
plot(fit_e)

The fourth graph shows there may be outliers that are less than -2 and greater than 2.
12. This problem involves simple linear regression without an intercept.
a) They are the same when \[\sum_jx_j^2 = \sum_jy_j^2\]
b) Generate an example in R with n=100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
x = rnorm(100, mean=10, sd=10) #set x such that it follows a random normal distribution rnorm() with 100 observations. Pick any numbers for the mean and standard deviation.
y = rnorm(100, mean=5, sd=20) #Since you want the coefficients for the regression of x onto y and y onto x to be different, their mean and standard deviation should be different
fit_y <- lm(y ~ x + 0) #fit x onto y
fit_x <- lm(x ~ y + 0) #fit y onto x
summary(fit_y)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.677 -11.199 2.573 13.402 54.805
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x -0.004328 0.141147 -0.031 0.976
##
## Residual standard error: 20.46 on 99 degrees of freedom
## Multiple R-squared: 9.499e-06, Adjusted R-squared: -0.01009
## F-statistic: 0.0009404 on 1 and 99 DF, p-value: 0.9756
summary(fit_x)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.424 1.754 10.577 16.402 35.682
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y -0.002195 0.071563 -0.031 0.976
##
## Residual standard error: 14.57 on 99 degrees of freedom
## Multiple R-squared: 9.499e-06, Adjusted R-squared: -0.01009
## F-statistic: 0.0009404 on 1 and 99 DF, p-value: 0.9756
c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.
xc <- 1:100 #Now, since you want their coefficient estimates to be the same, set x and y as the inverses of one another. I set x to be numbers 1 to 100 in order
yc <- 100:1 #To set y, do the inverse of x. So y is numbers 100 to 1 in decending order.
fit.yc <- lm(yc ~ xc + 0) #Fit x onto y
fit.xc <- lm(xc ~ yc + 0) #Fit y onto x
summary(fit.yc) #Output the results and compare. Now the coefficient estimate for both is 0.5075
##
## Call:
## lm(formula = yc ~ xc + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## xc 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08
summary(fit.xc)
##
## Call:
## lm(formula = xc ~ yc + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## yc 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08