Chapter 03 (page 120): 2, 9, 10, 12

#2

Carefully explain the differences between the KNN classifier and KNN regression methods.
KNN estimates which class a given observation will be in based on the class that it’s K nearest neighbors fall into. KNN regression estimates a function using the average of the K nearest neighbors to a given observation.

#9

This question involves the use of multiple linear regression on the Auto data set.

library(ISLR)
## Warning: package 'ISLR' was built under R version 3.6.3
attach(Auto)

(a) Produce a scatterplot matrix which includes all of the variables in the data set.

scatter=pairs(x=Auto)

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.

auto_noname = data.frame(mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin)
cor(auto_noname)
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

fit_auto_mpg=lm(mpg~., data=auto_noname)
summary(fit_auto_mpg)
## 
## Call:
## lm(formula = mpg ~ ., data = auto_noname)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

i. Is there a relationship between the predictors and the response?
Based on the p-value of the F-statistic (< 2.2e-16), we can conclude that there is a relationship between the predictors and response in the above regression model.

ii. Which predictors appear to have a statistically significant relationship to the response? Based on the p-values of the t-statistics for each of the predictors in the above regression, the origin, production, year, displacement, and weight of a car all have significant relationships with a car’s mpg.

iii. What does the coefficient for the year variable suggest?
The coefficient of the year variable (.750773) suggests that for each year newer a car is, miles per gallon will increase by ~.75.

(d) Use the plot()function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

plot(fit_auto_mpg)

Looking at the above residual plot, we can see there are a few data points that could be large outliers. The plot specifically points out points 321, 324, and 325, but there may be more as well. The leverage plot also identified some points that have particularly high leverage - 14, 325, and 389. The Q-Q plot suggests the data is mostly normal, except for the previously mentioned outliers/high leverage data points.

(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

fit_auto_mpg_int1=lm(mpg~year*origin, data=auto_noname)
summary(fit_auto_mpg_int1)
## 
## Call:
## lm(formula = mpg ~ year * origin, data = auto_noname)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.3141  -3.7120  -0.6513   3.3621  15.5859 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -83.3809    12.0000  -6.948 1.57e-11 ***
## year          1.3089     0.1576   8.305 1.68e-15 ***
## origin       17.3752     6.8325   2.543   0.0114 *  
## year:origin  -0.1663     0.0889  -1.871   0.0621 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.199 on 388 degrees of freedom
## Multiple R-squared:  0.5596, Adjusted R-squared:  0.5562 
## F-statistic: 164.4 on 3 and 388 DF,  p-value: < 2.2e-16
fit_auto_mpg_int2=lm(mpg~weight*horsepower, data=auto_noname)
summary(fit_auto_mpg_int2)
## 
## Call:
## lm(formula = mpg ~ weight * horsepower, data = auto_noname)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.7725  -2.2074  -0.2708   1.9973  14.7314 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        6.356e+01  2.343e+00  27.127  < 2e-16 ***
## weight            -1.077e-02  7.738e-04 -13.921  < 2e-16 ***
## horsepower        -2.508e-01  2.728e-02  -9.195  < 2e-16 ***
## weight:horsepower  5.355e-05  6.649e-06   8.054 9.93e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.93 on 388 degrees of freedom
## Multiple R-squared:  0.7484, Adjusted R-squared:  0.7465 
## F-statistic: 384.8 on 3 and 388 DF,  p-value: < 2.2e-16
fit_auto_mpg_int3=lm(mpg~cylinders*weight, data=auto_noname)
summary(fit_auto_mpg_int3)
## 
## Call:
## lm(formula = mpg ~ cylinders * weight, data = auto_noname)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.4916  -2.6225  -0.3927   1.7794  16.7087 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      65.3864559  3.7333137  17.514  < 2e-16 ***
## cylinders        -4.2097950  0.7238315  -5.816 1.26e-08 ***
## weight           -0.0128348  0.0013628  -9.418  < 2e-16 ***
## cylinders:weight  0.0010979  0.0002101   5.226 2.83e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.165 on 388 degrees of freedom
## Multiple R-squared:  0.7174, Adjusted R-squared:  0.7152 
## F-statistic: 328.3 on 3 and 388 DF,  p-value: < 2.2e-16
fit_auto_mpg_int4=lm(mpg~cylinders*displacement, data=auto_noname)
summary(fit_auto_mpg_int4)
## 
## Call:
## lm(formula = mpg ~ cylinders * displacement, data = auto_noname)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.0432  -2.4308  -0.2263   2.2048  20.9051 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            48.22040    2.34712  20.545  < 2e-16 ***
## cylinders              -2.41838    0.53456  -4.524 8.08e-06 ***
## displacement           -0.13436    0.01615  -8.321 1.50e-15 ***
## cylinders:displacement  0.01182    0.00207   5.711 2.24e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.454 on 388 degrees of freedom
## Multiple R-squared:  0.6769, Adjusted R-squared:  0.6744 
## F-statistic:   271 on 3 and 388 DF,  p-value: < 2.2e-16

Based on the above models, the interactions between weight and horsepower, cylinders and weight, and cylinders and displacement are all significant

(f) Try a few different transformations of the variables, such as log(X), √X, X^2. Comment on your findings.

fit_auto_mpg_tran0=lm(mpg~weight+year+origin)
summary(fit_auto_mpg_tran0)
## 
## Call:
## lm(formula = mpg ~ weight + year + origin)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9440 -2.0948 -0.0389  1.7255 13.2722 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.805e+01  4.001e+00  -4.510 8.60e-06 ***
## weight      -5.994e-03  2.541e-04 -23.588  < 2e-16 ***
## year         7.571e-01  4.832e-02  15.668  < 2e-16 ***
## origin       1.150e+00  2.591e-01   4.439 1.18e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.348 on 388 degrees of freedom
## Multiple R-squared:  0.8175, Adjusted R-squared:  0.816 
## F-statistic: 579.2 on 3 and 388 DF,  p-value: < 2.2e-16
fit_auto_mpg_tran1=lm(mpg~I(weight^2)+year+origin)
summary(fit_auto_mpg_tran1)
## 
## Call:
## lm(formula = mpg ~ I(weight^2) + year + origin)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.8810 -2.2688 -0.0881  1.9049 13.3968 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.824e+01  4.189e+00  -6.743 5.66e-11 ***
## I(weight^2) -8.503e-07  4.191e-08 -20.288  < 2e-16 ***
## year         7.531e-01  5.282e-02  14.259  < 2e-16 ***
## origin       1.661e+00  2.739e-01   6.064 3.15e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.638 on 388 degrees of freedom
## Multiple R-squared:  0.7844, Adjusted R-squared:  0.7827 
## F-statistic: 470.5 on 3 and 388 DF,  p-value: < 2.2e-16
fit_auto_mpg_tran2=lm(mpg~log(weight)+year+origin)
summary(fit_auto_mpg_tran2)
## 
## Call:
## lm(formula = mpg ~ log(weight) + year + origin)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9120 -1.9384 -0.0257  1.5961 13.1033 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 115.76550    7.53529  15.363  < 2e-16 ***
## log(weight) -19.19080    0.72701 -26.397  < 2e-16 ***
## year          0.77969    0.04477  17.417  < 2e-16 ***
## origin        0.75026    0.24722   3.035  0.00257 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.123 on 388 degrees of freedom
## Multiple R-squared:  0.8411, Adjusted R-squared:  0.8398 
## F-statistic: 684.5 on 3 and 388 DF,  p-value: < 2.2e-16
fit_auto_mpg_tran3=lm(mpg~I(weight^.5)+year+origin)
summary(fit_auto_mpg_tran3)
## 
## Call:
## lm(formula = mpg ~ I(weight^0.5) + year + origin)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9430 -2.0520 -0.0312  1.7145 13.1901 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.90095    4.23678   0.213 0.831713    
## I(weight^0.5) -0.68713    0.02734 -25.131  < 2e-16 ***
## year           0.76605    0.04635  16.528  < 2e-16 ***
## origin         0.92951    0.25244   3.682 0.000264 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.222 on 388 degrees of freedom
## Multiple R-squared:  0.8309, Adjusted R-squared:  0.8296 
## F-statistic: 635.6 on 3 and 388 DF,  p-value: < 2.2e-16

After removing all of the predictors that were seen to be not significant in the model with all variables included, the the model to predict mpg was run with weight, displacement, year, and origin. However, displacement was no longer significant, so that was removed as well. This left us with a model lm(mpg~weight,year,origin), which has an r-square of .8175.

Models were fit where weight was transformed by weight^2, log(weight), and weight^.5. The log and square root transformations both yielded better results than the non-transformed model, with the model where log transformation was used lm(mpg~log(weight)+year+origin) being the best with r-square = .8411.


#10

This question should be answered using the Carseats data set.

library(ISLR)
attach(Carseats)

(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.

fit=lm(Sales~Price+Urban+US)
summary(fit)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(b) Provide an interpretation of each coefficient in the model. Becareful—some of the variables in the model are qualitative!
From the table in part A, Price and US are significant predictors of Sales. For every $1 increase in the price, Sales decrease by about $54.5. Sales inside the US are $1,200 higher than sales outside of the US. Urban has no effect on Sales.

(c) Write out the model in equation form, being careful to handlethe qualitative variables properly.
\(Sales = 13.043469 -0.054459*Price - 0.021916*Urban_{Yes} + 1.200573*US_{Yes}\)

(d) For which of the predictors can you reject the null hypothesis \(H_{0}:\beta_j=0?\)
Price and US.

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

fit2=lm(Sales~Price+US)
summary(fit2)
## 
## Call:
## lm(formula = Sales ~ Price + US)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) How well do the models in (a) and (e) fit the data?
Not well, each model explains around 23% of the variance in Sales.

(g) Using the model from (e), obtain 95 % confidence intervals forthe coefficient(s).

confint(fit2)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

(h) Is there evidence of outliers or high leverage observations in the model from (e)?

plot(fit2)

The diagnositic plots show some evidence of outliers and high leverage data points. The residual plots indicate that 69, 377, and 51 are points that may be outliers, and the leverage plot shows that points 26, 50, and 368 are likely high leverage data points.

#12

This problem involves simple linear regression without an intercept.
(a) Recall that the coefficient estimate \(\hat{\beta}\) for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

The coefficient estimate for Y onto X without an intercept will be the samed as the coefficient estimate for X onto Y when: \(\sum{x_i^2} = \sum{y_i^2}\) or, when the sum of the square of all x observations equales the sum of the square of all y observations.

(b) Generate an example in R with n=100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

set.seed(1)
X = rnorm(100,10,1)
Y = rnorm(100,500,100)
lmY = lm(Y ~ X)
lmX = lm(X ~ Y)
lmY
## 
## Call:
## lm(formula = Y ~ X)
## 
## Coefficients:
## (Intercept)            X  
##     497.291       -0.106
lmX
## 
## Call:
## lm(formula = X ~ Y)
## 
## Coefficients:
## (Intercept)            Y  
##   1.011e+01   -9.324e-06

(c) Generate an example in R with n=100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

set.seed(1)
X = rnorm(100)
Y = X
lmY = lm(Y ~ X)
lmX = lm(X ~ Y)
lmY
## 
## Call:
## lm(formula = Y ~ X)
## 
## Coefficients:
## (Intercept)            X  
##   -2.22e-17     1.00e+00
lmX
## 
## Call:
## lm(formula = X ~ Y)
## 
## Coefficients:
## (Intercept)            Y  
##   -2.22e-17     1.00e+00