Chapter 3 Exercises

Question 2 - Carefully explain the differences between the KNN classifier and KNN regression methods

KNN classifier makes estimates based on the most frequent observations. KNN regression estimates f(x0) using the average of the K training observations.

Question 9 - This question involves the use of multiple linear regression on the Auto data set

Question 9.a. - Produce a scatterplot matrix which includes all of the variables in the data set.

pairs(Auto)

Question 9.b. - Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative

cor(Auto[-9])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

Question 9.c. - Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results

lm = lm(mpg ~ . - name, data = Auto)
summary(lm)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Question 9.c.i. - Is there a relationship between the predictors and the response?

Yes there is a relationship. The R^2 value states that 82.15% of change in mpg can be explained by change in the predictor variables.

Question 9.c.ii. - Which predictors appear to have a statistically significant relationship to the response?

Using the p-values, displacement, weight, year, and origin are statistically significant to mpg.

Question 9.c.iii. - What does the coefficient for the year variable suggest?

1 year increase proportionally increases mpg by 0.750773

Question 9.d. - Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow = c(2,2))
plot(lm)

The Normal QQ plot shows that the data is skewed to the right. Observation 14 is the only leverage point.

Question 9.e. - Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

lm.interaction.attempt.1 = lm(mpg ~. -name + cylinders:weight, data = Auto)
summary(lm.interaction.attempt.1)
## 
## Call:
## lm(formula = mpg ~ . - name + cylinders:weight, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.9484  -1.7133  -0.1809   1.4530  12.4137 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       7.3143478  5.0076737   1.461  0.14494    
## cylinders        -5.0347425  0.5795767  -8.687  < 2e-16 ***
## displacement      0.0156444  0.0068409   2.287  0.02275 *  
## horsepower       -0.0314213  0.0126216  -2.489  0.01322 *  
## weight           -0.0150329  0.0011125 -13.513  < 2e-16 ***
## acceleration      0.1006438  0.0897944   1.121  0.26306    
## year              0.7813453  0.0464139  16.834  < 2e-16 ***
## origin            0.8030154  0.2617333   3.068  0.00231 ** 
## cylinders:weight  0.0015058  0.0001657   9.088  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.022 on 383 degrees of freedom
## Multiple R-squared:  0.8531, Adjusted R-squared:  0.8501 
## F-statistic: 278.1 on 8 and 383 DF,  p-value: < 2.2e-16
lm.interaction.attempt.2 = lm(mpg ~. -name + displacement:weight, data = Auto)
summary(lm.interaction.attempt.2)
## 
## Call:
## lm(formula = mpg ~ . - name + displacement:weight, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9027 -1.8092 -0.0946  1.5549 12.1687 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -5.389e+00  4.301e+00  -1.253   0.2109    
## cylinders            1.175e-01  2.943e-01   0.399   0.6899    
## displacement        -6.837e-02  1.104e-02  -6.193 1.52e-09 ***
## horsepower          -3.280e-02  1.238e-02  -2.649   0.0084 ** 
## weight              -1.064e-02  7.136e-04 -14.915  < 2e-16 ***
## acceleration         6.724e-02  8.805e-02   0.764   0.4455    
## year                 7.852e-01  4.553e-02  17.246  < 2e-16 ***
## origin               5.610e-01  2.622e-01   2.139   0.0331 *  
## displacement:weight  2.269e-05  2.257e-06  10.054  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.964 on 383 degrees of freedom
## Multiple R-squared:  0.8588, Adjusted R-squared:  0.8558 
## F-statistic: 291.1 on 8 and 383 DF,  p-value: < 2.2e-16
lm.interaction.attempt.3 = lm(mpg ~. -name + horsepower:weight, data = Auto)
summary(lm.interaction.attempt.3)
## 
## Call:
## lm(formula = mpg ~ . - name + horsepower:weight, data = Auto)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.589 -1.617 -0.184  1.541 12.001 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.876e+00  4.511e+00   0.638 0.524147    
## cylinders         -2.955e-02  2.881e-01  -0.103 0.918363    
## displacement       5.950e-03  6.750e-03   0.881 0.378610    
## horsepower        -2.313e-01  2.363e-02  -9.791  < 2e-16 ***
## weight            -1.121e-02  7.285e-04 -15.393  < 2e-16 ***
## acceleration      -9.019e-02  8.855e-02  -1.019 0.309081    
## year               7.695e-01  4.494e-02  17.124  < 2e-16 ***
## origin             8.344e-01  2.513e-01   3.320 0.000986 ***
## horsepower:weight  5.529e-05  5.227e-06  10.577  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.931 on 383 degrees of freedom
## Multiple R-squared:  0.8618, Adjusted R-squared:  0.859 
## F-statistic: 298.6 on 8 and 383 DF,  p-value: < 2.2e-16
lm.interaction.attempt.4 = lm(mpg ~. -name + acceleration:weight, data = Auto)
summary(lm.interaction.attempt.4)
## 
## Call:
## lm(formula = mpg ~ . - name + acceleration:weight, data = Auto)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.247 -2.048 -0.045  1.619 12.193 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -4.364e+01  5.811e+00  -7.511 4.18e-13 ***
## cylinders           -2.141e-01  3.078e-01  -0.696 0.487117    
## displacement         3.138e-03  7.495e-03   0.419 0.675622    
## horsepower          -4.141e-02  1.348e-02  -3.071 0.002287 ** 
## weight               4.027e-03  1.636e-03   2.462 0.014268 *  
## acceleration         1.629e+00  2.422e-01   6.726 6.36e-11 ***
## year                 7.821e-01  4.833e-02  16.184  < 2e-16 ***
## origin               1.033e+00  2.686e-01   3.846 0.000141 ***
## weight:acceleration -5.826e-04  8.408e-05  -6.928 1.81e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.141 on 383 degrees of freedom
## Multiple R-squared:  0.8414, Adjusted R-squared:  0.838 
## F-statistic: 253.9 on 8 and 383 DF,  p-value: < 2.2e-16

Out of the 4 tested models, lm.interaction.attempt.3 (horsepower:weight interaction) produced the lowest p-values and Adjusted R^2 closest to 1 with a value of 0.859.

Question 9.f. - Try a few different transformations of the variables, such as log(X), √X, X^2.

lm.log.interaction.attempt.1 = lm(mpg ~. -name + log(cylinders) + cylinders:weight, data = Auto)
summary(lm.log.interaction.attempt.1)
## 
## Call:
## lm(formula = mpg ~ . - name + log(cylinders) + cylinders:weight, 
##     data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8589 -1.7311 -0.1874  1.5819 12.3569 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -3.622e+00  6.042e+00  -0.599  0.54920    
## cylinders        -1.117e+01  2.027e+00  -5.512 6.54e-08 ***
## displacement      1.416e-02  6.778e-03   2.090  0.03731 *  
## horsepower       -2.361e-02  1.272e-02  -1.856  0.06418 .  
## weight           -1.767e-02  1.382e-03 -12.791  < 2e-16 ***
## acceleration      7.749e-02  8.906e-02   0.870  0.38483    
## year              7.934e-01  4.604e-02  17.234  < 2e-16 ***
## origin            7.995e-01  2.587e-01   3.090  0.00215 ** 
## log(cylinders)    2.626e+01  8.319e+00   3.156  0.00172 ** 
## cylinders:weight  1.954e-03  2.168e-04   9.014  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.987 on 382 degrees of freedom
## Multiple R-squared:  0.8569, Adjusted R-squared:  0.8535 
## F-statistic: 254.1 on 9 and 382 DF,  p-value: < 2.2e-16
lm.log.interaction.attempt.2 = lm(mpg ~. -name + log(displacement) + displacement:weight, data = Auto)
summary(lm.log.interaction.attempt.2)
## 
## Call:
## lm(formula = mpg ~ . - name + log(displacement) + displacement:weight, 
##     data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.0020  -1.7953  -0.0745   1.5635  12.1431 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -3.107e+00  1.288e+01  -0.241  0.80950    
## cylinders            1.190e-01  2.948e-01   0.404  0.68657    
## displacement        -6.308e-02  3.022e-02  -2.087  0.03752 *  
## horsepower          -3.343e-02  1.285e-02  -2.602  0.00962 ** 
## weight              -1.043e-02  1.356e-03  -7.692 1.24e-13 ***
## acceleration         6.366e-02  9.019e-02   0.706  0.48072    
## year                 7.854e-01  4.560e-02  17.223  < 2e-16 ***
## origin               5.471e-01  2.727e-01   2.007  0.04550 *  
## log(displacement)   -6.546e-01  3.481e+00  -0.188  0.85093    
## displacement:weight  2.197e-05  4.491e-06   4.891 1.48e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.968 on 382 degrees of freedom
## Multiple R-squared:  0.8588, Adjusted R-squared:  0.8554 
## F-statistic: 258.1 on 9 and 382 DF,  p-value: < 2.2e-16
lm.log.interaction.attempt.3 = lm(mpg ~. -name + log(horsepower) + horsepower:weight, data = Auto)
summary(lm.log.interaction.attempt.3)
## 
## Call:
## lm(formula = mpg ~ . - name + log(horsepower) + horsepower:weight, 
##     data = Auto)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.509 -1.514 -0.122  1.415 11.931 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        4.054e+01  1.691e+01   2.397 0.016994 *  
## cylinders         -5.646e-03  2.867e-01  -0.020 0.984298    
## displacement       3.473e-04  7.137e-03   0.049 0.961209    
## horsepower        -7.176e-02  7.296e-02  -0.983 0.326007    
## weight            -8.189e-03  1.497e-03  -5.472 8.08e-08 ***
## acceleration      -2.054e-01  1.012e-01  -2.030 0.043065 *  
## year               7.591e-01  4.491e-02  16.902  < 2e-16 ***
## origin             8.170e-01  2.500e-01   3.268 0.001182 ** 
## log(horsepower)   -1.157e+01  5.011e+00  -2.310 0.021419 *  
## horsepower:weight  3.563e-05  9.972e-06   3.573 0.000398 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.915 on 382 degrees of freedom
## Multiple R-squared:  0.8637, Adjusted R-squared:  0.8605 
## F-statistic: 269.1 on 9 and 382 DF,  p-value: < 2.2e-16
lm.log.interaction.attempt.4 = lm(mpg ~. -name + log(acceleration) + acceleration:weight, data = Auto)
summary(lm.log.interaction.attempt.4)
## 
## Call:
## lm(formula = mpg ~ . - name + log(acceleration) + acceleration:weight, 
##     data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5279 -1.8822 -0.0237  1.6775 12.3282 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -5.083e+00  1.671e+01  -0.304 0.761077    
## cylinders           -1.286e-01  3.078e-01  -0.418 0.676212    
## displacement        -1.442e-03  7.675e-03  -0.188 0.851089    
## horsepower          -4.814e-02  1.367e-02  -3.521 0.000482 ***
## weight               3.357e-03  1.648e-03   2.037 0.042302 *  
## acceleration         2.610e+00  4.657e-01   5.604 4.01e-08 ***
## year                 7.811e-01  4.801e-02  16.270  < 2e-16 ***
## origin               1.027e+00  2.669e-01   3.849 0.000139 ***
## log(acceleration)   -1.975e+01  8.031e+00  -2.460 0.014355 *  
## weight:acceleration -5.101e-04  8.858e-05  -5.759 1.74e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.121 on 382 degrees of freedom
## Multiple R-squared:  0.8438, Adjusted R-squared:  0.8402 
## F-statistic: 229.3 on 9 and 382 DF,  p-value: < 2.2e-16

Question 10 - This question should be answered using the Carseats data set

Question 10.a. - Fit a multiple regression model to predict Sales using Price, Urban, and US.

lm.fit = lm(Sales ~ Price + Urban + US, data = Carseats)

Question 10.b. - Provide an interpretation of each coefficient in the model.

summary(lm.fit)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16
lm.fit
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Coefficients:
## (Intercept)        Price     UrbanYes        USYes  
##    13.04347     -0.05446     -0.02192      1.20057

There is a negative relationship between price and sales. As price increases by $1, sales decreases by ~54.46. There is no relationship between urban and sales as stated by the p-value of 0.936. There is a positive relationship between the US and sales. There is a ~1201 sales increase if the store is located in the US.

Question 10.c. - Write out the model in equation form, being careful to handle the qualitative variables properly.

Sales = 13.04 + -0.05 * Price + -0.02 * UrbanYes + 1.20 * USYes

Question 10.d. - For which of the predictors can you reject the null hypothesis H0 : βj = 0?

Reject the Urban predictor because of its high p-value.

Question 10.e. - On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome

lm.fit.2 = lm(Sales ~ Price -Urban + US, data = Carseats)
summary(lm.fit.2)
## 
## Call:
## lm(formula = Sales ~ Price - Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

Question 10.f. - How well do the models in (a) and (e) fit the data?

The first model Adjusted R^2 is .2335 while the second model’s is .2354. There is a slight difference, but the second model is better.

Question 10.g. - Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).

confint(lm.fit.2)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Question 10.h. - Is there evidence of outliers or high leverage observations in the model from (e)?

plot(lm.fit.2)

Observations 51, 69, and 377 appear to be outliers in the residuals vs fitted & scale-location graphs. And there appears to be a few observations with high leverage as well.

Question 12 - This problem involves simple linear regression without an intercept.

Question 12.a. - Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

From 3.38, we see that the coefficient estimate for both Y on X and X on Y will be very similar as their numerators are the same, but just changing the denominator, both will become equal and we can remove the other variables. It’ll equate to just the sum of x^2 to equate to the sum of y^2.

Question 12.b. - Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

set.seed(12)
x = rnorm(100)
sum(x)
## [1] -3.116866
set.seed(12)
y = 2*x + rnorm(100)
sum(y)
## [1] -9.350598
betaX = lm(x~y + 0)
betaY = lm(y~x + 0)
coef(betaX)
##         y 
## 0.3333333
coef(betaY)
## x 
## 3

Question 12.c. - Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X

set.seed(12)
x.2 = rnorm(100)
y.2 = 1*x.2
sum(x.2)
## [1] -3.116866
sum(y.2)
## [1] -3.116866
betaX2 = lm(x.2~y.2 + 0)
betaY2 = lm(y.2~x.2 + 0)
coef(betaX2)
## y.2 
##   1
coef(betaY2)
## x.2 
##   1