Chapter 03 (page 120): 2, 9, 10, 12

2. Carefully explain the differences between the KNN classifier and KNN regression methods.

The KNN classifier seeks to classify a result into qualitative groups based on using the most common group found among the K nearest neighbors. For example, given a positive integer (K) and a test observation (x0), the KNN classifier first identifies the neighboring K points in the training data that are closest to x0 (known as N0). It then estimates the conditional probability for class j as the fraction of points in N0 whose response values equal j.

Conversely, The KNN regression seeks to make a quantitative estimate by averaging the result of the K nearest neighbors. The KNN regression method is used to solve regression problems (those with a quantitative response) by again identifying the neighborhood of x0 and then estimating f(x0) as the average of all the training responses in the neighborhood.

9. This question involves the use of multiple linear regression on the “Auto” data set.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.5
data(Auto)
9 A. Produce a scatterplot matrix which include all the variables in the data set.
pairs(Auto)

9 B. Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the “name” variable, which is qualitative.
names(Auto)
## [1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
## [6] "acceleration" "year"         "origin"       "name"
cor(Auto[1:8])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000
9 C. Use the lm() function to perform a multiple linear regression with “mpg” as the response and all other variables except “name” as the predictors. Use the summary() function to print the results. Comment on the output. For instance :
9 C (i). Is there a relationship between the predictors and the response?
fit2 <- lm(mpg ~ . - name, data = Auto)
summary(fit2)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

The p-value of the linear regression model is 2.2e-16, thus we reject the null hypothesis and therefore conclude that there is a relationship between these predictors and mpg.

9 C (ii). Which predictors appear to have a statistically significant relationship to the response?

By checking the p-values associated with each predictor’s t-statistic, we can conclude that all predictors are statistically significant except “cylinders”, “horsepower” and “acceleration” as they have a p-value above our significance level.

9 C (iii). What does the coefficient for the “year” variable suggest?

The coefficient of the year variable suggests that the average effect of an increase of 1 year is an increase of 0.7507727 in mpg (assuming all other predictors remain constant).

9 D. Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plots identify any observations with unusually high leverages?
par(mfrow = c(2,2))
plot(fit2)

By observing the Normal Q-Q Plot, we see that many of the points fall along the line for the majority of the graph. However, there does appear to be a skew towards the upper extremity. This could violate our assumption of normality. Additionally, the plot of residuals versus fitted values indicates the presence of mild non linearity in the data. The plot of standardized residuals versus leverage indicates the presence of a few outliers (specifically 327 and 294) and one high leverage point (point 14).

9 E. Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant ?
fit3 <- lm(formula = mpg ~ . * ., data = Auto[, -9])
summary(fit3)
## 
## Call:
## lm(formula = mpg ~ . * ., data = Auto[, -9])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6303 -1.4481  0.0596  1.2739 11.1386 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
## cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
## displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
## horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
## weight                     4.133e-03  1.759e-02   0.235  0.81442   
## acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
## year                       6.974e-01  6.097e-01   1.144  0.25340   
## origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
## cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
## cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
## cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
## cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
## cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
## cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
## displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
## displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
## displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
## displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
## displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
## horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
## horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
## horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
## horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
## weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
## weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
## weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
## acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
## acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
## year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8808 
## F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16

From the p-values, we can see that the interaction between acceleration:origin, acceleration:weight and displacement:year are statistically significant as their p-values are below our significance level of 0.05.

9 F. Try a few different transformations of the variables, such as log(X), sqrt(X), X^2. Comment on your findings.
fit4 = lm(mpg ~ . - name + log(weight) + sqrt(horsepower) + I(displacement^2) + I(cylinders^2),data=Auto)
summary(fit4)
## 
## Call:
## lm(formula = mpg ~ . - name + log(weight) + sqrt(horsepower) + 
##     I(displacement^2) + I(cylinders^2), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.2216 -1.4972 -0.1142  1.4184 11.9541 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.228e+02  4.381e+01   2.803  0.00532 ** 
## cylinders          3.360e-01  1.451e+00   0.232  0.81697    
## displacement      -3.765e-02  2.175e-02  -1.731  0.08421 .  
## horsepower         2.197e-01  6.684e-02   3.287  0.00111 ** 
## weight             1.181e-03  2.074e-03   0.569  0.56949    
## acceleration      -2.036e-01  1.004e-01  -2.028  0.04323 *  
## year               7.654e-01  4.526e-02  16.911  < 2e-16 ***
## origin             5.497e-01  2.679e-01   2.052  0.04088 *  
## log(weight)       -1.493e+01  6.714e+00  -2.223  0.02678 *  
## sqrt(horsepower)  -5.998e+00  1.493e+00  -4.018 7.06e-05 ***
## I(displacement^2)  6.788e-05  3.773e-05   1.799  0.07279 .  
## I(cylinders^2)    -1.067e-02  1.164e-01  -0.092  0.92702    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.904 on 380 degrees of freedom
## Multiple R-squared:  0.8654, Adjusted R-squared:  0.8615 
## F-statistic: 222.2 on 11 and 380 DF,  p-value: < 2.2e-16

By observing the above output, we can see that displacement, horsepower, weight, year, log(weight), I(displacement^2), and I(year^2) all have p-values below our significance level of 0.05 which implies a statistically significant relationship with the mpg response variable.

10. This question should be answered using the Carseats data set.
data(Carseats)
10 A. Fit a multiple regression model to predict Sales using Price, Urban, and US.
fit5 = lm(Sales ~ Price + Urban + US, data = Carseats)
summary (fit5)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16
10 B. Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative.
?Carseats
## starting httpd help server ... done
  • Price: Price company charges for car seats at each site.
  • Urban: A factor with levels No and Yes to indicate whether the store is in an urban or rural location.
  • US: A factor with levels No and Yes to indicate whether the store is in the US or not.

Our above output shows that Price and US are statistically significant, with p-values below our significance level. The Price coefficient of -0.054459 implies that Sales will decrease by 0.054459 (thousands) for each 1-unit increase in price. The USYes coefficient of 1.200573 implies that Sales will increase by 1.200573 (thousands) if the store is located in the US.

The Urban variable returned a p-value above our significance, thus we cannot assume a relationship with our response variable (Sales).

10 C. Write out the model in equation form, being careful to handle the qualitative variables properly.

\[Sales = 13.043469 + (-0.054459)Price + (-0.021916)UrbanYes + (1.200573)USYes + ε\]

With Urban=1 if the store is in an urban location and 0 if not, and US=1 if the store is in the US and 0 if not.

10 D. For which of the predictors can you reject the null hypothesis H0 : βj = 0?

The p-values below our significance level of 0.05 include Price and USYes. Therefore, we can reject the null hypothesis and conclude there is a significant relationship for those variables.

10 E. On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
fit6 = lm(Sales ~ Price + US, data = Carseats)
summary(fit6)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16
10 F. How well do the models in (a) and (e) fit the data?

Observing the Adjusted R-squared, we can see that model A (0.2335) can account for 23.35% of the variation in the response variable, while model E (0.2354) can account for 23.54%. While very close, model E is slightly better fit.

10 G. Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).
confint(fit6)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632
10 H. Is there evidence of outliers or high leverage observations in the model from (e)?
par(mfrow = c(2, 2))
plot(fit6)

Based on our above output, we can see that the Residuals and Standardized Residuals plots indicate the presence of several outliers and our Cook’s distance shows evidence of unduly influential leverage points.

12 A. This problem involves simple linear regression without an intercept. Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

The coefficients are the same if: \[E_jx^2_j = E_jy^2_j\]

12 B. Generate an example in R with n=100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
set.seed(1)
x <- 1:100
sum(x^2)
## [1] 338350
y <- 2 * x + rnorm(100, sd = 0.1)
sum(y^2)
## [1] 1353606
fit.Y <- lm(y ~ x + 0)
fit.X <- lm(x ~ y + 0)
summary(fit.Y)
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.223590 -0.062560  0.004426  0.058507  0.230926 
## 
## Coefficients:
##    Estimate Std. Error t value Pr(>|t|)    
## x 2.0001514  0.0001548   12920   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09005 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.669e+08 on 1 and 99 DF,  p-value: < 2.2e-16
summary(fit.X)
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.115418 -0.029231 -0.002186  0.031322  0.111795 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y 5.00e-01   3.87e-05   12920   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04502 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.669e+08 on 1 and 99 DF,  p-value: < 2.2e-16
12 C. Generate an example in R with n=100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.
x <- 1:100
sum(x^2)
## [1] 338350
y <- 100:1
sum(y^2)
## [1] 338350
fit.Y <- lm(y ~ x + 0)
fit.X <- lm(x ~ y + 0)
summary(fit.Y)
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08
summary(fit.X)
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08