Problem 2

Carefully explain the differences between the KNN classifier and KNN regression methods.

KNN Classifier is used for categorical (classification) tasks wile KNN Regression is used for continuous (regression) methods

KNN Classifier finds the K closest training points to a given test point and assigns the most frequent class among those neighbors, while KNN Regression finds the K closest training points and calculates the average of their response variables.

KNN Classifier uses majority voting, which means that the class label that appear the most among the K neighbors is assigned to that test point while KNN Regression uses numerical averaging, which means that the predicted value is the mean of the K nearest neighbors target value.

KNN Classifier’s output is a discrete category like ‘Yes’ or ‘No’, while KNN Regression’s output is a continuous numerical value.

Problem 9

library(ISLR)
attach(Auto)

(a) Produce a scatterplot matrix which includes all of the variables in the data set.

pairs(Auto)

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

cor(Auto[, sapply(Auto, is.numeric)])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output.

model9c <- lm(mpg ~ . -name, data = Auto)
summary(model9c)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

i. Is there a relationship between the predictors and the response?

Yes, there is a relationship between the predictors and the response variable as indicated by the extremely low p-value of 2.2e-16, which suggests that at least one predictor significantly explains variation in mpg. However, cylinders, horsepower and acceleration do not have a statistically significant effect on the response variable, which means their individual effects on mpg are not statistically significant.

ii. Which predictors appear to have a statistically significant relationship to the response?

displacement, weight, year, and origin are the predictors with a statistically significant relationship to the response.

iii. What does the coefficient for the year variable suggest?

Assuming all other predictors remain constant, for every increase in year, there is an increase of about 0.750773 units in mpg . This likely reflects improvement of mpg over time.

(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow = c(2, 2))
plot(model9c)

par(mfrow = c(1, 1))
  1. The Residuals vs Fitted plot shows a random pattern in the residuals, which suggests the errors are spread evenly across all fitted values.
  2. The Q-Q plot shows most points along the red line with a very slight deviation at the end which doesn’t concern us too much. We can assume the residuals are normally distributed.
  3. The Scale-Location plot shows a random pattern in the residuals which suggests the residuals are spread equally along the range of predictors.
  4. The Residuals vs Leverage plot shows all our residuals within the Cook’s distance lines, which means that we have no influential outliers in our data.

I do not see any issues with the diagnostic plots.

The residual plots do not suggest any large outliers.

Leverage plot also does not identify any observations with unusually high leverage. All observations are inside Cook’s line

(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

model9e <- lm(mpg ~ . -name + displacement:weight + horsepower:acceleration + year:origin, data = Auto)
summary(model9e)
## 
## Call:
## lm(formula = mpg ~ . - name + displacement:weight + horsepower:acceleration + 
##     year:origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7073 -1.6687  0.0337  1.4242 12.8153 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              5.726e+00  8.643e+00   0.663 0.508026    
## cylinders                3.390e-01  2.915e-01   1.163 0.245533    
## displacement            -7.449e-02  1.076e-02  -6.925 1.86e-11 ***
## horsepower               5.107e-02  2.420e-02   2.110 0.035494 *  
## weight                  -8.580e-03  8.525e-04 -10.065  < 2e-16 ***
## acceleration             5.743e-01  1.545e-01   3.717 0.000231 ***
## year                     5.097e-01  9.887e-02   5.155 4.09e-07 ***
## origin                  -1.228e+01  4.128e+00  -2.975 0.003118 ** 
## displacement:weight      1.948e-05  2.325e-06   8.381 1.03e-15 ***
## horsepower:acceleration -6.668e-03  1.730e-03  -3.854 0.000136 ***
## year:origin              1.639e-01  5.303e-02   3.091 0.002143 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.87 on 381 degrees of freedom
## Multiple R-squared:  0.8682, Adjusted R-squared:  0.8648 
## F-statistic: 251.1 on 10 and 381 DF,  p-value: < 2.2e-16

The interaction between displacement:weight, horsepower:acceleration, and year:origin appear to be statistically significant.

Many of the other interactions appear to not be statistically significant as shown in the next model.

model9e_2 <- lm(mpg ~ . -name + displacement:horsepower + weight:acceleration + cylinders:origin, data = Auto)
summary(model9e_2)
## 
## Call:
## lm(formula = mpg ~ . - name + displacement:horsepower + weight:acceleration + 
##     cylinders:origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2931 -1.6317 -0.1017  1.4266 12.5900 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -5.724e+00  7.483e+00  -0.765    0.445    
## cylinders                4.400e-01  4.842e-01   0.909    0.364    
## displacement            -7.077e-02  1.173e-02  -6.032 3.84e-09 ***
## horsepower              -1.909e-01  2.264e-02  -8.431 7.13e-16 ***
## weight                  -1.831e-03  1.688e-03  -1.084    0.279    
## acceleration             2.969e-02  3.017e-01   0.098    0.922    
## year                     7.406e-01  4.551e-02  16.275  < 2e-16 ***
## origin                   1.315e-01  1.200e+00   0.110    0.913    
## displacement:horsepower  4.926e-04  6.194e-05   7.953 2.10e-14 ***
## weight:acceleration     -8.485e-05  1.000e-04  -0.848    0.397    
## cylinders:origin         1.319e-01  2.832e-01   0.465    0.642    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.916 on 381 degrees of freedom
## Multiple R-squared:  0.864,  Adjusted R-squared:  0.8604 
## F-statistic:   242 on 10 and 381 DF,  p-value: < 2.2e-16

In this model, we can see that the interaction displacement:horsepower is statistically significant, while weight:acceleration, and cylinders:origin are not.

(f) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

model9f <- lm(mpg ~ cylinders + displacement + I(horsepower^2) + log(weight) + sqrt(acceleration) + year + origin, data = Auto)
summary(model9f)
## 
## Call:
## lm(formula = mpg ~ cylinders + displacement + I(horsepower^2) + 
##     log(weight) + sqrt(acceleration) + year + origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3864 -1.9909  0.0413  1.6213 12.8007 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.329e+02  1.106e+01  12.011  < 2e-16 ***
## cylinders          -3.447e-01  3.050e-01  -1.130  0.25909    
## displacement        1.532e-02  7.259e-03   2.111  0.03541 *  
## I(horsepower^2)     4.171e-05  3.947e-05   1.057  0.29135    
## log(weight)        -2.261e+01  1.530e+00 -14.773  < 2e-16 ***
## sqrt(acceleration)  1.569e+00  6.473e-01   2.424  0.01582 *  
## year                8.073e-01  4.723e-02  17.092  < 2e-16 ***
## origin              8.687e-01  2.636e-01   3.295  0.00108 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.095 on 384 degrees of freedom
## Multiple R-squared:  0.8456, Adjusted R-squared:  0.8427 
## F-statistic: 300.3 on 7 and 384 DF,  p-value: < 2.2e-16

The Adjusted R-squared increased from 0.8182 in our model with the same variables to 0.8427 in our model with transformations.

RSE dropped from 3.328 to 3.095 which indicates better predictions with lower residual variability.

log(weight) confirms that weight has a strong negative effect on mpg.

sqrt(acceleration) is now significant compared to the other model, suggesting that acceleration’s effect on mpg is non-linear.

Squaring horsepower I(horsepower^2) didn’t improve its significance, meaning it may not have a strong non-linear effect on mpg.

Problem 10

library(ISLR)
attach(Carseats)
head(Carseats)
##   Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1  9.50       138     73          11        276   120       Bad  42        17
## 2 11.22       111     48          16        260    83      Good  65        10
## 3 10.06       113     35          10        269    80    Medium  59        12
## 4  7.40       117    100           4        466    97    Medium  55        14
## 5  4.15       141     64           3        340   128       Bad  38        13
## 6 10.81       124    113          13        501    72       Bad  78        16
##   Urban  US
## 1   Yes Yes
## 2   Yes Yes
## 3   Yes Yes
## 4   Yes Yes
## 5   Yes  No
## 6    No Yes
summary(Carseats)
##      Sales          CompPrice       Income        Advertising    
##  Min.   : 0.000   Min.   : 77   Min.   : 21.00   Min.   : 0.000  
##  1st Qu.: 5.390   1st Qu.:115   1st Qu.: 42.75   1st Qu.: 0.000  
##  Median : 7.490   Median :125   Median : 69.00   Median : 5.000  
##  Mean   : 7.496   Mean   :125   Mean   : 68.66   Mean   : 6.635  
##  3rd Qu.: 9.320   3rd Qu.:135   3rd Qu.: 91.00   3rd Qu.:12.000  
##  Max.   :16.270   Max.   :175   Max.   :120.00   Max.   :29.000  
##    Population        Price        ShelveLoc        Age          Education   
##  Min.   : 10.0   Min.   : 24.0   Bad   : 96   Min.   :25.00   Min.   :10.0  
##  1st Qu.:139.0   1st Qu.:100.0   Good  : 85   1st Qu.:39.75   1st Qu.:12.0  
##  Median :272.0   Median :117.0   Medium:219   Median :54.50   Median :14.0  
##  Mean   :264.8   Mean   :115.8                Mean   :53.32   Mean   :13.9  
##  3rd Qu.:398.5   3rd Qu.:131.0                3rd Qu.:66.00   3rd Qu.:16.0  
##  Max.   :509.0   Max.   :191.0                Max.   :80.00   Max.   :18.0  
##  Urban       US     
##  No :118   No :142  
##  Yes:282   Yes:258  
##                     
##                     
##                     
## 

(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.

model10a <- lm(Sales ~ Price + Urban + US , data = Carseats)
summary(model10a)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16
coef(model10a)[2]
##       Price 
## -0.05445885

(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

The coefficient for Price is -0.054459 which means that for every dollar increase in the price of my car seat, my store’s sales decrease by 54 units on average.

The coefficient for ‘Urban’ is -0.021916

The coefficient for ‘USYes’ is 1.200573 which means on average, US stores sell 1,200 units more compared to stores outside the US.

(c) Write out the model in equation form, being careful to handle the qualitative variables properly. \(Sales = 13.04 - 0.05Price - 0.022Urban + 1.2US\)

(d) For which of the predictors can you reject the null hypothesis \(H0 : \beta_j = 0?\)

Price and US because they both have p-values lower than 0.05, which means we can reject the null hypothesis.

Urban's p-value is higher than 0.05 and actually very high (0.936) which means we can’t reject the null hypothesis.

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

model10e <- lm(Sales ~ Price + US, data = Carseats)
summary(model10e)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) How well do the models in (a) and (e) fit the data?

Not good. Adjusted R-squared is 0.2335 for part (a) and Adjusted R-squared is 0.2354 for part (e). I’d prefer Adjusted R-squared to be higher than 0.7

(g) Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).

confint(model10e)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

(h) Is there evidence of outliers or high leverage observations in the model from (e)?

par(mfrow = c(2, 2))
plot(model10e)

summary(influence.measures(model10e))
## Potentially influential observations of
##   lm(formula = Sales ~ Price + US, data = Carseats) :
## 
##     dfb.1_ dfb.Pric dfb.USYs dffit   cov.r   cook.d hat    
## 26   0.24  -0.18    -0.17     0.28_*  0.97_*  0.03   0.01  
## 29  -0.10   0.10    -0.10    -0.18    0.97_*  0.01   0.01  
## 43  -0.11   0.10     0.03    -0.11    1.05_*  0.00   0.04_*
## 50  -0.10   0.17    -0.17     0.26_*  0.98    0.02   0.01  
## 51  -0.05   0.05    -0.11    -0.18    0.95_*  0.01   0.00  
## 58  -0.05  -0.02     0.16    -0.20    0.97_*  0.01   0.01  
## 69  -0.09   0.10     0.09     0.19    0.96_*  0.01   0.01  
## 126 -0.07   0.06     0.03    -0.07    1.03_*  0.00   0.03_*
## 160  0.00   0.00     0.00     0.01    1.02_*  0.00   0.02  
## 166  0.21  -0.23    -0.04    -0.24    1.02    0.02   0.03_*
## 172  0.06  -0.07     0.02     0.08    1.03_*  0.00   0.02  
## 175  0.14  -0.19     0.09    -0.21    1.03_*  0.02   0.03_*
## 210 -0.14   0.15    -0.10    -0.22    0.97_*  0.02   0.01  
## 270 -0.03   0.05    -0.03     0.06    1.03_*  0.00   0.02  
## 298 -0.06   0.06    -0.09    -0.15    0.97_*  0.01   0.00  
## 314 -0.05   0.04     0.02    -0.05    1.03_*  0.00   0.02_*
## 353 -0.02   0.03     0.09     0.15    0.97_*  0.01   0.00  
## 357  0.02  -0.02     0.02    -0.03    1.03_*  0.00   0.02  
## 368  0.26  -0.23    -0.11     0.27_*  1.01    0.02   0.02_*
## 377  0.14  -0.15     0.12     0.24    0.95_*  0.02   0.01  
## 384  0.00   0.00     0.00     0.00    1.02_*  0.00   0.02  
## 387 -0.03   0.04    -0.03     0.05    1.02_*  0.00   0.02  
## 396 -0.05   0.05     0.08     0.14    0.98_*  0.01   0.00

All residuals appear to be within the inside of our Cook’s line, which isn’t even visible because none of the points get close enough to it. This means that there may or may not be ‘outliers’ but points with high leverage are absolutely not present.

After analyzing potential influential observations, no single observation appears to be highly influential based on Cook’s Distance, DFFITS and DFBETAS which are all relatively low.

Problem 12

(a) Recall that the coefficient estimate β for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

When the sum of squares of X and Y are equal, the data points all lie on a 45-degree line through the origin which is when the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

set.seed(42)
n <- 100
X <- rnorm(n, mean = 0, sd = 2)
Y <- 0.5 * X + rnorm(n, mean = 0, sd = 1)

model12b <- lm(Y ~ 0 + X) #0 is to make sure there is no intercept
summary(model12b)
## 
## Call:
## lm(formula = Y ~ 0 + X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9815 -0.5947 -0.0741  0.4498  2.7669 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## X   0.5122     0.0438    11.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9081 on 99 degrees of freedom
## Multiple R-squared:  0.5801, Adjusted R-squared:  0.5759 
## F-statistic: 136.8 on 1 and 99 DF,  p-value: < 2.2e-16
model12b_2 <- lm(X ~ 0 + Y)
summary(model12b_2)
## 
## Call:
## lm(formula = X ~ 0 + Y)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3644 -0.7187  0.1213  1.1146  2.6722 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## Y  1.13251    0.09683    11.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.35 on 99 degrees of freedom
## Multiple R-squared:  0.5801, Adjusted R-squared:  0.5759 
## F-statistic: 136.8 on 1 and 99 DF,  p-value: < 2.2e-16

Since var(X) != var(Y), the regression coefficients are different.

Y on X coefficient = 0.5122

X on Y coefficient = 1.13251

(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

set.seed(42)
n <- 100
X <- rnorm(n, mean = 0, sd = 1)
Y <- X

model12c <- lm(Y ~ 0 + X)
summary(model12c)
## 
## Call:
## lm(formula = Y ~ 0 + X)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.178e-14 -1.028e-16  1.120e-17  1.112e-16  4.124e-16 
## 
## Coefficients:
##    Estimate Std. Error   t value Pr(>|t|)    
## X 1.000e+00  1.154e-16 8.668e+15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.196e-15 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 7.514e+31 on 1 and 99 DF,  p-value: < 2.2e-16
model12c_2 <- lm(X ~ 0 + Y)
summary(model12c_2)
## 
## Call:
## lm(formula = X ~ 0 + Y)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.178e-14 -1.028e-16  1.120e-17  1.112e-16  4.124e-16 
## 
## Coefficients:
##    Estimate Std. Error   t value Pr(>|t|)    
## Y 1.000e+00  1.154e-16 8.668e+15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.196e-15 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 7.514e+31 on 1 and 99 DF,  p-value: < 2.2e-16

Since X = Y, their sum of squares are equal, making the coefficients identical (1 and 1).