Assignment 2

Rudy Martinez

6/14/2021


Libraries

library(MASS)
library(ISLR)

Exercises

Exercise 2

Carefully explain the differences between the KNN classifier and KNN regression methods.

  • KNN Classifier
    • Given a positive integer K and a test observation x0, the KNN classifier first identifies the neighbors K points in the training data that are closest to x0, represented by N0. It then estimates the conditional probability for class j as the fraction of points in N0 whose response values equal j.
  • KNN Regression
    • Given a value for K and a prediction point x0, KNN regression first identifies the K training observations that are closest to x0, represented by N0. It then estimates f(x0) using the average of all the training responses in N0.


Exercise 9

(a) This question involves the use of multiple linear regression on the Auto data set. Produce a scatterplot matrix which includes all of the variables in the data set.

# Exercise 9-a
pairs(Auto)

names(Auto)
## [1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
## [6] "acceleration" "year"         "origin"       "name"

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

# Exercise 9-b
cor(Auto[ , -9])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.

# Exercise 9-c
lm.auto.fit = lm(mpg ~ . - name , data = Auto)
summary (lm.auto.fit)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16
coef(lm.auto.fit)[7]
##      year 
## 0.7507727
  • i. Is there a relationship between the predictors and the response?
    • Because the p-value of the linear regression model is 2.2e-16, we reject the null hypothesis and conclude that there is a relationship between these predictors and mpg.
  • ii. Which predictors appear to have a statistically significant relationship to the response?
    • In reviewing the predictors, it appears that there is a statistically significant relationship between displacement, weight, year, origin, and the response (mpg). This is due to p-values below the significance level of 0.05 for these predictors (0.00844, 2e-16, 2e-16, and 4.67e-07 respectively).
  • iii. What does the coefficient for the year variable suggest?
    • A coefficient of 0.7507727 means that the effect of an increase of 1 year (with all other effects held constant) is associated with an increase of 0.7507727 in the response (mpg).


(d) Use the plot() function to produce diagnostic plots of the linear regression fit.

# Exercise 9-d
par(mfrow = c(2,2))
plot(lm.auto.fit)

  • Comment on any problems you see with the fit.
    • Looking at the Normal Q-Q Plot, we see that many of the points fall along the line for the majority of the graph. However, looking at the points in the extremities of the graph, they appear to curve off the line. This indicates that an assumption of normality is not reasonable. This is reinforced by looking at the Standardized Residuals plot. Because a considerable number of observations fall above 1.5 along the Y-axis, an assumption of normality is not reasonable.
  • Do the residual plots suggest any unusually large outliers?
    • Yes, both the Residuals and Standardized Residual plots indicate outliers.
  • Does the leverage plot identify any observations with unusually high leverage?
    • Based on the Cook’s Distance plot, the following unduly influential points are noted: Observation 14, 327, and 394.


(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

# Exercise 9-e
lm.auto.fit_interact = lm(formula = mpg ~ . * ., data = Auto[, -9])
summary (lm.auto.fit_interact)
## 
## Call:
## lm(formula = mpg ~ . * ., data = Auto[, -9])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6303 -1.4481  0.0596  1.2739 11.1386 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
## cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
## displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
## horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
## weight                     4.133e-03  1.759e-02   0.235  0.81442   
## acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
## year                       6.974e-01  6.097e-01   1.144  0.25340   
## origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
## cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
## cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
## cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
## cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
## cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
## cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
## displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
## displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
## displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
## displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
## displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
## horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
## horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
## horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
## horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
## weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
## weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
## weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
## acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
## acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
## year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8808 
## F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16
  • The following interactions appear to be statistically significant: displacement:year, acceleration:year, and acceleration:origin based on their p-values below the significance level of 0.05 (0.01352, 0.03033, and 0.03033 respectively).


(f) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

# Exercise 9-f
my_lm_f = lm(mpg ~ . - name + log(weight) + sqrt(origin) + I(displacement^2) + I(year^2), data = Auto)
summary(my_lm_f)
## 
## Call:
## lm(formula = mpg ~ . - name + log(weight) + sqrt(origin) + I(displacement^2) + 
##     I(year^2), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1242 -1.4752  0.1069  1.4306 12.3150 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        6.040e+02  8.264e+01   7.309 1.60e-12 ***
## cylinders          2.961e-01  3.318e-01   0.892  0.37269    
## displacement      -4.865e-02  2.002e-02  -2.430  0.01555 *  
## horsepower        -5.546e-02  1.317e-02  -4.212 3.16e-05 ***
## weight             4.394e-03  1.972e-03   2.229  0.02641 *  
## acceleration      -2.983e-02  8.474e-02  -0.352  0.72503    
## year              -1.098e+01  1.856e+00  -5.912 7.51e-09 ***
## origin            -5.442e+00  3.338e+00  -1.630  0.10383    
## log(weight)       -2.632e+01  6.264e+00  -4.202 3.30e-05 ***
## sqrt(origin)       1.625e+01  9.164e+00   1.773  0.07704 .  
## I(displacement^2)  1.027e-04  3.215e-05   3.195  0.00152 ** 
## I(year^2)          7.736e-02  1.219e-02   6.345 6.35e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.815 on 380 degrees of freedom
## Multiple R-squared:  0.8735, Adjusted R-squared:  0.8699 
## F-statistic: 238.6 on 11 and 380 DF,  p-value: < 2.2e-16
  • Findings: After performing the transformations, it appears that the below variables have a statistically significant relationship with the response mpg variable.
    • displacement (p-value = 0.01555)
    • horsepower (p-value = 3.16e-05)
    • weight (p-value = 0.02641)
    • year (p-value = 7.51e-09)
    • log(weight) (p-value = 3.30e-05)
    • I(displacement^2) (p-value = 0.00152)
    • I(year^2) (p-value = 6.35e-10)
  • Additionally, the model has an Adjusted R-squared of 0.8699 that is higher than the original model Adjusted R-squared of 0.8182.


Exercise 10

(a) This question should be answered using the Carseats data set. Fit a multiple regression model to predict Sales using Price, Urban, and US.

# Exercise 10
lm.carseats.fit = lm(Sales ~ Price + Urban + US, data = Carseats)
summary (lm.carseats.fit)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16


(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

#Exercise 10-b
?Carseats
  • Price: Price company charges for car seats at each site
    • A coefficient of -0.054459 means that the effect of a 1-unit increase in the Price the company charges for car seats (with all other effects held constant) is associated with an decrease of 0.054459 in the response (Sales). Due to conversions, this equates to $54,459.
  • Urban: A factor with levels No and Yes to indicate whether the store is in an urban or rural location
    • Because of a p-value below the significance level of 0.05 (0.936), there is not enough evidence for a relationship between the location of the store and the number of sales.
  • US: A factor with levels No and Yes to indicate whether the store is in the US or not
    • There is a positive relationship between USYes and Sales. On average the unit sales in a US store are 1200.573 units more than in a non US store when all other predictors are held constant.


(c) Write out the model in equation form, being careful to handle the qualitative variables properly.

  • Sales = 13.043469 + (-0.054459)Price + (-0.021916)UrbanYes + (1.200573)USYes + ε


(d) For which of the predictors can you reject the null hypothesis H0 : βj = 0?

  • Due to p-values below the significance level of 0.05, we can reject the null hypothesis for Price and USYes.


(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

#Exercise 10-e
lm.carseats.fit_small = lm(Sales ~ Price + US, data = Carseats)
summary (lm.carseats.fit_small)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16


(f) How well do the models in (a) and (e) fit the data?

  • Based on the models’ Adjusted R-squared, (a) at 0.2335 and (e) at 0.2354, it appears that model (e) fits the data more closely - although marginally.


(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

#Exercise 10-g
?confint
confint(lm.carseats.fit_small, level = 0.95)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632


(h) Is there evidence of outliers or high leverage observations in the model from (e)?

#Exercise 10-h
par(mfrow = c(2, 2))
plot(lm.carseats.fit_small)

  • Do the residual plots suggest any unusually large outliers?
    • The Residuals and Standardized Residuals plots indicate the presence of several outliers.
  • Does the leverage plot identify any observations with unusually high leverage?
    • Based on the Cook’s Distance plot, there are a couple of unduly influential points.


Exercise 12

(a) This problem involves simple linear regression without an intercept. Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?


(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

#Exercise 12-b
set.seed(2)
x = rnorm(100)
y = 2 * x + rnorm(100, sd = 2)
data = data.frame(x, y)

lm_y_by_x = lm(y ~ x + 0)
lm_x_by_y = lm(x ~ y + 0)

summary(lm_y_by_x)
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2442 -1.5840  0.3314  1.4960  4.1668 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   1.9045     0.1705   11.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.968 on 99 degrees of freedom
## Multiple R-squared:  0.5577, Adjusted R-squared:  0.5532 
## F-statistic: 124.8 on 1 and 99 DF,  p-value: < 2.2e-16
summary(lm_x_by_y)
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0447 -0.5794 -0.0017  0.5626  1.4081 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y  0.29283    0.02621   11.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7719 on 99 degrees of freedom
## Multiple R-squared:  0.5577, Adjusted R-squared:  0.5532 
## F-statistic: 124.8 on 1 and 99 DF,  p-value: < 2.2e-16


(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

#Exercise 12-c
set.seed(2)
x = 1:100
y = 100:1
data = data.frame(x, y)

lm_y_by_x_same = lm(y ~ x + 0)
lm_x_by_y_same = lm(x ~ y + 0)

summary(lm_y_by_x_same)
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08
summary(lm_x_by_y_same)
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08