Homework 2

Question 2

Carefully explain the differences between the KNN classifier and KNN regression methods.

K-Nearest Neighbor classifiers are closely related to KNN regression, the method used depends on the outcome variable. When our outcome variable is categorical, the KNN classifier identifies the $k$ nearest points in the training data and classifies the test observation to the most common class. In cases with a continuous outcome, KNN regression is used. Similar to KNN classifiers, it first identifies the closest $k$ training observations and uses the average of the training responses to estimate the prediction point.

Question 9

This question involves the use of multiple linear regression on the Auto data set.

Produce a scatterplot matrix which includes all of the variables in the data set.

library(ISLR)
attach(Auto)
pairs(Auto)

Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.

kable(Auto %>% 
  select(-name) %>% 
  cor(), digits = 2, align = 'c') %>%   
  kable_styling(bootstrap_options = c("striped", "hover"))

	mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin
mpg	1.00	-0.78	-0.81	-0.78	-0.83	0.42	0.58	0.57
cylinders	-0.78	1.00	0.95	0.84	0.90	-0.50	-0.35	-0.57
displacement	-0.81	0.95	1.00	0.90	0.93	-0.54	-0.37	-0.61
horsepower	-0.78	0.84	0.90	1.00	0.86	-0.69	-0.42	-0.46
weight	-0.83	0.90	0.93	0.86	1.00	-0.42	-0.31	-0.59
acceleration	0.42	-0.50	-0.54	-0.69	-0.42	1.00	0.29	0.21
year	0.58	-0.35	-0.37	-0.42	-0.31	0.29	1.00	0.18
origin	0.57	-0.57	-0.61	-0.46	-0.59	0.21	0.18	1.00

Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.

lm.fit <- lm(mpg ~ . -name, data = Auto)
summary(lm.fit)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Is there a relationship between the predictors and the response?

Based on the very small p-value of the F-statistic, we can reject the null hypothesis and conclude that there is a relationship between the predictors and mpg.
Which predictors appear to have a statistically significant relationship to the response?

From the table above, displacement, year, and origin have a significant positive relationship to mpg while weight has a significant negative relationship. Cylinders, horsepower, and acceleration have no effect on mpg.
What does the coefficient for the year variable suggest?

The coefficient for year suggests that gas mileage improves with time. For every year, fuel efficiency is expected to increase by 0.75 miles per gallon.

Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

Based on the diagnostic plots below, there is some evidence of non-linearity in the data indicated by the curved red line in the Residuals plot. There is some slight skewness in the right tail of the Normal QQ plot but overall, the residuals appear to be normally distributed. The Standardized Residuals plot shows there are several observations with an absolute value greater than 3, indicating they are likely outliers, and there is one observation with high leverage, observation 14 in the Leverage plot.

par(mfrow=c(2,2))
plot(lm.fit)

Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

Using the predictors that were statistically significant to mpg, I combined displacement with horsepower and year with origin. Both interaction terms are significant in the model and the adjusted ${R^2}$ increased to 84%, indicating a better fit.

m2 <- lm(mpg ~ displacement*horsepower + year*origin)
summary(m2)

## 
## Call:
## lm(formula = mpg ~ displacement * horsepower + year * origin)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0561 -1.8897 -0.3494  1.3947 14.5092 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              1.405e+01  8.366e+00   1.679   0.0939 .  
## displacement            -9.314e-02  6.146e-03 -15.155  < 2e-16 ***
## horsepower              -2.085e-01  1.571e-02 -13.276  < 2e-16 ***
## year                     4.561e-01  1.067e-01   4.276  2.4e-05 ***
## origin                  -1.055e+01  4.469e+00  -2.362   0.0187 *  
## displacement:horsepower  5.711e-04  4.260e-05  13.406  < 2e-16 ***
## year:origin              1.464e-01  5.743e-02   2.549   0.0112 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.113 on 385 degrees of freedom
## Multiple R-squared:  0.8434, Adjusted R-squared:  0.8409 
## F-statistic: 345.5 on 6 and 385 DF,  p-value: < 2.2e-16

Try a few different transformations of the variables, such as $\log(x)$, $\sqrt{x}$, $x^2$. Comment on your findings.

Horsepower, acceleration, and cylinders were not significant in the original model. However, with the addition of the transformation terms to the model, the log of horsepower and the quadratic term of acceleration are significant, and the adjusted $R^2$ improved to 86%.

m3 <- lm(mpg~. -name + log(horsepower) + sqrt(weight) + I(acceleration^2)
         + log(cylinders), data=Auto)
summary(m3)

## 
## Call:
## lm(formula = mpg ~ . - name + log(horsepower) + sqrt(weight) + 
##     I(acceleration^2) + log(cylinders), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7179 -1.5400 -0.0032  1.4379 12.5205 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        99.628930  11.064956   9.004  < 2e-16 ***
## cylinders          -1.552560   1.266888  -1.225 0.221149    
## displacement       -0.003561   0.007088  -0.502 0.615727    
## horsepower          0.052162   0.032792   1.591 0.112509    
## weight              0.016610   0.004160   3.993 7.83e-05 ***
## acceleration       -2.161362   0.570295  -3.790 0.000175 ***
## year                0.785725   0.044852  17.518  < 2e-16 ***
## origin              0.650906   0.251257   2.591 0.009950 ** 
## log(horsepower)   -11.548273   3.912917  -2.951 0.003360 ** 
## sqrt(weight)       -2.288328   0.473765  -4.830 1.98e-06 ***
## I(acceleration^2)   0.059192   0.017050   3.472 0.000576 ***
## log(cylinders)      7.763732   6.807244   1.141 0.254792    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.87 on 380 degrees of freedom
## Multiple R-squared:  0.8686, Adjusted R-squared:  0.8648 
## F-statistic: 228.3 on 11 and 380 DF,  p-value: < 2.2e-16

detach(Auto)

Question 10

This question should be answered using the Carseats data set.

Fit a multiple regression model to predict Sales using Price, Urban, and US.

attach(Carseats)
fit <- lm(Sales ~ Price + Urban + US)
summary(fit)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

From the table above, price and US are significant predictors of Sales, for every $1 increase in price, my sales go down by $54. Sales inside of the US are $1,200 higher than sales outside of the US. Urban has no effect on Sales.
Write out the model in equation form, being careful to handle the qualitative variables properly.

$Sales = 13.043469 -0.054459Price-0.021916Urban_{Yes}+1.200573US_{Yes}$
For which of the predictors can you reject the null hypothesis $H_0: β_j = 0$?

Price and US have p-values below the significance level of 0.05, so we can reject the null hypothesis.
On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

fit <- lm(Sales ~ Price + US)
summary(fit)

## 
## Call:
## lm(formula = Sales ~ Price + US)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

How well do the models in (a) and (e) fit the data?

Not very well, each model explains around 23% of the variance in Sales.
Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

confint(fit)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Is there evidence of outliers or high leverage observations in the model from (e)?

There are several unusual data points in the model that violate various rules for each influence measure. After removing these potential outliers and influential observations, very little changes from the original model with all observations included. The coefficient estimates are within the confidence intervals of the linear model fit to the full data set, so it is safe to keep all observations in our model.

par(mfrow=c(2,2))
plot(fit)

# Influential observations
summary(influence.measures(fit))

## Potentially influential observations of
##   lm(formula = Sales ~ Price + US) :
## 
##     dfb.1_ dfb.Pric dfb.USYs dffit   cov.r   cook.d hat    
## 26   0.24  -0.18    -0.17     0.28_*  0.97_*  0.03   0.01  
## 29  -0.10   0.10    -0.10    -0.18    0.97_*  0.01   0.01  
## 43  -0.11   0.10     0.03    -0.11    1.05_*  0.00   0.04_*
## 50  -0.10   0.17    -0.17     0.26_*  0.98    0.02   0.01  
## 51  -0.05   0.05    -0.11    -0.18    0.95_*  0.01   0.00  
## 58  -0.05  -0.02     0.16    -0.20    0.97_*  0.01   0.01  
## 69  -0.09   0.10     0.09     0.19    0.96_*  0.01   0.01  
## 126 -0.07   0.06     0.03    -0.07    1.03_*  0.00   0.03_*
## 160  0.00   0.00     0.00     0.01    1.02_*  0.00   0.02  
## 166  0.21  -0.23    -0.04    -0.24    1.02    0.02   0.03_*
## 172  0.06  -0.07     0.02     0.08    1.03_*  0.00   0.02  
## 175  0.14  -0.19     0.09    -0.21    1.03_*  0.02   0.03_*
## 210 -0.14   0.15    -0.10    -0.22    0.97_*  0.02   0.01  
## 270 -0.03   0.05    -0.03     0.06    1.03_*  0.00   0.02  
## 298 -0.06   0.06    -0.09    -0.15    0.97_*  0.01   0.00  
## 314 -0.05   0.04     0.02    -0.05    1.03_*  0.00   0.02_*
## 353 -0.02   0.03     0.09     0.15    0.97_*  0.01   0.00  
## 357  0.02  -0.02     0.02    -0.03    1.03_*  0.00   0.02  
## 368  0.26  -0.23    -0.11     0.27_*  1.01    0.02   0.02_*
## 377  0.14  -0.15     0.12     0.24    0.95_*  0.02   0.01  
## 384  0.00   0.00     0.00     0.00    1.02_*  0.00   0.02  
## 387 -0.03   0.04    -0.03     0.05    1.02_*  0.00   0.02  
## 396 -0.05   0.05     0.08     0.14    0.98_*  0.01   0.00

outliers <- c(26,29,43,50,51,58,69,126,160,166,172,175,210,270,298,314,353,357,368,377,384,387,396)
Carseats.small <- Carseats[-outliers,]
fit2 <- lm(Sales ~ Price + US, data=Carseats.small)
summary(fit2)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats.small)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.263 -1.605 -0.039  1.590  5.428 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.925232   0.665259  19.429  < 2e-16 ***
## Price       -0.053973   0.005511  -9.794  < 2e-16 ***
## USYes        1.255018   0.248856   5.043 7.15e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.29 on 374 degrees of freedom
## Multiple R-squared:  0.2387, Adjusted R-squared:  0.2347 
## F-statistic: 58.64 on 2 and 374 DF,  p-value: < 2.2e-16

detach(Carseats)

Question 12

This problem involves simple linear regression without an intercept.

Recall that the coefficient estimate $\hat{β}$ for the linear regression of $Y$ onto $X$ without an intercept is given by Figure 3.38. Under what circumstance is the coefficient estimate for the regression of $X$ onto $Y$ the same as the coefficient estimate for the regression of $Y$ onto $X$?

In regression without an intercept, the coefficient estimates will be the same when the sum of squares of the observed $x$ values is equal to the sum of squares of the observed $y$ values.
Generate an example in R with $n$ = 100 observations in which the coefficient estimate for the regression of $X$ onto $Y$ is different from the coefficient estimate for the regression of $Y$ onto $X$.

set.seed(1)
x <- rnorm(100) 
y <- x+rnorm(100, mean=50, sd=2)

fitY <- lm(y~x+0)
fitX <- lm(x~y+0)

# Check coefficients
fitY$coefficients

##        x 
## 7.704626

fitX$coefficients

##          y 
## 0.00249027

Generate an example in R with $n$ = 100 observations in which the coefficient estimate for the regression of $X$ onto $Y$ is the same as the coefficient estimate for the regression of $Y$ onto $X$.

set.seed(2)
x <- rnorm(100) 
y <- sample(x)

fitY2 <- lm(y~x+0)
fitX2 <- lm(x~y+0)

# Check coefficients
fitY2$coefficients

##          x 
## -0.0230038

fitX2$coefficients

##          y 
## -0.0230038

Homework 2

Introduction to Statistical Learning, Chapter 3 Exercises

Kelli Belcher

2021-02-26

Question 2

Question 9

Question 10

Question 12