Assignment 2

Question 2

Carefully explain the differences between the KNN classifier and KNN regression methods.

The KNN classifier method predicts which “class” a datapoint falls into based on a comparison to the point’s nearest neighbors. It groups the points into separate clusters as oppose to using the average of the nearest neighbors to approximate a regression line, like the KNN regression method. The classification method approximates groups, while the regression method approximates a regression line.

Question 9

This question involves the use of multiple linear regression on the Auto data set.

a. Produce a scatterplot matrix which includes all of the variables in the data set.

library(ISLR)
pairs(Auto)

b. Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.

cor(Auto[ ,-9])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

c. Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.Comment on the output. For instance: i. Is there a relationship between the predictors and the response? ii. Which predictors appear to have a statistically significant relationship to the response? iii. What does the coefficient for the year variable suggest?

lm.mpg=lm(mpg ~ . - name, data = Auto)
summary(lm.mpg)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

coef(lm.mpg)[7]

##      year 
## 0.7507727

The null hypothesis can be rejected, given the p-value of <2.2e-16, meaning there is a relationship between the predictors and the mpg.
The predictors that are statistically significant at the p=.05 level are displacement, weight, year, and origin.
The coefficient for year is .7507727, meaning for every one year increase, the mpg will increase by .7507727, with all other variables held constant. Essentially, newer cars are consistently increasing in fuel effeciency.

d. Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow = c(2, 2))
plot(lm.mpg)

An issue with the plot’s fit is that the data does not appear to be linear. The Residuals v Fitted graph shows a significant curve. There also appears to be one datapoint, labeled number 14 on the Residuals v Leverage plot, that is an unusually large outlier with unusually large leverage. Additionally, there are other outliers that surpass Cook’s distance that do not have a lot of leverage.

e. Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

lm.compare= lm(mpg ~ . - name + weight:acceleration + year:horsepower + cylinders:displacement + origin:year,data=Auto)
summary(lm.compare)

## 
## Call:
## lm(formula = mpg ~ . - name + weight:acceleration + year:horsepower + 
##     cylinders:displacement + origin:year, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1456 -1.4783 -0.1112  1.3427 11.1754 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -8.202e+01  1.722e+01  -4.763 2.72e-06 ***
## cylinders              -1.119e+00  4.616e-01  -2.425 0.015769 *  
## displacement           -5.437e-02  1.428e-02  -3.808 0.000163 ***
## horsepower              6.034e-01  1.036e-01   5.824 1.22e-08 ***
## weight                  3.921e-04  1.668e-03   0.235 0.814210    
## acceleration            7.557e-01  2.604e-01   2.902 0.003929 ** 
## year                    1.578e+00  2.036e-01   7.749 8.56e-14 ***
## origin                 -1.391e+00  4.610e+00  -0.302 0.763087    
## weight:acceleration    -2.902e-04  9.204e-05  -3.153 0.001745 ** 
## horsepower:year        -8.976e-03  1.413e-03  -6.351 6.12e-10 ***
## cylinders:displacement  7.515e-03  1.970e-03   3.814 0.000159 ***
## year:origin             2.608e-02  5.919e-02   0.441 0.659691    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.862 on 380 degrees of freedom
## Multiple R-squared:  0.8693, Adjusted R-squared:  0.8655 
## F-statistic: 229.8 on 11 and 380 DF,  p-value: < 2.2e-16

3 of my 4 interactions appear to be statistically significant: weight:acceleration, year:horsepower, and cylinders:displacement. This makes sense, considering these variables had stronger correlation than origin:year, shown in part b. With these interactions, the coefficient of determination, R squared, increased slightly from our first regression. This could suggest the interactions produce a more fitting model.

f. Try a few different transformations of the variables, such as $log(X)$, $√X$, $X^2$. Comment on your findings.

lm.transform= lm(mpg ~ . - name + log(year) + sqrt(weight) + I(acceleration^2) + I(displacement^2),data=Auto)
summary(lm.transform)

## 
## Call:
## lm(formula = mpg ~ . - name + log(year) + sqrt(weight) + I(acceleration^2) + 
##     I(displacement^2), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.6314 -1.4739  0.0377  1.3991 12.3633 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.808e+03  4.634e+02   6.060 3.28e-09 ***
## cylinders          2.366e-01  3.240e-01   0.730 0.465729    
## displacement      -3.865e-02  1.954e-02  -1.978 0.048619 *  
## horsepower        -6.050e-02  1.295e-02  -4.673 4.14e-06 ***
## weight             1.730e-02  4.146e-03   4.173 3.73e-05 ***
## acceleration      -1.918e+00  5.377e-01  -3.566 0.000408 ***
## year               1.164e+01  1.834e+00   6.347 6.25e-10 ***
## origin             5.127e-01  2.535e-01   2.023 0.043796 *  
## log(year)         -8.247e+02  1.394e+02  -5.916 7.37e-09 ***
## sqrt(weight)      -2.338e+00  4.708e-01  -4.966 1.04e-06 ***
## I(acceleration^2)  5.548e-02  1.563e-02   3.550 0.000434 ***
## I(displacement^2)  6.822e-05  3.344e-05   2.040 0.042038 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.775 on 380 degrees of freedom
## Multiple R-squared:  0.8772, Adjusted R-squared:  0.8736 
## F-statistic: 246.7 on 11 and 380 DF,  p-value: < 2.2e-16

All 4 of my transformations were statistically significant at p=.01, and 3 of the 4 were significant at p=0. Additionally, the coefficient of determination has increased again to .8772.

Question 10

a. Fit a multiple regression model to predict Sales using Price, Urban, and US.

lm.sales= lm(Sales ~ Price + Urban + US, data = Carseats)
summary(lm.sales)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

b. Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

Price: For every $1 increase in price, sales decrease by $54 units, with all other variables held constant. This variable is statistically significant, so it is safe to assume this statement holds true. Urban: This variable was not statistically significant, so we cannot conclude a change in sales from being located in an urban area as opposed to a rural area. US: US is also a qualitative variable, suggesting sales are $1,200 higher in the US than stores located outside of the US.

c. Write out the model in equation form, being careful to handle the qualitative variables properly.

$Sales= 13.043469- 0.054459(Price) -0.021916(Urban) +1.200573(US)$

Where Urban= 1 if located in an Urban area, 0 if Rural

US= 1 if located in the United States, 0 if not

d. For which of the predictors can you reject the null hypothes is $H_{0} : β_{j} = 0$?

We can reject the null hypothesis for Price and US because they were both statistically significant at p=0.

e. On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

lm.sales= lm(Sales ~ Price + US, data = Carseats)
summary(lm.sales)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

f. How well do the models in (a) and (e) fit the data?

Neither model is a good fit for our data; both have a coefficient of determination of .2393. This means that about 23.93% of the variability in Sales can be explained by the predictor variables. However, the model from part e has a higher adjusted R squared of .2354 compared to the first model’s adjusted R squared of .2335. R squared usually increases with additional variables, but this is not the case for our two models. These two reasons make the model from part e our better fit.

g. Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).

confint(lm.sales, level = 0.95)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

h. Is there evidence of outliers or high leverage observations in the model from (e)?

par(mfrow = c(2, 2))
plot(lm.sales)

There is evidence of a few outliers, as well as a few datapoints with higher leverage. In particular, there are a few datapoints with leverage approaching .03, and one point passing .04 on the Residuals v Leverage graph. On the same graph, there are points that surpass Cook’s distance, indicating additional outliers that are not particularly high in leverage.

Question 12

a. Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

The coefficient estimate for the regression of X onto Y equals the coefficient estimate for the regression of Y onto X when $\sum_{i} x^2_{i} = \sum_{i} y^2_{i}$.

b. Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

set.seed(1)
x = rnorm(100)
y = 10*x + rnorm(100, sd = 2)
data = data.frame(x, y)
lm.x = lm(x ~ y + 0)
lm.y = lm(y ~ x + 0)
summary(lm.x)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42234 -0.09331  0.04202  0.11837  0.35994 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y 0.095811   0.002043    46.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1878 on 99 degrees of freedom
## Multiple R-squared:  0.9569, Adjusted R-squared:  0.9565 
## F-statistic:  2200 on 1 and 99 DF,  p-value: < 2.2e-16

summary(lm.y)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8307 -1.2943 -0.3541  1.0112  4.6218 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x    9.988      0.213    46.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.917 on 99 degrees of freedom
## Multiple R-squared:  0.9569, Adjusted R-squared:  0.9565 
## F-statistic:  2200 on 1 and 99 DF,  p-value: < 2.2e-16

The coefficient estimate for X onto Y is .095811 and the coefficient estimate for Y onto X is 9.988, proving they are not equal.

c. Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

set.seed(1)
x = rnorm(100)
y = x
data = data.frame(x, y)
lm.x = lm(x ~ y + 0)
lm.y = lm(y ~ x + 0)
summary(lm.x)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.888e-16 -1.689e-17  1.339e-18  3.057e-17  2.552e-16 
## 
## Coefficients:
##    Estimate Std. Error   t value Pr(>|t|)    
## y 1.000e+00  6.479e-18 1.543e+17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.833e-17 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.382e+34 on 1 and 99 DF,  p-value: < 2.2e-16

summary(lm.y)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.888e-16 -1.689e-17  1.339e-18  3.057e-17  2.552e-16 
## 
## Coefficients:
##    Estimate Std. Error   t value Pr(>|t|)    
## x 1.000e+00  6.479e-18 1.543e+17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.833e-17 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.382e+34 on 1 and 99 DF,  p-value: < 2.2e-16

The coefficient estimate for X onto Y is 1.00 and the coefficient estimate for Y onto X is 1.00 proving they are equal.

Assignment 2

Raegan Gutierrez

2/26/2021

Question 2

Question 9

Question 10

Question 12