Homework 2- Algorithms II

Problem 2

Carefully explain the differences between the KNN classifier and KNN regression methods.

KNN classifier uses the neighbors (or closest observations) to assign a classification group while KNN regression averages the neighbors to estimate a prediction.

Problem 9

This question involves the use of multiple linear regression on the Auto data set.

library(ISLR)
attach(Auto)

(a) Produce a scatterplot matrix which includes all of the variables in the data set.

pairs(Auto[,-9])

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

cor(Auto[,-9])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

(c) Use the lm() function to perform a multiple linear regressionwith mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.Comment on the output. For instance:
i. Is there a relationship between the predictors and the response?
ii. Which predictors appear to have a statistically significant relationship to the response?
iii. What does the coefficient for the year variable suggest?
There does appear to be a relationship between the predictors and the response variable of mpg. The predictors that appear to be significant are displacement, weight, year, and origin. For every year increase in the model year of the car (later model car), there will be an increase of .75 mpg.

lm.fit<-lm(mpg~.-name,data=Auto)
summary(lm.fit)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

The Reiduals vs Fitted plot show us that the data make not be completely linear as we can see a slight cureve in our line. The QQplot shows that the data is right skewed as the points are pulling away from the dotted line on the right side and some evidence of nonlinearity by looking at the residuals plot. There also appears to be a few outliers such as entries 323, 327, 326 as well as leverage points at 14, 327, and 394.

par(mfrow=c(2,2))
plot(lm.fit)

(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

The interactions that look to be statistically significant are displacement:year, acceleration:year, and acceleration:origin.

lm.fit2<- lm(mpg~.*.-name*.+.-name,data=Auto)
summary(lm.fit2)

## 
## Call:
## lm(formula = mpg ~ . * . - name * . + . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6303 -1.4481  0.0596  1.2739 11.1386 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
## cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
## displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
## horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
## weight                     4.133e-03  1.759e-02   0.235  0.81442   
## acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
## year                       6.974e-01  6.097e-01   1.144  0.25340   
## origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
## cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
## cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
## cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
## cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
## cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
## cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
## displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
## displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
## displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
## displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
## displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
## horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
## horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
## horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
## horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
## weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
## weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
## weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
## acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
## acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
## year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8808 
## F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16

(f) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

As seen b the models below, I tried several transformations with all of the predictors excluding name as well as a few with only the signifcant predictors as identified in an earlier problem. Ultimately, I found that lm.fit10 returned the higher Rsquare value of 0.8825. $lm.fit10<- lm(log(mpg) \sim log(horsepower) + weight + origin + year, data = Auto)$

lm.fit3<- lm(log(mpg)~.-name, data = Auto)
summary(lm.fit3)

## 
## Call:
## lm(formula = log(mpg) ~ . - name, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40955 -0.06533  0.00079  0.06785  0.33925 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.751e+00  1.662e-01  10.533  < 2e-16 ***
## cylinders    -2.795e-02  1.157e-02  -2.415  0.01619 *  
## displacement  6.362e-04  2.690e-04   2.365  0.01852 *  
## horsepower   -1.475e-03  4.935e-04  -2.989  0.00298 ** 
## weight       -2.551e-04  2.334e-05 -10.931  < 2e-16 ***
## acceleration -1.348e-03  3.538e-03  -0.381  0.70339    
## year          2.958e-02  1.824e-03  16.211  < 2e-16 ***
## origin        4.071e-02  9.955e-03   4.089 5.28e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1191 on 384 degrees of freedom
## Multiple R-squared:  0.8795, Adjusted R-squared:  0.8773 
## F-statistic: 400.4 on 7 and 384 DF,  p-value: < 2.2e-16

lm.fit4<- lm((mpg*mpg)~. -name, data = Auto)
summary(lm.fit4)

## 
## Call:
## lm(formula = (mpg * mpg) ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -483.45 -141.87  -19.62  103.58 1042.84 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.878e+03  2.928e+02  -6.412 4.22e-10 ***
## cylinders    -1.436e+01  2.038e+01  -0.704  0.48157    
## displacement  1.328e+00  4.738e-01   2.802  0.00534 ** 
## horsepower   -3.587e-01  8.693e-01  -0.413  0.68009    
## weight       -3.522e-01  4.111e-02  -8.567 2.62e-16 ***
## acceleration  9.278e+00  6.232e+00   1.489  0.13740    
## year          4.081e+01  3.214e+00  12.698  < 2e-16 ***
## origin        9.509e+01  1.754e+01   5.422 1.04e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 209.8 on 384 degrees of freedom
## Multiple R-squared:  0.7292, Adjusted R-squared:  0.7243 
## F-statistic: 147.8 on 7 and 384 DF,  p-value: < 2.2e-16

lm.fit5<- lm(sqrt(mpg)~. -name, data = Auto)
summary(lm.fit5)

## 
## Call:
## lm(formula = sqrt(mpg) ~ . - name, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.98891 -0.18946  0.00505  0.16947  1.02581 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.075e+00  4.290e-01   2.506   0.0126 *  
## cylinders    -5.942e-02  2.986e-02  -1.990   0.0474 *  
## displacement  1.752e-03  6.942e-04   2.524   0.0120 *  
## horsepower   -2.512e-03  1.274e-03  -1.972   0.0493 *  
## weight       -6.367e-04  6.024e-05 -10.570  < 2e-16 ***
## acceleration  2.738e-03  9.131e-03   0.300   0.7644    
## year          7.381e-02  4.709e-03  15.675  < 2e-16 ***
## origin        1.217e-01  2.569e-02   4.735 3.09e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3074 on 384 degrees of freedom
## Multiple R-squared:  0.8561, Adjusted R-squared:  0.8535 
## F-statistic: 326.3 on 7 and 384 DF,  p-value: < 2.2e-16

lm.fit6<- lm(log(mpg)~ displacement + weight + year + origin,  data = Auto)
summary(lm.fit6)

## 
## Call:
## lm(formula = log(mpg) ~ displacement + weight + year + origin, 
##     data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42022 -0.06963  0.00737  0.06806  0.38782 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.552e+00  1.458e-01  10.644  < 2e-16 ***
## displacement -1.154e-04  1.726e-04  -0.669  0.50416    
## weight       -2.795e-04  2.017e-05 -13.857  < 2e-16 ***
## year          3.099e-02  1.804e-03  17.182  < 2e-16 ***
## origin        2.925e-02  9.666e-03   3.025  0.00265 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1212 on 387 degrees of freedom
## Multiple R-squared:  0.8744, Adjusted R-squared:  0.8731 
## F-statistic: 673.3 on 4 and 387 DF,  p-value: < 2.2e-16

lm.fit7<- lm(log(mpg)~ log(displacement) + weight + year + origin,  data = Auto)
summary(lm.fit7)

## 
## Call:
## lm(formula = log(mpg) ~ log(displacement) + weight + year + origin, 
##     data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.44002 -0.06678  0.00800  0.06493  0.42379 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.030e+00  2.114e-01   9.599  < 2e-16 ***
## log(displacement) -1.138e-01  3.624e-02  -3.140  0.00182 ** 
## weight            -2.335e-04  2.058e-05 -11.347  < 2e-16 ***
## year               3.055e-02  1.744e-03  17.523  < 2e-16 ***
## origin             1.665e-02  1.031e-02   1.616  0.10692    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1197 on 387 degrees of freedom
## Multiple R-squared:  0.8773, Adjusted R-squared:  0.8761 
## F-statistic:   692 on 4 and 387 DF,  p-value: < 2.2e-16

lm.fit8<- lm(log(mpg)~ log(displacement) + weight + year ,  data = Auto)
summary(lm.fit8)

## 
## Call:
## lm(formula = log(mpg) ~ log(displacement) + weight + year, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.43457 -0.06538  0.00390  0.06717  0.42625 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.183e+00  1.893e-01   11.53  < 2e-16 ***
## log(displacement) -1.394e-01  3.266e-02   -4.27 2.47e-05 ***
## weight            -2.279e-04  2.033e-05  -11.21  < 2e-16 ***
## year               3.039e-02  1.744e-03   17.42  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.12 on 388 degrees of freedom
## Multiple R-squared:  0.8765, Adjusted R-squared:  0.8756 
## F-statistic:   918 on 3 and 388 DF,  p-value: < 2.2e-16

lm.fit9<- lm(log(mpg)~.-name -acceleration, data = Auto)
summary(lm.fit9)

## 
## Call:
## lm(formula = log(mpg) ~ . - name - acceleration, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40671 -0.06629  0.00104  0.06899  0.33853 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.723e+00  1.493e-01  11.539  < 2e-16 ***
## cylinders    -2.772e-02  1.154e-02  -2.402  0.01679 *  
## displacement  6.466e-04  2.673e-04   2.419  0.01601 *  
## horsepower   -1.359e-03  3.876e-04  -3.505  0.00051 ***
## weight       -2.594e-04  2.044e-05 -12.693  < 2e-16 ***
## year          2.963e-02  1.817e-03  16.309  < 2e-16 ***
## origin        4.067e-02  9.944e-03   4.090 5.25e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.119 on 385 degrees of freedom
## Multiple R-squared:  0.8795, Adjusted R-squared:  0.8776 
## F-statistic: 468.2 on 6 and 385 DF,  p-value: < 2.2e-16

lm.fit10<- lm(log(mpg) ~ log(horsepower) + weight + origin + year, data = Auto)
summary(lm.fit10)

## 
## Call:
## lm(formula = log(mpg) ~ log(horsepower) + weight + origin + year, 
##     data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.38929 -0.06547  0.00082  0.06773  0.36613 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.423e+00  2.192e-01  11.052  < 2e-16 ***
## log(horsepower) -1.956e-01  3.740e-02  -5.230 2.78e-07 ***
## weight          -2.235e-04  1.575e-05 -14.189  < 2e-16 ***
## origin           3.460e-02  9.098e-03   3.803 0.000166 ***
## year             2.874e-02  1.760e-03  16.333  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1172 on 387 degrees of freedom
## Multiple R-squared:  0.8825, Adjusted R-squared:  0.8813 
## F-statistic: 726.8 on 4 and 387 DF,  p-value: < 2.2e-16

detach(Auto)

Problem 10

This question should be answered using the Carseats data set.

library(ISLR)
attach(Carseats)

(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.

fit<-lm(Sales~Price+Urban+US)
summary(fit)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(b) Provide an interpretation of each coefficient in the model. Be careful-some of the variables in the model are qualitative!
From the table above, Price and US are significant predictor of Sales. For every $1 increase in my price, the sales will decrease by $54. Sales inside the US are $1200 dollars more than sales outside the US. Urban has no effect on Sales.

(c) Write out the model in equation form, being careful to handle the qualitative variables properly.
$Sales = 13.043469 - 0.054459Price - 0.021916Urban_{Yes} + 1.200573US_{Yes}$

(d) For which of the predictors can you reject the null hypothesis $H_0 : \beta_j = 0$?
Price and US are significant predictors so we reject the null hypothesis $H_0 : \beta_j = 0$?

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

fit2<-lm(Sales~Price+US)
summary(fit2)

## 
## Call:
## lm(formula = Sales ~ Price + US)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) How well do the models in (a) and (e) fit the data? The models do not seem to fit the data very well. By looking at the Rsquare value, we can see that only about 23% of the variance in Sales. We would likely need to try some transformations in order to get a better fit.

(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

confint(fit2)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

(h) Is there evidence of outliers or high leverage observations in the model from (e)?

By looking at the plots, there are no outliers or leverage points that particularly stand out to me; however, the hatvalues funtion tells us that we may be looking at a leverage point at entry 43.

plot(fit2)

plot(hatvalues(fit2))

which.max(hatvalues(fit2))

## 43 
## 43

Problem 12

This problem involves simple linear regression without an intercept.

(a) Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

In order for $y=\beta_0 + \beta_1x$ to be the same as $x=\beta_0 + \beta_1y$, the correlation coefficient, r, which is the same as $sy/sx$ must be equal to 1. In other words, $sx=sy=1$.

(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

x<-rnorm(100,10,2)
y<-rnorm(100,5,3)
lm(y~x)

## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##     5.06486      0.02645

lm(x~y)

## 
## Call:
## lm(formula = x ~ y)
## 
## Coefficients:
## (Intercept)            y  
##    10.09780      0.01204

(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

x<-rnorm(100,10,1)
y<-rnorm(100,10,1)
lm(y~x)

## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##   10.061033    -0.004504

lm(x~y)

## 
## Call:
## lm(formula = x ~ y)
## 
## Coefficients:
## (Intercept)            y  
##    10.21722     -0.00324

Homework 2- Algorithms II

Lucy Metz

6/15/2020

Problem 2

Problem 9

Problem 10

Problem 12