2. Carefully explain the differences between the KNN classifier and KNN regression methods.

KNN classifier predicts the class of a given test observation by identifying the observations that are nearest to it (ISLR, pg. 165). KNN regression predicts a value of a given test observation, using similar methods

9. This question involves the use of multiple linear regression on the Auto data set.

(a) Produce a scatterplot matrix which includes all of the variables in the data set.

Auto<-read.table('http://www-bcf.usc.edu/~gareth/ISL/Auto.data', header = TRUE)

pairs(Auto[,1:8])

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

library(corrplot)

## Warning: package 'corrplot' was built under R version 3.5.2

## corrplot 0.84 loaded

Auto[1:8] <- sapply(Auto[1:8],as.numeric)
AutoCorter<- cor(Auto[,1:8])
corrplot(AutoCorter, method ='square')

(c) Use the lm() function to perform a multiple linear regressionwith mpg as the response and all other variables except name asthe predictors. Use the summary() function to print the results.Comment on the output. For instance:

i. Is there a relationship between the predictors and the response?

lm.fit<-lm(mpg~.-name, data=Auto)
summary(lm.fit)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.629 -2.034 -0.046  1.801 13.010 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.128e+01  4.259e+00  -4.998 8.78e-07 ***
## cylinders    -2.927e-01  3.382e-01  -0.865   0.3874    
## displacement  1.603e-02  7.284e-03   2.201   0.0283 *  
## horsepower    7.942e-03  6.809e-03   1.166   0.2442    
## weight       -6.870e-03  5.799e-04 -11.846  < 2e-16 ***
## acceleration  1.539e-01  7.750e-02   1.986   0.0477 *  
## year          7.734e-01  4.939e-02  15.661  < 2e-16 ***
## origin        1.346e+00  2.691e-01   5.004 8.52e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.331 on 389 degrees of freedom
## Multiple R-squared:  0.822,  Adjusted R-squared:  0.8188 
## F-statistic: 256.7 on 7 and 389 DF,  p-value: < 2.2e-16

ii. Which predictors appear to have a statistically significant relationship to the response?

Displacement, acceleration weight, year, and origin all have p values <.05 and significant. The variables account for 81.88% of the MPGs variation. It looks like cylinders and horsepower are not significant.

iii. What does the coefficient for the year variable suggest?

The newer the car, the better gas mileage. The model suggests for every increase in year, mpg increases by 7.734e-01

(d) Use the plot() function to produce diagnostic plots of the linear regression fit.

par(mfrow=c(2,2))

plot(lm.fit)

Comment on any problems you see with the fit.Do the residual plots suggest any unusually large outliers?

The residual plot does not indicate linearity, as residuals are u-shaped and not randomly scattered. The model indicates outliers with the smaller fitted values and even more outliers with the bigger fitted values.

Does the leverage plot identify any observations with unusually high leverage?

There are not any observations with an unusually high leverage outside of Cook’s distance. Observation 14 is in the bottom right, however not outside Cook’s distance.

(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

lm.fit<-lm(mpg~.-name+cylinders*acceleration, data=Auto)
summary(lm.fit)

## 
## Call:
## lm(formula = mpg ~ . - name + cylinders * acceleration, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6385 -2.0148  0.0181  1.8525 12.5798 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -37.997857   5.777865  -6.576 1.56e-10 ***
## cylinders                2.603860   0.767972   3.391 0.000769 ***
## displacement             0.003954   0.007698   0.514 0.607805    
## horsepower               0.007197   0.006672   1.079 0.281377    
## weight                  -0.006220   0.000589 -10.560  < 2e-16 ***
## acceleration             1.077460   0.233595   4.613 5.41e-06 ***
## year                     0.800046   0.048790  16.398  < 2e-16 ***
## origin                   1.135397   0.268341   4.231 2.90e-05 ***
## cylinders:acceleration  -0.178164   0.042618  -4.181 3.60e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.263 on 388 degrees of freedom
## Multiple R-squared:  0.8297, Adjusted R-squared:  0.8262 
## F-statistic: 236.3 on 8 and 388 DF,  p-value: < 2.2e-16

lm.fit<-lm(mpg~.-name+cylinders*acceleration+displacement*horsepower, data=Auto)
summary(lm.fit)

## 
## Call:
## lm(formula = mpg ~ . - name + cylinders * acceleration + displacement * 
##     horsepower, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9058 -2.0680 -0.0239  1.9590 12.5716 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -3.871e+01  5.838e+00  -6.631 1.13e-10 ***
## cylinders                2.537e+00  7.721e-01   3.285  0.00111 ** 
## displacement             8.395e-03  9.256e-03   0.907  0.36500    
## horsepower               2.266e-02  1.909e-02   1.187  0.23592    
## weight                  -6.222e-03  5.892e-04 -10.560  < 2e-16 ***
## acceleration             1.071e+00  2.338e-01   4.580 6.28e-06 ***
## year                     8.016e-01  4.884e-02  16.413  < 2e-16 ***
## origin                   1.122e+00  2.689e-01   4.173 3.71e-05 ***
## cylinders:acceleration  -1.760e-01  4.270e-02  -4.123 4.59e-05 ***
## displacement:horsepower -8.778e-05  1.015e-04  -0.865  0.38780    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.264 on 387 degrees of freedom
## Multiple R-squared:   0.83,  Adjusted R-squared:  0.8261 
## F-statistic:   210 on 9 and 387 DF,  p-value: < 2.2e-16

cylinders * acceleration

The interaction between cylinders and acceleration is significant with a p-value of 3.60e-05 < .05. The addition of this interaction made the previously significant dispalcement and horsepower variables insignificant

cylinders * acceleration + displacement * horsepower

Adding the interaction between displacement and horsepower was not significant, and did not make the individual variables or pairing signs insignificant

10. This question should be answered using the Carseats data set

library(ISLR)

## Warning: package 'ISLR' was built under R version 3.5.2

## 
## Attaching package: 'ISLR'

## The following object is masked _by_ '.GlobalEnv':
## 
##     Auto

head(Carseats)

(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.

lm.10<-lm(Sales~Price+Urban+US, data = Carseats)

summary(lm.10)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

The price is significant with a p value of < 2e-16 . For every unit increase in price, sales decreases by -0.054459. The UrbanYes categorical variable is significant with a p value of 0.936. Whether or not the person was in an urban area, proved to be irrelevant.The USYes categorical variable is significant with a p value of 4.86e-06. If someone lives in the US, sales goes up by 1.20

(c) Write out the model in equation form, being careful to handle the qualitative variables properly.

Sales= 13.04 -.054xPrice + 1.2xUSYes

(d) For which of the predictors can you reject the null hypothesis beta is equal 0?

I can reject the null hypothesis that beta is equal to 0 for Price and whether or not someone lives in the US.

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

 lm.10<-lm(Sales~Price+US, data = Carseats)
 
 summary(lm.10)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) How well do the models in (a) and (e) fit the data?

Adjusted R-squared: 0.2335
Adjusted R-squared: 0.2354

Adjusted R-squared is slightly better for (e)

(g) Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).

confint(lm.10)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

(h) Is there evidence of outliers or high leverage observations in the model from (e) ?

par(mfrow=c(2,2))

plot(lm.10)

The first three graphs have the data looking normal, and the fourth graph gives us our answer. There are no points outside Cook’s distance for model (e).

12. This problem involves simple linear regression without an intercept.

(a) Recall that the coefficient estimate Beta for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

When beta is 1, x is equal to y and y is equal to x

(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

x=rnorm(100)
y=.696969*x+rnorm(100)
coefficients(lm(x~y))

## (Intercept)           y 
##  0.02868183  0.54224052

coefficients(lm(y~x))

## (Intercept)           x 
##  0.07144916  0.71084865

(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

x=rnorm(100)
y=1*x
coefficients(lm(x~y))

##  (Intercept)            y 
## 3.330669e-17 1.000000e+00

coefficients(lm(y~x))

##  (Intercept)            x 
## 3.330669e-17 1.000000e+00

HW3 Linear Regression and Related Stuff

Manuel S. H. Valles

February 15, 2019