Assignment 2: Linear Regression

Question 2

Carefully explain the differences between KNN classifier and KNN regression models
Overall they are very similar. Given a value for K and a prediction point x, KNN identifies the number of training observations that are closet to the prediction point. The biggest difference is that the classifier approach uses an algorithm that assumes the outcome of the class of more presence. Likewise the regression uses the average value of the nearest neighbor.

Question 9

a.) Create a scatter plot matrix using all the variables from the auto data set

library(ISLR)
pairs(Auto[,1:9], pch = 19)

b.) Compute the matrix of correlations between the variables. Note that the name variable must be removed because it is not continuous.

cor(Auto[1:8])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

c.) Use the linear models function to perform a multiple linear regression with mpg as the response variable. Remove the name variable from the prediction.
Address the following points:
i.) is there a relationship between the predictors and the response?
ii.) Which predictors appear to have a statistically significant relationship to the response?
iii.) What does the coefficient for the year variable suggest?

lm.fit=lm(mpg~.-name, data=Auto)
summary(lm.fit)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

i.) Based the table created from the linear model, we can see that overall this model is significant at the 0.05 level. This means that we do have a relationship between the predictors and the response because at least one of the predictors selected has a significant P-value, that is we know at least one of these means does not equal 0.
ii.) If we look at all the predictors individually, we see that we have some that are not statistically significant. Among the variables that are significant are: displacement, weight, year, and origin. These variables are significant because they each have P-Values less than the 0.05 level.

iii.) The coefficient for year is 0.750773. This tells us that if we increase the mpg by 1 unit, the year is expected to increase by 0.750773.

d.) Use the plot function to produce diagnostic plots of the linear regression fit.
Do the residual plots suggest any unusual outliers?
Does the leverage plot identify any observations with unusually high leverage?

lm.fit=lm(mpg~.-name, data=Auto)
plot(hatvalues(lm.fit))

We can see from this chart of hat values that we do have a point that seems to be extreme value. If we look at a plot of leverage vs. the standardized residuals we see the same issue.

par(mfrow=c(2,2))
plot(lm.fit)

e.)Fit a linear regression model with interaction terms

First we can take a simple model only using two predictors,

summary(lm(mpg~cylinders*displacement, data = Auto))

## 
## Call:
## lm(formula = mpg ~ cylinders * displacement, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.0432  -2.4308  -0.2263   2.2048  20.9051 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            48.22040    2.34712  20.545  < 2e-16 ***
## cylinders              -2.41838    0.53456  -4.524 8.08e-06 ***
## displacement           -0.13436    0.01615  -8.321 1.50e-15 ***
## cylinders:displacement  0.01182    0.00207   5.711 2.24e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.454 on 388 degrees of freedom
## Multiple R-squared:  0.6769, Adjusted R-squared:  0.6744 
## F-statistic:   271 on 3 and 388 DF,  p-value: < 2.2e-16

We can see that all 3 predictors are significant in this model. This means we have a better fit for this data, and that the variables chosen depend on one another.
Now lets include all variables and some interaction terms.

summary(lm(mpg~.-name+cylinders:displacement + weight:acceleration, data=Auto))

## 
## Call:
## lm(formula = mpg ~ . - name + cylinders:displacement + weight:acceleration, 
##     data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.3257  -1.8495  -0.0574   1.6634  11.9975 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -2.118e+01  7.263e+00  -2.916 0.003752 ** 
## cylinders              -1.951e+00  4.631e-01  -4.214 3.13e-05 ***
## displacement           -6.061e-02  1.488e-02  -4.073 5.63e-05 ***
## horsepower             -5.288e-02  1.330e-02  -3.976 8.40e-05 ***
## weight                  2.279e-04  1.767e-03   0.129 0.897429    
## acceleration            9.199e-01  2.760e-01   3.333 0.000944 ***
## year                    7.744e-01  4.696e-02  16.492  < 2e-16 ***
## origin                  6.842e-01  2.703e-01   2.531 0.011780 *  
## cylinders:displacement  9.959e-03  2.027e-03   4.912 1.34e-06 ***
## weight:acceleration    -3.214e-04  9.743e-05  -3.299 0.001061 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.05 on 382 degrees of freedom
## Multiple R-squared:  0.8508, Adjusted R-squared:  0.8473 
## F-statistic:   242 on 9 and 382 DF,  p-value: < 2.2e-16

From this model we can see that all of our predictors are significant except weight. However when weight is interacted with acceleration, this variable then becomes significant.

f.) Try a few different transformations of the variables.

Our first transformation will be our variable squared

summary(lm(mpg~displacement+I(displacement^2), data=Auto))

## 
## Call:
## lm(formula = mpg ~ displacement + I(displacement^2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2165  -2.2404  -0.2508   2.1094  20.5158 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        4.204e+01  1.078e+00  39.011  < 2e-16 ***
## displacement      -1.379e-01  1.113e-02 -12.388  < 2e-16 ***
## I(displacement^2)  1.685e-04  2.366e-05   7.122 5.17e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.365 on 389 degrees of freedom
## Multiple R-squared:  0.6888, Adjusted R-squared:  0.6872 
## F-statistic: 430.5 on 2 and 389 DF,  p-value: < 2.2e-16

Here we notice that this transformation fits the data a little better. The R-Square without a transformation was 0.6482. When we squared our variable we slightly increased out R-Square bringing it to 0.6888. Signifying we have increased the fit of our model.

summary(lm(mpg~log(displacement), data=Auto))

## 
## Call:
## lm(formula = mpg ~ log(displacement), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.1204  -2.5843  -0.4217   2.1979  19.9005 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        85.6906     2.1422   40.00   <2e-16 ***
## log(displacement) -12.1385     0.4155  -29.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.377 on 390 degrees of freedom
## Multiple R-squared:  0.6863, Adjusted R-squared:  0.6855 
## F-statistic: 853.4 on 1 and 390 DF,  p-value: < 2.2e-16

Here we took the Log of our variable. We see an increase in R-Square from the original, however it is not as good as the squared transformation.

summary(lm(mpg~sqrt(displacement), data=Auto))

## 
## Call:
## lm(formula = mpg ~ sqrt(displacement), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.4034  -2.7367  -0.4956   2.3207  19.3499 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        47.11839    0.86246   54.63   <2e-16 ***
## sqrt(displacement) -1.75878    0.06186  -28.43   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.458 on 390 degrees of freedom
## Multiple R-squared:  0.6746, Adjusted R-squared:  0.6738 
## F-statistic: 808.5 on 1 and 390 DF,  p-value: < 2.2e-16

Finally, if we take the square root of the variable, we again get a better fit than the original, but still not as desirable as the square transformation.

Question 10

a.) Fit a multiple regression to predict Sales using price, Urban, and US.

lm.fit=lm(Sales~Price+Urban+US, data=Carseats)
summary(lm.fit)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

b.) Provide an interpretation of each coefficient

If we interpret each coefficient we will take the qualitative variables into consideration. Here we only receive back only one level of each variable. So here if the sales increases by 1 unit, the price will decrease by 0.054459. For the Urban variable, the Yes level does not perform as well as the No level. And for US, It is better to have a yes level than no.

c.) Write out the model in equation form
y = 13.043469 - 0.054459Price - 0.021916Urbanyes + 1.2200573USyes

d.) for which of the predictors can you reject the null hypothesis?
Based on the table we obtained from our regression, we can reject Price and USYes.

e.) Fit a smaller model for which only uses predictors for which there is evidence of association with the outcome.

lm.fit2=lm(Sales~Price+US, data=Carseats)
summary(lm.fit2)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

f.) How well do models (a) and (e) fit the data?
The fit for both is very poor, both models only have about a 23% explained variance.

**g.) Using model (e) fit a 95% confidence interval.

confint(lm.fit2)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

h.) Is there evidence of outliers or high leverage observations in the model from (e)?

We can use an R command to view multiple graphs to identify issues with outliers and leverage points. In our case to find leverage we can use \((p+1)/n\). So we would have \((2+1)/400\) = 0.0075

par(mfrow=c(2,2))
plot(lm.fit2)

summary(influence.measures(lm.fit2))

## Potentially influential observations of
##   lm(formula = Sales ~ Price + US, data = Carseats) :
## 
##     dfb.1_ dfb.Pric dfb.USYs dffit   cov.r   cook.d hat    
## 26   0.24  -0.18    -0.17     0.28_*  0.97_*  0.03   0.01  
## 29  -0.10   0.10    -0.10    -0.18    0.97_*  0.01   0.01  
## 43  -0.11   0.10     0.03    -0.11    1.05_*  0.00   0.04_*
## 50  -0.10   0.17    -0.17     0.26_*  0.98    0.02   0.01  
## 51  -0.05   0.05    -0.11    -0.18    0.95_*  0.01   0.00  
## 58  -0.05  -0.02     0.16    -0.20    0.97_*  0.01   0.01  
## 69  -0.09   0.10     0.09     0.19    0.96_*  0.01   0.01  
## 126 -0.07   0.06     0.03    -0.07    1.03_*  0.00   0.03_*
## 160  0.00   0.00     0.00     0.01    1.02_*  0.00   0.02  
## 166  0.21  -0.23    -0.04    -0.24    1.02    0.02   0.03_*
## 172  0.06  -0.07     0.02     0.08    1.03_*  0.00   0.02  
## 175  0.14  -0.19     0.09    -0.21    1.03_*  0.02   0.03_*
## 210 -0.14   0.15    -0.10    -0.22    0.97_*  0.02   0.01  
## 270 -0.03   0.05    -0.03     0.06    1.03_*  0.00   0.02  
## 298 -0.06   0.06    -0.09    -0.15    0.97_*  0.01   0.00  
## 314 -0.05   0.04     0.02    -0.05    1.03_*  0.00   0.02_*
## 353 -0.02   0.03     0.09     0.15    0.97_*  0.01   0.00  
## 357  0.02  -0.02     0.02    -0.03    1.03_*  0.00   0.02  
## 368  0.26  -0.23    -0.11     0.27_*  1.01    0.02   0.02_*
## 377  0.14  -0.15     0.12     0.24    0.95_*  0.02   0.01  
## 384  0.00   0.00     0.00     0.00    1.02_*  0.00   0.02  
## 387 -0.03   0.04    -0.03     0.05    1.02_*  0.00   0.02  
## 396 -0.05   0.05     0.08     0.14    0.98_*  0.01   0.00

We can see that we do have some observations that violate some parameters for acceptable data points. We can try to remove these points to improve our model.

outyling.obs<-c(26,29,43,50,51,58,69,126,160,166,172,175,210,270,298,314,353,357,368,377,384,387,396)
Small <- Carseats[-outyling.obs,]
fit2 <- lm(Sales~Price+US, data=Small)
summary(fit2)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Small)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.263 -1.605 -0.039  1.590  5.428 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.925232   0.665259  19.429  < 2e-16 ***
## Price       -0.053973   0.005511  -9.794  < 2e-16 ***
## USYes        1.255018   0.248856   5.043 7.15e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.29 on 374 degrees of freedom
## Multiple R-squared:  0.2387, Adjusted R-squared:  0.2347 
## F-statistic: 58.64 on 2 and 374 DF,  p-value: < 2.2e-16

After we obtain our new regression summary, we can see that the model did not have any significant improvement, therefor we can keep all data in our model.

Question 12

This problem involves linear regerssion wihtout the intercept.

a.)Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

Var(x) and Var(y) are equal, that is that x and y will have the same tendency away from the mean. They should be the same when the coefficient is 1.

b.)Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X

c.) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

for (b) and (c)

x=rnorm(100)
y=0.5*x+rnorm(100)
coefficients(lm(x~y+0))

##         y 
## 0.3712565

coefficients(lm(y~x+0))

##         x 
## 0.5247402

x=rnorm(100)
y=1*x
coefficients(lm(x~y+0))

## y 
## 1

coefficients(lm(y~x+0))

## x 
## 1

Assignment 2: Linear Regression

Andrew Mazza

Question 2

Question 9

Question 10

Question 12