Assignment #2

2.

Carefully explain the differences between the KNN classifier and KNN regression methods.

For KNN regression, the algorithm identifies K observations near some point X, and estimates a function f(X) using the average of all the K points. For the classification flavor of KNN, the idea is similar, in that the algorithm identifies K observations near a point X, and classifies X to the label of the maximum K points. If the maximum K has the label “Banana,” then X gets classified as the yellow fruit.

9.

This question involves the use of multiple linear regression on the Auto data set.

9-A)

Produce a scatterplot matrix which includes all of the variables in the data set.

#str(Auto)
pairs(Auto)

9-B)

Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.

names(Auto)

## [1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
## [6] "acceleration" "year"         "origin"       "name"

cor(Auto[c(1,2,3,4,5,6,7,8)])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

9-C)

Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output.

lm.fit9C <- lm(mpg ~. -name, data=Auto)
summary(lm.fit9C)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

9-Ci)

Is there a relationship between the predictors and the response?

The null hypothesis is that the model is not useful, and that all coefficients should be considered zero. The alternate hypothesis is that the model is useful, and that at least one of the coefficients is not equal to zero. For this model, we see that the selection threshold p-value is well below the typical 0.05 value, and so we reject the null hypothesis. This model is useful, and at least one coefficient is not equal to zero. The model explains approximately 82% of the variance observed in MPG.

9-Cii)

Which predictors appear to have a statistically significant relationship to the response?

Having established that the model is useful, we examine the selection thresholds for the various coefficients. The null hypothesis states that there is no relation, and the coefficient should be considered zero. The alternate hypothesis states that there is a relationship, and the coefficient is not zero. For this model, the following coefficients have p-values below the selection threshold of 0.05:

Displacement, Weight, Year and Origin.

All of these coefficients have a statistically significant relation with the target variable, MPG.

9-Ciii)

What does the coefficient for the year variable suggest?

The coefficient for “Year” should be interpreted as follows: “The MPG increases by 0.75 for every unit increase in Year, when all other variables are held constant.”

9-D)

Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow = c(2,2))
plot(lm.fit9C, which=1:4)

I observe a combination of curve as well as an expanding “cheese wedge” appearance in the Residuals plot, indicating a possibility of heteroscedasticity. Q-Q plot has a generally good track with the reference line, although there is a divergence with points 323, 327 and 326 identified. The Cook’s Distance plot identifies three values that could be influential: 14, 327 and 394.

#Setting plots back to standard configuration
par(mfrow = c(1,1))

9-E)

Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

lm.fit9E1 <- lm(mpg~.+cylinders*displacement + displacement*weight + horsepower*displacement + weight * displacement + acceleration * horsepower + origin*displacement -name, data=Auto)
summary(lm.fit9E1)

## 
## Call:
## lm(formula = mpg ~ . + cylinders * displacement + displacement * 
##     weight + horsepower * displacement + weight * displacement + 
##     acceleration * horsepower + origin * displacement - name, 
##     data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3250 -1.5778 -0.0658  1.4758 12.4039 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -6.047e+00  6.922e+00  -0.874  0.38291    
## cylinders                7.586e-01  6.331e-01   1.198  0.23159    
## displacement            -9.031e-02  1.916e-02  -4.712 3.44e-06 ***
## horsepower              -7.047e-02  5.853e-02  -1.204  0.22941    
## weight                  -7.088e-03  1.452e-03  -4.882 1.55e-06 ***
## acceleration             2.107e-01  2.316e-01   0.910  0.36354    
## year                     7.593e-01  4.544e-02  16.710  < 2e-16 ***
## origin                  -5.884e-01  9.558e-01  -0.616  0.53856    
## cylinders:displacement  -9.817e-04  2.827e-03  -0.347  0.72862    
## displacement:weight      1.427e-05  4.753e-06   3.002  0.00286 ** 
## displacement:horsepower  2.524e-04  1.074e-04   2.350  0.01930 *  
## horsepower:acceleration -3.540e-03  2.342e-03  -1.512  0.13148    
## displacement:origin      9.991e-03  8.248e-03   1.211  0.22652    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.886 on 379 degrees of freedom
## Multiple R-squared:  0.8675, Adjusted R-squared:  0.8633 
## F-statistic: 206.7 on 12 and 379 DF,  p-value: < 2.2e-16

Using the correlation matrix as a reference, I chose the highest correlations across a number of variables. The summary results indicate the majority of interaction effects are not statistically significant, or that there exist no interactions. However, there does appear to be statistically significant evidence for an interaction effect between the following: Displacement and Weight, and Displacement and Horsepower.

9-F)

Try a few different transformations of the variables, such as log(X), sqrt(X), X^2. Comment on your findings.

lm.fit9F <- lm(mpg~ log(cylinders)+displacement+I(horsepower^2)+weight+sqrt(acceleration)+year+origin, data=Auto)
summary(lm.fit9F)

## 
## Call:
## lm(formula = mpg ~ log(cylinders) + displacement + I(horsepower^2) + 
##     weight + sqrt(acceleration) + year + origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.8647 -2.1472  0.0056  1.7709 13.0454 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -2.267e+01  5.251e+00  -4.317 2.01e-05 ***
## log(cylinders)     -2.724e+00  1.759e+00  -1.549   0.1223    
## displacement        1.410e-02  7.843e-03   1.797   0.0731 .  
## I(horsepower^2)     7.067e-05  4.387e-05   1.611   0.1080    
## weight             -7.054e-03  6.018e-04 -11.722  < 2e-16 ***
## sqrt(acceleration)  1.701e+00  7.096e-01   2.398   0.0170 *  
## year                7.823e-01  5.053e-02  15.480  < 2e-16 ***
## origin              1.198e+00  2.794e-01   4.286 2.30e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.318 on 384 degrees of freedom
## Multiple R-squared:  0.8225, Adjusted R-squared:  0.8193 
## F-statistic: 254.2 on 7 and 384 DF,  p-value: < 2.2e-16

In this case, after applying random transformations to variables that had previously not been statistically significant, we see that there is a change in behavior. The square-root of acceleration is now statistically significant, however none of these other transformed variables has become statistically significant.

par(mfrow=c(2,2))
plot(lm.fit9F, which=1:4)

Examination of the diagnostics plots indicates similar behavior to the previous (non-transformed) model. There is some evidence of heteroscedasticity, and some of the Cook’s Distance points indicate they may be influential (14, 327, 387).

lm.fit9F2 <- lm(log(mpg)~.-name, data=Auto)
summary(lm.fit9F2)

## 
## Call:
## lm(formula = log(mpg) ~ . - name, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40955 -0.06533  0.00079  0.06785  0.33925 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.751e+00  1.662e-01  10.533  < 2e-16 ***
## cylinders    -2.795e-02  1.157e-02  -2.415  0.01619 *  
## displacement  6.362e-04  2.690e-04   2.365  0.01852 *  
## horsepower   -1.475e-03  4.935e-04  -2.989  0.00298 ** 
## weight       -2.551e-04  2.334e-05 -10.931  < 2e-16 ***
## acceleration -1.348e-03  3.538e-03  -0.381  0.70339    
## year          2.958e-02  1.824e-03  16.211  < 2e-16 ***
## origin        4.071e-02  9.955e-03   4.089 5.28e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1191 on 384 degrees of freedom
## Multiple R-squared:  0.8795, Adjusted R-squared:  0.8773 
## F-statistic: 400.4 on 7 and 384 DF,  p-value: < 2.2e-16

Just for fun, I transformed MPG to LOG(MPG) and re-ran the original model. Variables that have become statistically significant are: Cylinders and Horsepower. The model now explains about 88% of the variance in MPG, so that’s an improvement if we’re aiming for predictive power.

par(mfrow=c(2,2))
plot(lm.fit9F2, which=1:4)

The diagnostics plots indicate a little better performance on the residuals plot (it’s more scattered), and Cook’s Distances have been knocked down a bit, with only observation 14 being worth a look for influential status. The Q-Q plot shows some departure at either end of the reference line, but otherwise the data tracks the reference line.

10.

This question should be answered using the Carseats data set.

10-A)

Fit a multiple regression model to predict Sales using Price, Urban, and US.

# ?Carseats
# str(Carseats)
lm.fit10A <- lm(Sales ~ Price+Urban+US, data = Carseats)
summary(lm.fit10A)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

10-B)

Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

The null hypothesis for the model is that the model is not useful, and that all coefficients are equal to zero. The alternate hypothesis for the model is that the model IS useful, and that some of the coefficients are not equal to zero. We observe that the p-value for the model is below the typical selection threshold of 0.05, so we can reject the null hypothesis. This model is useful and has something to tell us. The model explains approximately 24% of the variance in Sales, which isn’t all that good for prediction purposes, but does establish that a relationship exists between some of the variables and sales.

The null hypothesis for each variable is that it has no relation to the target variable, and is therefore equal to zero. The alternate hypothesis holds that the variable does have a relationship with the target variable. Going down the list we see that Price is statistically significant, so there is a “Price Effect” on Sales. Whether the carseat falls into the Urban category is not statistically significant. Store location (whether the carseat is sold in the US) is statistically significant, so we can say there is a categorical “US Store Effect” on Sales.

For the statistically significant variables, we interpret them as follows:

Sales decreases by 0.054 (x1000) for every unit increase in Price, holding all other variables constant. On average, “US store” carseats have sales 1.2005 (x1000) higher than carseats sold in stores outside the US when all other variables are held constant.

10-C)

Write out the model in equation form, being careful to handle the qualitative variables properly.

The equations for the model considering statistically significant terms may be written as follows:

\(Sales (x 1000) = (13.043 + 1.200573) - 0.054459 * Price\) for carseats sold in US stores.

\(Sales (x 1000) = 13.043 - 0.054459 * Price\) for carseats sold in stores outside the US.

The equation for the entire model is:

\(Sales (x 1000) = 13.043 + 1.200576 * USYes - 0.021916 * UrbanYes - 0.054459 * Price\)

10-D)

For which of the predictors can you reject the null hypothesis H0 : βj = 0?

I reject the null hypothesis for Price and the categorical USYes predictor. The reasoning behind this is explained in the answer for 10-B.

10-E)

On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

lm.fit10E <- lm(Sales~Price + US, data = Carseats)
summary(lm.fit10E)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

10-F)

How well do the models in (a) and (e) fit the data?

Both models explain about 24% of the variance in Sales. This new model has a slightly smaller residual standard error. The F-statistic for the new model is somewhat larger than the earlier one.

10-G)

Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).

confint(lm.fit10E)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Confidence intervals are displayed above.

10-H)

Is there evidence of outliers or high leverage observations in the model from (e)?

par(mfrow=c(2,2))
plot(lm.fit10E, which=1:4)

Residuals plot indicates a random shotgun scattering, so I would say this data has homoscedasticty. There are three observations that are on the outside bounds of the residuals plot (51, 69, 377). The Q-Q plot looks very good, with data tracking down the reference line. The Cook’s Distance plot shows several points above a Cook’s distance of 0.02 (26, 50 and 368). Depending upon your threshold for influential points, these points may need to be removed.

12.

This problem involves simple linear regression without an intercept.

12-A)

Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

If the summation of Xi^2 = the summation of Yi^2, then the β should be the same.

12-B)

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

par(mfrow=c(1,1))
set.seed(42)
x = rnorm(100)
y = 2*x + rnorm(100)
#plot(x,y)
lm.fit12B1 <- lm(y~x+0)
summary(lm.fit12B1)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9815 -0.5947 -0.0741  0.4498  2.7669 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   2.0245     0.0876   23.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9081 on 99 degrees of freedom
## Multiple R-squared:  0.8436, Adjusted R-squared:  0.8421 
## F-statistic: 534.1 on 1 and 99 DF,  p-value: < 2.2e-16

lm.fit12B2 <- lm(x~y+0)
summary(lm.fit12B2)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.56841 -0.21077  0.06774  0.31614  0.83105 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y  0.41671    0.01803   23.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.412 on 99 degrees of freedom
## Multiple R-squared:  0.8436, Adjusted R-squared:  0.8421 
## F-statistic: 534.1 on 1 and 99 DF,  p-value: < 2.2e-16

12-C)

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

set.seed(42)
x = rnorm(100)
y = 1*x

lm.fit12B1 <- lm(y~x+0)
summary(lm.fit12B1)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -7.281e-15 -5.030e-17 -3.400e-18  4.530e-17  3.022e-16 
## 
## Coefficients:
##    Estimate Std. Error   t value Pr(>|t|)    
## x 1.000e+00  7.156e-17 1.397e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.418e-16 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.953e+32 on 1 and 99 DF,  p-value: < 2.2e-16

lm.fit12B2 <- lm(x~y+0)
summary(lm.fit12B2)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -7.281e-15 -5.030e-17 -3.400e-18  4.530e-17  3.022e-16 
## 
## Coefficients:
##    Estimate Std. Error   t value Pr(>|t|)    
## y 1.000e+00  7.156e-17 1.397e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.418e-16 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.953e+32 on 1 and 99 DF,  p-value: < 2.2e-16