Problem 2

Carefully explain the differences between KNN Classifier and KNN regression methods.

KNN Classifier’s final result is the classification output for Y and is qualitative, while the KNN Regression’s final output predicts the quantitative value for f(x).

Problem 9

This question involves the use of multiple linear regression on the Auto data set.

##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365

(a) Produce a scatterplot matrix which includes all of the variables in the data set.

pairs(newAuto)

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.

cor(subset(newAuto, select = -name))
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

lm.fit = lm(mpg~.-name, data = newAuto)
summary(lm.fit)
## 
## Call:
## lm(formula = mpg ~ . - name, data = newAuto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

i. Is there a relationship between the predictors and the response?

There is evidence that there is a relationship between the predictors and response, which can be looked at by examining the p values.

ii. Which predictors appear to have a statistically significant relationship to the response?

By looking at the p-values, we can see that displacement, weight, year, and origin have very small p-values, which makes them statistically significant. All others are not significant because their p-values are above .05.

iii. What does the coefficient for the year variable suggest?

The coefficient for the year variable suggests that for every additional year, mpg increases by 0.75. This basically means that newer cars have better mpg than older cars.

(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow=c(2,2))
plot(lm.fit)

The fit does not appear to be the best, and definitely appears to contain some outliers that could be high leverage points. We can see that point 14 is the highest outlier and appears to have high leverage.

which.max(hatvalues(lm.fit))
## 14 
## 14

(e) Use the star and ; symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

lm.fit2 = lm(mpg~ (.-name)*(.-name), data = newAuto)
summary(lm.fit2)
## 
## Call:
## lm(formula = mpg ~ (. - name) * (. - name), data = newAuto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6303 -1.4481  0.0596  1.2739 11.1386 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
## cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
## displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
## horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
## weight                     4.133e-03  1.759e-02   0.235  0.81442   
## acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
## year                       6.974e-01  6.097e-01   1.144  0.25340   
## origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
## cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
## cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
## cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
## cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
## cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
## cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
## displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
## displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
## displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
## displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
## displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
## horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
## horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
## horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
## horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
## weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
## weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
## weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
## acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
## acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
## year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8808 
## F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16

There are several interactions that appear to be significant, those include displacement:year, acceleration;year, and acceleration:origin.

(f) Try a few different transformations of the variables, such as log(X), square root of X and X squared. Comment on your findings.

lm.fit3 = lm(mpg~log(weight)+sqrt(horsepower) + (acceleration^2), data = newAuto)
summary(lm.fit3)
## 
## Call:
## lm(formula = mpg ~ log(weight) + sqrt(horsepower) + (acceleration^2), 
##     data = newAuto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2221  -2.5623  -0.3921   2.1635  15.6629 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      164.6132    10.1059  16.289  < 2e-16 ***
## log(weight)      -15.5666     1.7439  -8.927  < 2e-16 ***
## sqrt(horsepower)  -1.5212     0.3503  -4.343  1.8e-05 ***
## acceleration      -0.1261     0.1244  -1.014    0.311    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.04 on 388 degrees of freedom
## Multiple R-squared:  0.7341, Adjusted R-squared:  0.732 
## F-statistic:   357 on 3 and 388 DF,  p-value: < 2.2e-16

The log(weight) and sqrt(horsepower) were statistically significant.

Problem 10

This question should be answered using the Carseats data set.

(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.

lm.fit = lm(Sales~Price+Urban+US, data = Carseats)
summary(lm.fit)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(b) Provide an interpretation of each coefficient in the model. Be careful - some of the variables in the model are qualitative!

The regression suggests that there is a relationship between Sales and Price given that the p-value is extremely small. The coefficient suggests that as Price increases, Sales decreases.

The regression shows that there is not a relationship between Sales and Urban considering the high p-value.

The regression reveals that there is a relationship between Sales and US given that the p-value is extremely small. The coefficient shows that with USYes, Sales are higher than when it would be USNo

(c) Write out the model in equation form, being careful to handle the qualitative variables properly.

Sales = 13.04 - .05(Price) - .02(UrbanYes) + 1.2(USYes)

(d) For which of the predictors can you reject the null hypothesis Ho : Bj = 0?

For the Price and USYes predictor, we can reject the null hypothesis, given the small p-values.

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

lm.fit2 = lm(Sales~ Price + US, data = Carseats)
summary(lm.fit2)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) How well do the models in (a) and (e) fit in the data?

Both models appear to be very similar to each other with not much of a differnce between the two models.

(g) Using the model from (e), obtain a 95% confidence intervals for the coefficient(s).

confint(lm.fit2)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

(h) Is there evidence of outliers or high leverage observations in the model from (e)?

plot(predict(lm.fit2), rstudent(lm.fit2))

Since all residuals seem to fall within -3 to 3, there does not appear to be any evidence of outliers.

Problem 12

(a) Recall that the coefficient estimate B hat for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

It is the same when the sum of the squares for the observed values for Y are equal to the sum of the squares of the observed values for X.

(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate of the regression of Y onto X.

set.seed(1)
x = rnorm(100)
y = 2*x
lm.fit = lm(y~x+0)
lm.fit2 = lm(x~y+0)
summary(lm.fit)
## Warning in summary.lm(lm.fit): essentially perfect fit: summary may be
## unreliable
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.776e-16 -3.378e-17  2.680e-18  6.113e-17  5.105e-16 
## 
## Coefficients:
##    Estimate Std. Error   t value Pr(>|t|)    
## x 2.000e+00  1.296e-17 1.543e+17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.167e-16 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.382e+34 on 1 and 99 DF,  p-value: < 2.2e-16
summary(lm.fit2)
## Warning in summary.lm(lm.fit2): essentially perfect fit: summary may be
## unreliable
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.888e-16 -1.689e-17  1.339e-18  3.057e-17  2.552e-16 
## 
## Coefficients:
##   Estimate Std. Error   t value Pr(>|t|)    
## y 5.00e-01   3.24e-18 1.543e+17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.833e-17 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.382e+34 on 1 and 99 DF,  p-value: < 2.2e-16

(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

x = 1:100
sum(x^2)
## [1] 338350
y = 1:100
sum(x^2)
## [1] 338350
fit.Y = lm(y ~ x + 0)
fit.X = lm(x ~ y + 0)
summary(fit.Y)
## Warning in summary.lm(fit.Y): essentially perfect fit: summary may be unreliable
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.082e-13 -2.094e-15  2.900e-17  2.218e-15  1.294e-14 
## 
## Coefficients:
##    Estimate Std. Error   t value Pr(>|t|)    
## x 1.000e+00  5.379e-17 1.859e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.129e-14 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 3.457e+32 on 1 and 99 DF,  p-value: < 2.2e-16
summary(fit.X)
## Warning in summary.lm(fit.X): essentially perfect fit: summary may be unreliable
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.082e-13 -2.094e-15  2.900e-17  2.218e-15  1.294e-14 
## 
## Coefficients:
##    Estimate Std. Error   t value Pr(>|t|)    
## y 1.000e+00  5.379e-17 1.859e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.129e-14 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 3.457e+32 on 1 and 99 DF,  p-value: < 2.2e-16