Carefully explain the differences between KNN Classifier and KNN regression methods.
KNN Classifier’s final result is the classification output for Y and is qualitative, while the KNN Regression’s final output predicts the quantitative value for f(x).
This question involves the use of multiple linear regression on the Auto data set.
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
(a) Produce a scatterplot matrix which includes all of the variables in the data set.
pairs(newAuto)
(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.
cor(subset(newAuto, select = -name))
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:
lm.fit = lm(mpg~.-name, data = newAuto)
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ . - name, data = newAuto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
i. Is there a relationship between the predictors and the response?
There is evidence that there is a relationship between the predictors and response, which can be looked at by examining the p values.
ii. Which predictors appear to have a statistically significant relationship to the response?
By looking at the p-values, we can see that displacement, weight, year, and origin have very small p-values, which makes them statistically significant. All others are not significant because their p-values are above .05.
iii. What does the coefficient for the year variable suggest?
The coefficient for the year variable suggests that for every additional year, mpg increases by 0.75. This basically means that newer cars have better mpg than older cars.
(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
par(mfrow=c(2,2))
plot(lm.fit)
The fit does not appear to be the best, and definitely appears to contain some outliers that could be high leverage points. We can see that point 14 is the highest outlier and appears to have high leverage.
which.max(hatvalues(lm.fit))
## 14
## 14
(e) Use the star and ; symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
lm.fit2 = lm(mpg~ (.-name)*(.-name), data = newAuto)
summary(lm.fit2)
##
## Call:
## lm(formula = mpg ~ (. - name) * (. - name), data = newAuto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6303 -1.4481 0.0596 1.2739 11.1386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.548e+01 5.314e+01 0.668 0.50475
## cylinders 6.989e+00 8.248e+00 0.847 0.39738
## displacement -4.785e-01 1.894e-01 -2.527 0.01192 *
## horsepower 5.034e-01 3.470e-01 1.451 0.14769
## weight 4.133e-03 1.759e-02 0.235 0.81442
## acceleration -5.859e+00 2.174e+00 -2.696 0.00735 **
## year 6.974e-01 6.097e-01 1.144 0.25340
## origin -2.090e+01 7.097e+00 -2.944 0.00345 **
## cylinders:displacement -3.383e-03 6.455e-03 -0.524 0.60051
## cylinders:horsepower 1.161e-02 2.420e-02 0.480 0.63157
## cylinders:weight 3.575e-04 8.955e-04 0.399 0.69000
## cylinders:acceleration 2.779e-01 1.664e-01 1.670 0.09584 .
## cylinders:year -1.741e-01 9.714e-02 -1.793 0.07389 .
## cylinders:origin 4.022e-01 4.926e-01 0.816 0.41482
## displacement:horsepower -8.491e-05 2.885e-04 -0.294 0.76867
## displacement:weight 2.472e-05 1.470e-05 1.682 0.09342 .
## displacement:acceleration -3.479e-03 3.342e-03 -1.041 0.29853
## displacement:year 5.934e-03 2.391e-03 2.482 0.01352 *
## displacement:origin 2.398e-02 1.947e-02 1.232 0.21875
## horsepower:weight -1.968e-05 2.924e-05 -0.673 0.50124
## horsepower:acceleration -7.213e-03 3.719e-03 -1.939 0.05325 .
## horsepower:year -5.838e-03 3.938e-03 -1.482 0.13916
## horsepower:origin 2.233e-03 2.930e-02 0.076 0.93931
## weight:acceleration 2.346e-04 2.289e-04 1.025 0.30596
## weight:year -2.245e-04 2.127e-04 -1.056 0.29182
## weight:origin -5.789e-04 1.591e-03 -0.364 0.71623
## acceleration:year 5.562e-02 2.558e-02 2.174 0.03033 *
## acceleration:origin 4.583e-01 1.567e-01 2.926 0.00365 **
## year:origin 1.393e-01 7.399e-02 1.882 0.06062 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared: 0.8893, Adjusted R-squared: 0.8808
## F-statistic: 104.2 on 28 and 363 DF, p-value: < 2.2e-16
There are several interactions that appear to be significant, those include displacement:year, acceleration;year, and acceleration:origin.
(f) Try a few different transformations of the variables, such as log(X), square root of X and X squared. Comment on your findings.
lm.fit3 = lm(mpg~log(weight)+sqrt(horsepower) + (acceleration^2), data = newAuto)
summary(lm.fit3)
##
## Call:
## lm(formula = mpg ~ log(weight) + sqrt(horsepower) + (acceleration^2),
## data = newAuto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.2221 -2.5623 -0.3921 2.1635 15.6629
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 164.6132 10.1059 16.289 < 2e-16 ***
## log(weight) -15.5666 1.7439 -8.927 < 2e-16 ***
## sqrt(horsepower) -1.5212 0.3503 -4.343 1.8e-05 ***
## acceleration -0.1261 0.1244 -1.014 0.311
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.04 on 388 degrees of freedom
## Multiple R-squared: 0.7341, Adjusted R-squared: 0.732
## F-statistic: 357 on 3 and 388 DF, p-value: < 2.2e-16
The log(weight) and sqrt(horsepower) were statistically significant.
This question should be answered using the Carseats data set.
(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.
lm.fit = lm(Sales~Price+Urban+US, data = Carseats)
summary(lm.fit)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
(b) Provide an interpretation of each coefficient in the model. Be careful - some of the variables in the model are qualitative!
The regression suggests that there is a relationship between Sales and Price given that the p-value is extremely small. The coefficient suggests that as Price increases, Sales decreases.
The regression shows that there is not a relationship between Sales and Urban considering the high p-value.
The regression reveals that there is a relationship between Sales and US given that the p-value is extremely small. The coefficient shows that with USYes, Sales are higher than when it would be USNo
(c) Write out the model in equation form, being careful to handle the qualitative variables properly.
Sales = 13.04 - .05(Price) - .02(UrbanYes) + 1.2(USYes)
(d) For which of the predictors can you reject the null hypothesis Ho : Bj = 0?
For the Price and USYes predictor, we can reject the null hypothesis, given the small p-values.
(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
lm.fit2 = lm(Sales~ Price + US, data = Carseats)
summary(lm.fit2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
(f) How well do the models in (a) and (e) fit in the data?
Both models appear to be very similar to each other with not much of a differnce between the two models.
(g) Using the model from (e), obtain a 95% confidence intervals for the coefficient(s).
confint(lm.fit2)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
(h) Is there evidence of outliers or high leverage observations in the model from (e)?
plot(predict(lm.fit2), rstudent(lm.fit2))
Since all residuals seem to fall within -3 to 3, there does not appear to be any evidence of outliers.
(a) Recall that the coefficient estimate B hat for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?
It is the same when the sum of the squares for the observed values for Y are equal to the sum of the squares of the observed values for X.
(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate of the regression of Y onto X.
set.seed(1)
x = rnorm(100)
y = 2*x
lm.fit = lm(y~x+0)
lm.fit2 = lm(x~y+0)
summary(lm.fit)
## Warning in summary.lm(lm.fit): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.776e-16 -3.378e-17 2.680e-18 6.113e-17 5.105e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 2.000e+00 1.296e-17 1.543e+17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.167e-16 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.382e+34 on 1 and 99 DF, p-value: < 2.2e-16
summary(lm.fit2)
## Warning in summary.lm(lm.fit2): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.888e-16 -1.689e-17 1.339e-18 3.057e-17 2.552e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 5.00e-01 3.24e-18 1.543e+17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.833e-17 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.382e+34 on 1 and 99 DF, p-value: < 2.2e-16
(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.
x = 1:100
sum(x^2)
## [1] 338350
y = 1:100
sum(x^2)
## [1] 338350
fit.Y = lm(y ~ x + 0)
fit.X = lm(x ~ y + 0)
summary(fit.Y)
## Warning in summary.lm(fit.Y): essentially perfect fit: summary may be unreliable
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.082e-13 -2.094e-15 2.900e-17 2.218e-15 1.294e-14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 1.000e+00 5.379e-17 1.859e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.129e-14 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 3.457e+32 on 1 and 99 DF, p-value: < 2.2e-16
summary(fit.X)
## Warning in summary.lm(fit.X): essentially perfect fit: summary may be unreliable
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.082e-13 -2.094e-15 2.900e-17 2.218e-15 1.294e-14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 1.000e+00 5.379e-17 1.859e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.129e-14 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 3.457e+32 on 1 and 99 DF, p-value: < 2.2e-16