Carefully explain the differences between the KNN classifier and KNN regression methods.
The KNN classifier uses qualitative responses to attempt to predict the value of an output variable with probability (Bayes Theorem) and the KNN regression method seeks to make a quantitative estimate by averaging the result of the K nearest neighbors.
This question involves the use of multiple linear regression on the Auto data set.
library(ISLR)
attach(Auto)
pairs(Auto, panel = panel.smooth)
#all columns but name
cor(Auto[,1:8])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
lm.auto <- lm(mpg ~.-name, data = Auto)
summary(lm.auto)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Comment on the output. For instance: i. Is there a relationship between the predictors and the response? ii. Which predictors appear to have a statistically significant relationship to the response? iii. What does the coefficient for the year variable suggest?
We will use a test statistic value of 0.05. Y - Our response variable is mpg - The predictor variables are every variable except name.
We check the individual significance by the t-test.
cylinders produces a p-value of 0.12780 displacement produces a p-value of 0.00844 horsepower produces a p-value of 0.21963 weight - < 2e-16 acceleration - 0.41548
year - < 2e-16 origin - 4.67e-07
With these p-values we test the following hypothesis:
Ho: No linear relationship Ha: Linear relationship
For p-value less than 0.05, we reject our null hypothesis and accept the alternative. Displacement, weight, year, and origin have a significant linear relationship.
The coefficient for year suggests that for all other predictors held constant, the mpg value increases by 0.750773 each year.
Next we will look at the value for Rˆ2. Our model produces an Rˆ2 value of 0.8182 This means that the variation of mpg that can be explained by our model with all variables but name is 81.82%.
par(mfrow=c(2,2))
plot(lm.auto)
Our QQ plot and our standardized residual plot support our normality claim. Our data does represent traits of normality in our QQ plot since our values almost fall on a straight line, but we do see some outliers on the upper tail end. On our standardized residual plot all values are between 0.0 and 2.0.
We use our residual plot to check for homoscedasticity (equal variance assumption on Y at each given X = x). We see that there is no special pattern in the residual plot. There isn’t strong evidence of unequal variance.Now we will check our Residuals vs Leverage plot. We see that there are three points that have a much greater distance then the other points and observation 14 has high leverage.
interaction_1 <- lm(mpg ~ . - name + weight*acceleration,data=Auto)
summary(interaction_1)
##
## Call:
## lm(formula = mpg ~ . - name + weight * acceleration, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.247 -2.048 -0.045 1.619 12.193
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.364e+01 5.811e+00 -7.511 4.18e-13 ***
## cylinders -2.141e-01 3.078e-01 -0.696 0.487117
## displacement 3.138e-03 7.495e-03 0.419 0.675622
## horsepower -4.141e-02 1.348e-02 -3.071 0.002287 **
## weight 4.027e-03 1.636e-03 2.462 0.014268 *
## acceleration 1.629e+00 2.422e-01 6.726 6.36e-11 ***
## year 7.821e-01 4.833e-02 16.184 < 2e-16 ***
## origin 1.033e+00 2.686e-01 3.846 0.000141 ***
## weight:acceleration -5.826e-04 8.408e-05 -6.928 1.81e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.141 on 383 degrees of freedom
## Multiple R-squared: 0.8414, Adjusted R-squared: 0.838
## F-statistic: 253.9 on 8 and 383 DF, p-value: < 2.2e-16
The p-value for the interaction term, weight:accelertion, is small, indicating that there is strong evidence for Ha : β ne 0. In other words, it is clear that the true relationship is not additive.
The R2 for this model is 83.8%, compared to our previous model only 81.82% without an interaction term. This means that (83.8 − 81.82) / (100 − 83.8) = 12% of the variability mpg that remains after fitting the additive model has been explained by the interaction term.
interaction_2 <-lm(mpg ~.-name+displacement:weight, data = Auto)
summary(interaction_2)
##
## Call:
## lm(formula = mpg ~ . - name + displacement:weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.9027 -1.8092 -0.0946 1.5549 12.1687
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.389e+00 4.301e+00 -1.253 0.2109
## cylinders 1.175e-01 2.943e-01 0.399 0.6899
## displacement -6.837e-02 1.104e-02 -6.193 1.52e-09 ***
## horsepower -3.280e-02 1.238e-02 -2.649 0.0084 **
## weight -1.064e-02 7.136e-04 -14.915 < 2e-16 ***
## acceleration 6.724e-02 8.805e-02 0.764 0.4455
## year 7.852e-01 4.553e-02 17.246 < 2e-16 ***
## origin 5.610e-01 2.622e-01 2.139 0.0331 *
## displacement:weight 2.269e-05 2.257e-06 10.054 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.964 on 383 degrees of freedom
## Multiple R-squared: 0.8588, Adjusted R-squared: 0.8558
## F-statistic: 291.1 on 8 and 383 DF, p-value: < 2.2e-16
The p-value for the interaction term, weight:accelertion, is small, indicating that there is strong evidence for Ha : β ne 0. In other words, it is clear that the true relationship is not additive.
The R2 for this model is 85.58%, compared to our previous model only 81.82% without an interaction term. This means that (85.58 − 81.82) / (100 − 85.58) = 26% of the variability mpg that remains after fitting the additive model has been explained by the interaction term.
interaction_3 <-lm(mpg ~.-name+year:weight, data = Auto)
summary(interaction_3)
##
## Call:
## lm(formula = mpg ~ . - name + year:weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9995 -1.8495 -0.1559 1.6061 11.7042
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.186e+02 1.338e+01 -8.864 < 2e-16 ***
## cylinders -1.218e-01 3.032e-01 -0.402 0.6881
## displacement 1.293e-02 7.019e-03 1.842 0.0663 .
## horsepower -2.877e-02 1.286e-02 -2.236 0.0259 *
## weight 3.044e-02 4.652e-03 6.543 1.94e-10 ***
## acceleration 1.447e-01 9.196e-02 1.574 0.1164
## year 2.084e+00 1.732e-01 12.033 < 2e-16 ***
## origin 1.174e+00 2.597e-01 4.519 8.30e-06 ***
## weight:year -4.879e-04 6.097e-05 -8.002 1.47e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.084 on 383 degrees of freedom
## Multiple R-squared: 0.847, Adjusted R-squared: 0.8439
## F-statistic: 265.1 on 8 and 383 DF, p-value: < 2.2e-16
The p-value for the interaction term, weight:accelertion, is small, indicating that there is strong evidence for Ha : β ne 0. In other words, it is clear that the true relationship is not additive.
The R2 for this model is 84.39%, compared to our previous model only 81.82% without an interaction term. This means that (84.39 − 81.82) / (100 − 84.39) = 16% of the variability mpg that remains after fitting the additive model has been explained by the interaction term.
interaction_4 <-lm(mpg ~.-name+horsepower:origin, data = Auto)
summary(interaction_4)
##
## Call:
## lm(formula = mpg ~ . - name + horsepower:origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.277 -1.875 -0.225 1.570 12.080
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.196e+01 4.396e+00 -4.996 8.94e-07 ***
## cylinders -5.275e-01 3.028e-01 -1.742 0.0823 .
## displacement -1.486e-03 7.607e-03 -0.195 0.8452
## horsepower 8.173e-02 1.856e-02 4.404 1.38e-05 ***
## weight -4.710e-03 6.555e-04 -7.186 3.52e-12 ***
## acceleration -1.124e-01 9.617e-02 -1.168 0.2434
## year 7.327e-01 4.780e-02 15.328 < 2e-16 ***
## origin 7.695e+00 8.858e-01 8.687 < 2e-16 ***
## horsepower:origin -7.955e-02 1.074e-02 -7.405 8.44e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.116 on 383 degrees of freedom
## Multiple R-squared: 0.8438, Adjusted R-squared: 0.8406
## F-statistic: 258.7 on 8 and 383 DF, p-value: < 2.2e-16
The p-value for the interaction term, weight:accelertion, is small, indicating that there is strong evidence for Ha : β ne 0. In other words, it is clear that the true relationship is not additive.
The R2 for this model is 84.06%, compared to our previous model only 81.82% without an interaction term. This means that (84.06 − 81.82) / (100 − 84.06) = 14% of the variability mpg that remains after fitting the additive model has been explained by the interaction term.
log_lm <- lm(mpg ~ . -name + log(acceleration), data=Auto)
summary(log_lm)
##
## Call:
## lm(formula = mpg ~ . - name + log(acceleration), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.7931 -2.0052 -0.1279 1.9299 13.1085
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.552e+01 1.479e+01 3.077 0.00224 **
## cylinders -2.796e-01 3.193e-01 -0.876 0.38172
## displacement 8.042e-03 7.805e-03 1.030 0.30344
## horsepower -3.434e-02 1.401e-02 -2.450 0.01473 *
## weight -5.343e-03 6.854e-04 -7.795 6.15e-14 ***
## acceleration 2.167e+00 4.782e-01 4.532 7.82e-06 ***
## year 7.560e-01 4.978e-02 15.186 < 2e-16 ***
## origin 1.329e+00 2.724e-01 4.877 1.58e-06 ***
## log(acceleration) -3.513e+01 7.886e+00 -4.455 1.10e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.249 on 383 degrees of freedom
## Multiple R-squared: 0.8303, Adjusted R-squared: 0.8267
## F-statistic: 234.2 on 8 and 383 DF, p-value: < 2.2e-16
The p-value for the log of acceleration is small indicating that there is strong evidence for Ha : β ne 0. The R2 for this model is 82.67%, compared to our previous model only 81.82% without an interaction term. This means that (82.67 − 81.82) / (100 − 82.67) = 5% of the variability mpg that remains after taking the log of acceleration in the model has been explained.
sqrt_lm <- lm(mpg ~ . -name + I(cylinders^2), data=Auto)
summary(sqrt_lm)
##
## Call:
## lm(formula = mpg ~ . - name + I(cylinders^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.426 -2.028 -0.161 1.717 12.876
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.0178890 5.7290872 -0.352 0.72487
## cylinders -5.8179557 1.2643565 -4.602 5.71e-06 ***
## displacement 0.0197886 0.0073457 2.694 0.00737 **
## horsepower -0.0312646 0.0138721 -2.254 0.02478 *
## weight -0.0062906 0.0006387 -9.848 < 2e-16 ***
## acceleration 0.1048520 0.0967778 1.083 0.27930
## year 0.7453135 0.0498398 14.954 < 2e-16 ***
## origin 1.2279200 0.2756596 4.454 1.11e-05 ***
## I(cylinders^2) 0.4644689 0.1067911 4.349 1.75e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.253 on 383 degrees of freedom
## Multiple R-squared: 0.8299, Adjusted R-squared: 0.8263
## F-statistic: 233.5 on 8 and 383 DF, p-value: < 2.2e-16
The p-value for the cylinders squared is small, indicating that there is strong evidence for Ha : β ne 0.
The R2 for this model is 82.63%, compared to our previous model only 81.82% without an interaction term. This means that (82.63 − 81.82) / (100 − 82.63) = 5% of the variability mpg that remains after fitting the model has been explained by the squaring the term.
This question should be answered using the Carseats data set.
attach(Carseats)
lm.carseats <- lm(Sales~Price+Urban+US, data = Carseats)
summary(lm.carseats)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
The following is each coefficient with their corresponding p-values:
Price - < 2e-16 *** UrbanYes - 0.936
USYes - 4.86e-06
With these p-values we test the following hypothesis:
Ho: No linear relationship Ha: Linear relationship
For p-value less than 0.05, we reject our null hypothesis and accept the alternative.
Price has a significant linear relationship with sales.
For Urban there is not a significant linear relationship between a stores being in an urban location and sales.
There is a linear relationship for stores located in the US and sales.
Our estimated regression line is:
y hat = 13.043469 - 0.054459 (Price) - 0.021916 (UrbanYes) + 1.200573 (USYes)
For which of the predictors can you reject the null hypothesis H0 : βj = 0? We can reject the null hypothesis for Price and US Yes.
On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
lm.carseats2 <- lm(Sales ~ Price + US, data = Carseats)
summary(lm.carseats2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Our estimated regression line from model (a) is:
y hat = 13.043469 - 0.054459 (Price) - 0.021916 (UrbanYes) + 1.200573 (USYes)
Our estimated regression line from model (e) is:
y hat = 13.03079 - 0.05448 (Price) + 1.19964 (USYes)
The slope of our regression model as slightly decreased from model a to e. Our previous model was the average sales decrease by 0.054459 with one unit increase in price. The new model sales decrease by 0.05448 with one unit increase in price. They are almost equal.
confint(lm.carseats2)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
par(mfrow = c(2, 2))
plot(lm.carseats2)
Observing the plots above we do see that there are evidence of outliers and high leverage observations. Points that fall horizontally away from the center of the cloud tend to pull harder on the line, so we call them points with high leverage.
This problem involves simple linear regression without an intercept.
yˆ = βˆ1x where ˆy indicates a prediction of Y on the basis of X = x (without an intercept) and: βˆ = \[\sum_{i=1}^{n} x_iy_i\]/\[\sum_{i=1}^{n} x^2_(i')\]
The estimate of Y onto X: xˆ = βˆ1y where βˆ = \[\sum_{i=1}^{n} x_iy_i\]/\[\sum_{i=1}^{n} y^2_(i')\]
Thus the coefficients (βˆ) are the same if \[\sum_{i=1}^{n} x^2_(i')\] = \[\sum_{i=1}^{n} y^2_(i')\] (b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
set.seed(1)
x=rnorm(100)
sum(x^2)
## [1] 81.05509
y <- 3 * x + rnorm(100)
sum(y^2)
## [1] 817.4962
fit_y <- lm(y ~ x + 0)
summary(fit_y)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9154 -0.6472 -0.1771 0.5056 2.3109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 2.9939 0.1065 28.12 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared: 0.8887, Adjusted R-squared: 0.8876
## F-statistic: 790.6 on 1 and 99 DF, p-value: < 2.2e-16
fit_x <- lm(x ~ y + 0)
summary(fit_x)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.63420 -0.16066 0.07099 0.18507 0.59841
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.29684 0.01056 28.12 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3019 on 99 degrees of freedom
## Multiple R-squared: 0.8887, Adjusted R-squared: 0.8876
## F-statistic: 790.6 on 1 and 99 DF, p-value: < 2.2e-16
set.seed(1)
x_2=rnorm(100)
sum((x_2)^2)
## [1] 81.05509
set.seed(1)
y_2=rnorm(100)
sum((y_2)^2)
## [1] 81.05509
fit.Y <- lm(y ~ x + 0)
fit.X <- lm(x ~ y + 0)
summary(fit.Y)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9154 -0.6472 -0.1771 0.5056 2.3109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 2.9939 0.1065 28.12 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared: 0.8887, Adjusted R-squared: 0.8876
## F-statistic: 790.6 on 1 and 99 DF, p-value: < 2.2e-16
summary(fit.X)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.63420 -0.16066 0.07099 0.18507 0.59841
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.29684 0.01056 28.12 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3019 on 99 degrees of freedom
## Multiple R-squared: 0.8887, Adjusted R-squared: 0.8876
## F-statistic: 790.6 on 1 and 99 DF, p-value: < 2.2e-16