The KNN classifier uses nearest neighbors (based on amount K) to determine the class of a test observation based on what those neighbors are.
Conversely, KNN regression uses a similar method but to predict the value of Y for a given X by considering the K closest points to X in the training data and taking on the average of those points.
Basically, the classifier produces a label of some sort for a test observation, while the regression produces a value X that helps predict value Y.
See below.
See below.
The plot looks slightly curved, indicating that there may be a quadratic relationship. Furthermore, we do see a few large outliers. The plotted leverages/residuals show that there’s nothing outside of Cook’s distance, so there aren’t any values that are really disrupting the model.
Accelerationweight, accelerationdisplacement, and horsepower*weight are all statistically significant interactions.
The square root of horsepower is statistically significant and negative, indicating that increasing horsepwoer decreases mpg but at a diminishing rate. The same is the case for the weight variable. In the squared displacement model, we see that displacement is negatively correlated with mpg, but the squared term is positive. This indicates that the negative directionality eventually tapers off as displacement increases.
auto <- Auto
auto <- auto[auto$horsepower != "?", ]
str(auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
## ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
pairs(auto)
cor(auto[, -9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
model1 <- lm(mpg ~ cylinders + displacement + horsepower + weight + acceleration + year + origin, data = auto)
summary(model1)
##
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight +
## acceleration + year + origin, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
plot(model1)
model2 <- lm(mpg ~ acceleration*weight + cylinders + displacement + year + origin + horsepower, data = auto)
summary(model2)
##
## Call:
## lm(formula = mpg ~ acceleration * weight + cylinders + displacement +
## year + origin + horsepower, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.247 -2.048 -0.045 1.619 12.193
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.364e+01 5.811e+00 -7.511 4.18e-13 ***
## acceleration 1.629e+00 2.422e-01 6.726 6.36e-11 ***
## weight 4.027e-03 1.636e-03 2.462 0.014268 *
## cylinders -2.141e-01 3.078e-01 -0.696 0.487117
## displacement 3.138e-03 7.495e-03 0.419 0.675622
## year 7.821e-01 4.833e-02 16.184 < 2e-16 ***
## origin 1.033e+00 2.686e-01 3.846 0.000141 ***
## horsepower -4.141e-02 1.348e-02 -3.071 0.002287 **
## acceleration:weight -5.826e-04 8.408e-05 -6.928 1.81e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.141 on 383 degrees of freedom
## Multiple R-squared: 0.8414, Adjusted R-squared: 0.838
## F-statistic: 253.9 on 8 and 383 DF, p-value: < 2.2e-16
model3 <- lm(mpg ~ acceleration*displacement + weight + cylinders + year + origin + horsepower, data = auto)
summary(model3)
##
## Call:
## lm(formula = mpg ~ acceleration * displacement + weight + cylinders +
## year + origin + horsepower, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.129 -1.899 -0.135 1.755 12.119
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.005e+01 4.737e+00 -6.343 6.36e-10 ***
## acceleration 7.530e-01 1.332e-01 5.653 3.09e-08 ***
## displacement 7.022e-02 1.005e-02 6.989 1.24e-11 ***
## weight -4.211e-03 6.929e-04 -6.077 2.96e-09 ***
## cylinders 2.136e-03 3.125e-01 0.007 0.994550
## year 7.722e-01 4.811e-02 16.051 < 2e-16 ***
## origin 1.057e+00 2.671e-01 3.958 9.01e-05 ***
## horsepower -5.515e-02 1.407e-02 -3.920 0.000105 ***
## acceleration:displacement -4.855e-03 6.879e-04 -7.058 7.99e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.134 on 383 degrees of freedom
## Multiple R-squared: 0.842, Adjusted R-squared: 0.8387
## F-statistic: 255.2 on 8 and 383 DF, p-value: < 2.2e-16
model4 <- lm(mpg ~ horsepower*weight + acceleration + displacement + cylinders + year + origin, data = auto)
summary(model4)
##
## Call:
## lm(formula = mpg ~ horsepower * weight + acceleration + displacement +
## cylinders + year + origin, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.589 -1.617 -0.184 1.541 12.001
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.876e+00 4.511e+00 0.638 0.524147
## horsepower -2.313e-01 2.363e-02 -9.791 < 2e-16 ***
## weight -1.121e-02 7.285e-04 -15.393 < 2e-16 ***
## acceleration -9.019e-02 8.855e-02 -1.019 0.309081
## displacement 5.950e-03 6.750e-03 0.881 0.378610
## cylinders -2.955e-02 2.881e-01 -0.103 0.918363
## year 7.695e-01 4.494e-02 17.124 < 2e-16 ***
## origin 8.344e-01 2.513e-01 3.320 0.000986 ***
## horsepower:weight 5.529e-05 5.227e-06 10.577 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.931 on 383 degrees of freedom
## Multiple R-squared: 0.8618, Adjusted R-squared: 0.859
## F-statistic: 298.6 on 8 and 383 DF, p-value: < 2.2e-16
model5 <- lm(mpg ~ sqrt(horsepower) + weight + acceleration + displacement + cylinders + year + origin, data = auto)
summary(model5)
##
## Call:
## lm(formula = mpg ~ sqrt(horsepower) + weight + acceleration +
## displacement + cylinders + year + origin, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5240 -1.9910 -0.1687 1.8181 12.9211
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.0373910 5.5460041 -1.089 0.277012
## sqrt(horsepower) -1.1434906 0.3113771 -3.672 0.000274 ***
## weight -0.0054593 0.0006842 -7.979 1.72e-14 ***
## acceleration -0.1021239 0.1038565 -0.983 0.326070
## displacement 0.0220542 0.0071987 3.064 0.002341 **
## cylinders -0.5222540 0.3166839 -1.649 0.099938 .
## year 0.7240379 0.0501791 14.429 < 2e-16 ***
## origin 1.5173206 0.2703470 5.612 3.83e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.277 on 384 degrees of freedom
## Multiple R-squared: 0.8269, Adjusted R-squared: 0.8237
## F-statistic: 262 on 7 and 384 DF, p-value: < 2.2e-16
model6 <- lm(mpg ~ sqrt(weight) + horsepower + acceleration + displacement + cylinders + year + origin, data = auto)
summary(model6)
##
## Call:
## lm(formula = mpg ~ sqrt(weight) + horsepower + acceleration +
## displacement + cylinders + year + origin, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.4018 -2.0112 0.0246 1.7565 12.8943
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.840893 4.486253 0.633 0.52695
## sqrt(weight) -0.794322 0.066906 -11.872 < 2e-16 ***
## horsepower -0.010706 0.013111 -0.817 0.41469
## acceleration 0.131710 0.094051 1.400 0.16220
## displacement 0.021846 0.007134 3.062 0.00235 **
## cylinders -0.430040 0.310000 -1.387 0.16618
## year 0.773764 0.049030 15.781 < 2e-16 ***
## origin 1.210091 0.268519 4.507 8.76e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.191 on 384 degrees of freedom
## Multiple R-squared: 0.8359, Adjusted R-squared: 0.8329
## F-statistic: 279.4 on 7 and 384 DF, p-value: < 2.2e-16
model7 <- lm(mpg ~ I(displacement^2) + displacement + weight + horsepower + acceleration + cylinders + year + origin, data = auto)
summary(model7)
##
## Call:
## lm(formula = mpg ~ I(displacement^2) + displacement + weight +
## horsepower + acceleration + cylinders + year + origin, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.3584 -1.7222 0.0236 1.5766 11.9298
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.776e+00 4.271e+00 -2.289 0.0226 *
## I(displacement^2) 2.077e-04 2.222e-05 9.345 < 2e-16 ***
## displacement -1.041e-01 1.490e-02 -6.984 1.28e-11 ***
## weight -4.231e-03 6.362e-04 -6.650 1.01e-10 ***
## horsepower -5.848e-02 1.323e-02 -4.421 1.28e-05 ***
## acceleration -2.391e-02 9.001e-02 -0.266 0.7907
## cylinders 7.073e-01 3.191e-01 2.216 0.0273 *
## year 7.622e-01 4.607e-02 16.544 < 2e-16 ***
## origin 4.035e-01 2.741e-01 1.472 0.1419
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.007 on 383 degrees of freedom
## Multiple R-squared: 0.8546, Adjusted R-squared: 0.8516
## F-statistic: 281.4 on 8 and 383 DF, p-value: < 2.2e-16
See below.
Price and US are both statistically significant. For every dollar increase in price, we see a 50 unit decrease in sales. We also see that stores in the US on average sell 1200 units more than those abroad. Finally, though not statistically significant, the urban variable indicates that stores in urban settings sell, on average, 21 fewer units than those in rural locations.
\[ \text{Sales} = \beta_0 + \beta_1\,\text{Price} + \beta_2\,\text{Urban}_{\text{yes}} + \beta_3\,\text{US}_{\text{yes}} + \epsilon \]
We can reject the null hypothesis for the US and Price predictors.
See below.
The first model accurately explains about 23% of the variation in the observed values. The second model is only slightly better in this regard.
See below.
There are some outliers on both ends of the spectrum. There is some high leverage observed, but not enough to significantly impact the model fit.
carseats <- Carseats
cmodel1 <- lm(Sales ~ Price + Urban + US, data = carseats)
summary(cmodel1)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
cmodel2 <- lm(Sales ~ Price + US, data = carseats)
summary(cmodel2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
confint(cmodel2)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
plot(cmodel2)
Really confusing question, but I think that the two would be equal when they have the same sum of squares (when they have the same coefficients as vectors).
See below.
See below.
set.seed(1)
x <- rnorm(100, mean = 0, sd = 1)
y <- 3*x + rnorm(100, mean = 0, sd = 5)
yxmodel <- lm(y ~ x + 0)
coef(yxmodel)
## x
## 2.96938
xymodel <- lm(x ~ y + 0)
coef(xymodel)
## y
## 0.08052115
y2 <- x
yxmodel2 <- lm(y2 ~ x + 0)
coef(yxmodel2)
## x
## 1
xymodel2 <- lm(x ~ y2 + 0)
coef(xymodel2)
## y2
## 1