Both methods are non-parametric and base their prediction on the \(K\) training observations closest to a test point \(x_0\) (its neighbourhood \(\mathcal{N}_0\)). They differ in the type of response they predict and in how they combine the neighbours.
KNN classifier predicts a qualitative (categorical) response. It estimates the conditional probability of each class \(j\) as the fraction of the \(K\) neighbours in that class, \[\Pr(Y = j \mid X = x_0) \approx \frac{1}{K}\sum_{i \in \mathcal{N}_0} I(y_i = j),\] and assigns \(x_0\) to the class with the highest estimated probability (a majority vote). The output is a class label and performance is judged by the classification error rate.
KNN regression predicts a quantitative response. It estimates the regression function as the average of the responses of the \(K\) neighbours, \[\hat f(x_0) = \frac{1}{K}\sum_{i \in \mathcal{N}_0} y_i,\] producing a numeric prediction; performance is judged by a quantity such as the test MSE.
In short, the classifier estimates \(\Pr(Y=j\mid X)\) and outputs a category by majority vote (approximating the Bayes classifier), whereas the regression method averages the neighbours’ numeric responses to estimate \(E(Y\mid X)\).
AutoAuto <- na.omit(Auto)
dim(Auto)
## [1] 392 9
pairs(Auto[, names(Auto) != "name"])
round(cor(Auto[, names(Auto) != "name"]), 3)
## mpg cylinders displacement horsepower weight acceleration
## mpg 1.000 -0.778 -0.805 -0.778 -0.832 0.423
## cylinders -0.778 1.000 0.951 0.843 0.898 -0.505
## displacement -0.805 0.951 1.000 0.897 0.933 -0.544
## horsepower -0.778 0.843 0.897 1.000 0.865 -0.689
## weight -0.832 0.898 0.933 0.865 1.000 -0.417
## acceleration 0.423 -0.505 -0.544 -0.689 -0.417 1.000
## year 0.581 -0.346 -0.370 -0.416 -0.309 0.290
## origin 0.565 -0.569 -0.615 -0.455 -0.585 0.213
## year origin
## mpg 0.581 0.565
## cylinders -0.346 -0.569
## displacement -0.370 -0.615
## horsepower -0.416 -0.455
## weight -0.309 -0.585
## acceleration 0.290 0.213
## year 1.000 0.182
## origin 0.182 1.000
mpg on all predictors except namelm.fit <- lm(mpg ~ . - name, data = Auto)
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
i. Is there a relationship between the predictors and the response? Yes. The overall \(F\)-statistic is large with a \(p\)-value of essentially zero, so we strongly reject the null hypothesis that all slope coefficients are zero; the predictors jointly explain about \(82\%\) of the variation in mpg.
ii. Which predictors are statistically significant? At the \(5\%\) level, displacement, weight, year, and origin are significant. cylinders, horsepower, and acceleration are not — largely because they are strongly collinear with weight and displacement.
iii. What does the year coefficient suggest? Its coefficient is about \(+0.75\): holding all other predictors fixed, average fuel economy improves by roughly \(0.75\) mpg per model year, i.e. cars became about \(0.75\) mpg more efficient each year over this period.
par(mfrow = c(2, 2))
plot(lm.fit)
The residuals-vs-fitted plot shows a mild U-shape and a funnel (increasing spread at large fitted values), indicating some non-linearity and mild heteroscedasticity. The residuals-vs-leverage plot flags no extreme outliers, but observation 14 has an unusually high leverage relative to the average leverage \((p+1)/n\).
summary(lm(mpg ~ horsepower * weight, data = Auto))
##
## Call:
## lm(formula = mpg ~ horsepower * weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.7725 -2.2074 -0.2708 1.9973 14.7314
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.356e+01 2.343e+00 27.127 < 2e-16 ***
## horsepower -2.508e-01 2.728e-02 -9.195 < 2e-16 ***
## weight -1.077e-02 7.738e-04 -13.921 < 2e-16 ***
## horsepower:weight 5.355e-05 6.649e-06 8.054 9.93e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.93 on 388 degrees of freedom
## Multiple R-squared: 0.7484, Adjusted R-squared: 0.7465
## F-statistic: 384.8 on 3 and 388 DF, p-value: < 2.2e-16
The horsepower:weight interaction is highly significant (\(p < 0.001\)), so the effect of horsepower on mpg depends on the car’s weight (and vice versa), and adding it raises \(R^2\).
horsepowerfits <- list(
"mpg ~ horsepower" = lm(mpg ~ horsepower, data = Auto),
"mpg ~ log(horsepower)" = lm(mpg ~ log(horsepower), data = Auto),
"mpg ~ sqrt(horsepower)" = lm(mpg ~ sqrt(horsepower), data = Auto),
"mpg ~ horsepower + hp^2" = lm(mpg ~ horsepower + I(horsepower^2), data = Auto)
)
sapply(fits, function(m) round(summary(m)$r.squared, 4))
## mpg ~ horsepower mpg ~ log(horsepower) mpg ~ sqrt(horsepower)
## 0.6059 0.6683 0.6437
## mpg ~ horsepower + hp^2
## 0.6876
All of the non-linear transformations fit better than the raw linear term, confirming the curvature seen in the diagnostics. The quadratic and the \(\log\) transformations give the largest improvement, consistent with mpg being a decreasing, convex function of horsepower.
Carseats data setdim(Carseats)
## [1] 400 11
Sales ~ Price + Urban + USlm1 <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(lm1)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
All effects are “holding the other predictors fixed”. Sales is in thousands of units and Price is in dollars.
Price (\(\approx -0.054\)): each $1 increase in price is associated with a decrease of about \(0.054\) thousand (\(\approx 54\)) unit sales. Negative and highly significant.Urban[Yes] (\(\approx -0.022\)): urban stores sell about \(0.022\) thousand units less than rural stores, but this is not significant (\(p \approx 0.94\)) — no detectable urban/rural difference.US[Yes] (\(\approx +1.20\)): stores in the US sell about \(1.20\) thousand (\(\approx 1200\)) more units than non-US stores; significant.Price \(= 0\).With \(\text{Urban}=1\) if the store is urban (else \(0\)) and \(\text{US}=1\) if it is in the US (else \(0\)): \[\widehat{\text{Sales}} = 13.0435 - 0.0545\,\text{Price} - 0.0219\,\text{Urban} + 1.2006\,\text{US}.\]
round(summary(lm1)$coefficients[, "Pr(>|t|)"], 4)
## (Intercept) Price UrbanYes USYes
## 0.0000 0.0000 0.9357 0.0000
We reject \(H_0\) for Price and US (and the intercept), but not for Urban.
Sales ~ Price + USlm2 <- lm(Sales ~ Price + US, data = Carseats)
summary(lm2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
data.frame(
model = c("(a) Price+Urban+US", "(e) Price+US"),
R2 = c(summary(lm1)$r.squared, summary(lm2)$r.squared),
adj_R2 = c(summary(lm1)$adj.r.squared, summary(lm2)$adj.r.squared),
RSE = c(summary(lm1)$sigma, summary(lm2)$sigma)
)
## model R2 adj_R2 RSE
## 1 (a) Price+Urban+US 0.2392754 0.2335123 2.472492
## 2 (e) Price+US 0.2392629 0.2354305 2.469397
Both models explain the same \(\approx 23.9\%\) of the variation in Sales (dropping the useless Urban term costs no \(R^2\)). Model (e) has a slightly higher adjusted \(R^2\) and a lower residual standard error, so it is the preferred, more parsimonious model. In absolute terms the fit is modest: about three-quarters of the variability in sales is left unexplained.
round(confint(lm2), 4)
## 2.5 % 97.5 %
## (Intercept) 11.7903 14.2713
## Price -0.0648 -0.0442
## USYes 0.6915 1.7078
par(mfrow = c(2, 2))
plot(lm2)
n <- nrow(Carseats)
studres <- rstandard(lm2)
lev <- hatvalues(lm2)
cat("max |studentized residual| =", round(max(abs(studres)), 3), "\n")
## max |studentized residual| = 2.865
cat("observations with |stud. resid| > 3 :", sum(abs(studres) > 3), "\n")
## observations with |stud. resid| > 3 : 0
cat("average leverage =", round((length(coef(lm2))) / n, 5),
" max leverage =", round(max(lev), 5), "\n")
## average leverage = 0.0075 max leverage = 0.04334
cat("high-leverage (lev > 3*avg) :", sum(lev > 3 * length(coef(lm2)) / n), "\n")
## high-leverage (lev > 3*avg) : 6
The largest studentized residual in absolute value is below \(3\), so there are no clear outliers. A few observations exceed three times the average leverage, so there are some mildly high-leverage points, but none are extreme.
For regression through the origin, the coefficient of \(Y\) onto \(X\) is \(\hat\beta_{Y\sim X} = \sum_i x_i y_i / \sum_i x_i^2\) and of \(X\) onto \(Y\) is \(\hat\beta_{X\sim Y} = \sum_i x_i y_i / \sum_i y_i^2\).
Since the numerators (\(\sum_i x_i y_i\)) are identical, the two estimates are equal if and only if the denominators are equal, i.e. \[\sum_i x_i^2 = \sum_i y_i^2\] (assuming \(\sum_i x_i y_i \neq 0\)). In general they differ.
set.seed(1)
x <- rnorm(100)
y <- 2 * x + rnorm(100)
c(beta_Y_on_X = coef(lm(y ~ x + 0)),
beta_X_on_Y = coef(lm(x ~ y + 0)))
## beta_Y_on_X.x beta_X_on_Y.y
## 1.9938761 0.3911145
c(sum_x2 = sum(x^2), sum_y2 = sum(y^2))
## sum_x2 sum_y2
## 81.05509 413.21352
The sums of squares differ greatly, so the two coefficients are very different.
set.seed(1)
x <- rnorm(100)
y <- sample(x) # a permutation of x forces sum(y^2) == sum(x^2)
c(beta_Y_on_X = coef(lm(y ~ x + 0)),
beta_X_on_Y = coef(lm(x ~ y + 0)))
## beta_Y_on_X.x beta_X_on_Y.y
## -0.07767695 -0.07767695
c(sum_x2 = sum(x^2), sum_y2 = sum(y^2))
## sum_x2 sum_y2
## 81.05509 81.05509
Because \(y\) is a permutation of \(x\), the denominators \(\sum x_i^2\) and \(\sum y_i^2\) are identical, so the two regression coefficients are exactly equal.