June 21st 2026KNN Classifier is used when the response Y is qualitative (categorical). For a new point x0, it finds the K nearest neighbors and assigns the majority class among them. Output is a class label.
KNN Regression is used when the response Y is quantitative (continuous). For a new point x0, it finds the K nearest neighbors and returns the average of their response values. Output is a numeric prediction.
Key difference: The classifier uses majority vote; regression uses the mean of neighbors.
data(Auto)
pairs(Auto)
name)cor(Auto[, -which(names(Auto) == "name")])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
lm.fit9 <- lm(mpg ~ . - name, data = Auto)
summary(lm.fit9)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
i. Is there a relationship between the predictors and the response?
Yes. The F-statistic is 252.4 with a p-value < 2.2e-16, essentially zero. We strongly reject the null hypothesis that all coefficients are zero — there is a statistically significant relationship between the predictors and mpg.
ii. Which predictors have a statistically significant relationship?
The significant predictors (p < 0.05) are: displacement (p = 0.00844), weight (p < 2e-16), year (p < 2e-16), and origin (p = 4.67e-07). Cylinders, horsepower, and acceleration are not significant.
iii. What does the year coefficient
suggest?
The coefficient is 0.7508, meaning each additional model year is associated with an increase of about 0.75 mpg, holding all other predictors constant. Fuel efficiency improved steadily over time.
par(mfrow = c(2, 2))
plot(lm.fit9)
par(mfrow = c(1, 1))
The residuals vs fitted plot shows mild non-linearity. There are a few potential outliers with large residuals. No severely influential points are evident from the leverage plot.
lm.fit9e <- lm(mpg ~ . - name + horsepower:weight + acceleration:horsepower,
data = Auto)
summary(lm.fit9e)
##
## Call:
## lm(formula = mpg ~ . - name + horsepower:weight + acceleration:horsepower,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4489 -1.6817 -0.0238 1.3932 11.8609
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.655e+00 5.397e+00 -1.418 0.156907
## cylinders 1.963e-01 2.916e-01 0.673 0.501235
## displacement -5.986e-03 7.504e-03 -0.798 0.425545
## horsepower -1.285e-01 3.786e-02 -3.394 0.000760 ***
## weight -9.289e-03 9.103e-04 -10.204 < 2e-16 ***
## acceleration 3.891e-01 1.643e-01 2.369 0.018336 *
## year 7.694e-01 4.431e-02 17.364 < 2e-16 ***
## origin 7.206e-01 2.500e-01 2.882 0.004172 **
## horsepower:weight 4.752e-05 5.626e-06 8.448 6.3e-16 ***
## horsepower:acceleration -6.123e-03 1.777e-03 -3.445 0.000634 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.891 on 382 degrees of freedom
## Multiple R-squared: 0.866, Adjusted R-squared: 0.8628
## F-statistic: 274.3 on 9 and 382 DF, p-value: < 2.2e-16
Both interaction terms are highly significant:
R² improved from 0.8215 to 0.866 and RSE dropped from 3.328 to 2.891, confirming the interactions add real explanatory power.
lm.fit9f_log <- lm(mpg ~ log(horsepower), data = Auto)
lm.fit9f_sqrt <- lm(mpg ~ sqrt(horsepower), data = Auto)
lm.fit9f_sq <- lm(mpg ~ I(horsepower^2), data = Auto)
summary(lm.fit9f_log)
##
## Call:
## lm(formula = mpg ~ log(horsepower), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.2299 -2.7818 -0.2322 2.6661 15.4695
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 108.6997 3.0496 35.64 <2e-16 ***
## log(horsepower) -18.5822 0.6629 -28.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.501 on 390 degrees of freedom
## Multiple R-squared: 0.6683, Adjusted R-squared: 0.6675
## F-statistic: 785.9 on 1 and 390 DF, p-value: < 2.2e-16
summary(lm.fit9f_sqrt)
##
## Call:
## lm(formula = mpg ~ sqrt(horsepower), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.9768 -3.2239 -0.2252 2.6881 16.1411
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 58.705 1.349 43.52 <2e-16 ***
## sqrt(horsepower) -3.503 0.132 -26.54 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.665 on 390 degrees of freedom
## Multiple R-squared: 0.6437, Adjusted R-squared: 0.6428
## F-statistic: 704.6 on 1 and 390 DF, p-value: < 2.2e-16
summary(lm.fit9f_sq)
##
## Call:
## lm(formula = mpg ~ I(horsepower^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.529 -3.798 -1.049 3.240 18.528
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.047e+01 4.466e-01 68.22 <2e-16 ***
## I(horsepower^2) -5.665e-04 2.827e-05 -20.04 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.485 on 390 degrees of freedom
## Multiple R-squared: 0.5074, Adjusted R-squared: 0.5061
## F-statistic: 401.7 on 1 and 390 DF, p-value: < 2.2e-16
par(mfrow = c(1, 3))
plot(Auto$horsepower, Auto$mpg, main = "log(horsepower)", xlab = "horsepower", ylab = "mpg")
lines(sort(Auto$horsepower),
predict(lm.fit9f_log, data.frame(horsepower = sort(Auto$horsepower))),
col = "red", lwd = 2)
plot(Auto$horsepower, Auto$mpg, main = "sqrt(horsepower)", xlab = "horsepower", ylab = "mpg")
lines(sort(Auto$horsepower),
predict(lm.fit9f_sqrt, data.frame(horsepower = sort(Auto$horsepower))),
col = "blue", lwd = 2)
plot(Auto$horsepower, Auto$mpg, main = "horsepower^2", xlab = "horsepower", ylab = "mpg")
lines(sort(Auto$horsepower),
predict(lm.fit9f_sq, data.frame(horsepower = sort(Auto$horsepower))),
col = "green", lwd = 2)
par(mfrow = c(1, 1))
| Transformation | R² | RSE |
|---|---|---|
| log(horsepower) | 0.6683 | 4.501 |
| sqrt(horsepower) | 0.6437 | 4.665 |
| horsepower² | 0.5074 | 5.485 |
log(horsepower) is clearly the best transformation. It has the highest R² and lowest RSE. The log transformation best captures the concave, diminishing relationship between horsepower and mpg — the fuel efficiency penalty for adding horsepower gets smaller as horsepower is already high.
data(Carseats)
lm.fit10 <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(lm.fit10)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Sales = 13.04 − 0.0544(Price) − 0.0219(Urban) + 1.2006(US)
Where Urban = 1 if urban, 0 if rural; US = 1 if in US, 0 if not.
Price and US — both have p-values far below 0.05. Urban cannot be rejected (p = 0.936).
lm.fit10e <- lm(Sales ~ Price + US, data = Carseats)
summary(lm.fit10e)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Both Price and US remain highly significant (Price: p < 2e-16, US: p = 4.71e-06). Dropping Urban had essentially no cost.
| Model | R² | Adjusted R² | RSE |
|---|---|---|---|
| (a) Price + Urban + US | 0.2393 | 0.2335 | 2.472 |
| (e) Price + US | 0.2393 | 0.2354 | 2.469 |
The fits are virtually identical. Model (e) has a slightly higher adjusted R² and lower RSE despite one fewer predictor — confirming that dropping Urban was correct. Overall R² of ~0.24 suggests other unmeasured predictors matter.
confint(lm.fit10e)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
Both Price and US intervals exclude zero, confirming significance. The Price interval is entirely negative and the US interval entirely positive.
par(mfrow = c(2, 2))
plot(lm.fit10e)
par(mfrow = c(1, 1))
Outliers: Observations 51, 69, and 377 appear with notably large residuals. Observation 51 sits below −3 on the Q-Q plot, flagging it as a potential outlier.
High leverage: Observation 368 has leverage ~0.04, noticeably higher than most. However it falls within Cook’s distance boundaries, so it is not strongly influencing the regression coefficients. No points cross the Cook’s distance boundary — no highly influential observations.
Overall: The Q-Q plot is very linear (residuals approximately normal) and the Residuals vs Fitted plot shows no strong pattern. The linear model is reasonably appropriate.
From formula (3.38), the no-intercept estimate is:
These are equal when Σ(xᵢ²) = Σ(yᵢ²), i.e., when X and Y have the same sum of squares.
set.seed(1)
x <- rnorm(100)
y <- 2 * x + rnorm(100)
lm.yx <- lm(y ~ x + 0)
lm.xy <- lm(x ~ y + 0)
cat("Beta(Y~X):", coef(lm.yx), "\n")
## Beta(Y~X): 1.993876
cat("Beta(X~Y):", coef(lm.xy), "\n")
## Beta(X~Y): 0.3911145
cat("sum(x^2):", sum(x^2), " sum(y^2):", sum(y^2), "\n")
## sum(x^2): 81.05509 sum(y^2): 413.2135
The estimates differ because sum(x²) ≠ sum(y²).
set.seed(1)
x2 <- rnorm(100)
y2 <- x2[sample(100)]
lm.yx2 <- lm(y2 ~ x2 + 0)
lm.xy2 <- lm(x2 ~ y2 + 0)
cat("Beta(Y2~X2):", coef(lm.yx2), "\n")
## Beta(Y2~X2): -0.07767695
cat("Beta(X2~Y2):", coef(lm.xy2), "\n")
## Beta(X2~Y2): -0.07767695
cat("sum(x2^2):", sum(x2^2), " sum(y2^2):", sum(y2^2), "\n")
## sum(x2^2): 81.05509 sum(y2^2): 81.05509
The estimates are equal because permuting x gives y the same sum of squares as x.