Carefully explain the differences between the KNN classifier and KNN regression methods.
Both KNN classifier and KNN regression work the same way up to a point — you pick a test observation \(x_0\), find its K nearest neighbors in the training data, and use those neighbors to make a prediction. The difference is in what you’re predicting and how you get there.
With KNN regression, the response is quantitative, so once you have your K neighbors you just take the average of their Y values and that’s your prediction:
\[\hat{f}(x_0) = \frac{1}{K} \sum_{x_i \in \mathcal{N}_0} y_i\]
With KNN classifier, the response is categorical (a class label), so averaging doesn’t make sense. Instead you look at what class shows up most among the K neighbors — basically a majority vote. More formally, you estimate the probability of each class and assign \(x_0\) to whichever class has the highest probability:
\[P(Y = j \mid X = x_0) = \frac{1}{K} \sum_{x_i \in \mathcal{N}_0} \mathbf{1}(y_i = j)\]
So the two methods are really built for different situations. KNN regression gives you a number, KNN classifier gives you a category. Both are non-parametric, meaning neither assumes a specific functional form for the relationship between X and Y. And both are affected by the choice of K in the same way — smaller K means a more flexible, wiggly fit (low bias, high variance) and larger K gives a smoother fit (higher bias, lower variance). They also both struggle when the number of predictors is large because the “nearest” neighbors end up not being very close at all in high-dimensional space.
data(Auto, package = "ISLR2")
Auto <- na.omit(Auto)
cat("Dimensions:", nrow(Auto), "rows x", ncol(Auto), "columns\n")## Dimensions: 392 rows x 9 columns
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
Auto_numeric <- Auto[, sapply(Auto, is.numeric)]
pairs(Auto_numeric,
main = "Scatterplot Matrix — Auto Data Set",
pch = 20,
cex = 0.5,
col = "steelblue")## mpg cylinders displacement horsepower weight acceleration
## mpg 1.000 -0.778 -0.805 -0.778 -0.832 0.423
## cylinders -0.778 1.000 0.951 0.843 0.898 -0.505
## displacement -0.805 0.951 1.000 0.897 0.933 -0.544
## horsepower -0.778 0.843 0.897 1.000 0.865 -0.689
## weight -0.832 0.898 0.933 0.865 1.000 -0.417
## acceleration 0.423 -0.505 -0.544 -0.689 -0.417 1.000
## year 0.581 -0.346 -0.370 -0.416 -0.309 0.290
## origin 0.565 -0.569 -0.615 -0.455 -0.585 0.213
## year origin
## mpg 0.581 0.565
## cylinders -0.346 -0.569
## displacement -0.370 -0.615
## horsepower -0.416 -0.455
## weight -0.309 -0.585
## acceleration 0.290 0.213
## year 1.000 0.182
## origin 0.182 1.000
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
# Compare full model to intercept-only model to test overall significance
model_null <- lm(mpg ~ 1, data = Auto)
anova(model_null, model_full)## Analysis of Variance Table
##
## Model 1: mpg ~ 1
## Model 2: mpg ~ (cylinders + displacement + horsepower + weight + acceleration +
## year + origin + name) - name
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 391 23819.0
## 2 384 4252.2 7 19567 252.43 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
i. Is there a relationship between the predictors and the response?
Yes. The F-statistic from the overall regression is very large and
the p-value is essentially zero. The ANOVA comparing the null model to
the full model backs this up — we can reject the null hypothesis that
all the coefficients are zero. At least some of the predictors are
related to mpg.
ii. Which predictors appear to have a statistically significant relationship to the response?
Looking at the individual p-values, the predictors that are
significant at the 5% level are displacement,
weight, year, and origin. The
others — cylinders, horsepower, and
acceleration — aren’t significant on their own here, which
is probably because they’re correlated with the predictors that are
significant (collinearity).
iii. What does the coefficient for the year
variable suggest?
The year coefficient is positive, meaning newer cars
tend to get better mileage, holding everything else fixed. Roughly
speaking, each additional model year is associated with about 0.75 more
mpg on average. That tracks with the idea that fuel efficiency improved
a lot over the 1970s and early 1980s.
The residuals vs. fitted plot shows a slight curve, which suggests the linear model is missing some non-linearity in the data. There are a few observations with large residuals that could be considered outliers. The leverage plot also flags a handful of high-leverage points, though nothing extreme. Overall the fit isn’t terrible but there’s room to improve it with transformations.
model_int <- lm(mpg ~ cylinders + displacement + horsepower + weight +
acceleration + year + origin +
horsepower:weight + acceleration:year,
data = Auto)
summary(model_int)##
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight +
## acceleration + year + origin + horsepower:weight + acceleration:year,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.2162 -1.4673 -0.1164 1.2023 11.2795
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.135e+02 1.825e+01 6.218 1.32e-09 ***
## cylinders 2.324e-01 2.780e-01 0.836 0.40375
## displacement -4.619e-03 6.658e-03 -0.694 0.48827
## horsepower -2.385e-01 2.257e-02 -10.567 < 2e-16 ***
## weight -1.058e-02 7.024e-04 -15.064 < 2e-16 ***
## acceleration -7.097e+00 1.127e+00 -6.299 8.27e-10 ***
## year -6.755e-01 2.356e-01 -2.867 0.00438 **
## origin 6.838e-01 2.409e-01 2.838 0.00478 **
## horsepower:weight 5.465e-05 4.987e-06 10.958 < 2e-16 ***
## acceleration:year 9.125e-02 1.463e-02 6.236 1.19e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.796 on 382 degrees of freedom
## Multiple R-squared: 0.8746, Adjusted R-squared: 0.8716
## F-statistic: 296 on 9 and 382 DF, p-value: < 2.2e-16
The horsepower:weight interaction turns out to be
significant, which makes sense — the effect of having a high-powered
engine on fuel economy is probably worse if the car is also heavy. The
acceleration:year interaction is less clear cut.
model_trans <- lm(mpg ~ log(weight) + sqrt(horsepower) + I(displacement^2) +
acceleration + year + origin,
data = Auto)
summary(model_trans)##
## Call:
## lm(formula = mpg ~ log(weight) + sqrt(horsepower) + I(displacement^2) +
## acceleration + year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1055 -1.9035 -0.1003 1.7043 12.3349
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.269e+02 1.004e+01 12.645 < 2e-16 ***
## log(weight) -1.974e+01 1.547e+00 -12.758 < 2e-16 ***
## sqrt(horsepower) -9.592e-01 2.965e-01 -3.236 0.00132 **
## I(displacement^2) 4.188e-05 7.847e-06 5.336 1.62e-07 ***
## acceleration 2.184e-02 9.395e-02 0.232 0.81633
## year 7.803e-01 4.647e-02 16.792 < 2e-16 ***
## origin 1.002e+00 2.447e-01 4.097 5.11e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.012 on 385 degrees of freedom
## Multiple R-squared: 0.8533, Adjusted R-squared: 0.851
## F-statistic: 373.3 on 6 and 385 DF, p-value: < 2.2e-16
## Full model (untransformed):
cat(" R2 =", round(summary(model_full)$r.squared, 4),
" Adj R2 =", round(summary(model_full)$adj.r.squared, 4),
" RSE =", round(summary(model_full)$sigma, 4), "\n\n")## R2 = 0.8215 Adj R2 = 0.8182 RSE = 3.3277
## Transformed model:
cat(" R2 =", round(summary(model_trans)$r.squared, 4),
" Adj R2 =", round(summary(model_trans)$adj.r.squared, 4),
" RSE =", round(summary(model_trans)$sigma, 4), "\n")## R2 = 0.8533 Adj R2 = 0.851 RSE = 3.0124
Using log(weight) and sqrt(horsepower)
gives a slightly better R² and lower RSE than the untransformed version.
The residual plot also looks more randomly scattered, with less of the
curve that showed up before. So the log/square-root transformations do
seem to help capture the relationship better.
data(Carseats, package = "ISLR2")
cat("Dimensions:", nrow(Carseats), "rows x", ncol(Carseats), "columns\n")## Dimensions: 400 rows x 11 columns
## 'data.frame': 400 obs. of 4 variables:
## $ Sales: num 9.5 11.22 10.06 7.4 4.15 ...
## $ Price: num 120 83 80 97 128 72 108 120 124 124 ...
## $ Urban: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
## (Intercept) Price UrbanYes USYes
## 13.0435 -0.0545 -0.0219 1.2006
Intercept (~13.04): This is the estimated average Sales (in thousands of units) for a rural store outside the US when Price = 0. A price of 0 isn’t realistic, so this isn’t meaningful on its own — it just sets the baseline for the other coefficients.
Price (~-0.0545): Price is a quantitative variable. The coefficient tells us that for every one-dollar increase in the price of the car seat, Sales go down by about 0.0545 thousand units (roughly 54 units), holding Urban and US constant. The negative sign makes sense — as price goes up, sales tend to go down.
UrbanYes (~-0.0219): Urban is a qualitative variable coded as Yes or No. R created a dummy variable that equals 1 for urban stores and 0 for rural stores. The coefficient means urban stores sell about 0.022 thousand units (roughly 22 units) less than rural stores on average, holding Price and US fixed. But the p-value on this one is very high (~0.936), so this difference isn’t statistically significant — we can’t really conclude urban vs. rural location matters for sales.
USYes (~1.2006): US is also a qualitative Yes/No variable, and the dummy variable equals 1 for stores in the US and 0 for stores outside the US. Stores in the US sell about 1.2 thousand units (roughly 1,200 units) more than stores outside the US on average, holding Price and Urban constant. This one is highly significant.
With UrbanYes = 1 if urban (0 if rural) and USYes = 1 if in the US (0 otherwise):
\[\widehat{\text{Sales}} = 13.04 - 0.0545 \cdot \text{Price} - 0.0219 \cdot \text{UrbanYes} + 1.2006 \cdot \text{USYes}\]
Spelled out for each combination:
coef_table <- summary(model_cs)$coefficients
coef_df <- as.data.frame(round(coef_table, 4))
coef_df$Decision <- ifelse(coef_df[, "Pr(>|t|)"] < 0.05,
"Reject H0", "Fail to reject H0")
coef_df## Estimate Std. Error t value Pr(>|t|) Decision
## (Intercept) 13.0435 0.6510 20.0357 0.0000 Reject H0
## Price -0.0545 0.0052 -10.3892 0.0000 Reject H0
## UrbanYes -0.0219 0.2717 -0.0807 0.9357 Fail to reject H0
## USYes 1.2006 0.2590 4.6347 0.0000 Reject H0
We can reject H₀ for Price and US — both have p-values well below 0.05. We can’t reject H₀ for Urban since its p-value is basically 1.
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
## Model (a) with Price + Urban + US:
cat(" R2 =", round(summary(model_cs)$r.squared, 4),
" Adj R2 =", round(summary(model_cs)$adj.r.squared, 4),
" RSE =", round(summary(model_cs)$sigma, 4), "\n\n")## R2 = 0.2393 Adj R2 = 0.2335 RSE = 2.4725
## Model (e) with Price + US only:
cat(" R2 =", round(summary(model_cs2)$r.squared, 4),
" Adj R2 =", round(summary(model_cs2)$adj.r.squared, 4),
" RSE =", round(summary(model_cs2)$sigma, 4), "\n")## R2 = 0.2393 Adj R2 = 0.2354 RSE = 2.4694
The two models fit about the same. The R² and RSE are nearly
identical, and the adjusted R² actually goes up slightly when we drop
Urban because it penalizes for extra predictors. Since
Urban wasn’t significant anyway and removing it doesn’t hurt fit, the
smaller model in (e) is the better choice.
## 2.5 % 97.5 %
## (Intercept) 11.7903 14.2713
## Price -0.0648 -0.0442
## USYes 0.6915 1.7078
The confidence interval for Price is entirely negative (doesn’t cross zero), so we’re confident higher prices are associated with lower sales. The interval for USYes is entirely positive, so we’re confident US stores sell more on average. Both intervals confirm what the hypothesis tests showed.
n <- nrow(Carseats)
p <- length(coef(model_cs2))
hat_vals <- hatvalues(model_cs2)
std_resids <- rstudent(model_cs2)
cat("Observations with |studentized residual| > 2:", sum(abs(std_resids) > 2), "\n")## Observations with |studentized residual| > 2: 23
cat("High-leverage observations (> 2p/n =", round(2 * p / n, 4), "):",
sum(hat_vals > 2 * p / n), "\n")## High-leverage observations (> 2p/n = 0.015 ): 20
There are a few observations with studentized residuals above 2 in absolute value, which could be outliers. The leverage plot also flags a small number of high-leverage points. Nothing looks severe enough to be a major concern though — no single observation is clearly dominating the fit.
The formula for the coefficient when regressing Y onto X without an intercept (from equation 3.38) is:
\[\hat{\beta}_{Y \to X} = \frac{\sum x_i y_i}{\sum x_i^2}\]
And for the regression of X onto Y without an intercept:
\[\hat{\beta}_{X \to Y} = \frac{\sum x_i y_i}{\sum y_i^2}\]
The numerators are the same, so the two estimates are equal only when the denominators are equal — that is, when \(\sum x_i^2 = \sum y_i^2\). In other words, the coefficient for regressing Y onto X equals the coefficient for regressing X onto Y if and only if X and Y have the same sum of squared values.
set.seed(42)
n <- 100
x_b <- rnorm(n)
y_b <- rnorm(n)
beta_yx_b <- sum(x_b * y_b) / sum(x_b^2)
beta_xy_b <- sum(x_b * y_b) / sum(y_b^2)
cat("Sum of x^2:", round(sum(x_b^2), 4), "\n")## Sum of x^2: 107.4637
## Sum of y^2: 81.7008
## Beta (Y onto X): 0.024486
## Beta (X onto Y): 0.032207
## Equal? FALSE
res_yx_b <- lm(y_b ~ x_b - 1)
res_xy_b <- lm(x_b ~ y_b - 1)
cat("lm: Beta (Y onto X):", round(coef(res_yx_b), 6), "\n")## lm: Beta (Y onto X): 0.024486
## lm: Beta (X onto Y): 0.032207
Since x and y were generated independently, their sums of squares are different, so the two coefficient estimates come out different.
set.seed(7)
n <- 100
x_c <- rnorm(n)
# Scale z so that sum(y_c^2) = sum(x_c^2)
z <- rnorm(n)
y_c <- z * sqrt(sum(x_c^2) / sum(z^2))
beta_yx_c <- sum(x_c * y_c) / sum(x_c^2)
beta_xy_c <- sum(x_c * y_c) / sum(y_c^2)
cat("Sum of x^2:", round(sum(x_c^2), 6), "\n")## Sum of x^2: 92.93545
## Sum of y^2: 92.93545
## Beta (Y onto X): 0.02945
## Beta (X onto Y): 0.02945
## Equal? TRUE
res_yx_c <- lm(y_c ~ x_c - 1)
res_xy_c <- lm(x_c ~ y_c - 1)
cat("lm: Beta (Y onto X):", round(coef(res_yx_c), 6), "\n")## lm: Beta (Y onto X): 0.02945
## lm: Beta (X onto Y): 0.02945
By rescaling z so that its sum of squares matches x_c, we forced \(\sum x_i^2 = \sum y_i^2\), and the two coefficient estimates are exactly equal. This is a direct demonstration of what we showed in part (a).