Question 2: KNN Classifier vs KNN Regression

KNN Classifier is used when the response Y is qualitative (categorical). For a new point x0, it finds the K nearest neighbors and assigns the majority class among them. Output is a class label.

KNN Regression is used when the response Y is quantitative (continuous). For a new point x0, it finds the K nearest neighbors and returns the average of their response values. Output is a numeric prediction.

Key difference: The classifier uses majority vote; regression uses the mean of neighbors.


Question 9: Multiple Linear Regression on Auto Dataset

(a) Scatterplot Matrix

data(Auto)
pairs(Auto)

(b) Correlation Matrix (excluding qualitative variable name)

cor(Auto[, -which(names(Auto) == "name")])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

(c) Multiple Linear Regression: mpg ~ all predictors except name

lm.fit9 <- lm(mpg ~ . - name, data = Auto)
summary(lm.fit9)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

i. Is there a relationship between the predictors and the response?

Yes. The F-statistic is 252.4 with a p-value < 2.2e-16, essentially zero. We strongly reject the null hypothesis that all coefficients are zero — there is a statistically significant relationship between the predictors and mpg.

ii. Which predictors have a statistically significant relationship?

The significant predictors (p < 0.05) are: displacement (p = 0.00844), weight (p < 2e-16), year (p < 2e-16), and origin (p = 4.67e-07). Cylinders, horsepower, and acceleration are not significant.

iii. What does the year coefficient suggest?

The coefficient is 0.7508, meaning each additional model year is associated with an increase of about 0.75 mpg, holding all other predictors constant. Fuel efficiency improved steadily over time.

(d) Diagnostic Plots

par(mfrow = c(2, 2))
plot(lm.fit9)

par(mfrow = c(1, 1))

The residuals vs fitted plot shows mild non-linearity. There are a few potential outliers with large residuals. No severely influential points are evident from the leverage plot.

(e) Interaction Effects

lm.fit9e <- lm(mpg ~ . - name + horsepower:weight + acceleration:horsepower,
               data = Auto)
summary(lm.fit9e)
## 
## Call:
## lm(formula = mpg ~ . - name + horsepower:weight + acceleration:horsepower, 
##     data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4489 -1.6817 -0.0238  1.3932 11.8609 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -7.655e+00  5.397e+00  -1.418 0.156907    
## cylinders                1.963e-01  2.916e-01   0.673 0.501235    
## displacement            -5.986e-03  7.504e-03  -0.798 0.425545    
## horsepower              -1.285e-01  3.786e-02  -3.394 0.000760 ***
## weight                  -9.289e-03  9.103e-04 -10.204  < 2e-16 ***
## acceleration             3.891e-01  1.643e-01   2.369 0.018336 *  
## year                     7.694e-01  4.431e-02  17.364  < 2e-16 ***
## origin                   7.206e-01  2.500e-01   2.882 0.004172 ** 
## horsepower:weight        4.752e-05  5.626e-06   8.448  6.3e-16 ***
## horsepower:acceleration -6.123e-03  1.777e-03  -3.445 0.000634 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.891 on 382 degrees of freedom
## Multiple R-squared:  0.866,  Adjusted R-squared:  0.8628 
## F-statistic: 274.3 on 9 and 382 DF,  p-value: < 2.2e-16

Both interaction terms are highly significant:

  • horsepower:weight (p = 6.3e-16): The effect of horsepower on mpg depends on the weight of the car. Heavier, more powerful cars see a compounding negative effect on fuel efficiency.
  • horsepower:acceleration (p = 0.000634): The effect of horsepower on mpg also depends on acceleration. The relationship between power and efficiency changes depending on how quickly the car accelerates.

R² improved from 0.8215 to 0.866 and RSE dropped from 3.328 to 2.891, confirming the interactions add real explanatory power.

(f) Variable Transformations

lm.fit9f_log  <- lm(mpg ~ log(horsepower),  data = Auto)
lm.fit9f_sqrt <- lm(mpg ~ sqrt(horsepower), data = Auto)
lm.fit9f_sq   <- lm(mpg ~ I(horsepower^2), data = Auto)

summary(lm.fit9f_log)
## 
## Call:
## lm(formula = mpg ~ log(horsepower), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.2299  -2.7818  -0.2322   2.6661  15.4695 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     108.6997     3.0496   35.64   <2e-16 ***
## log(horsepower) -18.5822     0.6629  -28.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.501 on 390 degrees of freedom
## Multiple R-squared:  0.6683, Adjusted R-squared:  0.6675 
## F-statistic: 785.9 on 1 and 390 DF,  p-value: < 2.2e-16
summary(lm.fit9f_sqrt)
## 
## Call:
## lm(formula = mpg ~ sqrt(horsepower), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.9768  -3.2239  -0.2252   2.6881  16.1411 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        58.705      1.349   43.52   <2e-16 ***
## sqrt(horsepower)   -3.503      0.132  -26.54   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.665 on 390 degrees of freedom
## Multiple R-squared:  0.6437, Adjusted R-squared:  0.6428 
## F-statistic: 704.6 on 1 and 390 DF,  p-value: < 2.2e-16
summary(lm.fit9f_sq)
## 
## Call:
## lm(formula = mpg ~ I(horsepower^2), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.529  -3.798  -1.049   3.240  18.528 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.047e+01  4.466e-01   68.22   <2e-16 ***
## I(horsepower^2) -5.665e-04  2.827e-05  -20.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.485 on 390 degrees of freedom
## Multiple R-squared:  0.5074, Adjusted R-squared:  0.5061 
## F-statistic: 401.7 on 1 and 390 DF,  p-value: < 2.2e-16
par(mfrow = c(1, 3))
plot(Auto$horsepower, Auto$mpg, main = "log(horsepower)", xlab = "horsepower", ylab = "mpg")
lines(sort(Auto$horsepower),
      predict(lm.fit9f_log, data.frame(horsepower = sort(Auto$horsepower))),
      col = "red", lwd = 2)

plot(Auto$horsepower, Auto$mpg, main = "sqrt(horsepower)", xlab = "horsepower", ylab = "mpg")
lines(sort(Auto$horsepower),
      predict(lm.fit9f_sqrt, data.frame(horsepower = sort(Auto$horsepower))),
      col = "blue", lwd = 2)

plot(Auto$horsepower, Auto$mpg, main = "horsepower^2", xlab = "horsepower", ylab = "mpg")
lines(sort(Auto$horsepower),
      predict(lm.fit9f_sq, data.frame(horsepower = sort(Auto$horsepower))),
      col = "green", lwd = 2)

par(mfrow = c(1, 1))
Transformation RSE
log(horsepower) 0.6683 4.501
sqrt(horsepower) 0.6437 4.665
horsepower² 0.5074 5.485

log(horsepower) is clearly the best transformation. It has the highest R² and lowest RSE. The log transformation best captures the concave, diminishing relationship between horsepower and mpg — the fuel efficiency penalty for adding horsepower gets smaller as horsepower is already high.


Question 10: Multiple Linear Regression on Carseats Dataset

(a) Fit model: Sales ~ Price + Urban + US

data(Carseats)
lm.fit10 <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(lm.fit10)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(b) Interpret each coefficient

  • Intercept (13.04): Expected sales for a non-Urban, non-US store at Price = 0. Not practically meaningful on its own.
  • Price (−0.0544): Each additional dollar in price reduces sales by about 54 units, holding other variables constant. Highly significant (p < 2e-16).
  • UrbanYes (−0.0219): Urban stores sell about 22 fewer units than rural stores. (p = 0.936) — essentially no effect.
  • USYes (1.2006): US stores sell about 1,200 more units than non-US stores. Highly significant (p = 4.86e-06).

(c) Model in equation form

Sales = 13.04 − 0.0544(Price) − 0.0219(Urban) + 1.2006(US)

Where Urban = 1 if urban, 0 if rural; US = 1 if in US, 0 if not.

(d) Which predictors can you reject H₀: βⱼ = 0?

Price and US — both have p-values far below 0.05. Urban cannot be rejected (p = 0.936).

(e) Smaller model using only significant predictors

lm.fit10e <- lm(Sales ~ Price + US, data = Carseats)
summary(lm.fit10e)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

Both Price and US remain highly significant (Price: p < 2e-16, US: p = 4.71e-06). Dropping Urban had essentially no cost.

(f) How well do models (a) and (e) fit?

Model Adjusted R² RSE
(a) Price + Urban + US 0.2393 0.2335 2.472
(e) Price + US 0.2393 0.2354 2.469

The fits are virtually identical. Model (e) has a slightly higher adjusted R² and lower RSE despite one fewer predictor — confirming that dropping Urban was correct. Overall R² of ~0.24 suggests other unmeasured predictors matter.

(g) 95% Confidence Intervals for model (e) coefficients

confint(lm.fit10e)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Both Price and US intervals exclude zero, confirming significance. The Price interval is entirely negative and the US interval entirely positive.

(h) Outliers and high leverage observations

par(mfrow = c(2, 2))
plot(lm.fit10e)

par(mfrow = c(1, 1))

Outliers: Observations 51, 69, and 377 appear with notably large residuals. Observation 51 sits below −3 on the Q-Q plot, flagging it as a potential outlier.

High leverage: Observation 368 has leverage ~0.04, noticeably higher than most. However it falls within Cook’s distance boundaries, so it is not strongly influencing the regression coefficients. No points cross the Cook’s distance boundary — no highly influential observations.

Overall: The Q-Q plot is very linear (residuals approximately normal) and the Residuals vs Fitted plot shows no strong pattern. The linear model is reasonably appropriate.


Question 12: Simple Linear Regression Without an Intercept

(a) When are the estimates equal?

From formula (3.38), the no-intercept estimate is:

  • β̂(Y~X) = Σ(xᵢyᵢ) / Σ(xᵢ²)
  • β̂(X~Y) = Σ(xᵢyᵢ) / Σ(yᵢ²)

These are equal when Σ(xᵢ²) = Σ(yᵢ²), i.e., when X and Y have the same sum of squares.

(b) Example where estimates are DIFFERENT

set.seed(1)
x <- rnorm(100)
y <- 2 * x + rnorm(100)

lm.yx <- lm(y ~ x + 0)
lm.xy <- lm(x ~ y + 0)

cat("Beta(Y~X):", coef(lm.yx), "\n")
## Beta(Y~X): 1.993876
cat("Beta(X~Y):", coef(lm.xy), "\n")
## Beta(X~Y): 0.3911145
cat("sum(x^2):", sum(x^2), "  sum(y^2):", sum(y^2), "\n")
## sum(x^2): 81.05509   sum(y^2): 413.2135

The estimates differ because sum(x²) ≠ sum(y²).

(c) Example where estimates are THE SAME

set.seed(1)
x2 <- rnorm(100)
y2 <- x2[sample(100)]

lm.yx2 <- lm(y2 ~ x2 + 0)
lm.xy2 <- lm(x2 ~ y2 + 0)

cat("Beta(Y2~X2):", coef(lm.yx2), "\n")
## Beta(Y2~X2): -0.07767695
cat("Beta(X2~Y2):", coef(lm.xy2), "\n")
## Beta(X2~Y2): -0.07767695
cat("sum(x2^2):", sum(x2^2), "  sum(y2^2):", sum(y2^2), "\n")
## sum(x2^2): 81.05509   sum(y2^2): 81.05509

The estimates are equal because permuting x gives y the same sum of squares as x.