STA 6543 Predictive Modeling

Problem 2

Carefully explain the differences between the KNN classifier and KNN regression methods.

Both KNN classifier and KNN regression work the same way up to a point — you pick a test observation \(x_0\), find its K nearest neighbors in the training data, and use those neighbors to make a prediction. The difference is in what you’re predicting and how you get there.

With KNN regression, the response is quantitative, so once you have your K neighbors you just take the average of their Y values and that’s your prediction:

\[\hat{f}(x_0) = \frac{1}{K} \sum_{x_i \in \mathcal{N}_0} y_i\]

With KNN classifier, the response is categorical (a class label), so averaging doesn’t make sense. Instead you look at what class shows up most among the K neighbors — basically a majority vote. More formally, you estimate the probability of each class and assign \(x_0\) to whichever class has the highest probability:

\[P(Y = j \mid X = x_0) = \frac{1}{K} \sum_{x_i \in \mathcal{N}_0} \mathbf{1}(y_i = j)\]

So the two methods are really built for different situations. KNN regression gives you a number, KNN classifier gives you a category. Both are non-parametric, meaning neither assumes a specific functional form for the relationship between X and Y. And both are affected by the choice of K in the same way — smaller K means a more flexible, wiggly fit (low bias, high variance) and larger K gives a smoother fit (higher bias, lower variance). They also both struggle when the number of predictors is large because the “nearest” neighbors end up not being very close at all in high-dimensional space.

Problem 9 — Multiple Linear Regression on the Auto Data Set

data(Auto, package = "ISLR2")
Auto <- na.omit(Auto)
cat("Dimensions:", nrow(Auto), "rows x", ncol(Auto), "columns\n")

## Dimensions: 392 rows x 9 columns

head(Auto)

##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

(a) Scatterplot matrix of all variables

Auto_numeric <- Auto[, sapply(Auto, is.numeric)]
pairs(Auto_numeric,
      main = "Scatterplot Matrix — Auto Data Set",
      pch  = 20,
      cex  = 0.5,
      col  = "steelblue")

(b) Matrix of correlations between the variables

round(cor(Auto_numeric), 3)

##                 mpg cylinders displacement horsepower weight acceleration
## mpg           1.000    -0.778       -0.805     -0.778 -0.832        0.423
## cylinders    -0.778     1.000        0.951      0.843  0.898       -0.505
## displacement -0.805     0.951        1.000      0.897  0.933       -0.544
## horsepower   -0.778     0.843        0.897      1.000  0.865       -0.689
## weight       -0.832     0.898        0.933      0.865  1.000       -0.417
## acceleration  0.423    -0.505       -0.544     -0.689 -0.417        1.000
## year          0.581    -0.346       -0.370     -0.416 -0.309        0.290
## origin        0.565    -0.569       -0.615     -0.455 -0.585        0.213
##                year origin
## mpg           0.581  0.565
## cylinders    -0.346 -0.569
## displacement -0.370 -0.615
## horsepower   -0.416 -0.455
## weight       -0.309 -0.585
## acceleration  0.290  0.213
## year          1.000  0.182
## origin        0.182  1.000

(c) Multiple linear regression: mpg ~ all predictors except name

model_full <- lm(mpg ~ . - name, data = Auto)
summary(model_full)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

# Compare full model to intercept-only model to test overall significance
model_null <- lm(mpg ~ 1, data = Auto)
anova(model_null, model_full)

## Analysis of Variance Table
## 
## Model 1: mpg ~ 1
## Model 2: mpg ~ (cylinders + displacement + horsepower + weight + acceleration + 
##     year + origin + name) - name
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1    391 23819.0                                  
## 2    384  4252.2  7     19567 252.43 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

i. Is there a relationship between the predictors and the response?

Yes. The F-statistic from the overall regression is very large and the p-value is essentially zero. The ANOVA comparing the null model to the full model backs this up — we can reject the null hypothesis that all the coefficients are zero. At least some of the predictors are related to mpg.

ii. Which predictors appear to have a statistically significant relationship to the response?

Looking at the individual p-values, the predictors that are significant at the 5% level are displacement, weight, year, and origin. The others — cylinders, horsepower, and acceleration — aren’t significant on their own here, which is probably because they’re correlated with the predictors that are significant (collinearity).

iii. What does the coefficient for the year variable suggest?

The year coefficient is positive, meaning newer cars tend to get better mileage, holding everything else fixed. Roughly speaking, each additional model year is associated with about 0.75 more mpg on average. That tracks with the idea that fuel efficiency improved a lot over the 1970s and early 1980s.

(d) Diagnostic plots

par(mfrow = c(2, 2))
plot(model_full, pch = 20, cex = 0.6)

par(mfrow = c(1, 1))

The residuals vs. fitted plot shows a slight curve, which suggests the linear model is missing some non-linearity in the data. There are a few observations with large residuals that could be considered outliers. The leverage plot also flags a handful of high-leverage points, though nothing extreme. Overall the fit isn’t terrible but there’s room to improve it with transformations.

(e) Fit models with interaction terms

model_int <- lm(mpg ~ cylinders + displacement + horsepower + weight +
                  acceleration + year + origin +
                  horsepower:weight + acceleration:year,
                data = Auto)
summary(model_int)

## 
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight + 
##     acceleration + year + origin + horsepower:weight + acceleration:year, 
##     data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.2162 -1.4673 -0.1164  1.2023 11.2795 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.135e+02  1.825e+01   6.218 1.32e-09 ***
## cylinders          2.324e-01  2.780e-01   0.836  0.40375    
## displacement      -4.619e-03  6.658e-03  -0.694  0.48827    
## horsepower        -2.385e-01  2.257e-02 -10.567  < 2e-16 ***
## weight            -1.058e-02  7.024e-04 -15.064  < 2e-16 ***
## acceleration      -7.097e+00  1.127e+00  -6.299 8.27e-10 ***
## year              -6.755e-01  2.356e-01  -2.867  0.00438 ** 
## origin             6.838e-01  2.409e-01   2.838  0.00478 ** 
## horsepower:weight  5.465e-05  4.987e-06  10.958  < 2e-16 ***
## acceleration:year  9.125e-02  1.463e-02   6.236 1.19e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.796 on 382 degrees of freedom
## Multiple R-squared:  0.8746, Adjusted R-squared:  0.8716 
## F-statistic:   296 on 9 and 382 DF,  p-value: < 2.2e-16

The horsepower:weight interaction turns out to be significant, which makes sense — the effect of having a high-powered engine on fuel economy is probably worse if the car is also heavy. The acceleration:year interaction is less clear cut.

(f) Transformations of the variables

model_trans <- lm(mpg ~ log(weight) + sqrt(horsepower) + I(displacement^2) +
                    acceleration + year + origin,
                  data = Auto)
summary(model_trans)

## 
## Call:
## lm(formula = mpg ~ log(weight) + sqrt(horsepower) + I(displacement^2) + 
##     acceleration + year + origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1055 -1.9035 -0.1003  1.7043 12.3349 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.269e+02  1.004e+01  12.645  < 2e-16 ***
## log(weight)       -1.974e+01  1.547e+00 -12.758  < 2e-16 ***
## sqrt(horsepower)  -9.592e-01  2.965e-01  -3.236  0.00132 ** 
## I(displacement^2)  4.188e-05  7.847e-06   5.336 1.62e-07 ***
## acceleration       2.184e-02  9.395e-02   0.232  0.81633    
## year               7.803e-01  4.647e-02  16.792  < 2e-16 ***
## origin             1.002e+00  2.447e-01   4.097 5.11e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.012 on 385 degrees of freedom
## Multiple R-squared:  0.8533, Adjusted R-squared:  0.851 
## F-statistic: 373.3 on 6 and 385 DF,  p-value: < 2.2e-16

cat("Full model (untransformed):\n")

## Full model (untransformed):

cat("  R2 =", round(summary(model_full)$r.squared, 4),
    " Adj R2 =", round(summary(model_full)$adj.r.squared, 4),
    " RSE =", round(summary(model_full)$sigma, 4), "\n\n")

##   R2 = 0.8215  Adj R2 = 0.8182  RSE = 3.3277

cat("Transformed model:\n")

## Transformed model:

cat("  R2 =", round(summary(model_trans)$r.squared, 4),
    " Adj R2 =", round(summary(model_trans)$adj.r.squared, 4),
    " RSE =", round(summary(model_trans)$sigma, 4), "\n")

##   R2 = 0.8533  Adj R2 = 0.851  RSE = 3.0124

plot(model_trans, which = 1, pch = 20, cex = 0.6,
     main = "Residuals vs Fitted — Transformed Model")

Using log(weight) and sqrt(horsepower) gives a slightly better R² and lower RSE than the untransformed version. The residual plot also looks more randomly scattered, with less of the curve that showed up before. So the log/square-root transformations do seem to help capture the relationship better.

Problem 10 — Multiple Regression on the Carseats Data Set

data(Carseats, package = "ISLR2")
cat("Dimensions:", nrow(Carseats), "rows x", ncol(Carseats), "columns\n")

## Dimensions: 400 rows x 11 columns

str(Carseats[, c("Sales", "Price", "Urban", "US")])

## 'data.frame':    400 obs. of  4 variables:
##  $ Sales: num  9.5 11.22 10.06 7.4 4.15 ...
##  $ Price: num  120 83 80 97 128 72 108 120 124 124 ...
##  $ Urban: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US   : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...

(a) Fit a multiple regression model: Sales ~ Price + Urban + US

model_cs <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(model_cs)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(b) Interpretation of each coefficient

round(coef(model_cs), 4)

## (Intercept)       Price    UrbanYes       USYes 
##     13.0435     -0.0545     -0.0219      1.2006

Intercept (~13.04): This is the estimated average Sales (in thousands of units) for a rural store outside the US when Price = 0. A price of 0 isn’t realistic, so this isn’t meaningful on its own — it just sets the baseline for the other coefficients.

Price (~-0.0545): Price is a quantitative variable. The coefficient tells us that for every one-dollar increase in the price of the car seat, Sales go down by about 0.0545 thousand units (roughly 54 units), holding Urban and US constant. The negative sign makes sense — as price goes up, sales tend to go down.

UrbanYes (~-0.0219): Urban is a qualitative variable coded as Yes or No. R created a dummy variable that equals 1 for urban stores and 0 for rural stores. The coefficient means urban stores sell about 0.022 thousand units (roughly 22 units) less than rural stores on average, holding Price and US fixed. But the p-value on this one is very high (~0.936), so this difference isn’t statistically significant — we can’t really conclude urban vs. rural location matters for sales.

USYes (~1.2006): US is also a qualitative Yes/No variable, and the dummy variable equals 1 for stores in the US and 0 for stores outside the US. Stores in the US sell about 1.2 thousand units (roughly 1,200 units) more than stores outside the US on average, holding Price and Urban constant. This one is highly significant.

(c) Model equation

With UrbanYes = 1 if urban (0 if rural) and USYes = 1 if in the US (0 otherwise):

\[\widehat{\text{Sales}} = 13.04 - 0.0545 \cdot \text{Price} - 0.0219 \cdot \text{UrbanYes} + 1.2006 \cdot \text{USYes}\]

Spelled out for each combination:

Rural, not in US: \(13.04 - 0.0545 \times \text{Price}\)
Urban, not in US: \(13.02 - 0.0545 \times \text{Price}\)
Rural, in US: \(14.24 - 0.0545 \times \text{Price}\)
Urban, in US: \(14.22 - 0.0545 \times \text{Price}\)

(d) For which predictors can we reject H₀: βⱼ = 0?

coef_table <- summary(model_cs)$coefficients
coef_df <- as.data.frame(round(coef_table, 4))
coef_df$Decision <- ifelse(coef_df[, "Pr(>|t|)"] < 0.05,
                           "Reject H0", "Fail to reject H0")
coef_df

##             Estimate Std. Error  t value Pr(>|t|)          Decision
## (Intercept)  13.0435     0.6510  20.0357   0.0000         Reject H0
## Price        -0.0545     0.0052 -10.3892   0.0000         Reject H0
## UrbanYes     -0.0219     0.2717  -0.0807   0.9357 Fail to reject H0
## USYes         1.2006     0.2590   4.6347   0.0000         Reject H0

We can reject H₀ for Price and US — both have p-values well below 0.05. We can’t reject H₀ for Urban since its p-value is basically 1.

(e) Fit a smaller model using only significant predictors

model_cs2 <- lm(Sales ~ Price + US, data = Carseats)
summary(model_cs2)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) How well do the two models fit?

cat("Model (a) with Price + Urban + US:\n")

## Model (a) with Price + Urban + US:

cat("  R2 =", round(summary(model_cs)$r.squared, 4),
    " Adj R2 =", round(summary(model_cs)$adj.r.squared, 4),
    " RSE =", round(summary(model_cs)$sigma, 4), "\n\n")

##   R2 = 0.2393  Adj R2 = 0.2335  RSE = 2.4725

cat("Model (e) with Price + US only:\n")

## Model (e) with Price + US only:

cat("  R2 =", round(summary(model_cs2)$r.squared, 4),
    " Adj R2 =", round(summary(model_cs2)$adj.r.squared, 4),
    " RSE =", round(summary(model_cs2)$sigma, 4), "\n")

##   R2 = 0.2393  Adj R2 = 0.2354  RSE = 2.4694

The two models fit about the same. The R² and RSE are nearly identical, and the adjusted R² actually goes up slightly when we drop Urban because it penalizes for extra predictors. Since Urban wasn’t significant anyway and removing it doesn’t hurt fit, the smaller model in (e) is the better choice.

(g) 95% confidence intervals for the coefficient(s) in model (e)

round(confint(model_cs2, level = 0.95), 4)

##               2.5 %  97.5 %
## (Intercept) 11.7903 14.2713
## Price       -0.0648 -0.0442
## USYes        0.6915  1.7078

The confidence interval for Price is entirely negative (doesn’t cross zero), so we’re confident higher prices are associated with lower sales. The interval for USYes is entirely positive, so we’re confident US stores sell more on average. Both intervals confirm what the hypothesis tests showed.

(h) Evidence of outliers or high leverage observations?

par(mfrow = c(2, 2))
plot(model_cs2, pch = 20, cex = 0.6)

par(mfrow = c(1, 1))

n <- nrow(Carseats)
p <- length(coef(model_cs2))
hat_vals   <- hatvalues(model_cs2)
std_resids <- rstudent(model_cs2)

cat("Observations with |studentized residual| > 2:", sum(abs(std_resids) > 2), "\n")

## Observations with |studentized residual| > 2: 23

cat("High-leverage observations (> 2p/n =", round(2 * p / n, 4), "):",
    sum(hat_vals > 2 * p / n), "\n")

## High-leverage observations (> 2p/n = 0.015 ): 20

There are a few observations with studentized residuals above 2 in absolute value, which could be outliers. The leverage plot also flags a small number of high-leverage points. Nothing looks severe enough to be a major concern though — no single observation is clearly dominating the fit.

Problem 12 — Simple Linear Regression Without an Intercept

(a) When are the two coefficient estimates the same?

The formula for the coefficient when regressing Y onto X without an intercept (from equation 3.38) is:

\[\hat{\beta}_{Y \to X} = \frac{\sum x_i y_i}{\sum x_i^2}\]

And for the regression of X onto Y without an intercept:

\[\hat{\beta}_{X \to Y} = \frac{\sum x_i y_i}{\sum y_i^2}\]

The numerators are the same, so the two estimates are equal only when the denominators are equal — that is, when \(\sum x_i^2 = \sum y_i^2\). In other words, the coefficient for regressing Y onto X equals the coefficient for regressing X onto Y if and only if X and Y have the same sum of squared values.

(b) Example where the two estimates are different

set.seed(42)
n   <- 100
x_b <- rnorm(n)
y_b <- rnorm(n)

beta_yx_b <- sum(x_b * y_b) / sum(x_b^2)
beta_xy_b <- sum(x_b * y_b) / sum(y_b^2)

cat("Sum of x^2:", round(sum(x_b^2), 4), "\n")

## Sum of x^2: 107.4637

cat("Sum of y^2:", round(sum(y_b^2), 4), "\n\n")

## Sum of y^2: 81.7008

cat("Beta (Y onto X):", round(beta_yx_b, 6), "\n")

## Beta (Y onto X): 0.024486

cat("Beta (X onto Y):", round(beta_xy_b, 6), "\n")

## Beta (X onto Y): 0.032207

cat("Equal?", isTRUE(all.equal(beta_yx_b, beta_xy_b)), "\n")

## Equal? FALSE

res_yx_b <- lm(y_b ~ x_b - 1)
res_xy_b <- lm(x_b ~ y_b - 1)
cat("lm: Beta (Y onto X):", round(coef(res_yx_b), 6), "\n")

## lm: Beta (Y onto X): 0.024486

cat("lm: Beta (X onto Y):", round(coef(res_xy_b), 6), "\n")

## lm: Beta (X onto Y): 0.032207

Since x and y were generated independently, their sums of squares are different, so the two coefficient estimates come out different.

(c) Example where the two estimates are the same

set.seed(7)
n   <- 100
x_c <- rnorm(n)

# Scale z so that sum(y_c^2) = sum(x_c^2)
z   <- rnorm(n)
y_c <- z * sqrt(sum(x_c^2) / sum(z^2))

beta_yx_c <- sum(x_c * y_c) / sum(x_c^2)
beta_xy_c <- sum(x_c * y_c) / sum(y_c^2)

cat("Sum of x^2:", round(sum(x_c^2), 6), "\n")

## Sum of x^2: 92.93545

cat("Sum of y^2:", round(sum(y_c^2), 6), "\n\n")

## Sum of y^2: 92.93545

cat("Beta (Y onto X):", round(beta_yx_c, 6), "\n")

## Beta (Y onto X): 0.02945

cat("Beta (X onto Y):", round(beta_xy_c, 6), "\n")

## Beta (X onto Y): 0.02945

cat("Equal?", isTRUE(all.equal(beta_yx_c, beta_xy_c)), "\n")

## Equal? TRUE

res_yx_c <- lm(y_c ~ x_c - 1)
res_xy_c <- lm(x_c ~ y_c - 1)
cat("lm: Beta (Y onto X):", round(coef(res_yx_c), 6), "\n")

## lm: Beta (Y onto X): 0.02945

cat("lm: Beta (X onto Y):", round(coef(res_xy_c), 6), "\n")

## lm: Beta (X onto Y): 0.02945

By rescaling z so that its sum of squares matches x_c, we forced \(\sum x_i^2 = \sum y_i^2\), and the two coefficient estimates are exactly equal. This is a direct demonstration of what we showed in part (a).

STA 6543 Predictive Modeling — Homework 2

Haniya Farhat

June 21, 2026