Assignment 2

STA6543

Author

Stephen Garcia (wqr974)

Published

May 29, 2025

Chapter 3

Linear Regression

Conceptual Exercises

Problem 2

Carefully explain the differences between the KNN classifier and KNN regression methods. Response: K-Nearest Neighbors (KNN) is a non-parametric algorithm that makes predictions based on the K closest training examples to a new data point. In KNN classification, it assigns the most common class among the neighbors, while in KNN regression, it predicts the average of their numeric target values, making the former suitable for categorical outcomes and the latter for continuous ones. The key differences lie in their prediction strategy (majority vote vs. mean), loss functions (misclassification error vs. mean squared error), and evaluation metrics, with classification using accuracy or F1-score and regression relying on RMSE or R².

Applied Exercises:

Problem 9

This question involves the use of multiple linear regression on the Auto data set.

(a) Produce a scatterplot matrix which includes all of the variablesin the data set.

library(ISLR)
data(Auto)
View(Auto)
library(ggplot2)
library(GGally)

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

# I did not include the 'name' column in the scatterplot matrix
# as it is a character variable and not suitable for numerical analysis.
plot_obj <- ggpairs(
  Auto,
  columns = 1:8, 
  aes(color = factor(cylinders), alpha = 0.5),
  upper = list(continuous = wrap("cor", size = 2.5)),
  lower = list(continuous = wrap("points", alpha = 0.6, size = 1.5)),
  diag = list(continuous = wrap("densityDiag"))
) +
  theme_minimal() +
  labs(
    title = "Scatterplot Matrix of Auto Data",
    subtitle = "Colored by Number of Cylinders",
    color = "Cylinders"
  )

suppressWarnings(print(plot_obj))

(b)
Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.

Response:
I had already computed the correlation matrix in the previous step, but here it is again for clarity.

library(kableExtra)
library(knitr)

cor_matrix <- cor(Auto[, -which(names(Auto) == "name")])

kable(round(cor_matrix, 2), caption = "Correlation Matrix (Rounded)", longtable = FALSE) %>%
  kable_styling(font_size = 8, latex_options = "scale_down")

Correlation Matrix (Rounded)
	mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin
mpg	1.00	-0.78	-0.81	-0.78	-0.83	0.42	0.58	0.57
cylinders	-0.78	1.00	0.95	0.84	0.90	-0.50	-0.35	-0.57
displacement	-0.81	0.95	1.00	0.90	0.93	-0.54	-0.37	-0.61
horsepower	-0.78	0.84	0.90	1.00	0.86	-0.69	-0.42	-0.46
weight	-0.83	0.90	0.93	0.86	1.00	-0.42	-0.31	-0.59
acceleration	0.42	-0.50	-0.54	-0.69	-0.42	1.00	0.29	0.21
year	0.58	-0.35	-0.37	-0.42	-0.31	0.29	1.00	0.18
origin	0.57	-0.57	-0.61	-0.46	-0.59	0.21	0.18	1.00

(c)
Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

library(broom)
library(kableExtra)

# Perform multiple linear regression with mpg as the response variable
model <- lm(mpg ~ ., data = Auto[, -which(names(Auto) == "name")])
summary(model)


Call:
lm(formula = mpg ~ ., data = Auto[, -which(names(Auto) == "name")])

Residuals:
    Min      1Q  Median      3Q     Max 
-9.5903 -2.1565 -0.1169  1.8690 13.0604 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
cylinders     -0.493376   0.323282  -1.526  0.12780    
displacement   0.019896   0.007515   2.647  0.00844 ** 
horsepower    -0.016951   0.013787  -1.230  0.21963    
weight        -0.006474   0.000652  -9.929  < 2e-16 ***
acceleration   0.080576   0.098845   0.815  0.41548    
year           0.750773   0.050973  14.729  < 2e-16 ***
origin         1.426141   0.278136   5.127 4.67e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.328 on 384 degrees of freedom
Multiple R-squared:  0.8215,    Adjusted R-squared:  0.8182 
F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Response: i. Is there a relationship between the predictors and the response? Yes, there is a relationship between the predictors and the response variable (mpg). The summary of the linear regression model indicates that several predictors have statistically significant coefficients, suggesting they contribute to explaining the variability in mpg.

Which predictors appear to have a statistically significant relationship to the response? The predictors that appear to have a statistically significant relationship to the response variable (mpg) are: displacement, weight, year, and origin. These variables have p-values less than 0.05, indicating strong evidence against the null hypothesis.
What does the coefficient for the year variable suggest? The coefficient for the year variable suggests that, on average, each additional year is associated with an increase in mpg. This implies that newer cars tend to be more fuel-efficient than older models.

Aside:
I would have thought that cylinders and horsepower would also be significant predictors, but they are not. This could be due to multicollinearity or other factors affecting the model. I am going to check for multicollinearity using the Variance Inflation Factor (VIF).

library(car)

Loading required package: carData

kable(round(vif(model), 2), caption = "Variance Inflation Factor (VIF)", longtable = FALSE) %>%
  kable_styling(font_size = 8, latex_options = "scale_down")

Variance Inflation Factor (VIF)
	x
cylinders	10.74
displacement	21.84
horsepower	9.94
weight	10.83
acceleration	2.63
year	1.24
origin	1.77

The VIF values for displacement, cylinders and horsepower are relatively high, indicating potential multicollinearity issues. This suggests that these variables may be correlated with other predictors in the model, which can inflate the standard errors of the coefficients.

I went through the process of removing these variables one at a time and re-running the model to see if it improved the model fit. I found that removing displacement, horsepower, and cylinders improved the model fit, as indicated by a lower AIC value.

model_removed <- lm(
  mpg ~ ., 
  data = Auto[, -which(names(Auto) %in% c("name", "displacement", "horsepower", "cylinders"))]
)
summary(model_removed)


Call:
lm(formula = mpg ~ ., data = Auto[, -which(names(Auto) %in% c("name", 
    "displacement", "horsepower", "cylinders"))])

Residuals:
    Min      1Q  Median      3Q     Max 
-9.7456 -2.1149 -0.0255  1.7654 13.1961 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.879e+01  4.051e+00  -4.639 4.79e-06 ***
weight       -5.893e-03  2.685e-04 -21.951  < 2e-16 ***
acceleration  7.966e-02  6.875e-02   1.159    0.247    
year          7.465e-01  4.917e-02  15.181  < 2e-16 ***
origin        1.163e+00  2.593e-01   4.487 9.54e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.346 on 387 degrees of freedom
Multiple R-squared:  0.8181,    Adjusted R-squared:  0.8162 
F-statistic: 435.1 on 4 and 387 DF,  p-value: < 2.2e-16

par(mfrow = c(2, 2))
plot(model_removed)

Checking the Residuals vs the Fitted values, I see that the residuals aren’t randomly distributed, which suggests that the model may not be capturing all the variability in the data. I will try some transformations to see if I can improve the model fit.

model_squared_term <- lm(mpg ~ weight + I(weight^2) + year + origin, data = Auto)
summary(model_squared_term)


Call:
lm(formula = mpg ~ weight + I(weight^2) + year + origin, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.8991 -1.5687 -0.1412  1.6235 12.7733 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.422e-01  4.058e+00  -0.109   0.9133    
weight      -2.041e-02  1.537e-03 -13.281   <2e-16 ***
I(weight^2)  2.213e-06  2.332e-07   9.487   <2e-16 ***
year         8.247e-01  4.416e-02  18.675   <2e-16 ***
origin       5.026e-01  2.435e-01   2.064   0.0397 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.019 on 387 degrees of freedom
Multiple R-squared:  0.8519,    Adjusted R-squared:  0.8504 
F-statistic: 556.5 on 4 and 387 DF,  p-value: < 2.2e-16

par(mfrow = c(2, 2))
plot(model_squared_term)

model_log <- lm(log(mpg) ~ weight + year + origin, data = Auto)
summary(model_log)


Call:
lm(formula = log(mpg) ~ weight + year + origin, data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.42915 -0.06814  0.00772  0.06895  0.37959 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.541e+00  1.447e-01  10.647   <2e-16 ***
weight      -2.915e-04  9.190e-06 -31.721   <2e-16 ***
year         3.129e-02  1.748e-03  17.902   <2e-16 ***
origin       3.081e-02  9.372e-03   3.288   0.0011 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1211 on 388 degrees of freedom
Multiple R-squared:  0.8742,    Adjusted R-squared:  0.8732 
F-statistic: 898.8 on 3 and 388 DF,  p-value: < 2.2e-16

par(mfrow = c(2, 2))
plot(model_log)

The log transformation of mpg seems to improve the model fit more than the quadratic term for weight. The residuals appear more randomly distributed, and the model diagnostics suggest a better fit. There is the trade-off of interpretability, as the coefficients for the log model are in log scale, which may not be as intuitive as the original mpg scale.

# Generate predictions
Auto$pred_quad <- predict(model_squared_term)
Auto$pred_log  <- exp(predict(model_log))  # back-transform log(mpg) to mpg

# Overlay predicted vs actual mpg
plot(Auto$mpg, Auto$pred_quad, col = "blue", pch = 16,
     xlab = "Actual MPG", ylab = "Predicted MPG",
     main = "Overlay: Quadratic vs Log Model")
points(Auto$mpg, Auto$pred_log, col = "darkgreen", pch = 17)
abline(0, 1, lty = 2)  # reference line (perfect prediction)

legend("topleft",
       legend = c("Quadratic Model", "Log Model"),
       col = c("blue", "darkgreen"),
       pch = c(16, 17))

# Root Mean Squared Error
sqrt(mean((Auto$mpg - Auto$pred_quad)^2))  # RMSE for quadratic

[1] 2.999864

sqrt(mean((Auto$mpg - Auto$pred_log)^2))   # RMSE for log model

[1] 2.996936

# Mean Absolute Percentage Error
mean(abs((Auto$mpg - Auto$pred_quad) / Auto$mpg)) * 100  # MAPE for quadratic

[1] 9.603122

mean(abs((Auto$mpg - Auto$pred_log) / Auto$mpg)) * 100   # MAPE for log model

[1] 9.177391

Response:
The Root Mean Squared Error (RMSE) for the quadratic model is higher than that of the log model, indicating that the log model has a better fit to the data. The Mean Absolute Percentage Error (MAPE) is also lower for the log model, suggesting that it provides more accurate predictions relative to the actual mpg values.

In conclusion, the log transformation of mpg appears to be the best approach for this dataset, as it provides a better fit and more accurate predictions compared to the quadratic term for weight. The log model also maintains interpretability while addressing potential issues with non-linearity in the relationship between the predictors and the response variable.

End Aside

(d)
Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow = c(2, 2))
plot(model)

Response:
The residual plots reveal non-linearity and potential outliers, while the leverage plot identifies observation 14 as highly influential, suggesting that model adjustments or transformations may be necessary.

(e)
Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

# Fit a model with interaction effects
model_interaction <- lm(mpg ~ weight * year + origin, data = Auto)
summary(model_interaction)


Call:
lm(formula = mpg ~ weight * year + origin, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.9328 -1.8729 -0.0833  1.5977 12.2103 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.074e+02  1.276e+01  -8.420 7.44e-16 ***
weight       2.594e-02  4.363e-03   5.945 6.18e-09 ***
year         1.961e+00  1.704e-01  11.510  < 2e-16 ***
origin       9.176e-01  2.452e-01   3.742  0.00021 ***
weight:year -4.295e-04  5.860e-05  -7.330 1.36e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.141 on 387 degrees of freedom
Multiple R-squared:  0.8397,    Adjusted R-squared:  0.838 
F-statistic: 506.8 on 4 and 387 DF,  p-value: < 2.2e-16

Response:
The interaction between weight and year is statistically significant, indicating that the relationship between weight and mpg varies by year. This suggests that the impact of weight on fuel efficiency is not constant over time, and newer cars may respond differently to changes in weight compared to older models.

(f)
Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

Response:
I explored several transformations of the variables, including log(mpg) and weight^2. The log transformation of mpg provided the best fit, as indicated by lower RMSE and MAPE values. The residuals appeared more randomly distributed, suggesting a better model fit. The quadratic term for weight also improved the model but was less effective than the log transformation. This work is located in the Aside section above.

Problem 10

This question should be answered using the Carseats data set.

(a)
Fit a multiple regression model to predict Sales using Price, Urban, and US.

library(ISLR)
data(Carseats)
# Fit a multiple regression model
model_carseats <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(model_carseats)


Call:
lm(formula = Sales ~ Price + Urban + US, data = Carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9206 -1.6220 -0.0564  1.5786  7.0581 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
Price       -0.054459   0.005242 -10.389  < 2e-16 ***
UrbanYes    -0.021916   0.271650  -0.081    0.936    
USYes        1.200573   0.259042   4.635 4.86e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.472 on 396 degrees of freedom
Multiple R-squared:  0.2393,    Adjusted R-squared:  0.2335 
F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(b)
Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

Response:
The model shows that Price has a statistically significant negative effect on Sales, with each $1 increase in price reducing sales by about 0.054 units. Stores located in the US sell significantly more (about 1.2 additional units) than non-US stores, while the Urban variable has no significant effect. The model is statistically significant overall (F = 41.52, p < 2.2e-16), but explains a modest 24% of the variance in Sales (R² = 0.239), with an RSE of 2.47 indicating the typical prediction error.

(c)
Write out the model in equation form, being careful to handle the qualitative variables properly.

Response:
The multiple regression model can be expressed as:

\[\begin{equation} \text{Sales} = 13.04 - 0.054 \cdot \text{Price} - 0.022 \cdot \text{Urban}_{\text{Yes}} + 1.201 \cdot \text{US}_{\text{Yes}} + \epsilon \end{equation}\]

Where:
\[\begin{equation*} \begin{aligned} \beta_0 &= 13.04 &&\text{: Intercept (Sales when Price = 0, Urban = No, US = No)} \\ \beta_1 &= -0.054 &&\text{: Change in Sales per \$1 increase in Price} \\ \beta_2 &= -0.022 &&\text{: Difference in Sales when Urban = Yes vs No} \\ \beta_3 &= 1.201 &&\text{: Difference in Sales when US = Yes vs No} \end{aligned} \end{equation*}\]

(d) For which of the predictors can you reject the null hypothesis H0 : βj = 0?

Response:
\[\begin{equation*} H_0^{\text{Price}}: \beta_{\text{Price}} = 0 \quad \text{(Price has no effect on Sales)} \end{equation*}\]

\[\begin{equation*} H_0^{\text{Urban}}: \beta_{\text{Urban(Yes)}} = 0 \quad \text{(Urban vs. Non-Urban has no effect on Sales)} \end{equation*}\]

\[\begin{equation*} H_0^{\text{US}}: \beta_{\text{US(Yes)}} = 0 \quad \text{(Being in the US has no effect on Sales)} \end{equation*}\]

Only Price and US location are statistically significant predictors of Sales in this model. The Urban variable does not have a meaningful impact.

(e)
On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

# Fit a smaller model using only significant predictors
model_carseats_reduced <- lm(Sales ~ Price + US, data = Carseats)
summary(model_carseats_reduced)


Call:
lm(formula = Sales ~ Price + US, data = Carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9269 -1.6286 -0.0574  1.5766  7.0515 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
Price       -0.05448    0.00523 -10.416  < 2e-16 ***
USYes        1.19964    0.25846   4.641 4.71e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.469 on 397 degrees of freedom
Multiple R-squared:  0.2393,    Adjusted R-squared:  0.2354 
F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f)
How well do the models in (a) and (e) fit the data?

Response:
We fit two multiple linear regression models:

Model e: Sales ~ Price + US
Model a: Sales ~ Price + Urban + US

The table below compares their performance:

Metric	Model e: `Price + US`	Model a: `Price + Urban + US`
R-squared	0.2393	0.2393
Adjusted R-squared	0.2354	0.2335
Residual Standard Error	2.469	2.472
F-statistic (p-value)	62.43 (p < 2.2e-16)	41.52 (p < 2.2e-16)

While both models explain the same amount of variance (R² = 0.2393), Model e has a slightly higher Adjusted R² and a lower Residual Standard Error, indicating slightly better fit and predictive accuracy.

Moreover, the coefficient for the Urban variable in Model a is not statistically significant:

\[\begin{equation*} H_0^{\text{Urban}}: \beta_{\text{Urban(Yes)}} = 0 \quad \text{(p = 0.936)} \end{equation*}\]

This means that adding Urban does not contribute meaningful explanatory power to the model.

Conclusion:
We prefer Model e, as it is more parsimonious and performs equivalently (or slightly better) without including an unnecessary predictor.

(g) Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).

# Obtain 95% confidence intervals for the coefficients of the reduced model
confint(model_carseats_reduced, level = 0.95)

                  2.5 %      97.5 %
(Intercept) 11.79032020 14.27126531
Price       -0.06475984 -0.04419543
USYes        0.69151957  1.70776632

Response:
The table below presents the 95% confidence intervals for each estimated regression coefficient:

Coefficient	2.5%	97.5%	Interpretation
Intercept	11.79	14.27	When `Price = 0` and `US = "No"`, the expected Sales fall between 11.79 and 14.27 units.
Price	-0.0648	-0.0442	Each $1 increase in Price is associated with a decrease in Sales between 0.044 and 0.065 units.
US (Yes)	0.6915	1.7078	US-based stores are expected to sell 0.69 to 1.71 more units than non-US stores, on average.

Conclusion:
Because none of these confidence intervals include zero, we can conclude that all three coefficients are statistically significant at the 5% level. The direction and magnitude of the effects are also meaningful:

Price has a consistently negative effect on Sales.
USYes has a consistently positive effect on Sales.

(h)
Is there evidence of outliers or high leverage observations in the model from (e)?

# Diagnostic plots for the reduced model
par(mfrow = c(2, 2))
plot(model_carseats_reduced)

Response:
The diagnostic plots reveal a few mild outliers (e.g., observations 377, 250, and 510) and one point with relatively high leverage (observation 380). However, none of these observations exceed the Cook’s distance threshold or show both high leverage and large residuals. Therefore, there is no strong evidence of influential outliers, and the model appears to satisfy the key assumptions of linear regression.

Problem 12

This problem involves simple linear regression without an intercept.

(a)
Recall that the coefficient estimate Beta Hat for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

Response:
The condition for the coefficient estimates of the regression of ( X ) onto ( Y ) and ( Y ) onto ( X ) to be the same is:

\[\begin{equation*} \sum_{i=1}^n X_i^2 = \sum_{i=1}^n Y_i^2 \end{equation*}\]

This means that the sum of the squares (which is proportional to the variance) of ( X ) must equal the sum of the squares of ( Y ).

In other words, the variances of ( X ) and ( Y ) must be equal. This is a rare circumstance but can occur if:

The magnitude of variation of ( X ) and ( Y ) is the same.
( X ) and ( Y ) are perfectly proportional (i.e., there is a perfect linear relationship between them with equal scaling).

Thus, for the coefficient estimates to be the same, the two variables must have an equal degree of variation.

(b)
Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

# Set seed for reproducibility
set.seed(42)

# Generate random data for X and Y
n <- 100
X <- rnorm(n)  # 100 random observations from a normal distribution
Y <- 2 * X + rnorm(n)  # Y is linearly dependent on X with some noise

# Perform the regression of Y onto X (without intercept)
model_Y_on_X <- lm(Y ~ X - 1)  # "-1" removes the intercept

# Perform the regression of X onto Y (without intercept)
model_X_on_Y <- lm(X ~ Y - 1)  # "-1" removes the intercept

# Display the coefficients
cat("Coefficient estimate for Y onto X:\n")

Coefficient estimate for Y onto X:

print(coef(model_Y_on_X))

       X 
2.024486

cat("\nCoefficient estimate for X onto Y:\n")


Coefficient estimate for X onto Y:

print(coef(model_X_on_Y))

        Y 
0.4167146

(c)
Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

# Set seed for reproducibility
set.seed(42)

# Generate random data for X
n <- 100
X <- rnorm(n)  # 100 random observations from a normal distribution

# Create Y to have the same variance as X and a perfect linear relationship
Y <- X  # Y is exactly equal to X, hence perfectly proportional

# Perform the regression of Y onto X (without intercept)
model_Y_on_X <- lm(Y ~ X - 1)  # "-1" removes the intercept

# Perform the regression of X onto Y (without intercept)
model_X_on_Y <- lm(X ~ Y - 1)  # "-1" removes the intercept

# Display the coefficients
cat("Coefficient estimate for Y onto X:\n")

Coefficient estimate for Y onto X:

print(coef(model_Y_on_X))

X 
1

cat("\nCoefficient estimate for X onto Y:\n")


Coefficient estimate for X onto Y:

print(coef(model_X_on_Y))

Y 
1

Other Formats