Question 2: Differences between KNN Classifier and KNN Regression

KNN classification is used for categorical variables by assigning the most common class among the k nearest neighbors, while KNN regression is used for continuous variables by averaging the values of the k nearest neighbors.

Question 9: Multiple Linear Regression on Auto Dataset

(a) Scatterplot Matrix

A scatterplot matrix visualizes the relationships between all numerical variables.

library(ISLR)
library(ggplot2)
library(corrplot)

data(Auto)
pairs(Auto)

(b) Correlation Matrix (Excluding Name)

mpg is negatively correlated with cylinders, displacement, horsepower, and weight, and positively correlated with year and origin. Weight, horsepower, and displacement are highly correlated, indicating possible multicollinearity.

cor_matrix <- cor(Auto[, !names(Auto) %in% "name"])
corrplot(cor_matrix, method = "circle")

(c) Multiple Linear Regression

Yes, the F-statistic = 252.4 (p-value < 2.2e-16) confirms a strong relationship. R² = 0.8215, meaning 82% of mpg variance is explained.

Significant (p < 0.05): displacement, weight, year, origin
Not significant (p > 0.05): cylinders, horsepower, acceleration

0.7508 → For every 1-year increase, mpg increases by 0.75. Newer cars are more fuel-efficient.

model <- lm(mpg ~ . - name, data = Auto)
summary(model)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

(d) Diagnostic Plots

Residuals vs Fitted: Slight curve suggests non-linearity, meaning the linear model might not be the best fit.
Q-Q Plot: Some points deviate from the line, indicating non-normality in residuals.
Scale-Location: No strong pattern, but potential heteroscedasticity (variance not constant).
Residuals vs Leverage: Points 327, 394, and 14 have high leverage, meaning they could strongly influence the model. Point 14 seems particularly influential.

par(mfrow = c(2, 2))
plot(model)

(e) Interaction Effects

Yes, the following interactions are statistically significant (p < 0.05).

interaction_model <- lm(mpg ~ (cylinders + displacement + horsepower + weight + acceleration + year + origin)^2 - name, data = Auto)
summary(interaction_model)

## 
## Call:
## lm(formula = mpg ~ (cylinders + displacement + horsepower + weight + 
##     acceleration + year + origin)^2 - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6303 -1.4481  0.0596  1.2739 11.1386 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
## cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
## displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
## horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
## weight                     4.133e-03  1.759e-02   0.235  0.81442   
## acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
## year                       6.974e-01  6.097e-01   1.144  0.25340   
## origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
## cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
## cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
## cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
## cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
## cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
## cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
## displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
## displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
## displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
## displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
## displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
## horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
## horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
## horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
## horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
## weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
## weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
## weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
## acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
## acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
## year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8808 
## F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16

(f) Transformations

The model explains 65.6% of the variation in mpg, which is lower than other models. The squared weight term is highly significant with a strong negative effect on mpg, while the squared horsepower term is borderline significant. The squared displacement term has no effect. Overall, only weight squared is meaningful, and this model does not improve much compared to others.

model_squared <- lm(mpg ~ I(horsepower^2) + I(weight^2) + I(displacement^2), data = Auto)
summary(model_squared)

## 
## Call:
## lm(formula = mpg ~ I(horsepower^2) + I(weight^2) + I(displacement^2), 
##     data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.941  -3.323  -0.771   2.634  17.200 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.427e+01  6.017e-01  56.955   <2e-16 ***
## I(horsepower^2)   -1.033e-04  5.632e-05  -1.834   0.0674 .  
## I(weight^2)       -9.953e-07  1.018e-07  -9.778   <2e-16 ***
## I(displacement^2) -3.673e-08  1.483e-05  -0.002   0.9980    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.596 on 388 degrees of freedom
## Multiple R-squared:  0.6559, Adjusted R-squared:  0.6532 
## F-statistic: 246.5 on 3 and 388 DF,  p-value: < 2.2e-16

Question 10: Multiple Regression on Carseats Dataset

(a) Fit a Multiple Regression Model

data(Carseats)
model_carseats <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(model_carseats)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(b) Interpretation of Coefficients

Price is significant and negatively affects Sales. US is significant and increases Sales. Urban is not significant and has no impact on Sales.

(c) Model Equation

Sales = 13.04 − 0.0545(Price) − 0.0219(UrbanYes) + 1.20(USYes)

(d) Significant Predictors

The significant predictors (p < 0.05) are Price and US, while Urban is not significant.

significant_predictors <- summary(model_carseats)$coefficients[,4] < 0.05
print(significant_predictors)

## (Intercept)       Price    UrbanYes       USYes 
##        TRUE        TRUE       FALSE        TRUE

(e) Fit a Smaller Model

A reduced model using only significant predictors.

reduced_model <- lm(Sales ~ Price + US, data = Carseats)
summary(reduced_model)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) Model Comparison

Both models have the same R² (0.2393), meaning they explain 23.93% of the variation in Sales. The removal of Urban did not impact the model’s fit, so the reduced model is just as effective but simpler.

AIC(model_carseats, reduced_model)

##                df      AIC
## model_carseats  5 1865.312
## reduced_model   4 1863.319

(g) 95% Confidence Intervals

confint(reduced_model, level = 0.95)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

(h) Outliers and High Leverage Points

The diagnostic plots help identify outliers and high leverage points that could influence the model. If extreme points are present, further investigation is needed.

par(mfrow = c(2, 2))
plot(reduced_model)

Question 12: Simple Linear Regression without an Intercept

(a) Condition for Coefficient Equality

The coefficient estimates for the regression of X onto Y and Y onto X will be the same if and only if Var(X) = Var(Y) (i.e., X and Y have the same variance).

(b) Example Where Coefficients Differ

The coefficient estimates for Y onto X (1.9829) and X onto Y (0.4972) are different. This occurs because X and Y do not have the same variance, and there is added noise in Y = 2X + error. The presence of error (random noise) disrupts the perfect linear relationship, causing the two regression coefficients to differ.

set.seed(42)
n <- 100
X <- rnorm(n, mean = 10, sd = 5)
Y <- 2 * X + rnorm(n, mean = 0, sd = 3)

model_xy <- lm(Y ~ X - 1)
model_yx <- lm(X ~ Y - 1)

coef_xy <- coef(model_xy)
coef_yx <- coef(model_yx)
print(coef_xy)

##        X 
## 1.982863

print(coef_yx)

##        Y 
## 0.497212

(c) Example Where Coefficients Are the Same

The coefficient estimates for Y onto X and X onto Y are both 1, meaning they are identical. This occurs because Y = X without any added noise, ensuring a perfect linear relationship where Var(X) = Var(Y).

set.seed(42)
n <- 100
X <- rnorm(n, mean = 10, sd = 5)
Y <- X  # Perfect linear relationship

model_xy_equal <- lm(Y ~ X - 1)
model_yx_equal <- lm(X ~ Y - 1)

coef_xy_equal <- coef(model_xy_equal)
coef_yx_equal <- coef(model_yx_equal)
print(coef_xy_equal)

## X 
## 1

print(coef_yx_equal)

## Y 
## 1

Assignment 2 6543

Lisa Ovalle

2025-03-01