library(ISLR2)
library(MASS)
library(ggplot2)
auto <- Auto Predictive Modeling Homework 2
Exercise 2
Carefully explain the differences between the KNN classifier and KNN regression methods.
📝 KNN classifier is used when a target variable is a categorical variable, it can be used to predict a class; while KNN regression is used when a target is a continuous variable, it is used to predict a number. KNN classifier uses majority vote (mode) among the k-nearest neighbors and KNN regression calculates mean of the k-nearest neighbors’ values to predict the outcome. Except for these difference, both of them rely on finding the k-nearest neighbors to make the prediction.
Exercise 9
This question involves the use of multiple linear regression on the Auto data set.
str(auto)'data.frame': 392 obs. of 9 variables:
$ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
$ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
$ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
$ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
$ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
$ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
$ year : int 70 70 70 70 70 70 70 70 70 70 ...
$ origin : int 1 1 1 1 1 1 1 1 1 1 ...
$ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
- attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
(a) Produce a scatterplot matrix which includes all of the variables in the data set.
plot(Auto)(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.
cor_matrix <- cor(auto[, -9])
cor_matrix mpg cylinders displacement horsepower weight
mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
acceleration year origin
mpg 0.4233285 0.5805410 0.5652088
cylinders -0.5046834 -0.3456474 -0.5689316
displacement -0.5438005 -0.3698552 -0.6145351
horsepower -0.6891955 -0.4163615 -0.4551715
weight -0.4168392 -0.3091199 -0.5850054
acceleration 1.0000000 0.2903161 0.2127458
year 0.2903161 1.0000000 0.1815277
origin 0.2127458 0.1815277 1.0000000
(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:
lm_model <- lm(mpg ~ . - name, data = auto)
summary(lm_model)
Call:
lm(formula = mpg ~ . - name, data = auto)
Residuals:
Min 1Q Median 3Q Max
-9.5903 -2.1565 -0.1169 1.8690 13.0604
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.218435 4.644294 -3.707 0.00024 ***
cylinders -0.493376 0.323282 -1.526 0.12780
displacement 0.019896 0.007515 2.647 0.00844 **
horsepower -0.016951 0.013787 -1.230 0.21963
weight -0.006474 0.000652 -9.929 < 2e-16 ***
acceleration 0.080576 0.098845 0.815 0.41548
year 0.750773 0.050973 14.729 < 2e-16 ***
origin 1.426141 0.278136 5.127 4.67e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.328 on 384 degrees of freedom
Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
i. Is there a relationship between the predictors and the response?
📝 Yes, the result of multiple linear regression shows that there is a significant relationship between the predictors and mpg (p-value: < 2.2e-16). At least one predictor (displacement, weight, year, and origin) is significantly contributing to the variation in mpg. Furthermore, \(R^2\) = 0.8182 suggests that ~82% of the variance in mpg is explained by the predictors in this model.
ii. Which predictors appear to have a statistically significant relationship to the response?
📝 Displacement (p = 0.00844), weight (p < 2e-16), year (p < 2e-16), origin (p = 4.67e-07) have statistically significant relationship with mpg.
iii. What does the coefficient for the year variable suggest?
📝 For every additional year, the expected mpg increases by ~0.75, with all other variables remain constant. Old cars have less mpg.
(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
par(mfrow = c(2,2))
plot(lm_model)📝 The Residuals vs Fitted plot shows a curved red smoother line, suggesting there is non-linearity in this model. Some outliers (323, 327, and 326) are seen with large residual values. The Q-Q Plot shows that some outliers deviate from the line on the right tail side, this suggests that there might be some right-skewness. The Scale-Location Plot shows that the red line is curved, there might be some heteroskedasticity (constant variance of residuals). The Residuals vs. Leverage Plot shows that observation 14 has high leverage and large residual, it might be a highly influential point.
(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
Fit the * model (main effects + interactions):
# weight * displacement: The effect of weight on mpg may vary by engine size.
# origin * cylinders: The effect of cylinders on mpg may vary by region of origin.
interaction_model <- lm(mpg ~ weight * displacement + origin * cylinders, data = auto)
summary(interaction_model)
Call:
lm(formula = mpg ~ weight * displacement + origin * cylinders,
data = auto)
Residuals:
Min 1Q Median 3Q Max
-12.5173 -2.4761 -0.3331 1.8232 17.9530
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.691e+01 3.543e+00 16.066 < 2e-16 ***
weight -9.151e-03 9.156e-04 -9.994 < 2e-16 ***
displacement -7.672e-02 1.502e-02 -5.108 5.14e-07 ***
origin -2.787e+00 1.712e+00 -1.627 0.1045
cylinders -8.053e-01 6.599e-01 -1.220 0.2231
weight:displacement 1.783e-05 3.113e-06 5.728 2.05e-08 ***
origin:cylinders 7.261e-01 3.980e-01 1.824 0.0689 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.091 on 385 degrees of freedom
Multiple R-squared: 0.7294, Adjusted R-squared: 0.7252
F-statistic: 173 on 6 and 385 DF, p-value: < 2.2e-16
Fit the : model (only the interaction terms):
interaction_model2 <- lm(mpg ~ weight:displacement + origin:cylinders, data = auto)
summary(interaction_model2)
Call:
lm(formula = mpg ~ weight:displacement + origin:cylinders, data = auto)
Residuals:
Min 1Q Median 3Q Max
-11.7675 -3.0862 -0.5903 2.6401 16.5863
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.890e+01 7.841e-01 36.859 < 2e-16 ***
weight:displacement -1.167e-05 4.555e-07 -25.628 < 2e-16 ***
origin:cylinders 2.891e-01 8.377e-02 3.451 0.000619 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.698 on 389 degrees of freedom
Multiple R-squared: 0.6395, Adjusted R-squared: 0.6377
F-statistic: 345.1 on 2 and 389 DF, p-value: < 2.2e-16
📝 Here I picked weight * displacement to see if effect of weight on mpg may vary by engine size, and origin * cylinders to see if effect of cylinders on mpg may vary by region of origin.
In the * model (main effects + interactions), the interaction of weight:displacement is statistically significant (p = 2.05e-08). The effect of weight on mpg depends on displacement. The interaction of origin:cylinders is not statistically significant (p = 0.0689). The effect of origin on mpg does not depends on displacement.
In the : model (only the interaction terms), both the interaction of weight:displacement and origin:cylinders are statistically significant (p < 2e-16 and p = 0.000619).
However, for a better prediction, the * model is preferred since it has a higher adjusted \(R^2\) (0.7252 v.s. 0.6377).
(f) Try a few different transformations of the variables, such as log(X),\(\sqrt{X}\) \(X^2\) Comment on your findings.
📝 In the previous model, only weight, displacement, year, and origin are significantly contributing to the variation in mpg. Here I decided to do log of weight and square root of displacement.
auto$log_weight <- log(auto$weight)
auto$sqrt_displacement <- sqrt(auto$displacement)
transformed_model <- lm(mpg ~ log_weight + sqrt_displacement +
year + origin, data = auto)
summary(transformed_model)
Call:
lm(formula = mpg ~ log_weight + sqrt_displacement + year + origin,
data = auto)
Residuals:
Min 1Q Median 3Q Max
-9.7704 -1.9265 0.0024 1.6125 13.0769
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 122.59081 11.44548 10.711 < 2e-16 ***
log_weight -20.32594 1.60637 -12.653 < 2e-16 ***
sqrt_displacement 0.10468 0.13208 0.793 0.42853
year 0.78905 0.04632 17.036 < 2e-16 ***
origin 0.80719 0.25755 3.134 0.00186 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.125 on 387 degrees of freedom
Multiple R-squared: 0.8413, Adjusted R-squared: 0.8397
F-statistic: 513 on 4 and 387 DF, p-value: < 2.2e-16
📝 Log transformations is usually used in right-skewed data with large values. Square root transformation is weaker than log transformations in handling extreme values. This is why I pick weight to do log transformations.
📝 The transformed model shows a better adjusted \(R^2\) (0.8397) compared with the original model without transformations (0.8182). Weight (log_weight) has a highly significant negative effect on mpg (p < 2e-16). Year is still highly significant (p < 2e-16) on mpg. Origin also remains statistically significant (p = 0.00186) on mpg. However, displacement (sqrt_displacement) became not statistically significant (p = 0.42853), suggesting that applying a square root transformation did not improve its predictive power in the model.
Exercise 10
This question should be answered using the Carseats data set.
(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.
carseat <- Carseats
str(Carseats)'data.frame': 400 obs. of 11 variables:
$ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
$ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
$ Income : num 73 48 35 100 64 113 105 81 110 113 ...
$ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
$ Population : num 276 260 269 466 340 501 45 425 108 131 ...
$ Price : num 120 83 80 97 128 72 108 120 124 124 ...
$ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
$ Age : num 42 65 59 55 38 78 71 67 76 76 ...
$ Education : num 17 10 12 14 13 16 15 10 10 17 ...
$ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
$ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
model_a <- lm(Sales ~ Price + Urban + US, data = carseat)
summary(model_a)
Call:
lm(formula = Sales ~ Price + Urban + US, data = carseat)
Residuals:
Min 1Q Median 3Q Max
-6.9206 -1.6220 -0.0564 1.5786 7.0581
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
Price -0.054459 0.005242 -10.389 < 2e-16 ***
UrbanYes -0.021916 0.271650 -0.081 0.936
USYes 1.200573 0.259042 4.635 4.86e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.472 on 396 degrees of freedom
Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!
📝 When Price = 0, Urban = No, and US = No, the expected sales is 13.04 units.
📝 For every 1 unit increase in price, sales decrease by 0.0545 units, holding other variables constant. The effect of price on sales is statistically significant (< 2e-16).
📝 Stores in urban areas or not is not statistically significant (p = 0.936), it does not significantly impact sales.
📝 Stores in US or not is statistically significant (p < 0.001). Stores in the US have 1.2 more units of sales than non-US stores, holding other factors constant.
(c) Write out the model in equation form, being careful to handle the qualitative variables properly.
📝 \(\hat{Sales} = 13.04 - 0.05 (\text{Price}) - 0.02 (\text{UrbanYes}) + 1.20 (\text{USYes}) + \epsilon\)
(d) For which of the predictors can you reject the null hypothesis H0 :βj =0 ?
📝 We can reject H0 based on Price and US. Both of their p-values are < 0.05, they have statistically significant impacts on Sales.
(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
model_e <- lm(Sales ~ Price + US, data = Carseats)
summary(model_e)
Call:
lm(formula = Sales ~ Price + US, data = Carseats)
Residuals:
Min 1Q Median 3Q Max
-6.9269 -1.6286 -0.0574 1.5766 7.0515
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
Price -0.05448 0.00523 -10.416 < 2e-16 ***
USYes 1.19964 0.25846 4.641 4.71e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.469 on 397 degrees of freedom
Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
(f) How well do the models in (a) and (e) fit the data?
📝 Both models have adjusted \(R^2\) of 0.23. Removing urban did not change explanatory power. However, the F-statistic increased from 41.52 to 62.43, suggesting a better overall fit after removing urban. A higher F-statistic indicates that the simpler model explains a greater proportion of variation in sales.
(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).
confint(model_e, level = 0.95) 2.5 % 97.5 %
(Intercept) 11.79032020 14.27126531
Price -0.06475984 -0.04419543
USYes 0.69151957 1.70776632
📝 The 95% confidence intervals confirm that Price and US significantly impact Sales. Since zero is not included in their confidence intervals.
(h) Is there evidence of outliers or high leverage observations in the model from (e)?
par(mfrow = c(2,2))
plot(model_e)cooks_dist <- cooks.distance(model_e)
influential_points <- which(cooks_dist > (4 / nrow(carseat)))
print(influential_points) 26 29 31 50 51 58 69 107 144 166 175 210 259 273 299 311 317 368 377
26 29 31 50 51 58 69 107 144 166 175 210 259 273 299 311 317 368 377
📝 The Residuals vs. Fitted, Q-Q, and Scale-Location plots show that observations 51, 69, and 377 could be potential outliers as they exhibit large residual. The result of cooks.distance shows that a total of 19 observations are potentially influential points.
Exercise 12
This problem involves simple linear regression without an intercept.
(a) Recall that the coefficient estimate \(\widehat{β}\) for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?
📝 The coefficient estimate \(\widehat{\beta}\) in simple linear regression without an intercept: \(\widehat{\beta} = \frac{\sum X_i Y_i}{\sum X_i^2}\)
📝 The coefficient \(\widehat{\beta}^*\) in regression of X onto Y: \(\widehat{\beta}^* = \frac{\sum X_i Y_i}{\sum Y_i^2}\)
📝 The two estimates are equal when: \(\sum X_i^2 = \sum Y_i^2\)
(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
set.seed(123)
X <- rnorm(100)
Y <- 2 * X + rnorm(100)
model_y_on_x <- lm(Y ~ X - 1)
model_x_on_y <- lm(X ~ Y - 1)
summary(model_y_on_x)
Call:
lm(formula = Y ~ X - 1)
Residuals:
Min 1Q Median 3Q Max
-2.0010 -0.7901 -0.1800 0.4693 3.1762
Coefficients:
Estimate Std. Error t value Pr(>|t|)
X 1.9364 0.1064 18.2 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9713 on 99 degrees of freedom
Multiple R-squared: 0.7698, Adjusted R-squared: 0.7675
F-statistic: 331.1 on 1 and 99 DF, p-value: < 2.2e-16
summary(model_x_on_y)
Call:
lm(formula = X ~ Y - 1)
Residuals:
Min 1Q Median 3Q Max
-1.49720 -0.18013 0.07056 0.35235 1.04653
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Y 0.39757 0.02185 18.2 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4401 on 99 degrees of freedom
Multiple R-squared: 0.7698, Adjusted R-squared: 0.7675
F-statistic: 331.1 on 1 and 99 DF, p-value: < 2.2e-16
📝 The regression of Y on X gives a slope of 1.9364, while X on Y gives 0.3976. The two regressions do not produce the same result. This happens because X and Y have different sums of squares.
(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.
set.seed(123)
X <- rnorm(100)
Y <- X
model_y_on_x <- lm(Y ~ X - 1)
model_x_on_y <- lm(X ~ Y - 1)
summary(model_y_on_x)Warning in summary.lm(model_y_on_x): essentially perfect fit: summary may be
unreliable
Call:
lm(formula = Y ~ X - 1)
Residuals:
Min 1Q Median 3Q Max
-1.851e-15 -1.582e-17 3.000e-19 1.727e-17 1.167e-16
Coefficients:
Estimate Std. Error t value Pr(>|t|)
X 1.000e+00 2.093e-17 4.778e+16 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.91e-16 on 99 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 2.283e+33 on 1 and 99 DF, p-value: < 2.2e-16
summary(model_x_on_y)Warning in summary.lm(model_x_on_y): essentially perfect fit: summary may be
unreliable
Call:
lm(formula = X ~ Y - 1)
Residuals:
Min 1Q Median 3Q Max
-1.851e-15 -1.582e-17 3.000e-19 1.727e-17 1.167e-16
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Y 1.000e+00 2.093e-17 4.778e+16 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.91e-16 on 99 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 2.283e+33 on 1 and 99 DF, p-value: < 2.2e-16
📝 The coefficient estimates are exactly the same. Because in this case X and Y are identical, perfectly correlated, and have equal variance. Both coefficients will be exactly 1.