Chapter 03 (page 121): 2, 9, 10, 12
Q2: Carefully explain the differences between the KNN classifier and KNN regression methods.
KNN classification predicts categorical outcomes by looking at the classes of the K nearest neighbors and assigning the most common class. On the other hand, KNN regression predicts numerical outcomes by averaging the values of the K nearest neighbors. For example, KNN classification can determine whether a loan should be approved to applicant, while KNN regression can estimate a house price based on similar houses nearby.
Q9: This question involves the use of multiple
linear regression on the Auto data set.
data(Auto)
attach(Auto)
(a) Produce a scatterplot matrix which includes all
of the variables in the data set.
(b) Compute the matrix of correlations between the
variables using the function cor(). You will need to exclude the name
variable, cor() which is qualitative.
cor(Auto[, names(Auto) != "name"])
mpg cylinders displacement horsepower weight acceleration year origin
mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442 0.4233285 0.5805410 0.5652088
cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273 -0.5046834 -0.3456474 -0.5689316
displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944 -0.5438005 -0.3698552 -0.6145351
horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377 -0.6891955 -0.4163615 -0.4551715
weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000 -0.4168392 -0.3091199 -0.5850054
acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392 1.0000000 0.2903161 0.2127458
year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199 0.2903161 1.0000000 0.1815277
origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054 0.2127458 0.1815277 1.0000000
(c) Use the lm() function to perform a multiple
linear regression with mpg as the response and all other variables
except name as the predictors. Use the summary() function to print the
results. Comment on the output.
lm.fit <- lm(mpg ~ .-name, data=Auto)
summary(lm.fit)
Call:
lm(formula = mpg ~ . - name, data = Auto)
Residuals:
Min 1Q Median 3Q Max
-9.5903 -2.1565 -0.1169 1.8690 13.0604
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.218435 4.644294 -3.707 0.00024 ***
cylinders -0.493376 0.323282 -1.526 0.12780
displacement 0.019896 0.007515 2.647 0.00844 **
horsepower -0.016951 0.013787 -1.230 0.21963
weight -0.006474 0.000652 -9.929 < 2e-16 ***
acceleration 0.080576 0.098845 0.815 0.41548
year 0.750773 0.050973 14.729 < 2e-16 ***
origin 1.426141 0.278136 5.127 4.67e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.328 on 384 degrees of freedom
Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
For instance:
i. Is there a relationship between the predictors and the
response?
Yes. There is a strong relationship between our predictors and the response variable
mpg. The F-statistic is 252.4 with a p-value < 2.2e-16, indicating that at least one predictor between our variables is significantly associated withmpg. Additionally, our model explains about 82% of the variation inmpg, suggesting a very good fit.
displacement,weight,yearandoriginhave a statistically significant relationship to our response variable.
The coefficient for year is 0.7508, which suggests that, holding all other variables constant, a one-year increase in the
yearis associated with an average increase of about 0.75 MPG. This indicates that newer cars tend to be more fuel efficient than older cars.
(d) Use the plot() function to produce diagnostic
plots of the linear regression fit. Comment on any problems you see with
the fit. Do the residual plots suggest any unusually large outliers?
Does the leverage plot identify any observations with unusually high
leverage?
par(mfrow = c(2,2))
plot(lm.fit)
Residuals vs Fitted plot shows non-linear trend rather than flat straigt red line. this means that our model doesnt fully capture relationships between the variables. Points are also scattered accross the red line shows heteroscedasticity and plot points out few outliers such as observation: 327, 323 and 326.
Q-Q Residuals plot show mostly align dots around the line however there is a tail showing upward trend with same outliers.
Scale-Location plot confirms that the model violates the assumption of homoscedasticity once again.This means that the model’s prediction errors become increasingly volatile and less reliable at higher predicted mpg values. Additionally, the same three extreme outliers (327, 323 and 326) appear high above the rest of the data, showing they have unusually large standardized residuals.
Residuals vs Leverage plot shows that no single data point is forcing our model to give bad results. Point 14 is unusual compared to other data, but it fits the model’s pattern. Points 327 and 394 are far off the model’s predictions, but they do not have enough power to change the overall math.
(e) Use the * and : symbols to fit linear regression
models with interaction effects. Do any interactions appear to be
statistically significant?
I will consider
displacement * accelerationinteraction because a car with a massive engine (high displacement) usually accelerates quickly, but it burns a massive amount of fuel to do so. I will also utilizehorsepower * weightinteraction because thinking about big powerful truck they have massive weight and horsepower yet they are bad on fuel efficiency so I want to see if the same results apply in this dataset.
lm.fit.int <- lm(mpg ~ cylinders + displacement*acceleration + horsepower * weight + year + origin - name, data = Auto)
summary(lm.fit.int)
As it shows in our model output higher horsepower and heavier cars drastically cause lower mpg. Also the fuel cost of accelerating quickly is much higher when you are feeding a bigger engine versus a small engine. Our interactions helped tremendously in our model improvement.
(f) Try a few different transformations of the
variables, such as log(X), √X, X2. Comment on your findings.
First I would pick a predictor variable and analyze it is relationship with the response variabe. Next I will try this transformations and see which one is best to capture that relationship between. For this example I will try
horsepoweragainstmpgbut first lets see in scatter plot what are their relationship look like.
plot(Auto$horsepower, Auto$mpg,
xlab = "Horsepower",
ylab = "MPG",
main = "MPG vs Horsepower",
pch = 1)
lm.fit.log <- lm(mpg ~ log(horsepower), data = Auto)
summary(lm.fit.log)
Call:
lm(formula = mpg ~ log(horsepower), data = Auto)
Residuals:
Min 1Q Median 3Q Max
-14.2299 -2.7818 -0.2322 2.6661 15.4695
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 108.6997 3.0496 35.64 <2e-16 ***
log(horsepower) -18.5822 0.6629 -28.03 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.501 on 390 degrees of freedom
Multiple R-squared: 0.6683, Adjusted R-squared: 0.6675
F-statistic: 785.9 on 1 and 390 DF, p-value: < 2.2e-16
lm.fit.sqrt <- lm(mpg ~ sqrt(horsepower), data = Auto)
summary(lm.fit.sqrt)
Call:
lm(formula = mpg ~ sqrt(horsepower), data = Auto)
Residuals:
Min 1Q Median 3Q Max
-13.9768 -3.2239 -0.2252 2.6881 16.1411
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 58.705 1.349 43.52 <2e-16 ***
sqrt(horsepower) -3.503 0.132 -26.54 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.665 on 390 degrees of freedom
Multiple R-squared: 0.6437, Adjusted R-squared: 0.6428
F-statistic: 704.6 on 1 and 390 DF, p-value: < 2.2e-16
lm.fit.quad <- lm(mpg ~ I(horsepower^2), data = Auto)
summary(lm.fit.quad)
Call:
lm(formula = mpg ~ I(horsepower^2), data = Auto)
Residuals:
Min 1Q Median 3Q Max
-12.529 -3.798 -1.049 3.240 18.528
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.047e+01 4.466e-01 68.22 <2e-16 ***
I(horsepower^2) -5.665e-04 2.827e-05 -20.04 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.485 on 390 degrees of freedom
Multiple R-squared: 0.5074, Adjusted R-squared: 0.5061
F-statistic: 401.7 on 1 and 390 DF, p-value: < 2.2e-16
lm.fit.poly <- lm(mpg ~ horsepower + I(horsepower^2), data = Auto)
summary(lm.fit.poly)
Call:
lm(formula = mpg ~ horsepower + I(horsepower^2), data = Auto)
Residuals:
Min 1Q Median 3Q Max
-14.7135 -2.5943 -0.0859 2.2868 15.8961
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 56.9000997 1.8004268 31.60 <2e-16 ***
horsepower -0.4661896 0.0311246 -14.98 <2e-16 ***
I(horsepower^2) 0.0012305 0.0001221 10.08 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.374 on 389 degrees of freedom
Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16
Seems like polynomial model has the best R-squared value with 0.686 now lets look at all transformation models and see how they look like in our previous plot.
library(ggplot2)
ggplot(Auto, aes(x = horsepower, y = mpg)) +
geom_point(color = "gray50", alpha = 0.5, size = 2) +
stat_smooth(method = "lm", formula = y ~ log(x),
aes(color = "Log"), se = FALSE, size = 1) +
stat_smooth(method = "lm", formula = y ~ sqrt(x),
aes(color = "Sqrt"), se = FALSE, size = 1) +
stat_smooth(method = "lm", formula = y ~ x + I(x^2),
aes(color = "Quad"), se = FALSE, size = 1) +
stat_smooth(method = "lm", formula = y ~ poly(x, 3),
aes(color = "Poly (Cubic)"), se = FALSE, size = 1) +
labs(title = "Comparing Model Transformations: MPG vs Horsepower",
x = "Horsepower",
y = "Miles Per Gallon (MPG)",
color = "Model Type") +
scale_color_manual(values = c("Log" = "red",
"Sqrt" = "blue",
"Quad" = "green3",
"Poly (Cubic)" = "purple")) +
theme_minimal()
Q10: This question should be answered using the
Carseats data set.
data("Carseats")
attach(Carseats)
(a) Fit a multiple regression model to predict Sales
using Price, Urban, and US.
carseat_lm <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(carseat_lm)
Call:
lm(formula = Sales ~ Price + Urban + US, data = Carseats)
Residuals:
Min 1Q Median 3Q Max
-6.9206 -1.6220 -0.0564 1.5786 7.0581
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
Price -0.054459 0.005242 -10.389 < 2e-16 ***
UrbanYes -0.021916 0.271650 -0.081 0.936
USYes 1.200573 0.259042 4.635 4.86e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.472 on 396 degrees of freedom
Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
(b) Provide an interpretation of each coefficient in
the model. Be careful—some of the variables in the model are
qualitative!
Price (-0.054): holding other variables constant for every $1 increase in price, sales decrease by about 0.054 units.
UrbanYes (-0.022) holding other variables constant, stores in urban locations sell about 0.022 fewer units than non-urban stores. This effect is very small.
USYes (1.201) holding other variables constant, stores located in the US sell about 1.20 more units than non-US stores.
(c) Write out the model in equation form, being
careful to handle the qualitative variables properly.
\(Sales=\beta_0 + \beta_1 * Price + \beta_2 * UrbanYes + \beta_3 * USYes\)
(d) For which of the predictors can you reject the
null hypothesis H0 : βj =0?
PriceandUSYeshave significant p-values, butUrbanYesdoesnt have, therefore onlyUrbanYesfail to reject Null hypothesis.
(e) On the basis of your response to the previous
question, fit a smaller model that only uses the predictors for which
there is evidence of association with the outcome.
carseat_lm_reduced <- lm(Sales ~ Price + US, data = Carseats)
summary(carseat_lm_reduced)
Call:
lm(formula = Sales ~ Price + US, data = Carseats)
Residuals:
Min 1Q Median 3Q Max
-6.9269 -1.6286 -0.0574 1.5766 7.0515
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
Price -0.05448 0.00523 -10.416 < 2e-16 ***
USYes 1.19964 0.25846 4.641 4.71e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.469 on 397 degrees of freedom
Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
(f) How well do the models in (a) and (e) fit the
data?
Both models fit the data with same performance. Both model’s Multiple R-squared: 0.2393 but Adjusted R-squared increased in model in (e) providing parsimonious fit with fewer variable.
(g) Using the model from (e), obtain 95% confidence
intervals for the coefficient(s).
confint(carseat_lm_reduced)
2.5 % 97.5 %
(Intercept) 11.79032020 14.27126531
Price -0.06475984 -0.04419543
USYes 0.69151957 1.70776632
(h) Is there evidence of outliers or high leverage
observations in the model from (e)?
par(mfrow=c(2,2))
plot(carseat_lm_reduced)
plot(hatvalues(carseat_lm_reduced))
hat <- hatvalues(carseat_lm_reduced)
which.max(hat)
43
43
Observation 43 has highest hat value.
Q12: This problem involves simple linear regression
without an intercept.
(a) Recall that the coefficient estimate ˆβ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?
Regression without an intercept, the coefficient estimates are
\[ \hat{\beta}_{Y|X} = \frac{\sum x_i y_i}{\sum x_i^2} \] >and
\[ \hat{\beta}_{X|Y} = \frac{\sum x_i y_i}{\sum y_i^2} \]
the Coefficient estimates will be equal when
\[ \frac{\sum x_i y_i}{\sum x_i^2} = \frac{\sum x_i y_i}{\sum y_i^2} \]
After canceling common terms our final formula look like below
\[ \sum x_i^2 = \sum y_i^2 \]
The coefficient estimates are the same when X and Y have the same sum of squared values
(b) Generate an example in R with n = 100
observations in which the coefficient estimate for the regression of X
onto Y is different from the coefficient estimate for the regression of
Y onto X.
set.seed(45)
x <- rnorm(100)
y <- 2*x + rnorm(100)
coef(lm(y ~ x - 1))
x
1.942817
coef(lm(x ~ y - 1))
y
0.4254618
I generated 100 observations with \(y=2x+ϵ\). The estimated coefficient for the regression of Y on X differed from the estimated coefficient for the regression of X on Y, demonstrating that the two regressions generally produce different estimates.
(c) Generate an example in R with n = 100
observations in which the coefficient estimate for the regression of X
onto Y is the same as the coefficient estimate for the regression of Y
onto X.
set.seed(45)
x <- rnorm(100)
y <- sample(c(-1,1), 100, replace = TRUE) * x
sum(x^2)
[1] 127.9706
sum(y^2)
[1] 127.9706
coef(lm(y ~ x - 1))
x
0.2549004
coef(lm(x ~ y - 1))
y
0.2549004
I generated Y by multiplying each value of X by either 1 or -1. As a result, X and Y had identical sums of squares, and the estimated coefficient for the regression of Y on X was the same as the estimated coefficient for the regression of X on Y