This document presents answers to several exercises from Chapter 3 (“Linear Regression”) in An Introduction to Statistical Learning. In particular, we address:
For all analyses, we use the ISLR2 package (which
contains these data sets), along with ggplot2 and
dplyr for data manipulation and visualization.
Load the necessary libraries and data sets.
# Load libraries
library(ISLR2) # Contains Auto and Boston data sets
library(ggplot2) # For plotting
library(dplyr) # For data manipulation
# Set a consistent theme for ggplot2
theme_set(theme_minimal())
data(Auto)
# Display the first few rows of the dataset
head(Auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
str(Auto) # Display variable types
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
## ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
### Question 2
> Carefully explain the differences between the KNN classifier and KNN
> regression methods.
The KNN classifier is categorical and assigns a value based on the most
frequent observed category among $K$ nearest neighbors, whereas KNN regression
assigns a continuous variable, the average of the response variables for
the $K$ nearest neighbors.
This question involves the use of multiple linear regression on the
Autodata set.
- Produce a scatterplot matrix which includes all of the variables in the data set.
pairs(Auto, cex = 0.2)
- Compute the matrix of correlations between the variables using the function
cor(). You will need to exclude the name variable,namewhich is qualitative.
x <- subset(Auto, select = -name)
cor(x)
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
- Use the
lm()function to perform a multiple linear regression withmpgas the response and all other variables except name as the predictors. Use thesummary()function to print the results. Comment on the output. For instance:
- Is there a relationship between the predictors and the response?
- Which predictors appear to have a statistically significant relationship to the response?
- What does the coefficient for the
yearvariable suggest?
fit <- lm(mpg ~ ., data = x)
summary(fit)
##
## Call:
## lm(formula = mpg ~ ., data = x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Yes, there is a relationship between some predictors and response, such as “displacement”, “weight”, and “year”
The coefficient for “year” suggests that “mpg” increases every year, on average.
- Use the
plot()function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
par(mfrow = c(2, 2))
plot(fit, cex = 0.2)
One point has high leverage, the residuals also show a trend with fitted values.
- Use the
*and:symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
summary(lm(mpg ~ . + weight:horsepower, data = x))
##
## Call:
## lm(formula = mpg ~ . + weight:horsepower, data = x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.589 -1.617 -0.184 1.541 12.001
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.876e+00 4.511e+00 0.638 0.524147
## cylinders -2.955e-02 2.881e-01 -0.103 0.918363
## displacement 5.950e-03 6.750e-03 0.881 0.378610
## horsepower -2.313e-01 2.363e-02 -9.791 < 2e-16 ***
## weight -1.121e-02 7.285e-04 -15.393 < 2e-16 ***
## acceleration -9.019e-02 8.855e-02 -1.019 0.309081
## year 7.695e-01 4.494e-02 17.124 < 2e-16 ***
## origin 8.344e-01 2.513e-01 3.320 0.000986 ***
## horsepower:weight 5.529e-05 5.227e-06 10.577 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.931 on 383 degrees of freedom
## Multiple R-squared: 0.8618, Adjusted R-squared: 0.859
## F-statistic: 298.6 on 8 and 383 DF, p-value: < 2.2e-16
summary(lm(mpg ~ . + acceleration:horsepower, data = x))
##
## Call:
## lm(formula = mpg ~ . + acceleration:horsepower, data = x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0329 -1.8177 -0.1183 1.7247 12.4870
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -32.499820 4.923380 -6.601 1.36e-10 ***
## cylinders 0.083489 0.316913 0.263 0.792350
## displacement -0.007649 0.008161 -0.937 0.349244
## horsepower 0.127188 0.024746 5.140 4.40e-07 ***
## weight -0.003976 0.000716 -5.552 5.27e-08 ***
## acceleration 0.983282 0.161513 6.088 2.78e-09 ***
## year 0.755919 0.048179 15.690 < 2e-16 ***
## origin 1.035733 0.268962 3.851 0.000138 ***
## horsepower:acceleration -0.012139 0.001772 -6.851 2.93e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.145 on 383 degrees of freedom
## Multiple R-squared: 0.841, Adjusted R-squared: 0.8376
## F-statistic: 253.2 on 8 and 383 DF, p-value: < 2.2e-16
summary(lm(mpg ~ . + cylinders:weight, data = x))
##
## Call:
## lm(formula = mpg ~ . + cylinders:weight, data = x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.9484 -1.7133 -0.1809 1.4530 12.4137
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.3143478 5.0076737 1.461 0.14494
## cylinders -5.0347425 0.5795767 -8.687 < 2e-16 ***
## displacement 0.0156444 0.0068409 2.287 0.02275 *
## horsepower -0.0314213 0.0126216 -2.489 0.01322 *
## weight -0.0150329 0.0011125 -13.513 < 2e-16 ***
## acceleration 0.1006438 0.0897944 1.121 0.26306
## year 0.7813453 0.0464139 16.834 < 2e-16 ***
## origin 0.8030154 0.2617333 3.068 0.00231 **
## cylinders:weight 0.0015058 0.0001657 9.088 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.022 on 383 degrees of freedom
## Multiple R-squared: 0.8531, Adjusted R-squared: 0.8501
## F-statistic: 278.1 on 8 and 383 DF, p-value: < 2.2e-16
There are a few coefficients that appear to be highly significant.
- Try a few different transformations of the variables, such as \(log(X)\), \(\sqrt{X}\), \(X^2\). Comment on your findings.
Below are transformations for horsepower:
par(mfrow = c(2, 2))
plot(Auto$horsepower, Auto$mpg, cex = 0.2)
plot(log(Auto$horsepower), Auto$mpg, cex = 0.2)
plot(sqrt(Auto$horsepower), Auto$mpg, cex = 0.2)
plot(Auto$horsepower^2, Auto$mpg, cex = 0.2)
x <- subset(Auto, select = -name)
x$horsepower <- log(x$horsepower)
fit <- lm(mpg ~ ., data = x)
summary(fit)
##
## Call:
## lm(formula = mpg ~ ., data = x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3115 -2.0041 -0.1726 1.8393 12.6579
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.254005 8.589614 3.173 0.00163 **
## cylinders -0.486206 0.306692 -1.585 0.11372
## displacement 0.019456 0.006876 2.830 0.00491 **
## horsepower -9.506436 1.539619 -6.175 1.69e-09 ***
## weight -0.004266 0.000694 -6.148 1.97e-09 ***
## acceleration -0.292088 0.103804 -2.814 0.00515 **
## year 0.705329 0.048456 14.556 < 2e-16 ***
## origin 1.482435 0.259347 5.716 2.19e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.18 on 384 degrees of freedom
## Multiple R-squared: 0.837, Adjusted R-squared: 0.834
## F-statistic: 281.6 on 7 and 384 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(fit, cex = 0.2)
horsepower appears to give a linear relationship with
mpg.
This question should be answered using the
Carseatsdata set.
- Fit a multiple regression model to predict
SalesusingPrice,Urban, andUS.
fit <- lm(Sales ~ Price + Urban + US, data = Carseats)
- Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!
summary(fit)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
- Write out the model in equation form, being careful to handle the qualitative variables properly.
\[ \textit{Sales} = 13 + -0.054 \times \textit{Price} + \begin{cases} -0.022, & \text{if $\textit{Urban}$ is Yes, $\textit{US}$ is No} \\ 1.20, & \text{if $\textit{Urban}$ is No, $\textit{US}$ is Yes} \\ 1.18, & \text{if $\textit{Urban}$ and $\textit{US}$ is Yes} \\ 0, & \text{Otherwise} \end{cases} \]
- For which of the predictors can you reject the null hypothesis \(H_0 : \beta_j = 0\)?
Price and US
- On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
fit2 <- lm(Sales ~ Price + US, data = Carseats)
- How well do the models in (a) and (e) fit the data?
summary(fit)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
summary(fit2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
anova(fit, fit2)
## Analysis of Variance Table
##
## Model 1: Sales ~ Price + Urban + US
## Model 2: Sales ~ Price + US
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 396 2420.8
## 2 397 2420.9 -1 -0.03979 0.0065 0.9357
They have similar \(R^2\) and the
model containing the extra variable Urban is
non-significantly better.
- Using the model from (e), obtain 95% confidence intervals for the coefficient(s).
confint(fit2)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
- Is there evidence of outliers or high leverage observations in the model from (e)?
par(mfrow = c(2, 2))
plot(fit2, cex = 0.2)
Yes, there are definitely some outliers, but most of the data is well fitted and grouped.
This problem involves simple linear regression without an intercept.
- Recall that the coefficient estimate \(\hat{\beta}\) for the linear regression of \(Y\) onto \(X\) without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of \(X\) onto \(Y\) the same as the coefficient estimate for the regression of \(Y\) onto \(X\)?
\[ \hat\beta = \sum_i x_iy_i / \sum_{i'} x_{i'}^2 \]
The coefficient for the regression of X onto Y swaps the \(x\) and \(y\) variables:
\[ \hat\beta = \sum_i x_iy_i / \sum_{i'} y_{i'}^2 \]
So they are the same when \(\sum_{i} x_{i}^2 = \sum_{i} y_{i}^2\)
- Generate an example in
Rwith \(n = 100\) observations in which the coefficient estimate for the regression of \(X\) onto \(Y\) is different from the coefficient estimate for the regression of \(Y\) onto \(X\).
x <- rnorm(100)
y <- 2 * x + rnorm(100, 0, 0.1)
c(sum(x^2), sum(y^2))
## [1] 110.8061 445.4824
c(coef(lm(y ~ x))[2], coef(lm(x ~ y))[2])
## x y
## 2.0038574 0.4978905
- Generate an example in
Rwith \(n = 100\) observations in which the coefficient estimate for the regression of \(X\) onto \(Y\) is the same as the coefficient estimate for the regression of \(Y\) onto \(X\).
x <- rnorm(100)
y <- x + rnorm(100, 0, 0.1)
c(sum(x^2), sum(y^2))
## [1] 101.2825 102.0189
c(coef(lm(y ~ x))[2], coef(lm(x ~ y))[2])
## x y
## 0.9997227 0.9923036