## Multiple Linear Regression on the Auto Data Set
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.5.3
data(Auto)
Auto <- na.omit(Auto)
pairs(Auto)
cor(Auto[, -9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
auto_lm <- lm(mpg ~ . - name, data = Auto)
summary(auto_lm)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(auto_lm)
par(mfrow = c(1, 1))
The Residuals vs Fitted plot displays a curved pattern rather than a completely random scatter, indicating that the relationship between the predictors and mpg may not be perfectly linear. The Scale-Location plot indicates that the variance of the residuals is not completely constant across fitted values, providing some evidence of heteroscedasticity.
interaction_lm <- lm(mpg ~ (cylinders + displacement + horsepower + weight + acceleration + year + origin)^2, data = Auto)
summary(interaction_lm)
##
## Call:
## lm(formula = mpg ~ (cylinders + displacement + horsepower + weight +
## acceleration + year + origin)^2, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6303 -1.4481 0.0596 1.2739 11.1386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.548e+01 5.314e+01 0.668 0.50475
## cylinders 6.989e+00 8.248e+00 0.847 0.39738
## displacement -4.785e-01 1.894e-01 -2.527 0.01192 *
## horsepower 5.034e-01 3.470e-01 1.451 0.14769
## weight 4.133e-03 1.759e-02 0.235 0.81442
## acceleration -5.859e+00 2.174e+00 -2.696 0.00735 **
## year 6.974e-01 6.097e-01 1.144 0.25340
## origin -2.090e+01 7.097e+00 -2.944 0.00345 **
## cylinders:displacement -3.383e-03 6.455e-03 -0.524 0.60051
## cylinders:horsepower 1.161e-02 2.420e-02 0.480 0.63157
## cylinders:weight 3.575e-04 8.955e-04 0.399 0.69000
## cylinders:acceleration 2.779e-01 1.664e-01 1.670 0.09584 .
## cylinders:year -1.741e-01 9.714e-02 -1.793 0.07389 .
## cylinders:origin 4.022e-01 4.926e-01 0.816 0.41482
## displacement:horsepower -8.491e-05 2.885e-04 -0.294 0.76867
## displacement:weight 2.472e-05 1.470e-05 1.682 0.09342 .
## displacement:acceleration -3.479e-03 3.342e-03 -1.041 0.29853
## displacement:year 5.934e-03 2.391e-03 2.482 0.01352 *
## displacement:origin 2.398e-02 1.947e-02 1.232 0.21875
## horsepower:weight -1.968e-05 2.924e-05 -0.673 0.50124
## horsepower:acceleration -7.213e-03 3.719e-03 -1.939 0.05325 .
## horsepower:year -5.838e-03 3.938e-03 -1.482 0.13916
## horsepower:origin 2.233e-03 2.930e-02 0.076 0.93931
## weight:acceleration 2.346e-04 2.289e-04 1.025 0.30596
## weight:year -2.245e-04 2.127e-04 -1.056 0.29182
## weight:origin -5.789e-04 1.591e-03 -0.364 0.71623
## acceleration:year 5.562e-02 2.558e-02 2.174 0.03033 *
## acceleration:origin 4.583e-01 1.567e-01 2.926 0.00365 **
## year:origin 1.393e-01 7.399e-02 1.882 0.06062 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared: 0.8893, Adjusted R-squared: 0.8808
## F-statistic: 104.2 on 28 and 363 DF, p-value: < 2.2e-16
Several interaction terms appear to be statistically significant. In particular, the interactions between displacement and year, acceleration and year, and acceleration and origin all have p-values below 0.05.Overall, the presence of significant interaction terms indicates that some predictor variables do not act independently when explaining variation in miles per gallon.
transform_lm <- lm(mpg ~ log(displacement) + log(horsepower) + log(weight) +sqrt(acceleration) + year + origin + I(horsepower^2) + I(weight^2), data = Auto)
summary(transform_lm)
##
## Call:
## lm(formula = mpg ~ log(displacement) + log(horsepower) + log(weight) +
## sqrt(acceleration) + year + origin + I(horsepower^2) + I(weight^2),
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1230 -1.5110 -0.1676 1.4112 12.1136
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.598e+02 1.919e+01 8.328 1.48e-15 ***
## log(displacement) -8.725e-01 1.018e+00 -0.857 0.3921
## log(horsepower) -1.242e+01 2.075e+00 -5.984 4.99e-09 ***
## log(weight) -1.653e+01 3.579e+00 -4.620 5.26e-06 ***
## sqrt(acceleration) -2.016e+00 7.827e-01 -2.576 0.0104 *
## year 7.679e-01 4.485e-02 17.123 < 2e-16 ***
## origin 6.528e-01 2.598e-01 2.513 0.0124 *
## I(horsepower^2) 2.182e-04 5.148e-05 4.238 2.83e-05 ***
## I(weight^2) 2.614e-07 1.376e-07 1.901 0.0581 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.899 on 383 degrees of freedom
## Multiple R-squared: 0.8649, Adjusted R-squared: 0.8621
## F-statistic: 306.5 on 8 and 383 DF, p-value: < 2.2e-16
Although the transformed model performed well, the interaction model achieved a slightly higher adjusted R², suggesting that interactions among predictors may explain variation in mpg more effectively than the transformations considered here.Overall, the transformed model suggests that allowing nonlinear relationships improves the flexibility of the regression model and provides evidence that some predictors affect mpg in a nonlinear manner.
par(mfrow = c(2, 2))
plot(transform_lm)
par(mfrow = c(1, 1))
library(ISLR2)
data(Carseats)
str(Carseats)
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
model_full <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(model_full)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
The fitted model uses Sales as the response variable and
Price, Urban, and US as
predictors. Since Urban and US are qualitative
variables.
Price = 0, Urban = No, and
US = No.Price represents the expected
change in sales for a one-unit increase in price, holding
Urban and US constant.UrbanYes represents the difference
in sales between urban and non-urban stores, holding price and US status
constant.USYes represents the difference in
sales between stores in the US and stores outside the US, holding price
and urban status constant.coef(model_full)
## (Intercept) Price UrbanYes USYes
## 13.04346894 -0.05445885 -0.02191615 1.20057270
The model can be written as:
\[ \widehat{Sales} = 13.04 - 0.054(Price) - 0.022(UrbanYes) + 1.201(USYes) \]
where:
UrbanYes = 1 if the store is urban, and 0
otherwise.USYes = 1 if the store is in the US, and 0
otherwise.Based on the regression output, we reject \(H_0: \beta_j = 0\) for predictors with small p-values.
summary(model_full)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
The predictors Price and US appear to be
statistically significant. The variable Urban does not
appear to be statistically significant because its p-value is large.
model_small <- lm(Sales ~ Price + US, data = Carseats)
summary(model_small)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
summary(model_full)$r.squared
## [1] 0.2392754
summary(model_full)$adj.r.squared
## [1] 0.2335123
summary(model_small)$r.squared
## [1] 0.2392629
summary(model_small)$adj.r.squared
## [1] 0.2354305
The full model and smaller model fit the data similarly. Since
Urban was not statistically significant, removing it does
not meaningfully reduce model performance. The smaller model is
preferred because it is simpler and uses only predictors with evidence
of association with Sales.
confint(model_small)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
par(mfrow = c(2, 2))
plot(model_small)
par(mfrow = c(1, 1))
# Studentized residuals
stud_res <- rstudent(model_small)
which(abs(stud_res) > 3)
## named integer(0)
# Leverage values
lev <- hatvalues(model_small)
which(lev > 2 * mean(lev))
## 43 126 156 157 160 166 172 175 192 204 209 270 273 314 316 357 366 368 384 387
## 43 126 156 157 160 166 172 175 192 204 209 270 273 314 316 357 366 368 384 387
The diagnostic plots and leverage values can be used to check for unusual observations. Points with very large studentized residuals may be possible outliers, while observations with high leverage may have unusual predictor values. Overall, the diagnostic plots should be examined to determine whether any observations are unusually influential.
For regression of \(Y\) onto \(X\) without an intercept,
\[ \hat{\beta}_{Y|X} = \frac{\sum x_i y_i}{\sum x_i^2} \]
For regression of \(X\) onto \(Y\) without an intercept,
\[ \hat{\beta}_{X|Y} = \frac{\sum x_i y_i}{\sum y_i^2} \]
These two coefficient estimates are equal when:
\[ \sum x_i^2 = \sum y_i^2 \]
In other words, the two estimates are the same when the sum of squared values for \(X\) equals the sum of squared values for \(Y\).
set.seed(1)
x <- rnorm(100)
y <- 2 * x + rnorm(100)
# Regression of Y onto X without intercept
fit_y_on_x <- lm(y ~ x + 0)
# Regression of X onto Y without intercept
fit_x_on_y <- lm(x ~ y + 0)
coef(fit_y_on_x)
## x
## 1.993876
coef(fit_x_on_y)
## y
## 0.3911145
The two coefficient estimates are different in this example because \(\sum x_i^2\) is not equal to \(\sum y_i^2\).
sum(x^2)
## [1] 81.05509
sum(y^2)
## [1] 413.2135
set.seed(2)
x <- rnorm(100)
y <- x
# Regression of Y onto X without intercept
fit_y_on_x_same <- lm(y ~ x + 0)
# Regression of X onto Y without intercept
fit_x_on_y_same <- lm(x ~ y + 0)
coef(fit_y_on_x_same)
## x
## 1
coef(fit_x_on_y_same)
## y
## 1
In this example, \(x\) and \(y\) are identical, so:
\[ \sum x_i^2 = \sum y_i^2 \]
Therefore, the coefficient estimate for the regression of \(Y\) onto \(X\) is the same as the coefficient estimate for the regression of \(X\) onto \(Y\).
sum(x^2)
## [1] 133.3521
sum(y^2)
## [1] 133.3521