M. Drew LaMar
February 12, 2020
Definition: Multiple regression extends simple two-variable regression to the case that still has one response but many predictors (denoted \( x_1 \), \( x_2 \), \( x_3 \), …).
The method is motivated by scenarios where many variables may be simultaneously connected to an output.
We're going to look at auction data for the Mario Kart Wii game.
'data.frame': 141 obs. of 5 variables:
$ price : num 51.5 37 45.5 44 71 ...
$ cond : Factor w/ 2 levels "used","new": 2 1 2 2 2 2 1 2 1 1 ...
$ stock_photo: Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 2 1 ...
$ duration : int 3 7 3 3 1 3 1 1 3 7 ...
$ wheels : int 1 1 1 1 2 0 0 2 1 1 ...
summary(mdl_cond)
Call:
lm(formula = price ~ cond, data = mario_kart)
Residuals:
Min 1Q Median 3Q Max
-13.8911 -5.8311 0.1289 4.1289 22.1489
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42.871 0.814 52.668 < 2e-16 ***
condnew 10.900 1.258 8.662 1.06e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.371 on 139 degrees of freedom
Multiple R-squared: 0.3506, Adjusted R-squared: 0.3459
F-statistic: 75.03 on 1 and 139 DF, p-value: 1.056e-14
A multiple regression model is a linear model with many predictors. In general, we write the model as \[ \hat{y} = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \cdots + \beta_{k}x_{k} \] when there are \( k \) predictors.
mdl_full <- lm(price ~ cond + stock_photo + duration + wheels, data = mario_kart)
summary(mdl_full)
Call:
lm(formula = price ~ cond + stock_photo + duration + wheels,
data = mario_kart)
Residuals:
Min 1Q Median 3Q Max
-11.3788 -2.9854 -0.9654 2.6915 14.0346
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.21097 1.51401 23.917 < 2e-16 ***
condnew 5.13056 1.05112 4.881 2.91e-06 ***
stock_photoyes 1.08031 1.05682 1.022 0.308
duration -0.02681 0.19041 -0.141 0.888
wheels 7.28518 0.55469 13.134 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.901 on 136 degrees of freedom
Multiple R-squared: 0.719, Adjusted R-squared: 0.7108
F-statistic: 87.01 on 4 and 136 DF, p-value: < 2.2e-16
For only one predictor variable:
\[ R^2 = 1- \frac{\textrm{Variability in residuals}}{\textrm{Variability in outcome}} \]
For multiple predictor variables, we use the adjusted R2:
\[ \hat{R}^2 = 1- \frac{\textrm{Variability in residuals}}{\textrm{Variability in outcome}} \times \frac{n-1}{n-k-1} \]
The adjusted R2 value is important for model selection.
Discuss: What is model selection and why is it important?
Four techniques:
Note: No guarantee that forward and backward selection will lead to same final model. In this case, choose model with highest adjusted \( R^2 \) value.
Note: When the sole goal is to improve prediction accuracy, use adjusted \( R^2 \) technique. This is commonly the case in machine learning applications.
Note: When we care about understanding which variables are statistically significant predictors of the response, or if there is interest in producing a simpler model at the potential cost of a little prediction accuracy, then the \( P \)-value approach is preferred.
Let's use backward elimination with the adjusted \( R^2 \) method:
full <- lm(price ~ cond + stock_photo + wheels + duration, data = mario_kart)
summary(full)
Call:
lm(formula = price ~ cond + stock_photo + wheels + duration,
data = mario_kart)
Residuals:
Min 1Q Median 3Q Max
-11.3788 -2.9854 -0.9654 2.6915 14.0346
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.21097 1.51401 23.917 < 2e-16 ***
condnew 5.13056 1.05112 4.881 2.91e-06 ***
stock_photoyes 1.08031 1.05682 1.022 0.308
wheels 7.28518 0.55469 13.134 < 2e-16 ***
duration -0.02681 0.19041 -0.141 0.888
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.901 on 136 degrees of freedom
Multiple R-squared: 0.719, Adjusted R-squared: 0.7108
F-statistic: 87.01 on 4 and 136 DF, p-value: < 2.2e-16
Let's use backward elimination with the adjusted \( R^2 \) method:
First, compute models eliminating each variable individually
full_cond <- lm(price ~ stock_photo + wheels + duration, data = mario_kart)
full_stock_photo <- lm(price ~ cond + wheels + duration, data = mario_kart)
full_wheels <- lm(price ~ cond + stock_photo + duration, data = mario_kart)
full_duration <- lm(price ~ cond + stock_photo + wheels, data = mario_kart)
Second, look at the adjusted \( R^2 \) values for each
Full - cond: 0.66257465806698
Full - stock_photo: 0.71066728367746
Full - wheels: 0.348698623692822
Full - duration: 0.712831545419501
Are any of them larger than \( R^2 \) = 0.7107622 for the full model?
Let's use backward elimination with the adjusted \( R^2 \) method:
“Full - duration” wins, so now compute models eliminating variables from that one
full_duration_cond <- lm(price ~ stock_photo + wheels, data = mario_kart)
full_duration_stock_photo <- lm(price ~ cond + wheels, data = mario_kart)
full_duration_wheels <- lm(price ~ cond + stock_photo, data = mario_kart)
Second, look at the adjusted \( R^2 \) values for each
Full - duration - cond: 0.658720517517627
Full - duration - stock_photo: 0.712409794516323
Full - duration - wheels: 0.341432692541391
Are any of them larger than \( R^2 \) = 0.7128315 for the full model?
Using backward elimination with the adjusted \( R^2 \) method:
winner <- lm(price ~ cond + stock_photo + wheels, data = mario_kart)
summary(winner)
Call:
lm(formula = price ~ cond + stock_photo + wheels, data = mario_kart)
Residuals:
Min 1Q Median 3Q Max
-11.454 -2.959 -0.949 2.712 14.061
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.0483 0.9745 36.990 < 2e-16 ***
condnew 5.1763 0.9961 5.196 7.21e-07 ***
stock_photoyes 1.1177 1.0192 1.097 0.275
wheels 7.2984 0.5448 13.397 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.884 on 137 degrees of freedom
Multiple R-squared: 0.719, Adjusted R-squared: 0.7128
F-statistic: 116.8 on 3 and 137 DF, p-value: < 2.2e-16
These were our steps for fitting a linear model:
Do a qqplot of the residuals:
qqnorm(residuals(winner))
Residuals vs. fitted values:
plot(fitted(winner), residuals(winner))
Residuals vs order of collection (ID):
plot(rownames(mario_kart), residuals(winner))
par(mfrow=c(1,3))
boxplot(residuals(winner) ~ cond, data = mario_kart)
boxplot(residuals(winner) ~ stock_photo, data = mario_kart)
plot(residuals(winner) ~ wheels, data = mario_kart)