M. Drew LaMar
February 11, 2019
Definition: Multiple regression extends simple two-variable regression to the case that still has one response but many predictors (denoted \( x_1 \), \( x_2 \), \( x_3 \), …).
The method is motivated by scenarios where many variables may be simultaneously connected to an output.
We're going to look at auction data for the Mario Kart Wii game.
'data.frame': 141 obs. of 5 variables:
$ price : num 51.5 37 45.5 44 71 ...
$ cond : Factor w/ 2 levels "used","new": 2 1 2 2 2 2 1 2 1 1 ...
$ stock_photo: Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 2 1 ...
$ duration : int 3 7 3 3 1 3 1 1 3 7 ...
$ wheels : int 1 1 1 1 2 0 0 2 1 1 ...
mario_kart %>% ggplot(aes(x = cond, y = price)) + geom_point(position = position_jitter(width = 0.1))
summary(mdl_cond)
Call:
lm(formula = price ~ cond, data = mario_kart)
Residuals:
Min 1Q Median 3Q Max
-13.8911 -5.8311 0.1289 4.1289 22.1489
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42.871 0.814 52.668 < 2e-16 ***
condnew 10.900 1.258 8.662 1.06e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.371 on 139 degrees of freedom
Multiple R-squared: 0.3506, Adjusted R-squared: 0.3459
F-statistic: 75.03 on 1 and 139 DF, p-value: 1.056e-14
A multiple regression model is a linear model with many predictors. In general, we write the model as \[ \hat{y} = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \cdots + \beta_{k}x_{k} \] when there are \( k \) predictors.
mdl_full <- lm(price ~ cond + stock_photo + duration + wheels, data = mario_kart)
summary(mdl_full)
Call:
lm(formula = price ~ cond + stock_photo + duration + wheels,
data = mario_kart)
Residuals:
Min 1Q Median 3Q Max
-11.3788 -2.9854 -0.9654 2.6915 14.0346
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.21097 1.51401 23.917 < 2e-16 ***
condnew 5.13056 1.05112 4.881 2.91e-06 ***
stock_photoyes 1.08031 1.05682 1.022 0.308
duration -0.02681 0.19041 -0.141 0.888
wheels 7.28518 0.55469 13.134 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.901 on 136 degrees of freedom
Multiple R-squared: 0.719, Adjusted R-squared: 0.7108
F-statistic: 87.01 on 4 and 136 DF, p-value: < 2.2e-16
For only one predictor variable:
\[ R^2 = 1- \frac{\textrm{Variability in residuals}}{\textrm{Variability in outcome}} \]
For multiple predictor variables, we use the adjusted R2:
\[ \hat{R}^2 = 1- \frac{\textrm{Variability in residuals}}{\textrm{Variability in outcome}} \times \frac{n-1}{n-k-1} \]
The adjusted R2 value is important for model selection.
Discuss: What is model selection and why is it important?
Four techniques:
Note: No guarantee that forward and backward selection will lead to same final model. In this case, choose model with highest adjusted \( R \)^2 value.
Note: When the sole goal is to improve prediction accuracy, use adjusted \( R^2 \) technique. This is commonly the case in machine learning applications.
Note: When we care about understanding which variables are statistically significant predictors of the response, or if there is interest in producing a simpler model at the potential cost of a little prediction accuracy, then the \( P \)-value approach is preferred.