Multiple Linear Regression

M. Drew LaMar
February 8, 2019

Class announcements

  • Reading Assignments
    • OpenStats, Chapter 8: Multiple and logistic regression (8.1-8.3)
    • Anderson, “Model Based Inference in the Life Sciences”
    • Appendix A: Likelihood Theory
    • Chapter 1: Science Hypotheses and Science Philosophy
  • Homework #3 and Lab #4 are up on Blackboard (due Monday, February 18, 11:59pm)

Introduction to Multiple Linear Regression

Definition: Multiple regression extends simple two-variable regression to the case that still has one response but many predictors (denoted \( x_1 \), \( x_2 \), \( x_3 \), …).

The method is motivated by scenarios where many variables may be simultaneously connected to an output.

Assumptions of Multiple Linear Regression

  1. the residuals of the model are nearly normal,
  2. the variability of the residuals is nearly constant,
  3. the residuals are independent, and
  4. each variable is linearly related to the outcome.

Our Example

We're going to look at auction data for the Mario Kart Wii game.

The Data

mario_kart <- marioKart %>% 
  dplyr::select(price = totalPr, 
         cond, 
         stock_photo = stockPhoto, 
         duration, wheels) %>% 
  mutate(cond = forcats::fct_relevel(cond, c("used", "new"))) %>%
  filter(price < 100)
str(mario_kart)

The Data

'data.frame':   141 obs. of  5 variables:
 $ price      : num  51.5 37 45.5 44 71 ...
 $ cond       : Factor w/ 2 levels "used","new": 2 1 2 2 2 2 1 2 1 1 ...
 $ stock_photo: Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 2 1 ...
 $ duration   : int  3 7 3 3 1 3 1 1 3 7 ...
 $ wheels     : int  1 1 1 1 2 0 0 2 1 1 ...

Price vs. condition

mario_kart %>% 
  ggplot(aes(x = cond, y = price)) + 
  geom_point(position = position_jitter(width = 0.1))

plot of chunk unnamed-chunk-3

Price vs. condition

plot of chunk unnamed-chunk-4

Price vs. condition

summary(mdl_cond)

Call:
lm(formula = price ~ cond, data = mario_kart)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.8911  -5.8311   0.1289   4.1289  22.1489 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   42.871      0.814  52.668  < 2e-16 ***
condnew       10.900      1.258   8.662 1.06e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.371 on 139 degrees of freedom
Multiple R-squared:  0.3506,    Adjusted R-squared:  0.3459 
F-statistic: 75.03 on 1 and 139 DF,  p-value: 1.056e-14

Multiple linear regression

A multiple regression model is a linear model with many predictors. In general, we write the model as \[ \hat{y} = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \cdots + \beta_{k}x_{k} \] when there are \( k \) predictors.

All predictors included

mdl_full <- lm(price ~ cond + stock_photo + duration + wheels, data = mario_kart)
summary(mdl_full)

Call:
lm(formula = price ~ cond + stock_photo + duration + wheels, 
    data = mario_kart)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.3788  -2.9854  -0.9654   2.6915  14.0346 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    36.21097    1.51401  23.917  < 2e-16 ***
condnew         5.13056    1.05112   4.881 2.91e-06 ***
stock_photoyes  1.08031    1.05682   1.022    0.308    
duration       -0.02681    0.19041  -0.141    0.888    
wheels          7.28518    0.55469  13.134  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.901 on 136 degrees of freedom
Multiple R-squared:  0.719, Adjusted R-squared:  0.7108 
F-statistic: 87.01 on 4 and 136 DF,  p-value: < 2.2e-16