Regression: Not Only a Mystery Thriller

Olesya Volchenko and Anna Shirokanova

April 2018

What do we already know?

Why do we need regression?

Omitted variable effect (1)

Relationship Third variable
The larger the foot size of a kid, the more clever s/he is ?
The lower the person, the longer the hair of that person ?
The larger the school class, the better are average grades ?
People using the Internet daily in Africa are happier ?
Ice-cream sales are positively related to the number of people drowning ?

Omitted variable effect (2)

Relationship Third variable
The larger the foot size of a kid, the more clever s/he is Age
The taller the person, the shorter the hair of that person Gender
The larger the school class, the better are average grades School size / equipment
People using the Internet daily in Africa are happier Income
Ice-cream sales are positively related to the number of people drowning Season

Regression modelling

Toy data example

There are variables X and Y, n = 50, normally distributed

y x
-2.2769689 -0.9030778
0.2932664 -0.1464296
-2.1536341 -1.2766415
3.3128843 0.9340912
6.2678134 2.3101434
2.9741792 1.2217376
##        y                  x           
##  Min.   :-8.43173   Min.   :-1.92319  
##  1st Qu.:-2.07282   1st Qu.:-0.68045  
##  Median :-0.03748   Median :-0.11262  
##  Mean   : 0.17194   Mean   : 0.06454  
##  3rd Qu.: 2.36191   3rd Qu.: 0.91913  
##  Max.   : 6.26781   Max.   : 2.31014

Visualise their relationship

plot(x, y, pch = 16)

Use correlation to estimate the strength of their relationship

cor.test(x, y, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = 18.604, df = 48, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8912640 0.9640148
## sample estimates:
##       cor 
## 0.9371278

Estimate the model

model1 <- lm(y ~ x)
summary(model1)

R Output

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9636 -0.4744 -0.0366  0.4992  3.8284 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.01119    0.15885   -0.07    0.944    
## x            2.83743    0.15252   18.60   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.121 on 48 degrees of freedom
## Multiple R-squared:  0.8782, Adjusted R-squared:  0.8757 
## F-statistic: 346.1 on 1 and 48 DF,  p-value: < 2.2e-16

What are the conclusions?

Plot

plot(x, y, pch = 16)
abline(model1, col = "red")
text(0, 2, paste("y = ", round(model1$coefficients[1], 1), "+", round(model1$coefficients[2], 
    1), "* x"))
abline(v = 0, lty = 2, col = "gray")

Coefficient of determination R2

Other ways to evaluate model fit:

Several predictors

Dummy-variables

Example: salary of males and females

200 males and females

##       age         salary       sex    
##  Min.   :14   Min.   : 40.60   F:100  
##  1st Qu.:26   1st Qu.: 85.44   M:100  
##  Median :30   Median : 94.70          
##  Mean   :30   Mean   : 94.98          
##  3rd Qu.:33   3rd Qu.:105.08          
##  Max.   :43   Max.   :133.30

Example: exploratory plots

Example: the model

model2 <- lm(salary ~ age + sex, data = genderdata)
sjt.lm(model2, show.ci = F)
    salary
    B p
(Intercept)   1.00 .080
age   2.97 <.001
sex (M)   10.08 <.001
Observations   200
R2 / adj. R2   .993 / .993

Example: plot

What if there are more than 2 categories? (e.g. educational attinment)

##        educ dummy1 dummy2
## 1  tertiary      0      0
## 2 secondary      1      0
## 3   primary      0      1

Interaction effect

Model Comparison

model3 <- lm(salary ~ age + sex, data = genderdata2)
model4 <- lm(salary ~ age * sex, data = genderdata2)
sjt.lm(model3, model4, show.ci = F)
    salary   salary
    B p   B p
(Intercept)   4.11 .565   99.07 <.001
age   4.19 <.001   1.07 <.001
sexM   77.86 <.001   -103.23 <.001
age:sexM     6.04 <.001
Observations   200   200
R2 / adj. R2   .884 / .883   .986 / .986

What the Interaction Effect Model Looks Like

What happens if we ignore the interaction effect?

Can we identify a possible interaction effect visually?

Sometimes yes. Heteroskedasticity may indicate a moderation/ interaction. Let’s recall the assumptions first.

Linear Regression Assumptions

Homoskedasticity/Heteroskedasticity

x <- 1:100
y1 <- rnorm(n = 100, mean = x, sd = 10)
y2 <- rnorm(n = 100, mean = x, sd = 0.4*x)
par(mfrow = c(1, 2))
plot(x, y1, pch = 16); abline(lm(y1 ~ x), col = "red")
plot(x, y2, pch = 16); abline(lm(y2 ~ x), col = "red")

Plane on the right-hand side with data points ‘fanning out’ shows there is a third variable which comes into play at high values of X

Anscombe’s Quartet

Feature Value
Mean x 9.0
Variance x 10.0
Mean y 7.5
Variance y 3.75
Correlation between x and y 0.816
Regression fitted line y = 3 + 0.5x

Another example

This example originates from the book by Jeremy Miles and Mark Shevlin (2001) 1.

Outcome distribution

## 
##  Shapiro-Wilk normality test
## 
## data:  df$grade
## W = 0.93009, p-value = 0.01622

Models: Standardised Coefficients

model3 = lm(grade ~ books + attend, data = df)
summary(model3)
## 
## Call:
## lm(formula = grade ~ books + attend, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.802 -13.374   0.060   9.173  32.295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   37.379      7.745   4.827 2.41e-05 ***
## books          4.037      1.753   2.303   0.0270 *  
## attend         1.284      0.587   2.187   0.0352 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.05 on 37 degrees of freedom
## Multiple R-squared:  0.3287, Adjusted R-squared:  0.2924 
## F-statistic: 9.059 on 2 and 37 DF,  p-value: 0.0006278
model3scaled <- lm(grade ~ scale(as.numeric(books)) + 
                     scale(as.numeric(attend)), data = df)
summary(model3scaled)
## 
## Call:
## lm(formula = grade ~ scale(as.numeric(books)) + scale(as.numeric(attend)), 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.802 -13.374   0.060   9.173  32.295 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 63.550      2.222  28.602   <2e-16 ***
## scale(as.numeric(books))     5.782      2.511   2.303   0.0270 *  
## scale(as.numeric(attend))    5.490      2.511   2.187   0.0352 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.05 on 37 degrees of freedom
## Multiple R-squared:  0.3287, Adjusted R-squared:  0.2924 
## F-statistic: 9.059 on 2 and 37 DF,  p-value: 0.0006278

The effect of books is somewhat larger than the effect of attend.

Interaction effect model (1)

model5 = lm(grade ~ attend * books, data = df)
summary(model5)
## 
## Call:
## lm(formula = grade ~ attend * books, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.844 -11.850   1.466   8.163  34.198 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   55.2213    11.2621   4.903 2.02e-05 ***
## attend        -0.1368     0.8782  -0.156   0.8771    
## books         -6.2078     5.1508  -1.205   0.2360    
## attend:books   0.7349     0.3494   2.104   0.0425 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.44 on 36 degrees of freedom
## Multiple R-squared:  0.4022, Adjusted R-squared:  0.3524 
## F-statistic: 8.073 on 3 and 36 DF,  p-value: 0.0003058
anova(model3, model5)
## Analysis of Variance Table
## 
## Model 1: grade ~ books + attend
## Model 2: grade ~ attend * books
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     37 7306.2                              
## 2     36 6506.4  1    799.78 4.4252 0.04246 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The interaction (multiplicative) model explains significantly more than the additive model

Interaction effect model (2)

model6 = lm(grade ~ lucky * books, data = df)
summary(model6)
## 
## Call:
## lm(formula = grade ~ lucky * books, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.417 -10.952  -1.455  10.156  37.300 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  50.92308    5.76906   8.827 2.57e-10 ***
## lucky1        3.76257    8.78658   0.428   0.6712    
## lucky2        0.07692   11.83961   0.006   0.9949    
## books         6.15385    2.42349   2.539   0.0159 *  
## lucky1:books -3.64642    3.71632  -0.981   0.3334    
## lucky2:books  2.55449    4.55189   0.561   0.5783    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.62 on 34 degrees of freedom
## Multiple R-squared:  0.3322, Adjusted R-squared:  0.2339 
## F-statistic: 3.382 on 5 and 34 DF,  p-value: 0.01381

Luck has not effect on final grade whatsoever

Interaction effect visualisation (1)

library(sjPlot)
sjp.int(model5, type = "eff")

sjp.int(model5, type = "eff", show.ci = T)

Interaction effect visualisation (2)

library(rockchalk)
x = plotSlopes(model5, plotx = "books", 
           modx = "attend", interval = "conf")

plot(testSlopes(x), shade = T)
## Values of attend OUTSIDE this interval:
##         lo         hi 
## -141.07291   13.13872 
## cause the slope of (b1 + b2*attend)books to be statistically significant

x = plotSlopes(model5, plotx = "attend", 
           modx = "books", interval = "conf")

plot(testSlopes(x), shade = T)
## Values of books OUTSIDE this interval:
##         lo         hi 
## -47.419893   1.746103 
## cause the slope of (b1 + b2*books)attend to be statistically significant

The interaction effect gets significant at 2 books and 13 classes attended.

Visualisation (3)

#sjp.int is now turning into plot_model, so please learn to use the new function:
sjPlot::plot_model(model5, type = "eff", terms = "books")

sjPlot::plot_model(model5, type = "int", show.ci = T, mdrt.values = "all")

#by default, it takes the variable with smaller range as a moderator, here - books

sjPlot::plot_model(model5, type = "int", show.ci = T, terms = "books", mdrt.values = "all")

sjPlot::plot_model(model5, type = "int", show.ci = T, mdrt.values = "meansd")

All models in one table

sjt.lm(model3, model5, model6, show.ci = F, p.numeric = F)
    grade   grade   grade
    B   B   B
(Intercept)   37.38 ***   55.22 ***   50.92 ***
lucky
books   4.04 *   -6.21    6.15 *
attend   1.28 *   -0.14     
attend:books       0.73 *    
lucky1           3.76 
lucky2           0.08 
lucky1:books           -3.65 
lucky2:books           2.55 
Observations   40   40   40
R2 / adj. R2   .329 / .292   .402 / .352   .332 / .234
Notes * p<.05   ** p<.01   *** p<.001

Check list for interaction effect regressions

(Brambor et al. 2005)2:

For interpretation, see the link

Summary: nuts and bolts of regression modelling

What’s next?

Reding for the seminar: Chapter 14

Monday’s lab is here


  1. Miles, J., & Shevlin, M. (2001). Applying regression and correlation: A guide for students and researchers. Sage.

  2. Brambor, T., Clark, W. R., & Golder, M. (2005). Understanding interaction models: Improving empirical analyses. Political analysis, 14(1), 63-82. URL