Linear Regression with Interaction Effects

Anna Shirokanova, Olesya Volchenko

May 13, 2020

Complete the sentence with the most suitable word:

“Younger people consume more alcohol on weekdays if they are …”

This seminar:

Interaction effects:

Exercise 1. Categorical by continuous

Exercise 2. Categorical by categorical

Exercise 3. Continuous by continuous

Example from the European Social Survey data, Wave 7

Variables involved:

Our research question is: “What predicts weekday alcohol consumption in Europe?”

Warning! This is an exercise to practise fitting and interpreting interaction effect models. It is far from the real research model testing with all of its literature-based hypotheses and care for model quality and reliability.

Get the data:

Download the ESS 7 integrated file, 2.2: http://www.europeansocialsurvey.org/data/download.html?r=7

library(foreign)
ess7 <-  read.spss(choose.files(), to.data.frame = T, use.value.labels = T)
ess7$alcwkdy <- as.numeric(as.character(ess7$alcwkdy))
ess7$eduyrs <- as.numeric(as.character(ess7$eduyrs))
ess7$agea <- as.numeric(as.character(ess7$agea))

Descriptive statistics:

library(dplyr)
ess <- select(ess7, agea, alcwkdy, dshltgp, eduyrs, gndr)
library(psych)
describe(ess)
##          vars     n  mean    sd median trimmed   mad min  max range  skew
## agea        1 40086 49.28 18.74     49   49.15 22.24  14  114   100  0.04
## alcwkdy     2 30064 27.39 44.31     15   19.06 22.24   0 2189  2189  8.95
## dshltgp*    3 40185  1.75  0.44      2    1.81  0.00   1    2     1 -1.13
## eduyrs      4 39828 12.90  3.94     12   12.86  2.97   0   50    50  0.27
## gndr*       5 40163  1.53  0.50      2    1.54  0.00   1    2     1 -0.12
##          kurtosis   se
## agea        -0.93 0.09
## alcwkdy    246.20 0.26
## dshltgp*    -0.73 0.00
## eduyrs       1.95 0.02
## gndr*       -1.99 0.00
library(sjPlot)
sjPlot::sjt.corr(ess[,c(1,2,4)], corr.method = "spearman")
  agea alcwkdy eduyrs
agea   -0.004 -0.218***
alcwkdy -0.004   0.023***
eduyrs -0.218*** 0.023***  
Computed correlation used spearman-method with listwise-deletion.
ess <- na.omit(ess) # 29879 full obs.
library(ggplot2)
ggplot(ess, aes(x = alcwkdy)) +
  geom_density()

The distribution of the outcome is extremely skewed to the right, which is likely to skew the model residuals as well.

In this exercise, we will ignore this fact and treat the variable “as is”, but in your research you could consider log-transforming this variable or filtering the sample prior to analysis.

Exercise 1. Gender and education

m1 <- lm(alcwkdy ~ gndr, data = ess)
m2 <- lm(alcwkdy ~ gndr + eduyrs, ess)
m3 <- lm(alcwkdy ~ gndr * eduyrs, ess)
anova(m1, m2)
## Analysis of Variance Table
## 
## Model 1: alcwkdy ~ gndr
## Model 2: alcwkdy ~ gndr + eduyrs
##   Res.Df      RSS Df Sum of Sq      F Pr(>F)
## 1  29830 57030774                           
## 2  29829 57028278  1    2495.6 1.3054 0.2532

Model 2 is no better than Model 1.

(You can revise how to use anova() for model comparison here: https://bookdown.org/ndphillips/YaRrr/comparing-regression-models-with-anova.html)

anova(m2, m3)
## Analysis of Variance Table
## 
## Model 1: alcwkdy ~ gndr + eduyrs
## Model 2: alcwkdy ~ gndr * eduyrs
##   Res.Df      RSS Df Sum of Sq      F    Pr(>F)    
## 1  29829 57028278                                  
## 2  29828 57001073  1     27206 14.236 0.0001615 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model 3 is better than Model 2.

anova(m1, m2, m3)
## Analysis of Variance Table
## 
## Model 1: alcwkdy ~ gndr
## Model 2: alcwkdy ~ gndr + eduyrs
## Model 3: alcwkdy ~ gndr * eduyrs
##   Res.Df      RSS Df Sum of Sq       F    Pr(>F)    
## 1  29830 57030774                                   
## 2  29829 57028278  1    2495.6  1.3059 0.2531412    
## 3  29828 57001073  1   27205.6 14.2364 0.0001615 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comparing all three models, Model 3 (interaction model) is significantly better than Model 2 (additive).

Conclusion: Model 3 is the best one. Let’s put the models next to each other to compare the coefficients, and then interpret Model 3.

library(sjPlot)
tab_model(m1, m2, m3)
  alcwkdy alcwkdy alcwkdy
Predictors Estimates CI p Estimates CI p Estimates CI p
(Intercept) 34.39 33.70 – 35.09 <0.001 35.39 33.54 – 37.23 <0.001 38.54 36.07 – 41.00 <0.001
gndr [Female] -14.38 -15.37 – -13.39 <0.001 -14.35 -15.35 – -13.36 <0.001 -21.02 -24.62 – -17.42 <0.001
eduyrs -0.08 -0.21 – 0.05 0.253 -0.32 -0.50 – -0.14 0.001
gndr [Female] * eduyrs 0.50 0.24 – 0.76 <0.001
Observations 29832 29832 29832
R2 / R2 adjusted 0.026 / 0.026 0.026 / 0.026 0.027 / 0.027

Interpretation: Education is not related to the outcome by itself, but taking into account the education, better predictions can be made for gender.

#library(sjPlot)
plot_model(m3, type = "int")

Where to look:

  1. What are all the data points in the plot?
  2. Who scores higher, males or females, when a) education years = 0 and b) = 50?
  3. Do confidence intervals for points of the same colour overlap?

What changes if you swap their places in the model?

m1a <- lm(alcwkdy ~ eduyrs, data = ess)
m2a <- lm(alcwkdy ~ eduyrs + gndr, ess)
m3a <- lm(alcwkdy ~ eduyrs * gndr, ess)
anova(m1a, m2a, m3a)
## Analysis of Variance Table
## 
## Model 1: alcwkdy ~ eduyrs
## Model 2: alcwkdy ~ eduyrs + gndr
## Model 3: alcwkdy ~ eduyrs * gndr
##   Res.Df      RSS Df Sum of Sq       F    Pr(>F)    
## 1  29830 58561625                                   
## 2  29829 57028278  1   1533347 802.383 < 2.2e-16 ***
## 3  29828 57001073  1     27206  14.236 0.0001615 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model 3 is statistically better than Model 1, while Model 3 is better than Model 2.

Statistically speaking, this model is identical to model m3.

library(sjPlot)
tab_model(m1a, m2a, m3a, 
          dv.labels = "Weekday Alcohol, grams", 
          pred.labels = c("Intercept", "Years of education", "Female", "Education:Female"))
  Weekday Alcohol, grams
Predictors Estimates CI p Estimates CI p Estimates CI p
Intercept 29.43 27.61 – 31.25 <0.001 35.39 33.54 – 37.23 <0.001 38.54 36.07 – 41.00 <0.001
Years of education -0.16 -0.29 – -0.02 0.020 -0.08 -0.21 – 0.05 0.253 -0.32 -0.50 – -0.14 0.001
Female -14.35 -15.35 – -13.36 <0.001 -21.02 -24.62 – -17.42 <0.001
Education:Female 0.50 0.24 – 0.76 <0.001
Observations 29832 29832 29832
R2 / R2 adjusted 0.000 / 0.000 0.026 / 0.026 0.027 / 0.027
plot_model(m3a, type = "int", 
           title = "Predicted values of Weekday Alcohol, grams", 
           legend.title = "Gender", 
           axis.title = c("Years of education", "Grams of alcohol on weekdays"))

Where to look:

  1. Do more educated males score higher than less educated males? Do more educated females score higher than less educated females?
  2. Do the confidence intervals of the two lines overlap? If yes, at what number of education years?

How does the coefficient for gender change when education is taked into account?

We can visualise the changes in the conditional coefficient with the function interplot:

“The function visualizes the changes in the coefficient of one variable in a two-way interaction term conditional on the value of the other included variable. The plot also includes simulated 95% confidential intervals of these coefficients.” https://cran.r-project.org/web/packages/interplot/vignettes/interplot-vignette.html

library(interplot)
interplot(m3a, var1 = "gndr", var2 = "eduyrs") + 
  xlab('Years of Education') +
  ylab('Estimated Coefficient for Gender (female)') +
  geom_hline(yintercept = 0, linetype = "dashed")

The Y-axis demonstrates the marginal effect for all the education years on weekday alcohol consumption for females.

The caption in the bottom right corner shows the confidence intervals of the difference between the conditioned effects of the gender at the minimum and maximum values of education years

Interpretation: Females are predicted to score less than males on the outcome (alcohol consumption), but the higher their education, the smaller the difference from males. When education is 33 years or higher, there is no more statistical difference in the outcome for males and females.

Extra

How can we get the marginal effect for males?

m3a <- lm(alcwkdy ~ eduyrs * relevel(gndr, ref = "Female"), ess)
summary(m3a)
interplot(m3a, var1 = "gndr", var2 = "eduyrs") + 
  xlab('Years of Education') +
  ylab('Estimated Coefficient for Gender')

“Error: unexpected symbol in”interplot(m3a, var1 = “relevel(gndr, ref =”Female"

ess$gndr1 <- relevel(ess$gndr, ref = "Female")
m3a <- lm(alcwkdy ~ eduyrs * gndr1, ess)
summary(m3a)
## 
## Call:
## lm(formula = alcwkdy ~ eduyrs * gndr1, data = ess)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -38.54  -20.18   -9.55    5.08 2155.19 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      17.51963    1.33945  13.080  < 2e-16 ***
## eduyrs            0.18491    0.09569   1.932 0.053310 .  
## gndr1Male        21.01817    1.83725  11.440  < 2e-16 ***
## eduyrs:gndr1Male -0.50019    0.13257  -3.773 0.000162 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.71 on 29828 degrees of freedom
## Multiple R-squared:  0.02682,    Adjusted R-squared:  0.02673 
## F-statistic: 274.1 on 3 and 29828 DF,  p-value: < 2.2e-16
interplot(m3a, var1 = "gndr1", var2 = "eduyrs") + 
  xlab('Years of Education') +
  ylab('Estimated Coefficient for Gender')

The Y-axis demonstrates the marginal (pure) effect for all the education years on weekday alcohol consumption for males.

Exercise 2. Gender by Visit to General Practitioner

m11 <- lm(alcwkdy ~ gndr, ess)
m12 <- lm(alcwkdy ~ gndr + dshltgp, ess)
m13 <- lm(alcwkdy ~ gndr * dshltgp, ess)
anova(m11, m12, m13)
## Analysis of Variance Table
## 
## Model 1: alcwkdy ~ gndr
## Model 2: alcwkdy ~ gndr + dshltgp
## Model 3: alcwkdy ~ gndr * dshltgp
##   Res.Df      RSS Df Sum of Sq       F    Pr(>F)    
## 1  29830 57030774                                   
## 2  29829 56961833  1     68941 36.1071 1.889e-09 ***
## 3  29828 56951855  1      9978  5.2259   0.02226 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion: The model with interaction is better, let’s describe it.

summary(m13)
## 
## Call:
## lm(formula = alcwkdy ~ gndr * dshltgp, data = ess)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -37.68  -19.57   -9.57    4.99 2151.32 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               37.6761     0.6512  57.853  < 2e-16 ***
## gndrFemale               -16.1094     1.0090 -15.966  < 2e-16 ***
## dshltgpMarked             -4.6632     0.7760  -6.009 1.88e-09 ***
## gndrFemale:dshltgpMarked   2.6695     1.1678   2.286   0.0223 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.7 on 29828 degrees of freedom
## Multiple R-squared:  0.02766,    Adjusted R-squared:  0.02757 
## F-statistic: 282.9 on 3 and 29828 DF,  p-value: < 2.2e-16

Interpretation: (R-squared, the intercept, regression coefficients)

plot_model(m13, type = "int",
           title = "Predicted values of Weekday Alcohol, grams", 
           legend.title = "Discussed health with GP last year", 
           axis.title = c("Years of education", "Grams of alcohol on weekdays"))

Where to look:

  1. What are all the data points in the plot?
  2. Who scores higher, males or females, when a) contacted a GP and b) did not ?
  3. Do confidence intervals for points of the same colour overlap?

Exercise 3. Education by Age

m21 <- lm(alcwkdy ~ eduyrs, ess)
m22 <- lm(alcwkdy ~ eduyrs + agea, ess) 
m23 <- lm(alcwkdy ~ eduyrs * agea, ess)
anova(m21, m22, m23)
## Analysis of Variance Table
## 
## Model 1: alcwkdy ~ eduyrs
## Model 2: alcwkdy ~ eduyrs + agea
## Model 3: alcwkdy ~ eduyrs * agea
##   Res.Df      RSS Df Sum of Sq      F  Pr(>F)    
## 1  29830 58561625                                
## 2  29829 58406177  1    155448 79.404 < 2e-16 ***
## 3  29828 58393830  1     12347  6.307 0.01203 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model 2 is better than Model 1, while Model 3 has a better fit than Model 2.

tab_model(m21, m22, m23,dv.labels = "Weekday Alcohol, grams")
  Weekday Alcohol, grams
Predictors Estimates CI p Estimates CI p Estimates CI p
(Intercept) 29.43 27.61 – 31.25 <0.001 37.39 34.87 – 39.92 <0.001 43.89 38.23 – 49.56 <0.001
eduyrs -0.16 -0.29 – -0.02 0.020 -0.28 -0.41 – -0.15 <0.001 -0.78 -1.20 – -0.37 <0.001
agea -0.13 -0.16 – -0.10 <0.001 -0.25 -0.35 – -0.15 <0.001
eduyrs * agea 0.01 0.00 – 0.02 0.012
Observations 29832 29832 29832
R2 / R2 adjusted 0.000 / 0.000 0.003 / 0.003 0.003 / 0.003

Here, the sample is huge, so that even a tiny relationship can get statistically significant. Be mindful to see the difference between statistical significant and substantive significance for your research.

Let’s proceed to plotting this interaction:

plot_model(m23, type = "int", mdrt.values = "minmax")

By default, plot_model will use the minimal and maximal values for plotting.

Another popular option is to plot the mean value, mean+1 SD, and mean-1SD:

plot_model(m23, type = "int", mdrt.values = "meansd")

You can see now the linear difference between the ages and the education level (~16 years of education) at which the interaction becomes not significant.

Finally, the following marginal effect plots will show the change in conditional coefficients for both continuous predictors in the model

library(interplot)
p1 <- interplot(m23, var1 = "eduyrs", var2 = "agea") + 
  xlab('age') +
  ylab('Estimated Coefficient for Years of Education') +
  geom_hline(yintercept = 0, linetype = "dashed")
p2 <- interplot(m23, var1 = "agea", var2 = "eduyrs") + 
  xlab('education years') +
  ylab('Estimated Coefficient for Age') +
  geom_hline(yintercept = 0, linetype = "dashed")
library(gridExtra)
grid.arrange(p1, p2, nrow = 1)

The end.