Anna Shirokanova, Olesya Volchenko
May 13, 2020
“Younger people consume more alcohol on weekdays if they are …”
Interaction effects:
Exercise 1. Categorical by continuous
Exercise 2. Categorical by categorical
Exercise 3. Continuous by continuous
Variables involved:
alcwdy : Grams alcohol, last time drinking on a weekday, Monday to Thursday
dshltgp : Discussed health, last 12 months, with a general practitioner “In the last 12 months, that is since [THIS MONTH, LAST YEAR], with which of the health professionals on this card have you discussed your health? General Practitioner”
eduyrs : Years of full-time education completed “About how many years of education have you completed, whether full-time or part-time? Please report these in full-time equivalents and include compulsory years of schooling.”
agea : Age, years
gndr : Gender, levels “Male”, “Female”
Warning! This is an exercise to practise fitting and interpreting interaction effect models. It is far from the real research model testing with all of its literature-based hypotheses and care for model quality and reliability.
Download the ESS 7 integrated file, 2.2: http://www.europeansocialsurvey.org/data/download.html?r=7
library(foreign)
ess7 <- read.spss(choose.files(), to.data.frame = T, use.value.labels = T)
ess7$alcwkdy <- as.numeric(as.character(ess7$alcwkdy))
ess7$eduyrs <- as.numeric(as.character(ess7$eduyrs))
ess7$agea <- as.numeric(as.character(ess7$agea))Descriptive statistics:
library(dplyr)
ess <- select(ess7, agea, alcwkdy, dshltgp, eduyrs, gndr)
library(psych)
describe(ess)## vars n mean sd median trimmed mad min max range skew
## agea 1 40086 49.28 18.74 49 49.15 22.24 14 114 100 0.04
## alcwkdy 2 30064 27.39 44.31 15 19.06 22.24 0 2189 2189 8.95
## dshltgp* 3 40185 1.75 0.44 2 1.81 0.00 1 2 1 -1.13
## eduyrs 4 39828 12.90 3.94 12 12.86 2.97 0 50 50 0.27
## gndr* 5 40163 1.53 0.50 2 1.54 0.00 1 2 1 -0.12
## kurtosis se
## agea -0.93 0.09
## alcwkdy 246.20 0.26
## dshltgp* -0.73 0.00
## eduyrs 1.95 0.02
## gndr* -1.99 0.00
| agea | alcwkdy | eduyrs | |
|---|---|---|---|
| agea | -0.004 | -0.218*** | |
| alcwkdy | -0.004 | 0.023*** | |
| eduyrs | -0.218*** | 0.023*** | |
| Computed correlation used spearman-method with listwise-deletion. | |||
ess <- na.omit(ess) # 29879 full obs.
library(ggplot2)
ggplot(ess, aes(x = alcwkdy)) +
geom_density()The distribution of the outcome is extremely skewed to the right, which is likely to skew the model residuals as well.
In this exercise, we will ignore this fact and treat the variable “as is”, but in your research you could consider log-transforming this variable or filtering the sample prior to analysis.
m1 <- lm(alcwkdy ~ gndr, data = ess)
m2 <- lm(alcwkdy ~ gndr + eduyrs, ess)
m3 <- lm(alcwkdy ~ gndr * eduyrs, ess)
anova(m1, m2)## Analysis of Variance Table
##
## Model 1: alcwkdy ~ gndr
## Model 2: alcwkdy ~ gndr + eduyrs
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 29830 57030774
## 2 29829 57028278 1 2495.6 1.3054 0.2532
Model 2 is no better than Model 1.
(You can revise how to use anova() for model comparison here: https://bookdown.org/ndphillips/YaRrr/comparing-regression-models-with-anova.html)
## Analysis of Variance Table
##
## Model 1: alcwkdy ~ gndr + eduyrs
## Model 2: alcwkdy ~ gndr * eduyrs
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 29829 57028278
## 2 29828 57001073 1 27206 14.236 0.0001615 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Model 3 is better than Model 2.
## Analysis of Variance Table
##
## Model 1: alcwkdy ~ gndr
## Model 2: alcwkdy ~ gndr + eduyrs
## Model 3: alcwkdy ~ gndr * eduyrs
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 29830 57030774
## 2 29829 57028278 1 2495.6 1.3059 0.2531412
## 3 29828 57001073 1 27205.6 14.2364 0.0001615 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Comparing all three models, Model 3 (interaction model) is significantly better than Model 2 (additive).
Conclusion: Model 3 is the best one. Let’s put the models next to each other to compare the coefficients, and then interpret Model 3.
| alcwkdy | alcwkdy | alcwkdy | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | 34.39 | 33.70 – 35.09 | <0.001 | 35.39 | 33.54 – 37.23 | <0.001 | 38.54 | 36.07 – 41.00 | <0.001 |
| gndr [Female] | -14.38 | -15.37 – -13.39 | <0.001 | -14.35 | -15.35 – -13.36 | <0.001 | -21.02 | -24.62 – -17.42 | <0.001 |
| eduyrs | -0.08 | -0.21 – 0.05 | 0.253 | -0.32 | -0.50 – -0.14 | 0.001 | |||
| gndr [Female] * eduyrs | 0.50 | 0.24 – 0.76 | <0.001 | ||||||
| Observations | 29832 | 29832 | 29832 | ||||||
| R2 / R2 adjusted | 0.026 / 0.026 | 0.026 / 0.026 | 0.027 / 0.027 | ||||||
Interpretation: Education is not related to the outcome by itself, but taking into account the education, better predictions can be made for gender.
Where to look:
m1a <- lm(alcwkdy ~ eduyrs, data = ess)
m2a <- lm(alcwkdy ~ eduyrs + gndr, ess)
m3a <- lm(alcwkdy ~ eduyrs * gndr, ess)
anova(m1a, m2a, m3a)## Analysis of Variance Table
##
## Model 1: alcwkdy ~ eduyrs
## Model 2: alcwkdy ~ eduyrs + gndr
## Model 3: alcwkdy ~ eduyrs * gndr
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 29830 58561625
## 2 29829 57028278 1 1533347 802.383 < 2.2e-16 ***
## 3 29828 57001073 1 27206 14.236 0.0001615 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Model 3 is statistically better than Model 1, while Model 3 is better than Model 2.
Statistically speaking, this model is identical to model m3.
library(sjPlot)
tab_model(m1a, m2a, m3a,
dv.labels = "Weekday Alcohol, grams",
pred.labels = c("Intercept", "Years of education", "Female", "Education:Female"))| Weekday Alcohol, grams | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p |
| Intercept | 29.43 | 27.61 – 31.25 | <0.001 | 35.39 | 33.54 – 37.23 | <0.001 | 38.54 | 36.07 – 41.00 | <0.001 |
| Years of education | -0.16 | -0.29 – -0.02 | 0.020 | -0.08 | -0.21 – 0.05 | 0.253 | -0.32 | -0.50 – -0.14 | 0.001 |
| Female | -14.35 | -15.35 – -13.36 | <0.001 | -21.02 | -24.62 – -17.42 | <0.001 | |||
| Education:Female | 0.50 | 0.24 – 0.76 | <0.001 | ||||||
| Observations | 29832 | 29832 | 29832 | ||||||
| R2 / R2 adjusted | 0.000 / 0.000 | 0.026 / 0.026 | 0.027 / 0.027 | ||||||
plot_model(m3a, type = "int",
title = "Predicted values of Weekday Alcohol, grams",
legend.title = "Gender",
axis.title = c("Years of education", "Grams of alcohol on weekdays"))Where to look:
How does the coefficient for gender change when education is taked into account?
We can visualise the changes in the conditional coefficient with the function interplot:
“The function visualizes the changes in the coefficient of one variable in a two-way interaction term conditional on the value of the other included variable. The plot also includes simulated 95% confidential intervals of these coefficients.” https://cran.r-project.org/web/packages/interplot/vignettes/interplot-vignette.html
library(interplot)
interplot(m3a, var1 = "gndr", var2 = "eduyrs") +
xlab('Years of Education') +
ylab('Estimated Coefficient for Gender (female)') +
geom_hline(yintercept = 0, linetype = "dashed")The Y-axis demonstrates the marginal effect for all the education years on weekday alcohol consumption for females.
The caption in the bottom right corner shows the confidence intervals of the difference between the conditioned effects of the gender at the minimum and maximum values of education years
Interpretation: Females are predicted to score less than males on the outcome (alcohol consumption), but the higher their education, the smaller the difference from males. When education is 33 years or higher, there is no more statistical difference in the outcome for males and females.
Extra
How can we get the marginal effect for males?
m3a <- lm(alcwkdy ~ eduyrs * relevel(gndr, ref = "Female"), ess)
summary(m3a)
interplot(m3a, var1 = "gndr", var2 = "eduyrs") +
xlab('Years of Education') +
ylab('Estimated Coefficient for Gender')“Error: unexpected symbol in”interplot(m3a, var1 = “relevel(gndr, ref =”Female"
ess$gndr1 <- relevel(ess$gndr, ref = "Female")
m3a <- lm(alcwkdy ~ eduyrs * gndr1, ess)
summary(m3a)##
## Call:
## lm(formula = alcwkdy ~ eduyrs * gndr1, data = ess)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.54 -20.18 -9.55 5.08 2155.19
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.51963 1.33945 13.080 < 2e-16 ***
## eduyrs 0.18491 0.09569 1.932 0.053310 .
## gndr1Male 21.01817 1.83725 11.440 < 2e-16 ***
## eduyrs:gndr1Male -0.50019 0.13257 -3.773 0.000162 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.71 on 29828 degrees of freedom
## Multiple R-squared: 0.02682, Adjusted R-squared: 0.02673
## F-statistic: 274.1 on 3 and 29828 DF, p-value: < 2.2e-16
interplot(m3a, var1 = "gndr1", var2 = "eduyrs") +
xlab('Years of Education') +
ylab('Estimated Coefficient for Gender')The Y-axis demonstrates the marginal (pure) effect for all the education years on weekday alcohol consumption for males.
m11 <- lm(alcwkdy ~ gndr, ess)
m12 <- lm(alcwkdy ~ gndr + dshltgp, ess)
m13 <- lm(alcwkdy ~ gndr * dshltgp, ess)
anova(m11, m12, m13)## Analysis of Variance Table
##
## Model 1: alcwkdy ~ gndr
## Model 2: alcwkdy ~ gndr + dshltgp
## Model 3: alcwkdy ~ gndr * dshltgp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 29830 57030774
## 2 29829 56961833 1 68941 36.1071 1.889e-09 ***
## 3 29828 56951855 1 9978 5.2259 0.02226 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion: The model with interaction is better, let’s describe it.
##
## Call:
## lm(formula = alcwkdy ~ gndr * dshltgp, data = ess)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.68 -19.57 -9.57 4.99 2151.32
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.6761 0.6512 57.853 < 2e-16 ***
## gndrFemale -16.1094 1.0090 -15.966 < 2e-16 ***
## dshltgpMarked -4.6632 0.7760 -6.009 1.88e-09 ***
## gndrFemale:dshltgpMarked 2.6695 1.1678 2.286 0.0223 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.7 on 29828 degrees of freedom
## Multiple R-squared: 0.02766, Adjusted R-squared: 0.02757
## F-statistic: 282.9 on 3 and 29828 DF, p-value: < 2.2e-16
Interpretation: (R-squared, the intercept, regression coefficients)
plot_model(m13, type = "int",
title = "Predicted values of Weekday Alcohol, grams",
legend.title = "Discussed health with GP last year",
axis.title = c("Years of education", "Grams of alcohol on weekdays"))Where to look:
m21 <- lm(alcwkdy ~ eduyrs, ess)
m22 <- lm(alcwkdy ~ eduyrs + agea, ess)
m23 <- lm(alcwkdy ~ eduyrs * agea, ess)
anova(m21, m22, m23)## Analysis of Variance Table
##
## Model 1: alcwkdy ~ eduyrs
## Model 2: alcwkdy ~ eduyrs + agea
## Model 3: alcwkdy ~ eduyrs * agea
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 29830 58561625
## 2 29829 58406177 1 155448 79.404 < 2e-16 ***
## 3 29828 58393830 1 12347 6.307 0.01203 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Model 2 is better than Model 1, while Model 3 has a better fit than Model 2.
| Weekday Alcohol, grams | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | 29.43 | 27.61 – 31.25 | <0.001 | 37.39 | 34.87 – 39.92 | <0.001 | 43.89 | 38.23 – 49.56 | <0.001 |
| eduyrs | -0.16 | -0.29 – -0.02 | 0.020 | -0.28 | -0.41 – -0.15 | <0.001 | -0.78 | -1.20 – -0.37 | <0.001 |
| agea | -0.13 | -0.16 – -0.10 | <0.001 | -0.25 | -0.35 – -0.15 | <0.001 | |||
| eduyrs * agea | 0.01 | 0.00 – 0.02 | 0.012 | ||||||
| Observations | 29832 | 29832 | 29832 | ||||||
| R2 / R2 adjusted | 0.000 / 0.000 | 0.003 / 0.003 | 0.003 / 0.003 | ||||||
Here, the sample is huge, so that even a tiny relationship can get statistically significant. Be mindful to see the difference between statistical significant and substantive significance for your research.
Let’s proceed to plotting this interaction:
By default, plot_model will use the minimal and maximal values for plotting.
Another popular option is to plot the mean value, mean+1 SD, and mean-1SD:
You can see now the linear difference between the ages and the education level (~16 years of education) at which the interaction becomes not significant.
Finally, the following marginal effect plots will show the change in conditional coefficients for both continuous predictors in the model
library(interplot)
p1 <- interplot(m23, var1 = "eduyrs", var2 = "agea") +
xlab('age') +
ylab('Estimated Coefficient for Years of Education') +
geom_hline(yintercept = 0, linetype = "dashed")
p2 <- interplot(m23, var1 = "agea", var2 = "eduyrs") +
xlab('education years') +
ylab('Estimated Coefficient for Age') +
geom_hline(yintercept = 0, linetype = "dashed")
library(gridExtra)
grid.arrange(p1, p2, nrow = 1)