Welcome to this R demo session! Today, we’ll delve deeper into
regression models that include interaction terms. We’ll be using the
readwrite2.csv
dataset that contains students’ reading and
writing test scores, as well as their proficiency levels. Our main
question is: does the relationship between reading and writing scores
differ across proficiency levels?
# Setting Up the Environment
# Before we start, let's load the necessary packages and the dataset we'll be working with.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
data_rw <- read.csv("readwrite2.csv")
Last week, we noticed that the slopes between reading and writing scores are different across different proficiency levels. Here’s how you can visualize this trend:
ggplot(data_rw, aes(x=reading, y=writing, color=as.factor(level))) +
geom_point(position = position_jitter(), alpha = .3, size = 2) +
geom_smooth(method='lm') +
labs(title="Reading vs Writing Scores by Proficiency Level")
## `geom_smooth()` using formula = 'y ~ x'
To begin, let’s explore a regression model that includes a dichotomous variable, specifically focusing on proficiency levels 1 and 2 (Novice and Intermediate Low). Below is the regression model without any interaction terms.
data_rw1 <- data_rw %>%
filter(level == 1 | level == 2) %>%
mutate(level = case_when(level == 1 ~ 0,
TRUE ~ 1))
# Build the regression model without interaction
output1 <- lm(writing ~ reading + level, data = data_rw1)
summary(output1)
##
## Call:
## lm(formula = writing ~ reading + level, data = data_rw1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.7614 -1.5082 -0.0017 1.2750 5.5986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.02763 0.44791 26.853 < 2e-16 ***
## reading -0.12662 0.03197 -3.961 8.42e-05 ***
## level 3.27309 0.20205 16.199 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.137 on 565 degrees of freedom
## Multiple R-squared: 0.3253, Adjusted R-squared: 0.3229
## F-statistic: 136.2 on 2 and 565 DF, p-value: < 2.2e-16
Now let’s add the interaction.
output2 <- lm(writing ~ reading + level + reading*level, data = data_rw1)
summary(output2)
##
## Call:
## lm(formula = writing ~ reading + level + reading * level, data = data_rw1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3056 -1.3777 -0.0165 1.4884 5.4787
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.18536 0.62295 16.350 < 2e-16 ***
## reading 0.01202 0.04568 0.263 0.792
## level 7.12242 0.93966 7.580 1.43e-13 ***
## reading:level -0.26445 0.06309 -4.192 3.21e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.106 on 564 degrees of freedom
## Multiple R-squared: 0.3457, Adjusted R-squared: 0.3422
## F-statistic: 99.34 on 3 and 564 DF, p-value: < 2.2e-16
There is a shortcut- if you put in two variables with a
*
, lm
interprets that as including all the
main effects and the interaction. If there were three variables, such as
reading * level * gender
, all main effects and all 2-way
interactions and the 3-way interaction would be included
automatically.
output3 <- lm(writing ~ reading*level, data = data_rw1)
summary(output3)
##
## Call:
## lm(formula = writing ~ reading * level, data = data_rw1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3056 -1.3777 -0.0165 1.4884 5.4787
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.18536 0.62295 16.350 < 2e-16 ***
## reading 0.01202 0.04568 0.263 0.792
## level 7.12242 0.93966 7.580 1.43e-13 ***
## reading:level -0.26445 0.06309 -4.192 3.21e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.106 on 564 degrees of freedom
## Multiple R-squared: 0.3457, Adjusted R-squared: 0.3422
## F-statistic: 99.34 on 3 and 564 DF, p-value: < 2.2e-16
If you want the interaction but not the main effect, use a colon
:
. But including an interaction without a main effect
usually makes no sense.
output4 <- lm(writing ~ reading:level, data = data_rw1)
summary(output4)
##
## Call:
## lm(formula = writing ~ reading:level, data = data_rw1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.7379 -1.6108 0.2003 1.3892 6.5130
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.61081 0.14820 71.60 <2e-16 ***
## reading:level 0.15635 0.01181 13.23 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.272 on 566 degrees of freedom
## Multiple R-squared: 0.2363, Adjusted R-squared: 0.235
## F-statistic: 175.2 on 1 and 566 DF, p-value: < 2.2e-16
We have the results, but what do they mean? Start with a plot with the interaction
ggplot(data_rw1, aes(x=reading, y=writing, color=as.factor(level))) +
geom_point(position = position_jitter(), alpha = .3, size = 2) +
geom_smooth(method='lm') +
labs(title="Reading vs Writing Scores by Proficiency Level")
## `geom_smooth()` using formula = 'y ~ x'
Main Effects and Interactions
reading
) on writing scores; higher values of reading are
associated with higher values of writing.level
); moving from one level to the next is associated
with a substantial increase in writing scores.reading:level
).Detailed Interpretation
The interaction term (reading:level
) suggests that the
relationship between reading and writing scores is not consistent across
different proficiency levels. Specifically, the effect of reading scores
on writing scores varies depending on the proficiency level of the
individual.
For a more concrete understanding, let’s plug in some numbers:
The coefficient for the categorical variable (7.122) should be interpreted as the difference in writing scores between students at two different proficiency levels, specifically when their reading scores are zero.
The main effect of reading is represented by its coefficient, 0.012 . In this case, for students whose proficiency level = 0, a one-unit increase in reading score is associated with a 0.012-unit increase in writing score.
To understand the effect of reading when level is 1, we need to consider both the main effect of reading and the interaction term (reading:level = -0.264). So, the overall effect is 0.012+(-0.264) = -0.252. For individuals at proficiency level 1, a one-unit difference in reading is associated with a -0.252-unit difference in writing.
Thus, the model suggests that the impact of reading on writing is different for students at different proficiency level. While for novice students, the effect of reading scores on writing scores is negligible (0.012), for intermediate low students, the effect is negative (-0.252).
Interactions between two continuous predictors are harder to
understand but quite common and important. Let’s add an additional
continuous variable, motivation
to the dataset. Note that
the motivation scores have been standardized.
existing_error_term <- residuals(lm(writing ~ reading, data=data_rw1))
data_rw1$motivation <- scale((data_rw1$writing - 0.3 * data_rw1$reading) / (0.2 * data_rw1$reading))
output5 <- lm(writing ~ reading * motivation, data = data_rw1)
summary(output5)
##
## Call:
## lm(formula = writing ~ reading * motivation, data = data_rw1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.049e-13 1.700e-16 5.700e-16 1.240e-15 1.808e-13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.408e-14 6.330e-15 3.805e+00 0.000157 ***
## reading 8.564e-01 4.277e-16 2.002e+15 < 2e-16 ***
## motivation 5.505e-15 1.769e-15 3.112e+00 0.001952 **
## reading:motivation 3.310e-01 1.766e-16 1.874e+15 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.296e-14 on 564 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.419e+30 on 3 and 564 DF, p-value: < 2.2e-16
Main Effects and Interactions
reading
) on writing is not
significant, with a higher reading score associated with a higher
writing score indicated by the positive coefficient.motivation
) on writing
is not significant at the conventional levels.Detailed Interpretation
In the context of interaction effects, it’s often useful to
understand how the impact of one variable changes at different levels of
another. Considering the effect at the mean and one standard deviation
above and below the mean of the moderating variable
(motivation
in this case) provides a nuanced understanding.
This approach gives us a sense of how reading impacts writing for
individuals who are at average levels of motivation, as well as for
those who are exceptionally motivated or less so.
At Mean Level of Motivation
For motivation at its mean value, which is zero due to standardization, the interaction term will have no effect (motivation * reading = 0). Therefore, the regression coefficient for reading is its main effect, 0.856. This means that when motivation is at the mean, a one-unit increase in reading is associated with a 0.856-unit increase in writing.
Interpreting at Specific Values: Mean + 1SD and Mean - 1SD
For motivation at one standard deviation above the mean (which would be 1 in a standardized variable), the effect of the interaction term would be 0.331. Therefore, the overall effect of reading on writing is 0.856+0.331=1.187. At one standard deviation above the mean in motivation, a one-unit increase in reading would result in a 1.187-unit increase in writing.
Conversely, for motivation at one standard deviation below the mean (which would be -1 in a standardized variable), the interaction term would subtract from the main effect, 0.856−0.331=0.525 At one standard deviation below the mean in motivation, a one-unit increase in reading would result in a 0.525-unit increase in writing.
The key point in all of this: the effect of reading scores on writing scores depends on the level of motivation This is the standard way to understand an interaction.
Interactions can work the other way too. Instead of the effect of reading scores on writing scores depending on the level of motivation, we can consider the effect of motivation on writing scores depending on the level of reading scores This is completely mathematically equivalent. Pick the one that offers the best scientific interpretation.
The package interactions
offers a lot of ways to better
understand interactions.
# install.packages("interactions")
library(interactions)
interact_plot(output5,pred = reading,modx = motivation)
# pred: The name of the predictor variable involved in the interaction.
# modx: The name of the moderator variable involved in the interaction.
interact_plot(output5,pred = reading,modx = motivation,plot.points = TRUE)
Looking at specific values of one variable in the context of an
interaction is a form of what is called simple slopes analysis. The
sim_slopes
function in the interactions
package does this automatically.
sim_slopes(output5,pred = reading,modx = motivation,johnson_neyman = FALSE) # omit the Johnson-Neyman regions of significance.
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be
## unreliable
## SIMPLE SLOPES ANALYSIS
##
## Slope of reading when motivation = -1.00000e+00 (- 1 SD):
##
## Est. S.E. t val. p
## ------ ------ --------------------- ------
## 0.53 0.00 6433019635787841.00 0.00
##
## Slope of reading when motivation = -1.07074e-15 (Mean):
##
## Est. S.E. t val. p
## ------ ------ --------------------- ------
## 0.86 0.00 1957938854142108.00 0.00
##
## Slope of reading when motivation = 1.00000e+00 (+ 1 SD):
##
## Est. S.E. t val. p
## ------ ------ --------------------- ------
## 1.19 0.00 1210467956874302.00 0.00
You may observe that the mean and SD for motivation are exactly the same as what you had above.
The sim_slopes
function also calculated the statistical
significance of the coefficient on reading scores for specific values of
motivation Is that awesome? Maybe. Maybe not. You can decide if it’s a
valuable way to help you interpret an interaction.
The default in the sim_slopes
function is exactly what
we did: mean, mean+1SD, mean-1SD. We can also control specific values of
motivation to use.
sim_slopes(output5,pred = reading,modx = motivation,
johnson_neyman = TRUE)
## Warning in summary.lm(object, ...): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(model): essentially perfect fit: summary may be
## unreliable
## JOHNSON-NEYMAN INTERVAL
##
## When motivation is OUTSIDE the interval [-2.59, -2.59], the slope of
## reading is p < .05.
##
## Note: The range of observed values of motivation is [-1.58, 14.03]
##
## SIMPLE SLOPES ANALYSIS
##
## Slope of reading when motivation = -1.00000e+00 (- 1 SD):
##
## Est. S.E. t val. p
## ------ ------ --------------------- ------
## 0.53 0.00 6433019635787841.00 0.00
##
## Slope of reading when motivation = -1.07074e-15 (Mean):
##
## Est. S.E. t val. p
## ------ ------ --------------------- ------
## 0.86 0.00 1957938854142108.00 0.00
##
## Slope of reading when motivation = 1.00000e+00 (+ 1 SD):
##
## Est. S.E. t val. p
## ------ ------ --------------------- ------
## 1.19 0.00 1210467956874302.00 0.00
If the interaction term is not significant, you can also ask for
Johnson-Neyman interval, which are the values for
movitation
for which the regression coefficient of reading
scores (reading
) is significant (in our case this is not
relevant because the interaction term is already significant).
I don’t love this analysis because it puts too much emphasis on statistical significance, but here it is.