Example: Last week, we examined a model with multiple continuous predictors. However, we did not investigate whether there was an interaction effect between any of the continuous predictors. In other words, we have not yet examined whether the relationship between predictor 1 and Y varies depending on the level of predictor 2.
The data set we’re using in today’s lab contains the following variables:
Let’s say a researcher is interested in whether the relationship between conscientiousness and self-reported health scores varies depending on the number of days people exercise per week.
This research question corresponds to the following model comparison:
\[ModelA: Health_i = \beta_0 + \beta_1*Conscientiousness + \beta_2*Exercise + \beta_3*(ConscientiousnessXExercise) + \epsilon_i\]
\[Model C: Health_i = \beta_0 + \beta_1*Conscientiousness + \beta_2*Exercise + \epsilon_i\]
The null hypothesis corresponding to this model comparison is:
\[H_0: \beta_3\ = 0\]
data <- import("health.csv")
First, let’s check the measure type that each variable was imported as.
str(data)
## 'data.frame': 60 obs. of 6 variables:
## $ row : int 1 2 3 4 5 6 7 8 9 10 ...
## $ pid : int 1 2 3 4 5 6 7 8 9 10 ...
## $ consc : num 2.66 5.02 3.85 1.73 3.45 ...
## $ exercise : int 0 6 3 1 3 4 2 6 3 1 ...
## $ soc_supp : int 4 4 3 2 2 5 4 2 3 1 ...
## $ sr_health: num 1.91 4.49 3.4 2.05 4.21 ...
Convert the variable’s measures types as needed.
data <- data %>%
mutate(row = as.factor(row),
pid = as.factor(pid))
Let’s construct a descriptive statistics table that provides the mean and standard deviation for the key variables of interest in this study.
descriptives_table <- data %>%
summarise(M_Consc = mean(consc),
SD_Consc = sd(consc),
M_Exercise = mean(exercise),
SD_Exercise = sd(exercise),
M_Health = mean(sr_health),
SD_Health = sd(sr_health))
# raw table with no formatting
descriptives_table
## M_Consc SD_Consc M_Exercise SD_Exercise M_Health SD_Health
## 1 2.847374 0.9702901 2.4 1.509293 3.053342 0.9985299
# adding table formatting
descriptives_table %>%
knitr::kable(digits = 2, col.names = c("Mean Conscientiousness", "SD Conscientiousness", "Mean Exercise", "SD Exercise", "Mean Health", "SD Health"), caption = "Descriptive Statistics for Conscientiousness and Health", format.args = list(nsmall = 2))
| Mean Conscientiousness | SD Conscientiousness | Mean Exercise | SD Exercise | Mean Health | SD Health |
|---|---|---|---|---|---|
| 2.85 | 0.97 | 2.40 | 1.51 | 3.05 | 1.00 |
We can create a scatterplot matrix using the
pairs.panels() function to visualize the relationship
between each pair of variables in the study.
data %>%
dplyr::select(consc, exercise, sr_health) %>%
pairs.panels(lm = TRUE)
The relationship between each predictor, conscientiousness & exercise, and the outcome variable, self-reported health, appears to follow the pattern of a straight line.
Now that we’re including a continuous X continuous interaction term in our model, it’s very important that we center the continuous predictors prior to including them in the linear model.
Centering the continuous predictors prior to running the model that includes the interaction effect between them will reduce the multicollinearity between them and the interaction term.
Let’s center the predictors, conscientiousness & exercise, below.
data$consc_c <- c(scale(data$consc, center = TRUE, scale = FALSE))
data$exercise_c <- c(scale(data$exercise, center = TRUE, scale = FALSE))
Side note: We’re putting the scale() function side of the c() function to make sure the resulting variables are numeric.
Fit a linear model predicting sr_health from
consc_c, exercise_c, and their interaction,
consc_c*exercise_c. Call the object
model.
model <- lm(sr_health ~ consc_c + exercise_c + consc_c*exercise_c, data = data)
Let’s assess the multicollinearity of our model’s predictors using
the ols_vif_tol() function from the olsrr
package. This is important to investigate when it’s possible that the
predictors in your model could be correlated.
ols_vif_tol(model)
## Variables Tolerance VIF
## 1 consc_c 0.7903154 1.265318
## 2 exercise_c 0.7853021 1.273395
## 3 consc_c:exercise_c 0.8913557 1.121887
Either a low tolerance (below 0.20 is one rule of thumb) or a high VIF (above 5 or 10) is an indication of a problem with multicollinearity. Looks like multicollinearity does not pose an extreme problem with the current set of predictors!
What if we hadn’t centered our continuous predictors before including them in a model that includes their interaction effect? Let’s see.
Re-run the analysis, but this time predict sr_health
from non-centered scores on consc, exercise,
and the interaction between the two, consc*exercise. Call
the resulting object model_uncentered.
model_uncentered <- lm(sr_health ~ consc + exercise + consc*exercise, data = data)
Let’s re-examine the multicollinearity diagnostics for this uncentered model:
ols_vif_tol(model_uncentered)
## Variables Tolerance VIF
## 1 consc 0.19967441 5.008153
## 2 exercise 0.07648468 13.074514
## 3 consc:exercise 0.04467484 22.383964
Multicollinearity now poses a substantial issue! The presence of this multicollinearity would make it more difficult to parse apart the unique contribution of each predictor in the model.
We’ll stick with the version of the model that uses the centered predictors.
Next, let’s check whether the assumptions of 1) normally distributed errors, 2) independence of errors, and 3) homogeneity of variances were met by the current model.
# storing table containing residuals
augment_model <- augment(model)
augment_model
## # A tibble: 60 × 9
## sr_health consc_c exercise_c .fitted .resid .hat .sigma .cooksd
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.91 -0.188 -2.4 1.78 0.134 0.0656 0.543 0.00116
## 2 4.49 2.17 3.6 4.50 -0.00600 0.480 0.543 0.0000552
## 3 3.40 1.00 0.6 3.56 -0.153 0.0363 0.543 0.000787
## 4 2.05 -1.11 -1.4 2.00 0.0459 0.0646 0.543 0.000134
## 5 4.21 0.606 0.6 3.51 0.698 0.0263 0.535 0.0117
## 6 5.11 1.16 1.6 3.96 1.15 0.0507 0.519 0.0639
## 7 3.09 -0.0301 -0.4 2.91 0.181 0.0206 0.543 0.000604
## 8 4.58 1.06 3.6 4.77 -0.192 0.160 0.542 0.00720
## 9 3.90 0.255 0.6 3.47 0.432 0.0235 0.540 0.00397
## 10 2.64 0.192 -1.4 2.46 0.180 0.0430 0.543 0.00131
## # ℹ 50 more rows
## # ℹ 1 more variable: .std.resid <dbl>
# plotting histogram of residuals
ggplot(data = augment_model, aes(x = .resid)) +
geom_density(fill = "purple") + # histogram of the residuals
stat_function(linetype = 2, # a normal distribution overlaid on top for reference
fun = dnorm,
args = list(mean = mean(augment_model$.resid),
sd = sd(augment_model$.resid)))
ggplot(model) +
geom_abline() +
stat_qq(aes(sample = .stdresid))
# add the ID column to the table containing the residuals
augment_model$pid <- data$pid
# plot the ID numbers against the residuals
ggplot(data = augment_model, aes(x = pid, y = .resid)) +
geom_point() +
geom_smooth(se = F) +
geom_hline(yintercept = 0)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
plot(model, 1)
Recall that we can examine the model output using different functions:
Let’s start with summary() and
confint().
summary(model)
##
## Call:
## lm(formula = sr_health ~ consc_c + exercise_c + consc_c * exercise_c,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.08743 -0.32427 -0.09771 0.27442 1.14760
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.12643 0.07675 40.737 < 2e-16 ***
## consc_c 0.18556 0.08125 2.284 0.0262 *
## exercise_c 0.52607 0.05240 10.040 3.96e-14 ***
## consc_c:exercise_c -0.11806 0.05261 -2.244 0.0288 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5383 on 56 degrees of freedom
## Multiple R-squared: 0.7242, Adjusted R-squared: 0.7094
## F-statistic: 49 on 3 and 56 DF, p-value: 1.131e-15
confint(model)
## 2.5 % 97.5 %
## (Intercept) 2.97268467 3.28016802
## consc_c 0.02280769 0.34831706
## exercise_c 0.42110698 0.63103639
## consc_c:exercise_c -0.22344427 -0.01267783
The best way to interpret the meaning of each parameter estimate is to rearrange the full model formula to express the “simple” relationship between one of the predictors and the outcome variable at different levels of the second predictor.
The full estimate of the linear model equation is:
\[Health' = 3.13 + 0.19*Consc_C + 0.53*Exercise_C - 0.12*Consc_CxExercise_C\] * b0 = 3.13 * b1 = 0.19 * b2 = 0.53 * b3 = -0.12
Let’s rearrange this formula to represent the simple relationship between conscientiousness and self-reported health at different levels of exercise.
\[Health' = (3.13 + 0.53*Exercise_C) + (0.19 - 0.12*Exercise_C)*Consc_C\]
b0 = 3.13, the y-intercept for the relationship between conscientiousness and self-reported health when exercise_c = 0 (so at the mean of exercise)
b2 = 0.53, the change in the y-intercept for the relationship between conscientiousness and self-reported health for every 1-unit increase in exercise
b1 = 0.19, the slope for the relationship between conscientiousness and self-reported when when exercise_c = 0 (so at the mean of exercise)
b3 = -0.12, the change in the slope for the relationship between conscientiousness and self-reported health for every 1-unit increase in exercise
Now, you practice rearranging the full model to represent the simple relationship between exercise and self-reported health at different levels of conscientiousness.
Write the equation for the simple relationship between exercise and self-reported health by rearranging the terms in the full model equation below:
\[Health' = (3.13 + 0.19*Consc_C) + (0.53 - 0.12*Consc_C)*Exercise_C\]
Interpret the meaning of each of the parameter estimates in the context of this rearranged version of the full equation:
b0 = 3.13, the y-intercept for the relationship between exercise and self-reported health when consc_c = 0 (so at the mean of conscientiousness)
b2 = 0.19, the change in the y-intercept for the relationship between exercise and self-reported health for every 1-unit increase in conscientiousness
b1 = 0.53, the slope for the relationship between exercise and self-reported when when consc_c = 0 (so at the mean of conscientiousness)
b3 = -0.12, the change in the slope for the relationship between exercise and self-reported health for every 1-unit increase in conscientiousness
Question: Was there a significant interaction effect between conscientiousness and exercise? What does this mean?
Yes, the interaction effect between conscientiousness and exercise was significant, t(56) = -2.24, p = .029. This means that the relationship between conscientiousness and self-reported health significantly differs depending on the level of exercise at which we examine this simple relationship.
Next, let’s examine the Anova() output:
Anova(model, type = 3)
## Anova Table (Type III tests)
##
## Response: sr_health
## Sum Sq Df F value Pr(>F)
## (Intercept) 480.88 1 1659.5085 < 2.2e-16 ***
## consc_c 1.51 1 5.2165 0.02619 *
## exercise_c 29.21 1 100.8021 3.955e-14 ***
## consc_c:exercise_c 1.46 1 5.0366 0.02879 *
## Residuals 16.23 56
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We should come to the same conclusion about the significance of the
interaction effect based on the Anova(model, type = 3)
output as we did using the summary() output.
Question: Does Model A, which includes the interaction effect, make a significant improvement to the variance accounted for in self-reported health scores compared to Model C, which leaves out the interaction effect?
Yes, the inclusion of the interaction between conscientiousness and exercise significantly improved the variance accounted for in self-reported health scores, F(1, 56) = 5.04, p = .029.
By how much did Model A improve the variance accounted for compared to Model C? Let’s see by calculating the effect size.
We can get two measures of effect size, \(\eta^2\) and \(\eta^2_{Partial}\) by passing our model to
the etaSquared() function.
Here’s a reminder of the formulas that are used to calculate each measure of effect size:
\[\eta^2 = SSR/SS_{Total}\]
\[\eta^2_{Partial} = SSR/SSE(C)\]
etaSquared(model, type = 3)
## eta.sq eta.sq.part
## consc_c 0.02569575 0.08521394
## exercise_c 0.49653717 0.64286191
## consc_c:exercise_c 0.02480955 0.08251756
Calculating the eta-squared values “from scratch”:
ssr_consc <- 1.51
ssr_exer <- 29.21
ssr_int <- 1.46
ss_total <- var(data$sr_health)*(nrow(data)-1)
ssr_consc/ss_total # .0257
## [1] 0.02566864
ssr_exer/ss_total # .4965
## [1] 0.4965436
ssr_int/ss_total # .0248
## [1] 0.02481868
Calculating the partial eta-squared values “from scratch”:
model <- lm(sr_health ~ consc_c + exercise_c + consc_c*exercise_c, data = data)
Anova(model, type = 3)
## Anova Table (Type III tests)
##
## Response: sr_health
## Sum Sq Df F value Pr(>F)
## (Intercept) 480.88 1 1659.5085 < 2.2e-16 ***
## consc_c 1.51 1 5.2165 0.02619 *
## exercise_c 29.21 1 100.8021 3.955e-14 ***
## consc_c:exercise_c 1.46 1 5.0366 0.02879 *
## Residuals 16.23 56
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
sse_c_consc <- 16.23 + 1.51 # SSE(A) + SSR
ssr_consc/sse_c_consc # .085
## [1] 0.08511838
sse_c_exer <- 16.23 + 29.21 # SSE(A) + SSR
ssr_exer/sse_c_exer # .643
## [1] 0.6428257
sse_c_int <- 16.23 + 1.46 # SSE(A) + SSR
ssr_int/sse_c_int # .083
## [1] 0.0825325
We can unpack the nature of the conscientiousness by exercise
interaction effect using the plot_model() function from the
sjPlot package.
When examining the relationship between a predictor (e.g., X1) and an outcome, Y, across different levels of a second predictor (e.g., X2), the second predictor variable is called a moderator.
Since the moderator variable is continuous, the researcher has to choose values of X2 at which to examine the relationship between X1 and Y. The most typical values that are chosen are +/- 1SD on X2.
Let’s plot the interaction effect between conscientiousness and self-reported health at the mean, +1 SD, and -1 SD on exercise.
plot_model(model = model,
type = "int", # interaction
mdrt.values = "meansd") # which values of the moderator variable to use
Question: What appears to be the nature of the interaction effect?
There appears to be a stronger, positive relationship between conscientiousness and self-reported health scores at low (-1SD) levels of exercise than at high (+1SD) levels of exercise. In other words, for people who exercise 1SD less than average, there is a positive relationship between their conscientiousness levels and self-reported health scores. For people who exercise 1SD more than average, there appears to be almost no relationship between their conscientiousness levels and their self-reported health scores.
In addition to visualization the interaction effect, we may also wish to know at which levels of exercise is there a significant relationship between conscientiousness and self-reported health.
Simple slopes analysis refers to analyzing the slope of the relationship between the first predictor (e.g., X1) and Y at specific values of the second predictor (e.g., X2).
We can perform a simple slopes analysis to test which of the slopes
in our plot of the interaction effect are significantly, or
non-significantly, different from zero using the
simple_slopes() function from the reghelper
package.
simple_slopes(model)
## consc_c exercise_c Test Estimate Std. Error t value df Pr(>|t|) Sig.
## 1 -0.97029 sstest 0.6406 0.0798 8.0232 56 7.081e-11 ***
## 2 0 sstest 0.5261 0.0524 10.0400 56 3.955e-14 ***
## 3 0.97029 sstest 0.4115 0.0658 6.2564 56 5.799e-08 ***
## 4 sstest -1.509293 0.3638 0.1231 2.9543 56 0.004575 **
## 5 sstest 0 0.1856 0.0812 2.2840 56 0.026187 *
## 6 sstest 1.509293 0.0074 0.1032 0.0714 56 0.943294
sstest means
that the simple slope for this variable is being tested.Tesst Estimate column provides the simple slope,
b, for the predictor marked sstestt-statistic,
df, SE, and p-value corresponding
to a test of the significance of the simple slope on that row.Question: Describe the significance and direction of the relationship between conscientiousness and self-reported health at each level of exercise that was tested (mean, -1SD, and +1SD levels of exercise).
At low levels of exercise (-1SD), there was a significant, positive relationship between conscientiousness and self-reported health, b = 0.36, t(56) = 2.95, p = .005.
At average levels of exercise, there was a significant, positive relationship between conscientiousness and self-reported health, b = 0.08, t(56) = 2.28, p = .026.
At high levels of exercise (+1SD), there was no significant relationship between conscientiousness and self-reported health and the simple slope was close to zero, b = 0.01, t(56) = 0.07, p = .943.