Research Scenario

Example: Last week, we examined a model with multiple continuous predictors. However, we did not investigate whether there was an interaction effect between any of the continuous predictors. In other words, we have not yet examined whether the relationship between predictor 1 and Y varies depending on the level of predictor 2.

The data set we’re using in today’s lab contains the following variables:

row = Row number
pid = Participant ID number
consc = Participant’s conscientiousness
exercise = The number of days a week the participant reported that they exercise
soc_supp = The participant’s perception of the amount of social support they have from the people around them from 1 (none at all) to 5 (a great deal)
sr_health = Participant’s self-reported health

Let’s say a researcher is interested in whether the relationship between conscientiousness and self-reported health scores varies depending on the number of days people exercise per week.

Model Comparison

This research question corresponds to the following model comparison:

\[ModelA: Health_i = \beta_0 + \beta_1*Conscientiousness + \beta_2*Exercise + \beta_3*(ConscientiousnessXExercise) + \epsilon_i\]

\[Model C: Health_i = \beta_0 + \beta_1*Conscientiousness + \beta_2*Exercise + \epsilon_i\]

Null Hypothesis

The null hypothesis corresponding to this model comparison is:

\[H_0: \beta_3\ = 0\]

Import Data

data <- import("health.csv")

Data Cleaning & Wrangling

First, let’s check the measure type that each variable was imported as.

str(data)

## 'data.frame':    60 obs. of  6 variables:
##  $ row      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ pid      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ consc    : num  2.66 5.02 3.85 1.73 3.45 ...
##  $ exercise : int  0 6 3 1 3 4 2 6 3 1 ...
##  $ soc_supp : int  4 4 3 2 2 5 4 2 3 1 ...
##  $ sr_health: num  1.91 4.49 3.4 2.05 4.21 ...

Convert the variable’s measures types as needed.

data <- data %>%
  mutate(row = as.factor(row),
         pid = as.factor(pid))

Descriptive Statistics

Let’s construct a descriptive statistics table that provides the mean and standard deviation for the key variables of interest in this study.

descriptives_table <- data %>%
  summarise(M_Consc = mean(consc),
            SD_Consc = sd(consc),
            M_Exercise = mean(exercise),
            SD_Exercise = sd(exercise),
            M_Health = mean(sr_health),
            SD_Health = sd(sr_health))

# raw table with no formatting
descriptives_table

##    M_Consc  SD_Consc M_Exercise SD_Exercise M_Health SD_Health
## 1 2.847374 0.9702901        2.4    1.509293 3.053342 0.9985299

# adding table formatting
descriptives_table %>%
  knitr::kable(digits = 2, col.names = c("Mean Conscientiousness", "SD Conscientiousness", "Mean Exercise", "SD Exercise", "Mean Health", "SD Health"), caption = "Descriptive Statistics for Conscientiousness and Health", format.args = list(nsmall = 2))

Descriptive Statistics for Conscientiousness and Health
Mean Conscientiousness	SD Conscientiousness	Mean Exercise	SD Exercise	Mean Health	SD Health
2.85	0.97	2.40	1.51	3.05	1.00

Data Visualization

We can create a scatterplot matrix using the pairs.panels() function to visualize the relationship between each pair of variables in the study.

data %>%
  dplyr::select(consc, exercise, sr_health) %>%
  pairs.panels(lm = TRUE)

The relationship between each predictor, conscientiousness & exercise, and the outcome variable, self-reported health, appears to follow the pattern of a straight line.

Centering the Continuous Predictors

Now that we’re including a continuous X continuous interaction term in our model, it’s very important that we center the continuous predictors prior to including them in the linear model.

Centering the continuous predictors prior to running the model that includes the interaction effect between them will reduce the multicollinearity between them and the interaction term.

Let’s center the predictors, conscientiousness & exercise, below.

data$consc_c <- c(scale(data$consc, center = TRUE, scale = FALSE))
data$exercise_c <- c(scale(data$exercise, center = TRUE, scale = FALSE))

Side note: We’re putting the scale() function side of the c() function to make sure the resulting variables are numeric.

Fit the Model

Fit a linear model predicting sr_health from consc_c, exercise_c, and their interaction, consc_c*exercise_c. Call the object model.

model <- lm(sr_health ~ consc_c + exercise_c + consc_c*exercise_c, data = data)

Assessing Multicollinearity

Let’s assess the multicollinearity of our model’s predictors using the ols_vif_tol() function from the olsrr package. This is important to investigate when it’s possible that the predictors in your model could be correlated.

ols_vif_tol(model)

##            Variables Tolerance      VIF
## 1            consc_c 0.7903154 1.265318
## 2         exercise_c 0.7853021 1.273395
## 3 consc_c:exercise_c 0.8913557 1.121887

Either a low tolerance (below 0.20 is one rule of thumb) or a high VIF (above 5 or 10) is an indication of a problem with multicollinearity. Looks like multicollinearity does not pose an extreme problem with the current set of predictors!

What if we hadn’t centered the predictors?

What if we hadn’t centered our continuous predictors before including them in a model that includes their interaction effect? Let’s see.

Re-run the analysis, but this time predict sr_health from non-centered scores on consc, exercise, and the interaction between the two, consc*exercise. Call the resulting object model_uncentered.

model_uncentered <- lm(sr_health ~ consc + exercise + consc*exercise, data = data)

Let’s re-examine the multicollinearity diagnostics for this uncentered model:

ols_vif_tol(model_uncentered)

##        Variables  Tolerance       VIF
## 1          consc 0.19967441  5.008153
## 2       exercise 0.07648468 13.074514
## 3 consc:exercise 0.04467484 22.383964

Multicollinearity now poses a substantial issue! The presence of this multicollinearity would make it more difficult to parse apart the unique contribution of each predictor in the model.

We’ll stick with the version of the model that uses the centered predictors.

Check whether model assumptions were met

Next, let’s check whether the assumptions of 1) normally distributed errors, 2) independence of errors, and 3) homogeneity of variances were met by the current model.

Checking whether errors are normally distributed

Distribution of the Residuals

# storing table containing residuals
augment_model <- augment(model)
augment_model

## # A tibble: 60 × 9
##    sr_health consc_c exercise_c .fitted   .resid   .hat .sigma   .cooksd
##        <dbl>   <dbl>      <dbl>   <dbl>    <dbl>  <dbl>  <dbl>     <dbl>
##  1      1.91 -0.188        -2.4    1.78  0.134   0.0656  0.543 0.00116  
##  2      4.49  2.17          3.6    4.50 -0.00600 0.480   0.543 0.0000552
##  3      3.40  1.00          0.6    3.56 -0.153   0.0363  0.543 0.000787 
##  4      2.05 -1.11         -1.4    2.00  0.0459  0.0646  0.543 0.000134 
##  5      4.21  0.606         0.6    3.51  0.698   0.0263  0.535 0.0117   
##  6      5.11  1.16          1.6    3.96  1.15    0.0507  0.519 0.0639   
##  7      3.09 -0.0301       -0.4    2.91  0.181   0.0206  0.543 0.000604 
##  8      4.58  1.06          3.6    4.77 -0.192   0.160   0.542 0.00720  
##  9      3.90  0.255         0.6    3.47  0.432   0.0235  0.540 0.00397  
## 10      2.64  0.192        -1.4    2.46  0.180   0.0430  0.543 0.00131  
## # ℹ 50 more rows
## # ℹ 1 more variable: .std.resid <dbl>

# plotting histogram of residuals
ggplot(data = augment_model, aes(x = .resid)) + 
  geom_density(fill = "purple") + # histogram of the residuals
  stat_function(linetype = 2, # a normal distribution overlaid on top for reference
                fun = dnorm, 
                args = list(mean = mean(augment_model$.resid), 
                                sd   =   sd(augment_model$.resid)))

Q-Q Plot

ggplot(model) +
  geom_abline() + 
  stat_qq(aes(sample = .stdresid))

Checking independence of errors

# add the ID column to the table containing the residuals
augment_model$pid <- data$pid

# plot the ID numbers against the residuals
ggplot(data = augment_model, aes(x = pid, y = .resid)) + 
  geom_point() +  
  geom_smooth(se = F) +
  geom_hline(yintercept = 0)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Checking homogeneity of variance

plot(model, 1)

Interpreting the Model Output

Recall that we can examine the model output using different functions:

summary() & confint()
anova()

Let’s start with summary() and confint().

summary(model)

## 
## Call:
## lm(formula = sr_health ~ consc_c + exercise_c + consc_c * exercise_c, 
##     data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.08743 -0.32427 -0.09771  0.27442  1.14760 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.12643    0.07675  40.737  < 2e-16 ***
## consc_c             0.18556    0.08125   2.284   0.0262 *  
## exercise_c          0.52607    0.05240  10.040 3.96e-14 ***
## consc_c:exercise_c -0.11806    0.05261  -2.244   0.0288 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5383 on 56 degrees of freedom
## Multiple R-squared:  0.7242, Adjusted R-squared:  0.7094 
## F-statistic:    49 on 3 and 56 DF,  p-value: 1.131e-15

confint(model)

##                          2.5 %      97.5 %
## (Intercept)         2.97268467  3.28016802
## consc_c             0.02280769  0.34831706
## exercise_c          0.42110698  0.63103639
## consc_c:exercise_c -0.22344427 -0.01267783

Interpreting Parameter Estimates in Models with Continuous Interaction Effects

The best way to interpret the meaning of each parameter estimate is to rearrange the full model formula to express the “simple” relationship between one of the predictors and the outcome variable at different levels of the second predictor.

The full estimate of the linear model equation is:

\[Health' = 3.13 + 0.19*Consc_C + 0.53*Exercise_C - 0.12*Consc_CxExercise_C\] * b0 = 3.13 * b1 = 0.19 * b2 = 0.53 * b3 = -0.12

Simple Relationship Between Predictor 1 and Y at Different Levels of Predictor 2

Let’s rearrange this formula to represent the simple relationship between conscientiousness and self-reported health at different levels of exercise.

\[Health' = (3.13 + 0.53*Exercise_C) + (0.19 - 0.12*Exercise_C)*Consc_C\]

b0 = 3.13, the y-intercept for the relationship between conscientiousness and self-reported health when exercise_c = 0 (so at the mean of exercise)
b2 = 0.53, the change in the y-intercept for the relationship between conscientiousness and self-reported health for every 1-unit increase in exercise
b1 = 0.19, the slope for the relationship between conscientiousness and self-reported when when exercise_c = 0 (so at the mean of exercise)
b3 = -0.12, the change in the slope for the relationship between conscientiousness and self-reported health for every 1-unit increase in exercise

Simple Relationship Between Predictor 2 and Y at Different Levels of Predictor 1

Now, you practice rearranging the full model to represent the simple relationship between exercise and self-reported health at different levels of conscientiousness.

Write the equation for the simple relationship between exercise and self-reported health by rearranging the terms in the full model equation below:

\[Health' = (3.13 + 0.19*Consc_C) + (0.53 - 0.12*Consc_C)*Exercise_C\]

Interpret the meaning of each of the parameter estimates in the context of this rearranged version of the full equation:

b0 = 3.13, the y-intercept for the relationship between exercise and self-reported health when consc_c = 0 (so at the mean of conscientiousness)
b2 = 0.19, the change in the y-intercept for the relationship between exercise and self-reported health for every 1-unit increase in conscientiousness
b1 = 0.53, the slope for the relationship between exercise and self-reported when when consc_c = 0 (so at the mean of conscientiousness)
b3 = -0.12, the change in the slope for the relationship between exercise and self-reported health for every 1-unit increase in conscientiousness

Question: Was there a significant interaction effect between conscientiousness and exercise? What does this mean?

Yes, the interaction effect between conscientiousness and exercise was significant, t(56) = -2.24, p = .029. This means that the relationship between conscientiousness and self-reported health significantly differs depending on the level of exercise at which we examine this simple relationship.

Next, let’s examine the Anova() output:

Anova(model, type = 3)

## Anova Table (Type III tests)
## 
## Response: sr_health
##                    Sum Sq Df   F value    Pr(>F)    
## (Intercept)        480.88  1 1659.5085 < 2.2e-16 ***
## consc_c              1.51  1    5.2165   0.02619 *  
## exercise_c          29.21  1  100.8021 3.955e-14 ***
## consc_c:exercise_c   1.46  1    5.0366   0.02879 *  
## Residuals           16.23 56                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We should come to the same conclusion about the significance of the interaction effect based on the Anova(model, type = 3) output as we did using the summary() output.

Question: Does Model A, which includes the interaction effect, make a significant improvement to the variance accounted for in self-reported health scores compared to Model C, which leaves out the interaction effect?

Yes, the inclusion of the interaction between conscientiousness and exercise significantly improved the variance accounted for in self-reported health scores, F(1, 56) = 5.04, p = .029.

By how much did Model A improve the variance accounted for compared to Model C? Let’s see by calculating the effect size.

Effect Size

We can get two measures of effect size, \(\eta^2\) and \(\eta^2_{Partial}\) by passing our model to the etaSquared() function.

Here’s a reminder of the formulas that are used to calculate each measure of effect size:

\[\eta^2 = SSR/SS_{Total}\]

\[\eta^2_{Partial} = SSR/SSE(C)\]

etaSquared(model, type = 3)

##                        eta.sq eta.sq.part
## consc_c            0.02569575  0.08521394
## exercise_c         0.49653717  0.64286191
## consc_c:exercise_c 0.02480955  0.08251756

Calculating the eta-squared values “from scratch”:

ssr_consc <- 1.51
ssr_exer <- 29.21
ssr_int <- 1.46

ss_total <- var(data$sr_health)*(nrow(data)-1)

ssr_consc/ss_total # .0257

## [1] 0.02566864

ssr_exer/ss_total # .4965

## [1] 0.4965436

ssr_int/ss_total # .0248

## [1] 0.02481868

Calculating the partial eta-squared values “from scratch”:

model <- lm(sr_health ~ consc_c + exercise_c + consc_c*exercise_c, data = data)
Anova(model, type = 3)

## Anova Table (Type III tests)
## 
## Response: sr_health
##                    Sum Sq Df   F value    Pr(>F)    
## (Intercept)        480.88  1 1659.5085 < 2.2e-16 ***
## consc_c              1.51  1    5.2165   0.02619 *  
## exercise_c          29.21  1  100.8021 3.955e-14 ***
## consc_c:exercise_c   1.46  1    5.0366   0.02879 *  
## Residuals           16.23 56                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

sse_c_consc <- 16.23 + 1.51 # SSE(A) + SSR
ssr_consc/sse_c_consc # .085

## [1] 0.08511838

sse_c_exer <- 16.23 + 29.21 # SSE(A) + SSR
ssr_exer/sse_c_exer # .643

## [1] 0.6428257

sse_c_int <- 16.23 + 1.46 # SSE(A) + SSR
ssr_int/sse_c_int # .083

## [1] 0.0825325

Unpacking the Interaction Effect

Plotting the Interaction

We can unpack the nature of the conscientiousness by exercise interaction effect using the plot_model() function from the sjPlot package.

When examining the relationship between a predictor (e.g., X1) and an outcome, Y, across different levels of a second predictor (e.g., X2), the second predictor variable is called a moderator.

Since the moderator variable is continuous, the researcher has to choose values of X2 at which to examine the relationship between X1 and Y. The most typical values that are chosen are +/- 1SD on X2.

Let’s plot the interaction effect between conscientiousness and self-reported health at the mean, +1 SD, and -1 SD on exercise.

plot_model(model = model,
           type = "int", # interaction
           mdrt.values = "meansd") # which values of the moderator variable to use

Question: What appears to be the nature of the interaction effect?

There appears to be a stronger, positive relationship between conscientiousness and self-reported health scores at low (-1SD) levels of exercise than at high (+1SD) levels of exercise. In other words, for people who exercise 1SD less than average, there is a positive relationship between their conscientiousness levels and self-reported health scores. For people who exercise 1SD more than average, there appears to be almost no relationship between their conscientiousness levels and their self-reported health scores.

Simple Slopes Analysis

In addition to visualization the interaction effect, we may also wish to know at which levels of exercise is there a significant relationship between conscientiousness and self-reported health.

Simple slopes analysis refers to analyzing the slope of the relationship between the first predictor (e.g., X1) and Y at specific values of the second predictor (e.g., X2).

We can perform a simple slopes analysis to test which of the slopes in our plot of the interaction effect are significantly, or non-significantly, different from zero using the simple_slopes() function from the reghelper package.

simple_slopes(model)

##    consc_c exercise_c Test Estimate Std. Error t value df  Pr(>|t|) Sig.
## 1 -0.97029     sstest        0.6406     0.0798  8.0232 56 7.081e-11  ***
## 2        0     sstest        0.5261     0.0524 10.0400 56 3.955e-14  ***
## 3  0.97029     sstest        0.4115     0.0658  6.2564 56 5.799e-08  ***
## 4   sstest  -1.509293        0.3638     0.1231  2.9543 56  0.004575   **
## 5   sstest          0        0.1856     0.0812  2.2840 56  0.026187    *
## 6   sstest   1.509293        0.0074     0.1032  0.0714 56  0.943294

The first two columns indicate at which level of each predictor the slope of the other predictor is being tested. sstest means that the simple slope for this variable is being tested.
The Tesst Estimate column provides the simple slope, b, for the predictor marked sstest
The remaining columns provide the t-statistic, df, SE, and p-value corresponding to a test of the significance of the simple slope on that row.

Question: Describe the significance and direction of the relationship between conscientiousness and self-reported health at each level of exercise that was tested (mean, -1SD, and +1SD levels of exercise).

At low levels of exercise (-1SD), there was a significant, positive relationship between conscientiousness and self-reported health, b = 0.36, t(56) = 2.95, p = .005.

At average levels of exercise, there was a significant, positive relationship between conscientiousness and self-reported health, b = 0.08, t(56) = 2.28, p = .026.

At high levels of exercise (+1SD), there was no significant relationship between conscientiousness and self-reported health and the simple slope was close to zero, b = 0.01, t(56) = 0.07, p = .943.

Lab 4 - Linear Regression with Continuous by Continuous Interactions