This week, we’ll be examining models that include both continuous and categorical predictors (but not their interactions - we’ll save that for next week). A model that examines the relationship between a categorical predictor and an outcome while controlling for a continuous covariate has traditionally been called an ANCOVA (Analysis of Covariance). We can think of these models more generally, though, as linear models that include both continuous and categorical predictors.
Example: A group of clinical psychologists are interested in testing the effectiveness of therapy on patients’ coping skills. The researchers recruit participants who are currently undergoing no therapy (control condition) or cognitive behavioral therapy. The psychologists measure the participants’ coping skills using a survey with items like, “I look for creative ways to alter difficult situations.” Since the researchers did not randomly assign participants to treatment conditions, it’s possible that the participants will differ on variables other than whether they are currently undergoing therapy or not. This means the two groups could differ on other variables - like general resilience, which is the ability to bounce back from hard events - that could be having their own effects on participants’ coping skills. To control for general resilience, the researchers also administer a measure of general resilience that uses items like, “I tend to bounce back quickly after hard times.”
The data set we’re using in today’s lab contains the following variables:
The researchers are particularly interested in whether
treatment significantly predicts coping_skills
when controlling for the covariate resilience.
data <- import("coping_skills.csv")
First, let’s check the measure type that each variable was imported as.
str(data)
## 'data.frame': 40 obs. of 3 variables:
## $ treatment : chr "Control" "Control" "Control" "Control" ...
## $ resilience : int 3 6 3 5 6 5 4 4 4 2 ...
## $ coping_skills: int 1 2 1 0 2 1 1 1 0 0 ...
Convert the variable’s measures types as needed.
data <- data %>%
mutate(treatment = factor(treatment, levels = c("Control","CBT"))) # we're using the levels argument to specify the order we'd like the groups to be in (*not* to be confused with the labels argument)
Let’s report the mean and standard deviation for coping skills and resilience across the two treatment groups in a table.
descriptives <- data %>%
group_by(treatment) %>%
summarize(n = n(),
M_Resilience = mean(resilience, na.rm = TRUE),
SD_Resilience = sd(resilience, na.rm = TRUE),
M_Coping = mean(coping_skills, na.rm = TRUE),
SD_Coping = sd(coping_skills, na.rm = TRUE))
descriptives
## # A tibble: 2 × 6
## treatment n M_Resilience SD_Resilience M_Coping SD_Coping
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Control 20 4.1 1.17 1 0.725
## 2 CBT 20 4.6 1.23 14.5 1.15
# Making the table look a little nicer
descriptives %>%
knitr::kable(digits = 2, col.names = c("Treatment", "n", "Mean Resilience", "SD Resilience", "Mean Coping Skills", "SD Coping Skills"), caption = "Descriptive Statistics", format.args = list(nsmall = 2))
| Treatment | n | Mean Resilience | SD Resilience | Mean Coping Skills | SD Coping Skills |
|---|---|---|---|---|---|
| Control | 20.00 | 4.10 | 1.17 | 1.00 | 0.73 |
| CBT | 20.00 | 4.60 | 1.23 | 14.50 | 1.15 |
Recall from PSY 611 that, in order to represent a categorical
predictor in the model, we must code it first. Let’s use contrast codes
to represent the treatment variable in the model.
After contrast coding a categorical predictor, it’s always a good idea to look at the resulting codes so you are aware of which numerical value has been assigned to each group.
TreatmentCC <- c(-1/2, 1/2)
contrasts(data$treatment) <- TreatmentCC
contrasts(data$treatment)
## [,1]
## Control -0.5
## CBT 0.5
We are also going to stick with centering the continuous predictor. Remember that the strengths of centering the continuous predictor include 1) a more meaningful y-intercept value, and 2) reduced multicollinearity between predictors and any continuous by continuous interactions composed from them.
The second strength doesn’t apply in this example, but we’ll center resilience so that we have a more meaningful y-intercept value.
data$resilience_c <- scale(data$resilience, center = TRUE, scale = FALSE)
This research question corresponds to the following model comparison:
\[ModelA: CopingSkills_i = \beta_0 + \beta_1*TreatmentCC + \beta_2*ResilienceC + \epsilon_i\]
\[ModelA: CopingSkills_i = \beta_0 + \beta_2*ResilienceC + \epsilon_i\]
The null hypothesis corresponding to this model comparison is:
\[H_0: \beta_1\ = 0\]
Fit a linear model predicting coping_skills from
treatment (the categorical predictor which we assigned
contrast codes to) and resilience_c (the centered
continuous predictor). Call the object model.
model <- lm(coping_skills ~ treatment + resilience_c, data = data)
Next, let’s check whether the assumptions of 1) normally distributed errors, 2) independence of errors, and 3) homogeneity of variances were met by the current model.
# storing table containing residuals
augment_model <- augment(model)
augment_model
## # A tibble: 40 × 9
## coping_skills treatment resilience_c[,1] .fitted .resid .hat .sigma
## <int> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Control -1.35 0.577 0.423 0.0722 0.862
## 2 2 Control 1.65 1.73 0.269 0.116 0.863
## 3 1 Control -1.35 0.577 0.423 0.0722 0.862
## 4 0 Control 0.65 1.35 -1.35 0.0648 0.833
## 5 2 Control 1.65 1.73 0.269 0.116 0.863
## 6 1 Control 0.65 1.35 -0.346 0.0648 0.863
## 7 1 Control -0.350 0.962 0.0385 0.0502 0.865
## 8 1 Control -0.350 0.962 0.0385 0.0502 0.865
## 9 0 Control -0.350 0.962 -0.962 0.0502 0.849
## 10 0 Control -2.35 0.192 -0.192 0.131 0.864
## # ℹ 30 more rows
## # ℹ 2 more variables: .cooksd <dbl>, .std.resid <dbl>
# plotting histogram of residuals
ggplot(data = augment_model, aes(x = .resid)) +
geom_density(fill = "purple") + # histogram of the residuals
stat_function(linetype = 2, # a normal distribution overlaid on top for reference
fun = dnorm,
args = list(mean = mean(augment_model$.resid),
sd = sd(augment_model$.resid)))
ggplot(model) +
geom_abline() +
stat_qq(aes(sample = .stdresid))
# add ID column to the data set
data$ID <- c(1:nrow(data))
# add the ID column to the table containing the residuals
augment_model$ID <- data$ID
# plot the ID numbers against the residuals
ggplot(data = augment_model, aes(x = ID, y = .resid)) +
geom_point() +
geom_smooth(se = F) +
geom_hline(yintercept = 0)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
plot(model, 1)
Recall that when no interaction effect is included in the model, an assumption is being made that there is no interaction between the predictors. The traditional ANCOVA analysis makes this an explicit assumption that has to be tested. That is, an ANCOVA analysis assumes that the relationship between the continuous covariate (e.g., resilience) and outcome variable (e.g., coping skills) is the same across all levels of the categorical predictor (e.g., treatment).
To test this assumption, we need to test the significance of the continuous by categorical interaction effect.
model_int <- lm(coping_skills ~ treatment*resilience_c, data = data)
summary(model_int)
##
## Call:
## lm(formula = coping_skills ~ treatment * resilience_c, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2292 -0.4264 0.3194 0.4809 1.3194
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.7323 0.1391 55.575 < 2e-16 ***
## treatment1 13.3096 0.2783 47.831 < 2e-16 ***
## resilience_c 0.3807 0.1166 3.265 0.00241 **
## treatment1:resilience_c 0.1413 0.2332 0.606 0.54840
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8604 on 36 degrees of freedom
## Multiple R-squared: 0.9857, Adjusted R-squared: 0.9845
## F-statistic: 824.4 on 3 and 36 DF, p-value: < 2.2e-16
Anova(model_int, type = 3)
## Anova Table (Type III tests)
##
## Response: coping_skills
## Sum Sq Df F value Pr(>F)
## (Intercept) 2286.55 1 3088.6173 < 2.2e-16 ***
## treatment 1693.68 1 2287.7831 < 2.2e-16 ***
## resilience_c 7.89 1 10.6587 0.002407 **
## treatment:resilience_c 0.27 1 0.3671 0.548403
## Residuals 26.65 36
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Question: Is the interaction between treatment and resilience significant?
No, F(1, 36) = 0.37, p = .548.
Question: Has the assumption of homogeneity of regression been met?
Yes, because the continuousXcategorical interaction effect is non-significant.
Recall that we can examine the model output using different functions:
Let’s start with summary() and
confint().
summary(model)
##
## Call:
## lm(formula = coping_skills ~ treatment + resilience_c, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2692 -0.4038 0.2692 0.5192 1.3461
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.7500 0.1349 57.461 < 2e-16 ***
## treatment1 13.3077 0.2759 48.241 < 2e-16 ***
## resilience_c 0.3846 0.1154 3.332 0.00197 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.853 on 37 degrees of freedom
## Multiple R-squared: 0.9855, Adjusted R-squared: 0.9847
## F-statistic: 1258 on 2 and 37 DF, p-value: < 2.2e-16
confint(model)
## 2.5 % 97.5 %
## (Intercept) 7.476717 8.0232827
## treatment1 12.748754 13.8666307
## resilience_c 0.150707 0.6185238
The parameter estimates are calculated using adjusted means on coping skills. That is, the average coping skills for each condition of treatment (control and CBT) are calculated adjusting for the relationship between resilience and coping skills. These adjusted means are calculated using the following formula:
\[\bar{Y}'_k = \bar{Y}_k - b_z(\bar{Z}_k - \bar{Z})\]
Let’s interpret the meaning of each parameter estimate:
descriptives
## # A tibble: 2 × 6
## treatment n M_Resilience SD_Resilience M_Coping SD_Coping
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Control 20 4.1 1.17 1 0.725
## 2 CBT 20 4.6 1.23 14.5 1.15
(14.5+1)/2 # the average of the Control and CBT condition's means on coping skills
## [1] 7.75
descriptives
## # A tibble: 2 × 6
## treatment n M_Resilience SD_Resilience M_Coping SD_Coping
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Control 20 4.1 1.17 1 0.725
## 2 CBT 20 4.6 1.23 14.5 1.15
# The mean of the group means on resilience = 4.35
(4.6+4.1)/2
## [1] 4.35
# Adjusted mean for CBT condition
cbt_adj <- 14.5 - 0.3846*(4.6-4.35)
cbt_adj
## [1] 14.40385
# Adjusted mean for Control condition
control_adj <- 1 - 0.3846*(4.1-4.35)
control_adj
## [1] 1.09615
cbt_adj - control_adj
## [1] 13.3077
Let’s also examine the Anova() and
etaSquared() output:
Anova(model, type = 3)
## Anova Table (Type III tests)
##
## Response: coping_skills
## Sum Sq Df F value Pr(>F)
## (Intercept) 2402.50 1 3301.7 < 2.2e-16 ***
## treatment 1693.41 1 2327.2 < 2.2e-16 ***
## resilience_c 8.08 1 11.1 0.001967 **
## Residuals 26.92 37
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
etaSquared(model, type = 3)
## eta.sq eta.sq.part
## treatment 0.911660658 0.9843501
## resilience_c 0.004348276 0.2307692
Question: Was treatment a significant predictor of coping skills when controlling for resilience?
Yes, people receiving CBT therapy (M = 14.50, SD = 1.15, \(M_{Adj}\) = 13.31) scored significantly higher on coping skills than people receiving no therapy (M = 1.00, SD = 1.17, \(M_{Adj}\) = 1.10) even when controlling for resilience, \(b_1\) = 13.31, 95%CI[12.75, 13.89], t(37) = 48.24, p < .001, \(\eta^2\) = 0.91, \(\eta^2_{Partial}\) = 0.98.
or
Yes, people receiving CBT therapy (M = 14.50, SD = 1.15, \(M_{Adj}\) = 13.31) scored significantly higher on coping skills than people receiving no therapy (M = 1.00, SD = 1.17, \(M_{Adj}\) = 1.10) even when controlling for resilience, \(b_1\) = 13.31, 95%CI[12.75, 13.89], F(1, 37) = 2327.23, p < .001, \(\eta^2\) = 0.91, \(\eta^2_{Partial}\) = 0.98.
Note: The t-statistic and F-statistic are redundant, so you can choose which you would prefer to report between the two.
Adjusted means are also called estimated marginal
means. Rather than having to calculate the adjusted means for
each group by hand, we can use the emmeans() function to
get the adjusted means for each group when controlling for
resilience.
emmeans(model, specs = c("treatment","resilience_c"))
## treatment resilience_c emmean SE df lower.CL upper.CL
## Control 3.55e-16 1.1 0.193 37 0.705 1.49
## CBT 3.55e-16 14.4 0.193 37 14.013 14.79
##
## Confidence level used: 0.95
Question: Why are estimated marginal means important?
In our model output, the row corresponding to our categorical
predictor, treatment, in the summary() and
Anova() output corresponds to a test of the significance of
the difference between the adjusted means for the
Control versus CBT group when controlling for resilience.
Adjusted means (or estimated marginal means) illustrate what we would have expected the average coping skills of participants in the control and CBT conditions to be equal to if levels of resilience were equal across groups.
We can plot the adjusted means on coping skills for the
Control and CBT conditions using the
emmip() function from the emmeans()
package.
emmip(model, ~ treatment, CIs = TRUE, xlab = "Treatment", ylab = "Adjusted Means on Coping Skills")
# If you want to overlay the data on top, add geom_point()
emmip(model, ~ treatment, CIs = TRUE, xlab = "Treatment", ylab = "Adjusted Means on Coping Skills") + geom_point(data = data, aes(x = treatment, y = coping_skills))