This week, we’ll be examining models that include both continuous and categorical predictors (but not their interactions - we’ll save that for next week). A model that examines the relationship between a categorical predictor and an outcome while controlling for a continuous covariate has traditionally been called an ANCOVA (Analysis of Covariance). We can think of these models more generally, though, as linear models that include both continuous and categorical predictors.

Research Scenario

Example: A group of clinical psychologists are interested in testing the effectiveness of therapy on patients’ coping skills. The researchers recruit participants who are currently undergoing no therapy (control condition) or cognitive behavioral therapy. The psychologists measure the participants’ coping skills using a survey with items like, “I look for creative ways to alter difficult situations.” Since the researchers did not randomly assign participants to treatment conditions, it’s possible that the participants will differ on variables other than whether they are currently undergoing therapy or not. This means the two groups could differ on other variables - like general resilience, which is the ability to bounce back from hard events - that could be having their own effects on participants’ coping skills. To control for general resilience, the researchers also administer a measure of general resilience that uses items like, “I tend to bounce back quickly after hard times.”

The data set we’re using in today’s lab contains the following variables:

treatment = Which type of therapy participants received
- Control
- Cognitive Behavioral Therapy (CBT)
resilience = Scores on a measure of general resilience
- Sample item: “I tend to bounce back quickly after hard times” from 1 (strongly disagree) to 3 (strongly agree)
coping_skills = Scores on a measure of coping skills
- Sample item: “I look for creative ways to alter difficult situations.” from 0 (does not describe me at all) to 20 (describes me extremely well)

The researchers are particularly interested in whether treatment significantly predicts coping_skills when controlling for the covariate resilience.

Import Data

data <- import("coping_skills.csv")

Data Cleaning & Wrangling

First, let’s check the measure type that each variable was imported as.

str(data)

## 'data.frame':    40 obs. of  3 variables:
##  $ treatment    : chr  "Control" "Control" "Control" "Control" ...
##  $ resilience   : int  3 6 3 5 6 5 4 4 4 2 ...
##  $ coping_skills: int  1 2 1 0 2 1 1 1 0 0 ...

Convert the variable’s measures types as needed.

data <- data %>%
  mutate(treatment = factor(treatment, levels = c("Control","CBT"))) # we're using the levels argument to specify the order we'd like the groups to be in (*not* to be confused with the labels argument)

Descriptive Statistics

Let’s report the mean and standard deviation for coping skills and resilience across the two treatment groups in a table.

descriptives <- data %>%
  group_by(treatment) %>%
  summarize(n = n(),
            M_Resilience = mean(resilience, na.rm = TRUE),
            SD_Resilience = sd(resilience, na.rm = TRUE),
            M_Coping = mean(coping_skills, na.rm = TRUE),
            SD_Coping = sd(coping_skills, na.rm = TRUE))

descriptives

## # A tibble: 2 × 6
##   treatment     n M_Resilience SD_Resilience M_Coping SD_Coping
##   <fct>     <int>        <dbl>         <dbl>    <dbl>     <dbl>
## 1 Control      20          4.1          1.17      1       0.725
## 2 CBT          20          4.6          1.23     14.5     1.15

# Making the table look a little nicer
descriptives %>%
  knitr::kable(digits = 2, col.names = c("Treatment", "n", "Mean Resilience", "SD Resilience", "Mean Coping Skills", "SD Coping Skills"), caption = "Descriptive Statistics", format.args = list(nsmall = 2))

Descriptive Statistics
Treatment	n	Mean Resilience	SD Resilience	Mean Coping Skills	SD Coping Skills
Control	20.00	4.10	1.17	1.00	0.73
CBT	20.00	4.60	1.23	14.50	1.15

Coding the Categorical Predictor

Recall from PSY 611 that, in order to represent a categorical predictor in the model, we must code it first. Let’s use contrast codes to represent the treatment variable in the model.

After contrast coding a categorical predictor, it’s always a good idea to look at the resulting codes so you are aware of which numerical value has been assigned to each group.

TreatmentCC <- c(-1/2, 1/2)

contrasts(data$treatment) <- TreatmentCC

contrasts(data$treatment)

##         [,1]
## Control -0.5
## CBT      0.5

Centering the Continuous Predictor

We are also going to stick with centering the continuous predictor. Remember that the strengths of centering the continuous predictor include 1) a more meaningful y-intercept value, and 2) reduced multicollinearity between predictors and any continuous by continuous interactions composed from them.

The second strength doesn’t apply in this example, but we’ll center resilience so that we have a more meaningful y-intercept value.

data$resilience_c <- scale(data$resilience, center = TRUE, scale = FALSE)

Model Comparison

This research question corresponds to the following model comparison:

\[ModelA: CopingSkills_i = \beta_0 + \beta_1*TreatmentCC + \beta_2*ResilienceC + \epsilon_i\]

\[ModelA: CopingSkills_i = \beta_0 + \beta_2*ResilienceC + \epsilon_i\]

Null Hypothesis

The null hypothesis corresponding to this model comparison is:

\[H_0: \beta_1\ = 0\]

Fit the Model

Fit a linear model predicting coping_skills from treatment (the categorical predictor which we assigned contrast codes to) and resilience_c (the centered continuous predictor). Call the object model.

model <- lm(coping_skills ~ treatment + resilience_c, data = data)

Check whether model assumptions were met

Next, let’s check whether the assumptions of 1) normally distributed errors, 2) independence of errors, and 3) homogeneity of variances were met by the current model.

Checking whether errors are normally distributed

Distribution of the Residuals

# storing table containing residuals
augment_model <- augment(model)
augment_model

## # A tibble: 40 × 9
##    coping_skills treatment resilience_c[,1] .fitted  .resid   .hat .sigma
##            <int> <fct>                <dbl>   <dbl>   <dbl>  <dbl>  <dbl>
##  1             1 Control             -1.35    0.577  0.423  0.0722  0.862
##  2             2 Control              1.65    1.73   0.269  0.116   0.863
##  3             1 Control             -1.35    0.577  0.423  0.0722  0.862
##  4             0 Control              0.65    1.35  -1.35   0.0648  0.833
##  5             2 Control              1.65    1.73   0.269  0.116   0.863
##  6             1 Control              0.65    1.35  -0.346  0.0648  0.863
##  7             1 Control             -0.350   0.962  0.0385 0.0502  0.865
##  8             1 Control             -0.350   0.962  0.0385 0.0502  0.865
##  9             0 Control             -0.350   0.962 -0.962  0.0502  0.849
## 10             0 Control             -2.35    0.192 -0.192  0.131   0.864
## # ℹ 30 more rows
## # ℹ 2 more variables: .cooksd <dbl>, .std.resid <dbl>

# plotting histogram of residuals
ggplot(data = augment_model, aes(x = .resid)) + 
  geom_density(fill = "purple") + # histogram of the residuals
  stat_function(linetype = 2, # a normal distribution overlaid on top for reference
                fun = dnorm, 
                args = list(mean = mean(augment_model$.resid), 
                                sd   =   sd(augment_model$.resid)))

Q-Q Plot

ggplot(model) +
  geom_abline() + 
  stat_qq(aes(sample = .stdresid))

Checking independence of errors

# add ID column to the data set
data$ID <- c(1:nrow(data))

# add the ID column to the table containing the residuals
augment_model$ID <- data$ID

# plot the ID numbers against the residuals
ggplot(data = augment_model, aes(x = ID, y = .resid)) + 
  geom_point() +  
  geom_smooth(se = F) +
  geom_hline(yintercept = 0)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Checking homogeneity of variance

plot(model, 1)

Checking the Assumption of Homogeneity of Regression

Recall that when no interaction effect is included in the model, an assumption is being made that there is no interaction between the predictors. The traditional ANCOVA analysis makes this an explicit assumption that has to be tested. That is, an ANCOVA analysis assumes that the relationship between the continuous covariate (e.g., resilience) and outcome variable (e.g., coping skills) is the same across all levels of the categorical predictor (e.g., treatment).

To test this assumption, we need to test the significance of the continuous by categorical interaction effect.

model_int <- lm(coping_skills ~ treatment*resilience_c, data = data)

summary(model_int)

## 
## Call:
## lm(formula = coping_skills ~ treatment * resilience_c, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2292 -0.4264  0.3194  0.4809  1.3194 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               7.7323     0.1391  55.575  < 2e-16 ***
## treatment1               13.3096     0.2783  47.831  < 2e-16 ***
## resilience_c              0.3807     0.1166   3.265  0.00241 ** 
## treatment1:resilience_c   0.1413     0.2332   0.606  0.54840    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8604 on 36 degrees of freedom
## Multiple R-squared:  0.9857, Adjusted R-squared:  0.9845 
## F-statistic: 824.4 on 3 and 36 DF,  p-value: < 2.2e-16

Anova(model_int, type = 3)

## Anova Table (Type III tests)
## 
## Response: coping_skills
##                         Sum Sq Df   F value    Pr(>F)    
## (Intercept)            2286.55  1 3088.6173 < 2.2e-16 ***
## treatment              1693.68  1 2287.7831 < 2.2e-16 ***
## resilience_c              7.89  1   10.6587  0.002407 ** 
## treatment:resilience_c    0.27  1    0.3671  0.548403    
## Residuals                26.65 36                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Question: Is the interaction between treatment and resilience significant?

No, F(1, 36) = 0.37, p = .548.

Question: Has the assumption of homogeneity of regression been met?

Yes, because the continuousXcategorical interaction effect is non-significant.

Interpreting the Model Output

Recall that we can examine the model output using different functions:

summary()
confint()
Anova()
etaSquared()

Let’s start with summary() and confint().

summary(model)

## 
## Call:
## lm(formula = coping_skills ~ treatment + resilience_c, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2692 -0.4038  0.2692  0.5192  1.3461 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.7500     0.1349  57.461  < 2e-16 ***
## treatment1    13.3077     0.2759  48.241  < 2e-16 ***
## resilience_c   0.3846     0.1154   3.332  0.00197 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.853 on 37 degrees of freedom
## Multiple R-squared:  0.9855, Adjusted R-squared:  0.9847 
## F-statistic:  1258 on 2 and 37 DF,  p-value: < 2.2e-16

confint(model)

##                  2.5 %     97.5 %
## (Intercept)   7.476717  8.0232827
## treatment1   12.748754 13.8666307
## resilience_c  0.150707  0.6185238

The parameter estimates are calculated using adjusted means on coping skills. That is, the average coping skills for each condition of treatment (control and CBT) are calculated adjusting for the relationship between resilience and coping skills. These adjusted means are calculated using the following formula:

\[\bar{Y}'_k = \bar{Y}_k - b_z(\bar{Z}_k - \bar{Z})\]

\(\bar{Y}'_k\) = adjusted mean on the outcome variable for condition k of the categorical predictor
\(\bar{Y}_k\) = original mean on the outcome variable for condition k of the categorical predictor
\(b_z\) = the parameter estimate corresponding to the continuous covariate from the model output
\(\bar{Z}_k\) = the mean on the continuous covariate for condition k of the categorical predictor
\(\bar{Z}\) = the mean of the group’s means on the continuous covariate

Let’s interpret the meaning of each parameter estimate:

b0 = 7.75, the y-intercept is equal to the mean of the group means on the outcome variable, coping skills, because we contrast coded the categorical predictor and mean-centered the continuous covariate

descriptives

## # A tibble: 2 × 6
##   treatment     n M_Resilience SD_Resilience M_Coping SD_Coping
##   <fct>     <int>        <dbl>         <dbl>    <dbl>     <dbl>
## 1 Control      20          4.1          1.17      1       0.725
## 2 CBT          20          4.6          1.23     14.5     1.15

(14.5+1)/2 # the average of the Control and CBT condition's means on coping skills

## [1] 7.75

b1 = 13.31, the adjusted means on coping skills controlling for resilience for the CBT condition minus the Control condition

descriptives

## # A tibble: 2 × 6
##   treatment     n M_Resilience SD_Resilience M_Coping SD_Coping
##   <fct>     <int>        <dbl>         <dbl>    <dbl>     <dbl>
## 1 Control      20          4.1          1.17      1       0.725
## 2 CBT          20          4.6          1.23     14.5     1.15

# The mean of the group means on resilience = 4.35
(4.6+4.1)/2

## [1] 4.35

# Adjusted mean for CBT condition 
cbt_adj <- 14.5 - 0.3846*(4.6-4.35)
cbt_adj

## [1] 14.40385

# Adjusted mean for Control condition 
control_adj <- 1 - 0.3846*(4.1-4.35)
control_adj

## [1] 1.09615

cbt_adj - control_adj

## [1] 13.3077

b2 = 0.38, the predicted increase in coping skills per 1-unit increase in resilience when controlling for treatment

Let’s also examine the Anova() and etaSquared() output:

Anova(model, type = 3)

## Anova Table (Type III tests)
## 
## Response: coping_skills
##               Sum Sq Df F value    Pr(>F)    
## (Intercept)  2402.50  1  3301.7 < 2.2e-16 ***
## treatment    1693.41  1  2327.2 < 2.2e-16 ***
## resilience_c    8.08  1    11.1  0.001967 ** 
## Residuals      26.92 37                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

etaSquared(model, type = 3)

##                   eta.sq eta.sq.part
## treatment    0.911660658   0.9843501
## resilience_c 0.004348276   0.2307692

Question: Was treatment a significant predictor of coping skills when controlling for resilience?

Yes, people receiving CBT therapy (M = 14.50, SD = 1.15, \(M_{Adj}\) = 13.31) scored significantly higher on coping skills than people receiving no therapy (M = 1.00, SD = 1.17, \(M_{Adj}\) = 1.10) even when controlling for resilience, \(b_1\) = 13.31, 95%CI[12.75, 13.89], t(37) = 48.24, p < .001, \(\eta^2\) = 0.91, \(\eta^2_{Partial}\) = 0.98.

Yes, people receiving CBT therapy (M = 14.50, SD = 1.15, \(M_{Adj}\) = 13.31) scored significantly higher on coping skills than people receiving no therapy (M = 1.00, SD = 1.17, \(M_{Adj}\) = 1.10) even when controlling for resilience, \(b_1\) = 13.31, 95%CI[12.75, 13.89], F(1, 37) = 2327.23, p < .001, \(\eta^2\) = 0.91, \(\eta^2_{Partial}\) = 0.98.

Note: The t-statistic and F-statistic are redundant, so you can choose which you would prefer to report between the two.

Adjusted Means (aka, Estimated Marginal Means)

Adjusted means are also called estimated marginal means. Rather than having to calculate the adjusted means for each group by hand, we can use the emmeans() function to get the adjusted means for each group when controlling for resilience.

emmeans(model, specs = c("treatment","resilience_c"))

##  treatment resilience_c emmean    SE df lower.CL upper.CL
##  Control       3.55e-16    1.1 0.193 37    0.705     1.49
##  CBT           3.55e-16   14.4 0.193 37   14.013    14.79
## 
## Confidence level used: 0.95

Question: Why are estimated marginal means important?

In our model output, the row corresponding to our categorical predictor, treatment, in the summary() and Anova() output corresponds to a test of the significance of the difference between the adjusted means for the Control versus CBT group when controlling for resilience.

Adjusted means (or estimated marginal means) illustrate what we would have expected the average coping skills of participants in the control and CBT conditions to be equal to if levels of resilience were equal across groups.

Plotting the Adjusted Means

We can plot the adjusted means on coping skills for the Control and CBT conditions using the emmip() function from the emmeans() package.

emmip(model, ~ treatment, CIs = TRUE, xlab = "Treatment", ylab = "Adjusted Means on Coping Skills")

# If you want to overlay the data on top, add geom_point() 
emmip(model, ~ treatment, CIs = TRUE, xlab = "Treatment", ylab = "Adjusted Means on Coping Skills") + geom_point(data = data, aes(x = treatment, y = coping_skills))

Lab 6 - Linear Regression with Continuous and Categorical Predictors