This lesson describes how to employ categorical variables in a multiplie linear regression analysis. When the effect of a predictor differs among the levels of a categorical predictor, the predictors “interact”. Creating interaction terms enables tests of whether interaction is present and the effect of the interation.

Categorical Predictors

A study compared the effectiveness of three treatments (A, B, and C) for severe depression. Data set depr (depression.txt) is a random sample of n = 36 severely depressed individuals:

y = effectiveness of the treatment for subject
age = age (in years) of subject
a = 1 if individual received treatment A and 0, if not
b = 1 if individual received treatment B and 0, if not.

library(readr)
library(dplyr)

depr <- read_tsv(file = url("https://newonlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/depression/index.txt"))

names(depr)[names(depr) == 'x2'] <- 'a'
names(depr)[names(depr) == 'x3'] <- 'b'

depr$a <- factor(depr$a)
depr$b <- factor(depr$b)
depr$TRT <- factor(depr$TRT)

glimpse(depr)

## Observations: 36
## Variables: 5
## $ y   <int> 56, 41, 40, 28, 55, 25, 46, 71, 48, 63, 52, 62, 50, 45, 58...
## $ age <int> 21, 23, 30, 19, 28, 23, 33, 67, 42, 33, 33, 56, 45, 43, 38...
## $ a   <fct> 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0...
## $ b   <fct> 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1...
## $ TRT <fct> A, B, B, C, A, C, B, C, B, A, A, C, C, B, A, C, B, C, A, B...

A plot of the response variable against age grouped by the treatment levels does not produce three parallel lines, so it is appropriate to model interaction effects between age and the treatments. The effect of the individual’s age on the treatment’s mean effectiveness depends on the treatment, or put another way, the effect of treatment on the treatment’s mean effectiveness depends on the individual’s age.

library(ggplot2)

ggplot(depr, aes(y = y, x = age, color = TRT)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  ggtitle("y vs age")

Formulate the model

Formulate a second-order multiple regression model with interaction terms:

\[y = \beta_0 + \beta_1 age + \beta_2 a + \beta_3 b + \beta_{12} age*a + \beta{13} age*b + \epsilon\] where the independent error terms \(\epsilon\) follow a normal distribution with mean 0 and equal variance \(\sigma^2\).

Estimate the model

depr_lm <- lm(y ~ age + a + b + age*a + age*b, data = depr)

summary(depr_lm)

## 
## Call:
## lm(formula = y ~ age + a + b + age * a + age * b, data = depr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4366 -2.7637  0.1887  2.9075  6.5634 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.21138    3.34964   1.854 0.073545 .  
## age          1.03339    0.07233  14.288 6.34e-15 ***
## a1          41.30421    5.08453   8.124 4.56e-09 ***
## b1          22.70682    5.09097   4.460 0.000106 ***
## age:a1      -0.70288    0.10896  -6.451 3.98e-07 ***
## age:b1      -0.50971    0.11039  -4.617 6.85e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.925 on 30 degrees of freedom
## Multiple R-squared:  0.9143, Adjusted R-squared:  0.9001 
## F-statistic: 64.04 on 5 and 30 DF,  p-value: 4.264e-15

By applying the dummy variable values, you can break regression equation into three treatment groups:

Treatment A: \(y = (6.2 + 41.3) + (1.03 - 0.70) age\)

Treatment B: \(y = (6.2 + 22.7) + (1.03 - 0.51) age\)

Treatment C: \(y = 6.2 + 1.03 age\)

The interpretation of these results is that for patients in this study receiving treatment A, the effectiveness of the treatment is predicted to increase 0.33 (1.03 - 0.70) units for every additional year in age, and so on for B, and C.

Evaluate the model

The residuals vs fits plot exhibits all of the “good” behavior, suggesting that the model fits well, there are no obvious outliers, and the error variances are indeed constant. (Use the standardized residual, the residual divided by its standard deviation.)

library(broom)  # for augment()

depr_aug <- augment(depr_lm)
ggplot(data = depr_aug, aes(y = .std.resid, x = .fitted)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  geom_hline(yintercept = -2) +
  geom_hline(yintercept = 2) +
  ggtitle("Standardized Residuals vs Fits")

The normal probability plot exhibits linear trend suggesting that the error terms are indeed normally distributed.

qqnorm(depr_aug$.resid)
qqline(depr_aug$.resid)

For a more quantitative analysis, use the Anderson-Darling normality test. The test p-value is the probability of calculating the test statistic if the distribution is normal. The p-value is .3345, so do not reject \(H_0\).

library(nortest)
ad.test(depr_aug$.resid)

## 
##  Anderson-Darling normality test
## 
## data:  depr_aug$.resid
## A = 0.40575, p-value = 0.3345

Using the model

For every age, is there a difference in the mean effectiveness for the three treatments? Test the null hypothesis \(H_0 : \beta_2 = \beta_3 = \beta_{12} = \beta_{13} = 0\) against the alternative \(H_A\) : at least one of these slope parameters is not 0.

summary(aov(depr_lm))

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## age          1   3424    3424 222.295 2.06e-15 ***
## a            1    804     804  52.178 4.86e-08 ***
## b            1      1       1   0.077    0.783    
## age:a        1    375     375  24.343 2.81e-05 ***
## age:b        1    328     328  21.319 6.85e-05 ***
## Residuals   30    462      15                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# check why the following code does not work.
# see notes here: 
# https://mcfromnz.wordpress.com/2011/03/02/anova-type-iiiiii-ss-explained/
# https://stats.stackexchange.com/questions/4639/interpreting-the-drop1-output-in-r
options(contrasts = c("contr.sum","contr.poly"))
lm2 <- lm(y ~ age + a + b + age*a + age*b, data = depr)
drop1(lm2, .~., test = "F")

## Single term deletions
## 
## Model:
## y ~ age + a + b + age * a + age * b
##        Df Sum of Sq     RSS    AIC F value    Pr(>F)    
## <none>               462.15 103.89                      
## age     1    826.72 1288.86 138.81  53.666 3.694e-08 ***
## a       1   1016.59 1478.74 143.75  65.991 4.555e-09 ***
## b       1    306.46  768.61 120.20  19.894 0.0001064 ***
## age:a   1    641.06 1103.21 133.21  41.614 3.982e-07 ***
## age:b   1    328.42  790.57 121.21  21.319 6.850e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F test statistic is \[F(4, 30) = \frac{(804 + 1 + 375 + 328) / 4}{462 / 30} = 24.49\]

The probability observing an F-statistic of F = 24.49 with 4 numerator df and 30 denominator df is 4.436433\times 10^{-9}. Reject our null hypothesis. There is sufficient evidence at the \(\alpha = 0.05\) level to conclude that there is a significant difference in the mean effectiveness for the three treatments.

Does the effect of age on the treatment’s effectiveness depend on treatment? Test whether the two interaction parameters \(\beta_{age:a}\) and \(\beta_{age:b}\) are significant, \(H0 : \beta_{age:a} = \beta_{age:b} = 0\) against the alternative \(H_A\) : at least one of the interaction parameters is not 0.

The F test statistic is \[F(4, 30) = \frac{(375 + 328) / 2}{462 / 30} = 22.84\]

The probability observing an F-statistic of F = 24.49 with 4 numerator df and 30 denominator df is 9.3778217\times 10^{-7}. Reject our null hypothesis. There is sufficient evidence at the \(\alpha = 0.05\) level to conclude that the effect of age on the treatment’s effectiveness depends on the treatment.

References

PSU STAT501.

Multiple Linear Regression: Categorical Predictors