This lesson describes how to employ categorical variables in a multiplie linear regression analysis. When the effect of a predictor differs among the levels of a categorical predictor, the predictors “interact”. Creating interaction terms enables tests of whether interaction is present and the effect of the interation.
A study compared the effectiveness of three treatments (A, B, and C) for severe depression. Data set depr
(depression.txt) is a random sample of n = 36
severely depressed individuals:
y
= effectiveness of the treatment for subjectage
= age (in years) of subjecta
= 1 if individual received treatment A and 0, if notb
= 1 if individual received treatment B and 0, if not.library(readr)
library(dplyr)
depr <- read_tsv(file = url("https://newonlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/depression/index.txt"))
names(depr)[names(depr) == 'x2'] <- 'a'
names(depr)[names(depr) == 'x3'] <- 'b'
depr$a <- factor(depr$a)
depr$b <- factor(depr$b)
depr$TRT <- factor(depr$TRT)
glimpse(depr)
## Observations: 36
## Variables: 5
## $ y <int> 56, 41, 40, 28, 55, 25, 46, 71, 48, 63, 52, 62, 50, 45, 58...
## $ age <int> 21, 23, 30, 19, 28, 23, 33, 67, 42, 33, 33, 56, 45, 43, 38...
## $ a <fct> 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0...
## $ b <fct> 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1...
## $ TRT <fct> A, B, B, C, A, C, B, C, B, A, A, C, C, B, A, C, B, C, A, B...
A plot of the response variable against age
grouped by the treatment levels does not produce three parallel lines, so it is appropriate to model interaction effects between age and the treatments. The effect of the individual’s age on the treatment’s mean effectiveness depends on the treatment, or put another way, the effect of treatment on the treatment’s mean effectiveness depends on the individual’s age.
library(ggplot2)
ggplot(depr, aes(y = y, x = age, color = TRT)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
ggtitle("y vs age")
Formulate a second-order multiple regression model with interaction terms:
\[y = \beta_0 + \beta_1 age + \beta_2 a + \beta_3 b + \beta_{12} age*a + \beta{13} age*b + \epsilon\] where the independent error terms \(\epsilon\) follow a normal distribution with mean 0 and equal variance \(\sigma^2\).
depr_lm <- lm(y ~ age + a + b + age*a + age*b, data = depr)
summary(depr_lm)
##
## Call:
## lm(formula = y ~ age + a + b + age * a + age * b, data = depr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4366 -2.7637 0.1887 2.9075 6.5634
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.21138 3.34964 1.854 0.073545 .
## age 1.03339 0.07233 14.288 6.34e-15 ***
## a1 41.30421 5.08453 8.124 4.56e-09 ***
## b1 22.70682 5.09097 4.460 0.000106 ***
## age:a1 -0.70288 0.10896 -6.451 3.98e-07 ***
## age:b1 -0.50971 0.11039 -4.617 6.85e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.925 on 30 degrees of freedom
## Multiple R-squared: 0.9143, Adjusted R-squared: 0.9001
## F-statistic: 64.04 on 5 and 30 DF, p-value: 4.264e-15
By applying the dummy variable values, you can break regression equation into three treatment groups:
Treatment A: \(y = (6.2 + 41.3) + (1.03 - 0.70) age\)
Treatment B: \(y = (6.2 + 22.7) + (1.03 - 0.51) age\)
Treatment C: \(y = 6.2 + 1.03 age\)
The interpretation of these results is that for patients in this study receiving treatment A, the effectiveness of the treatment is predicted to increase 0.33 (1.03 - 0.70) units for every additional year in age, and so on for B, and C.
The residuals vs fits plot exhibits all of the “good” behavior, suggesting that the model fits well, there are no obvious outliers, and the error variances are indeed constant. (Use the standardized residual, the residual divided by its standard deviation.)
library(broom) # for augment()
depr_aug <- augment(depr_lm)
ggplot(data = depr_aug, aes(y = .std.resid, x = .fitted)) +
geom_point() +
geom_hline(yintercept = 0) +
geom_hline(yintercept = -2) +
geom_hline(yintercept = 2) +
ggtitle("Standardized Residuals vs Fits")
The normal probability plot exhibits linear trend suggesting that the error terms are indeed normally distributed.
qqnorm(depr_aug$.resid)
qqline(depr_aug$.resid)
For a more quantitative analysis, use the Anderson-Darling normality test. The test p-value is the probability of calculating the test statistic if the distribution is normal. The p-value is .3345, so do not reject \(H_0\).
library(nortest)
ad.test(depr_aug$.resid)
##
## Anderson-Darling normality test
##
## data: depr_aug$.resid
## A = 0.40575, p-value = 0.3345
For every age, is there a difference in the mean effectiveness for the three treatments? Test the null hypothesis \(H_0 : \beta_2 = \beta_3 = \beta_{12} = \beta_{13} = 0\) against the alternative \(H_A\) : at least one of these slope parameters is not 0.
summary(aov(depr_lm))
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 3424 3424 222.295 2.06e-15 ***
## a 1 804 804 52.178 4.86e-08 ***
## b 1 1 1 0.077 0.783
## age:a 1 375 375 24.343 2.81e-05 ***
## age:b 1 328 328 21.319 6.85e-05 ***
## Residuals 30 462 15
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# check why the following code does not work.
# see notes here:
# https://mcfromnz.wordpress.com/2011/03/02/anova-type-iiiiii-ss-explained/
# https://stats.stackexchange.com/questions/4639/interpreting-the-drop1-output-in-r
options(contrasts = c("contr.sum","contr.poly"))
lm2 <- lm(y ~ age + a + b + age*a + age*b, data = depr)
drop1(lm2, .~., test = "F")
## Single term deletions
##
## Model:
## y ~ age + a + b + age * a + age * b
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 462.15 103.89
## age 1 826.72 1288.86 138.81 53.666 3.694e-08 ***
## a 1 1016.59 1478.74 143.75 65.991 4.555e-09 ***
## b 1 306.46 768.61 120.20 19.894 0.0001064 ***
## age:a 1 641.06 1103.21 133.21 41.614 3.982e-07 ***
## age:b 1 328.42 790.57 121.21 21.319 6.850e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F test statistic is \[F(4, 30) = \frac{(804 + 1 + 375 + 328) / 4}{462 / 30} = 24.49\]
The probability observing an F-statistic of F = 24.49 with 4 numerator df and 30 denominator df is 4.436433\times 10^{-9}
. Reject our null hypothesis. There is sufficient evidence at the \(\alpha = 0.05\) level to conclude that there is a significant difference in the mean effectiveness for the three treatments.
Does the effect of age on the treatment’s effectiveness depend on treatment? Test whether the two interaction parameters \(\beta_{age:a}\) and \(\beta_{age:b}\) are significant, \(H0 : \beta_{age:a} = \beta_{age:b} = 0\) against the alternative \(H_A\) : at least one of the interaction parameters is not 0.
The F test statistic is \[F(4, 30) = \frac{(375 + 328) / 2}{462 / 30} = 22.84\]
The probability observing an F-statistic of F = 24.49 with 4 numerator df and 30 denominator df is 9.3778217\times 10^{-7}
. Reject our null hypothesis. There is sufficient evidence at the \(\alpha = 0.05\) level to conclude that the effect of age on the treatment’s effectiveness depends on the treatment.