This handout expands on Bailey, Chapter 6.4, to provide more examples and sample R code to help explain the concept of Interaction Terms. Interaction terms are a useful but often overlooked tool in the analyst’s kit; that they are “often overlooked”, however, does not meant that they are an add-on or an option. Instead, it means that by often overlooking them people are often getting the wrong answer.
An interaction term allows us to capture the potential for – literally – interactions between factors. Until now, we’ve modeled outcomes as the result of linear additive processes: we keep adding factors that each contribute some constant slope to a linear regression. Technically, a model with an interaction term is still linear and additive, but it allows for the factors that we measure with variables to have different effects depending on the values of other factors.
Let’s imagine that we’re interested in whether IQ is affected by environmental or genetic causes. The file kidiq.dta (available on Moodle) contains data on maternal and child IQ, as well as other covariates. Let’s read the data into memory and take a look at it.
library(foreign)
kidiq <- read.dta("kidiq.dta")
names(kidiq)
## [1] "kid_score" "mom_hs" "mom_iq" "mom_work" "mom_age"
library(pastecs)
## Loading required package: boot
options(scipen=100) ## Don't worry about what these options() commands are doing for now
options(digits=2)
stat.desc(kidiq,norm=FALSE)
## kid_score mom_hs mom_iq mom_work mom_age
## nbr.val 434.00 434.000 434.00 434.000 434.00
## nbr.null 0.00 93.000 0.00 0.000 0.00
## nbr.na 0.00 0.000 0.00 0.000 0.00
## min 20.00 0.000 71.04 1.000 17.00
## max 144.00 1.000 138.89 4.000 29.00
## range 124.00 1.000 67.86 3.000 12.00
## sum 37670.00 341.000 43400.00 1257.000 9889.00
## median 90.00 1.000 97.92 3.000 23.00
## mean 86.80 0.786 100.00 2.896 22.79
## SE.mean 0.98 0.020 0.72 0.057 0.13
## CI.mean.0.95 1.93 0.039 1.42 0.111 0.25
## var 416.60 0.169 225.00 1.396 7.30
## std.dev 20.41 0.411 15.00 1.181 2.70
## coef.var 0.24 0.523 0.15 0.408 0.12
plot(kidiq$kid_score~kidiq$mom_iq,
main="Child IQ by Mother's IQ",
xlab="Maternal IQ",
ylab="Child IQ",
pch = 16)
And we can fit a basic OLS regression:
kidiq.lm1 <- lm(kid_score~mom_iq,data=kidiq)
summary(kidiq.lm1)
##
## Call:
## lm(formula = kid_score ~ mom_iq, data = kidiq)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.75 -12.07 2.22 11.71 47.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.7998 5.9174 4.36 0.000016 ***
## mom_iq 0.6100 0.0585 10.42 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18 on 432 degrees of freedom
## Multiple R-squared: 0.201, Adjusted R-squared: 0.199
## F-statistic: 109 on 1 and 432 DF, p-value: <0.0000000000000002
plot(kidiq$kid_score~kidiq$mom_iq,
main="Child IQ by Mother's IQ",
xlab="Maternal IQ",
ylab="Child IQ",
pch = 16)
abline(kidiq.lm1, col="red")
What do these results from the bivariate regression suggest? Are these results likely to be good estimates of the underlying causal relationships? What sources of endogeneity exist?
Suppose that you, as a researcher, had a good idea: what if there are social processes external to the genotype that lead to variations in expressed intelligence? In other words, that old hoary dorm-room staple: what if nurture matters alongside nature?
One bit of evidence near at hand involves whether the mother has graduated high school. This should be exogenous to the DNA in the mother’s genes, or at least pretty nearly so: humans don’t have a gene for finishing high school. But it is a fairly good indicator of social status and economic potential: social status because high-status people rarely fail to graduate high school and economic potential because those who don’t graduate high school earn less (substantially so) than those who do. If IQ is just a matter of genetics, this piece of evidence shouldn’t be correlated with anything in the child’s IQ, which is a handy null hypothesis.
So, first, let’s run regressions stratified by mom_hs, where 1 signifies the mother finished high school and 0 says she did not.
kidiq.lm2 <- lm(kid_score~mom_iq,data=subset(kidiq,mom_hs==1))
kidiq.lm3 <- lm(kid_score~mom_iq,data=subset(kidiq,mom_hs==0))
Here are the results in tabular form:
| Child’s IQ | ||
| kid_score | ||
| Mom Graduated | Mom Didn’t | |
| (1) | (2) | |
| mom_iq | 0.480*** | 0.970*** |
| (0.065) | (0.160) | |
| Constant | 40.000*** | -11.000 |
| (6.700) | (15.000) | |
| Observations | 341 | 93 |
| R2 | 0.140 | 0.290 |
| Adjusted R2 | 0.140 | 0.290 |
| Residual Std. Error | 18.000 (df = 339) | 19.000 (df = 91) |
| F Statistic | 56.000*** (df = 1; 339) | 38.000*** (df = 1; 91) |
| Note: | p<0.1; p<0.05; p<0.01 | |
What do these results from the two bivariate regressions suggest?
That’s …. interesting. Let’s look at the data again, but this time with more color:
kidiq$color <- NA
kidiq[kidiq$mom_hs==1,]$color <- "gray60"
kidiq[kidiq$mom_hs==0,]$color <- "darkmagenta"
plot(kidiq$kid_score~kidiq$mom_iq,
main="Child IQ by Mother's IQ",
xlab="Maternal IQ",
ylab="Child IQ",
pch = 16,
col= kidiq$color)
abline(kidiq.lm2,col="black",lty="dashed",lwd=3)
abline(kidiq.lm3,col="darkmagenta",lty="dotted",lwd=3)
What does
kidiq$color <- NA
kidiq[kidiq$mom_hs==1,]$color <- "gray60"
kidiq[kidiq$mom_hs==0,]$color <- "darkmagenta"
do? Why is it useful?
The results seem to suggest that there is a different relationship between maternal IQ and child IQ for mothers who graduated high school and those who didn’t. Specifically, the greater maternal IQ, the more child IQ will be for non-high school grads compared to those who graduated high school. This is not just an additive difference–moving the slope up or down–this is a difference in the slope itself.
In other words, there seems to be an interaction between an environmental factor and a genetic one.
We should model this, then, not as \[ ChildIQ = \beta_0 + \beta_1 MomIQ + \beta_2 MomHS + \epsilon \] but instead as \[ ChildIQ = \beta_0 + \beta_1 MomIQ + \beta_2 MomHS + \beta_3 MomIQ \times MomHS + \epsilon \]
Why do we set MomHS to be a dummy variable of 1 or 0? Why not 1 or 2? Come up with some answers.
Estimating this equation turns out to be simple:
kidiq.lm4 <- lm(kid_score~mom_iq+mom_hs+mom_iq*mom_hs,data=kidiq)
summary(kidiq.lm4)
##
## Call:
## lm(formula = kid_score ~ mom_iq + mom_hs + mom_iq * mom_hs, data = kidiq)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.09 -11.33 2.07 11.66 43.88
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.482 13.758 -0.83 0.4044
## mom_iq 0.969 0.148 6.53 0.00000000018 ***
## mom_hs 51.268 15.338 3.34 0.0009 ***
## mom_iq:mom_hs -0.484 0.162 -2.99 0.0030 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18 on 430 degrees of freedom
## Multiple R-squared: 0.23, Adjusted R-squared: 0.225
## F-statistic: 42.8 on 3 and 430 DF, p-value: <0.0000000000000002
| Dependent variable: | |||
| Child’s IQ | |||
| Mom Graduated | Mom Didn’t | Interaction | |
| (1) | (2) | (3) | |
| mom_iq | 0.480*** | 0.970*** | 0.970*** |
| (0.065) | (0.160) | (0.150) | |
| mom_hs | 51.000*** | ||
| (15.000) | |||
| mom_iq:mom_hs | -0.480*** | ||
| (0.160) | |||
| Constant | 40.000*** | -11.000 | -11.000 |
| (6.700) | (15.000) | (14.000) | |
| Observations | 341 | 93 | 434 |
| R2 | 0.140 | 0.290 | 0.230 |
| Adjusted R2 | 0.140 | 0.290 | 0.230 |
| Residual Std. Error | 18.000 (df = 339) | 19.000 (df = 91) | 18.000 (df = 430) |
| F Statistic | 56.000*** (df = 1; 339) | 38.000*** (df = 1; 91) | 43.000*** (df = 3; 430) |
| Note: | p<0.1; p<0.05; p<0.01 | ||
How should we interpret the relationship between a one-degree change in MomIQ in Column 1? In Column 2? In Column 3?
What other factors might we want to control for? How would you expect them to affect the estimate of \(\beta_1\)?