Introduction

This handout expands on Bailey, Chapter 6.4, to provide more examples and sample R code to help explain the concept of Interaction Terms. Interaction terms are a useful but often overlooked tool in the analyst’s kit; that they are “often overlooked”, however, does not meant that they are an add-on or an option. Instead, it means that by often overlooking them people are often getting the wrong answer.

Concept

An interaction term allows us to capture the potential for – literally – interactions between factors. Until now, we’ve modeled outcomes as the result of linear additive processes: we keep adding factors that each contribute some constant slope to a linear regression. Technically, a model with an interaction term is still linear and additive, but it allows for the factors that we measure with variables to have different effects depending on the values of other factors.

Motivation

Let’s imagine that we’re interested in whether IQ is affected by environmental or genetic causes. The file kidiq.dta (available on Moodle) contains data on maternal and child IQ, as well as other covariates. Let’s read the data into memory and take a look at it.

library(foreign)
kidiq <- read.dta("kidiq.dta")
names(kidiq)
## [1] "kid_score" "mom_hs"    "mom_iq"    "mom_work"  "mom_age"
library(pastecs)
## Loading required package: boot
options(scipen=100) ## Don't worry about what these options() commands are doing for now
options(digits=2) 
stat.desc(kidiq,norm=FALSE)
##              kid_score  mom_hs   mom_iq mom_work mom_age
## nbr.val         434.00 434.000   434.00  434.000  434.00
## nbr.null          0.00  93.000     0.00    0.000    0.00
## nbr.na            0.00   0.000     0.00    0.000    0.00
## min              20.00   0.000    71.04    1.000   17.00
## max             144.00   1.000   138.89    4.000   29.00
## range           124.00   1.000    67.86    3.000   12.00
## sum           37670.00 341.000 43400.00 1257.000 9889.00
## median           90.00   1.000    97.92    3.000   23.00
## mean             86.80   0.786   100.00    2.896   22.79
## SE.mean           0.98   0.020     0.72    0.057    0.13
## CI.mean.0.95      1.93   0.039     1.42    0.111    0.25
## var             416.60   0.169   225.00    1.396    7.30
## std.dev          20.41   0.411    15.00    1.181    2.70
## coef.var          0.24   0.523     0.15    0.408    0.12
plot(kidiq$kid_score~kidiq$mom_iq,
     main="Child IQ by Mother's IQ",
     xlab="Maternal IQ",
     ylab="Child IQ",
     pch = 16)

And we can fit a basic OLS regression:

kidiq.lm1 <- lm(kid_score~mom_iq,data=kidiq)
summary(kidiq.lm1)
## 
## Call:
## lm(formula = kid_score ~ mom_iq, data = kidiq)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -56.75 -12.07   2.22  11.71  47.69 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  25.7998     5.9174    4.36             0.000016 ***
## mom_iq        0.6100     0.0585   10.42 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18 on 432 degrees of freedom
## Multiple R-squared:  0.201,  Adjusted R-squared:  0.199 
## F-statistic:  109 on 1 and 432 DF,  p-value: <0.0000000000000002
plot(kidiq$kid_score~kidiq$mom_iq,
     main="Child IQ by Mother's IQ",
     xlab="Maternal IQ",
     ylab="Child IQ",
     pch = 16)
abline(kidiq.lm1, col="red")


Quick Exercise 1

What do these results from the bivariate regression suggest? Are these results likely to be good estimates of the underlying causal relationships? What sources of endogeneity exist?


Suppose that you, as a researcher, had a good idea: what if there are social processes external to the genotype that lead to variations in expressed intelligence? In other words, that old hoary dorm-room staple: what if nurture matters alongside nature?

One bit of evidence near at hand involves whether the mother has graduated high school. This should be exogenous to the DNA in the mother’s genes, or at least pretty nearly so: humans don’t have a gene for finishing high school. But it is a fairly good indicator of social status and economic potential: social status because high-status people rarely fail to graduate high school and economic potential because those who don’t graduate high school earn less (substantially so) than those who do. If IQ is just a matter of genetics, this piece of evidence shouldn’t be correlated with anything in the child’s IQ, which is a handy null hypothesis.

So, first, let’s run regressions stratified by mom_hs, where 1 signifies the mother finished high school and 0 says she did not.

kidiq.lm2 <- lm(kid_score~mom_iq,data=subset(kidiq,mom_hs==1))

kidiq.lm3 <- lm(kid_score~mom_iq,data=subset(kidiq,mom_hs==0))

Here are the results in tabular form:

Child’s IQ
kid_score
Mom Graduated Mom Didn’t
(1) (2)
mom_iq 0.480*** 0.970***
(0.065) (0.160)
Constant 40.000*** -11.000
(6.700) (15.000)
Observations 341 93
R2 0.140 0.290
Adjusted R2 0.140 0.290
Residual Std. Error 18.000 (df = 339) 19.000 (df = 91)
F Statistic 56.000*** (df = 1; 339) 38.000*** (df = 1; 91)
Note: p<0.1; p<0.05; p<0.01

Quick Exercise 2

What do these results from the two bivariate regressions suggest?


That’s …. interesting. Let’s look at the data again, but this time with more color:

kidiq$color <- NA 
kidiq[kidiq$mom_hs==1,]$color <- "gray60"
kidiq[kidiq$mom_hs==0,]$color <- "darkmagenta"
plot(kidiq$kid_score~kidiq$mom_iq,
     main="Child IQ by Mother's IQ",
     xlab="Maternal IQ",
     ylab="Child IQ",
     pch = 16,
     col= kidiq$color)
abline(kidiq.lm2,col="black",lty="dashed",lwd=3)
abline(kidiq.lm3,col="darkmagenta",lty="dotted",lwd=3)


Quick Exercise 2

What does

kidiq$color <- NA 
kidiq[kidiq$mom_hs==1,]$color <- "gray60"
kidiq[kidiq$mom_hs==0,]$color <- "darkmagenta"

do? Why is it useful?


The results seem to suggest that there is a different relationship between maternal IQ and child IQ for mothers who graduated high school and those who didn’t. Specifically, the greater maternal IQ, the more child IQ will be for non-high school grads compared to those who graduated high school. This is not just an additive difference–moving the slope up or down–this is a difference in the slope itself.

In other words, there seems to be an interaction between an environmental factor and a genetic one.

We should model this, then, not as \[ ChildIQ = \beta_0 + \beta_1 MomIQ + \beta_2 MomHS + \epsilon \] but instead as \[ ChildIQ = \beta_0 + \beta_1 MomIQ + \beta_2 MomHS + \beta_3 MomIQ \times MomHS + \epsilon \]


Quick Exercise 4

Why do we set MomHS to be a dummy variable of 1 or 0? Why not 1 or 2? Come up with some answers.


Estimating this equation turns out to be simple:

kidiq.lm4 <- lm(kid_score~mom_iq+mom_hs+mom_iq*mom_hs,data=kidiq)
summary(kidiq.lm4)
## 
## Call:
## lm(formula = kid_score ~ mom_iq + mom_hs + mom_iq * mom_hs, data = kidiq)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -52.09 -11.33   2.07  11.66  43.88 
## 
## Coefficients:
##               Estimate Std. Error t value      Pr(>|t|)    
## (Intercept)    -11.482     13.758   -0.83        0.4044    
## mom_iq           0.969      0.148    6.53 0.00000000018 ***
## mom_hs          51.268     15.338    3.34        0.0009 ***
## mom_iq:mom_hs   -0.484      0.162   -2.99        0.0030 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18 on 430 degrees of freedom
## Multiple R-squared:  0.23,   Adjusted R-squared:  0.225 
## F-statistic: 42.8 on 3 and 430 DF,  p-value: <0.0000000000000002
Dependent variable:
Child’s IQ
Mom Graduated Mom Didn’t Interaction
(1) (2) (3)
mom_iq 0.480*** 0.970*** 0.970***
(0.065) (0.160) (0.150)
mom_hs 51.000***
(15.000)
mom_iq:mom_hs -0.480***
(0.160)
Constant 40.000*** -11.000 -11.000
(6.700) (15.000) (14.000)
Observations 341 93 434
R2 0.140 0.290 0.230
Adjusted R2 0.140 0.290 0.230
Residual Std. Error 18.000 (df = 339) 19.000 (df = 91) 18.000 (df = 430)
F Statistic 56.000*** (df = 1; 339) 38.000*** (df = 1; 91) 43.000*** (df = 3; 430)
Note: p<0.1; p<0.05; p<0.01

Quick Exercise 5

How should we interpret the relationship between a one-degree change in MomIQ in Column 1? In Column 2? In Column 3?

What other factors might we want to control for? How would you expect them to affect the estimate of \(\beta_1\)?