Lecture 11: Interaction Effects

Slides for Week 11 is available here

Before adding interaction effects to the model, test the model for potential violations of OLS assumptions. To correct for violations: center age, add a squared age term, and log transform income variable in a new temporary dataset.

Basic Set Up and install and load all necessary libraries

options(scipen=999, digits = 3)
library(ggplot2)
library(ggthemes)
library(ggfortify)
library(car)
# install.packages(c("sjPlot", "sjmisc"))
library(sjPlot)
library(sjmisc)
# install.packages("stargazer")
library(stargazer)

Step 0. Look at the Data Documentation Check the CPS2015_Description.pdf in your project folder.

Variables in the dataset:

  • female: 1 if female; 0 if male
  • year: Year
  • ahe: Average Hourly Earnings
  • bachelor: 1 if worker has a bachelor’s degree; 0 if worker has a high school degree
  • age: Age of the individuals
cps_new <- read.csv("cps_lab7.csv",header=TRUE, sep=",")

cps_new1 <- cps_new # cps1 is now the spare copy shall we "mess up" our `cps`!

# Now in your global environment, you should see 'cps' and `cps1` stored. 
# It has 7098 observations (i.e., 7098 rows) and 5 variables (i.e., 5 columns). 

Centering and Adding a Quadratic Term

Two Steps: create a new age variable centered around 0, then square it

  1. Centering the Age variable by deducting the mean from the value
summary(cps_new$age) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    25.0    27.0    30.0    29.6    32.0    34.0
# mean of age = 29.63, 29 or 30

The mean value for age is about 29.63, but the age variable is a discrete variable (i.e. age is measured in whole year here). So, you can center it around the year that is closest to mean value, i.e., 30 years old, or the exact mean value.

cps_new$ageC <- (cps_new$age - 30)
summary(cps_new$ageC)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -5.00   -3.00    0.00   -0.37    2.00    4.00
# mean becomes your zero for ageC

hist(cps_new$ageC)

  1. Squaring the centered continuous variable ageC to create ageC2
cps_new$ageC2 <- (cps_new$ageC^2) 

summary(cps_new$ageC2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    4.00    8.41   16.00   25.00
cor(cps_new$ageC, cps_new$ageC2)
## [1] -0.337

Run Polynomial Regressions with an Interaction term

\[\hat{ahe}= \alpha + \beta_1{ageC} + \beta_2{ageC2} + \beta_3{female} + \beta_4{bachelor} + \beta_5{female \cdot bachelor}\]

Notice that \(\beta_5{female \cdot bachelor}\) is the interaction term.

One way is to create separate interaction terms stored in the dataset. However, plotting might be difficult. So, I don’t recommend this way if you want to visualize the interaction effects down the road.

# Tips: do not make dummy to factor here, otherwise can't create an interaction. 
cps_new$bachelor_female <- cps_new$bachelor * cps_new$female
cps_new$ageC_female <- cps_new$ageC * cps_new$female

# remove them for now
cps_new$bachelor_female <- NULL
cps_new$ageC_female <- NULL

Alternatively, we can include interaction terms in the regression model directly – see the following regression models.


Simple model (recall lm2 we did in previous lab)

lm2 <- lm(ahe ~ female + age + bachelor, data = cps_new)
summary(lm2)
## 
## Call:
## lm(formula = ahe ~ female + age + bachelor, data = cps_new)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -27.91  -6.65  -1.86   4.25  83.91 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   2.0342     1.3548     1.5                0.13    
## female       -4.1424     0.2659   -15.6 <0.0000000000000002 ***
## age           0.5316     0.0451    11.8 <0.0000000000000002 ***
## bachelor      9.8471     0.2624    37.5 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.9 on 7094 degrees of freedom
## Multiple R-squared:  0.19,   Adjusted R-squared:  0.189 
## F-statistic:  553 on 3 and 7094 DF,  p-value: <0.0000000000000002
# Check residual plots
autoplot(lm2, which = 1:4, ncol = 2, label.size = 3)

# check multicollinearity
vif(lm2)
##   female      age bachelor 
##     1.02     1.00     1.02

Visualization using sjPlot and sjmisc packages

See the package documentations here for more explanations on syntaxes: http://www.strengejacke.de/sjPlot/reference/plot_model.html

# simple forest plot for coefficients
plot_model(lm2)

# fit model with prediction
plot_model(lm2, type = "pred", terms = c("age"))

# plot with no interaction: same slopes are parallel to each other!  
plot_model(lm2, type = "pred", terms = c("bachelor","female")) 

Reproduce the log-linear and log-log regression models from Lab 7

lm5 <- lm(log_ahe1 ~ female + ageC + ageC2 + bachelor, data = cps_new)
summary(lm5)
## 
## Call:
## lm(formula = log_ahe1 ~ female + ageC + ageC2 + bachelor, data = cps_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7147 -0.2862  0.0129  0.3044  2.0602 
## 
## Coefficients:
##              Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  2.767359   0.011066  250.07 <0.0000000000000002 ***
## female      -0.176939   0.011657  -15.18 <0.0000000000000002 ***
## ageC         0.022591   0.002098   10.77 <0.0000000000000002 ***
## ageC2       -0.001875   0.000776   -2.42               0.016 *  
## bachelor     0.462164   0.011504   40.18 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.479 on 7093 degrees of freedom
## Multiple R-squared:  0.209,  Adjusted R-squared:  0.208 
## F-statistic:  467 on 4 and 7093 DF,  p-value: <0.0000000000000002
# check residuals: Any improvement? 
autoplot(lm5, which = 1:4, ncol = 2, label.size = 3)

# check multicollinearity
vif(lm5)
##   female     ageC    ageC2 bachelor 
##     1.02     1.13     1.13     1.02
lm6 <- lm(log_ahe1 ~ female + log_age + bachelor, data = cps_new)
summary(lm5)
## 
## Call:
## lm(formula = log_ahe1 ~ female + ageC + ageC2 + bachelor, data = cps_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7147 -0.2862  0.0129  0.3044  2.0602 
## 
## Coefficients:
##              Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  2.767359   0.011066  250.07 <0.0000000000000002 ***
## female      -0.176939   0.011657  -15.18 <0.0000000000000002 ***
## ageC         0.022591   0.002098   10.77 <0.0000000000000002 ***
## ageC2       -0.001875   0.000776   -2.42               0.016 *  
## bachelor     0.462164   0.011504   40.18 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.479 on 7093 degrees of freedom
## Multiple R-squared:  0.209,  Adjusted R-squared:  0.208 
## F-statistic:  467 on 4 and 7093 DF,  p-value: <0.0000000000000002
# Check residual plots
autoplot(lm6, which = 1:4, ncol = 2, label.size = 3)

# check multicollinearity
vif(lm6)
##   female  log_age bachelor 
##     1.02     1.00     1.02

Interaction: Categorical x Categorical

In regression analysis, an interaction effect occurs when the effect of one independent variable on the dependent variable depends on the level of another independent variable. Including an interaction term in a regression model allows us to account for the fact that the relationship between the independent variables and the dependent variable may not be constant across all levels of the other independent variable. It can also help us to better understand the complex relationships between variables in our data.

For lm7, we want to examine if the relationship between bachelor and ahe depends on female (or the effect of bachelor on ahe significantly vary by sex). Let’s check if the interaction term is significant at the 0.05 level!

lm7 <- lm(ahe ~  ageC + ageC2 +  bachelor * female, data = cps_new)
summary(lm7)
## 
## Call:
## lm(formula = ahe ~ ageC + ageC2 + bachelor * female, data = cps_new)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -27.94  -6.69  -1.90   4.31  83.30 
## 
## Coefficients:
##                 Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)      17.8930     0.2725   65.67 <0.0000000000000002 ***
## ageC              0.5049     0.0479   10.55 <0.0000000000000002 ***
## ageC2            -0.0246     0.0177   -1.39              0.1654    
## bachelor         10.4596     0.3402   30.74 <0.0000000000000002 ***
## female           -3.3071     0.3972   -8.33 <0.0000000000000002 ***
## bachelor:female  -1.5067     0.5346   -2.82              0.0048 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.9 on 7092 degrees of freedom
## Multiple R-squared:  0.191,  Adjusted R-squared:  0.19 
## F-statistic:  334 on 5 and 7092 DF,  p-value: <0.0000000000000002
# Check residual plots
autoplot(lm7, which = 1:4, ncol = 2, label.size = 3)

# check multicollinearity
vif(lm7, type = "predictor")
## GVIFs computed for predictors
##          GVIF Df GVIF^(1/(2*Df)) Interacts With        Other Predictors
## ageC     1.13  1            1.06           --   ageC2, bachelor, female
## ageC2    1.13  1            1.06           --    ageC, bachelor, female
## bachelor 1.00  3            1.00         female             ageC, ageC2
## female   1.00  3            1.00       bachelor             ageC, ageC2

Controlling other variables constant, the results show that the interaction term (female x bachelor) is statistically significant at the 0.05 level, meaning that there is a significant interaction effect between female and bachelor, i.e., the association between bachelor and average hourly earning significantly differs between female and male workers.

For female workers (female=1) who had bachelor’s degree, the coefficient (=slope) of bachelor for average hourly earning decreases by $1.51, compared to the coefficients of bachelor for male workers who had bachelor’s degree.

In short, for female workers (female=1), having a bachelor’s degree is associated with an increase in the average hourly earning of $8.95 (i.e., 10.46 - 1.51 = 8.95), holding other variables constant. Whereas, for male workers (female=0 which is the reference group), having a bachelor’s degree is associated with $10.46 increase in the average hourly earning, holding other variables constant.

Visualize interaction effect

# simple forest plot for coefficients
plot_model(lm7)

# fit model with interaction
plot_model(lm7, type = "int", terms = c("bachelor", "female (0, 1)"),
           title= "Interaction Effect of Female and Bachelor on Average Hourly Earnings",
           axis.title = c("Bachelor's degree", "Average Hourly Earnings"))

Interaction: Categorical x Continuous

# the order of your interaction terms will determine the x-axis and y-axis in your plot
# ageC = x-axis ; female = y-axis

lm8 <- lm(ahe ~ ageC * female + ageC2 + bachelor, data = cps_new)
summary(lm8)
## 
## Call:
## lm(formula = ahe ~ ageC * female + ageC2 + bachelor, data = cps_new)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -27.96  -6.73  -1.97   4.33  83.55 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  18.2282     0.2532   71.99 <0.0000000000000002 ***
## ageC          0.5930     0.0611    9.71 <0.0000000000000002 ***
## female       -4.2168     0.2681  -15.73 <0.0000000000000002 ***
## ageC2        -0.0267     0.0177   -1.51               0.132    
## bachelor      9.8291     0.2625   37.44 <0.0000000000000002 ***
## ageC:female  -0.2030     0.0913   -2.22               0.026 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.9 on 7092 degrees of freedom
## Multiple R-squared:  0.19,   Adjusted R-squared:  0.19 
## F-statistic:  334 on 5 and 7092 DF,  p-value: <0.0000000000000002
# Check residual plots
autoplot(lm8, which = 1:4, ncol = 2, label.size = 3)

# check multicollinearity
vif(lm8, type = "predictor")
## GVIFs computed for predictors
##          GVIF Df GVIF^(1/(2*Df)) Interacts With       Other Predictors
## ageC     1.16  3            1.02         female        ageC2, bachelor
## female   1.16  3            1.02           ageC        ageC2, bachelor
## ageC2    1.13  1            1.06           --   ageC, female, bachelor
## bachelor 1.02  1            1.01           --      ageC, female, ageC2

Visualize interaction effect

# simple forest plot for coefficients
plot_model(lm8)

# fit model with interaction
plot_model(lm8, type = "int", terms = c("female (0,1)", "ageC (-5: 4"),
           title= "Interaction Effect of Female and Age on Average Hourly Earnings",
           axis.title = c("Age (centered at the mean)", "Average Hourly Earnings"))

Export Regression Tables using Stargazer package

For the full documentation: https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf

Summary Statistics

cps_new_Mini <- (cps_new[c("ahe","age","bachelor","female")])

# install.packages("stargazer")
library(stargazer)

stargazer(cps_new_Mini, type= "html", out = "summary table.html",
          title="Table 1. Summary Statistics", align=TRUE)
Table 1. Summary Statistics
Statistic N Mean St. Dev. Min Max
ahe 7,098 21.200 12.100 0.000 106.000
age 7,098 29.600 2.880 25 34
bachelor 7,098 0.526 0.499 0 1
female 7,098 0.417 0.493 0 1

Correlation Matrix

# Create a matrix by selecting some of the variables
cps_new$bachelor <- as.numeric(cps_new$bachelor)

# Correlation can't deal with factor structure, keep dummy in numerical form 
cor.matrix <- cor(cps_new[c("ahe","age","bachelor","female")])

stargazer(cor.matrix, type= "html", out = "cor.table.html",
          title="Table 2. Correlation Matrix")
Table 2. Correlation Matrix
ahe age bachelor female
ahe 1 0.131 0.380 -0.112
age 0.131 1 -0.001 -0.032
bachelor 0.380 -0.001 1 0.149
female -0.112 -0.032 0.149 1
# Remove dummy from your correlation matrix 
# since correlation is only useful for continuous/numeric variables
cor.matrix1 <- cor(cps_new[c("ahe","age")])

stargazer(cor.matrix1, type= "html", out = "cor.table1.html",
          title="Table 2. Correlation Matrix")
Table 2. Correlation Matrix
ahe age
ahe 1 0.131
age 0.131 1

Regression Tables

For more settings, see https://www.jakeruss.com/cheatsheets/stargazer/

lm2$AIC <- AIC(lm2)
lm5$AIC <- AIC(lm5)
lm6$AIC <- AIC(lm6)
lm7$AIC <- AIC(lm7)
lm8$AIC <- AIC(lm8)

lm2$BIC <- BIC(lm2)
lm5$BIC <- BIC(lm5)
lm6$BIC <- BIC(lm6)
lm7$BIC <- BIC(lm7)
lm8$BIC <- BIC(lm8)

stargazer(lm2, lm5, lm6, lm7, lm8, type= "html", out = "regression.table.html",
          dep.var.labels = "Average Hourly Earnings", 
          column.labels = c("Main Effects","Log-linear Model", "Log-Log Model",
                            "Int: Bachelor x Female ", "Int: Age x Female"), 
          intercept.bottom = FALSE, 
          digits  = 2 , digits.extra = 2,
          single.row= FALSE,
          keep.stat = c("n","adj.rsq","aic","bic"),
          notes.append = FALSE,
          star.char = c("+", "*", "**", "***"),
          star.cutoffs = c(0.1, 0.05, 0.01, 0.001),
          notes = c("(+ p<0.1; * p<0.05; ** p<0.01; *** p<0.001)"), 
          title="Table 3. Regression Results", align=TRUE, no.space=FALSE)
Table 3. Regression Results
Dependent variable:
Average Hourly Earnings log_ahe1 ahe
Main Effects Log-linear Model Log-Log Model Int: Bachelor x Female Int: Age x Female
(1) (2) (3) (4) (5)
Constant 2.03 2.77*** 0.31 17.90*** 18.20***
(1.35) (0.01) (0.20) (0.27) (0.25)
female -4.14*** -0.18*** -0.18*** -3.31*** -4.22***
(0.27) (0.01) (0.01) (0.40) (0.27)
age 0.53***
(0.05)
bachelor:female -1.51**
(0.53)
ageC 0.02*** 0.50*** 0.59***
(0.002) (0.05) (0.06)
ageC2 -0.002* -0.02 -0.03
(0.001) (0.02) (0.02)
log_age 0.72***
(0.06)
bachelor 9.85*** 0.46*** 0.46*** 10.50*** 9.83***
(0.26) (0.01) (0.01) (0.34) (0.26)
ageC:female -0.20*
(0.09)
Observations 7,098 7,098 7,098 7,098 7,098
Adjusted R2 0.19 0.21 0.21 0.19 0.19
Akaike Inf. Crit. 54,083.00 9,688.00 9,690.00 54,077.00 54,080.00
Bayesian Inf. Crit. 54,118.00 9,729.00 9,724.00 54,125.00 54,128.00
Note: (+ p<0.1; * p<0.05; ** p<0.01; *** p<0.001)

If you use the stargazer package in your research publications, please remember to include the following citation: Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables. R package version 5.2.2. https://CRAN.R-project.org/package=stargazer