The slides for Lab 10 session are now available to review here.
options(scipen=999, digits = 2)
# install.packages("multilevel")
# install.packages("mediation")

Create a clean data frame with the selected variables first

good <- read.csv("good.csv")
#recoding Race as we did before
# -.5 = black
# .5 = white
good$race <- ifelse(good$CHRACE == 1, .5, ifelse(good$CHRACE == 2, -.5, NA))

# Log and/or Centering Transformations of `AGE97` and `faminc97`
good$ageC <- good$AGE97 - 8
good$ageC2 <- good$ageC^2

good$faminc97New <- ifelse(good$faminc97 <= 1, 1, good$faminc97)
summary(good$faminc97New)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1   20176   39118   49882   64495  784611
good$loginc <- log(good$faminc97New)

goodMini <- na.omit(good[,c("CHID","ageC","ageC2","loginc",
                            "HOME97","race","read02","WICpreg")])
attach(goodMini)
str(goodMini)
## 'data.frame':    1455 obs. of  8 variables:
##  $ CHID   : int  4041 5032 7039 7040 7045 10034 10035 14032 14030 14031 ...
##  $ ageC   : num  -4 4 1 -1 -4 -2 -4 -4 3 0 ...
##  $ ageC2  : num  16 16 1 1 16 4 16 16 9 0 ...
##  $ loginc : num  9.04 11.4 9.46 9.46 11.09 ...
##  $ HOME97 : num  15.1 21 18.6 16.6 16 25.4 20.7 21.7 20.6 20.7 ...
##  $ race   : num  0.5 0.5 0.5 0.5 0.5 -0.5 -0.5 0.5 0.5 0.5 ...
##  $ read02 : int  45 88 52 64 53 75 54 74 82 77 ...
##  $ WICpreg: int  1 1 1 1 1 1 1 0 0 0 ...
##  - attr(*, "na.action")= 'omit' Named int  1 2 4 5 6 7 8 10 11 12 ...
##   ..- attr(*, "names")= chr  "1" "2" "4" "5" ...

10 Mediation Analysis

Mediation refers to the effect of an independent variable (X) on a dependent variable (Y) is transmitted through a third intervening/mediating variable (M). In other words, an indirect effect is when X affects M, which in turn affects Y, other things being equal. Hence, we say M mediates the effect of x1 on y. Note: M is just another independent variable in your model.

For mediation analysis, besides the average direct effects (ADE), we are interested in an average of the mediation effect, denoted as the average causal mediation effects (ACME).

Total Effect = Average Causal Mediation Effects (ACME) + Average Direct Effect (ADE)
Basic Mediation Model

Basic Mediation Model

Note: Mediation analysis itself does not imply causal relationships unless the research data is based on experimental design.


10.1 Procedures of Estimating the Indirect Effect following Baron & Kenny (1986)

Here are the basic steps for mediation analysis suggested by Baron & Kenny (1986).

A mediation analysis is comprised of three sets of regression:

  1. \(X → Y\)

  2. \(X → M\), and

  3. \(X + M → Y\)


10.2 Mediation Analysis With only X and M as Predictors

Estimating indirect and direct effects of loginc (X) on read02 (Y) via HOME97 (M), assuming we have theoretical/empirical reasons to suspect HOME97 mediates the effect of family income on reading scores.

Step 1: Find out the original effect/coefficient of loginc (X) on read02 (Y)

\[{read02}= \alpha + \beta_1 {loginc} + \epsilon_i\]

m1<-lm(read02~loginc, data=goodMini)
summary(m1)    # b1 = 3.15
## 
## Call:
## lm(formula = read02 ~ loginc, data = goodMini)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -53.26  -7.42   1.27   8.59  44.57 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   39.427      2.703    14.6 <0.0000000000000002 ***
## loginc         3.152      0.257    12.3 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12 on 1453 degrees of freedom
## Multiple R-squared:  0.0939, Adjusted R-squared:  0.0933 
## F-statistic:  151 on 1 and 1453 DF,  p-value: <0.0000000000000002

Is the coefficient (b1) significant? If not, we do not need to go further to explore mediating effect at all.


Step 2: Find out the coefficient of loginc (X) on HOME97 (M)

If X and M have no relationship, M is just a third variable that may or may not be associated with Y. A mediation makes sense only if X affects M.

\[{HOME97}= \alpha + \beta_2 {loginc} + \epsilon_i\]

m1.M<-lm(HOME97~loginc, data=goodMini)
summary(m1.M)    # b2 = 1.16
## 
## Call:
## lm(formula = HOME97 ~ loginc, data = goodMini)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.176  -1.755   0.227   1.840  12.271 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   8.1294     0.6359    12.8 <0.0000000000000002 ***
## loginc        1.1639     0.0604    19.3 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.8 on 1453 degrees of freedom
## Multiple R-squared:  0.203,  Adjusted R-squared:  0.203 
## F-statistic:  371 on 1 and 1453 DF,  p-value: <0.0000000000000002

Is the coefficient of loginc on HOME97 (b2) significant?


Step 3: Find out the new coefficient for loginc (X) on read02 (Y), with HOME97 included

\[{read02}= \alpha + \beta_3 {loginc} + \beta_4{HOME97} + \epsilon_i\]

If a mediation effect exists, the effect of X on Y will disappear (or at least weaken) when M is included in the regression, compared to lm1.

m1.Y<-lm(read02~loginc + HOME97, data=goodMini)
summary(m1.Y)    # b3 = 1.37  ;  b4 = 1.53
## 
## Call:
## lm(formula = read02 ~ loginc + HOME97, data = goodMini)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -53.50  -6.67   1.24   7.72  33.66 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   26.962      2.660   10.14 < 0.0000000000000002 ***
## loginc         1.368      0.269    5.09            0.0000004 ***
## HOME97         1.533      0.104   14.74 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11 on 1452 degrees of freedom
## Multiple R-squared:  0.212,  Adjusted R-squared:  0.211 
## F-statistic:  195 on 2 and 1452 DF,  p-value: <0.0000000000000002

How does the new coefficient of loginc (b3) change, compared to b1? Does it decrease or disappear completely?

If the effect of X on Y completely disappears, M fully mediates between X and Y (full mediation) which rarely happens, however. If the effect of X on Y still exists, but in a smaller magnitude, M partially mediates between X and Y (partial mediation).


Step 4: Tease out the Indirect and direct effect

  1. Indirect path: loginc –> HOME97 –> read02
  2. Direct path: loginc –> read02
  • Indirect path: loginc –> HOME97 (b2 = 1.16) –> read02 (b4 = 1.53) 1.16*1.53 = 1.8

  • Direct path: loginc –> read02 (b3 = 1.37) 1.37 # almost half of original value (i.e., b1 = 3.15)

  • Total effect: 1.8 + 1.37 = 3.2

Summary Table using stargazer

library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
stargazer(m1,m1.M,m1.Y, type = "text", title = "Mediation Analysis With only X and M as Predictors",
          star.cutoffs = c(0.05, 0.01, 0.001), digits = 2)
## 
## Mediation Analysis With only X and M as Predictors
## ==============================================================================================
##                                                Dependent variable:                            
##                     --------------------------------------------------------------------------
##                              read02                   HOME97                   read02         
##                               (1)                      (2)                      (3)           
## ----------------------------------------------------------------------------------------------
## loginc                      3.10***                  1.20***                  1.40***         
##                              (0.26)                   (0.06)                   (0.27)         
##                                                                                               
## HOME97                                                                        1.50***         
##                                                                                (0.10)         
##                                                                                               
## Constant                    39.00***                 8.10***                  27.00***        
##                              (2.70)                   (0.64)                   (2.70)         
##                                                                                               
## ----------------------------------------------------------------------------------------------
## Observations                 1,455                    1,455                    1,455          
## R2                            0.09                     0.20                     0.21          
## Adjusted R2                   0.09                     0.20                     0.21          
## Residual Std. Error    12.00 (df = 1453)         2.80 (df = 1453)        11.00 (df = 1452)    
## F Statistic         151.00*** (df = 1; 1453) 371.00*** (df = 1; 1453) 195.00*** (df = 2; 1452)
## ==============================================================================================
## Note:                                                            *p<0.05; **p<0.01; ***p<0.001

10.3 Mediation Analysis with Control Variables

Now, we estimate the indirect effect of loginc (X) on read02 (Y) via HOME97 (M), controlling for ageC and race, using the same procedure.

Step 1: Find out the original effect/coefficient of loginc (X) on read02 (Y)

\[read02= \alpha + \beta_1 {loginc} + \beta_p {ageC} + \beta_q {race} + \epsilon_i\]

m2<-lm(read02~ ageC + race + loginc, data= goodMini)
summary(m2)  # b1 = 1.61
## 
## Call:
## lm(formula = read02 ~ ageC + race + loginc, data = goodMini)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -51.56  -4.64   0.98   6.01  28.05 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  57.0036     2.3658   24.10 < 0.0000000000000002 ***
## ageC          2.2907     0.0867   26.43 < 0.0000000000000002 ***
## race          6.6986     0.5402   12.40 < 0.0000000000000002 ***
## loginc        1.6112     0.2250    7.16      0.0000000000013 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.4 on 1451 degrees of freedom
## Multiple R-squared:  0.421,  Adjusted R-squared:  0.419 
## F-statistic:  351 on 3 and 1451 DF,  p-value: <0.0000000000000002
# Is the coefficient (b1) significant? 
# If not, not need to go further to explore mediating effect at all. 

Step 2: Find out the coefficient of loginc (X) on HOME97 (M)

\[HOME97= \alpha + \beta_2 {loginc} + \beta_p {ageC} + \beta_q {race} + \epsilon_i\]

m2.M<-lm(HOME97~ageC + race + loginc, data= goodMini)
summary(m2.M)   # b2 = 0.76
## 
## Call:
## lm(formula = HOME97 ~ ageC + race + loginc, data = goodMini)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.814 -1.500  0.126  1.711  8.800 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  12.4897     0.6314   19.78 <0.0000000000000002 ***
## ageC          0.2203     0.0231    9.52 <0.0000000000000002 ***
## race          2.2190     0.1442   15.39 <0.0000000000000002 ***
## loginc        0.7561     0.0601   12.59 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.5 on 1451 degrees of freedom
## Multiple R-squared:  0.344,  Adjusted R-squared:  0.343 
## F-statistic:  254 on 3 and 1451 DF,  p-value: <0.0000000000000002

Step 3: Find out the new coefficient for loginc (X) on read02 (Y), with HOME97 and controls included

\[read02= \alpha + \beta_3 {loginc} + \beta_4{HOME97} + \beta_p {ageC} + \beta_q {race} + \epsilon_i\]

# If a mediation effect exists, the effect of X on Y will disappear (or at least weaken)
# when M is included in the regression, compared to lm1. 

m2.Y<-lm(read02~ageC + race + loginc + HOME97, data= goodMini)
summary(m2.Y)   # b3 =  1.01  ;  b4 = 0.79
## 
## Call:
## lm(formula = read02 ~ ageC + race + loginc + HOME97, data = goodMini)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -51.30  -4.54   0.71   6.24  25.79 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  47.0935     2.6062   18.07 < 0.0000000000000002 ***
## ageC          2.1159     0.0873   24.23 < 0.0000000000000002 ***
## race          4.9379     0.5696    8.67 < 0.0000000000000002 ***
## loginc        1.0113     0.2317    4.36  0.00001366217904877 ***
## HOME97        0.7935     0.0962    8.25  0.00000000000000035 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.2 on 1450 degrees of freedom
## Multiple R-squared:  0.447,  Adjusted R-squared:  0.445 
## F-statistic:  292 on 4 and 1450 DF,  p-value: <0.0000000000000002

how does the new coefficient of loginc (b3) change, compared to b1? Does it decrease or disappear completely? ***

Step 4: Tease out the Indirect and direct effect

  1. Indirect path: loginc –> HOME97 –> read02

  2. Direct path 1: loginc –> read02

  3. Direct path 2: race –> read02

  4. Direct path 3: ageC –> read02

  • Indirect path: loginc –> HOME97 (b2 = 0.76) –> read02 (b4 = 0.79) 0.76*0.79 = 0.59

  • Direct path 1: loginc –> read02 (b3 = 1.01) 1.01

  • Total effect of loginc –> read02: 0.59 + 1.01 = 1.6
    (compared to the coefficient of loginc in m2 (b1 = 1.61))

Summary Table using stargazer

library(stargazer)
stargazer(m2,m2.M,m2.Y, type = "text", title = "Mediation Analysis with Control Variables",
          star.cutoffs = c(0.05, 0.01, 0.001), digits = 2)
## 
## Mediation Analysis with Control Variables
## ==============================================================================================
##                                                Dependent variable:                            
##                     --------------------------------------------------------------------------
##                              read02                   HOME97                   read02         
##                               (1)                      (2)                      (3)           
## ----------------------------------------------------------------------------------------------
## ageC                        2.30***                  0.22***                  2.10***         
##                              (0.09)                   (0.02)                   (0.09)         
##                                                                                               
## race                        6.70***                  2.20***                  4.90***         
##                              (0.54)                   (0.14)                   (0.57)         
##                                                                                               
## loginc                      1.60***                  0.76***                  1.00***         
##                              (0.23)                   (0.06)                   (0.23)         
##                                                                                               
## HOME97                                                                        0.79***         
##                                                                                (0.10)         
##                                                                                               
## Constant                    57.00***                 12.00***                 47.00***        
##                              (2.40)                   (0.63)                   (2.60)         
##                                                                                               
## ----------------------------------------------------------------------------------------------
## Observations                 1,455                    1,455                    1,455          
## R2                            0.42                     0.34                     0.45          
## Adjusted R2                   0.42                     0.34                     0.44          
## Residual Std. Error     9.40 (df = 1451)         2.50 (df = 1451)         9.20 (df = 1450)    
## F Statistic         351.00*** (df = 3; 1451) 254.00*** (df = 3; 1451) 292.00*** (df = 4; 1450)
## ==============================================================================================
## Note:                                                            *p<0.05; **p<0.01; ***p<0.001

10.4 Testing the Significance of Mediation

There are two primary methods for formally testing the significance of the indirect test: the Sobel test & bootstrapping. For more information, see also http://davidakenny.net/cm/mediate.htm#WIM

1 - Sobel Test

A test, first proposed by Sobel (1982), was initially often used. The Sobel test provides an approximate estimate of the standard error of ab. However, the Sobel test is very conservative (MacKinnon, Warsi, & Dwyer, 1995), and so it has very low power. Bootstrapping has replaced the more conservative Sobel test in recent practices.

# install.packages("multilevel")
library(multilevel)
## Loading required package: nlme
## Loading required package: MASS
sobel(goodMini$loginc, goodMini$HOME97, goodMini$read02)
## $`Mod1: Y~X`
##             Estimate Std. Error t value
## (Intercept)     39.4       2.70      15
## pred             3.2       0.26      12
##                                                     Pr(>|t|)
## (Intercept) 0.0000000000000000000000000000000000000000000044
## pred        0.0000000000000000000000000000000053155072010804
## 
## $`Mod2: Y~X+M`
##             Estimate Std. Error t value
## (Intercept)     27.0       2.66    10.1
## pred             1.4       0.27     5.1
## med              1.5       0.10    14.7
##                                                      Pr(>|t|)
## (Intercept) 0.00000000000000000000002219236370632509845355826
## pred        0.00000039841588700485895090108723337607443681918
## med         0.00000000000000000000000000000000000000000000064
## 
## $`Mod3: M~X`
##             Estimate Std. Error t value
## (Intercept)      8.1       0.64      13
## pred             1.2       0.06      19
##                                                                                 Pr(>|t|)
## (Intercept) 0.00000000000000000000000000000000001527345899890333983869861178561677661492
## pred        0.00000000000000000000000000000000000000000000000000000000000000000000000009
## 
## $Indirect.Effect
## [1] 1.8
## 
## $SE
## [1] 0.15
## 
## $z.value
## [1] 12
## 
## $N
## [1] 1455
# the results resembles section 10.2

2 - Bootstrapping

Bootstrapping is a non-parametric method based on resampling with replacement which is done many times, e.g., 5000 times. From each of these samples the indirect effect is computed and a sampling distribution can be empirically generated. Because the mean of the bootstrapped distribution will not exactly equal the indirect effect a correction for bias can be made. With the distribution, a confidence interval, a p value, or a standard error can be determined. Very typically a confidence interval is computed and it is checked to determine if zero is in the interval. If zero is not in the interval, then the researcher can be confident that the indirect effect is different from zero. Also a Z value can determined by dividing the bootstrapped estimate by its standard error, but bootstrapped standard errors suffer the same problem as the Sobel standard errors and are not recommended. (Bootstrapping does not require the assumption that a and b are uncorrelated.)

# install.packages("mediation")
library(mediation)
## Loading required package: Matrix
## Loading required package: mvtnorm
## Loading required package: sandwich
## mediation: Causal Mediation Analysis
## Version: 4.4.7
results <- mediate(m2.M, m2.Y, treat='loginc', mediator='HOME97',
                   boot=TRUE, sims=500)
summary(results)
## 
## Causal Mediation Analysis 
## 
## Nonparametric Bootstrap Confidence Intervals with the Percentile Method
## 
##                Estimate 95% CI Lower 95% CI Upper             p-value    
## ACME              0.600        0.417         0.87 <0.0000000000000002 ***
## ADE               1.011        0.573         1.57 <0.0000000000000002 ***
## Total Effect      1.611        1.071         2.27 <0.0000000000000002 ***
## Prop. Mediated    0.372        0.253         0.53 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sample Size Used: 1455 
## 
## 
## Simulations: 500
plot(results, effect.type = c("indirect","direct","total"),
     title = "Mediation Analysis using Bootstrapping (sims = 500)", 
     xlab = "", ylab = "", 
     main = NULL, lwd = 1.5, cex = .85,
     col = "black")

The mediate function gives us the Average Causal Mediation Effects (ACME), Average Direct Effects (ADE), the combined indirect and direct effects (Total Effect), and the ratio of these estimates (Prop. Mediated). Again, the ACME here is the indirect effect of M (total effect - average direct effect) the estimate of ACME tells us if the mediation effect is significant.