options(scipen=999, digits = 2)
# install.packages("multilevel")
# install.packages("mediation")
Create a clean data frame with the selected variables first
good <- read.csv("good.csv")
#recoding Race as we did before
# -.5 = black
# .5 = white
good$race <- ifelse(good$CHRACE == 1, .5, ifelse(good$CHRACE == 2, -.5, NA))
# Log and/or Centering Transformations of `AGE97` and `faminc97`
good$ageC <- good$AGE97 - 8
good$ageC2 <- good$ageC^2
good$faminc97New <- ifelse(good$faminc97 <= 1, 1, good$faminc97)
summary(good$faminc97New)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 20176 39118 49882 64495 784611
good$loginc <- log(good$faminc97New)
goodMini <- na.omit(good[,c("CHID","ageC","ageC2","loginc",
"HOME97","race","read02","WICpreg")])
attach(goodMini)
str(goodMini)
## 'data.frame': 1455 obs. of 8 variables:
## $ CHID : int 4041 5032 7039 7040 7045 10034 10035 14032 14030 14031 ...
## $ ageC : num -4 4 1 -1 -4 -2 -4 -4 3 0 ...
## $ ageC2 : num 16 16 1 1 16 4 16 16 9 0 ...
## $ loginc : num 9.04 11.4 9.46 9.46 11.09 ...
## $ HOME97 : num 15.1 21 18.6 16.6 16 25.4 20.7 21.7 20.6 20.7 ...
## $ race : num 0.5 0.5 0.5 0.5 0.5 -0.5 -0.5 0.5 0.5 0.5 ...
## $ read02 : int 45 88 52 64 53 75 54 74 82 77 ...
## $ WICpreg: int 1 1 1 1 1 1 1 0 0 0 ...
## - attr(*, "na.action")= 'omit' Named int 1 2 4 5 6 7 8 10 11 12 ...
## ..- attr(*, "names")= chr "1" "2" "4" "5" ...
Mediation refers to the effect of an independent variable (X) on a dependent variable (Y) is transmitted through a third intervening/mediating variable (M). In other words, an indirect effect is when X affects M, which in turn affects Y, other things being equal. Hence, we say M mediates the effect of x1 on y. Note: M is just another independent variable in your model.
For mediation analysis, besides the average direct effects (ADE), we are interested in an average of the mediation effect, denoted as the average causal mediation effects (ACME).
Basic Mediation Model
Note: Mediation analysis itself does not imply causal relationships unless the research data is based on experimental design.
Here are the basic steps for mediation analysis suggested by Baron & Kenny (1986).
A mediation analysis is comprised of three sets of regression:
\(X → Y\)
\(X → M\), and
\(X + M → Y\)
Estimating indirect and direct effects of loginc (X) on read02 (Y) via HOME97 (M), assuming we have theoretical/empirical reasons to suspect HOME97 mediates the effect of family income on reading scores.
\[{read02}= \alpha + \beta_1 {loginc} + \epsilon_i\]
m1<-lm(read02~loginc, data=goodMini)
summary(m1) # b1 = 3.15
##
## Call:
## lm(formula = read02 ~ loginc, data = goodMini)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.26 -7.42 1.27 8.59 44.57
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.427 2.703 14.6 <0.0000000000000002 ***
## loginc 3.152 0.257 12.3 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12 on 1453 degrees of freedom
## Multiple R-squared: 0.0939, Adjusted R-squared: 0.0933
## F-statistic: 151 on 1 and 1453 DF, p-value: <0.0000000000000002
Is the coefficient (b1) significant? If not, we do not need to go further to explore mediating effect at all.
If X and M have no relationship, M is just a third variable that may or may not be associated with Y. A mediation makes sense only if X affects M.
\[{HOME97}= \alpha + \beta_2 {loginc} + \epsilon_i\]
m1.M<-lm(HOME97~loginc, data=goodMini)
summary(m1.M) # b2 = 1.16
##
## Call:
## lm(formula = HOME97 ~ loginc, data = goodMini)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.176 -1.755 0.227 1.840 12.271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.1294 0.6359 12.8 <0.0000000000000002 ***
## loginc 1.1639 0.0604 19.3 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.8 on 1453 degrees of freedom
## Multiple R-squared: 0.203, Adjusted R-squared: 0.203
## F-statistic: 371 on 1 and 1453 DF, p-value: <0.0000000000000002
Is the coefficient of loginc on HOME97 (b2) significant?
\[{read02}= \alpha + \beta_3 {loginc} + \beta_4{HOME97} + \epsilon_i\]
If a mediation effect exists, the effect of X on Y will disappear (or at least weaken) when M is included in the regression, compared to lm1.
m1.Y<-lm(read02~loginc + HOME97, data=goodMini)
summary(m1.Y) # b3 = 1.37 ; b4 = 1.53
##
## Call:
## lm(formula = read02 ~ loginc + HOME97, data = goodMini)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.50 -6.67 1.24 7.72 33.66
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.962 2.660 10.14 < 0.0000000000000002 ***
## loginc 1.368 0.269 5.09 0.0000004 ***
## HOME97 1.533 0.104 14.74 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11 on 1452 degrees of freedom
## Multiple R-squared: 0.212, Adjusted R-squared: 0.211
## F-statistic: 195 on 2 and 1452 DF, p-value: <0.0000000000000002
How does the new coefficient of loginc (b3) change, compared to b1? Does it decrease or disappear completely?
If the effect of X on Y completely disappears, M fully mediates between X and Y (full mediation) which rarely happens, however. If the effect of X on Y still exists, but in a smaller magnitude, M partially mediates between X and Y (partial mediation).
Indirect path: loginc –> HOME97 (b2 = 1.16) –> read02 (b4 = 1.53) 1.16*1.53 = 1.8
Direct path: loginc –> read02 (b3 = 1.37) 1.37 # almost half of original value (i.e., b1 = 3.15)
Total effect: 1.8 + 1.37 = 3.2
Summary Table using stargazer
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
stargazer(m1,m1.M,m1.Y, type = "text", title = "Mediation Analysis With only X and M as Predictors",
star.cutoffs = c(0.05, 0.01, 0.001), digits = 2)
##
## Mediation Analysis With only X and M as Predictors
## ==============================================================================================
## Dependent variable:
## --------------------------------------------------------------------------
## read02 HOME97 read02
## (1) (2) (3)
## ----------------------------------------------------------------------------------------------
## loginc 3.10*** 1.20*** 1.40***
## (0.26) (0.06) (0.27)
##
## HOME97 1.50***
## (0.10)
##
## Constant 39.00*** 8.10*** 27.00***
## (2.70) (0.64) (2.70)
##
## ----------------------------------------------------------------------------------------------
## Observations 1,455 1,455 1,455
## R2 0.09 0.20 0.21
## Adjusted R2 0.09 0.20 0.21
## Residual Std. Error 12.00 (df = 1453) 2.80 (df = 1453) 11.00 (df = 1452)
## F Statistic 151.00*** (df = 1; 1453) 371.00*** (df = 1; 1453) 195.00*** (df = 2; 1452)
## ==============================================================================================
## Note: *p<0.05; **p<0.01; ***p<0.001
Now, we estimate the indirect effect of loginc (X) on read02 (Y) via HOME97 (M), controlling for ageC and race, using the same procedure.
\[read02= \alpha + \beta_1 {loginc} + \beta_p {ageC} + \beta_q {race} + \epsilon_i\]
m2<-lm(read02~ ageC + race + loginc, data= goodMini)
summary(m2) # b1 = 1.61
##
## Call:
## lm(formula = read02 ~ ageC + race + loginc, data = goodMini)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.56 -4.64 0.98 6.01 28.05
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57.0036 2.3658 24.10 < 0.0000000000000002 ***
## ageC 2.2907 0.0867 26.43 < 0.0000000000000002 ***
## race 6.6986 0.5402 12.40 < 0.0000000000000002 ***
## loginc 1.6112 0.2250 7.16 0.0000000000013 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.4 on 1451 degrees of freedom
## Multiple R-squared: 0.421, Adjusted R-squared: 0.419
## F-statistic: 351 on 3 and 1451 DF, p-value: <0.0000000000000002
# Is the coefficient (b1) significant?
# If not, not need to go further to explore mediating effect at all.
\[HOME97= \alpha + \beta_2 {loginc} + \beta_p {ageC} + \beta_q {race} + \epsilon_i\]
m2.M<-lm(HOME97~ageC + race + loginc, data= goodMini)
summary(m2.M) # b2 = 0.76
##
## Call:
## lm(formula = HOME97 ~ ageC + race + loginc, data = goodMini)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.814 -1.500 0.126 1.711 8.800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.4897 0.6314 19.78 <0.0000000000000002 ***
## ageC 0.2203 0.0231 9.52 <0.0000000000000002 ***
## race 2.2190 0.1442 15.39 <0.0000000000000002 ***
## loginc 0.7561 0.0601 12.59 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.5 on 1451 degrees of freedom
## Multiple R-squared: 0.344, Adjusted R-squared: 0.343
## F-statistic: 254 on 3 and 1451 DF, p-value: <0.0000000000000002
\[read02= \alpha + \beta_3 {loginc} + \beta_4{HOME97} + \beta_p {ageC} + \beta_q {race} + \epsilon_i\]
# If a mediation effect exists, the effect of X on Y will disappear (or at least weaken)
# when M is included in the regression, compared to lm1.
m2.Y<-lm(read02~ageC + race + loginc + HOME97, data= goodMini)
summary(m2.Y) # b3 = 1.01 ; b4 = 0.79
##
## Call:
## lm(formula = read02 ~ ageC + race + loginc + HOME97, data = goodMini)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.30 -4.54 0.71 6.24 25.79
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.0935 2.6062 18.07 < 0.0000000000000002 ***
## ageC 2.1159 0.0873 24.23 < 0.0000000000000002 ***
## race 4.9379 0.5696 8.67 < 0.0000000000000002 ***
## loginc 1.0113 0.2317 4.36 0.00001366217904877 ***
## HOME97 0.7935 0.0962 8.25 0.00000000000000035 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.2 on 1450 degrees of freedom
## Multiple R-squared: 0.447, Adjusted R-squared: 0.445
## F-statistic: 292 on 4 and 1450 DF, p-value: <0.0000000000000002
how does the new coefficient of loginc (b3) change, compared to b1? Does it decrease or disappear completely? ***
Indirect path: loginc –> HOME97 –> read02
Direct path 1: loginc –> read02
Direct path 2: race –> read02
Direct path 3: ageC –> read02
Indirect path: loginc –> HOME97 (b2 = 0.76) –> read02 (b4 = 0.79) 0.76*0.79 = 0.59
Direct path 1: loginc –> read02 (b3 = 1.01) 1.01
Total effect of loginc –> read02: 0.59 + 1.01 = 1.6
(compared to the coefficient of loginc in m2 (b1 = 1.61))
Summary Table using stargazer
library(stargazer)
stargazer(m2,m2.M,m2.Y, type = "text", title = "Mediation Analysis with Control Variables",
star.cutoffs = c(0.05, 0.01, 0.001), digits = 2)
##
## Mediation Analysis with Control Variables
## ==============================================================================================
## Dependent variable:
## --------------------------------------------------------------------------
## read02 HOME97 read02
## (1) (2) (3)
## ----------------------------------------------------------------------------------------------
## ageC 2.30*** 0.22*** 2.10***
## (0.09) (0.02) (0.09)
##
## race 6.70*** 2.20*** 4.90***
## (0.54) (0.14) (0.57)
##
## loginc 1.60*** 0.76*** 1.00***
## (0.23) (0.06) (0.23)
##
## HOME97 0.79***
## (0.10)
##
## Constant 57.00*** 12.00*** 47.00***
## (2.40) (0.63) (2.60)
##
## ----------------------------------------------------------------------------------------------
## Observations 1,455 1,455 1,455
## R2 0.42 0.34 0.45
## Adjusted R2 0.42 0.34 0.44
## Residual Std. Error 9.40 (df = 1451) 2.50 (df = 1451) 9.20 (df = 1450)
## F Statistic 351.00*** (df = 3; 1451) 254.00*** (df = 3; 1451) 292.00*** (df = 4; 1450)
## ==============================================================================================
## Note: *p<0.05; **p<0.01; ***p<0.001
There are two primary methods for formally testing the significance of the indirect test: the Sobel test & bootstrapping. For more information, see also http://davidakenny.net/cm/mediate.htm#WIM
A test, first proposed by Sobel (1982), was initially often used. The Sobel test provides an approximate estimate of the standard error of ab. However, the Sobel test is very conservative (MacKinnon, Warsi, & Dwyer, 1995), and so it has very low power. Bootstrapping has replaced the more conservative Sobel test in recent practices.
# install.packages("multilevel")
library(multilevel)
## Loading required package: nlme
## Loading required package: MASS
sobel(goodMini$loginc, goodMini$HOME97, goodMini$read02)
## $`Mod1: Y~X`
## Estimate Std. Error t value
## (Intercept) 39.4 2.70 15
## pred 3.2 0.26 12
## Pr(>|t|)
## (Intercept) 0.0000000000000000000000000000000000000000000044
## pred 0.0000000000000000000000000000000053155072010804
##
## $`Mod2: Y~X+M`
## Estimate Std. Error t value
## (Intercept) 27.0 2.66 10.1
## pred 1.4 0.27 5.1
## med 1.5 0.10 14.7
## Pr(>|t|)
## (Intercept) 0.00000000000000000000002219236370632509845355826
## pred 0.00000039841588700485895090108723337607443681918
## med 0.00000000000000000000000000000000000000000000064
##
## $`Mod3: M~X`
## Estimate Std. Error t value
## (Intercept) 8.1 0.64 13
## pred 1.2 0.06 19
## Pr(>|t|)
## (Intercept) 0.00000000000000000000000000000000001527345899890333983869861178561677661492
## pred 0.00000000000000000000000000000000000000000000000000000000000000000000000009
##
## $Indirect.Effect
## [1] 1.8
##
## $SE
## [1] 0.15
##
## $z.value
## [1] 12
##
## $N
## [1] 1455
# the results resembles section 10.2
Bootstrapping is a non-parametric method based on resampling with replacement which is done many times, e.g., 5000 times. From each of these samples the indirect effect is computed and a sampling distribution can be empirically generated. Because the mean of the bootstrapped distribution will not exactly equal the indirect effect a correction for bias can be made. With the distribution, a confidence interval, a p value, or a standard error can be determined. Very typically a confidence interval is computed and it is checked to determine if zero is in the interval. If zero is not in the interval, then the researcher can be confident that the indirect effect is different from zero. Also a Z value can determined by dividing the bootstrapped estimate by its standard error, but bootstrapped standard errors suffer the same problem as the Sobel standard errors and are not recommended. (Bootstrapping does not require the assumption that a and b are uncorrelated.)
# install.packages("mediation")
library(mediation)
## Loading required package: Matrix
## Loading required package: mvtnorm
## Loading required package: sandwich
## mediation: Causal Mediation Analysis
## Version: 4.4.7
results <- mediate(m2.M, m2.Y, treat='loginc', mediator='HOME97',
boot=TRUE, sims=500)
summary(results)
##
## Causal Mediation Analysis
##
## Nonparametric Bootstrap Confidence Intervals with the Percentile Method
##
## Estimate 95% CI Lower 95% CI Upper p-value
## ACME 0.600 0.417 0.87 <0.0000000000000002 ***
## ADE 1.011 0.573 1.57 <0.0000000000000002 ***
## Total Effect 1.611 1.071 2.27 <0.0000000000000002 ***
## Prop. Mediated 0.372 0.253 0.53 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Sample Size Used: 1455
##
##
## Simulations: 500
plot(results, effect.type = c("indirect","direct","total"),
title = "Mediation Analysis using Bootstrapping (sims = 500)",
xlab = "", ylab = "",
main = NULL, lwd = 1.5, cex = .85,
col = "black")
The mediate function gives us the Average Causal Mediation Effects (ACME), Average Direct Effects (ADE), the combined indirect and direct effects (Total Effect), and the ratio of these estimates (Prop. Mediated). Again, the ACME here is the indirect effect of M (total effect - average direct effect) the estimate of ACME tells us if the mediation effect is significant.