Mediation Tutorial

Notes I made from the very useful “Getting Started with Mediation” article, located at: https://towardsdatascience.com/doing-and-reporting-your-first-mediation-analysis-in-r-2fe423b92171

Basic concepts:

mediators must be endogenous: can’t be a treatment or study condition. It must be dependent on these exogenous variables (IVs).
mediators must reveal more about how the IV affects the DV. Can’t be just random.

Starting our r work:

df=iris
set.seed(12334) #this makes random effects similar to the ones used for tutorial purposes. Run this or you'll get different numbers.

Our model:

Sepal Length (iv) -> Pollination likelihood (dv): c
Sepal Length (iv) -> Bee Attraction (mediator): a
Bee Attraction (mediator) -> Pollination Likelihood (dv): b

To analyze this, let’s create simulated mediation variable.

df$random1=runif(nrow(df),min=min(df$Sepal.Length),max=max(df$Sepal.Length))
df$mediator=df$Sepal.Length*0.35+df$random1*0.65

Now let’s create a simulated dv, pollenation likelihood.

df$random2=runif(nrow(df),min=min(df$mediator),max=max(df$mediator))
df$dv=df$mediator*0.35+df$random2*.65

steps:

test total effect. Does sepal length have an effect on pollination likelihood?
test effect of iv on mediator. this must exist as a prereq for mediation being possible at all
test effects of mediator and iv on dv simultaneously. this is the core of our analysis.
estimate various quantities for causal mediation analysis, comparing direct to indirect effect.

hopefully this gives us some insight into what’s going on in the data!

Step 1:

Must a relationship between the IV and DV exist?

old school camp: there must be a significant relationship betwen the two. otherwise, we can’t say for sure that changing the iv has any impact on dv. However, this school of thouhgt is slowly losing ground to…

newer ideas: there doesn’t necessarily have to be a sig relationship. Correlation doesn’t prove causation, but lack of correlation doesn’t fully disprove causation, either. (Bollen, 1989; Hayes, 2018)

for step 1, let’s do a simple linear reg of iv onto final dv.

fittotal = lm(dv ~ Sepal.Length, df)
summary(fittotal)

## 
## Call:
## lm(formula = dv ~ Sepal.Length, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.15930 -0.45815 -0.01242  0.44662  1.20905 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.29106    0.32791  16.136   <2e-16 ***
## Sepal.Length  0.12984    0.05557   2.337   0.0208 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5616 on 148 degrees of freedom
## Multiple R-squared:  0.03558,    Adjusted R-squared:  0.02907 
## F-statistic:  5.46 on 1 and 148 DF,  p-value: 0.02079

These coeffs look good, very close to the real pattern we simulated into the data

(if the coeff is not around .129, re-run all the code, especially the set seed command)

Step 2:

Another linear reg, this time iv on mediator plus any other covariates (which we don’t have any of here)

fitmed = lm(mediator~Sepal.Length,df)
summary(fitmed)

## 
## Call:
## lm(formula = mediator ~ Sepal.Length, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2494 -0.5082  0.0123  0.5483  1.0799 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.32300    0.37901  11.406  < 2e-16 ***
## Sepal.Length  0.30429    0.06422   4.738 5.02e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6492 on 148 degrees of freedom
## Multiple R-squared:  0.1317, Adjusted R-squared:  0.1258 
## F-statistic: 22.45 on 1 and 148 DF,  p-value: 5.019e-06

These coeffs are again close to the 35% simulation parameter again.

Step 3:

Let’s confirm that mediator affects dv, while controlling iv. For a mediation to be true, the mediator must explain more or at least other variance in the dv than the iv itself. Analysis is still simple, though: another linear reg.

fitdv = lm(dv~Sepal.Length+mediator,df)
summary(fitdv)

## 
## Call:
## lm(formula = dv ~ Sepal.Length + mediator, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.90734 -0.46316  0.00764  0.39751  0.87026 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.68314    0.40721   9.045 8.17e-16 ***
## Sepal.Length  0.01667    0.05402   0.309    0.758    
## mediator      0.37194    0.06443   5.773 4.46e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5088 on 147 degrees of freedom
## Multiple R-squared:  0.2138, Adjusted R-squared:  0.2031 
## F-statistic: 19.99 on 2 and 147 DF,  p-value: 2.092e-08

So here we see evidence of the mediation. Sig coeffs, again close to the true sim value of 35%.

Note how sepal length is no longer significant! This is a “complete mediation”, where the base effect is illusory and the mediator explains all of it.

If the IV was still sig, it would be an incomplete mediation, which is still fine; just a more complex relationship.

Step 4:

We have all the info we need to show there’s a mediation effect, but for completeness and usability in the modern world, let’s dig deeper. We could use a Structural Equation Model (SEM) or an R package that approximates it (“lavaan”), but that’s too hard. Instead we’ll use the “mediation” package, which is more user-friendly.

#install.packages("mediation")
library(mediation)

## Loading required package: MASS

## Loading required package: Matrix

## Loading required package: mvtnorm

## Loading required package: sandwich

## mediation: Causal Mediation Analysis
## Version: 4.5.0

Time to get some results.

results= mediate(fitmed, fitdv, treat='Sepal.Length', mediator = 'mediator', boot=T)

## Running nonparametric bootstrap

summary(results)

## 
## Causal Mediation Analysis 
## 
## Nonparametric Bootstrap Confidence Intervals with the Percentile Method
## 
##                Estimate 95% CI Lower 95% CI Upper p-value    
## ACME             0.1132       0.0575         0.17  <2e-16 ***
## ADE              0.0167      -0.0904         0.12   0.752    
## Total Effect     0.1298       0.0152         0.24   0.018 *  
## Prop. Mediated   0.8716       0.3796         3.68   0.018 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sample Size Used: 150 
## 
## 
## Simulations: 1000

What we’ve just done is told R to take our mediation fit and dv fit models, use the sepal length as the original IV, identified that our mediator is in fact variable mediator, and told it to use bootstrapping to get percentile confidence intervals.

Results:

Wow, that’s a lot of lame acronyms, huh? here’s how to interpret them:

ACME: not wil-e-coyote. Stands for Average Causal Mediation Effects. This is the indirect effect of the iv on the dv that goes through the mediator. (note that .11, the estimate it gives, is roughly .30 * 37, which you should remember were the coeffs in the regressions we ran earlier.) ACME gives us both a nice CI and a p-value, which I hear journals like to see; a one-stop shop to say our mediation is significant.
ADE: this stands for Average Direct Effects. It describes the straight line effect of the iv on the dv. This is also not new, since we did this in step 3. It is more useful in Incomplete Mediation scenarios.
Total Effect: the full value of both direct and indirect effects of hte iv on the dv. Also something we already calculated, this time in step 1. You can also get it by adding ACME and ADE: .1132+.0167=0.1299
Prop Mediated: This is the proportion of the effect of the iv on the dv that goes through the mediator. It divides teh ACME by the total effect (.113/.13)=.87 It is usually filed under “nice to know”, as it may not be a main effect you are interested in.

Reporting Results:

This is an important factor that wasn’t really covered in class.

Here’s the text that the tutorial uses to describe the mediation effect, in a publication style.

The effect of sepal length on likelihood of pollination was fully mediated via the attractiveness of the bloom. As Figure 1 illustrates, the regression coefficient between sepal length and likelihood to be pollinated and the regression coefficient between attractiveness and likelihood of pollination was significant. The indirect effect was (.3)*(.37) = -.11. We tested the significance of this indirect effect using bootstrapping procedures. Unstandardized indirect effects were computed for each of 1000 bootstrapped samples, and the 95% confidence interval was computed by determining the indirect effects at the 2.5th and 97.5th percentiles. The bootstrapped unstandardized indirect effect was .11, and the 95% confidenceinterval ranged from .06 to .17. Thus, the indirect effect was statistically significant (p<.001).

(Do I fully understand all this? No. But it’s there for you to dissect and copy at your leisure.)

Now, as the text suggests, Figure 1 is an important factor here.

These analyses are best understood via path diagrams. Fortunately, there are ways of getting R to draw them for you, though as usual they can be a bit complicated.

library(ggplot2)
library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

## The following object is masked from 'package:mediation':
## 
##     mediate

library(lavaan)

## This is lavaan 0.6-7

## lavaan is BETA software! Please report any bugs.

## 
## Attaching package: 'lavaan'

## The following object is masked from 'package:psych':
## 
##     cor2cov

require(MASS)
library(diagram)

## Loading required package: shape

Copying Yrian’s code from the homework…

ModelVis <- c(0,        "'.372***'",    0,
              0,        0,           0,
              "'305***'", "'.130*\n(.017)'",   0)

This line makes a Matrix (whoa, I know kung fu) from the numbers we put in. It’s a 3x3 grid that will tell the software a series of 3 lines of 3 entries, that it can index.

The numbers are the coefficients from our analyses earlier, though the order can be tricky. Think about where the lines are in the diagram.

The first row, 0, number, 0, winds up being ‘b’ in our diagram, from the mediator to the dv. The second row is empty, as we don’t really have any variables there. The third row has the ‘a’ quantity, from IV to mediator, and the ‘c’ quantity from IV to DV. c’ (c after mediation) is shown in parentheses.

note: \n is code for “New Line Here”, which helps a lot in readability.

M2<- matrix (nrow=3, ncol=3, byrow = TRUE, data=ModelVis)

This puts the above code into a format R understands, a Matrix variable called M2.

plot<- plotmat(M2, pos=c(1,2),
               name= c( "Bee \n Attraction","Sepal \n Length", "Pollenation \n Likelihood"),
               box.type = "rect", box.size = 0.14, box.prop=0.55, curve=0, box.col = "gray"
               )

Finally, using the plotmat function, we actually draw our graph. This function is complex, with tons of optional add-ons. Check out the help function for it for a list of what they are and wha they do, and how to work with them.

#?plotmat

Hope that helps, and that at least one other person reads this. ;)