How do meta-analyses work?

To perform a meta-analysis we compute an effect size and variance for each study, and then compute a weighted mean of these effect sizes.

To compute the weighted mean we generally assign more weight to the more precise studies (typically: larger), but the rules for assigning weights depend on our assumptions about the distribution of true effects.

The effect size could represent the impact of an intervention, such as the impact of medical treatment on risk of infection, the impact of a teaching method on test scores, or the impact of a new protocol on the number of salmon successfully returning upstream.

The effect size is not limited to the impact of interventions, but could represent any relationship between two variables, such as the difference in test scores for males versus females, the difference in cancer rates for persons exposed or not exposed to second-hand smoke, or the difference in cardiac events for persons with two distinct personality types. In fact, what we generally call an effect size could refer simply to the estimate of a single value.

Meta-analysis vs systematic (“narrative”) review

Since the narrative review is based on discrete reports from a series of studies, it provides no real mechanism for synthesizing the data. To borrow a phrase from Abelson, it involves doing arithmetic with words. And, when the words are based on p-values the words are the wrong words.

systematic reviews often rely on “vote counting” (counting how many studies reached statistical significance)
- vote counting is the process of counting the number of studies that are statistically significant and comparing this with the number that are not statistically significant
vote counting treats a nonsignificant p-value as evidence that an effect is absent. In fact, though, small, moderate, and even large effect sizes may yield a nonsignificant p-value due to inadequate statistical power. Therefore, vote counting is never a valid approach.
- example: if in 10 studies with small sample size (low power) the expected effect was present but did not reach statistical significance, then vote counting would tell us that there is no effect at all. If we had a method to analyse these results “together”, the conclusion would different.

By contrast, in a meta-analysis we introduce two fundamental changes. First, we work directly with the effect size from each study rather than the p-value. Second, we include all of the effects in a single statistical synthesis. This is critically important for the goal of computing (and testing) a summary effect. Meta-analysis also allows us to assess the dispersion of effects, and distinguish between real dispersion and spurious dispersion.

Choosing the right measure of effect size

Effect sizes based on means
- Raw (unstandardized) mean difference (\(D\))
  - Based on studies with independent groups
  - Based on studies with matched groups or pre-post designs
- Standardized mean difference (\(d\) or \(g\))
  - Based on studies with independent groups
  - Based on studies with matched groups or pre-post designs
- Response ratios (\(R\))
  - Based on studies with independent groups
Effect sizes based on binary data
- Risk ratio (\(RR\))
  - Based on studies with independent groups
- Odds ratio (\(OR\))
  - Based on studies with independent groups
- Risk difference (\(RD\))
  - Based on studies with independent groups
Effect sizes based on correlational data
- Correlation (\(r\))
  - Based on studies with one group

Note: there are formulas to convert from one effect size to another!

Effect size: Cohen’s \(d\)

Cohen proposed a measure of effect size which is independent of scale and units used to measure the phenomenon:

\[ d = \frac{\bar{X_1} - \bar{X_2}}{S_{within}} \]

\(\bar{X_1} - \bar{X_2}\) - the difference between sample means

In this equation \(S_{within}\) is a pooled standard deviation computed as weighted average from two samples:

\[ S_{within} = \sqrt{\frac{(n_1 -1)S^2_1 + (n_2 - 1)S^2_2}{n_1 + n_2 - 2}} \]

Cohen’s \(d\) is can be understood as a difference between sample means expressed in standard deviations. That’s why we can compare it between different studies!

Cohen’s \(d\) sampling variance

Cohen’s \(d\) sampling variance is given by the following equation:

\[ V_d = \frac{n_1 + n_2}{n_1 n_2} + \frac{d^2}{2(n_1 +n_2)} \]

How to interpret it? We can think of this variance in different manners. We can see that the variance is directly related to the sample size. The more observations we have, the smaller is the sampling variance. For the purposes of meta-analysis this is useful, because we can interpet the sampling variance as a measure of how precise was the study.

Using Cohen’s \(d\) sampling variance we can compute its standard error (\(SE_d = \sqrt{V_d}\)) and confidence intervals (\(d \pm 1.96 \times SE_d\))

Cohen’s \(d\) and its sampling variance - example

dat.besson2016 contains results from 46 studies synthesising maternal nutritional effects on coping styles in rodents. Let’s compute Cohen’s \(d\) and its sampling variance by hand for the first study.

library(metafor)
exp_mean <- dat.besson2016[1, "exp_mean"] # mean in the experimental condition
con_mean <- dat.besson2016[1, "con_mean"] # mean in the control condition

exp_n <- dat.besson2016[1, "exp_n"] # sample size in the experimental condition 
con_n <- dat.besson2016[1, "con_n"] # sample size in the control condition

exp_sd <- dat.besson2016[1, "exp_se"] * sqrt(exp_n) # std. deviation in the experimental condition
con_sd <- dat.besson2016[1, "con_se"] * sqrt(con_n) # std. deviation in the control condition

s_within = sqrt(((exp_n - 1) * exp_sd**2 + (con_n - 1) * con_sd**2) /
                  (exp_n + con_n - 2)) # pooled SD

d <- (exp_mean - con_mean) / s_within # Cohen's d
v_d <- ((exp_n + con_n) / (exp_n * con_n)) + ((d**2)/ (2*(exp_n + con_n))) # Cohen's d sampling variance


# Checking if we did that correctly...
dat <- dat.besson2016
dat$sd_c <- with(dat, con_se * sqrt(con_n))
dat$sd_e <- with(dat, exp_se * sqrt(exp_n))

### compute standardized mean differences and corresponding sampling variances
dat <- escalc(measure="SMD", m1i=exp_mean, m2i=con_mean, sd1i=sd_e, sd2i=sd_c,
              n1i=exp_n, n2i=con_n, data=dat, add.measure=TRUE)

Meta-analysis: fixed-effects model

In this simplest model we assume, that there is a specific value corresponding to the “true” effect size in the population and that is it estimated by effect sizes from analyzed studies. We can thus ascribe a weight to the effect size from each study:

\[ W_i = \frac{1}{V_d} \]

and compute weighted average from all studies: \[ M = \frac{\sum_{i=1}^{k} W_i Y_i}{\sum_{i=1}^{k} W_i} \]

The variance of overall effect size is given by the formula:

\[ V_M = \frac{1}{\sum_{i=1}^{k} W_i} \]

Fixed-effect model: example

The dat.konstantopoulos2011 dataset (taken from Konstantopoulos, 2011 but related to the research of Cooper et al., 2003) contains the results from 56 studies, each comparing the level of academic achievement in a group of students following a modified school calendar (that do away with the long summer break) with that of a group of students following a more traditional school calendar.

The difference between the two groups was quantified in terms of a standardized mean difference (with positive values indicating a higher mean level of achievement in the group following the modified school calendar).

Fixed-effect model: example

Forsts plot is a most popular way to represent the effect sizes of many studies.

The effect size for each study is represented by a square, with the location of the square representing both the direction and magnitude of the effect.

In the plot, the effect size for each study is bounded by a confidence interval, reflecting the precision with which the effect size has been estimated in that study

Sometimes in the forests plots for each study the p-value for a test of the null is shown. metafor library does not allow for that, but there is a necessary correspondence between the p-value and the confidence interval, such that the p-value will fall under 0.05 if and only if the 95% confidence interval does not include the null value. Therefore, by scanning the confidence intervals we can easily identify the statistically significant studies.

On the plot the summary effect is shown on the bottom line. The summary effect is nothing more than the weighted mean of the individual effects. However, the mechanism used to assign the weights (and therefore the meaning of the summary effect) depends on our assumptions about the distribution of effect sizes from which the studies were sampled.

The summary effect is represented by a diamond. The location of the diamond represents the effect size while its width reflects the precision of the estimate (95% confidence interval).

library(metafor)
fit <- rma(yi, vi, data = dat.konstantopoulos2011, method = "FE")
forest(fit)

Fixed-effect model: example

The p-value for the summary effect is < 0.0001. This p-value reflects both the magnitude of the summary effect size and also the volume of information on which the estimate is based. Note that the p-value for the summary effect is substantially more compelling than that of any single study.

summary(fit)

## 
## Fixed-Effects Model (k = 56)
## 
##    logLik   deviance        AIC        BIC       AICc   
## -217.1038   578.8640   436.2075   438.2329   436.2816   
## 
## I^2 (total heterogeneity / total variability):   90.50%
## H^2 (total variability / sampling variability):  10.52
## 
## Test for Heterogeneity:
## Q(df = 55) = 578.8640, p-val < .0001
## 
## Model Results:
## 
## estimate      se    zval    pval   ci.lb   ci.ub      
##   0.0464  0.0092  5.0499  <.0001  0.0284  0.0644  *** 
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Meta-analysis: random-effects model vs fixed-effects model

Under the fixed-effect model we assume that there is one true effect size (hence the term fixed effect) which underlies all the studies in the analysis, and that all differences in observed effects are due to sampling error.

By contrast, under the random-effects model we allow that the true effect could vary from study to study. For example, the effect size might be higher (or lower) in studies where the participants are older, or more educated, or healthier than in others, or when a more intensive variant of an intervention is used, and so on. Because studies will differ in the mixes of participants and in the implementations of interventions, among other reasons, there may be different effect sizes underlying different studies.

If it were possible to perform an infinite number of studies (based on the inclusion criteria for our analysis), the true effect sizes for these studies would be distributed about some mean.

The effect sizes in the studies that actually were performed are assumed to represent a random sample of these effect sizes (hence the term random effects). Here, we use the plural (effects) since there is an array of true effects. We thus assume that the effect size is a random variable and effect sizes ale following some distribution that has some parameters.

Meta-analysis: random-effects model

fit <- rma(yi, vi, data = dat.konstantopoulos2011, method = "REML")
forest(fit)

Meta-analysis: random-effects model

Compared to the fixed-effect model we obtained estimated amount of total heterogeneity (\(\tau^2\)) and related statistics. Note that compared to fixed-effects model the p-value is larger and the confidence interval is slightly wider.

summary(fit)

## 
## Random-Effects Model (k = 56; tau^2 estimator: REML)
## 
##   logLik  deviance       AIC       BIC      AICc   
## -16.8455   33.6910   37.6910   41.7057   37.9218   
## 
## tau^2 (estimated amount of total heterogeneity): 0.0884 (SE = 0.0202)
## tau (square root of estimated tau^2 value):      0.2974
## I^2 (total heterogeneity / total variability):   94.70%
## H^2 (total variability / sampling variability):  18.89
## 
## Test for Heterogeneity:
## Q(df = 55) = 578.8640, p-val < .0001
## 
## Model Results:
## 
## estimate      se    zval    pval   ci.lb   ci.ub     
##   0.1279  0.0439  2.9161  0.0035  0.0419  0.2139  ** 
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The sources of variability

When we speak about dispersion in effect sizes from study to study we are usually concerned with the dispersion in true effect sizes, but the observed dispersion includes both true variance and random error.

The mechanism used to isolate the true variance is to compare the observed dispersion with the amount we would expect to see if all studies shared a common effect size. The excess portion is assumed to reflect real differences among studies. This portion of the variance is then used to create several measures of heterogeneity

The sources of variability

We want to answer the following questions:

Is there evidence of heterogeneity in true effect sizes?
What is the variance of the true effects?
What are the substantive implications of this heterogeneity?
What proportion of the observed dispersion is real?

Tools to answer these question:

\(Q\) is the weighted sum of squares (\(WSS\)) on a standardized scale. As a standard score it can be compared with the expected \(WSS\) (on the assumption that all studies share a common effect) to yield a test of the null (= all variability is due to chance) and also an estimate of the excess variance.
\(I^2\) is the proportion of observed dispersion that is real, rather than spurious. It is not dependent on the scale, and is expressed as a ratio with a range of \(0\%\) to \(100\%\).

Publication bias

What is publication bias?
- Several lines of research (reviewed by Dickersin, 2005) have established that studies with statistically significant results are more likely to find their way into the published literature than studies that report results that are not statistically significant.
Why is publication bias present?
What are the consequences of publication bias?
- It is possible that the studies in a meta-analysis may overestimate the true effect size because they are based on a biased sample of the target population of studies.
How can we detect the presence of publication bias?
- funnel plot and statistical tests for asymmetry
- p-curve analysis

Publication bias: funnel plots

Traditionally, the funnel plot was plotted with effect size on the X axis and the sample size or variance on the Y axis. Large studies appear toward the top of the graph and generally cluster around the mean effect size. Smaller studies appear toward the bottom of the graph, and (since smaller studies have more sampling error variation in effect sizes) tend to be spread across a broad range of values.

The use of the standard error (rather than sample size or variance) on the Y axis has the advantage of spreading out the points on the bottom half of the scale, where the smaller studies are plotted. This could make it easier to identify asymmetry.

In the absence of publication bias, the studies will be distributed symmetrically about the mean effect size, since the sampling error is random. In the presence of publication bias the studies are expected to follow the model, with symmetry at the top, a few studies missing in the middle, and more studies missing near the bottom. If the direction of the effect is toward the right, then near the bottom of the plot we expect a gap on the left, where the nonsignificant studies would have been if we had been able to locate them.

funnel(fit)

Estimating the impact of publication bias: Trim and Fill

The trim and fill method is a nonparametric (rank-based) data augmentation technique proposed by Duval and Tweedie (2000a, 2000b; see also Duval, 2005).

The method can be used to estimate the number of studies missing from a meta-analysis due to suppression of the most extreme results on one side of the funnel plot.

Trim and Fill uses an iterative procedure to remove the most extreme small studies from the positive side of the funnel plot, re-computing the effect size at each iteration until the funnel plot is symmetric about the (new) effect size. In theory, this will yield an unbiased estimate of the effect size.

While this trimming yields the adjusted effect size, it also reduces the variance of the effects, yielding a too narrow confidence interval. Therefore, the algorithm then adds the original studies back into the analysis, and imputes a mirror image for each. This fill has no impact on the point estimate but serves to correct the variance (Duval and Tweedie, 2000a, 2000b).

Estimating the impact of publication bias: Trim and Fill

fit_tm <- trimfill(fit)
summary(fit_tm)

## 
## Estimated number of missing studies on the right side: 11 (SE = 4.9414)
## 
## Random-Effects Model (k = 67; tau^2 estimator: REML)
## 
##   logLik  deviance       AIC       BIC      AICc   
## -26.5436   53.0871   57.0871   61.4664   57.2776   
## 
## tau^2 (estimated amount of total heterogeneity): 0.1066 (SE = 0.0220)
## tau (square root of estimated tau^2 value):      0.3265
## I^2 (total heterogeneity / total variability):   95.31%
## H^2 (total variability / sampling variability):  21.30
## 
## Test for Heterogeneity:
## Q(df = 66) = 916.0735, p-val < .0001
## 
## Model Results:
## 
## estimate      se    zval    pval   ci.lb   ci.ub      
##   0.2061  0.0439  4.6993  <.0001  0.1201  0.2921  *** 
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

funnel(fit_tm)

Estimating the impact of publication bias: Trim and Fill

Another example:

dat <- escalc(measure="RR", ai=tpos, bi=tneg, ci=cpos, di=cneg, data=dat.bcg)
res <- rma(yi, vi, data=dat, method="FE")
res.tf <- trimfill(res)
res.tf

## 
## Estimated number of missing studies on the right side: 4 (SE = 2.3853)
## 
## Fixed-Effects Model (k = 17)
## 
## I^2 (total heterogeneity / total variability):   93.91%
## H^2 (total variability / sampling variability):  16.42
## 
## Test for Heterogeneity:
## Q(df = 16) = 262.7316, p-val < .0001
## 
## Model Results:
## 
## estimate      se     zval    pval    ci.lb    ci.ub      
##  -0.2910  0.0383  -7.6057  <.0001  -0.3660  -0.2160  *** 
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

funnel(res.tf)

Publication bias and p-hacking: p-curve analysis

P-curve analysis (Simonsohn, Simmons & Nelson, 2014, 2015) has been proposed as a method to detect p-hacking and publication bias in meta-analyses.

P-Curve assumes that publication bias is not only generated because researchers do not publish non-significant results, but also because analysts “play” around with their data (“p-hacking”; e.g., selectively removing outliers, choosing different outcomes, controlling for different variables) until a non-significant finding becomes significant (i.e., \(p<0.05\)).

The method assumes that for a specific research question, p-values smaller \(0.05\) of included studies should follow a right-skewed distribution if a true effect exists, even when the power in single studies was (relatively) low.

Conversely, a left-skewed p-value distribution indicates the presence of p-hacking and absence of a true underlying effect.

Publication bias and p-hacking: p-curve analysis

The distribution of p-values if true effect > 0:

set.seed(42)
pvals <- replicate(1000, { # 1000 studies
  t.test(rnorm(35, 20, 15), # random sample (n=35) from the population with mean = 20
         rnorm(35, 15, 15) # random sample (n=35) from the population with mean = 15 
         )$p.value # p-val from the t-test
})
hist(pvals[pvals < 0.05]) # histogram of significant (p < 0.05) p-values when true effect exists

Publication bias and p-hacking: p-curve analysis

The distribution of p-values if true effect = 0:

set.seed(42)
pvals <- replicate(1000, { # 1000 studies
  t.test(rnorm(35, 15, 15), # random sample (n=35) from the population with mean = 20
         rnorm(35, 15, 15) # random sample (n=35) from the population with mean = 15 
         )$p.value # p-val from the t-test
})
hist(pvals[pvals < 0.05]) # histogram of significant (p < 0.05) p-values when true effect does not exist

Publication bias and p-hacking: p-curve analysis

The distribution of p-values if true effect = 0 and p-hacking was present:

set.seed(42)
pvals <- replicate(1000, { # 1000 studies
  pval <- t.test(rnorm(35, 15, 15), # random sample (n=35) from the population with mean = 20
  rnorm(35, 15, 15) # random sample (n=35) from the population with mean = 15 
  )$p.value # p-val from the t-test
  if (pval > 0.05 & pval < 0.08){
    pval <- runif(1, 0.04, 0.05) # p-hacking for "marginally significant" p-values
  }
  pval
})
hist(pvals[pvals < 0.05]) # histogram of significant (p < 0.05) p-values when true effect does not exist

Publication bias and p-hacking: p-curve analysis

library(dmetar)
dat.konstantopoulos2011$TE <- dat.konstantopoulos2011$yi
dat.konstantopoulos2011$seTE <- sqrt(dat.konstantopoulos2011$vi)
dat.konstantopoulos2011$studlab <- dat.konstantopoulos2011$study
pcurve(dat.konstantopoulos2011)

## P-curve analysis 
##  ----------------------- 
## - Total number of provided studies: k = 56 
## - Total number of p<0.05 studies included into the analysis: k = 21 (37.5%) 
## - Total number of studies with p<0.025: k = 14 (25%) 
##    
## Results 
##  ----------------------- 
##                     pBinomial   zFull pFull   zHalf pHalf
## Right-skewness test     0.095 -10.792     0 -13.775     0
## Flatness test           0.393   8.061     1  13.514     1
## Note: p-values of 0 or 1 correspond to p<0.001 and p>0.999, respectively.   
## Power Estimate: 98% (95.6%-99%)
##    
## Evidential value 
##  ----------------------- 
## - Evidential value present: yes 
## - Evidential value absent/inadequate: no

Publication bias and p-hacking: p-curve analysis

Another example

dat$TE <- dat$yi
dat$seTE <- sqrt(dat$vi)
dat$studlab <- dat$author
pcurve(dat)

## P-curve analysis 
##  ----------------------- 
## - Total number of provided studies: k = 13 
## - Total number of p<0.05 studies included into the analysis: k = 8 (61.54%) 
## - Total number of studies with p<0.025: k = 6 (46.15%) 
##    
## Results 
##  ----------------------- 
##                     pBinomial  zFull pFull   zHalf pHalf
## Right-skewness test     0.145 -8.434     0 -10.218     0
## Flatness test           0.716  6.993     1   9.995     1
## Note: p-values of 0 or 1 correspond to p<0.001 and p>0.999, respectively.   
## Power Estimate: 99% (98.2%-99%)
##    
## Evidential value 
##  ----------------------- 
## - Evidential value present: yes 
## - Evidential value absent/inadequate: no

Metaanalysis in R

Advanced statistical methods and models in experimental design

What can metaanalysis tell us?

How do meta-analyses work?

Meta-analysis vs systematic (“narrative”) review

Choosing the right measure of effect size

Effect size: Cohen’s \(d\)

Cohen’s \(d\) sampling variance

Cohen’s \(d\) and its sampling variance - example

Meta-analysis: fixed-effects model

Fixed-effect model: example

Fixed-effect model: example

Fixed-effect model: example

Meta-analysis: random-effects model vs fixed-effects model

Meta-analysis: random-effects model

Meta-analysis: random-effects model

The sources of variability

The sources of variability

Publication bias

Publication bias: funnel plots

Estimating the impact of publication bias: Trim and Fill

Estimating the impact of publication bias: Trim and Fill

Estimating the impact of publication bias: Trim and Fill

Publication bias and p-hacking: p-curve analysis

Publication bias and p-hacking: p-curve analysis

Publication bias and p-hacking: p-curve analysis

Publication bias and p-hacking: p-curve analysis

Publication bias and p-hacking: p-curve analysis

Publication bias and p-hacking: p-curve analysis