1-What is known
Did you ever want to run a multi-center research study retrospectively? It’s possible with a statistical method called meta-analysis. Meta-analysis is way of combining the results of multiple studies into a single study (1). By evaluating the variance within and between study results, meta-analysis is able understand how to best combine the study results. Study results that have high levels of variance are down weighted compared to studies with low variance (2). It’s also important to evaluate the heterogeneity of the study results, it only makes sense to combine studies that are saying the same thing. If the direction and magnitude of the study findings are different combining the results doesn’t make sense. Heterogeneity testing is a test of whether the variation seen in the study results is beyond that level that is expected by chance.
Heterogeneity tests evaluate whether the variation (between study) that is observed in the effect sizes are larger than would be expected based on the sampling variability (within study). The most common heterogeneity test is the Cochran’s Q. This test is similar to an ANOVA anther ‘Analysis of Variance’ in that you are evaluating whether the variation observed between groups could be due to chance alone (3). Unlike the ANOVA, the Cochran’s Q test is used for binary outcomes and requires a minimum of three groups (4). This is a non-parametric test compared to an ANOVA that assumes that the data is normally distributed.
Simulations are an invaluable method to understand a statistical method (2). By simulating data for a statistical test it’s possible to understand how the test performs under all conditions. Simulations also allow you to solve statistical problems that can’t be solved analytically. The drivers of power and heterogeneity have been determined through simulation.
2-what is unknown
What is unknown is how far the heterogeneity testing used in meta-analysis can be extended for other applications. Some applications might involve ranges of values that wouldn’t be found in meta-analysis making this approach inappropriate.
3-How and why should this gap be filled?
Historically, surgeons haven’t received high-quality feedback on their surgical outcomes. Our surgical center created an online dashboard to provide providers feedback on their surgical outcomes (1, 2). This gives surgeons risk adjusted rates for outcomes and covariates selected by the service. The surgeons can stratify their outcome rates to patient subsets and time frames of interest. This interactivity allows providers to understand their outcome rates better and improve learning. With all the outcomes and options available to the surgeons, there are hundreds of possible graphs to review. When people are presented with too much information and too many options, they may choose to take no action known as ‘decision paralysis’ (3,4). The addition of a visual cue might help providers narrow in on graphs worth focusing on. Our plan is to add a flag to each graph that reports the results of a heterogeneity test. This statistical test will allow providers to know if the variation they are seeing on a graph is explainable by the surgeon’s performance or driven by something else.
Where are heterogeneity tests appropriate to evaluate the differences between surgeon rates?” Are there rates where the heterogeneity test should not be applied? Low case counts with large absolute difference or high case count and low absolute difference.
4-Methods -what was done
I’m using non-public surgical outcome datasets and then creating simulated datasets based on actual surgical outcome datasets to evaluate my research question. My simulated data will be equivalent to random-effects studies, where we assume that the underlying relationship between the surgical outcome and the surgeon may be different due to moderators (covariates) that drive the heterogeneity that we observe. Often in meta-analysis, the beta values of these moderators are unknown and a range of values may need to be employed. I have built simulated data but haven’t currently added the moderator. The reason being that the magnitude of the covariates is related to the underlying data. We can use the ‘metafor’ R package to help generate our simulated datasets.
Normally, at this point in the paper there would be a bunch of descriptive statistics describing the data. At this point in the project I have set that aside because the data that I’m running my tests on is simulated and I’m more interested in meaningful ranges of values. I have built simulated data but haven’t currently added the moderator. The reason being that the magnitude of the covariates is related to the underlying data. I don’t want to simulate complete data sets just the effect size, number of cases per surgeon, within study and between study variance. The issue that I’m still resolving is calculating the moderator. I don’t believe that I can use the covariate betas from by surgeon outcome models because they don’t correlate with my simulated data. Here is an example model:
• γ = β1 * age + β2 * diabetic + β3 * surgeon_smith + intercept
• γ = 0.1 * age + 2 * diabetic + 3 * surgeon_smith + 5
In this model, it looks like the diabetic status has a larger impact on the outcome than age because the coefficient for diabetic status is 2 while the coefficient for age is 0.1. This would be true if this was a resection for pediatric patients. Because the average age for this type of surgical resection is 70, ‘age’ is much more impactful (70 * 0.1) on a patient’s outcome than their diabetic status. Because I don’t want to simulate complete data sets I may need to take another approach to calculating meaningful moderator value ranges. There is a method where you use the model R-squared value to estimate the moderators. The R-squared value isn’t something that I currently save out when I build by surgeon outcome models but to could add.
I’m using a general Monte Carlo simulation for the power analysis. The literature suggested the following method for the analysis (4)
5-Results
First, we are going to look at whether the number of simulations matter. The literature recommends 10,000 to obtain stable estimates of power. We modulated the number of simulations from 10-10k (10, 50, 100, 250, 500, 1000, 2500, 5000, 7500, 10000). It looks like the estimate stabilizes by 5000 simulations.
Here is the results of the simulation to better understand the simulation and heterogeneity testing.
k | mu | tau2 | navg | nmin | nsim | power |
---|---|---|---|---|---|---|
5 | 0.3 | 0.3 | 20 | 20 | 10 | 0.9000000 |
5 | 0.3 | 0.3 | 20 | 20 | 50 | 0.6800000 |
5 | 0.3 | 0.3 | 20 | 20 | 100 | 0.7200000 |
5 | 0.3 | 0.3 | 20 | 20 | 250 | 0.6680000 |
5 | 0.3 | 0.3 | 20 | 20 | 500 | 0.6640000 |
5 | 0.3 | 0.3 | 20 | 20 | 1000 | 0.6790000 |
5 | 0.3 | 0.3 | 20 | 20 | 2500 | 0.6800000 |
5 | 0.3 | 0.3 | 20 | 20 | 5000 | 0.6874000 |
5 | 0.3 | 0.3 | 20 | 20 | 7500 | 0.6845333 |
5 | 0.3 | 0.3 | 20 | 20 | 10000 | 0.6758000 |
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
We ran 5000 simulations per case count level (20, 50, 100, 200, 300, 400, 500) and below are the the percentage of p-values below our alpha of 0.05 for the heterogeneity test. We can see that the heterogeneity test power becomes asymptotic as the number of cases exceeds an average of 100 cases per surgeon.
k | mu | tau2 | navg | nmin | nsim | power |
---|---|---|---|---|---|---|
5 | 0.3 | 0.3 | 20 | 20 | 5000 | 0.6812 |
5 | 0.3 | 0.3 | 50 | 20 | 5000 | 0.8956 |
5 | 0.3 | 0.3 | 100 | 20 | 5000 | 0.9692 |
5 | 0.3 | 0.3 | 200 | 20 | 5000 | 0.9894 |
5 | 0.3 | 0.3 | 300 | 20 | 5000 | 0.9936 |
5 | 0.3 | 0.3 | 400 | 20 | 5000 | 0.9970 |
5 | 0.3 | 0.3 | 500 | 20 | 5000 | 0.9984 |
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Our surgical QA applications allows a graph to be generated with a minimum of 2 surgeons. A graph may contains as many as 15 surgeons so we simulated data sets for that range (2, 3, 4, 5, 10, 15). We ran 5000 simulations for each number of surgeons level holding other parameters constant (delta: 0.3, tau; 0.3, cases: 100, alpha: 0.05) The heterogeneity test power is very sensitive to the real world range of surgeons.
k | mu | tau2 | navg | nmin | nsim | power |
---|---|---|---|---|---|---|
2 | 0.3 | 0.3 | 100 | 20 | 5000 | 0.6218 |
3 | 0.3 | 0.3 | 100 | 20 | 5000 | 0.8344 |
4 | 0.3 | 0.3 | 100 | 20 | 5000 | 0.9230 |
5 | 0.3 | 0.3 | 100 | 20 | 5000 | 0.9616 |
10 | 0.3 | 0.3 | 100 | 20 | 5000 | 0.9994 |
15 | 0.3 | 0.3 | 100 | 20 | 5000 | 1.0000 |
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The power of the heterogeneity test is somewhat insensitive to the effect size for a range from 0.01-0.7 holding all the other parameters constant (surgeons: 5, tau; 0.3, cases: 100, alpha: 0.05)
k | mu | tau2 | navg | nmin | nsim | power |
---|---|---|---|---|---|---|
5 | 0.01 | 0.3 | 100 | 20 | 5000 | 0.9616 |
5 | 0.05 | 0.3 | 100 | 20 | 5000 | 0.9638 |
5 | 0.10 | 0.3 | 100 | 20 | 5000 | 0.9622 |
5 | 0.20 | 0.3 | 100 | 20 | 5000 | 0.9612 |
5 | 0.30 | 0.3 | 100 | 20 | 5000 | 0.9628 |
5 | 0.40 | 0.3 | 100 | 20 | 5000 | 0.9676 |
5 | 0.50 | 0.3 | 100 | 20 | 5000 | 0.9634 |
5 | 0.60 | 0.3 | 100 | 20 | 5000 | 0.9634 |
5 | 0.70 | 0.3 | 100 | 20 | 5000 | 0.9622 |
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
We reran the above simulations but also changes the heterogeneity levels (0.1, 0.15, 0.2, 0.25, 0.3) at the same time. The heterogeneity power values stratify bases on the heterogeneity levels.
k | mu | tau2 | navg | nmin | nsim | power |
---|---|---|---|---|---|---|
5 | 0.01 | 0.10 | 100 | 20 | 5000 | 0.8060 |
5 | 0.01 | 0.15 | 100 | 20 | 5000 | 0.8910 |
5 | 0.01 | 0.20 | 100 | 20 | 5000 | 0.9266 |
5 | 0.01 | 0.25 | 100 | 20 | 5000 | 0.9470 |
5 | 0.01 | 0.30 | 100 | 20 | 5000 | 0.9628 |
5 | 0.05 | 0.10 | 100 | 20 | 5000 | 0.8222 |
5 | 0.05 | 0.15 | 100 | 20 | 5000 | 0.8858 |
5 | 0.05 | 0.20 | 100 | 20 | 5000 | 0.9294 |
5 | 0.05 | 0.25 | 100 | 20 | 5000 | 0.9492 |
5 | 0.05 | 0.30 | 100 | 20 | 5000 | 0.9604 |
5 | 0.10 | 0.10 | 100 | 20 | 5000 | 0.8048 |
5 | 0.10 | 0.15 | 100 | 20 | 5000 | 0.8876 |
5 | 0.10 | 0.20 | 100 | 20 | 5000 | 0.9346 |
5 | 0.10 | 0.25 | 100 | 20 | 5000 | 0.9490 |
5 | 0.10 | 0.30 | 100 | 20 | 5000 | 0.9664 |
5 | 0.20 | 0.10 | 100 | 20 | 5000 | 0.8052 |
5 | 0.20 | 0.15 | 100 | 20 | 5000 | 0.8894 |
5 | 0.20 | 0.20 | 100 | 20 | 5000 | 0.9338 |
5 | 0.20 | 0.25 | 100 | 20 | 5000 | 0.9474 |
5 | 0.20 | 0.30 | 100 | 20 | 5000 | 0.9670 |
5 | 0.30 | 0.10 | 100 | 20 | 5000 | 0.8086 |
5 | 0.30 | 0.15 | 100 | 20 | 5000 | 0.8914 |
5 | 0.30 | 0.20 | 100 | 20 | 5000 | 0.9342 |
5 | 0.30 | 0.25 | 100 | 20 | 5000 | 0.9470 |
5 | 0.30 | 0.30 | 100 | 20 | 5000 | 0.9636 |
5 | 0.40 | 0.10 | 100 | 20 | 5000 | 0.8118 |
5 | 0.40 | 0.15 | 100 | 20 | 5000 | 0.8944 |
5 | 0.40 | 0.20 | 100 | 20 | 5000 | 0.9334 |
5 | 0.40 | 0.25 | 100 | 20 | 5000 | 0.9490 |
5 | 0.40 | 0.30 | 100 | 20 | 5000 | 0.9658 |
5 | 0.50 | 0.10 | 100 | 20 | 5000 | 0.8050 |
5 | 0.50 | 0.15 | 100 | 20 | 5000 | 0.8940 |
5 | 0.50 | 0.20 | 100 | 20 | 5000 | 0.9294 |
5 | 0.50 | 0.25 | 100 | 20 | 5000 | 0.9482 |
5 | 0.50 | 0.30 | 100 | 20 | 5000 | 0.9650 |
5 | 0.60 | 0.10 | 100 | 20 | 5000 | 0.8118 |
5 | 0.60 | 0.15 | 100 | 20 | 5000 | 0.8918 |
5 | 0.60 | 0.20 | 100 | 20 | 5000 | 0.9340 |
5 | 0.60 | 0.25 | 100 | 20 | 5000 | 0.9496 |
5 | 0.60 | 0.30 | 100 | 20 | 5000 | 0.9640 |
5 | 0.70 | 0.10 | 100 | 20 | 5000 | 0.8080 |
5 | 0.70 | 0.15 | 100 | 20 | 5000 | 0.8968 |
5 | 0.70 | 0.20 | 100 | 20 | 5000 | 0.9224 |
5 | 0.70 | 0.25 | 100 | 20 | 5000 | 0.9530 |
5 | 0.70 | 0.30 | 100 | 20 | 5000 | 0.9660 |
1 Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2009). Introduction to meta-analysis. John Wiley & Sons. https://doi.org/10.1002/9780470743386
2 Gambarota, F., & Altoè, G. (2024). Understanding Meta-Analysis through data simulation with applications to power analysis. Advances in Methods and Practices in Psychological Science, 7(1). https://doi.org/10.1177/25152459231209330
3 William G. Cochran (December 1950). “The Comparison of Percentages in Matched Samples”. Biometrika. 37 (3/4): 256–266. doi:10.1093/biomet/37.3-4.256. JSTOR 2332378.
4 What is ANOVA and what can I use it for?: Qualtrics Au. Qualtrics. (2024, March 7). https://www.qualtrics.com/en-au/experience-management/research/anova/#:~:text=ANOVA%2C%20or%20Analysis%20of%20Variance%2C%20is%20a,three%20or%20more%20unrelated%20samples%20or%20groups.
5 Somers, J. (2018, June 13). Making the cut - backchannel - medium. Medium. https://medium.com/backchannel/should-surgeons-keep-score-8b3f890a7d4c
6 Vickers, A. J., Sjoberg, D., Basch, E., Sculli, F., Shouery, M., Laudone, V., Touijer, K., Eastham, J., & Scardino, P. T. (2012). How Do You Know If You Are Any Good? A Surgeon Performance Feedback System for the Outcomes of Radical Prostatectomy. European Urology, 61(2), 284–289. https://doi.org/10.1016/j.eururo.2011.10.039
7 Arnold, M., Goldschmitt, M., & Rigotti, T. (2023). Dealing with information overload: A comprehensive review. Dealing with Information Overload: A Comprehensive Review, 14(1122200). https://doi.org/10.3389/fpsyg.2023.1122200
8 Patel, A., & Stern, L. (2014). The Indecisive Shopper: Incorporating Choice Paralysis into the Multinomial Logit Model. https://www.stern.nyu.edu/sites/default/files/assets/documents/Anisha%20Patel_Thesis_Honors%202014.pdf