Recent Advances in AI for Meta-Analyses

Damien Beillouin

Current Bottlenecks in Evidence Synthesis

#1. Limited Scalability

  • Even with massive human efforts:

    • IPCC (2021): ~14,000 articles
    • IPBES (2019): ~15,000 articles
  • Reviews become obsolete quickly due to exponential publication growth.

  • ==> Undermines the quality of evidence for decision-making

Adoption of automatic tools: Where Do We Stand?

  • Surveys (pre-LLM era): low adoption of automation tools
  • Main barriers:
    • Lack of trust
    • Complexity of setup, Statistical expertise, Substantial time to fine-tune models
    • Poor integration into workflows
  • However, recent LLM advances are changing the landscape:
    • More intuitive interfaces (e.g., chat-based assistants)
    • Higher performance on zero-shot tasks
    • Opportunities for fully automated pipelines

Al-Zubidy et al. (2017). Vision for SLR tooling infrastructure: prioritizing value-added requirements Information and Software Technology, 91, pp.72-81.

Degrees of Automation in Evidence Synthesis

Level Description Examples
4 Tools fully perform tasks, eliminating the need for human participation. Fully automated relevance screening
3 Tools perform tasks automatically but unreliably, requiring human supervision or override. Duplicate detection, plagiarism checking
2 Tools prioritize workflow, accelerating the process but not reducing total human workload. Abstract prioritization, relevance ranking
1 Tools assist with file and reference management. Citation databases, SR management platforms

Most tools are still between Level 2 and 3.

O’Connor et al. (2019). A question of trust: can we build an evidence base to gain trust in systematic review automation technologies? Systematic Reviews, 8, pp.1–8.

Screening

Which screening tool choose? (1)

Tool Platform Bulk IO Team Blind T/A Screen Full-text Decision Labels License
Abstrackr Web-based Relevant, borderline, irrelevant Free
Covidence Web-based Include, maybe, exclude $240–$635
ASReview Terminal/Python Relevant, irrelevant Free
RevMan Web / 5 Web/Desktop $73–$120+
Rayyan Web-based Include, undecide, exclude Free + ($48–$99+)
EPPI-Reviewer Web-based Include, exclude, 2nd opinion $145–$506+
DistillerSR Web-based Yes, no, can’t tell $239–$3636
Excel Web/Desktop Free
Sysrev Web-based $0–$120+
SWIFT-AS Web-based Include, exclude Not listed
CADIMA Web-based Criteria or comment Free
ReLiS Web-based Include, exclude Free
Rayyan AI Web-based Include, undecide, exclude Free + ($100+)
Screening.ai Web-based Include, exclude Not listed
RobotReviewer Web-based Include, exclude Not listed
SRDR+ Web-based Include, exclude, quality Not free, contact
Cochrane Crowd Web-based Relevant, irrelevant Free
Machine Learning Screening Tool Web-based Yes, no, maybe Free
Review Manager Desktop Software Exclude, include, 2nd opinion $120+

Zhang, Q. and Neitzel (2024). Choosing the right tool for the job: screening tools for systematic reviews in education Journal of Research on Educational Effectiveness, 17(3), pp.513-539

Which screening tool choose? (2)

Time-saving for screening step?

-> Time savings by 35 to 99%.

Edwards et al., 2024. ADVISE: accelerating the creation of evidence syntheses for global development using natural language processing-supported human-artificial intelligence collaboration Journal of Mechanical Design, 146(5)

Data Extraction

PICO extraction

  • The PICO (Population, Intervention, Comparison, Outcome) framework is vital for systematic reviews and meta-analyses.
  • Manually recognizing these elements is resource-intensive.
  • BioBERT, SciBERT, and GPT models, have significantly enhanced the automation of PICO extraction.

BioBERT & SciBERT for PICO extraction

BioBERT

  • Fine-tuned version of BERT trained on biomedical corpora for biomedical text mining.
    • Applications : Named Entity Recognition (NER) for Population, Intervention, Comparison, Outcome.
    • Performance : Achieved state-of-the-art results for biomedical NER tasks with F1-scores over 90% on several biomedical benchmarks.
    • Key Paper: Peng et al. (2019), BioBERT: a pre-trained biomedical language representation model for biomedical text mining in Bioinformatics.
    • BioBERT Repository

SciBERT

  • Fine-tuned version of BERT for scientific literature across disciplines.
    • Applications : NER for scientific text extraction, including PICO elements, applicable to broader scientific research (beyond biomedicine).

    • Performance : Outperforms traditional models for scientific text classification, achieving F1-scores up to 85% on scientific text benchmarks.

    • Key Paper: Beltagy et al. (2019), SciBERT: A pretrained language model for scientific text in arXiv.

    • SciBERT Repository

Application to Agricultural Field

  • Fine-tuning: Requires specialized agricultural text corpus.
  • Semantics/Ontology: Utilizes AgroVOC for better term extraction.

Prediction Performance of BERT Models

Language Model Precision (P) Recall (R) F1 Score (F1)
Agriculture-BERT 85.28 77.22 80.60
Sci-BERT 83.89 75.83 79.12
RoBERTa 83.66 75.06 78.07
Vanilla BERT 83.62 73.86 77.61

Panoutsopoulos et al., 2024. Investigating the effect of different fine-tuning configuration scenarios on agricultural term extraction using BERT Computers and Electronics in Agriculture, 225, p.109268

Large Language Models (LLMs)

  • Can generate high-quality PICO identifications when fine-tuned or prompted appropriately.

  • Challenges:

    • Prompt Generation: Requires expert knowledge to craft effective prompts, especially for specific tasks like PICO extraction.
    • Cost: High computational costs for large-scale extraction tasks.
  • Example Study: *Automated Mass Extraction of Over 680,000 PICOs from Clinical Study Abstracts

    • 682,667 abstracts processed in < 3 hours, 395,992,770 tokens processed, costing $3390.
    • Accuracy: 98% of PICO elements extracted correctly.

Reason et al., 2024. Automated Mass Extraction of Over 680,000 PICOs from Clinical Study Abstracts Using Generative AI: A Proof-of-Concept Study Pharmaceutical Medicine, 225, p.109268

2. Structured Data Extraction

Semi-Automated extraction for bias analyses

  • Extract study design, bias domains, experimental design or others
    Example: RobotReviewer, DataSeer

Thomason et al., 2020. RobotReviewer: a tool for automated risk of bias assessment in systematic reviews Cochrane Database of Systematic Reviews
Jonnalagedda et al., 2021. DataSeer: A semi-automated system for the extraction of quantitative data from scientific articles BMC Bioinformatics

  • Uses of LLM for Risk of Bias Assessments: Example: The agreement between ChatGPT and expert reviewers was low, with a weighted kappa ranging from 0.11 to 0.29.

Pitre et al., 2023. ChatGPT for assessing risk of bias of randomized trials using the RoB 2.0 tool: A methods study Medrxiv,pp.2023-11

Semi-Automated Extraction of GPS Coordinates & Geoparsing

  • Geoparsing & Geolocation via NLP
    Automated identification and extraction of geographical entities (e.g., country, region, site, GPS coordinates).

  • Example Tools
    Tools such as GeoQuery, Perdido, and GeoParser are avaialble.

Lieberman et al., 2010. Geotagging with local lexicons to build indexes for textually-specified spatial data SIGIR.

Linking Text-Derived Locations with Gridded Environmental Data

  • From Text to Coordinates to Context
    Once coordinates or place names are extracted, they can be automatically linked to environmental data sources

  • Gridded Data Sources

    • Climate: WorldClim, CHELSA

    • Soil: SoilGrids, ISRIC

    • Biodiversity: GBIF, Map of Life, PREDICTS

    • Application
      Enables meta-regressions or subgroup analyses based on site-level environmental gradients — without manual georeferencing.

✅ Intermediate Conclusion: Scalable Study Characterization

IA tools are now mature enough to extract key study descriptors:

→ Ideal for fast and structured evidence maps.
→ Works best for descriptive metadata.

Tip: Combine LLMs + XML parsers + domain ontologies for flexible, transparent pipelines.

2. Extraction from Text, Tables, and Figures

  • WebPlotDigitizer, PlotDigitizer, metagear, … to extract numerical data points from figures.
    -Lajeunesse, 2016: Facilitating systematic reviews, data extraction and meta‐analysis with the metagear package for R. Methods in Ecology and Evolution, 7(3), pp.323-330.

  • Tabula, pdftools, Camelot to convert embedded tables into structured formats (CSV, JSON).
    -Deng et al., 2025: An automatic selective PDF table-extraction method for collecting materials data from literature. Advances in Engineering Software, 204, p.10389.

  • LLMs + Prompt Engineering** : Uncertain quality?!

⚠️ Intermediate Conclusion: Limited Automation for Quantitative Data

  • IA can assist, but often struggle with heterogeneous formats, ambiguous context, and supplementary files.
  • Robust extraction still requires manual validation and domain expertise.

Toward Living Evidence Systems

(Semi)-Automating data updates is key to avoid obsolescence in fast-evolving fields.

Emerging Platforms
- MetaDataset (link): Open, machine-readable, updateable datasets.
- Impact4soil: Platform for climate-related living reviews. https://www.impact4soil.com/

Beyond AI

  • interoperability, e.g. Ontologies & Metadata Standards (AGROVOC, DATA4C, ERA)

  • data sharing

  • standardization across research teams and domains.

  • References:
    Fujisaki et al., 2022: Semantics about soil organic carbon storage: DATA4C+, a comprehensive thesaurus and classification of management practices in agriculture and forestry. EGU Sphere.
    Rosenstock et al., 2024: Evidence for Resilient Agriculture Dataset. Alliance Biodiversity CIAT.

Compilation of Databases of Meta-analyses

  • Compilation of 42 meta-analyses on agroforestry- cook-Patton et al., in progress.
    ::: {style=“text-align: center;”} :::