Scaling Observational Analytics Across Europe

Case Study — Formation Bio Senior Data Engineer, RWD

Adam Black

2026-05-05

Darwin EU — a real-world data network for regulators

Problem statement: Build a scalable data system to provide the European Medicines Agency with timely and accurate health data across Europe.

Agenda

  • Architecture decisions (CDMConnector & new tools)

  • Consensus building across teams (Oxford & Erasmus)

  • The package ecosystem we shipped (omopverse → 25+ packages)

  • AI tooling we layered on top

  • What I learned

https://www.darwin-eu.org/

~16 engineer-weeks for the foundation; full project timeline 2022–2026.

The Darwin EU real-world data network

40+ databases across 12 countries.

Each partner runs studies behind their firewall.

Patient-level data never leaves the source.

Common substrate: OMOP CDM.

Distributed analytics: study code travels to the data, results come back.

Scale targets:

  • 40–50 studies per year
  • 6–10 simultaneous studies in flight
  • Regulatory-grade reproducibility
  • Data partner staff are scientists, not platform engineers

What Darwin EU delivers

A fork in the road · summer 2022

Architectural decision #1 · Functional SQL via dbplyr

# Before — OHDSI-SQL: string templating + dialect translation
sql <- "SELECT person_id, MIN(drug_exposure_start_date) AS first_exposure
        FROM @cdm_schema.drug_exposure
        WHERE drug_concept_id IN (@drug_ids)
        GROUP BY person_id;"
rendered <- SqlRender::render(sql, cdm_schema = "main", drug_ids = drug_ids)
translated <- SqlRender::translate(rendered, targetDialect = "snowflake")

# After — dbplyr functional SQL
cdm$drug_exposure |>
  filter(drug_concept_id %in% local(drug_ids)) |>
  group_by(person_id) |>
  summarise(first_exposure = min(drug_exposure_start_date, na.rm = TRUE))

Tradeoff accepted: occasional dbplyr translation gaps in exchange for composability, R-level type checking, and unit-testability on DuckDB.

Architectural decision #2 · CDMConnector as the platform contract

A single small package that:

  • Connects to a CDM in any backend (Postgres, Snowflake, DuckDB, BigQuery, Spark)
  • Exposes tables as lazy dbplyr frames
  • Carries CDM metadata (vocab version, snapshot, source name)
  • Supports subsetting to a “small CDM” for speed and portability
  • Ensures downstream tools and studies work on all SQL backends
cdm <- cdmFromCon(con, cdmSchema = "main", writeSchema = "main")

# function 'verbs' that transform a whole cdm
cdm_small <- cdm |>
  cdmSelect() |> # subset tables
  cdmSubset() |> # subset persons
  cdmFlatten() |> # denormalize data model into a single table
  collect() # Pull data from SQL backend into R

https://darwin-eu.github.io/CDMConnector/

CI tests on supported SQL backends: https://github.com/darwin-eu/CDMConnector/actions

Architectural decision #3 · Synthetic CDMs as first-class fixtures

# Eunomia ships a synthetic OMOP CDM you can run real queries against
con <- dbConnect(duckdb::duckdb(), dbdir = eunomia_dir())

# Hand-crafted mock CDM — sub-second tests against 5-row fixtures
mock <- omopgenerics::cdmFromTables(
  tables = list(
    person = tibble(person_id = 1L,
                    gender_concept_id = 8507L,
                    year_of_birth = 1980L,
                    race_concept_id = 0L,
                    ethnicity_concept_id = 0L)
  ),
  cdmName = "test_cdm"
)

Synthetic data has been critical for testing analytic code before sending it out to the network for execution.

CDMConnector CDMs: https://darwin-eu.github.io/CDMConnector/reference/exampleDatasets.html

TestGenerator: https://darwin-eu-dev.github.io/TestGenerator/

omock: https://ohdsi.github.io/omock/

The package ecosystem (omopverse)

                       omopgenerics (contract definitions)
                            │
                       CDMConnector (SQL database compatibility)
                            │
        ┌───────────────────┼───────────────────┐    (cohort building)
   CodelistGenerator    CohortConstructor   PatientProfiles
                            │
                            │
                            │
       ┌────────┬───────────┼──────────┬────────────┐    (analytics)
  Incidence-    Drug-       Cohort-    Treatment-   Cohort-
  Prevalence    Utilisation Survival   Patterns     Characteristics

20+ packages, all sharing the omopgenerics class definitions. A bug fix in CDMConnector propagates to every downstream tool and study.

https://ohdsi.github.io/Tidy-R-programming-with-OMOP/

The data behind the platform · claims, EHR, registry

OMOP is the lingua franca. One cohort definition runs against all sources — but the pitfalls differ.

Source Strengths Limitations Darwin EU examples
Claims Large scale, good medication utilization, population-wide coverage Often lacks clinical depth for some studies INGEF (Germany), PharMetrics (US)
Primary Care EHR Rich primary care longitudinal data, labs, notes, good for chronic conditions, population coverage Hospital & acute care missing CPRD GOLD (UK), IPCI (NL)
Hospital EHR Rich inpatient & specialty care representation, labs, notes, good for acute care Limited out-of-hospital and follow-up data CDW Bordeaux (FR), FinOMOP-HUS (Finland)
Registry/Biobanks Clinical depth, validated outcomes Narrow scope, smaller N, sometimes missing history NCR (NL), Harmony (EU), Estonian Biobank

The platform’s job is to make the same analytic code valid across all types of RWD.

Building a cohort · new users of metformin with T2DM

library(CDMConnector)
library(CodelistGenerator)
library(CohortConstructor)

con <- DBI::dbConnect(duckdb::duckdb(), eunomiaDir("synpuf-110k"))
cdm <- cdmFromCon(con, "main", "main")

# 1. Codelists — concept sets with vocabulary descendants
metformin <- getDrugIngredientCodes(cdm, name = "metformin")
# search codes
t2dm <- getCandidateCodes(cdm, keywords = "type 2 diabetes", domains = "Condition", includeDescendants = TRUE)
t2dmCodelist <- asCodelist(t2dm)

# 2. Cohort pipeline — composable, inspectable, backend-agnostic
cdm$metformin_new_users <- cdm |>
  conceptCohort(conceptSet = metformin, name = "metformin_new_users") |>
  requireIsFirstEntry() |>
  requirePriorObservation(minPriorObservation = 365) |>
  requireDemographics(ageRange = list(c(18, 100))) |>
  requireConceptIntersect(conceptSet = t2dmCodelist, window = c(-Inf, 0))

cdm$metformin_new_users
#> # Source:   table<metformin_new_users> [?? x 4]
#> # Database: DuckDB 1.5.2 [root@Darwin 25.4.0:R 4.5.1//private/var/folders/2j/8z0yfn1j69q8sxjc7vj9yhz40000gp/T/RtmpzH6MGs/file11acd7dca52d9.duckdb]
#>    cohort_definition_id subject_id cohort_start_date cohort_end_date
#>                   <int>      <int> <date>            <date>         
#>  1                    1     112989 2010-01-13        2010-04-12     
#>  2                    1      81257 2009-05-07        2009-06-05     
#>  3                    1     112807 2009-01-22        2009-03-02     
#>  4                    1      41720 2009-06-20        2009-07-19     
#>  5                    1      83797 2009-04-16        2009-05-15     
#>  6                    1      63760 2009-06-10        2009-07-09     
#>  7                    1      11203 2009-06-25        2009-07-24     
#>  8                    1      13093 2009-06-30        2009-07-29     
#>  9                    1      68137 2009-07-07        2009-08-05     
#> 10                    1      78858 2009-05-11        2009-06-09     
#> # ℹ more rows

# 3. Inspect attrition at every step — the audit trail clinicians review
attrition(cdm$metformin_new_users)
#> # A tibble: 11 × 7
#>    cohort_definition_id number_records number_subjects reason_id reason         
#>                   <int>          <int>           <int>     <int> <chr>          
#>  1                    1          86263           37781         1 Initial qualif…
#>  2                    1          86263           37781         2 Record in obse…
#>  3                    1          86263           37781         3 Not missing re…
#>  4                    1          78358           37781         4 Merge overlapp…
#>  5                    1          37781           37781         5 Restricted to …
#>  6                    1          15495           15495         6 Prior observat…
#>  7                    1          15481           15481         7 Age requiremen…
#>  8                    1          15481           15481         8 Sex requiremen…
#>  9                    1          15481           15481         9 Prior observat…
#> 10                    1          15481           15481        10 Future observa…
#> 11                    1          11320           11320        11 Concept candid…
#> # ℹ 2 more variables: excluded_records <int>, excluded_subjects <int>

Same pipeline runs on DuckDB, Snowflake, Postgres, BigQuery, Redshift, SQL Server, Databricks — no rewrites.

Cohort Evaluation

The validation loop that makes a cohort believable

  1. Build the cohort in Atlas (UI) or R code
  2. Run diagnostics package
  3. Review with Principal Investigator (PI)
  4. Iterate on codelists and cohort logic until the PI signs off

Things we look for when performing cohort evaluation

  • Index-date misclassification: e.g., treatment before index
  • Specificity: do characterizations look reasonable for this patient population?
  • Missing codes: are there codes in the data we should be including?
  • Plausibility: incidence rates, person counts, data-date cutoffs
# Add rich context to every person in the cohort. Adds columns to cohort table (SQL)
profiles <- cdm$metformin_new_users |>
  addDemographics() |>
  addCohortIntersectFlag(targetCohortTable = "comorbidities", window = c(-Inf, 0)) |>
  addConceptIntersectCount(conceptSet = list(hba1c = hba1c_codes), window = c(-365, 0))

From cohort to evidence · one definition, many studies

        
                        cdm$metformin_new_users  ─┐
                                                  │
    ┌─────────────────────┬───────────────────────┼─────────────────┬─────────────────┐
    ▼                     ▼                       ▼                 ▼                 ▼
CohortCharacteristics     IncidencePrevalence     DrugUtilisation   CohortSurvival    TreatmentPatterns
(table 1, comorbidities)  (background rates)      (adherence)       (time-to-event)   (sequences of events)
  • One cohort, many downstream questions — characterisation, drug utilisation, comparative effectiveness, safety signals
  • Every downstream package consumes the same cohort_table contract from omopgenerics
  • A bug fix in cohort logic propagates to every study

The compounding effect: we significantly lowered the marginal cost per study.

How we found allies and achieved consensus

This wasn’t a top-down process. It was a bottom-up coalition:

  • Found early allies and adopters at Oxford
  • Gave multiple presentations showcasing the benefits of the cdm_reference ORM and “functional” SQL.
  • Co-authored Software Requirements Specifications and quickly built prototypes
  • Early adopters became contributors and evangelists

https://cran.r-project.org/web/packages/CDMConnector/index.html

Lesson: Functional cross-platform query language provided much more flexibility and speed compared to existing OHDSI tools.

CDMConnector now has >20 reverse dependency packages and is used in all Darwin-EU studies.

Dev practices that kept us honest

Cadence

  • Weekly development meetings to sync stakeholders and highlight issues
  • Extensive use of GitHub for observability, collaboration, issue tracking, and CI pipelines
  • Monthly release cadence with predictable timelines
  • Public roadmap on GitHub Projects

Quality bars

  • Test coverage: ≥80% line coverage on synthetic CDM
  • Every release: CI against SQL backends (DuckDB, Postgres, Snowflake, Spark)
  • Evolution: clear function lifecycle communication and deprecation warnings

DARWIN package CI dashboard

Supporting Darwin-EU data scientists

Darwin-EU studies are delivered as R packages.

My role as platform lead is to:

  • Work alongside data scientists implementing studies myself (understanding the real issues)
  • Develop tools my team relies on:
    • CDMConnector (SQL translation)
    • Posit Connect on Azure (app deployment platform)
    • Arachne Execution Engine
  • Train new data scientists
  • Develop new tools as needs arise

Containerized study execution · Arachne

Study packages ship with

  • Dockerfile

  • renv.lock (R dependency file)

  • GitHub Actions workflow to build the study Docker image and push to Azure Container Registry

  • A Shiny app for reviewing results

Minimal example: https://github.com/darwin-eu-dev/ExampleStudy/

Arachne loads these studies from the registry and facilitates containerized execution and results review.

Arachne · study repository

Arachne · study editor and run log

AI layer · three places it earned its place

1. Test patient generation

LLM-driven generator that produces OMOP-CDM patients matching a clinical narrative. Replaces hand-written fixtures for test cases.

2. Hecate vocabulary search

Vector embeddings over OMOP concepts; semantic search that finds “T2DM” → 201826 even when the analyst types “adult-onset diabetes.” Plays nicely as an MCP server and supports multilingual vocabulary search.

https://hecate.pantheon-hds.com/

https://hecate.pantheon-hds.com/openapi/

3. AI assistant — chatbot with access to omopverse package documentation, CDM metadata, the Darwin network metadata, and the OMOP Vocabulary. Can automatically build concept sets and look up record counts across the network.

AI test-patient generator · narrative to OMOP

AI test-patient generator · timeline view

Darwin Study Design Assistant

Impact

The Darwin EU project has been an incredible success that is unrivaled in the US or anywhere else in the world for near-real-time regulatory decision-making with scientific rigor.

Metric Value
Studies delivered to EMA 110 studies
Time to run a new “off the shelf” study ~ 3 months
Capacity of studies completed per year 40+
Patients covered 248,621,000

Numbers are illustrative of the platform’s compounding effect that will continue to increase with time.

Highlighted DARWIN EU studies

What I learned

  • Silos are dangerous — cross-team open communication is key
  • Empower your team to build what they know they need
  • Aggressively remove unnecessary steps — adapt as needs change
  • There is no substitute for spending time together in person

What I’d carry to Formation Bio

Human-centered design

  • Spend time with consumers to truly understand needs
  • Iterate quickly and test solutions
  • Focus on getting interfaces right first, then optimize implementation
  • Technology and data access as a tool for impact, not the end goal

“The impediment to action advances action.
What stands in the way becomes the way.”

— Marcus Aurelius, Meditations

Thanks for listening. Let’s build the future together.