Scaling Observational Analytics Across Europe

Case Study — Formation Bio Senior Data Engineer, RWD

Adam Black

2026-05-05

Darwin EU — a real-world data network for regulators

Problem statement: Build a scalable data system to provide the European Medicines Agency with timely and accurate health data across Europe.

Agenda

Architecture decisions (CDMConnector & new tools)
Consensus building across teams (Oxford & Erasmus)
The package ecosystem we shipped (omopverse → 25+ packages)
AI tooling we layered on top
What I learned

https://www.darwin-eu.org/

~16 engineer-weeks for the foundation; full project timeline 2022–2026.

The Darwin EU real-world data network

40+ databases across 12 countries.

Each partner runs studies behind their firewall.

Patient-level data never leaves the source.

Common substrate: OMOP CDM.

Distributed analytics: study code travels to the data, results come back.

Scale targets:

40–50 studies per year
6–10 simultaneous studies in flight
Regulatory-grade reproducibility
Data partner staff are scientists, not platform engineers

What Darwin EU delivers

A fork in the road · summer 2022

Architectural decision #1 · Functional SQL via dbplyr

# Before — OHDSI-SQL: string templating + dialect translation
sql <- "SELECT person_id, MIN(drug_exposure_start_date) AS first_exposure
        FROM @cdm_schema.drug_exposure
        WHERE drug_concept_id IN (@drug_ids)
        GROUP BY person_id;"
rendered <- SqlRender::render(sql, cdm_schema = "main", drug_ids = drug_ids)
translated <- SqlRender::translate(rendered, targetDialect = "snowflake")

# After — dbplyr functional SQL
cdm$drug_exposure |>
  filter(drug_concept_id %in% local(drug_ids)) |>
  group_by(person_id) |>
  summarise(first_exposure = min(drug_exposure_start_date, na.rm = TRUE))

Tradeoff accepted: occasional dbplyr translation gaps in exchange for composability, R-level type checking, and unit-testability on DuckDB.

Architectural decision #2 · `CDMConnector` as the platform contract

A single small package that:

Connects to a CDM in any backend (Postgres, Snowflake, DuckDB, BigQuery, Spark)
Exposes tables as lazy dbplyr frames
Carries CDM metadata (vocab version, snapshot, source name)
Supports subsetting to a “small CDM” for speed and portability
Ensures downstream tools and studies work on all SQL backends

cdm <- cdmFromCon(con, cdmSchema = "main", writeSchema = "main")

# function 'verbs' that transform a whole cdm
cdm_small <- cdm |>
  cdmSelect() |> # subset tables
  cdmSubset() |> # subset persons
  cdmFlatten() |> # denormalize data model into a single table
  collect() # Pull data from SQL backend into R

https://darwin-eu.github.io/CDMConnector/

CI tests on supported SQL backends: https://github.com/darwin-eu/CDMConnector/actions

Architectural decision #3 · Synthetic CDMs as first-class fixtures

# Eunomia ships a synthetic OMOP CDM you can run real queries against
con <- dbConnect(duckdb::duckdb(), dbdir = eunomia_dir())

# Hand-crafted mock CDM — sub-second tests against 5-row fixtures
mock <- omopgenerics::cdmFromTables(
  tables = list(
    person = tibble(person_id = 1L,
                    gender_concept_id = 8507L,
                    year_of_birth = 1980L,
                    race_concept_id = 0L,
                    ethnicity_concept_id = 0L)
  ),
  cdmName = "test_cdm"
)

Synthetic data has been critical for testing analytic code before sending it out to the network for execution.

CDMConnector CDMs: https://darwin-eu.github.io/CDMConnector/reference/exampleDatasets.html

TestGenerator: https://darwin-eu-dev.github.io/TestGenerator/

omock: https://ohdsi.github.io/omock/

The package ecosystem (omopverse)

                       omopgenerics (contract definitions)
                            │
                       CDMConnector (SQL database compatibility)
                            │
        ┌───────────────────┼───────────────────┐    (cohort building)
   CodelistGenerator    CohortConstructor   PatientProfiles
                            │
                            │
                            │
       ┌────────┬───────────┼──────────┬────────────┐    (analytics)
  Incidence-    Drug-       Cohort-    Treatment-   Cohort-
  Prevalence    Utilisation Survival   Patterns     Characteristics

20+ packages, all sharing the omopgenerics class definitions. A bug fix in CDMConnector propagates to every downstream tool and study.

https://ohdsi.github.io/Tidy-R-programming-with-OMOP/

The data behind the platform · claims, EHR, registry

OMOP is the lingua franca. One cohort definition runs against all sources — but the pitfalls differ.

Source	Strengths	Limitations	Darwin EU examples
Claims	Large scale, good medication utilization, population-wide coverage	Often lacks clinical depth for some studies	INGEF (Germany), PharMetrics (US)
Primary Care EHR	Rich primary care longitudinal data, labs, notes, good for chronic conditions, population coverage	Hospital & acute care missing	CPRD GOLD (UK), IPCI (NL)
Hospital EHR	Rich inpatient & specialty care representation, labs, notes, good for acute care	Limited out-of-hospital and follow-up data	CDW Bordeaux (FR), FinOMOP-HUS (Finland)
Registry/Biobanks	Clinical depth, validated outcomes	Narrow scope, smaller N, sometimes missing history	NCR (NL), Harmony (EU), Estonian Biobank

The platform’s job is to make the same analytic code valid across all types of RWD.

Building a cohort · new users of metformin with T2DM

library(CDMConnector)
library(CodelistGenerator)
library(CohortConstructor)

con <- DBI::dbConnect(duckdb::duckdb(), eunomiaDir("synpuf-110k"))
cdm <- cdmFromCon(con, "main", "main")

# 1. Codelists — concept sets with vocabulary descendants
metformin <- getDrugIngredientCodes(cdm, name = "metformin")
# search codes
t2dm <- getCandidateCodes(cdm, keywords = "type 2 diabetes", domains = "Condition", includeDescendants = TRUE)
t2dmCodelist <- asCodelist(t2dm)

# 2. Cohort pipeline — composable, inspectable, backend-agnostic
cdm$metformin_new_users <- cdm |>
  conceptCohort(conceptSet = metformin, name = "metformin_new_users") |>
  requireIsFirstEntry() |>
  requirePriorObservation(minPriorObservation = 365) |>
  requireDemographics(ageRange = list(c(18, 100))) |>
  requireConceptIntersect(conceptSet = t2dmCodelist, window = c(-Inf, 0))

cdm$metformin_new_users
#> # Source:   table<metformin_new_users> [?? x 4]
#> # Database: DuckDB 1.5.2 [root@Darwin 25.4.0:R 4.5.1//private/var/folders/2j/8z0yfn1j69q8sxjc7vj9yhz40000gp/T/RtmpzH6MGs/file11acd7dca52d9.duckdb]
#>    cohort_definition_id subject_id cohort_start_date cohort_end_date
#>                   <int>      <int> <date>            <date>         
#>  1                    1     112989 2010-01-13        2010-04-12     
#>  2                    1      81257 2009-05-07        2009-06-05     
#>  3                    1     112807 2009-01-22        2009-03-02     
#>  4                    1      41720 2009-06-20        2009-07-19     
#>  5                    1      83797 2009-04-16        2009-05-15     
#>  6                    1      63760 2009-06-10        2009-07-09     
#>  7                    1      11203 2009-06-25        2009-07-24     
#>  8                    1      13093 2009-06-30        2009-07-29     
#>  9                    1      68137 2009-07-07        2009-08-05     
#> 10                    1      78858 2009-05-11        2009-06-09     
#> # ℹ more rows

# 3. Inspect attrition at every step — the audit trail clinicians review
attrition(cdm$metformin_new_users)
#> # A tibble: 11 × 7
#>    cohort_definition_id number_records number_subjects reason_id reason         
#>                   <int>          <int>           <int>     <int> <chr>          
#>  1                    1          86263           37781         1 Initial qualif…
#>  2                    1          86263           37781         2 Record in obse…
#>  3                    1          86263           37781         3 Not missing re…
#>  4                    1          78358           37781         4 Merge overlapp…
#>  5                    1          37781           37781         5 Restricted to …
#>  6                    1          15495           15495         6 Prior observat…
#>  7                    1          15481           15481         7 Age requiremen…
#>  8                    1          15481           15481         8 Sex requiremen…
#>  9                    1          15481           15481         9 Prior observat…
#> 10                    1          15481           15481        10 Future observa…
#> 11                    1          11320           11320        11 Concept candid…
#> # ℹ 2 more variables: excluded_records <int>, excluded_subjects <int>

Same pipeline runs on DuckDB, Snowflake, Postgres, BigQuery, Redshift, SQL Server, Databricks — no rewrites.

Cohort Evaluation

The validation loop that makes a cohort believable

Build the cohort in Atlas (UI) or R code
Run diagnostics package
Review with Principal Investigator (PI)
Iterate on codelists and cohort logic until the PI signs off

Things we look for when performing cohort evaluation

Index-date misclassification: e.g., treatment before index
Specificity: do characterizations look reasonable for this patient population?
Missing codes: are there codes in the data we should be including?
Plausibility: incidence rates, person counts, data-date cutoffs

# Add rich context to every person in the cohort. Adds columns to cohort table (SQL)
profiles <- cdm$metformin_new_users |>
  addDemographics() |>
  addCohortIntersectFlag(targetCohortTable = "comorbidities", window = c(-Inf, 0)) |>
  addConceptIntersectCount(conceptSet = list(hba1c = hba1c_codes), window = c(-365, 0))

From cohort to evidence · one definition, many studies

        
                        cdm$metformin_new_users  ─┐
                                                  │
    ┌─────────────────────┬───────────────────────┼─────────────────┬─────────────────┐
    ▼                     ▼                       ▼                 ▼                 ▼
CohortCharacteristics     IncidencePrevalence     DrugUtilisation   CohortSurvival    TreatmentPatterns
(table 1, comorbidities)  (background rates)      (adherence)       (time-to-event)   (sequences of events)

One cohort, many downstream questions — characterisation, drug utilisation, comparative effectiveness, safety signals
Every downstream package consumes the same cohort_table contract from omopgenerics
A bug fix in cohort logic propagates to every study

The compounding effect: we significantly lowered the marginal cost per study.

How we found allies and achieved consensus

This wasn’t a top-down process. It was a bottom-up coalition:

Found early allies and adopters at Oxford
Gave multiple presentations showcasing the benefits of the cdm_reference ORM and “functional” SQL.
Co-authored Software Requirements Specifications and quickly built prototypes
Early adopters became contributors and evangelists

https://cran.r-project.org/web/packages/CDMConnector/index.html

Lesson: Functional cross-platform query language provided much more flexibility and speed compared to existing OHDSI tools.

CDMConnector now has >20 reverse dependency packages and is used in all Darwin-EU studies.

Dev practices that kept us honest

Cadence

Weekly development meetings to sync stakeholders and highlight issues
Extensive use of GitHub for observability, collaboration, issue tracking, and CI pipelines
Monthly release cadence with predictable timelines
Public roadmap on GitHub Projects

Quality bars

Test coverage: ≥80% line coverage on synthetic CDM
Every release: CI against SQL backends (DuckDB, Postgres, Snowflake, Spark)
Evolution: clear function lifecycle communication and deprecation warnings

DARWIN package CI dashboard

Supporting Darwin-EU data scientists

Darwin-EU studies are delivered as R packages.

My role as platform lead is to:

Work alongside data scientists implementing studies myself (understanding the real issues)
Develop tools my team relies on:
- CDMConnector (SQL translation)
- Posit Connect on Azure (app deployment platform)
- Arachne Execution Engine
Train new data scientists
Develop new tools as needs arise

Containerized study execution · Arachne

Study packages ship with

Dockerfile
renv.lock (R dependency file)
GitHub Actions workflow to build the study Docker image and push to Azure Container Registry
A Shiny app for reviewing results

Minimal example: https://github.com/darwin-eu-dev/ExampleStudy/

Arachne loads these studies from the registry and facilitates containerized execution and results review.

Arachne · study repository

Arachne · study editor and run log

AI layer · three places it earned its place

1. Test patient generation

LLM-driven generator that produces OMOP-CDM patients matching a clinical narrative. Replaces hand-written fixtures for test cases.

2. Hecate vocabulary search

Vector embeddings over OMOP concepts; semantic search that finds “T2DM” → 201826 even when the analyst types “adult-onset diabetes.” Plays nicely as an MCP server and supports multilingual vocabulary search.

https://hecate.pantheon-hds.com/

https://hecate.pantheon-hds.com/openapi/

3. AI assistant — chatbot with access to omopverse package documentation, CDM metadata, the Darwin network metadata, and the OMOP Vocabulary. Can automatically build concept sets and look up record counts across the network.

AI test-patient generator · narrative to OMOP

AI test-patient generator · timeline view

Darwin Study Design Assistant

Impact

The Darwin EU project has been an incredible success that is unrivaled in the US or anywhere else in the world for near-real-time regulatory decision-making with scientific rigor.

Metric	Value
Studies delivered to EMA	110 studies
Time to run a new “off the shelf” study	~ 3 months
Capacity of studies completed per year	40+
Patients covered	248,621,000

Numbers are illustrative of the platform’s compounding effect that will continue to increase with time.

Highlighted DARWIN EU studies

What I learned

Silos are dangerous — cross-team open communication is key
Empower your team to build what they know they need
Aggressively remove unnecessary steps — adapt as needs change
There is no substitute for spending time together in person

What I’d carry to Formation Bio

Human-centered design

Spend time with consumers to truly understand needs
Iterate quickly and test solutions
Focus on getting interfaces right first, then optimize implementation
Technology and data access as a tool for impact, not the end goal

“The impediment to action advances action.
What stands in the way becomes the way.”

— Marcus Aurelius, Meditations

Thanks for listening. Let’s build the future together.

Scaling Observational Analytics Across Europe

Darwin EU — a real-world data network for regulators

The Darwin EU real-world data network

What Darwin EU delivers

A fork in the road · summer 2022

Architectural decision #1 · Functional SQL via dbplyr

Architectural decision #2 · CDMConnector as the platform contract

Architectural decision #3 · Synthetic CDMs as first-class fixtures

The package ecosystem (omopverse)

The data behind the platform · claims, EHR, registry

Building a cohort · new users of metformin with T2DM

Cohort Evaluation

From cohort to evidence · one definition, many studies

How we found allies and achieved consensus

Dev practices that kept us honest

DARWIN package CI dashboard

Supporting Darwin-EU data scientists

Containerized study execution · Arachne

Arachne · study repository

Arachne · study editor and run log

AI layer · three places it earned its place

AI test-patient generator · narrative to OMOP

AI test-patient generator · timeline view

Darwin Study Design Assistant

Impact

Highlighted DARWIN EU studies

What I learned

What I’d carry to Formation Bio

Architectural decision #2 · `CDMConnector` as the platform contract