In the pharmaceutical industry, the process of bringing experimental clinical products to market requires the submission of electronic data, computer programs, and relevant documentation to health authority agencies worldwide. Traditionally, these submissions have been based on the SAS language. However, there is a growing trend towards utilizing open-source programming languages, particularly R, in clinical trial submissions. This blog post explores the advantages of moving into open-source R clinical trial submissions and why it is a positive development for the pharmaceutical industry.
Embracing Open Source in Pharma: In recent years, the adoption of open-source languages, such as R, has gained popularity in both pharmaceutical companies and research institutions. Health authorities now accept submissions based on open source programming languages. However, sponsors have been hesitant to utilize open-source languages due to a lack of working examples. The future of pharma lies in open source, as it has the potential to transform research and reporting processes.
Driving Clinical Trials Forward: Open source is driving innovation and accelerating the pace of clinical trials. It enables faster go/no-go decision-making by making innovation the standard at every checkpoint. Through the use of Shiny applications and real-time dashboards, clinical teams can analyze trial data interactively and monitor trial status. Leveraging open-source packages designed specifically for pharma streamlines clinical workflows, ultimately leading to faster identification of leading molecules.
Efficient Data Manipulation: Clinical trials generate numerous outputs, and open-source tools can automate and streamline the process of manipulating data. Collaboration between pharmaceutical companies has resulted in the development of packages that facilitate the automation of tasks. Open source allows for quicker generation of SDTM/ADaM datasets, and it also enhances validation and quality control processes.
Confident Submissions to Regulatory Agencies: Open source can and should be used in regulatory submissions. Pharmaceutical companies are collaborating with regulatory agencies like the FDA, ensuring guidance and feedback on the use of open-source tools. Open source provides transparency and reproducibility, which are crucial for upholding the integrity of clinical trial results. By embracing open source, companies can confidently submit their findings to regulatory bodies.
Collaboration and Innovation: The adoption of open-source tools in the pharmaceutical industry has fostered cross-collaboration and sharing of knowledge among organizations. Previously, competition and closed software hindered collaboration. However, the open-source philosophy has inspired companies to partner with others facing similar challenges. By working collaboratively, the pace of innovation in open-source software surpasses that of proprietary software, enabling faster development, bug fixing, and feature enhancements.
Open source R clinical trial submissions offer numerous benefits for the pharmaceutical industry. From faster decision-making to efficient data manipulation and confident regulatory submissions, open-source tools are revolutionizing drug development. The collaboration and innovation fostered by open source are creating a positive impact on the industry as a whole. Embracing open-source R clinical trial submissions is a step towards a more transparent, efficient, and transformative future in pharmaceutical research and development.
Recap of the Clinical Trial Phases
Preclinical Trials: Before human testing, drugs are tested on animals (e.g., rodents or apes).
Phase 1:
Phase 2:
Phase 3:
Study Protocol
The study protocol serves as the blueprint of a clinical trial and should be detailed enough for replication by other investigators. It typically includes:
Objectives: What the trial aims to test (e.g., efficacy/safety of a drug).
Background: Brief information on the medical condition and previous use of the drug.
Methods:
The Estimand Framework
This framework systematically defines the treatment effect being investigated and includes five components:
Treatment: Description of the intervention and control groups (e.g., placebo vs. 25 mg of Courseramab every two weeks, both self-administered).
Population: Defined by inclusion and exclusion criteria.
Variable of Interest: The key outcome measure influenced by the treatment (e.g., tumor volume in cancer studies).
Population-Level Summary: How the results are summarized/computed across the population (e.g., average reduction in tumor volume).
Handling of Intercurrent Events: Strategy for managing unexpected events that occur during the trial, such as a patient dropping out.
Statistical Analysis Plan (SAP)
The SAP is a detailed guide prepared by statisticians, defining how the data will be analyzed.
It includes:
Timing: Written after study initiation, but before any data is accessed.
Purpose:
The course section introduces how data is collected during clinical trials, distinguishing between two main types:
Clinical trial data collection involves multiple stakeholders:
The sponsor is responsible for initiating, managing, or funding the trial.
CRF data is collected at these clinical sites and is managed through structured forms:
Sponsors do not have direct access to patient medical records, so they design CRFs for site staff to complete.
CRFs collect essential information, including:
CRFs often take the form of questionnaires:
Some questions have predefined answer options (e.g., “Yes”/“No”)
Other questions allow for free-text entries (e.g., type of vaccine received).
All entries are mapped to specific variable names, which support further processing in data models.
Non-CRF data is collected outside the CRF process and typically involves third-party vendors:
Examples of non-CRF sources:
Examples of non-CRF data:
These data are:
This standardization helps ensure that results are consistent across all sites and regions.
In this section, the course discusses standard data formats used in clinical trials and why they are beneficial.
The Clinical Data Interchange Standards Consortium (CDISC) sets globally accepted data standards for the pharmaceutical research industry.
Two key data models defined by CDISC are commonly encountered:
Study Data Tabulation Model (SDTM)
Analysis Data Model (ADaM)
Standardization benefits both pharmaceutical companies and regulatory authorities:
For companies:
For regulators:
Data traceability is another key benefit:
Adherence to data standards also signals a commitment to data quality:
In summary, data standards are a win-win:
In this section, the course explains what the FDA submission package is, what it contains, and how to ensure the submission process goes smoothly.
There are several types of FDA submissions, including:
To bring a drug to market in the United States, it must go through the US Food and Drug Administration (FDA).
For new drugs, the main route is through a New Drug Application (NDA).
The submission must demonstrate that the drug is:
The key objectives of a submission package are to demonstrate that the drug meets regulatory standards in terms of:
The submission package is structured around the electronic Common Technical Document (eCTD), a standardized format used globally.
The Electronic Common Technical Document (eCTD) is organized into five modules, but these modules are not structured by individual clinical studies, and they do not represent Phase III studies individually.
In details:
Module 1 – Administrative Information
Module 2 – Summaries
Module 3 – Quality
Module 4 – Non-Clinical Study Reports
Module 5 – Clinical Study Reports
Module 1.5 – Region-Specific Documents
Module 1.6 – Residual Information
Each module is organized according to international regulatory guidelines, making review easier for agencies like the FDA.
As a data scientist or statistical programmer, involvement is primarily focused on Module 5, which includes the clinical data package:
To ensure a successful submission, five key strategies are emphasized:
Thorough Quality Control
Teamwork
Engage with the FDA
Clarity and Structure
Be Proactive
The SDTM (Study Data Tabulation Model) plays a central role in organizing clinical trial data within the broader clinical data pipeline.
The data flow begins with raw data collection, which comes from:
Once collected, raw data is transformed into SDTM datasets for each domain (e.g., demographics, adverse events, labs):
SDTM datasets are then transformed into ADaM datasets (Analysis Data Model):
After generating outputs, results are shared with stakeholders and regulatory authorities.
Why SDTM is important:
CDISC (Clinical Data Interchange Standards Consortium) maintains SDTM and other standards:
CDISC is a global nonprofit that defines and maintains:
These standards provide a unified framework for organizing and documenting clinical trial data.
The SDTMIG (SDTM Implementation Guide):
Created by CDISC, it provides in-depth guidance for implementing SDTM:
How CRF and non-CRF data are mapped into SDTM:
Data Availability:
Preprocessing Steps:
Data Extraction:
Data Cleaning and Transformation:
Data Mapping:
Data Validation:
Metadata Documentation:
SDTM (Study Data Tabulation Model) is a standardized framework that structures all data collected during clinical trials into organized datasets for regulatory submission and analysis.
What SDTM contains:
Examples of SDTM domains:
Mapping from CRF to SDTM:
Example 1: Adverse Events (AE Domain)
CRF fields and corresponding SDTM variables:
AE Term
→ AETERM
Start Date/Time
→ AESTDTC
End Date/Time
→ AEENDTC
Severity
→ AESEV
These are mapped directly into the AE dataset.
Each row represents one adverse event per subject, allowing multiple observations for the same patient.
Example 2: Visit Date Across Domains
Example 3: Vital Signs – Pulse Data (VS Domain)
Data is stored in long format – multiple observations per patient.
Key SDTM variables:
VSTESTCD = PULSE
– test code identifierVSORRES
– original result (e.g., 75)VSORRESU
– result unit (e.g., bpm)To extract pulse values, one would filter by
VSTESTCD = "PULSE"
.
Annotated CRFs (ACRFs)
Non-CRF Data Examples
Questionnaire Data (e.g., EQ-5D):
Often collected via apps or vendors.
Each question appears as a row in the QS domain (Questionnaires).
Key variables:
QSCAT
– questionnaire category (e.g., EQ-5D)QSTESTCD
– short test/item code (e.g., Q1, Q2)QSTEST
– description of the questionQSORRES
– original response (e.g., score or text)QSSTRESC
– standardized response (e.g., numerical
transformation)Imaging Data (mapped to IM domain):
Laboratory Results:
Notes on Non-CRF Data Mapping
Source datasets often use non-standard variable names, defined by vendor agreements.
Before mapping, ensure consistency in structure and terminology.
Mapping ensures:
For example, EQ-5D source data:
This session provides important context about how SDTM programming is currently done in the industry and how it is starting to change.
Current industry standard: SAS
SAS is the dominant tool used throughout the clinical data science pipeline:
It is a powerful and specialized commercial tool, accepted widely by regulatory agencies.
Companies invest significant resources into customizing SAS to fit their internal workflows.
The separation between SAS developers and SAS users can make the system feel like a black box to end-users.
A major limitation is that SAS requires a license, creating barriers for:
Challenges in open-source adoption:
There is no widely adopted open-source solution yet for SDTM mapping.
Why?
Why we need open-source solutions:
Many companies face the same challenges in mapping to SDTM.
However, each builds its own proprietary tools in isolation.
Open-source collaboration would allow for:
In both SAS and R, the core task is writing code that maps raw/source data to SDTM structure.
By applying modern software development practices, such as:
Introducing the sdtm.oak
R
package:
sdtm.oak
is an open-source R
package for SDTM creation.
Its main features:
This package aims to enable the pharmaceutical programming community to build SDTM datasets collaboratively in R.
Though still under development, you can explore the project on its GitHub page (linked in the course’s further reading).
In the pharmaceutical industry, CDISC provides standardized data formats to ensure a common structure across companies and studies. These standards improve efficiency, data review, and regulatory compliance.
The CDISC model for analysis datasets is called ADaM (Analysis Data Model).
ADaM includes three key dataset structures, each serving a different purpose in statistical analysis:
1. ADSL – Subject-Level Analysis Dataset
Contains one record per subject.
Includes:
Used to support:
ADSL is the first ADaM dataset created, and is often merged into other ADaM datasets to add subject-level context.
Example structure:
USUBJID | AGE | SEX | TRTSDT | TRTEDT | ITTFL | SAFFL
2. OCCDS – Occurrence Data Structure
Contains one record per event (not per subject).
Used for:
OCCDS datasets are event-based and do not use analysis parameters like BDS.
Derived variables may include:
Example: ADAE dataset
USUBJID | AEDECOD | ASTDT | AENDT | TRTSDT | TRTEMFL | AGE | SEX
A subject may have multiple adverse events, each with its own record.
Treatment-emergent events are identified when
ASTDT ≥ TRTSDT
.
3. BDS – Basic Data Structure
Contains one or more records per subject per parameter per time point.
Known for its vertical layout (long format).
Used for:
Key BDS variables:
PARAM
– Analysis parameter (e.g., WEIGHT, HEIGHT)AVAL
– Analysis valueAVISIT
– Visit nameADT
– DateBASE
, CHG
, ABLFL
– Derived
analysis variablesDTYPE
– Derivation typeUSUBJID
, AGE
, SEX
– Pulled in
from ADSLBDS datasets support:
Example structure:
USUBJID | PARAM | AVISIT | AVAL | BASE | CHG | ABLFL | DTYPE
BMI is a derived parameter:
DTYPE
to indicate it is
derivedBaseline and changes are indicated using BASE
,
CHG
, and ABLFL
:
Example: If baseline weight = 65 kg, and at Visit 1 it is 62 kg, then:
CHG = -3
ABLFL = Y
for baseline visit rowIn summary:
ADaM datasets support clear, structured, and regulatory-compliant clinical trial analysis.
Each dataset type has a specific role:
ADaM builds upon SDTM to support traceability, reproducibility, and efficient reporting.
In this lesson, we explore Admiral, an open-source R package designed to streamline the creation of ADaM datasets in the pharmaceutical industry. Admiral is part of a larger suite of tools called pharmaverse, aimed at enabling end-to-end clinical reporting using R.
Background and Motivation for Admiral
The pharmaceutical industry has long lacked a unified, collaborative approach to ADaM programming.
Historically, each company built its own internal workflows using SAS, leading to:
The increasing shift toward open-source languages like R and Python provides an opportunity to break these silos and enable collaborative, reusable solutions.
Admiral Overview
Admiral = ADaM in R Asset Library
Initial collaboration between Roche and GSK (2021), now expanded to partners like:
Goal: Harmonize and modernize ADaM development across companies using open, modular, and reusable R code
Key Benefits of Admiral
Open Source
Modularized Toolbox
Community-Driven Development
Function Pipelining
Executable and Editable Workflows
Admiral as Part of Pharmaverse
Admiral is just one part of pharmaverse — a growing collection of curated open-source R packages for clinical reporting
Pharmaverse supports the entire data lifecycle in five stages:
Supporting Packages in Pharmaverse for ADaM
metacore
Creates a central R object to store metadata
Supports tasks like:
metatools
Extends metacore’s capabilities to:
Key functions:
xportr
Used for regulatory submission
Converts R data frames to SAS v5 XPT format accepted by authorities (e.g., FDA)
Capabilities:
.xpt
formatSummary
Admiral represents a shift toward collaborative, open, and modern programming in clinical research.
Through Admiral and the broader pharmaverse, companies can:
The metacore
package is part of the open-source
pharmaverse project and was designed to simplify and
standardize metadata handling in clinical reporting.
metacore
GitHub repository.metacore
is designed to accommodate them.Step 1: Understand the metacore
schema
The metacore
object includes seven core
metadata tables:
ds_spec
(dataset-level info)ds_vars
(dataset-variable relationships)var_spec
(variable metadata)value_spec
(value-level metadata)derivations
(derivation rules)codelist
(controlled terminology)supp_info
(optional supplementary data)The schema diagram uses white lines to show how tables are linked:
ds_spec
→ ds_vars
via datasetds_vars
→ var_spec
via variableStep 2: Determine the specification type
Use spec_type()
to identify how the Excel file is
structured.
"by_type"
—meaning each
Excel tab corresponds to a different metadata
table.Step 3: Read all specification sheets
doc
) is a named list of
data frames—one for each tab.Step 4: Create individual metadata tables
ds_spec
– Dataset
Specificationsdataset
, structure
,
label
spec_type_to_ds_spec()
with cols
argument to manually specify the correct columns due to
naming conflicts (e.g., Label
vs
Description
)ds_vars
– Dataset VariablesColumns: dataset
, variable
,
key_seq
, order
, keep
,
core
, supp_flag
Info is split across Variables tab and Domains tab
Use spec_type_to_ds_vars()
and specify:
cols
for variable/orderkey_seq_cols
for dataset and key variablesheet
to pull from both tabsvar_spec
– Variable
Specificationsvariable
, label
,
type
, length
, format
,
common
spec_type_to_var_spec()
format
column using a dplyr
filter since
formats end in "."
and code lists don’tvalue_spec
– Value-Level
Metadataspec_type_to_value_spec()
where_sep_sheet = FALSE
since info is not in a
separate sheetderivations
– Derivation
Logicderivation_id
, derivation
spec_type_to_derivation()
and set both
comment
and predecessor
to
"Comment"
since no separate predecessor column existscodelist
– Controlled
TerminologyOnly code lists are present (no dictionaries)
Use spec_type_to_codelist()
and set:
dictionary_cols = NULL
Step 5: Build the metacore
object
Once all component tables are created, they can be used to
construct the metacore
object.
Console may show warnings, such as:
These can be resolved by adjusting the specs or acknowledged as acceptable depending on use case.
Key Takeaways
The metacore
object acts as a central
metadata hub in R.
It simplifies:
Creating the object requires:
What Are Standard MedDRA Queries (SMQs) and Custom Queries (CQs)?
In clinical trials, especially when analyzing adverse events (AEs), it is common to group certain events into medically meaningful categories. These groupings are essential for identifying potential safety signals related to the study drug.
Standard MedDRA Queries (SMQs)
Definition: SMQs are predefined, standardized groupings of MedDRA-coded terms developed by the MedDRA Maintenance and Support Services Organization (MSSO).
Purpose: Used to facilitate the retrieval and analysis of safety data and to support pharmacovigilance and clinical review.
Characteristics:
Based on expert consensus, literature review, and validation.
Widely accepted by regulatory agencies (e.g., FDA, EMA).
Each SMQ consists of terms that can be either:
Example – SMQ: Lactic Acidosis
Custom Queries (CQs)
Definition: CQs are study-specific groupings of AEs defined by the clinical study team.
Purpose: To address safety concerns or groupings that are not covered by existing SMQs, especially for novel compounds or indications.
Characteristics:
Example – CQ
How Are SMQs and CQs Used in R (Admiral package)?
Use the function derive_var_query_flag()
from the
admiral or admiraldev
package.
This function:
The output includes variables like SMQ01FL
or
CQ01FL
to indicate matches.
Comparison Table: SMQ vs CQ
Feature | SMQ (Standard MedDRA Query) | CQ (Custom Query) |
---|---|---|
Source | MedDRA / MSSO | Study team / Sponsor defined |
Validation | Yes, industry-standard | No, study-specific |
Reusability | High (cross-study) | Limited to a specific study |
Regulatory Use | Widely accepted | Depends on study justification |
Customizable | No | Yes |
Analyzing data from clinical trials involves multiple forms of presenting data. These fall broadly into two categories:
1. Static outputs Used in regulatory submissions and clinical study reports (CSRs). These include:
2. Exploratory outputs Used for internal review and hypothesis generation, often built as interactive web applications using tools like R Shiny. This will be covered in the next module.
Deciding Which TLGs to Create
There is a structured process involving close collaboration between multiple stakeholders, especially with biostatistics, to determine which TLGs are needed for a trial. This process ensures that:
The decision-making is based on two key documents:
Statistical Analysis Plan (SAP) A formal document that outlines:
Output Specifications / List of Planned Outputs
Once the SAP and specifications are finalized, the programming team uses ADaM datasets to produce the outputs. These datasets are designed to contain all required variables and metadata needed for regulatory-grade reporting.
Common TLG Topics
Standard TLGs typically cover core areas such as:
These outputs are consistent across trials and may be published in CSRs or public registries (e.g., ClinicalTrials.gov).
TLG Development Workflow
1. Environment Setup Ensure access to required R packages and datasets (typically ADaM and metadata).
2. Data Input and Preprocessing
3. Generate Intermediate Dataset This may involve merging datasets, creating derived variables, or reshaping data.
4. Produce the Output
5. Format the Output
NEST is a collection of open-source R packages designed to enable faster and more efficient insights generation in clinical research, supporting both exploratory and regulatory use cases.
The vision for NEST is to deliver a complete R-based solution for clinical trial reporting and insights generation.
Core Functionality
NEST provides a streamlined ecosystem of interoperable packages that assist in generating regulatory-grade outputs. These packages fall into two main categories:
Foundational Packages for Output Generation:
rtables
that adds statistical analysis
functions for clinical reporting.Supporting Tools:
**TLG Catalog (TLGC)
NEST includes a catalog of commonly used TLGs (Tables, Listings, Graphs), known as TLGC:
rtables
,
rlistings
, and tern
NEST integrates multiple packages to support:
These tools are freely available as open-source packages on CRAN and can be directly integrated into your R-based clinical reporting pipeline.
What is?
Why Use TEAL?
Core Components
TEAL App Layout Overview
Key Benefits
Traditional Shiny Apps vs. Teal Apps
Feature | Traditional Shiny App | Teal App |
---|---|---|
Data Handling | Developer manually gathers all input data | Data is loaded into separate “containers” in a modular structure |
Development Style | Centralized, monolithic code (long UI + server files) | Modularized plug-and-play components |
Analogy | Like a cargo truck carrying a single large load | Like a cargo ship with containers (modules) |
Data Flexibility | Often tightly coupled to specific datasets | Data-agnostic: easily switches input data types |
UI and Server Logic | Manually defined in detail by developer | Abstracted by framework – no need to build UI/server from scratch |
Modularity | Low – one large codebase | High – modular components can be added/removed independently |
Team Collaboration | Harder – conflicts more likely | Easier – different team members can work on different modules concurrently |
Teal App Development Workflow
Discuss Analysis Plan Understand both planned and exploratory analysis needs with stakeholders.
Navigate the Teal Gallery Browse the Teal module catalog for appropriate analysis modules.
Configure Modules Use a plug-and-play approach to select and configure modules.
Build the App Assemble modules without coding the full UI/server logic.
Deploy & Share Deploy the app for clinical teams, stakeholders, or regulatory review.
Two Roles for Working with Teal
Teal App Developer (ideal for R/Shiny beginners)
Teal Module Developer (for advanced R/Shiny users)
https://posit.co/blog/open-source-in-pharma-from-five-perspectives/
https://posit.co/solutions/pharma/
https://posit.co/blog/the-state-of-pharma/
https://posit.co/blog/celebrating-5-years-of-r-pharma/
https://posit.co/blog/roche-shifting-to-an-open-source-backbone-in-clinical-trials/
Estimands—A Basic Element for Clinical Trials: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8962508/#:~:text=The%20estimand%20is%20a%20systematic,of%20Intercurrent%20Events%20(ICEs)
Find the vaccination history example here: https://www.cdisc.org/kb/examples/vaccination-history-acrf-75268882
CDISC eCRF Portal: https://www.cdisc.org/kb/ecrf
CDISC Guideline SDTM:
https://www.cdisc.org/standards/foundational/sdtmig/sdtmig-v3-3/html
CDISC ADaMs page: https://www.cdisc.org/standards/foundational/adam
Sharing Clinical Research Data: https://www.ncbi.nlm.nih.gov/books/NBK137818/