Conducting Reproducible Research using R Packages

April 16, 2019

Overview

About Me
Reproducible Data Science and Software
Example 1: Decentralised Water Quality Monitoring
Example 2: Detecting Pathways that Drive Cancer
Conclusion

About Me

Academic Interests

Research

High-Dimensional Statistics (\(p >> n\))
Computational Statistics
Bayesian Statistics
Spatial Statistics
Data Science
"Big" EHR Data

Curiosities

Pedagogy
Health and Science Ethics
Leadership Development
Health Economics
Infectious Disease Modelling
Apologetics and Patristics
Film Music

Vitae: http://rpubs.com/gabrielodom/curriculum_vitae

GitHub: https://github.com/gabrielodom

Academic Timeline

Teaching

Courses Taught

Computational Statistics
High-Dimensional Statistics
R for Data Science and Development
Special Topics in Research
Statistics for Business I & II
Statistics for Health Sciences

Courses I Can Teach

Bayesian Methods
Bayesian Theory
Spatial Statistics
Multivariate Statistics
Applied Regression
Introduction to Machine Learning

Statistician or Data Scientist?

Six Divisions of "Greater" Data Science (Donaho, 2017):

Data Exploration and Preparation: exploratory data analysis and data cleaning
Data Representation and Transformation: mathematical transformations, querying databases, and data formatting
Computing with Data: programming languages and code packaging
Data Modeling: statistical modelling and machine learning
Data Visualization and Presentation: from plots to interactive websites
Science about Data Science: meta-analysis about the utility of statistical and computational tools

* Traditional "statistical" topics in bold.

My Philosophy

When you use something, leave it better than when you found it.

– My mother

Investing in people and for people drives my work:

My students should be better scientists, better collaborators, and better people after my classes / mentoring
The PIs I collaborate with and their staff should feel that I added lasting value
My research should be easy for other scientists to replicate and build upon

Reproducible Data Science and Software

The Reproducibility Crisis

Published bio-science is largely not reproducible:

Oncology: 53 published articles tested, six successes (11%) (Nature, 2012)
Psychology:
- 100 published articles, 39 successes (Nature News, 2015)
- 71 published articles tested, 92 replication attempts, 35 successes (38%). The PsychFileDrawer project is ongoing.
Pharmacology: 67 published models tested, 14 successes (21%) (Nature Reviews, 2011)

The Reproducibility Crisis

Published bio-science is largely not reproducible:

Oncology: 53 published articles tested, six successes (11%) (Nature, 2012)
Psychology:
- 100 published articles, 39 successes (Nature News, 2015)
- 71 published articles tested, 92 replication attempts, 35 successes (38%). The PsychFileDrawer project is ongoing.
Pharmacology: 67 published models tested, 14 successes (21%) (Nature Reviews, 2011)

Reproducibility: if we repeat a study, the repetition should agree with the original results, or—at minimum—not refute the study's conclusions.

The Ioannidis crusade

John Ioannidis, physician scientist, speaks harshly against the lack of replicability in science:

Reproducible Data Science

Before reproducibility must come preproducibility.

"Instead of arguing about whether results hold up, let’s push to provide enough information for others to repeat the experiments … In computational science, ‘reproducible’ often means that enough information is provided to allow a dedicated reader to repeat the calculations in the paper for herself."

– Philip B. Stark, Professor of Statistics, UC Berkeley

Forensic Bioinformatics

Baggerly and Coombes (2009) founded the field of Forensic Bioinformatics. They wrote this paper while waging war against fabricated data:

Dr. Potti's falsified research had made it to the clinical trial phase before it was stopped.
Dr. Potti was found to have "engaged in research misconduct by including false research data".
He was fired from the Duke University School of Medicine.
Links: Dr. Baggerly's slides on their process; HHS Office of Research Integrity statement on Anil Potti; Fostering Integrity in Research, Appedindix D

Potti's mistakes could have been corrected—or avoided entirely—if his team had built a software package to document their data science.

Software Packages

Software Package: a self-contained suite of programs necessary to accomplish a set of related tasks.

Functions and scripts written in one or more programming languages
Example data
Code and data documentation
Metadata about the package development process, team, and timeline
A users' guide interweaving motivation, documentation, code, output, and analysis, known as a vignette

Software packages for a published paper document and organize the code necessary to replicate every aspect of the data analysis shown in the paper.

Why Packages

Software packages:

make your entire analysis reproducible,
make your research process ethical and transparent,
follow the spirit of the Literate Programming principle (Knuth, 1984), and
enable the next generation of scientists to "stand on your shoulders".

Why Packages

Software packages:

make your entire analysis reproducible,
make your research process ethical and transparent,
follow the spirit of the Literate Programming principle (Knuth, 1984), and
enable the next generation of scientists to "stand on your shoulders".

Reproducible Data Science is publishing the software package necessary to transmute our raw data into our published results.

Example 1: Monitoring Water Quality with `mvMonitoring`

Motivating Example

Water conservation is a growing concern around the world (and currently in the western U.S.)
Lack of sanitation affects 35% of the world's population (65% of East Asia, 33% South Asia, 31% Sub-Saharan Africa) (CDC, 2016; WHO, 2008):
- Over 1900 children die each day due to sanitation-preventable diarrheal diseases
- "Almost one tenth of the global disease burden could be prevented by improving water supply, sanitation, hygiene and management of water resources"
- Communities without sanitation are less likely to provide education to female students after puberty
Decentralized wastewater reclaimation is a modern strategy to provide access to potable water and curb water waste

Decentralized Water Treatment

Most global communities do not have the infrastructure for centralized wastewater treatment or reclaimation
Decentralized wastewater treatment processes are cheaper to build and maintain, and require minimal human interaction
Decentralized treatment requires sophisticated automatic process monitoring
These processes:
1. are adaptive (exhibiting change over weeks, months, and years relative to the local population),
2. are dynamic (exhibiting strong temporal dependence),
3. have multiple steady-states, and
4. use multiple redundant and highly-correlated sensors (features)

Example Problem

We care about monitoring the state of the system to ensure that the water quality stays within proper limits and that the system does not collapse.
A sequencing-batch membrane bioreactor is a tightly-controlled hybrid bio-mechanical purification system.
The internal biological ecosystem is very sensitive; if it crashes, the entire treatment facility can be inoperable for months
One such bio-system crash shut down a decentralized plant for four months

Control Charts

Control charts detect when dynamic processes violate normal operating conditions (Shewhart, 1926). Example (toy data):

We created a new control chart that addresses the four process-monitoring concerns listed previously.

Multi-state, Adaptive, Dynamic PCA

Control charts monitor independent, univariate sensors. We need to monitor many dependent sensors simultaneously. We need PCA (principal components analysis) to:

extract independent composite features (orthogonal linear combinations) from the original sensors, and
explain the most information with fewest features.
Benefits:
- reduced number of charts to monitor
- charts are now independent (under mild assumptions)
- account for correlated sensors.

Multi-state, Adaptive, Dynamic PCA combines the following three algorithmic components: Dynamic PCA, Adaptive PCA, and Multi-state PCA.

PCA Formulation

Let \(\textbf{X} \in \mathbb{R}_{(N - \ell) \times p}\) be the observed process data (including up to \(\ell\) previous time points as predictors).
Let \(\textbf{P}_d \in \mathbb{R}_{p \times d}\), for \(d < p\), be the projection matrix of the \(d\) eigenvectors corresponding to the largest \(d\) eigenvalues of \(\textbf{X}\).
The principal components matrix, \(\textbf{Y} = \textbf{X}\textbf{P}_d \in \mathbb{R}_{(N - \ell) \times d}\), is the transformation of the original \(p\) features into the \(d\)-dimensional orthogonal subspace that preserves the most information.
Multi-state, adaptive PCA estimates a different \(\textbf{P}_d\) for each state and updates these estimates regularly.
Instead of monitoring a new observation \(\textbf{x}_{\text{new}}\), we monitor \(\textbf{y}_{\text{new}} = \textbf{x}_{\text{new}}\textbf{P}_d\).

Adaptive and Dynamic PCA

Dynamic PCA

Include any relevant sensor data from previous time points
Benefits: how much water people used at 6PM yesterday is a great predictor for how much they will use at 6PM today

Adaptive PCA

Update the projection at fixed times:
- each hour, day, week, etc., "learn" the most recent observations and "forget" the oldest observations
- re-estimate the principal components for the next time period.
Benefits: account for the fact that people and communities change unpredictably over time

Multi-state PCA

Different steps in water-treatment process \(\Longrightarrow\) different relationships between correlated sensors
Estimate the principal components for each state in the process independently
Benefits: within-state variance is smaller than between-state variance

Monitoring Statistics: Hotelling's \(T^2\)

Hotelling's \(T^2 = \textbf{y}_{\text{new}} \Lambda^{-1}_d \textbf{y}^T_{\text{new}}\), where \(\textbf{y}_{\text{new}}\) is a new observation to check, and \(\Lambda_d\) is the diagonal matrix of the first \(d\) eigenvalues (variances).

\(T^2\) is the Mahalanobis distance of the mapped value \(\textbf{y}_{\text{new}}\) from the original \(p\)-space into the \(d\)-dimensional PCA subspace.
\(T^2\) measures deviations from expectation in the lower subspace.

Monitoring Statistics: Squared Prediction Error

First, let \(\textbf{e}_{\text{new}} := \textbf{x}_{\text{new}} - \textbf{y}_{\text{new}}\textbf{P}^T_d\). Then, SPE \(= \textbf{e}_{\text{new}}\textbf{e}_{\text{new}}^T\).

SPE is the squared distance between the original sensor vector and reduced-dimension approximation of this vector.
SPE measures the goodness-of-fit of the \(d\)-dimensional model.

Performance

Publication

Study Replication

Our code and examples are available at https://gabrielodom.github.io/mvMonitoring/index.html
The simulation study and data analysis are completely documented and repeatable through the users' guide on this page.
This method and corresponding software are currently in use at the Mines Park Decentralized Wastewater Treatment facility in Golden, CO.
This project is part of a pilot program for "sustainable clean water and sanitation".

Example 2: Interrogating Biological Pathways with `pathwayPCA`

Motivation

Each year, 18 Million new cancer cases are diagnosed, and nearly 10 Million people die from cancer (WHO, 2018)
A person dies of cancer every 3.3 seconds.
Cancer is the second leading cause of death in the US (CDC, 2017)
Different cancers cause disparities in mortality for (NCI, 2019):
- women
- minorities
- the indegent
- the elderly
Mortality is affected by society, but incidence is driven by genetics

Genetics and Cancer

The Central Dogma of molecular biology states that DNA (genes) encode RNA, RNA encode proteins, and proteins govern the behavior of the cell (thereby governing the tissue) (Clancy et al., 2008)
Cancers are primarily caused by multiple mutations in genes (Knudson hypothesis) belonging to certain biological processes, such as apoptosis (programmed cell death) or proliferation (ACS, 2014)
Many cancers are caused by multiple mutations of multiple genes, all working in concert to advance the disease state (Sugimura et al., 1992)

Challenges

While discovering single-gene cancer drivers is important, such as TP53 (NCBI, 2011), this approach has a few challenges:

Cancers are often caused by concurrent abnormalities in multiple genes
Gene knockdown experiments only test for single genes, not multiple genes
Drug trials often find redundancy in cancer-driving genes
Single-gene testing of 20,000 genes has very low statistical power after controlling for the false discovery rate

Solutions

To overcome these challenges:

Group genes by their biological pathways (NIH NHGRI, 2015).
- Depending on the grouping, there are anywhere from 50-5000 pathways to consider.
- In cancer research, we usually care about
  - The C2 Canonical Pathways collection (Broad Institute) in the Molecular Signatures Database (1,329 pathways), or
  - The WikiPathways collection (approximately 500 pathways) (Slenter et al., 2018).
For each of the pathways selected, test a summary of the pathway for the presesence of a statistically-significant relationship with some outcome (survival time, tumor size, or cancer subtype)

Methods to Summarize Pathways

SuperPCA

Supervised PCA (SuperPCA; Chen et al., 2008; Chen et al., 2010):

ranks each feature in pathway \(i\) by its univariate relationship with the outcome of interest (survival time, tumor size, cancer subtype, etc.), then
extracts principal components from the most relevant features

AES-PCA

Adaptive, Elastic-net, Sparse PCA (AES-PCA; Chen, 2011) combines into a single objective function the following methods:

Elastic-Net (Zou and Hastie, 2005)
Adaptive Lasso (Zou, 2012)
Sparse PCA (Zou, Hastie, and Tibshirani, 2012)

AES-PCA extracts principal components from pathway \(i\) which minimize this composite objective function

Pathway Associations with Cancer Survival

Many cancers have pronounced survival disparities to gender (Dorak and Karpuzoglu, 2012)
Renal cancers have a known gender effect (ACS, 2017)
We found a potential association between survival outcomes and the interaction of gender and the first principal component of pathway WP1559.
This pathway measures transcription factors related to cardiac hypertrophy (thickening of the heart muscle).
A recent paper in Cardiorenal Medicine shows a strong relationship between kidney diseases and cardiac hypertrophy (De Lullo et al., 2015).
Our Cox Proportional Hazards model was

\[ h(t) = h_0(t)\exp\left[\beta_1\text{PC}_1 + \beta_2\text{male} + \beta_3(\text{PC}_1\times\text{male})\right] \]

Kidney Cancer Survival

Execution

Conclusion

Review

Published research in biomedical and social sciences is largely irreproducible.
Lack of documentation during the data cleaning, visualizing, modelling, and analyzing steps is partly to blame.
When you publish methodological or data analytical research, build a software package to share your code, data, and reports.
Software packages are valuable, self-contained, research apparatuses that greatly increase the chance that your published research is replicable by the scientific community.

Next Steps

Tools for reproducible data science, bioinformatics, and biostatistics I am currently collaborating on or recently completed:

Moonlight: A multi-omics tool to interpret pathways and indict cancer-driver genes (with Antonio Colaprico, Steven Chen, et al.).
coMethDMR: An unsupervised approach for identifying differentially-methylated regions in epigenome-wide association studies (with Lissette Gomez & Lily Wang).
DMRcompare: An evaluation of supervised methods for identifying differentially-methylated regions in Illumina methylation arrays (with Saurav Mallik, Lily Wang, & Steven Chen).
regionPredictR: Predict clinical outcomes using CpGs from differentially-methylated regions of the genome (with Lizhong Liu & Lily Wang).
rnaEditR: An unsupervised approach to cluster regions of co-edited RNA (with Jenny Zhang & Lily Wang).

Acknowledgements

Chen and Wang Translational Bio Lab: Steven Chen, Lily Wang, Lizhong Liu, Antonio Colaprico, James Ban, Jenny Zhang, Zhen Gao, Lissette Gomez, and Shirley Sun
- NIH / NCI R01 CA158472; NIH/NCI R01 CA200987; NIH/NCI U24 CA210954
Hering and Cath Engineering Lab: Amanda Hering, Tzahi Cath, Kate Newhart, Ben Barnard, Karen Kazor, and Melissa Johnson
- NSF 1632227; KAUST OSR-2015-CRG4-2582
My mentors: Dean Young, Amanda Hering, James Stamey, Steven Chen, and Lily Wang

Thank You!

Questions?

Overview

Overview

About Me

Academic Interests

Research

Curiosities

Academic Timeline

Teaching

Courses Taught

Courses I Can Teach

Statistician or Data Scientist?

My Philosophy

Reproducible Data Science and Software

The Reproducibility Crisis

The Reproducibility Crisis

The Ioannidis crusade

Reproducible Data Science

Forensic Bioinformatics

Software Packages

Why Packages

Why Packages

Example 1: Monitoring Water Quality with mvMonitoring

Motivating Example

Decentralized Water Treatment

Example Problem

Control Charts

Multi-state, Adaptive, Dynamic PCA

PCA Formulation

Adaptive and Dynamic PCA

Dynamic PCA

Adaptive PCA

Multi-state PCA

Monitoring Statistics: Hotelling's \(T^2\)

Monitoring Statistics: Squared Prediction Error

Performance

Publication

Study Replication

Example 2: Interrogating Biological Pathways with pathwayPCA

Motivation

Genetics and Cancer

Challenges

Solutions

Methods to Summarize Pathways

SuperPCA

AES-PCA

Pathway Associations with Cancer Survival

Kidney Cancer Survival

Execution

Conclusion

Review

Next Steps

Acknowledgements

Example 1: Monitoring Water Quality with `mvMonitoring`

Example 2: Interrogating Biological Pathways with `pathwayPCA`