April 16, 2019

Overview

Overview

  • About Me
  • Reproducible Data Science and Software
  • Example 1: Decentralised Water Quality Monitoring
  • Example 2: Detecting Pathways that Drive Cancer
  • Conclusion

About Me

Academic Interests

Research

  • High-Dimensional Statistics (\(p >> n\))
  • Computational Statistics
  • Bayesian Statistics
  • Spatial Statistics
  • Data Science
  • "Big" EHR Data


Curiosities

  • Pedagogy
  • Health and Science Ethics
  • Leadership Development
  • Health Economics
  • Infectious Disease Modelling
  • Apologetics and Patristics
  • Film Music

Vitae: http://rpubs.com/gabrielodom/curriculum_vitae

GitHub: https://github.com/gabrielodom

Academic Timeline

Teaching

Courses Taught

  • Computational Statistics
  • High-Dimensional Statistics
  • R for Data Science and Development
  • Special Topics in Research
  • Statistics for Business I & II
  • Statistics for Health Sciences

Courses I Can Teach

  • Bayesian Methods
  • Bayesian Theory
  • Spatial Statistics
  • Multivariate Statistics
  • Applied Regression
  • Introduction to Machine Learning

Statistician or Data Scientist?

Six Divisions of "Greater" Data Science (Donaho, 2017):

  1. Data Exploration and Preparation: exploratory data analysis and data cleaning
  2. Data Representation and Transformation: mathematical transformations, querying databases, and data formatting
  3. Computing with Data: programming languages and code packaging
  4. Data Modeling: statistical modelling and machine learning
  5. Data Visualization and Presentation: from plots to interactive websites
  6. Science about Data Science: meta-analysis about the utility of statistical and computational tools

* Traditional "statistical" topics in bold.

My Philosophy

When you use something, leave it better than when you found it.

– My mother

Investing in people and for people drives my work:

  1. My students should be better scientists, better collaborators, and better people after my classes / mentoring
  2. The PIs I collaborate with and their staff should feel that I added lasting value
  3. My research should be easy for other scientists to replicate and build upon

Reproducible Data Science and Software

The Reproducibility Crisis

Published bio-science is largely not reproducible:

  • Oncology: 53 published articles tested, six successes (11%) (Nature, 2012)
  • Psychology:
    • 100 published articles, 39 successes (Nature News, 2015)
    • 71 published articles tested, 92 replication attempts, 35 successes (38%). The PsychFileDrawer project is ongoing.
  • Pharmacology: 67 published models tested, 14 successes (21%) (Nature Reviews, 2011)

The Reproducibility Crisis

Published bio-science is largely not reproducible:

  • Oncology: 53 published articles tested, six successes (11%) (Nature, 2012)
  • Psychology:
    • 100 published articles, 39 successes (Nature News, 2015)
    • 71 published articles tested, 92 replication attempts, 35 successes (38%). The PsychFileDrawer project is ongoing.
  • Pharmacology: 67 published models tested, 14 successes (21%) (Nature Reviews, 2011)

Reproducibility: if we repeat a study, the repetition should agree with the original results, or—at minimum—not refute the study's conclusions.

The Ioannidis crusade

Reproducible Data Science

Before reproducibility must come preproducibility.

"Instead of arguing about whether results hold up, let’s push to provide enough information for others to repeat the experiments … In computational science, ‘reproducible’ often means that enough information is provided to allow a dedicated reader to repeat the calculations in the paper for herself."

Philip B. Stark, Professor of Statistics, UC Berkeley

Forensic Bioinformatics

Baggerly and Coombes (2009) founded the field of Forensic Bioinformatics. They wrote this paper while waging war against fabricated data:

  • Dr. Potti's falsified research had made it to the clinical trial phase before it was stopped.
  • Dr. Potti was found to have "engaged in research misconduct by including false research data".
  • He was fired from the Duke University School of Medicine.
  • Links: Dr. Baggerly's slides on their process; HHS Office of Research Integrity statement on Anil Potti; Fostering Integrity in Research, Appedindix D

Potti's mistakes could have been corrected—or avoided entirely—if his team had built a software package to document their data science.

Software Packages

Software Package: a self-contained suite of programs necessary to accomplish a set of related tasks.

  • Functions and scripts written in one or more programming languages
  • Example data
  • Code and data documentation
  • Metadata about the package development process, team, and timeline
  • A users' guide interweaving motivation, documentation, code, output, and analysis, known as a vignette

Software packages for a published paper document and organize the code necessary to replicate every aspect of the data analysis shown in the paper.

Why Packages

Software packages:

  1. make your entire analysis reproducible,
  2. make your research process ethical and transparent,
  3. follow the spirit of the Literate Programming principle (Knuth, 1984), and
  4. enable the next generation of scientists to "stand on your shoulders".

Why Packages

Software packages:

  1. make your entire analysis reproducible,
  2. make your research process ethical and transparent,
  3. follow the spirit of the Literate Programming principle (Knuth, 1984), and
  4. enable the next generation of scientists to "stand on your shoulders".

Reproducible Data Science is publishing the software package necessary to transmute our raw data into our published results.

Example 1: Monitoring Water Quality with mvMonitoring

Motivating Example

  • Water conservation is a growing concern around the world (and currently in the western U.S.)
  • Lack of sanitation affects 35% of the world's population (65% of East Asia, 33% South Asia, 31% Sub-Saharan Africa) (CDC, 2016; WHO, 2008):
    • Over 1900 children die each day due to sanitation-preventable diarrheal diseases
    • "Almost one tenth of the global disease burden could be prevented by improving water supply, sanitation, hygiene and management of water resources"
    • Communities without sanitation are less likely to provide education to female students after puberty
  • Decentralized wastewater reclaimation is a modern strategy to provide access to potable water and curb water waste

Decentralized Water Treatment

  • Most global communities do not have the infrastructure for centralized wastewater treatment or reclaimation
  • Decentralized wastewater treatment processes are cheaper to build and maintain, and require minimal human interaction
  • Decentralized treatment requires sophisticated automatic process monitoring
  • These processes:
    1. are adaptive (exhibiting change over weeks, months, and years relative to the local population),
    2. are dynamic (exhibiting strong temporal dependence),
    3. have multiple steady-states, and
    4. use multiple redundant and highly-correlated sensors (features)

Example Problem

  • We care about monitoring the state of the system to ensure that the water quality stays within proper limits and that the system does not collapse.
  • A sequencing-batch membrane bioreactor is a tightly-controlled hybrid bio-mechanical purification system.
  • The internal biological ecosystem is very sensitive; if it crashes, the entire treatment facility can be inoperable for months
  • One such bio-system crash shut down a decentralized plant for four months

Control Charts

Control charts detect when dynamic processes violate normal operating conditions (Shewhart, 1926). Example (toy data):

We created a new control chart that addresses the four process-monitoring concerns listed previously.

Multi-state, Adaptive, Dynamic PCA

Control charts monitor independent, univariate sensors. We need to monitor many dependent sensors simultaneously. We need PCA (principal components analysis) to:

  • extract independent composite features (orthogonal linear combinations) from the original sensors, and
  • explain the most information with fewest features.
  • Benefits:
    • reduced number of charts to monitor
    • charts are now independent (under mild assumptions)
    • account for correlated sensors.

Multi-state, Adaptive, Dynamic PCA combines the following three algorithmic components: Dynamic PCA, Adaptive PCA, and Multi-state PCA.

PCA Formulation

  • Let \(\textbf{X} \in \mathbb{R}_{(N - \ell) \times p}\) be the observed process data (including up to \(\ell\) previous time points as predictors).
  • Let \(\textbf{P}_d \in \mathbb{R}_{p \times d}\), for \(d < p\), be the projection matrix of the \(d\) eigenvectors corresponding to the largest \(d\) eigenvalues of \(\textbf{X}\).
  • The principal components matrix, \(\textbf{Y} = \textbf{X}\textbf{P}_d \in \mathbb{R}_{(N - \ell) \times d}\), is the transformation of the original \(p\) features into the \(d\)-dimensional orthogonal subspace that preserves the most information.
  • Multi-state, adaptive PCA estimates a different \(\textbf{P}_d\) for each state and updates these estimates regularly.
  • Instead of monitoring a new observation \(\textbf{x}_{\text{new}}\), we monitor \(\textbf{y}_{\text{new}} = \textbf{x}_{\text{new}}\textbf{P}_d\).

Adaptive and Dynamic PCA

Dynamic PCA

  • Include any relevant sensor data from previous time points
  • Benefits: how much water people used at 6PM yesterday is a great predictor for how much they will use at 6PM today

Adaptive PCA

  • Update the projection at fixed times:
    • each hour, day, week, etc., "learn" the most recent observations and "forget" the oldest observations
    • re-estimate the principal components for the next time period.
  • Benefits: account for the fact that people and communities change unpredictably over time

Multi-state PCA

  • Different steps in water-treatment process \(\Longrightarrow\) different relationships between correlated sensors
  • Estimate the principal components for each state in the process independently
  • Benefits: within-state variance is smaller than between-state variance

Monitoring Statistics: Hotelling's \(T^2\)

Hotelling's \(T^2 = \textbf{y}_{\text{new}} \Lambda^{-1}_d \textbf{y}^T_{\text{new}}\), where \(\textbf{y}_{\text{new}}\) is a new observation to check, and \(\Lambda_d\) is the diagonal matrix of the first \(d\) eigenvalues (variances).

  • \(T^2\) is the Mahalanobis distance of the mapped value \(\textbf{y}_{\text{new}}\) from the original \(p\)-space into the \(d\)-dimensional PCA subspace.
  • \(T^2\) measures deviations from expectation in the lower subspace.

Monitoring Statistics: Squared Prediction Error

First, let \(\textbf{e}_{\text{new}} := \textbf{x}_{\text{new}} - \textbf{y}_{\text{new}}\textbf{P}^T_d\). Then, SPE \(= \textbf{e}_{\text{new}}\textbf{e}_{\text{new}}^T\).

  • SPE is the squared distance between the original sensor vector and reduced-dimension approximation of this vector.
  • SPE measures the goodness-of-fit of the \(d\)-dimensional model.

Performance

Publication

Study Replication

  • Our code and examples are available at https://gabrielodom.github.io/mvMonitoring/index.html
  • The simulation study and data analysis are completely documented and repeatable through the users' guide on this page.
  • This method and corresponding software are currently in use at the Mines Park Decentralized Wastewater Treatment facility in Golden, CO.
  • This project is part of a pilot program for "sustainable clean water and sanitation".

Example 2: Interrogating Biological Pathways with pathwayPCA

Motivation

  • Each year, 18 Million new cancer cases are diagnosed, and nearly 10 Million people die from cancer (WHO, 2018)
  • A person dies of cancer every 3.3 seconds.
  • Cancer is the second leading cause of death in the US (CDC, 2017)
  • Different cancers cause disparities in mortality for (NCI, 2019):
    • women
    • minorities
    • the indegent
    • the elderly
  • Mortality is affected by society, but incidence is driven by genetics

Genetics and Cancer

  • The Central Dogma of molecular biology states that DNA (genes) encode RNA, RNA encode proteins, and proteins govern the behavior of the cell (thereby governing the tissue) (Clancy et al., 2008)
  • Cancers are primarily caused by multiple mutations in genes (Knudson hypothesis) belonging to certain biological processes, such as apoptosis (programmed cell death) or proliferation (ACS, 2014)
  • Many cancers are caused by multiple mutations of multiple genes, all working in concert to advance the disease state (Sugimura et al., 1992)

Challenges

While discovering single-gene cancer drivers is important, such as TP53 (NCBI, 2011), this approach has a few challenges:

  • Cancers are often caused by concurrent abnormalities in multiple genes
  • Gene knockdown experiments only test for single genes, not multiple genes
  • Drug trials often find redundancy in cancer-driving genes
  • Single-gene testing of 20,000 genes has very low statistical power after controlling for the false discovery rate

Solutions

To overcome these challenges:

  1. Group genes by their biological pathways (NIH NHGRI, 2015).
    • Depending on the grouping, there are anywhere from 50-5000 pathways to consider.
    • In cancer research, we usually care about
      • The C2 Canonical Pathways collection (Broad Institute) in the Molecular Signatures Database (1,329 pathways), or
      • The WikiPathways collection (approximately 500 pathways) (Slenter et al., 2018).
  2. For each of the pathways selected, test a summary of the pathway for the presesence of a statistically-significant relationship with some outcome (survival time, tumor size, or cancer subtype)

Methods to Summarize Pathways

SuperPCA

Supervised PCA (SuperPCA; Chen et al., 2008; Chen et al., 2010):

  • ranks each feature in pathway \(i\) by its univariate relationship with the outcome of interest (survival time, tumor size, cancer subtype, etc.), then
  • extracts principal components from the most relevant features

AES-PCA

Pathway Associations with Cancer Survival

  • Many cancers have pronounced survival disparities to gender (Dorak and Karpuzoglu, 2012)
  • Renal cancers have a known gender effect (ACS, 2017)
  • We found a potential association between survival outcomes and the interaction of gender and the first principal component of pathway WP1559.
  • This pathway measures transcription factors related to cardiac hypertrophy (thickening of the heart muscle).
  • A recent paper in Cardiorenal Medicine shows a strong relationship between kidney diseases and cardiac hypertrophy (De Lullo et al., 2015).
  • Our Cox Proportional Hazards model was

\[ h(t) = h_0(t)\exp\left[\beta_1\text{PC}_1 + \beta_2\text{male} + \beta_3(\text{PC}_1\times\text{male})\right] \]

Kidney Cancer Survival

Execution

Conclusion

Review

  • Published research in biomedical and social sciences is largely irreproducible.
  • Lack of documentation during the data cleaning, visualizing, modelling, and analyzing steps is partly to blame.
  • When you publish methodological or data analytical research, build a software package to share your code, data, and reports.
  • Software packages are valuable, self-contained, research apparatuses that greatly increase the chance that your published research is replicable by the scientific community.

Next Steps

Tools for reproducible data science, bioinformatics, and biostatistics I am currently collaborating on or recently completed:

  • Moonlight: A multi-omics tool to interpret pathways and indict cancer-driver genes (with Antonio Colaprico, Steven Chen, et al.).
  • coMethDMR: An unsupervised approach for identifying differentially-methylated regions in epigenome-wide association studies (with Lissette Gomez & Lily Wang).
  • DMRcompare: An evaluation of supervised methods for identifying differentially-methylated regions in Illumina methylation arrays (with Saurav Mallik, Lily Wang, & Steven Chen).
  • regionPredictR: Predict clinical outcomes using CpGs from differentially-methylated regions of the genome (with Lizhong Liu & Lily Wang).
  • rnaEditR: An unsupervised approach to cluster regions of co-edited RNA (with Jenny Zhang & Lily Wang).

Acknowledgements

  • Chen and Wang Translational Bio Lab: Steven Chen, Lily Wang, Lizhong Liu, Antonio Colaprico, James Ban, Jenny Zhang, Zhen Gao, Lissette Gomez, and Shirley Sun
    • NIH / NCI R01 CA158472; NIH/NCI R01 CA200987; NIH/NCI U24 CA210954
  • Hering and Cath Engineering Lab: Amanda Hering, Tzahi Cath, Kate Newhart, Ben Barnard, Karen Kazor, and Melissa Johnson
    • NSF 1632227; KAUST OSR-2015-CRG4-2582
  • My mentors: Dean Young, Amanda Hering, James Stamey, Steven Chen, and Lily Wang

Thank You!

Questions?