April 16, 2019
Vitae: http://rpubs.com/gabrielodom/curriculum_vitae
GitHub: https://github.com/gabrielodom
Six Divisions of "Greater" Data Science (Donaho, 2017):
* Traditional "statistical" topics in bold.
When you use something, leave it better than when you found it.
– My mother
Investing in people and for people drives my work:
Published bio-science is largely not reproducible:
Published bio-science is largely not reproducible:
Reproducibility: if we repeat a study, the repetition should agree with the original results, or—at minimum—not refute the study's conclusions.
John Ioannidis, physician scientist, speaks harshly against the lack of replicability in science:
Before reproducibility must come preproducibility.
"Instead of arguing about whether results hold up, let’s push to provide enough information for others to repeat the experiments … In computational science, ‘reproducible’ often means that enough information is provided to allow a dedicated reader to repeat the calculations in the paper for herself."
Baggerly and Coombes (2009) founded the field of Forensic Bioinformatics. They wrote this paper while waging war against fabricated data:
Potti's mistakes could have been corrected—or avoided entirely—if his team had built a software package to document their data science.
Software Package: a self-contained suite of programs necessary to accomplish a set of related tasks.
Software packages for a published paper document and organize the code necessary to replicate every aspect of the data analysis shown in the paper.
Software packages:
Software packages:
Reproducible Data Science is publishing the software package necessary to transmute our raw data into our published results.
mvMonitoring
Control charts detect when dynamic processes violate normal operating conditions (Shewhart, 1926). Example (toy data):
We created a new control chart that addresses the four process-monitoring concerns listed previously.
Control charts monitor independent, univariate sensors. We need to monitor many dependent sensors simultaneously. We need PCA (principal components analysis) to:
Multi-state, Adaptive, Dynamic PCA combines the following three algorithmic components: Dynamic PCA, Adaptive PCA, and Multi-state PCA.
Hotelling's \(T^2 = \textbf{y}_{\text{new}} \Lambda^{-1}_d \textbf{y}^T_{\text{new}}\), where \(\textbf{y}_{\text{new}}\) is a new observation to check, and \(\Lambda_d\) is the diagonal matrix of the first \(d\) eigenvalues (variances).
First, let \(\textbf{e}_{\text{new}} := \textbf{x}_{\text{new}} - \textbf{y}_{\text{new}}\textbf{P}^T_d\). Then, SPE \(= \textbf{e}_{\text{new}}\textbf{e}_{\text{new}}^T\).
pathwayPCA
While discovering single-gene cancer drivers is important, such as TP53 (NCBI, 2011), this approach has a few challenges:
To overcome these challenges:
Supervised PCA (SuperPCA; Chen et al., 2008; Chen et al., 2010):
Adaptive, Elastic-net, Sparse PCA (AES-PCA; Chen, 2011) combines into a single objective function the following methods:
AES-PCA extracts principal components from pathway \(i\) which minimize this composite objective function
\[ h(t) = h_0(t)\exp\left[\beta_1\text{PC}_1 + \beta_2\text{male} + \beta_3(\text{PC}_1\times\text{male})\right] \]
Tools for reproducible data science, bioinformatics, and biostatistics I am currently collaborating on or recently completed:
Moonlight
: A multi-omics tool to interpret pathways and indict cancer-driver genes (with Antonio Colaprico, Steven Chen, et al.).coMethDMR
: An unsupervised approach for identifying differentially-methylated regions in epigenome-wide association studies (with Lissette Gomez & Lily Wang).DMRcompare
: An evaluation of supervised methods for identifying differentially-methylated regions in Illumina methylation arrays (with Saurav Mallik, Lily Wang, & Steven Chen).regionPredictR
: Predict clinical outcomes using CpGs from differentially-methylated regions of the genome (with Lizhong Liu & Lily Wang).rnaEditR
: An unsupervised approach to cluster regions of co-edited RNA (with Jenny Zhang & Lily Wang).Thank You!
Questions?