emrselect Large detailed data set + Complex disease +
Messy diagnostic codes + Free text fields
=
Who are true disease cases?
Liao et al. Development of phenotype algorithms using electronic medical records and incorporating natural language processing BMJ 2015; 350 :h1885
Liao et al. Development of phenotype algorithms using electronic medical records and incorporating natural language processing BMJ 2015; 350 :h1885
Turn text notes into variables:
S. Murphy, et. al, Instrumenting the health care enterprise for discovery research in the genomic era, Genome Research , 19(9): 1675–1681 (2009). Copyright 2009 by Cold Spring Harbor Laboratory Press
## Observations: 100
## Variables: 37
## $ patient_num <int> 31, 19, 5, 13, 17, 2,...
## $ RA_GoldStandard <chr> "NULL", "N", "NULL", ...
## $ patient_gender <int> 1, 1, 1, 0, 0, 0, 1, ...
## $ HgA1C <int> 0, 1, 31, 10, 0, 0, 0...
## $ HgA1C_2 <int> 0, 0, 1, 1, 0, 0, 0, ...
## $ RA_COD_DX_Lupus <int> 0, 0, 0, 0, 0, 0, 0, ...
## $ RA_COD_DX_Psoriaticarthritis <int> 0, 0, 0, 0, 155, 8, 7...
## $ RA_COD_DX_RheumatoidArthritis <int> 0, 0, 0, 0, 3, 2, 4, ...
## $ RA_COD_LAB_antiCCP <int> 0, 0, 0, 0, 0, 0, 0, ...
## $ RA_COD_LAB_RF <int> 0, 6, 0, 0, 1, 0, 0, ...
## $ RA_COD_MED_antiTNF <int> 0, 0, 0, 0, 5, 0, 0, ...
## $ RA_COD_MED_methotrexate <int> 0, 0, 0, 0, 7, 0, 0, ...
## $ RA_NLP_analgesics <int> 52, 584, 63, 52, 466,...
## $ RA_NLP_antibodies <int> 6, 69, 3, 1, 13, 1, 4...
## $ RA_NLP_antiinflammatorydrugs <int> 51, 547, 60, 44, 336,...
## $ RA_NLP_antimalarialagents <int> 0, 0, 0, 0, 0, 0, 0, ...
## $ RA_NLP_antimicrobialagents <int> 1, 2, 0, 0, 125, 3, 2...
## $ RA_NLP_antirheumaticdrug <int> 51, 504, 60, 41, 533,...
## $ RA_NLP_biologicalagents <int> 14, 123, 12, 46, 98, ...
## $ RA_NLP_corticosteroids <int> 14, 371, 0, 0, 37, 29...
## $ RA_NLP_folicacid <int> 6, 0, 0, 1, 47, 14, 2...
## $ RA_NLP_folicacidantagonist <int> 0, 0, 0, 0, 75, 3, 14...
## $ RA_NLP_glucocorticoids <int> 14, 332, 0, 0, 28, 29...
## $ RA_NLP_immunologicalfactors <int> 6, 40, 3, 2, 32, 3, 5...
## $ RA_NLP_immunomodulators <int> 14, 371, 1, 0, 239, 3...
## $ RA_NLP_immunosuppressiveagents <int> 14, 371, 1, 0, 241, 3...
## $ RA_NLP_inflammatoryarthritis <int> 1, 44, 0, 0, 275, 77,...
## $ RA_NLP_jointspain <int> 0, 33, 0, 0, 13, 3, 5...
## $ RA_NLP_monoclonalantibody <int> 6, 15, 0, 0, 8, 1, 3,...
## $ RA_NLP_morningstiffness <int> 0, 10, 0, 0, 0, 13, 2...
## $ RA_NLP_naproxen <int> 0, 36, 0, 3, 27, 3, 0...
## $ RA_NLP_nonsteroidalantiinflammatorydrugs <int> 31, 170, 60, 39, 286,...
## $ RA_NLP_prednisone <int> 14, 300, 0, 0, 13, 26...
## $ RA_NLP_proteins <int> 17, 91, 53, 25, 61, 1...
## $ RA_NLP_rheumatoidarthritis <int> 0, 9, 0, 0, 16, 0, 5,...
## $ RA_NLP_synovitis <int> 0, 0, 0, 0, 12, 0, 3,...
## $ RA_NLP_tumornecrosisfactoralphablockers <int> 6, 0, 0, 0, 8, 1, 0, ...
emrselectR package for statistical method:
We know \(\ Y\) = disease status = 0 or 1
on a training and validation set
We have other data \(\ \boldsymbol{X}\) = covariates
i.e. ICD codes, clinical variables, NLP features
Goal: Estimate \(\ \pi = P(Y=1 | \boldsymbol{X})\)
We do not know \(\ Y\) = disease status = 0 or 1
(except maybe on small subset)
We do know \(\ \boldsymbol{S} = S_1, S_2, \ldots\) = surrogate outcomes
i.e. ICD-9 code for RA, # of times RA mentioned in note (NLP feature count)
Goal: predict \(\ Y\) using \(\ \boldsymbol{S}\) and \(\ \boldsymbol{X}\)
Estimate \(\ \pi_S = P(Y=1 | \boldsymbol{S})\)
Assume two-component (Gaussian) mixture model:
\[\ \boldsymbol{S} \sim \tau\cdot f_1(\boldsymbol{s}) + (1-\tau)\cdot f_0 (\boldsymbol{s})\]
where \(\ \tau = P(Y=1)\)
extend to multiple \(\ \boldsymbol{S}\) surrogates = multivariate mixture modeling
MathWorks Documentation https://www.mathworks.com/help/examples/stats/
mclust package
emrselect::ProbD.S() estimates \(\ \pi_S\) from mclust output with Bayes Rule\[\ \hat{\pi}_S = \frac{\hat{\tau}\hat{f}_1(s)}{\hat{\tau}\hat{f}_1(s) + (1-\hat{\tau}) \hat{f}_0(s))} \]
See sections 6.8, 8.5 in The Elements of Statistical Learning (Hastie, Tibshirani, Friedman 2009)
Usual logistic likelihood objective function (glm):
\[\ n^{-1} \sum_{i=1}^{n} \left[ y_i\boldsymbol{\beta}^{T}x_i + \log(1+\exp(\alpha + \boldsymbol{\beta}^{T} x_i))\right]\]
Adaptive LASSO logistic regression (glmpath, glmnet):
\[\ n^{-1} \sum_{i=1}^{n} \left[ y_i\boldsymbol{\beta}^{T}x_i + \log(1+\exp(\alpha + \boldsymbol{\beta}^{T} x_i)) \right] + \color{red}{\lambda_n \sum_{j=1}^p |\beta_j|/|\tilde{\beta_j}|} \]
Our objection function (emrselect::Est.ALASSO.GLM):
\[\ n^{-1} \sum_{i=1}^{n}\left[ \color{red}{ \hat{\pi}_{Si}}\boldsymbol{\beta}^{T}x_i + \log(1+\exp(\alpha + \boldsymbol{\beta}^{T} x_i))\right] + \lambda_n \sum_{j=1}^p |\beta_j|/|\tilde{\beta_j}| \]
glmpath for fitting regression\[\ \color{red}{ \hat{Y_0}^* = \mbox{logit}^{-1}(\hat{\alpha}+\hat{\boldsymbol{\gamma}}^{*T}\boldsymbol{S}_0 + \hat{\boldsymbol{\beta}}^{*T} \boldsymbol{X_0})}\]
See sections 3.4, 4.4 in The ESL (Hastie, Tibshirani, Friedman 2009)
Goal:
auto model predict \(\ \geq\) model w/ labeled data
\(\ \Rightarrow\) reduce or avoid labeling!
\(\ \boldsymbol{X}\) = 81 predictors
labeled data with Y vs “automated feature selection” with S
with > 200 labels AUC logistic adaptive LASSO regression \(\ \approx\) AUC with automated feature selection.
Minnier, Gronsbell, Yu, Liao, Cai (in progress)
Jessica Gronsbell, Harvard Biostatistics
Sheng Yu, Tsinghua University, Beijing, China
Katherine Liao, Brigham and Women’s Hospital, Boston
Tianxi Cai, Harvard Biostatistics
“Automated Feature Selection of Predictors in Electronic Medical Records Data” in progress
R package emrselect also in progress: https://github.com/jminnier/emrselect