emrselect
Automated Feature Selection of Predictors in Electronic Medical Records Data


Jessica Minnier; minnier@ohsu.edu

Tuesday, January 10, 2017

https://github.com/jminnier/emrselect
Slides available at http://bit.ly/wwc-emrselect

EMR Research Challenge

Large detailed data set + Complex disease +
Messy diagnostic codes + Free text fields

=

Who are true disease cases?

EMR Research Challenge

Liao et al. Development of phenotype algorithms using electronic medical records and incorporating natural language processing BMJ 2015; 350 :h1885

EMR Research Data

Liao et al. Development of phenotype algorithms using electronic medical records and incorporating natural language processing BMJ 2015; 350 :h1885

Natural Language Processing

Turn text notes into variables:

S. Murphy, et. al, Instrumenting the health care enterprise for discovery research in the genomic era, Genome Research , 19(9): 1675–1681 (2009). Copyright 2009 by Cold Spring Harbor Laboratory Press

Example

  • Rheumatoid Arthritis (RA)
  • Partners Healthcare Systems => 46k potential RA subjects
  • Goal: genetic study of RA patients, recruit a cohort of cases
  • Clues:
    • ICD-9 code for RA (false positive rate of ~30%, specificity 55%; Singh et al, 2004)
    • NLP features: mentions of “rheumatoid” or phrases such as “morning stiffness” in doctor’s notes
    • ICD code for related autoimmune disease
    • medications for RA
  • “Gold standard” diagnosis = team of clinicians read notes and score patients \(\ \Rightarrow\) small training data

Statistical Questions

  • Prediction or classification of diseased vs not-diseased
    • Diagnostic and prescription codes
    • Natural Language Processing (NLP)
    • clinical variables
    • best model? regularized regression?
  • Surrogate outcome
    • Diagnoses not precise (mismeasured), may also diagnose patients from notes (text)
    • “true” outcome is laborious/expensive/limited

Data!

## Observations: 100
## Variables: 37
## $ patient_num                              <int> 31, 19, 5, 13, 17, 2,...
## $ RA_GoldStandard                          <chr> "NULL", "N", "NULL", ...
## $ patient_gender                           <int> 1, 1, 1, 0, 0, 0, 1, ...
## $ HgA1C                                    <int> 0, 1, 31, 10, 0, 0, 0...
## $ HgA1C_2                                  <int> 0, 0, 1, 1, 0, 0, 0, ...
## $ RA_COD_DX_Lupus                          <int> 0, 0, 0, 0, 0, 0, 0, ...
## $ RA_COD_DX_Psoriaticarthritis             <int> 0, 0, 0, 0, 155, 8, 7...
## $ RA_COD_DX_RheumatoidArthritis            <int> 0, 0, 0, 0, 3, 2, 4, ...
## $ RA_COD_LAB_antiCCP                       <int> 0, 0, 0, 0, 0, 0, 0, ...
## $ RA_COD_LAB_RF                            <int> 0, 6, 0, 0, 1, 0, 0, ...
## $ RA_COD_MED_antiTNF                       <int> 0, 0, 0, 0, 5, 0, 0, ...
## $ RA_COD_MED_methotrexate                  <int> 0, 0, 0, 0, 7, 0, 0, ...
## $ RA_NLP_analgesics                        <int> 52, 584, 63, 52, 466,...
## $ RA_NLP_antibodies                        <int> 6, 69, 3, 1, 13, 1, 4...
## $ RA_NLP_antiinflammatorydrugs             <int> 51, 547, 60, 44, 336,...
## $ RA_NLP_antimalarialagents                <int> 0, 0, 0, 0, 0, 0, 0, ...
## $ RA_NLP_antimicrobialagents               <int> 1, 2, 0, 0, 125, 3, 2...
## $ RA_NLP_antirheumaticdrug                 <int> 51, 504, 60, 41, 533,...
## $ RA_NLP_biologicalagents                  <int> 14, 123, 12, 46, 98, ...
## $ RA_NLP_corticosteroids                   <int> 14, 371, 0, 0, 37, 29...
## $ RA_NLP_folicacid                         <int> 6, 0, 0, 1, 47, 14, 2...
## $ RA_NLP_folicacidantagonist               <int> 0, 0, 0, 0, 75, 3, 14...
## $ RA_NLP_glucocorticoids                   <int> 14, 332, 0, 0, 28, 29...
## $ RA_NLP_immunologicalfactors              <int> 6, 40, 3, 2, 32, 3, 5...
## $ RA_NLP_immunomodulators                  <int> 14, 371, 1, 0, 239, 3...
## $ RA_NLP_immunosuppressiveagents           <int> 14, 371, 1, 0, 241, 3...
## $ RA_NLP_inflammatoryarthritis             <int> 1, 44, 0, 0, 275, 77,...
## $ RA_NLP_jointspain                        <int> 0, 33, 0, 0, 13, 3, 5...
## $ RA_NLP_monoclonalantibody                <int> 6, 15, 0, 0, 8, 1, 3,...
## $ RA_NLP_morningstiffness                  <int> 0, 10, 0, 0, 0, 13, 2...
## $ RA_NLP_naproxen                          <int> 0, 36, 0, 3, 27, 3, 0...
## $ RA_NLP_nonsteroidalantiinflammatorydrugs <int> 31, 170, 60, 39, 286,...
## $ RA_NLP_prednisone                        <int> 14, 300, 0, 0, 13, 26...
## $ RA_NLP_proteins                          <int> 17, 91, 53, 25, 61, 1...
## $ RA_NLP_rheumatoidarthritis               <int> 0, 9, 0, 0, 16, 0, 5,...
## $ RA_NLP_synovitis                         <int> 0, 0, 0, 0, 12, 0, 3,...
## $ RA_NLP_tumornecrosisfactoralphablockers  <int> 6, 0, 0, 0, 8, 1, 0, ...

emrselect

R package for statistical method:

  1. use surrogate outcome(s) to estimate
    \(\ \pi_S\)=P(diseased | surrogates)
    • mixture model clustering
  2. build prediction model with \(\ \pi_S\) as outcome
    • logistic regression: regularized (adaptive LASSO), misspecified (\(\ \pi_S\) outcome)
  3. gold standard labels? re-fit regression on smaller data set

https://github.com/jminnier/emrselect

Prediction and Classification - Normal Setting

We know \(\ Y\) = disease status = 0 or 1
on a training and validation set

We have other data \(\ \boldsymbol{X}\) = covariates
i.e. ICD codes, clinical variables, NLP features

Goal: Estimate \(\ \pi = P(Y=1 | \boldsymbol{X})\)

Prediction with Surrogate Outcomes

We do not know \(\ Y\) = disease status = 0 or 1
(except maybe on small subset)

We do know \(\ \boldsymbol{S} = S_1, S_2, \ldots\) = surrogate outcomes
i.e. ICD-9 code for RA, # of times RA mentioned in note (NLP feature count)

Goal: predict \(\ Y\) using \(\ \boldsymbol{S}\) and \(\ \boldsymbol{X}\)

  1. estimate \(\ \pi_S = P(Y=1 | \boldsymbol{S})\)
  2. use \(\ \hat{\pi}_S\) as outcome in prediction model with \(\ \boldsymbol{X}\) as covariates
  3. estimate \(\ \pi = P(Y=1 | \boldsymbol{X}, \boldsymbol{S})\)

Mixture Model

Estimate \(\ \pi_S = P(Y=1 | \boldsymbol{S})\)

Assume two-component (Gaussian) mixture model:

\[\ \boldsymbol{S} \sim \tau\cdot f_1(\boldsymbol{s}) + (1-\tau)\cdot f_0 (\boldsymbol{s})\]

where \(\ \tau = P(Y=1)\)

Mixture Model

extend to multiple \(\ \boldsymbol{S}\) surrogates = multivariate mixture modeling

MathWorks Documentation https://www.mathworks.com/help/examples/stats/

Mixture Model

  • solve with Expectation-Maximization algorithm
  • Gaussian mixture modeling mclust package
    • estimates mean and variance of the two Normal distributions, and \(\ P(Y=1)\)
  • emrselect::ProbD.S() estimates \(\ \pi_S\) from mclust output with Bayes Rule

\[\ \hat{\pi}_S = \frac{\hat{\tau}\hat{f}_1(s)}{\hat{\tau}\hat{f}_1(s) + (1-\hat{\tau}) \hat{f}_0(s))} \]


See sections 6.8, 8.5 in The Elements of Statistical Learning (Hastie, Tibshirani, Friedman 2009)

Prediction Model

Usual logistic likelihood objective function (glm):

\[\ n^{-1} \sum_{i=1}^{n} \left[ y_i\boldsymbol{\beta}^{T}x_i + \log(1+\exp(\alpha + \boldsymbol{\beta}^{T} x_i))\right]\]

Adaptive LASSO logistic regression (glmpath, glmnet):

\[\ n^{-1} \sum_{i=1}^{n} \left[ y_i\boldsymbol{\beta}^{T}x_i + \log(1+\exp(\alpha + \boldsymbol{\beta}^{T} x_i)) \right] + \color{red}{\lambda_n \sum_{j=1}^p |\beta_j|/|\tilde{\beta_j}|} \]

Our objection function (emrselect::Est.ALASSO.GLM):

\[\ n^{-1} \sum_{i=1}^{n}\left[ \color{red}{ \hat{\pi}_{Si}}\boldsymbol{\beta}^{T}x_i + \log(1+\exp(\alpha + \boldsymbol{\beta}^{T} x_i))\right] + \lambda_n \sum_{j=1}^p |\beta_j|/|\tilde{\beta_j}| \]

Prediction Model

  • Quasi-logistic regression with \(\ \hat{\pi}_S\) response, \(\ \boldsymbol{X}\) predictors
  • regularized regression to select important \(\ \boldsymbol{X}\) variables
    • minimize adaptive LASSO logistic objective func. w.r.t. \(\ \boldsymbol{\beta}\)
    • Package glmpath for fitting regression
    • Select tuning parameter \(\lambda_n\) with BIC
  • Statistics
    • Theory says we can select important \(\ \boldsymbol{X}\) well
    • Obtain estimate of \(\ E(\pi_S|\boldsymbol{X})\)
  • Prediction: \[\ \color{red}{ \hat{Y_0} = \mbox{logit}^{-1}(\hat{\alpha}+\hat{\boldsymbol{\beta}}^{T} \boldsymbol{X_0})}\]
  • Prediction with labels (refit regression on labeled data)

\[\ \color{red}{ \hat{Y_0}^* = \mbox{logit}^{-1}(\hat{\alpha}+\hat{\boldsymbol{\gamma}}^{*T}\boldsymbol{S}_0 + \hat{\boldsymbol{\beta}}^{*T} \boldsymbol{X_0})}\]


See sections 3.4, 4.4 in The ESL (Hastie, Tibshirani, Friedman 2009)

How well does the model work?

  • Measures of prediction accuracy on “labeled” data (validation data)
    • AUC = area under the Receiver Operating Curve (ROC)
  • Model size, variables selected

Goal:
auto model predict \(\ \geq\) model w/ labeled data
\(\ \Rightarrow\) reduce or avoid labeling!

Classify RA patients

  • Partners Healthcare System
    • 46k potential RA subjects (at least one ICD-9 code for RA and related diseases, or had received common RA diagnosis test)
    • 435 gold standard labels by team rheumatologists
  • Surrogates: counts of NLP mentions of RA in records, RA ICD-9 code
  • \(\ \boldsymbol{X}\) = 81 predictors

  • adaptive LASSO \(\ \leftrightarrow\) 32 predictors
    • NLP: morning stiffness, methotraxate, ultrasound, MRI
  • labeled data with Y vs “automated feature selection” with S

Results

with > 200 labels AUC logistic adaptive LASSO regression \(\ \approx\) AUC with automated feature selection.

Results

Minnier, Gronsbell, Yu, Liao, Cai (in progress)

Future

  • Improve code: efficiency, speed, commenting and documentation
  • Add simulation and analysis code, with example data set
  • Other uses?
    • Risk prediction of related diseases?
    • Select cohorts with high risk for prospective studies?
    • Clustering and prediction with other data?
  • Other machine learning methods for prediction?

Explore EMR data (i2b2.org)

Thank You

Jessica Gronsbell, Harvard Biostatistics
Sheng Yu, Tsinghua University, Beijing, China
Katherine Liao, Brigham and Women’s Hospital, Boston
Tianxi Cai, Harvard Biostatistics

“Automated Feature Selection of Predictors in Electronic Medical Records Data” in progress

R package emrselect also in progress: https://github.com/jminnier/emrselect