`emrselect`
Automated Feature Selection of Predictors in Electronic Medical Records Data

Jessica Minnier; minnier@ohsu.edu

Tuesday, January 10, 2017

https://github.com/jminnier/emrselect
Slides available at http://bit.ly/wwc-emrselect

EMR Research Challenge

Large detailed data set + Complex disease +
Messy diagnostic codes + Free text fields

Who are true disease cases?

EMR Research Challenge

Liao et al. Development of phenotype algorithms using electronic medical records and incorporating natural language processing BMJ 2015; 350 :h1885

EMR Research Data

Liao et al. Development of phenotype algorithms using electronic medical records and incorporating natural language processing BMJ 2015; 350 :h1885

Natural Language Processing

Turn text notes into variables:

Example

Rheumatoid Arthritis (RA)
Partners Healthcare Systems => 46k potential RA subjects
Goal: genetic study of RA patients, recruit a cohort of cases
Clues:
- ICD-9 code for RA (false positive rate of ~30%, specificity 55%; Singh et al, 2004)
- NLP features: mentions of “rheumatoid” or phrases such as “morning stiffness” in doctor’s notes
- ICD code for related autoimmune disease
- medications for RA
“Gold standard” diagnosis = team of clinicians read notes and score patients \(\ \Rightarrow\) small training data

Statistical Questions

Prediction or classification of diseased vs not-diseased
- Diagnostic and prescription codes
- Natural Language Processing (NLP)
- clinical variables
- best model? regularized regression?
Surrogate outcome
- Diagnoses not precise (mismeasured), may also diagnose patients from notes (text)
- “true” outcome is laborious/expensive/limited

Data!

## Observations: 100
## Variables: 37
## $ patient_num                              <int> 31, 19, 5, 13, 17, 2,...
## $ RA_GoldStandard                          <chr> "NULL", "N", "NULL", ...
## $ patient_gender                           <int> 1, 1, 1, 0, 0, 0, 1, ...
## $ HgA1C                                    <int> 0, 1, 31, 10, 0, 0, 0...
## $ HgA1C_2                                  <int> 0, 0, 1, 1, 0, 0, 0, ...
## $ RA_COD_DX_Lupus                          <int> 0, 0, 0, 0, 0, 0, 0, ...
## $ RA_COD_DX_Psoriaticarthritis             <int> 0, 0, 0, 0, 155, 8, 7...
## $ RA_COD_DX_RheumatoidArthritis            <int> 0, 0, 0, 0, 3, 2, 4, ...
## $ RA_COD_LAB_antiCCP                       <int> 0, 0, 0, 0, 0, 0, 0, ...
## $ RA_COD_LAB_RF                            <int> 0, 6, 0, 0, 1, 0, 0, ...
## $ RA_COD_MED_antiTNF                       <int> 0, 0, 0, 0, 5, 0, 0, ...
## $ RA_COD_MED_methotrexate                  <int> 0, 0, 0, 0, 7, 0, 0, ...
## $ RA_NLP_analgesics                        <int> 52, 584, 63, 52, 466,...
## $ RA_NLP_antibodies                        <int> 6, 69, 3, 1, 13, 1, 4...
## $ RA_NLP_antiinflammatorydrugs             <int> 51, 547, 60, 44, 336,...
## $ RA_NLP_antimalarialagents                <int> 0, 0, 0, 0, 0, 0, 0, ...
## $ RA_NLP_antimicrobialagents               <int> 1, 2, 0, 0, 125, 3, 2...
## $ RA_NLP_antirheumaticdrug                 <int> 51, 504, 60, 41, 533,...
## $ RA_NLP_biologicalagents                  <int> 14, 123, 12, 46, 98, ...
## $ RA_NLP_corticosteroids                   <int> 14, 371, 0, 0, 37, 29...
## $ RA_NLP_folicacid                         <int> 6, 0, 0, 1, 47, 14, 2...
## $ RA_NLP_folicacidantagonist               <int> 0, 0, 0, 0, 75, 3, 14...
## $ RA_NLP_glucocorticoids                   <int> 14, 332, 0, 0, 28, 29...
## $ RA_NLP_immunologicalfactors              <int> 6, 40, 3, 2, 32, 3, 5...
## $ RA_NLP_immunomodulators                  <int> 14, 371, 1, 0, 239, 3...
## $ RA_NLP_immunosuppressiveagents           <int> 14, 371, 1, 0, 241, 3...
## $ RA_NLP_inflammatoryarthritis             <int> 1, 44, 0, 0, 275, 77,...
## $ RA_NLP_jointspain                        <int> 0, 33, 0, 0, 13, 3, 5...
## $ RA_NLP_monoclonalantibody                <int> 6, 15, 0, 0, 8, 1, 3,...
## $ RA_NLP_morningstiffness                  <int> 0, 10, 0, 0, 0, 13, 2...
## $ RA_NLP_naproxen                          <int> 0, 36, 0, 3, 27, 3, 0...
## $ RA_NLP_nonsteroidalantiinflammatorydrugs <int> 31, 170, 60, 39, 286,...
## $ RA_NLP_prednisone                        <int> 14, 300, 0, 0, 13, 26...
## $ RA_NLP_proteins                          <int> 17, 91, 53, 25, 61, 1...
## $ RA_NLP_rheumatoidarthritis               <int> 0, 9, 0, 0, 16, 0, 5,...
## $ RA_NLP_synovitis                         <int> 0, 0, 0, 0, 12, 0, 3,...
## $ RA_NLP_tumornecrosisfactoralphablockers  <int> 6, 0, 0, 0, 8, 1, 0, ...

`emrselect`

R package for statistical method:

use surrogate outcome(s) to estimate
\(\ \pi_S\)=P(diseased | surrogates)
- mixture model clustering
build prediction model with \(\ \pi_S\) as outcome
- logistic regression: regularized (adaptive LASSO), misspecified (\(\ \pi_S\) outcome)
gold standard labels? re-fit regression on smaller data set

https://github.com/jminnier/emrselect

Prediction and Classification - Normal Setting

We know \(\ Y\) = disease status = 0 or 1
on a training and validation set

We have other data \(\ \boldsymbol{X}\) = covariates
i.e. ICD codes, clinical variables, NLP features

Goal: Estimate \(\ \pi = P(Y=1 | \boldsymbol{X})\)

Prediction with Surrogate Outcomes

We do not know \(\ Y\) = disease status = 0 or 1
(except maybe on small subset)

We do know \(\ \boldsymbol{S} = S_1, S_2, \ldots\) = surrogate outcomes
i.e. ICD-9 code for RA, # of times RA mentioned in note (NLP feature count)

Goal: predict \(\ Y\) using \(\ \boldsymbol{S}\) and \(\ \boldsymbol{X}\)

estimate \(\ \pi_S = P(Y=1 | \boldsymbol{S})\)
use \(\ \hat{\pi}_S\) as outcome in prediction model with \(\ \boldsymbol{X}\) as covariates
estimate \(\ \pi = P(Y=1 | \boldsymbol{X}, \boldsymbol{S})\)

Mixture Model

Estimate \(\ \pi_S = P(Y=1 | \boldsymbol{S})\)

Assume two-component (Gaussian) mixture model:

\[\ \boldsymbol{S} \sim \tau\cdot f_1(\boldsymbol{s}) + (1-\tau)\cdot f_0 (\boldsymbol{s})\]

where \(\ \tau = P(Y=1)\)

Mixture Model

extend to multiple \(\ \boldsymbol{S}\) surrogates = multivariate mixture modeling

MathWorks Documentation https://www.mathworks.com/help/examples/stats/

Mixture Model

solve with Expectation-Maximization algorithm
Gaussian mixture modeling mclust package
- estimates mean and variance of the two Normal distributions, and \(\ P(Y=1)\)
emrselect::ProbD.S() estimates \(\ \pi_S\) from mclust output with Bayes Rule

\[\ \hat{\pi}_S = \frac{\hat{\tau}\hat{f}_1(s)}{\hat{\tau}\hat{f}_1(s) + (1-\hat{\tau}) \hat{f}_0(s))} \]

See sections 6.8, 8.5 in The Elements of Statistical Learning (Hastie, Tibshirani, Friedman 2009)

Prediction Model

Usual logistic likelihood objective function (glm):

\[\ n^{-1} \sum_{i=1}^{n} \left[ y_i\boldsymbol{\beta}^{T}x_i + \log(1+\exp(\alpha + \boldsymbol{\beta}^{T} x_i))\right]\]

Adaptive LASSO logistic regression (glmpath, glmnet):

\[\ n^{-1} \sum_{i=1}^{n} \left[ y_i\boldsymbol{\beta}^{T}x_i + \log(1+\exp(\alpha + \boldsymbol{\beta}^{T} x_i)) \right] + \color{red}{\lambda_n \sum_{j=1}^p |\beta_j|/|\tilde{\beta_j}|} \]

Our objection function (emrselect::Est.ALASSO.GLM):

\[\ n^{-1} \sum_{i=1}^{n}\left[ \color{red}{ \hat{\pi}_{Si}}\boldsymbol{\beta}^{T}x_i + \log(1+\exp(\alpha + \boldsymbol{\beta}^{T} x_i))\right] + \lambda_n \sum_{j=1}^p |\beta_j|/|\tilde{\beta_j}| \]

Prediction Model

Quasi-logistic regression with \(\ \hat{\pi}_S\) response, \(\ \boldsymbol{X}\) predictors
regularized regression to select important \(\ \boldsymbol{X}\) variables
- minimize adaptive LASSO logistic objective func. w.r.t. \(\ \boldsymbol{\beta}\)
- Package glmpath for fitting regression
- Select tuning parameter \(\lambda_n\) with BIC
Statistics
- Theory says we can select important \(\ \boldsymbol{X}\) well
- Obtain estimate of \(\ E(\pi_S|\boldsymbol{X})\)
Prediction: \[\ \color{red}{ \hat{Y_0} = \mbox{logit}^{-1}(\hat{\alpha}+\hat{\boldsymbol{\beta}}^{T} \boldsymbol{X_0})}\]
Prediction with labels (refit regression on labeled data)

\[\ \color{red}{ \hat{Y_0}^* = \mbox{logit}^{-1}(\hat{\alpha}+\hat{\boldsymbol{\gamma}}^{*T}\boldsymbol{S}_0 + \hat{\boldsymbol{\beta}}^{*T} \boldsymbol{X_0})}\]

See sections 3.4, 4.4 in The ESL (Hastie, Tibshirani, Friedman 2009)

How well does the model work?

Measures of prediction accuracy on “labeled” data (validation data)
- AUC = area under the Receiver Operating Curve (ROC)
Model size, variables selected

Goal:
auto model predict \(\ \geq\) model w/ labeled data
\(\ \Rightarrow\) reduce or avoid labeling!

Classify RA patients

Partners Healthcare System
- 46k potential RA subjects (at least one ICD-9 code for RA and related diseases, or had received common RA diagnosis test)
- 435 gold standard labels by team rheumatologists
Surrogates: counts of NLP mentions of RA in records, RA ICD-9 code
\(\ \boldsymbol{X}\) = 81 predictors
adaptive LASSO \(\ \leftrightarrow\) 32 predictors
- NLP: morning stiffness, methotraxate, ultrasound, MRI
labeled data with Y vs “automated feature selection” with S

Results

with > 200 labels AUC logistic adaptive LASSO regression \(\ \approx\) AUC with automated feature selection.

Results

Minnier, Gronsbell, Yu, Liao, Cai (in progress)

Future

Improve code: efficiency, speed, commenting and documentation
Add simulation and analysis code, with example data set
Other uses?
- Risk prediction of related diseases?
- Select cohorts with high risk for prospective studies?
- Clustering and prediction with other data?
Other machine learning methods for prediction?

Explore EMR data (i2b2.org)

Thank You

Jessica Gronsbell, Harvard Biostatistics
Sheng Yu, Tsinghua University, Beijing, China
Katherine Liao, Brigham and Women’s Hospital, Boston
Tianxi Cai, Harvard Biostatistics

“Automated Feature Selection of Predictors in Electronic Medical Records Data” in progress

R package emrselect also in progress: https://github.com/jminnier/emrselect

emrselect Automated Feature Selection of Predictors in Electronic Medical Records Data

Jessica Minnier; minnier@ohsu.edu

Tuesday, January 10, 2017 https://github.com/jminnier/emrselect Slides available at http://bit.ly/wwc-emrselect

EMR Research Challenge

EMR Research Challenge

EMR Research Data

Natural Language Processing

Example

Statistical Questions

Data!

emrselect

Prediction and Classification - Normal Setting

Prediction with Surrogate Outcomes

Mixture Model

Mixture Model

Mixture Model

Prediction Model

Prediction Model

How well does the model work?

Classify RA patients

Results

Results

Future

Explore EMR data (i2b2.org)

Thank You

`emrselect`
Automated Feature Selection of Predictors in Electronic Medical Records Data

Tuesday, January 10, 2017

https://github.com/jminnier/emrselect
Slides available at http://bit.ly/wwc-emrselect

`emrselect`