Holding pen for some submissions I made while working through courses in the Clinical Data Science Specialization offered by The University of Colorado. My plan is to revisit these projects and notes at a later date and recompile and reformat.

Clinical Natural Language Processing

August 2021: R Rstudio RMarkdown Tidyverse Regex.

library(tidyverse)
htmltools::includeHTML("c4w5_Clinical_DS_Project_V2.html")
Identifying Patients with Diabetic Complications

Saturday, July 31, 2021

Background

This project combines all the tools and techniques covered in the course, applying it to the problem of identifying patients with diabetic complications from clinical text notes.

Specifically, a corpus of anonymized clinical notes from patients who have diabetes is used to identify which notes indicate diabetic complications of neuropathy, nephropathy, and/or retinopathy, and flag these complications accordingly.

We assess the text processing algorithm performance by identifying how many notes were correctly and incorrectly classified based on a reference key.

Load Both Diabetes Datasets

# load both diabetes datasets
con <- DBI::dbConnect(drv = bigquery(),
  project = "learnclinicaldatascience")
diabetes_notes <- 
  tbl(con, "course4_data.diabetes_notes") %>%
  collect()
diabetes_goldstandard <- 
  tbl(con, "course4_data.diabetes_goldstandard") %>%
  collect()

Quick View of Diabetes Notes Data

summary(diabetes_notes)
##     NOTE_ID     NOTE_TYPE             TEXT          
##  Min.   :  1   Length:141         Length:141        
##  1st Qu.: 36   Class :character   Class :character  
##  Median : 71   Mode  :character   Mode  :character  
##  Mean   : 71                                        
##  3rd Qu.:106                                        
##  Max.   :141

Quick View of Diabetes Notes Data

options(width = 70)
head(diabetes_notes)
## # A tibble: 6 x 3
##   NOTE_ID NOTE_TYPE        TEXT                                       
##     <int> <chr>            <chr>                                      
## 1       1 History and Phy~ "CHIEF COMPLAINT:  Dog bite to his right l~
## 2       2 History and Phy~ "CHIEF COMPLAINT:    Left hip pain.\n\nHIS~
## 3       3 Discharge Summa~ "CC: Dysarthria\n\nHX: This 52y/o RHF was ~
## 4       4 Operative Note   "PRE-OP DIAGNOSIS:  Osteoporosis, patholog~
## 5       5 Discharge Summa~ "CC: Left hemibody numbness.\n\nHX: This 4~
## 6       6 Operative Note   "PREOPERATIVE DIAGNOSIS:  Left renal mass,~

Quick View of Reference Key

str(diabetes_goldstandard)
## tibble [141 x 5] (S3: tbl_df/tbl/data.frame)
##  $ NOTE_ID                  : int [1:141] 1 2 3 4 5 6 7 8 9 10 ...
##  $ ANY_DIABETIC_COMPLICATION: int [1:141] 0 0 0 0 0 1 1 0 0 0 ...
##  $ DIABETIC_NEUROPATHY      : int [1:141] 0 0 0 0 0 0 1 0 0 0 ...
##  $ DIABETIC_NEPHROPATHY     : int [1:141] 0 0 0 0 0 1 0 0 0 0 ...
##  $ DIABETIC_RETINOPATHY     : int [1:141] 0 0 0 0 0 0 0 0 0 0 ...

Quick View of Reference Key

options(width = 70)
head(diabetes_goldstandard) 
## # A tibble: 6 x 5
##   NOTE_ID ANY_DIABETIC_COMPLICA~ DIABETIC_NEUROPAT~ DIABETIC_NEPHROPA~
##     <int>                  <int>              <int>              <int>
## 1       1                      0                  0                  0
## 2       2                      0                  0                  0
## 3       3                      0                  0                  0
## 4       4                      0                  0                  0
## 5       5                      0                  0                  0
## 6       6                      1                  0                  1
## # ... with 1 more variable: DIABETIC_RETINOPATHY <int>

Approach

Based on initial tests, we chose to use the delimiter method to extract note sections for exclusion. Limit false positives by removing sections like Family and Social History; remove one blank section where the text contained family medical history.

Section headers to ignore include:

FAMILY AND SOCIAL HISTORY, FAMILY HISTORY, FAMILY HISTORY CONSISTENT WITH..., FAMILY HX, FAMILY MEDICAL HISTORY, FH, FHX, HX, SHX, SOCH, SOCIAL HISTORY, SOCIAL HISTORY THE PATIENT CURRENT SMOKES, SOCIAL HX

After which, we can further flag records by whether they are diabetes false positives, or contain neuropathy / nephropathy / retinopathy, or rule them out.

Process Step 1

Separate diabetes_notes into section headers, trim fields, create additional TEXT column which contains extra long section headers combined with section text for poorly-delimited headers.
diabetes_notes %>%
separate_rows(TEXT, sep = “”) %>%
separate(TEXT, into = c(“SECTION_HEADER”, “SECTION_TEXT”), sep = “:”, remove = TRUE, extra = “merge”) %>%
mutate(SECTION_HEADER = str_trim(SECTION_HEADER), SECTION_TEXT = ifelse(is.na(str_trim(SECTION_TEXT)), " ", str_trim(SECTION_TEXT)),
TEXT = ifelse(str_length(SECTION_HEADER) > 27, paste(str_trim(SECTION_HEADER),str_trim(SECTION_TEXT)), str_trim(SECTION_TEXT))) %>%

## tibble [2,121 x 5] (S3: tbl_df/tbl/data.frame)
##  $ NOTE_ID       : int [1:2121] 1 1 1 1 1 1 1 1 1 1 ...
##  $ NOTE_TYPE     : chr [1:2121] "History and Physical" "History and Physical" "History and Physical" "History and Physical" ...
##  $ SECTION_HEADER: chr [1:2121] "CHIEF COMPLAINT" "HISTORY OF PRESENT ILLNESS" "PAST MEDICAL HISTORY (PMH)" "ALLERGIES" ...
##  $ SECTION_TEXT  : chr [1:2121] "Dog bite to his right lower leg." "This 50-year-old white male earlier this afternoon was attempting to adjust a cable that a dog was tied to.  Do"| __truncated__ "Significant for history of pulmonary fibrosis and atrial fibrillation.  He is status post bilateral lung transp"| __truncated__ "There are no known allergies." ...
##  $ TEXT          : chr [1:2121] "Dog bite to his right lower leg." "This 50-year-old white male earlier this afternoon was attempting to adjust a cable that a dog was tied to.  Do"| __truncated__ "Significant for history of pulmonary fibrosis and atrial fibrillation.  He is status post bilateral lung transp"| __truncated__ "There are no known allergies." ...

Process Step 2

Delete family and social history sections; remove one blank section in Note 82 which contains family history in the text.
filter(!str_detect(string = SECTION_HEADER, pattern = regex(“^FAMILY|FH|^[F|S]?HX|^SOC|^$”, ignore_case = TRUE))) %>%

## tibble [1,955 x 5] (S3: tbl_df/tbl/data.frame)
##  $ NOTE_ID       : int [1:1955] 1 1 1 1 1 1 1 1 1 1 ...
##  $ NOTE_TYPE     : chr [1:1955] "History and Physical" "History and Physical" "History and Physical" "History and Physical" ...
##  $ SECTION_HEADER: chr [1:1955] "CHIEF COMPLAINT" "HISTORY OF PRESENT ILLNESS" "PAST MEDICAL HISTORY (PMH)" "ALLERGIES" ...
##  $ SECTION_TEXT  : chr [1:1955] "Dog bite to his right lower leg." "This 50-year-old white male earlier this afternoon was attempting to adjust a cable that a dog was tied to.  Do"| __truncated__ "Significant for history of pulmonary fibrosis and atrial fibrillation.  He is status post bilateral lung transp"| __truncated__ "There are no known allergies." ...
##  $ TEXT          : chr [1:1955] "Dog bite to his right lower leg." "This 50-year-old white male earlier this afternoon was attempting to adjust a cable that a dog was tied to.  Do"| __truncated__ "Significant for history of pulmonary fibrosis and atrial fibrillation.  He is status post bilateral lung transp"| __truncated__ "There are no known allergies." ...

Process Step 3

Filter diabetes sections, then flag neuropathy, nephropathy and retinopathy. Adjust for false positives. filter(str_detect(string = TEXT, pattern = regex(“diabet(ic|es)”, ignore_case = TRUE))) %>%
mutate(NO_DIAB = case_when(str_detect(string = TEXT, pattern = regex(“gestational diabetes|No history of diabetes”, ignore_case = TRUE)) ~ 1, TRUE ~ 0),
DIAB_NEUROPATHY = case_when(str_detect(string = TEXT, pattern = regex(“denies any comorbid complications of the diabetes including”, ignore_case = TRUE)) ~ 0,
str_detect(string = TEXT, pattern = regex(“((diabetic|peripheral) neuropathy)|(Neuropath(ic|y))|(diabetic nerve pain)|(pain secondary to diabetes)”, ignore_case = TRUE)) ~ 1, TRUE ~ 0),
DIAB_NEPHROPATHY = case_when(str_detect(string = TEXT, pattern = regex(“(denies any comorbid complications of the diabetes including)|(herbs which can cause chronic interstitial nephritis)”, ignore_case = TRUE)) ~ 0,
str_detect(string = TEXT, pattern = regex(“((diabetic|peripheral|reflux) nephropathy)|(Nephropathy 2/2 Diabetes)|(glomerulonephropathy)|(end-stage renal disease)|(ESRD)|(chronic renal insufficiency)|(nephropath(ic|y))”, ignore_case = TRUE)) ~ 1, TRUE ~ 0),

Process Step 3 (cont’)

Filter diabetes sections, then flag neuropathy, nephropathy and retinopathy. Adjust for false positives.
DIAB_RETINOPATHY = case_when(str_detect(string = TEXT, pattern = regex(“(denies any comorbid complications of the diabetes including)|(No retinopathy)|(does not show any evidence of diabetic retinopathy)”, ignore_case = TRUE)) ~ 0,
str_detect(string = TEXT, pattern = regex(“((diabetic|peripheral|reflux) retinopathy)|(diabetic complications of retinopathy)|(optic nerve damage)”, ignore_case = TRUE)) ~ 1, TRUE ~ 0)) %>%

Process Step 3 (cont’)

## tibble [190 x 9] (S3: tbl_df/tbl/data.frame)
##  $ NOTE_ID         : int [1:190] 2 4 6 7 8 8 9 10 11 12 ...
##  $ NOTE_TYPE       : chr [1:190] "History and Physical" "Operative Note" "Operative Note" "Operative Note" ...
##  $ SECTION_HEADER  : chr [1:190] "PAST MEDICAL HISTORY    Diabetes and high blood pressure." "INDICATIONS" "BRIEF HISTORY" "S - An 84-year-old diabetic female, 5'7-1/2\" tall, 148 pounds, history of hypertension and diabetes.  She pres"| __truncated__ ...
##  $ SECTION_TEXT    : chr [1:190] " " "Mrs. Smith is a 75-year-old female who has had severe back pain that began approximately three months ago and i"| __truncated__ "The patient is a 54-year-old female with history of diabetic nephropathy, diabetes, hypertension, left BKA, who"| __truncated__ " " ...
##  $ TEXT            : chr [1:190] "PAST MEDICAL HISTORY    Diabetes and high blood pressure. " "Mrs. Smith is a 75-year-old female who has had severe back pain that began approximately three months ago and i"| __truncated__ "The patient is a 54-year-old female with history of diabetic nephropathy, diabetes, hypertension, left BKA, who"| __truncated__ "S - An 84-year-old diabetic female, 5'7-1/2\" tall, 148 pounds, history of hypertension and diabetes.  She pres"| __truncated__ ...
##  $ NO_DIAB         : num [1:190] 0 0 0 0 0 0 1 0 1 0 ...
##  $ DIAB_NEUROPATHY : num [1:190] 0 0 0 1 0 0 0 0 0 0 ...
##  $ DIAB_NEPHROPATHY: num [1:190] 0 0 1 0 0 0 0 0 0 1 ...
##  $ DIAB_RETINOPATHY: num [1:190] 0 0 0 0 0 0 0 0 0 0 ...

Process Step 4

Summarize to note level, sum up and remap flags for neuropathy, nephropathy, retinopathy and create a flag for “any complication”.
group_by(NOTE_ID) %>%
summarize(DIABETIC_NO = sum(NO_DIAB), DIABETIC_NEUROPATHY = sum(DIAB_NEUROPATHY), DIABETIC_NEPHROPATHY = sum(DIAB_NEPHROPATHY), DIABETIC_RETINOPATHY = sum(DIAB_RETINOPATHY),NUM_ROWS = n()) %>% ungroup() %>% mutate(DIABETIC_NO = case_when(DIABETIC_NO >= 1 ~ 1, TRUE ~ 0),
DIABETIC_NEUROPATHY = case_when(DIABETIC_NEUROPATHY >= 1 ~ 1, TRUE ~ 0),
DIABETIC_NEPHROPATHY = case_when(DIABETIC_NEPHROPATHY >= 1 ~ 1, TRUE ~ 0),
DIABETIC_RETINOPATHY = case_when(DIABETIC_RETINOPATHY >= 1 ~ 1, TRUE ~ 0),
ANY_DIABETIC_COMPLICATION = case_when(DIABETIC_NEUROPATHY ==1 ~ 1,
DIABETIC_NEPHROPATHY ==1 ~ 1, DIABETIC_RETINOPATHY ==1 ~ 1, TRUE ~ 0) ) %>%
relocate(NOTE_ID, ANY_DIABETIC_COMPLICATION, DIABETIC_NEUROPATHY, DIABETIC_NEPHROPATHY, DIABETIC_RETINOPATHY)

Process Step 4 (cont’)

## tibble [120 x 7] (S3: tbl_df/tbl/data.frame)
##  $ NOTE_ID                  : int [1:120] 2 4 6 7 8 9 10 11 12 13 ...
##  $ ANY_DIABETIC_COMPLICATION: num [1:120] 0 0 1 1 0 0 0 0 1 1 ...
##  $ DIABETIC_NEUROPATHY      : num [1:120] 0 0 0 1 0 0 0 0 0 0 ...
##  $ DIABETIC_NEPHROPATHY     : num [1:120] 0 0 1 0 0 0 0 0 1 1 ...
##  $ DIABETIC_RETINOPATHY     : num [1:120] 0 0 0 0 0 0 0 0 0 0 ...
##  $ DIABETIC_NO              : num [1:120] 0 0 0 0 0 1 0 1 0 0 ...
##  $ NUM_ROWS                 : int [1:120] 1 1 1 1 2 1 1 1 1 1 ...

Process Step 5

Merge note-level summary dataset back to original notes table along with reference key file to validate accuracy.
diabetes_notes_final <- diabetes_notes %>% left_join(diabetes_goldstandard, by=“NOTE_ID”, suffix = c("“,”_gold“)) %>%
left_join(diabetes_notes_output, by=”NOTE_ID“, suffix = c(”“,”_calc")) %>% select(-NUM_ROWS) %>%
replace_na(list(DIABETIC_NO = 0, DIABETIC_NEUROPATHY_calc = 0,
DIABETIC_NEPHROPATHY_calc = 0, DIABETIC_RETINOPATHY_calc = 0,
ANY_DIABETIC_COMPLICATION_calc = 0)) %>%
mutate(MATCHED = case_when(((DIABETIC_NEUROPATHY == DIABETIC_NEUROPATHY_calc) &
(DIABETIC_NEPHROPATHY == DIABETIC_NEPHROPATHY_calc) & (DIABETIC_RETINOPATHY == DIABETIC_RETINOPATHY_calc) & (ANY_DIABETIC_COMPLICATION == ANY_DIABETIC_COMPLICATION_calc)) ~ 1, TRUE ~ 0)) diabetes_notes_final %>% str()

Process Step 5 (cont’)

## tibble [141 x 13] (S3: tbl_df/tbl/data.frame)
##  $ NOTE_ID                       : int [1:141] 1 2 3 4 5 6 7 8 9 10 ...
##  $ NOTE_TYPE                     : chr [1:141] "History and Physical" "History and Physical" "Discharge Summary" "Operative Note" ...
##  $ TEXT                          : chr [1:141] "CHIEF COMPLAINT:  Dog bite to his right lower leg.\n\nHISTORY OF PRESENT ILLNESS:  This 50-year-old white male "| __truncated__ "CHIEF COMPLAINT:    Left hip pain.\n\nHISTORY OF PRESENT ILLNESS  The patient is a 32-year-old male seen by Dr."| __truncated__ "CC: Dysarthria\n\nHX: This 52y/o RHF was transferred from a local hospital to UIHC on 10/28/94 with a history o"| __truncated__ "PRE-OP DIAGNOSIS:  Osteoporosis, pathologic fractures T12- L2 with severe kyphosis. POST-OP DIAGNOSIS:  Osteopo"| __truncated__ ...
##  $ ANY_DIABETIC_COMPLICATION     : int [1:141] 0 0 0 0 0 1 1 0 0 0 ...
##  $ DIABETIC_NEUROPATHY           : int [1:141] 0 0 0 0 0 0 1 0 0 0 ...
##  $ DIABETIC_NEPHROPATHY          : int [1:141] 0 0 0 0 0 1 0 0 0 0 ...
##  $ DIABETIC_RETINOPATHY          : int [1:141] 0 0 0 0 0 0 0 0 0 0 ...
##  $ ANY_DIABETIC_COMPLICATION_calc: num [1:141] 0 0 0 0 0 1 1 0 0 0 ...
##  $ DIABETIC_NEUROPATHY_calc      : num [1:141] 0 0 0 0 0 0 1 0 0 0 ...
##  $ DIABETIC_NEPHROPATHY_calc     : num [1:141] 0 0 0 0 0 1 0 0 0 0 ...
##  $ DIABETIC_RETINOPATHY_calc     : num [1:141] 0 0 0 0 0 0 0 0 0 0 ...
##  $ DIABETIC_NO                   : num [1:141] 0 0 0 0 0 0 0 0 1 0 ...
##  $ MATCHED                       : num [1:141] 1 1 1 1 1 1 1 1 1 1 ...

Regular Expression (Regex)

- Family or Social History Sections, or blank (for removal): “^FAMILY|FH|^[F|S]?HX|^SOC|^$”
- Flag mentions of diabetes that can be excluded: “gestational diabetes|No history of diabetes”
- Flag mentions of neuropathy that can be excluded: “denies any comorbid complications of the diabetes including”
- Flag mentions of neuropathy that should be included: “((diabetic|peripheral) neuropathy)|(Neuropath(ic|y))|(diabetic nerve pain)|(pain secondary to diabetes)”
- Flag mentions of nephropathy that can be excluded: “(denies any comorbid complications of the diabetes including)|(herbs which can cause chronic interstitial nephritis)”
- Flag mentions of nephropathy that should be included: “((diabetic|peripheral|reflux) nephropathy)|(Nephropathy 2/2 Diabetes)|(glomerulonephropathy)|(end-stage renal disease)|(ESRD)|(chronic renal insufficiency)|(nephropath(ic|y))”
- Flag mentions of retinopathy that can be excluded: “(denies any comorbid complications of the diabetes including)|(No retinopathy)|(does not show any evidence of diabetic retinopathy)”
- Flag mentions of retinopathy that should be included: “((diabetic|peripheral|reflux) retinopathy)|(diabetic complications of retinopathy)|(optic nerve damage)”

Performance

- I was able to evaluate all notes without using text from Family and Social History sections.
- 120 notes were identified as having some reference to diabetes (excluding text from text from Family and Social History sections).
- Eight out of these 120 notes contained language that either stated that these was no history of diabetes, or referred to gestational diabetes.
- One note misspelled the word “neuropathic” as “neuopathic”, but we were able to capture that note using other keywords.
- After much testing and revision, only 2 mismatches resulted from the validation. (Please see next page for details.)

Mismatches

- NOTE_ID 16: Reference key identifies retinopathy, but my code and review did not.
- NOTE_ID 86: Reference key identifies retinopathy which my code also finds, but my code also identifies neuropathy and flags it accordingly.
… %>% filter(DIABETIC_RETINOPATHY != DIABETIC_RETINOPATHY_calc) %>% select(NOTE_ID, DIABETIC_RETINOPATHY, DIABETIC_RETINOPATHY_calc)
… %>% filter(DIABETIC_NEUROPATHY != DIABETIC_NEUROPATHY_calc) %>% select(NOTE_ID, DIABETIC_NEUROPATHY, DIABETIC_NEUROPATHY_calc)

## # A tibble: 1 x 3
##   NOTE_ID DIABETIC_RETINOPATHY DIABETIC_RETINOPATHY_calc
##     <int>                <int>                     <dbl>
## 1      16                    1                         0
## # A tibble: 1 x 3
##   NOTE_ID DIABETIC_NEUROPATHY DIABETIC_NEUROPATHY_calc
##     <int>               <int>                    <dbl>
## 1      86                   0                        1

Reflection

- Using the delimiter method appears to have been the right choice for this specific use case.
- The 2 mismatches out of 120 notes from the reference key are justifiable based on manual review of the notes. The 1.7% error rate goes down to zero after manual review.
- This manual use of regex is geared specifically for this corpus, and may not apply very broadly.

TO BE REWORKED AT A FUTURE DATE