Analysis of Literal Text from the Death Certificate

Setup
Dataset
Summary Statistics
Representation of Literal Text in Machine Learning
Unigram Analyses
Bigram Analyses
Network Visualizations

The intention of this project is to perform an analysis on a representative sample of death records from public data available for purchase from the Washington State Department of Health prior to performing a analysis based on the full set of data for all United States records from 2003-2017. The focus of the project will be examining records pertaining to drug overdose deaths so we will limit our analysis to a subset of ICD-10 codes.

The following table is provided as a reference for ICD-10 codes pertaining to drug overdose.

ICD-10 Code	Description
X40	Accidental poisoning by and exposure to nonopioid analgesics, antipyretics and antirheumatics
X41	Accidental poisoning by and exposure to antiepileptic, sedative-hypnotic, antiparkinsonism and psychotropic drugs, not elsewhere classified
X42	Accidental poisoning by and exposure to narcotics and psychodysleptics [hallucinogens], not elsewhere classified
X43	Accidental poisoning by and exposure to other drugs acting on the autonomic nervous system
X44	Accidental poisoning by and exposure to other and unspecified drugs, medicaments and biological substances
X60	Intentional self-poisoning by and exposure to nonopioid analgesics, antipyretics and antirheumatics
X61	Intentional self-poisoning by and exposure to antiepileptic, sedative-hypnotic, antiparkinsonism and psychotropic drugs, not elsewhere classified
X62	Intentional self-poisoning by and exposure to narcotics and psychodysleptics [hallucinogens], not elsewhere classified
X63	Intentional self-poisoning by and exposure to other drugs acting on the autonomic nervous system
X64	Intentional self-poisoning by and exposure to other and unspecified drugs, medicaments and biological substances
X85	Assault by drugs, medicaments and biological substances
Y10	Poisoning by and exposure to nonopioid analgesics, antipyretics and antirheumatics, undetermined intent
Y11	Poisoning by and exposure to antiepileptic, sedative-hypnotic, antiparkinsonism and psychotropic drugs, not elsewhere classified, undetermined intent
Y12	Poisoning by and exposure to narcotics and psychodysleptics [hallucinogens], not elsewhere classified, undetermined intent
Y13	Poisoning by and exposure to other drugs acting on the autonomic nervous system, undetermined intent
Y14	Poisoning by and exposure to other and unspecified drugs, medicaments and biological substances, undetermined intent

Setup

Note: Code folding is enabled and set to hide code blocks by default. To see setup code, click on the ‘Code’ button to expand code blocks.

library(data.table) # read in dataset with speed

library(tidyverse)   # for dplyr, tidyr, and ggplot functions
library(tidytext)    # for text mining tokenization and td_idf calculations

data("stop_words")   # Load stop_words from tidytext package

library(scales)      # for custom scale formatting in ggplot

library(igraph)      # to create graphs of term relationships
library(ggraph)      # visualize term relationship directed graphs



#####################################################
# NCHS Data Briefs color palette for ggplot figures #
#####################################################

nchs_db = c("#022a5c", "#93a1c0", "#6d9e34","#c7d6b1","#666666","#022a5c", "#93a1c0", "#6d9e34","#c7d6b1","#666666","#022a5c", "#93a1c0", "#6d9e34","#c7d6b1","#666666")


###################################
#  Lookup Table for ICD-10 Codes  #
###################################

lookup_table = cbind(c("X40","X41","X42","X43","X44","X60","X61","X62","X63","X64","X85","Y10","Y11","Y12","Y13","Y14"),
c("Accidental poisoning by and exposure to nonopioid analgesics, antipyretics and antirheumatics",
"Accidental poisoning by and exposure to antiepileptic, sedative-hypnotic, antiparkinsonism and psychotropic drugs, not elsewhere classified",
"Accidental poisoning by and exposure to narcotics and psychodysleptics [hallucinogens], not elsewhere classified",
"Accidental poisoning by and exposure to other drugs acting on the autonomic nervous system",
"Accidental poisoning by and exposure to other and unspecified drugs, medicaments and biological substances",
"Intentional self-poisoning by and exposure to nonopioid analgesics, antipyretics and antirheumatics",
"Intentional self-poisoning by and exposure to antiepileptic, sedative-hypnotic, antiparkinsonism and psychotropic drugs, not elsewhere classified",
"Intentional self-poisoning by and exposure to narcotics and psychodysleptics [hallucinogens], not elsewhere classified",
"Intentional self-poisoning by and exposure to other drugs acting on the autonomic nervous system",
"Intentional self-poisoning by and exposure to other and unspecified drugs, medicaments and biological substances",
"Assault by drugs, medicaments and biological substances",
"Poisoning by and exposure to nonopioid analgesics, antipyretics and antirheumatics, undetermined intent",
"Poisoning by and exposure to antiepileptic, sedative-hypnotic, antiparkinsonism and psychotropic drugs, not elsewhere classified, undetermined intent",
"Poisoning by and exposure to narcotics and psychodysleptics [hallucinogens], not elsewhere classified, undetermined intent",
"Poisoning by and exposure to other drugs acting on the autonomic nervous system, undetermined intent",
"Poisoning by and exposure to other and unspecified drugs, medicaments and biological substances, undetermined intent"),
c("Accidental","Accidental","Accidental","Accidental","Accidental","Intentional","Intentional","Intentional","Intentional","Intentional","Assault","Undetermined","Undetermined","Undetermined","Undetermined","Undetermined"),
c("Nonopioid analgesics, antipyretics and antirheumatics",
  "Antiepileptic, sedative-hypnotic, antiparkinsonism and psychotropic drugs, not elsewhere classified",
  "Narcotics and psychodysleptics [hallucinogens], not elsewhere classified",
  "Other drugs acting on the autonomic nervous system",
  "Other and unspecified drugs, medicaments and biological substances",
  "Nonopioid analgesics, antipyretics and antirheumatics",
  "Antiepileptic, sedative-hypnotic, antiparkinsonism and psychotropic drugs, not elsewhere classified",
  "Narcotics and psychodysleptics [hallucinogens], not elsewhere classified",
  "Other drugs acting on the autonomic nervous system",
  "Other and unspecified drugs, medicaments and biological substances",
  "Medicaments and biological substances",
  "Nonopioid analgesics, antipyretics and antirheumatics",
  "Antiepileptic, sedative-hypnotic, antiparkinsonism and psychotropic drugs, not elsewhere classified",
  "Narcotics and psychodysleptics [hallucinogens], not elsewhere classified",
  "Other drugs acting on the autonomic nervous system",
  "Other and unspecified drugs, medicaments and biological substances"))

colnames(lookup_table) = c("ICD-10 Code","Description", "Intent","Mechanism")

lookup_table = as_tibble(lookup_table)

Dataset

The dataset contains 56,069 records from Washington State consisting of literal text fields from Part I (the chain of events leading to death), Part II (Other significant conditions that contributed to cause of death), and Box 43 (How the injury occurred) of the standard certificate of death. The file contains records for all deaths in Washington State for 2016 and includes 3 character ICD-10 Codes and a flag variable indicating whether the underlying cause of death was drug overdose.

The following code loads a CSV text file and creates a tidy data frame including the drug overdose flag, ICD-10 code, record number, and unigram within a given record. This format will facilitate various text mining and visualization tasks of the literal text.

literals = fread("literals.txt")

tidy_literals = literals %>%
  mutate(Overdose = as.factor(Overdose)) %>%
  rowid_to_column("record") %>%
  unite("text", `Cause of Death - Line A`,`Cause of Death - Line B`,`Cause of Death - Line C`,`Cause of Death - Line D`,`Other Significant Conditions`,`How Injury Occurred`, sep=" ") %>%
  unnest_tokens(word, text) %>%
  as_tibble()

levels(tidy_literals$Overdose) = c("Not Overdose", "Overdose")

tidy_literals

## # A tibble: 561,244 x 4
##    record Overdose     `Underlying COD` word       
##     <int> <fct>        <chr>            <chr>      
##  1      1 Not Overdose J84              respiratory
##  2      1 Not Overdose J84              failure    
##  3      1 Not Overdose J84              community  
##  4      1 Not Overdose J84              acquired   
##  5      1 Not Overdose J84              pneumonia  
##  6      1 Not Overdose J84              end        
##  7      1 Not Overdose J84              stage      
##  8      1 Not Overdose J84              pulmonary  
##  9      1 Not Overdose J84              fibrosis   
## 10      2 Not Overdose C50              metastatic 
## # … with 561,234 more rows

Summary Statistics

First, we would like to examine some summary statistics for the sample data file to determine the proportion of records with an underlying cause of drug overdose. In 2016, Washington State had 1,134 deaths with an underlying cause of drug overdose, comprising 2.07 percent of deaths.

tidy_literals %>%
  select(Overdose, record) %>%
  distinct(Overdose, record) %>%
  count(Overdose) %>%
  ggplot(aes(Overdose, n, fill=Overdose)) +
  geom_bar(stat="identity", show.legend = FALSE) +
  geom_text(aes(x=Overdose, y=n, label=n), vjust=-0.75, hjust=0.5) +
  scale_fill_manual(values=nchs_db) +
  scale_y_continuous(labels = comma) + 
  labs(title="Number of overdose and non-overdose deaths",
       subtitle="Data are for Washington State, 2016",
       x=NULL, 
       y="Number of deaths") +
  theme_classic()

Second, we would like to examine the number of drug overdose deaths by ICD-10 code. Only codes with at least 1 record for 2016 are shown. The ICD-10 codes for drug overdose are primarily categorized by intent and mechanism. Codes pertaining to “other and unspecified drugs, medicaments and biological substances”" (X44, X64, Y14) appear to have the largest counts with respect to mechanism, whereas those pertaining to “Accidental poisoning” (X40-44) had the largest counts with respect to intent.

tidy_literals %>%
  select(Overdose, `Underlying COD`, record) %>%
  filter(Overdose == 'Overdose') %>%
  count(`Underlying COD`) %>%
  left_join(lookup_table[,c(1,3,4)], by=c("Underlying COD" = "ICD-10 Code")) %>%
  ggplot() +
  geom_col(aes(y = n, x = `Underlying COD`, fill=Intent), show.legend=FALSE) +
  geom_text(aes(x=`Underlying COD`, y=n, label=n), hjust=0.5, vjust=-0.75) +
  scale_fill_manual(values=nchs_db) +
  labs(title="Number of drug overdose deaths, by ICD-10 code",
       subtitle="Data are for Washington State, 2016",
       x=NULL, 
       y="Number of deaths") +
  theme_classic()

tidy_literals %>%
  select(Overdose, `Underlying COD`, record) %>%
  filter(Overdose == 'Overdose') %>%
  count(`Underlying COD`) %>%
  left_join(lookup_table[,c(1,3,4)], by=c("Underlying COD" = "ICD-10 Code")) %>%
  group_by(Intent) %>%
  summarise(n=sum(n)) %>% 
  ggplot(aes(Intent, n, fill=Intent)) +
  geom_col(show.legend=FALSE) +
  geom_text(aes(x=Intent, y=n, label=n), vjust=-0.75, hjust=0.5) +
  scale_fill_manual(values=nchs_db) +
  labs(title="Number of drug overdose deaths, by intent",
       subtitle="Data are for Washington State, 2016",
       x=NULL, 
       y="Number of deaths") +
  theme_classic()

tidy_literals %>%
  select(Overdose, `Underlying COD`, record) %>%
  filter(Overdose == 'Overdose') %>%
  count(`Underlying COD`) %>%
  left_join(lookup_table[,c(1,3,4)], by=c("Underlying COD" = "ICD-10 Code")) %>%
  group_by(Mechanism) %>%
  summarise(n=sum(n)) %>% 
  mutate(Mechanism = paste(substr(Mechanism, start = 1, stop = 40), "...")) %>%
  ggplot(aes(Mechanism, n, fill=Mechanism)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(x=Mechanism, y=n, label=n), vjust=0.5, hjust="inward") +
  scale_fill_manual(values=nchs_db) +
  labs(title="Number of drug overdose deaths, by mechanism",
       subtitle="Data are for Washington State, 2016",
       x=NULL, 
       y="Number of deaths") +
  theme_classic() +
  coord_flip()

Representation of Literal Text in Machine Learning

One area of interest with literal text data from the death certificate is the use of machine learning techniques to accurately identify death records with an underlying cause of drug overdose and assign the appropriate ICD-10 code (e.g. X40-44, X60-64, X85, Y10-Y14).

A common data representation that might be used with machine learning models are term frequency matrices, where rows represent individual records and columns represent the absence or presence of specific terms. Terms may include unigrams (single words), bigrams (adjacent word pairs), and n-grams (adjacent words within a window of size n).

Terms that provide no discriminating power are referred to as stop-words and are often excluded from the model. Examples include commonly used words like “and”, “the”, and “but”. Additional stop words may be identified based on the domain/subject matter. This process often involves a trial and error approach of evaluating the performance of manually curated lists versus standard stop word lists against a machine learning model. However, it may be useful to look at metrics such as counts, term frequency (TF), and term frequency-inverse document frequency (TF-IDF) to identify unimportant words.

Unigram Analyses

The following chart provides a listing of the top 25 words among all records in the sample dataset, excluding commonly used stop words. Words such as disease, failure, chronic, and acute stand out as words with high counts in the dataset. However, it is unclear whether they would provide useful information in identifying a record as a drug overdose or classifying a record into distinct ICD-10 codes.

tidy_literals %>%
  select(word) %>%
  anti_join(stop_words) %>%
  count(word) %>%
  top_n(25, n) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col(fill=nchs_db[1]) +
  labs(title="Top 25 words, all records",
       subtitle="Data are for Washington State, 2016",
       x=NULL,
       y="Count") +
  theme_light() + 
  coord_flip()

## Joining, by = "word"

Let’s examine the top words by overdose status (i.e. “Overdose”, “Not Overdose”).

Not Overdose - disease, failure, acute, chronic, as well as a number of terms relating to physiology
Overdose - references to intoxication and prescription or illegal drugs, although disease still remains in the top 25 terms.

It is difficult to compare the importance of common words like disease from results of the two groups due to the imbalance in the number of records (i.e., 545,581 Not Overdose, 15,663 Overdose). Therefore, it is useful to characterize the data with a metric that is calculated relative to the number of records or words.

top_words_by_status = tidy_literals %>%
  select(Overdose, word) %>%
  anti_join(stop_words) %>%
  count(Overdose, word) %>%
  ungroup() %>%
  group_by(Overdose) %>%
  top_n(25, n) %>%
  ungroup() %>%
  arrange(Overdose, n) %>%
  mutate(order = row_number())

## Joining, by = "word"

ggplot(top_words_by_status, aes(order, n, fill=Overdose)) +
  geom_bar(stat="identity", show.legend=FALSE) +
  facet_wrap(~Overdose, ncol = 2, scales = "free") +  
  labs(title = "Top 25 words, by overdose status",
       subtitle = "Data are for Washington State, 2016",
       x=NULL, 
       y="Count") +
  scale_x_continuous(
    breaks = top_words_by_status$order,
    labels = top_words_by_status$word,
    expand = c(0,0)
  ) +  
  scale_fill_manual(values=nchs_db) +
  theme_light() + 
  theme(plot.title = element_text(hjust=0.5), plot.subtitle = element_text(hjust = 0.5)) + 
  coord_flip()

Term frequency (TF) is a metric that may be used to consider the importance of a word within a single document by calculating the relative proportion of a particular term with respect to all terms in that document. It is calculated as follows:

\[TF(t,d) = \frac{\text{Number of times term t appears in document d}}{\text{Number of terms in document d}}\]

Since we would like to compare the use of terms in ‘Not Overdose’ and ‘Overdose’ groups we will consider all records within each group as a single document. We can see from the results that the metrics for term frequency are on roughly the same scale, allowing for comparison of word use in the two groups. Some relationships that we can observe include:

‘disease’ has a TF of 0.045 in the ‘Not Overdose’ group, whereas it only has a 0.01 TF in the ‘Overdose’ group.
‘acute’ has a high TF of 0.05 in the ‘Overdose’ group when compared to the TF of 0.021 in the ‘Not Overdose’ group
‘chronic’ has a TF of roughly 0.025 in the ‘Not Overdose’ group and a considerably lower TF of 0.006 in the ‘Overdose’ group
In general, records in the ‘Not Overdose’ group are more likely to have a reference to chronic than acute, while the opposite is true of the ‘Overdose’ group.

top_tf_words = tidy_literals %>% 
  count(Overdose, word) %>%
  bind_tf_idf(word, Overdose, n) %>%
  anti_join(stop_words) %>%
  group_by(Overdose) %>%
  top_n(25, tf) %>%
  ungroup %>%
  arrange(Overdose, tf) %>%
  mutate(order = row_number())

## Joining, by = "word"

ggplot(top_tf_words, aes(order, tf, fill=Overdose)) +
  geom_col(show.legend=FALSE) +
  facet_wrap(~Overdose, ncol = 2, scales = "free") +
  labs(title = "Top 25 words based on term frequency (TF), by overdose status",
       subtitle = "Data are for Washington State, 2016",
       x=NULL, 
       y="Term Frequency") +
  scale_fill_manual(values=nchs_db) + 
  scale_x_continuous(
    breaks = top_tf_words$order,
    labels = top_tf_words$word,
    expand = c(0,0)
  ) +  
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust=0.5)) +
  coord_flip()

As discussed, the lists of top words using TF include words that occur in both documents. Although this gives us a relative measure that is useful for comparing the frequency of words in each document, it might not provide the best list of words for discriminating between the two groups. A second measure, called term frequency - inverse document frequency (TF-IDF) introduces the inverse document frequency (IDF) measure as a weighting or penalty term. IDF is calculated as follows

\[ IDF(t, D) = log_{10}(\frac {\text{Number of documents in D}}{\text{Number of documents with term t in D}}) \] TF-IDF is then calculated as follows:

\[ \text{TF-IDF} = TF(t,d) * IDF(t,D) \]

Notice that for the prior example where only two documents are considered, if the term is in both documents, then IDF will be zero and therefore the TF-IDF will resolve to zero. Terms that are left include words with the highest term frequencies that are unique to the document. For the ‘Overdose’ group, this primarily consists of various drug names, while the ‘Not Overdose’ group contains various physiological terms and descriptors that are unique to the ‘Not Overdose’ group. Provided that the terms have a minimum count or frequency in a collection of records, these may prove to be useful predictors of Overdose status.

We can also see some need for cleanup in preprocessing with variants of terms such as alzheimer’s and alzheimers, as well as numbers such as 47700 which should likely be stripped out. It is unclear whether numbers such as 2 provide important context when paired with surrounding words. In this case it may be useful to analyze the words in the context of bigrams or word pairings.

top_tf_idf_words = tidy_literals %>% 
  count(Overdose, word) %>%
  bind_tf_idf(word, Overdose, n) %>%
  group_by(Overdose) %>%
  top_n(25, tf_idf) %>%
  ungroup %>%
  arrange(Overdose, tf_idf) %>%
  mutate(order = row_number())

ggplot(top_tf_idf_words, aes(order, tf_idf, fill=Overdose)) +
  geom_col(show.legend=FALSE) +
  facet_wrap(~Overdose, ncol = 2, scales = "free") +
  labs(title = "Top 25 words based on TF-IDF, by overdose status",
       subtitle = "Data are for Washington State, 2016",
       x=NULL, 
       y="Term Frequency-Inverse Document Frequency (TF-IDF)") +
  scale_fill_manual(values=nchs_db) + 
  scale_x_continuous(
    breaks = top_tf_idf_words$order,
    labels = top_tf_idf_words$word,
    expand = c(0,0)
  ) +  
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust=0.5)) +
  coord_flip()

We can also consider the collection of records by ICD-10 code as individual documents, allowing us to pull out the most important terms for each category.

top_tf_idf_words = tidy_literals %>% 
  filter(Overdose == "Overdose") %>%
  count(`Underlying COD`, word) %>%
  bind_tf_idf(word, `Underlying COD`, n) %>%
  group_by(`Underlying COD`) %>%
  top_n(10, tf_idf) %>%
  ungroup %>%
  arrange(`Underlying COD`, tf_idf) %>%
  mutate(order = row_number())

ggplot(top_tf_idf_words, aes(order, tf_idf, fill=`Underlying COD`)) +
  geom_col(show.legend=FALSE) +
  facet_wrap(~`Underlying COD`, ncol = 3, scales = "free") +
  labs(title = "Top 10 words based on TF-IDF, by ICD-10 code",
       subtitle = "Data are for Washington State, 2016",
       x=NULL, 
       y="Term Frequency-Inverse Document Frequency (TF-IDF)") +
  scale_fill_manual(values=nchs_db) + 
  scale_x_continuous(
    breaks = top_tf_idf_words$order,
    labels = top_tf_idf_words$word,
    expand = c(0,0)
  ) +  
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust=0.5)) +
  coord_flip()

We can also examine the collection of documents by intent and mechanism:

top_tf_idf_words = tidy_literals %>%
  select(Overdose, `Underlying COD`, record, word) %>%
  filter(Overdose == 'Overdose') %>%
  left_join(lookup_table[,c(1,3)], by=c("Underlying COD" = "ICD-10 Code")) %>%
  count(Intent, word) %>%
  bind_tf_idf(word, Intent, n) %>%
  group_by(Intent) %>%
  top_n(5, tf_idf) %>%
  ungroup %>%
  arrange(Intent, tf_idf) %>%
  mutate(order = row_number())
  
ggplot(top_tf_idf_words, aes(order, tf_idf, fill=Intent)) +
  geom_col(show.legend=FALSE) +
  facet_wrap(~Intent, ncol = 2, scales = "free") +
  labs(title = "Top 5 words based on TF-IDF, by Intent",
       subtitle = "Data are for Washington State, 2016",
       x=NULL, 
       y="Term Frequency-Inverse Document Frequency (TF-IDF)") +
  scale_fill_manual(values=nchs_db) + 
  scale_x_continuous(
    breaks = top_tf_idf_words$order,
    labels = top_tf_idf_words$word,
    expand = c(0,0)
  ) +  
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust=0.5)) +
  coord_flip()

top_tf_idf_words = tidy_literals %>%
  select(Overdose, `Underlying COD`, record, word) %>%
  filter(Overdose == 'Overdose') %>%
  left_join(lookup_table[,c(1,4)], by=c("Underlying COD" = "ICD-10 Code")) %>%
  mutate(Mechanism = paste(substr(Mechanism, start = 1, stop = 25), "...")) %>%  
  count(Mechanism, word) %>%
  bind_tf_idf(word, Mechanism, n) %>%
  group_by(Mechanism) %>%
  top_n(10, tf_idf) %>%
  ungroup %>%
  arrange(Mechanism, tf_idf) %>%
  mutate(order = row_number())

ggplot(top_tf_idf_words, aes(order, tf_idf, fill=Mechanism)) +
  geom_col(show.legend=FALSE) +
  facet_wrap(~Mechanism, ncol = 2, scales = "free") +
  labs(title = "Top 10 words based on TF-IDF, by Mechanism",
       subtitle = "Data are for Washington State, 2016",
       x=NULL, 
       y="Term Frequency-Inverse Document Frequency (TF-IDF)") +
  scale_fill_manual(values=nchs_db) + 
  scale_x_continuous(
    breaks = top_tf_idf_words$order,
    labels = top_tf_idf_words$word,
    expand = c(0,0)
  ) +  
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust=0.5)) +
  coord_flip()

In identifying stop-words, we can search for terms with sufficiently large counts that minimize TD-IDF or IDF. Since IDF for a word is the same irrespective of the record that it comes from, we will search for words with a count larger than 10 with the lowest IDF values, excluding any stop words. This will produce a list of potential stop words that may be evaluated before being added to a curated list.

Based on the top 25 potential stop words for this criteria, a reasonable assumption might be to include failure and disease on a list of stop words. When developing a term-frequency matrix as a representation of the literal text data, these words would be excluded as features for consideration. Performance of the machine learning algorithm with and without the custom stop word lists may then be compared.

potential_stop_words = tidy_literals %>% 
  count(record, word) %>%
  bind_tf_idf(word, record, n) %>%
  group_by(word) %>%
  summarize(min_idf = min(idf), 
            count=sum(n)) %>%
  ungroup() %>%
  arrange(min_idf) %>%
  filter(count > 10) %>%
  anti_join(stop_words) %>%
  mutate(order = row_number()) %>%
  filter(order <= 25)

## Joining, by = "word"

ggplot(potential_stop_words, aes(order, min_idf)) +
  geom_col(show.legend=FALSE) +
  labs(title = "Top 25 potential stop words",
       subtitle = "Data are for Washington State, 2016",
       x=NULL, 
       y="Inverse Document Frequency (IDF)") +
  scale_fill_manual(values=nchs_db) + 
  scale_x_continuous(
    breaks = rev(potential_stop_words$order),
    labels = rev(potential_stop_words$word),
    expand = c(0,0)
  ) +  
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust=0.5)) +
  coord_flip()

Bigram Analyses

Bigrams consider adjacent word pairings and may be analyzed in a similar manner to unigrams using count, term frequency, and TF-IDF metrics. A listing of bigrams may be obtained by using the n-gram option of unnest_tokens:

tidy_literals_bigrams = literals %>%
  as_tibble() %>%
  mutate(Overdose = as.factor(Overdose)) %>%
  rowid_to_column("record") %>%
  gather("line", "text", `Cause of Death - Line A`,`Cause of Death - Line B`,`Cause of Death - Line C`,`Cause of Death - Line D`,`Other Significant Conditions`,`How Injury Occurred`, na.rm = TRUE) %>%
  unnest_tokens(bigram, text, token = "ngrams", format="text", n=2, collapse=F) %>%
  filter(!is.na(bigram))

levels(tidy_literals_bigrams$Overdose) = c("Not Overdose", "Overdose")

Returning to the idea of stop words, if a machine learning model considers bigrams as features, even 2 or 3 words could result in the removal of hundreds of potential features. For example, removing disease and failure would result in 340 less features in a model considering bigrams with a minimum frequency of 5.

tidy_literals_bigrams %>% 
  count(bigram) %>% 
  filter(n > 5,
         str_detect(bigram, "disease|failure"))

## # A tibble: 340 x 2
##    bigram                       n
##    <chr>                    <int>
##  1 addison's disease            9
##  2 adult failure              312
##  3 airway disease              21
##  4 alzheimer disease           61
##  5 alzheimer's disease        804
##  6 alzheimers disease         158
##  7 and failure                 26
##  8 arterial disease           132
##  9 arteries disease            20
## 10 arteriosclerotic disease    16
## # … with 330 more rows

Selected charts for top bigrams according to raw counts and TF-IDF are provided below in a tabbed format to conserve space.

Top Bigrams (Counts)

tidy_literals_bigrams %>%
  select(bigram) %>%
  count(bigram) %>%
  top_n(25, n) %>%
  mutate(word = reorder(bigram, n)) %>%
  ggplot(aes(word, n)) +
  geom_col(fill=nchs_db[1]) +
  labs(title="Top 25 bigrams",
       subtitle="Data are for Washington State, 2016",
       x=NULL,
       y="Count") +
  theme_light() + 
  coord_flip()

By Overdose Status (TF-IDF)

top_tf_idf_bigrams = tidy_literals_bigrams %>% 
  count(Overdose, bigram) %>%
  bind_tf_idf(bigram, Overdose, n) %>%
  group_by(Overdose) %>%
  top_n(25, tf_idf) %>%
  ungroup() %>%
  arrange(Overdose, tf_idf) %>%
  mutate(order = row_number())

ggplot(top_tf_idf_bigrams, aes(order, tf_idf, fill=Overdose)) +
  geom_col(show.legend=FALSE) +
  facet_wrap(~Overdose, ncol = 2, scales = "free") +
  labs(title = "Top 25 bigrams based on TF-IDF, by overdose status",
       subtitle = "Data are for Washington State, 2016",
       x=NULL, 
       y="Term Frequency-Inverse Document Frequency (TF-IDF)") +
  scale_fill_manual(values=nchs_db) + 
  scale_x_continuous(
    breaks = top_tf_idf_bigrams$order,
    labels = top_tf_idf_bigrams$bigram,
    expand = c(0,0)
  ) +  
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust=0.5)) +
  coord_flip()

By Intent (TF-IDF)

top_tf_idf_bigrams = tidy_literals_bigrams %>%
  select(Overdose, `Underlying COD`, record, bigram) %>%
  filter(Overdose == 'Overdose') %>%
  left_join(lookup_table[,c(1,3)], by=c("Underlying COD" = "ICD-10 Code")) %>%
  count(Intent, bigram) %>%
  bind_tf_idf(bigram, Intent, n) %>%
  group_by(Intent) %>%
  top_n(5, tf_idf) %>%
  ungroup %>%
  arrange(Intent, tf_idf) %>%
  mutate(order = row_number())

ggplot(top_tf_idf_bigrams, aes(order, tf_idf, fill=Intent)) +
  geom_col(show.legend=FALSE) +
  facet_wrap(~Intent, ncol = 2, scales = "free") +
  labs(title = "Top 5 bigrams based on TF-IDF, by Intent",
       subtitle = "Data are for Washington State, 2016",
       x=NULL, 
       y="Term Frequency-Inverse Document Frequency (TF-IDF)") +
  scale_fill_manual(values=nchs_db) + 
  scale_x_continuous(
    breaks = top_tf_idf_bigrams$order,
    labels = top_tf_idf_bigrams$bigram,
    expand = c(0,0)
  ) +  
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust=0.5)) +
  coord_flip()

By Mechanism (TF-IDF)

top_tf_idf_bigrams = tidy_literals_bigrams %>%
  select(Overdose, `Underlying COD`, record, bigram) %>%
  filter(Overdose == 'Overdose') %>%
  left_join(lookup_table[,c(1,4)], by=c("Underlying COD" = "ICD-10 Code")) %>%
  count(Mechanism, bigram) %>%
  mutate(Mechanism = paste(substr(Mechanism, start = 1, stop = 25), "...")) %>% 
  bind_tf_idf(bigram, Mechanism, n) %>%
  group_by(Mechanism) %>%
  top_n(5, tf_idf) %>%
  ungroup %>%
  arrange(Mechanism, tf_idf) %>%
  mutate(order = row_number())

ggplot(top_tf_idf_bigrams, aes(order, tf_idf, fill=Mechanism)) +
  geom_col(show.legend=FALSE) +
  facet_wrap(~Mechanism, ncol = 2, scales = "free") +
  labs(title = "Top 5 bigrams based on TF-IDF, by Mechanism",
       subtitle = "Data are for Washington State, 2016",
       x=NULL, 
       y="Term Frequency-Inverse Document Frequency (TF-IDF)") +
  scale_fill_manual(values=nchs_db) + 
  scale_x_continuous(
    breaks = top_tf_idf_bigrams$order,
    labels = top_tf_idf_bigrams$bigram,
    expand = c(0,0)
  ) +  
  theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust=0.5)) +
  coord_flip()

Network Visualizations

Using the ggraph package, we can extend the functionality of ggplot2 to create network graphs using bigram counts, term frequencies, or TF-IDF values.

In brief, this is accomplished by taking our tokenized bigrams, separating them into individual words, setting those words as nodes within a graph, and using a metric such as count or TF-IDF as the weight for a directed edge between those nodes.

The following example uses bigram counts to define edge weights. The igraph package in R is used to create the initial graph object from a data frame. Only the top 100 bigrams according to account are retained to maintain a clean visualization.

bigram_graph = tidy_literals_bigrams %>%
  count(bigram) %>%
  select(bigram, n) %>%
  separate(bigram, c("word1","word2"), sep = " ") %>%
  distinct() %>%
  filter(!word1 %in% stop_words$word, !word2 %in% stop_words$word,
         !is.na(word1)) %>%
  top_n(100, n) %>%
  arrange(desc(n)) %>%
  graph_from_data_frame()

This object is then passed to the ggraph function to create the network visualization. Darker edges indicate higher edge weights or counts in this case. Arrows indicate direction in which the words appear, which illustrate not only common bigrams, but n-grams.

We can observe from the visualization several clusters of bigrams associated with disease, failure, cancer, and respiratory. By following the paths we can see longer n-grams such as ‘chronic obstructive pulmonary disease’.

We can also see that ‘2’ which previously appeared in the list of top words/unigrams according to TD-IDF are included in the visualization within the context of ‘type 2 diabetes’. A second variant of ‘type ii diabetes’ is also present. It may be useful to standardize this terminology in a pre-processing stage before feeding into a machine learning model.

set.seed(2018)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), 
                 show.legend = FALSE,
                 arrow = a, 
                 end_cap = circle(.07, 'inches')) +
  geom_node_point(color = nchs_db[4], 
                  size = 5) +
  geom_node_text(aes(label = name), 
                 vjust = 1, 
                 hjust = 1) +
  theme_void()

Using TF-IDF as the value for edge weights results in a network visualization where the most important relationships are maintained. The following example first calculates TF-IDF treating Overdose and Non-Overdose groups as separate documents, filters the bigrams to include only those from the Overdose group, and then displays the network visualization for drug overdose deaths.

We can observe from this visualization several clusters of bigrams associated with intoxication, toxicity, overdose, and various drug names (e.g. morphine, heroin, methamphetamine). Knowing this, we may be able to search for terms in a record that are similar according to string/edit distances or cosine similarity measures if using vector space models such as word2vec.

bigram_graph = tidy_literals_bigrams %>% 
  count(Overdose, bigram) %>%
  bind_tf_idf(bigram, Overdose, n) %>%
  filter(Overdose == "Overdose") %>%
  select(bigram, tf_idf) %>%
  separate(bigram, c("word1","word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word, !word2 %in% stop_words$word,
         !is.na(word1)) %>%
  top_n(100, tf_idf) %>%
  graph_from_data_frame()

set.seed(2018)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = tf_idf), 
                 show.legend = FALSE,
                 arrow = a, 
                 end_cap = circle(.07, 'inches')) +
  geom_node_point(color = nchs_db[4], 
                  size = 5) +
  geom_node_text(aes(label = name), 
                 vjust = 1, 
                 hjust = 1) +
  theme_void()