The purpose of this demo is to demonstrate the concept of word embeddings using textual data.
Word embeddings are a form of language modeling using natural language processing and machine learning techniques in order to map words from a selected vocabulary to a vector of numbers. For example:
“DRUG” = [-3.2030731e-02, 2.7105689e-01, -4.9149340e-01, 6.1396444e-01, … ]
By representing words with vectors, we can map words to a point in a high dimensional geometric space. The goal of word embedding models are to place words in this geometric space such that words with high semantic similarity are close together. How these vectors are constructed depend on the models or algorithms used. Within the context of this demo, we will focus on Word2Vec and FastText models.
# Load and manipulate arrays
import pandas as pd
import numpy as np
from time import time
# Tokenize lines of text
from nltk.tokenize import RegexpTokenizer
# Word2Vec models and word similarity
from gensim import models, similarities
# For FastText model
from gensim.models import FastText
# Dimensionality reduction to reduce to 2 dimensions for visualization
from sklearn.manifold import TSNE
# For plotting
import matplotlib.pyplot as plt
The dataset contains 56,069 records from Washington State for data year 2016 and consists of literal text fields from Part I (the chain of events leading to death), Part II (Other significant conditions that contributed to cause of death), and Box 43 (How the injury occurred) of the standard certificate of death, as well as a 3 character ICD-10 code for underyling cause of death and a flag variable indicating whether the underlying cause was drug overdose (i.e. X40-44, X60-64, X85, or Y10-Y14).
df = pd.read_csv("literals.txt")
df.head()
## Overdose ... How Injury Occurred
## 0 0 ...
## 1 0 ...
## 2 0 ...
## 3 0 ...
## 4 0 ...
##
## [5 rows x 8 columns]
Although literal text data may be used for a variety of purposes, this demo will limit the focus to analysis related to drug overdose deaths. The dataset is comprised of approximately 1,134 overdose deaths and 54,835 non-overdose deaths.
counts = df.groupby('Overdose').size().rename({0: 'Non-overdose', 1: 'Overdose'}, axis='index').rename('Counts')
print(counts)
## Overdose
## Non-overdose 54935
## Overdose 1134
## Name: Counts, dtype: int64
fig = plt.figure(figsize=(9,9))
ax = counts.plot.pie(colors=["#6d9e34","#c7d6b1"], startangle=90, autopct="%.1f%%", labels=['',''])
ax.legend(loc=1, labels=counts.index)
ax.set_ylabel('')
ax.invert_xaxis()
_ = ax.set_title("Proportion of Overdose Deaths")
Word embedding models are trained using a corpus or collection of text from which it can extract the various contexts in which a word appears. For the purposes of training a word embedding model, we would like to reformat the dataset to only include the fields containing raw text and to stack all text fields into a single column. The original dataset will be reformatted as follows:
df = pd.melt(df,
id_vars = ['Overdose', 'Underlying COD'],
value_vars = ['Cause of Death - Line A',
'Cause of Death - Line B',
'Cause of Death - Line C',
'Cause of Death - Line D',
'How Injury Occurred',
'Other Significant Conditions'],
var_name = "Line", value_name = "Text")
df = df['Text'].str.strip().replace('', np.nan)
df.dropna(inplace = True)
df.head()
## 0 RESPIRATORY FAILURE
## 1 METASTATIC BREAST CARCINOMA
## 2 END STAGE LIVER DISEASE
## 3 SARCOID HEART AND LUNG DISEASE
## 4 ACUTE ON CHRONIC RESPIRATORY FAILURE POSSIBLY ...
## Name: Text, dtype: object
The GenSim python package provides two popular word embedding algorithms, word2vec and fastText, that are used to easily construct a mapping between words in a given vocabulatory to numerical word vectors. Both take as input a parameter called sentences which consists of a python list or iterable containing lists of tokens corresponding to each document or record.
If the corpus fits into memory, as is the case with this demo, a python list may be provided for the sentences parameter. However, if the corpus is particularly large a generator may be passed that streams a document line by line to the algorithm for training. For more information, see the GenSim tutorial Corpus Streaming - One Document at a Time
In order to generate a list of lists of tokens, we pass each line through a preprocessing function to tokenize, splitting on non-word characters. Select words are then removed from the list of tokens before being returned. Numbers are removed since we are not interested in the semantic similarity between numbers and words. The token ‘S’ is removed as it is leftover from common contractions ending in apostrophe s such as “ALZHEIMER’S” and “CROHN’S”.
def cleanText(line):
'''Tokenize splitting on non-word characters and remove if token = S or contains a number'''
tokens = RegexpTokenizer(r'\w+').tokenize(line)
return [token for token in tokens if token != 'S' and token.isalpha()]
token_corpus = [cleanText(line) for line in df]
print(token_corpus[:5])
## [['RESPIRATORY', 'FAILURE'], ['METASTATIC', 'BREAST', 'CARCINOMA'], ['END', 'STAGE', 'LIVER', 'DISEASE'], ['SARCOID', 'HEART', 'AND', 'LUNG', 'DISEASE'], ['ACUTE', 'ON', 'CHRONIC', 'RESPIRATORY', 'FAILURE', 'POSSIBLY', 'COMPLICATED', 'BY', 'ASPIRATION', 'PNEUMONIA']]
A complete review of the word2vec and fastText algorithms are outside the scope of this demo. For detailed information original articles by Mikolov et al (2013) and Bojanowski et al (2017). A brief overview pertaining to the content of this demo is provided below.
word2vec comes in two flavors, using either a continuous bag of words (CBOW) or a skip-gram approach. For the purpose of this demo, the CBOW approach is used.
CBOW trains a neural network model by passing as input a one hot vector for a word that occurs in a given context, passes it through a hidden layer, and attempts to predict the probability of all other words within the vocabulary occurring in the same context/window using a softmax classifier. The network is optimized to minimize the loss in predicting probabilities for context words in the softmax layer. Once this loss is minimized, values are obtained from the hidden layer, also known as the embedding layer, to construct an embedding matrix of word vectors. The resulting word vectors can be used to compare semantic similarity using various distance measures.
The algorithm itself may be tweaked using a number of hyperparameters. The word2vec API page contains a full listing of available parameters. This demo considers the following hyperparameters:
The skipgram approach works in a similar manner to CBOW, but flips it around so that all context words are passed in as input and the center word is predicted in the output layer. Words that are closer to a center word within a context/window are also weighted to have more importance in the construction of word embeddings/vectors.
Diagram of word embedding model
The fastText model is a state of the art word embedding algorithm that extends word2vec by considering not only individual words, but character n-grams within word boundaries. Word vectors for complete words may then be constructed by summing the word vectors for character ngrams within the word boundary.
The advantages of fastText is that the model may create more meaningful word vectors for rare words and produce word vectors for out-of-vocabulary words. For example, drug names ending in ‘CILLIN’ are likely to fall into the category of penicillin antibiotics (e.g. amoxicillin, ampicillin, dicloxacillin, nafcillin, oxacillin). By using character ngrams these words are likely to be determined semantically similar, despite low counts or being out-of-vocabulary words.
The disadvantage of fastText is that it takes a significantly longer amount of time to train and store the model, given the inclusion of character ngrams in training and storing the model. If similarity for out-of-vocabularly words is not a requirement, storage requirements are similar to word2vec.
The hyperparameters of the fastText model are similar to word2vec and are detailed on the fastText API page. The only additional hyperparameters are the min and max size for character n-grams, which are set by default to 3 and 6, respectively.
The below chart shows the time in seconds to train each model using the corpus of 56,069 records from Washington State. Note that due to the relatively small size of the dataset training time is measured in seconds. However, datasets in practice may often be gigabytes in size and require hours to process.
t0 = time()
wv_model = models.Word2Vec(token_corpus, min_count=2, size=200, window=5)
w2v_time = time() - t0
t0 = time()
ft_model = FastText(token_corpus, min_count=2, size=200, window=5)
ft_time = time() - t0
times = pd.DataFrame(data = {"Models": ["word2vec","fastText"], "Time": [w2v_time, ft_time]})
ax = times.plot.barh(x="Models", y="Time", color = ["#022a5c", "#93a1c0"])
ax.get_legend().remove()
ax.set_title("Time to train model (seconds)")
## Text(0.5, 1.0, 'Time to train model (seconds)')
ax.invert_yaxis()
plt.gcf().set_size_inches(9,6)
Once word vectors are computed, they may be compared using a metric called cosine similarity. This measure essentially determines the similarity by measuring the angle between two vectors and returns a value between -1 and 1. Words with high semantic similarity have a score close to 1, words with no relationship have a score close to 0, and words with a score close to -1 are opposites in meaning.
Cosine Similarity examples
Using the most_similar function of gensim, we can obtain a rank ordered list of words that are most similar to a specified word.
The below function obtains this rank ordered list and visualizes it using a horizontal bar chart. The x-axis is provided on a log scale given that the rank ordered cosine similarity are so close in value. A bar chart of the counts for each word are also provided.
Results for both the word2vec and fastText models are obtained for comparison.
def plot_Similar_Words(model, word):
if(type(model).__name__ == 'Word2Vec'):
color = "#022a5c"
else:
color = "#93a1c0"
df = pd.DataFrame(model.wv.most_similar(word), columns = ["Word", "Similarity"])
df['Counts'] = [model.wv.vocab[word].count for word in df['Word']]
fig = plt.figure(figsize = (12,4))
ax1 = fig.add_subplot(121)
df.plot.barh(x="Word", y="Similarity", ax=ax1, color = color)
ax1.set_title(type(model).__name__ + ", Words Most Similar to " + word)
ax1.set_xscale('log')
ax1.invert_yaxis()
ax2 = fig.add_subplot(122)
df.plot.barh(x="Word", y="Counts", ax = ax2, color = color)
ax2.set_title(type(model).__name__ + ", Counts for each word")
ax2.invert_yaxis()
plt.tight_layout()
plt.show()
The first word queried is DECEDENT, which occurs 698 times in the dataset. Despite the small dataset, word2vec generates several words that roughly carry the semantic meaning of an individual, with the word2vec model identifying HIS, SUBJECT, and COMPANIONS. fastText does not perform quite as well and only the top ranked word, DECEASED, is semantically similar. Other selections appear to be chosen based on common character ngrams such as ‘DEC’ in ‘DECEMBER’ and ‘DECIDED’.
print('DECEDENT:',wv_model.wv.vocab['DECEDENT'].count)
## DECEDENT: 698
plot_Similar_Words(wv_model, "DECEDENT")
plot_Similar_Words(ft_model, "DECEDENT")
Tne next query word is DRUG, which appears in the dataset 808 times. Both word2vec and fastText models return drug names or words semantically similar to the concept of drug (e.g. SUBSTANCE, POLYSUBSTANCE). fastText again appears to prioritize words with similar character ngrams such as RUG, DRUGS, and DR.
print('DRUG:',wv_model.wv.vocab['DRUG'].count)
## DRUG: 808
plot_Similar_Words(wv_model, "DRUG")
plot_Similar_Words(ft_model, "DRUG")
The next query word is FENTANYL, which appears in the dataset 93 times. Despite the relatively low frequency in the corpus, the top ranked words in the word2vec model are all drug names, with HYDROCODONE, METHADONE, TRAMADOL, and HYDROMORPHONE belonging to the same drug class. The majority of fastText matches are also drug names, but only MORPHINE belongs to the same drug class. The typo FENTAYL is returned as a top match.
From this and prior queries, a general assumption might be that for this sample dataset word2vec is more effective at identifying semantically similar words, while fastText is useful for identifying rare words or typos.
print('FENTANYL:',wv_model.wv.vocab['FENTANYL'].count)
## FENTANYL: 93
plot_Similar_Words(wv_model, "FENTANYL")
plot_Similar_Words(ft_model, "FENTANYL")
Using the measure of cosine similarity, we can flag terms of interest in a record, such as a line of text from a death certificate. One example might be to flag all potential drug mentions within a death record. This can be done by tokenizing a record and comparing the cosine similarity of each word to a reference term such as DRUG. Those words that pass a threshold of similarity (e.g. 0.7) are flagged as potential drug mentions.
The below function accepts a record and provides a horizontal bar chart to visualize the rank order cosine similarity between each word and the reference term DRUG. Success of the this method depends on the quality of the word embedding model and careful selection of the reference term and threshold for selection as a drug mention.
def detectDrugMentions(record):
record_tokenized = cleanText(record)
results = []
for token in record_tokenized:
results.append([token, wv_model.wv.similarity("DRUG",token)])
results = pd.DataFrame(results, columns=['word','similarity'])
results = results.sort_values(by=['similarity'], ascending = False)
bars = plt.barh(y = results['word'], width = results['similarity'])
for bar in bars:
if bar.get_width() > 0.7:
bar.set_color('g')
else:
bar.set_color('r')
line = plt.axvline(0.7)
line.set_linestyle("dashed")
line.set_color("black")
plt.title("Potential drug mentions")
plt.show()
record = "HYPOXIC ENCEPHALOPATHY,CARDIOPULMONARY ARREST, HEROIN TOXICITY: 6 ACETYL, CODEINE AND MORPHINE DETECTED"
detectDrugMentions(record)
record = "DIASTOLIC CONGESTIVE HEART FAILURE, END-STAGE"
detectDrugMentions(record)
record = "METASTATIC BREAST CARCINOMA"
detectDrugMentions(record)
t-SNE, like PCA or LDA, is used as a technique for dimensionality reduction and is advantageous for visualization in that it retains local structures when projecting down to lower dimensions, so semantically similar words that are close in a high dimensional vector space model should be close together in the 2D projection. This is useful to illustrate the general concept of word vectors and close proximity of semantically similar words.
The below function uses the TSNE implementation in scikit-learn to transform the word vectors obtained from the word2vec embedding matrix to a 2-dimensional representation. Due to the size of vocabulary, a min_count parameter is provided restrict the number of points and labels in the scatterplot. A parameter for a target word is also provided to decrease the transparency of all words except for those with a cosine similarity of 0.7 or greater.
def tsne_plot(model, min_count = 50, target_word = "DRUG"):
labels = []
tokens = []
for word in model.wv.vocab:
if model.wv.vocab[word].count > min_count:
tokens.append(model.wv[word])
labels.append(word)
reduce_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
new_values = reduce_model.fit_transform(tokens)
x = []
y = []
for value in new_values:
x.append(value[0])
y.append(value[1])
plt.figure(figsize=(16, 16))
for i in range(len(x)):
plt.scatter(x[i],y[i])
text_label = plt.annotate(labels[i],
xy=(x[i], y[i]),
xytext=(5, 2),
textcoords='offset points',
ha='right',
va='bottom')
if wv_model.wv.similarity(target_word, labels[i]) > 0.7:
text_label.set_alpha(1)
else:
text_label.set_alpha(0.2)
plt.show()
The first variation of the visualization focuses on DRUG as the target word and only sets words with cosine similarity of 0.7 or higher to be fully opaque. We can see in the bottom a cluster of words including terms such as INTRAVENOUS, INTOXICATION, METHAMPHETAMINE, HEROIN, METHADONE, ALPRAZOLAM, and so forth.
tsne_plot(model = wv_model, min_count = 70)
The second variation of the visualization focus on HEMATOMA as the target word, which occur in the upper center portion of the scatterplot. It can be seen that words such as HEMORRHAGE and SUBDURAL which are closely related are in close proximity to the target word.
tsne_plot(model = wv_model, min_count = 50, target_word = "HEMATOMA")
Using tools such as tensorboard, embeddings may be loaded and displayed within an embedding projector which plots the data in an interactive 3-D visualization. Data may be projected using methods such as PCA or T-SNE, or using custom projections by specifying word vectors as the axes for the visualization (e.g. project word vectors from left to right using words ACUTE and CHRONIC). Entering words within the search box also produces a rank ordered list of words according to cosine similarity.
A shareable public link to this tensorboard visualization is available at:
Embedding projector screenshot from tensorflow