Data 624- Advanced Exploration and Visualization in Health

Instructor: Zahra Shakeri– Winter 2021

Handout #3-1 Natural Language Processing (Part I)

Natural Language Processing (NLP) is a form of artificial intelligence that enables computer programs to process and analyze unstructured data, such as free-text physician notes written in an EHR. Any organization seeking to leverage their data to improve outcomes, reduce cost, and further medical research needs to consider the wealth of insight stored in text and how they will create value from that data using NLP. In this handout, we review the basic concepts of text mining, including:

Data cleaning and pre-processing
Stemming, Stopword removal, and generalization
Information extraction using lexical association

In our second lecture on NLP, we will cover semantic analysis and supervised/unsupervised information extraction topics.

Text mining in R

In R, text mining packages are not installed by default, so one has to install them manually. The relevant packages are:

tm – the text mining package.
SnowballC – required for stemming.
wordcloud – which is self-explanatory.

(Warning for Windows users: R is case-sensitive so Wordcloud != wordcloud)

About the dataset: The dataset for this handout is a revised and modified version of the dataset that you will use for Datathon #2, and it is collected by Gräßer, Felix et al., and it provides patient reviews on specific drugs along with related conditions and a 10-star patient rating system reflecting overall patient satisfaction. The data is obtained by crawling online pharmaceutical review sites. This data was published in a study on sentiment analysis of drug experience over multiple facets. This data can be housed in various databases, including pharmacy management systems, financial systems, category product systems, and supply chain systems.

library(tm)

## Loading required package: NLP

#When you load and install the 'tm' package, dependent packages are loaded automatically – in this case, the dependency is on the NLP (natural language processing) package.
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

setwd("/Users/zahrashakeri/Library/Mobile Documents/com~apple~CloudDocs/Teaching/2021/Data 624/Lectures/Lecture 03/Handout#3")

df <- read.csv("DrugInfo.csv")

Corpus in R

Next, we need to create a collection of documents (technically referred to as a Corpus) in the R environment. This basically involves loading all the reviews into a Corpus object. The tm package provides the Corpus() function to do this. In a nutshell, the Corpus() function can read from various sources, including a csv file. That’s the option we’ll use:

#Create Corpus
reviews <- Corpus(VectorSource(df$review)) #Since we are passing character values, we cannot use Corpus(df$review), we need to call the column using VectorSource() function
reviews

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 10115

Data Cleaning and Pre-Processing

Data cleansing, though tedious, is perhaps the most important step in text analysis. As we will see, dirty data can play havoc with the results. Furthermore, as we will also see, data cleaning is an iterative process as there are always problems that are overlooked the first time around. Sometimes you may need to follow >100 iterations!

The tm package offers a number of transformations that ease the cleaning task. To see the available transformations type getTransformations() at the R prompt:

getTransformations()

## [1] "removeNumbers"     "removePunctuation" "removeWords"      
## [4] "stemDocument"      "stripWhitespace"

There are a few preliminary clean-up steps we need to do before we use these powerful transformations. If you inspect some reviews in the corpus, you will notice that some reviews have some quirks in their writing. For example, they often use colons and hyphens without spaces between the words separated by them. Using the removePunctuation transform without fixing this will cause the two words on either side of the symbols to be combined. Clearly, we need to fix this prior to using the transformations.

To fix the above, we need to create a custom transformation. The tm package provides the ability to do this via the* content_transformer* function.

content_transformer(): This function takes a function as input. The input function should specify what transformation needs to be done. In this case, the input function would be one that replaces all instances of a character by spaces. As it turns out, the gsub() function does just that.

Here is the R code to build the content transformer, which we will call toSpace:

#create the toSpace content transformer
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, " ", x))})

Now we can use this content transformer to eliminate colons and hypens like so:

reviews <- tm_map(reviews, toSpace, "-")
reviews <- tm_map(reviews, toSpace, ":")

If it all looks good, we can now apply the removePunctuation transformation. This is done as follows:

#Remove punctuation – replace punctuation marks with ” “
reviews <- tm_map(reviews, removePunctuation)

If all is well, you can move to the next step, which is to:

Convert the corpus to lower case
Remove all numbers. Since R is case sensitive, “Text” is not equal to “text” – and hence the rationale for converting to a standard case. However, although there is a tolower transformation, it is not a part of the standard tm transformations (see the output of getTransformations() in the previous section). For this reason, we have to convert tolower into a transformation that can handle a corpus object properly. This is done with the help of our new friend, content_transformer.

Here’s the relevant code:

#Transform to lower case (need to wrap in content_transformer)
reviews <- tm_map(reviews,content_transformer(tolower))

## Warning in tm_map.SimpleCorpus(reviews, content_transformer(tolower)):
## transformation drops documents

Text analysts are typically NOT interested in numbers since these do not usually contribute to the meaning of the text. However, this may not always be so. For example, it is definitely not the case if one is interested in getting a count of the number of times a particular year appears in a corpus. This does not need to be wrapped in content_transformer as it is a standard transformation in tm.

#Strip digits (std transformation, so no need for content_transformer)
reviews <- tm_map(reviews, removeNumbers)

## Warning in tm_map.SimpleCorpus(reviews, removeNumbers): transformation drops
## documents

The next step is to remove common words from the text. These include words such as articles (a, an, the), conjunctions (and, or but etc.), common verbs (is), qualifiers (yet, however etc). The tm package includes a standard list of such stop words as they are referred to. We remove stop words using the standard removeWords transformation like so:

#remove stopwords using the standard list in tm
reviews <- tm_map(reviews, removeWords, stopwords("english"))

## Warning in tm_map.SimpleCorpus(reviews, removeWords, stopwords("english")):
## transformation drops documents

Finally, we remove all extra whitespaces that have been generated during data cleaning, using the stripWhitespace transformation:

#Strip whitespace (cosmetic?)
reviews <- tm_map(reviews, stripWhitespace)

## Warning in tm_map.SimpleCorpus(reviews, stripWhitespace): transformation drops
## documents

Stemming

In textual datasets, a large corpus will contain many words with a common root – for example: visit, visited and visiting.

Stemming is the process of reducing relevant words to their common root, which in this case would be the word visit.

Simple stemming algorithms (such as the one in tm) are relatively basic: they work by chopping off the ends of words. This can cause problems: for example, the words dies, and died might be reduced to di instead of die. However, the overall benefit gained from stemming more than makes up for the downside of such special cases.

To see what stemming does, let’s take a look at the last few lines of the corpus before and after stemming:

writeLines(as.character(reviews[[55]]))

##  mother died lung cancer last hope medication within couple months gone

Now let’s stem the corpus and reinspect it.

 #load library
 library(SnowballC)
 #Stem document
 reviews <- tm_map(reviews,stemDocument)

## Warning in tm_map.SimpleCorpus(reviews, stemDocument): transformation drops
## documents

 writeLines(as.character(reviews[[55]]))

## mother die lung cancer last hope medic within coupl month gone

Document Term Matrix (DTM)

DTM is a matrix that lists all occurrences of words in the corpus by documents. In the DTM, the documents are represented by rows and the terms (or words) by columns. If a word occurs in a particular document, then the matrix entry corresponding to that row and column is 1, else it is 0 (multiple occurrences within a document are recorded – that is, if a word occurs twice in a document, it is recorded as “2” in the relevant matrix entry).

The DTM for our review corpus would look like:

dtm <- DocumentTermMatrix(reviews)

This creates a term-document matrix from the corpus and stores the result in the variable dtm.

dtm

## <<DocumentTermMatrix (documents: 10115, terms: 12843)>>
## Non-/sparse entries: 351788/129555157
## Sparsity           : 100%
## Maximal term length: 37
## Weighting          : term frequency (tf)

This is a 10115 x 12843 dimension matrix, in which 89% of the rows are zero.

Let’s inspect a part of our DTM:

inspect(dtm[1:2,900:910])

## <<DocumentTermMatrix (documents: 2, terms: 11)>>
## Non-/sparse entries: 0/22
## Sparsity           : 100%
## Maximal term length: 8
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs instant nervous outsid patienc patient place present question quotment
##    1       0       0      0       0       0     0       0        0        0
##    2       0       0      0       0       0     0       0        0        0
##     Terms
## Docs quotsexu quotthi
##    1        0       0
##    2        0       0

This command displays terms 900 through 910 in the first two rows of the DTM.

Mining the Corpus

Notice that in constructing the DTM, we have converted a corpus of text into a mathematical object that can be analyzed using quantitative techniques of matrix algebra. So, DTTM is the starting point for quantitative text analysis.

For example, to get the frequency of occurrence of each word in the corpus, we simply sum over all rows to give column sums:

freq <- colSums(as.matrix(dtm))

Here we have first converted the DTM into a mathematical matrix using the as.matrix() function.

Next, we sort freq in descending order of term count:

#create sort order (descending)
dec_ord <- order(freq,decreasing=TRUE)

Then list the most and least frequently occurring terms:

#inspect most frequently occurring terms
freq[head(dec_ord)]

##    day   take  month   year effect   work 
##   6148   5994   4407   4155   3966   3934

#inspect least frequently occurring terms
freq[tail(dec_ord)]

##          mtv          xan       absenc      atelvia choiceespeci       litemi 
##            1            1            1            1            1            1

write.csv(freq[dec_ord],"freq.csv")

After checking the frequency file, it shows up a few problems. First, some of the frequently occuring words are not content-carying in this context and should be removed (e.g. can, use, now, day, also, …). Second, words like got and gotten are actually variants of the same stem got. Clearly, they should be merged. These (and other errors of their ilk) can and should be fixed up before proceeding. This is easily done using gsub() wrapped in content_transformer. Here is the code to clean up these and a few other issues that I found:

reviews <- tm_map(reviews, content_transformer(gsub), pattern = "gotten", replacement = "got")

## Warning in tm_map.SimpleCorpus(reviews, content_transformer(gsub), pattern =
## "gotten", : transformation drops documents

reviews <- tm_map(reviews, content_transformer(gsub), pattern = "gotta", replacement = "got")

## Warning in tm_map.SimpleCorpus(reviews, content_transformer(gsub), pattern =
## "gotta", : transformation drops documents

reviews <- tm_map(reviews, content_transformer(gsub), pattern = "felt", replacement = "feel")

## Warning in tm_map.SimpleCorpus(reviews, content_transformer(gsub), pattern =
## "felt", : transformation drops documents

#defining context-sensitive stopwords
DrugStopwords <- c("take", "get", "now", "ive", "use", "one", "also", "will", "can", "put")
reviews <- tm_map(reviews, removeWords, DrugStopwords)

## Warning in tm_map.SimpleCorpus(reviews, removeWords, DrugStopwords):
## transformation drops documents

Words like “he” and “im” give us no information about the subject matter of the documents in which they occur. They can, therefore, be eliminated without loss. Indeed, they ought to have been eliminated by the stopword removal we did earlier. However, since such words occur very frequently – virtually in all documents – we can remove them by enforcing bounds when creating the DTM, like so:

dtmr <-DocumentTermMatrix(reviews, control=list(wordLengths=c(3, 20),bounds = list(global = c(3,30))))

Here we have told R to include only those words that occur in 3 to 50 documents in DTM. We have also enforced a lower and upper limit to the length of the words included (between 3 and 20 characters).

dtmr

## <<DocumentTermMatrix (documents: 10115, terms: 3491)>>
## Non-/sparse entries: 31569/35279896
## Sparsity           : 100%
## Maximal term length: 18
## Weighting          : term frequency (tf)

We apply the process of calculating the frequency on the new DTM (dtmr):

freqr <- colSums(as.matrix(dtmr))

#create sort order
ordr <- order(freqr,decreasing=TRUE)
#inspect most frequently occurring terms
freqr[head(ordr)]

## irsquom    apri  viagra bactrim    nail saxenda 
##      49      47      43      41      40      39

#inspect least frequently occurring terms
freqr[tail(ordr)]

##       asham      pylera      revert      reader      adderr guaifenesin 
##           3           3           3           3           3           3

findFreqTerms(dtmr,lowfreq=80)

## character(0)

Before visualizing the word/frequency data, we need to convert it to a datafarame:

wf=data.frame(term=names(freqr), occurrences=freqr)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

p <- ggplot(subset(wf, freqr>35), aes(term, occurrences))
p <- p + geom_bar(stat="identity", fill="orange")
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))
p

Finally, let’s create a wordcloud to visualize the most frequent words of our corpus. The code for this is:

#wordcloud
library(wordcloud)

## Loading required package: RColorBrewer

#setting the same seed each time ensures consistent look across clouds
set.seed(42)
#limit words by specifying min frequency
wordcloud(names(freqr),freqr,min.freq=60,colors=brewer.pal(6,"Dark2"))

Now that we have the most frequently occurring terms, we can check for correlations between some of these and other terms that occur in the corpus. In this context, correlation is a quantitative measure of the co-occurrence of words in multiple documents.

The tm package provides the findAssocs() function to do this. One needs to specify the DTM, the term of interest and the correlation limit.

Here are the results of running findAssocs() on some of the frequently occurring terms (pain, feel).

findAssocs(dtm,"pain",0.1)

## $pain
##       chronic        relief         joint          back         muscl 
##          0.21          0.18          0.18          0.16          0.15 
##         sever          neck        insert       surgeri        narcot 
##          0.14          0.14          0.14          0.14          0.14 
##          disc  fibromyalgia       vicodin       abdomin           hip 
##          0.14          0.13          0.13          0.13          0.13 
##          knee      dilaudid         manag      tramadol         sharp 
##          0.13          0.13          0.12          0.12          0.12 
##     ibuprofen          nerv       excruci         cramp        reliev 
##          0.12          0.12          0.12          0.11          0.11 
##       morphin         norco        killer      oxycodon      arthriti 
##          0.11          0.11          0.11          0.11          0.11 
##      methadon      shoulder differenceday       goneday         upper 
##          0.11          0.11          0.11          0.11          0.10 
##        cervix       herniat        epidur 
##          0.10          0.10          0.10

findAssocs(dtm,"feel",0.1)

## $feel
##               like             better               felt            anxieti 
##               0.30               0.20               0.16               0.15 
##            depress               take              start            curbsid 
##               0.14               0.13               0.13               0.13 
##             enlong            freqent frustrationquotand        illnessquot 
##               0.13               0.13               0.13               0.13 
##   increaseddecreas              virtu               much               just 
##               0.13               0.13               0.12               0.12 
##               dont               sick        aripiprazol                day 
##               0.12               0.12               0.12               0.11 
##               make               tire            anxious 
##               0.11               0.11               0.11

Try to interpret these associations and link them to other columns of the dataset. Any interesting findings?

Enjoy Text Analysis!