Handout #4-1 Natural Language Processing II

Supervised and Unsupervised Information Extraction

In handout #3, we considered words as individual units and considered their relationship to documents. However, many interesting text analyses are based on the relationships between words, whether examining which words tend to follow others immediately or co-occur within the same documents.

In this handout, we will explore some of the methods that tidytext library offers for calculating and visualizing relationships between words in your text dataset. This includes the token = “ngrams” argument, which tokenizes by pairs of adjacent words rather than by individual ones. We’ll also introduce a new package: ggraph, which extends ggplot2 to construct network plots.

Language Modeling

library(udpipe)
library(SnowballC) #for text Stemmign
library(hunspell) # for spell check and spelling
library(ggplot2)
library(igraph)
library(ggraph)
library(widyr)
library(dplyr)
library(tm)

After importing and cleaning the dataset, we will use the unnest_tokens() function to tokenize reviews into consecutive sequences of words, called n-grams. By seeing how often word X is followed by word Y, we can then build a model of the relationships between them.

#-------------------------------------
#Changing the working directory
setwd("/Users/zahrashakeri/Library/Mobile Documents/com~apple~CloudDocs/Teaching/2021/Data 624/Lectures/Lecture 04/HandoutR")

drug_info<- read.csv("DrugInfo.csv", header=TRUE, sep=",", fileEncoding = "UTF-8") #importing the textual data

drug_info$review <- lapply(drug_info$review, as.character) #the format was 'factor' /check class(drug_info$text)
drug_info$review<-as.character(unlist(drug_info$review)) #to unlist the list to character format
drug_info$review <- iconv(drug_info$review, "UTF-8", "ASCII", sub="") #To remove all the special characters from text

drug_info$review <-stemDocument(drug_info$review) #stemming the document

Let’s explore the most common bigrams using dplyr’s count() function:

bigrams <- drug_info %>% tidytext::unnest_tokens(bigram, review, token = "ngrams", n = 2)
bigrams %>% count (bigram, sort=TRUE)

As one might expect, a lot of the most common bigrams are pairs of common (uninteresting) words, such as i am and i have: what we call “stop-words”. To address this, we use tidyr’s separate() function, which splits a column into multiple based on a delimiter. This lets us separate the bigrams into two columns, “word1” and “word2”, at which point we can remove cases where either is a stop-word or irrelevant.

# #-------------Try TM-------

drug_info$review<-tolower(drug_info$review) %>%gsub(pattern = "pattern", replace = " ", drug_info$review)
library(tidyr)

bigrams_separated <- bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

context_w1 <- c("ve", "pill", "ani" ) #contextual words that should be removed! Separate them by ','

context_w2<- c("told", "pill", "bit", "notice", "ani") #do the same for the second word!

library(tidytext)
library(stringr)
data(stop_words)
bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)%>%
  filter(!str_detect(word1, "^[0-9]*$")) %>% #removing numbers from word1
  filter(!str_detect(word2, "^[0-9]*$")) %>% #removing numbers from word2
  filter(!word1 %in% context_w1) %>%
  filter(!word2 %in% context_w2)

Information Extraction Using Sentiment Analysis

The sentiments datasets

There are a variety of methods and dictionaries that exist for evaluating the opinion or emotion in text. The tidytext package contains several sentiment lexicons. Three general-purpose lexicons are

AFINN from Finn Arup Nielsen,
bing from Bing Liu and collaborators, and
nrc from Saif Mohammad and Peter Turney.

library(tidytext)
#Install the textdata package to access the following datasets.


get_sentiments("afinn")

## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # … with 2,467 more rows

get_sentiments("bing")

## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # … with 6,776 more rows

get_sentiments("nrc")

## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # … with 13,891 more rows

With data in a tidy format, sentiment analysis can be done as an inner join:

library(SnowballC)
tidy <- drug_info %>% tidytext::unnest_tokens(word, review) %>%
  mutate(word_stem = wordStem(word, language="english"))

Inner Joins vs Outer Join:

bing_word_counts <- tidy %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE)

## Joining, by = "word"

bing_word_counts

This can be shown visually as:

bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()

## Selecting by n

For using the comparison.cloud() function, you may need to turn the data frame into a matrix with reshape2’s acast(). Let’s do the sentiment analysis to tag positive and negative words using an inner join, then find the most common positive and negative words. Until the step where we need to send the data to comparison.cloud(), this can all be done with joins, piping, and dplyr because our data is in tidy format.

library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

library(wordcloud)

## Loading required package: RColorBrewer

#acast() converts a long-format data frame into a wide-format vector/matrix/array.

tidy %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("#DC143C", "#66CDAA"),
                   max.words = 100)

## Joining, by = "word"

Now that we have the data organized into bigrams, it’s easy to tell how often words are preceded by a word like “not”:

bigrams_separated %>%
  filter(word1 == "not") %>%
  count(word1, word2, sort = TRUE)

By performing sentiment analysis on the bigram data, we can examine how often sentiment-associated words are preceded by “not” or other negating words. We could use this to ignore or even reverse their contribution to the sentiment score.

AFINN <- get_sentiments("afinn")

Let’s examine the most frequent words that were preceded by “not” and were associated with a sentiment.

not_words <- bigrams_separated %>%
  filter(word1 == "not") %>%
  inner_join(AFINN, by = c(word2 = "word")) %>%
  count(word2, value, sort = TRUE)

not_words

##           word2 value  n
## 1         worth     2 39
## 2     recommend     2 31
## 3          help     2 22
## 4          like     2 17
## 5          want     1 15
## 6           bad    -3 11
## 7          stop    -1  8
## 8          gain     2  7
## 9          good     3  6
## 10       better     2  4
## 11         care     2  4
## 12          fun     4  4
## 13       severe    -2  4
## 14        allow     1  3
## 15       bother    -2  3
## 16      perfect     3  3
## 17        clean     2  2
## 18    depressed    -2  2
## 19        great     3  2
## 20         lost    -3  2
## 21   overweight    -1  2
## 22         pain    -2  2
## 23        panic    -3  2
## 24     pleasant     3  2
## 25        scary    -2  2
## 26         sore    -1  2
## 27       suffer    -2  2
## 28     terrible    -3  2
## 29         warn    -2  2
## 30       accept     1  1
## 31        admit    -1  1
## 32        alone    -2  1
## 33      anxious    -2  1
## 34          big     1  1
## 35        blame    -2  1
## 36 breakthrough     3  1
## 37        clear     1  1
## 38     complain    -2  1
## 39          cut    -1  1
## 40        dream     1  1
## 41         drop    -1  1
## 42        enjoy     2  1
## 43     euphoric     4  1
## 44       expand     1  1
## 45       extend     1  1
## 46         fear    -2  1
## 47      focused     2  1
## 48          gag    -2  1
## 49        happy     3  1
## 50         hard    -1  1
## 51         harm    -2  1
## 52      healthy     2  1
## 53     horrible    -3  1
## 54         hurt    -2  1
## 55      impress     3  1
## 56    impressed     3  1
## 57      improve     2  1
## 58      inhibit    -1  1
## 59     irritate    -3  1
## 60       losing    -3  1
## 61         luck     3  1
## 62         miss    -2  1
## 63        moody    -1  1
## 64          pay    -1  1
## 65     prepared     1  1
## 66       pretty     1  1
## 67  recommended     2  1
## 68       regret    -2  1
## 69          sad    -2  1
## 70         safe     1  1
## 71        scare    -2  1
## 72        trust     1  1
## 73         wish     1  1
## 74        worse    -3  1
## 75     worsened    -3  1

For example, the most common sentiment-associated word to follow “not” was “worth”, which would normally have a (positive) score of 2.

It’s worth asking which words contributed the most in the “wrong” direction. To compute that, we can multiply their value by the number of times they appear (so that a word with a value of +3 occurring 10 times has as much impact as a word with a sentiment value of +1 occurring 30 times). We visualize the result with a bar plot:

library(ggplot2)

not_words %>%
  mutate(contribution = n * value) %>%
  arrange(desc(abs(contribution))) %>%
  head(20) %>%
  mutate(word2 = reorder(word2, contribution)) %>%
  ggplot(aes(word2, n * value, fill = n * value > 0)) +
  geom_col(show.legend = FALSE) +
  xlab("Words preceded by \"not\"") +
  ylab("Sentiment value * number of occurrences") +
  coord_flip()

“Not” isn’t the only term that provides some context for the following word. We could pick four common words (or more) that negate the subsequent term, and use the same joining and counting approach to examine all of them at once.

negation_words <- c("not", "no", "never", "without")

negated_words <- bigrams_separated %>%
  filter(word1 %in% negation_words) %>%
  inner_join(AFINN, by = c(word2 = "word")) %>%
  count(word1, word2, value, sort = TRUE)

negated_words

Visualizing the relationships between words: N-grams

We may be interested in visualizing all of the relationships among words simultaneously, rather than just the top few at a time. As one common visualization, we can arrange the words into a network, or “graph.” Here we’ll be referring to a “graph” not in the sense of a visualization, but as a combination of connected nodes. A graph can be constructed from a tidy object since it has three variables:

from: the node an edge is coming from
to: the node an edge is going towards
weight: A numeric value associated with each edge

The igraph package has many powerful functions for manipulating and analyzing networks. One way to create an igraph object from tidy data is the graph_from_data_frame() function, which takes a data frame of edges with columns for “from”, “to”, and edge attributes (in this case n):

#------------------Generating bi-grams--------------
bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

View(bigram_counts[1:250,])

library(igraph)


#---------------------Generating Graphs---------------------------
library(ggraph)
library(ggplot2)
set.seed(2017)

  bigram_counts %>%
  filter(n > 15) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha=.5, edge_width=n), edge_color="cadetblue3") +
  geom_node_point(color = "deeppink4", size = 2, alpha=0.65) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1)+
  theme_void()

ggsave("Graph.pdf", width = 6, height = 6)

#--------------------------------------------------------------------
set.seed(2016)

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
 bigram_counts %>%
  filter(n > 15) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n*2), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "deeppink4", size = 3) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

ggsave("Graph-Directed.pdf", width = 10, height =6)

Topic Modeling (Unsupervised Information Extraction)

We can use topic modeling to model each document (review) as a mixture of topics and each topic as a mixture of words. We will use latent Dirichlet allocation (LDA) for our topic modeling; however, there are other possible approaches for topic modeling.

In essence, LDA is a technique that facilitates the automatic discovery of themes in a collection of documents.

The basic assumption behind LDA is that each of the documents in a collection consist of a mixture of collection-wide topics. However, in reality we observe only documents and words, not topics – the latter are part of the hidden (or latent) structure of documents. The aim is to infer the latent topic structure given the words and document. LDA does this by recreating the documents in the corpus by adjusting the relative importance of topics in documents and words in topics iteratively.

To do the topic modeling as implemented here, we need to make a DocumentTermMatrix, a special kind of matrix from the tm package (Review Handout 3-1 for more information on DTMs).

Let’s first import the data and clean up the text a bit:

library(tm)
#When you load and install the 'tm' package, dependent packages are loaded automatically – in this case, the dependency is on the NLP (natural language processing) package.
library(dplyr)


df <- read.csv("DrugInfo.csv")

#Create Corpus
reviews <- Corpus(VectorSource(df$review)) #Since we are passing character values, we cannot use Corpus(df$review), we need to call the column using VectorSource() function
reviews

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3324

#create the toSpace content transformer
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, " ", x))})

reviews <- tm_map(reviews, toSpace, "-")
reviews <- tm_map(reviews, toSpace, ":")

#Remove punctuation – replace punctuation marks with ” “
reviews <- tm_map(reviews, removePunctuation)

#Transform to lower case (need to wrap in content_transformer)
reviews <- tm_map(reviews,content_transformer(tolower))

#Strip digits (std transformation, so no need for content_transformer)
reviews <- tm_map(reviews, removeNumbers)

#remove stopwords using the standard list in tm
reviews <- tm_map(reviews, removeWords, stopwords("english"))

#Strip whitespace (cosmetic?)
reviews <- tm_map(reviews, stripWhitespace)

 #load library
 library(SnowballC)
 #Stem document
 reviews <- tm_map(reviews,stemDocument)

dtm <- DocumentTermMatrix(reviews)

Now we are ready to apply the LDA function:

library(topicmodels)

ap_lda <- LDA(dtm, k = 2, control = list(seed = 1234))
ap_lda

## A LDA_VEM topic model with 2 topics.

To extract the per-topic-per-word probabilities, we use “beta” from the tidytext package:

library(tidytext)

ap_topics <- tidy(ap_lda, matrix = "beta")
ap_topics

## # A tibble: 14,764 x 3
##    topic term         beta
##    <int> <chr>       <dbl>
##  1     1 bystol 0.0000160 
##  2     2 bystol 0.0000287 
##  3     1 combin 0.000271  
##  4     2 combin 0.000400  
##  5     1 effect 0.00712   
##  6     2 effect 0.0128    
##  7     1 fish   0.000110  
##  8     2 fish   0.00000925
##  9     1 oil    0.000140  
## 10     2 oil    0.0000987 
## # … with 14,754 more rows

We could use dplyr’s top_n() to find the 10 terms that are most common within each topic. As a tidy data frame, this lends itself well to a ggplot2 visualization.

library(ggplot2)
library(dplyr)

ap_top_terms <- ap_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

ap_top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip() +
  scale_x_reordered()

This visualization lets us understand the two topics that were extracted from the reviews.

We need to try different values of k (# of topics) and make a choice based by inspecting the results.

reviews_lda <- LDA(dtm, k = 4, control = list(seed = 1234))
reviews

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3324

reviews_topics <- tidy(reviews_lda, matrix = "beta")
reviews_topics

## # A tibble: 29,528 x 3
##    topic term        beta
##    <int> <chr>      <dbl>
##  1     1 bystol 0.0000204
##  2     2 bystol 0.0000442
##  3     3 bystol 0.0000101
##  4     4 bystol 0.0000147
##  5     1 combin 0.000295 
##  6     2 combin 0.000420 
##  7     3 combin 0.000114 
##  8     4 combin 0.000513 
##  9     1 effect 0.0100   
## 10     2 effect 0.0185   
## # … with 29,518 more rows

We could use dplyr’s top_n() to find the top 5 terms within each topic.

top_terms <- reviews_topics %>%
  group_by(topic) %>%
  top_n(5, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms

## # A tibble: 20 x 3
##    topic term     beta
##    <int> <chr>   <dbl>
##  1     1 day    0.0216
##  2     1 take   0.0164
##  3     1 year   0.0159
##  4     1 start  0.0150
##  5     1 first  0.0146
##  6     2 effect 0.0185
##  7     2 period 0.0167
##  8     2 work   0.0136
##  9     2 month  0.0135
## 10     2 now    0.0131
## 11     3 take   0.0239
## 12     3 day    0.0215
## 13     3 just   0.0156
## 14     3 work   0.0120
## 15     3 week   0.0114
## 16     4 month  0.0207
## 17     4 pain   0.0203
## 18     4 year   0.0168
## 19     4 side   0.0127
## 20     4 start  0.0123

library(ggplot2)

top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip() +
  scale_x_reordered()

The LDA algorithm returns an object that contains a lot of information. Of particular interest to us are the document to topic assignments, the top terms in each topic and the probabilities associated with each of those terms. Take some time to examine the outpu and find the topics that can be extracted based on their corresponding top terms.

Reference: Text Mining with R: A Tidy Approach. This is a great reference for unsupervised information extraction from textual datasets.

Enjoy Language and Topic Modeling!

Data 624- Advanced Exploration and Visualization in Health