Text Analytics

First you need to install and load all the packages needed.

#install.packages("tm")
#install.packages("tokenizers")
#install.packages("SnowballC")
#install.packages("textstem")
#install.packages('gutenbergr')

# Add packages to Library
library(tm)
library(tokenizers)
library(textstem)
library(gutenbergr)

Basic of Text

Corpus and Corpora of Text

Package TM allows you to create corpus which is a collection of text documents. For the text analysis we need to convert the text documents to a corpus.

text <- c("Now, I truly understand that because and Because it's an election campaigning in the middle of the year",
          "expectations for what we will achieve this year are low.",
          "But, Mister Speaker, I appreciate the constructive approach",
          "that you and other leaders took at the end of last year",
          "to pass a budget and make tax cuts permanent for working",
          "families. So I hope we can work together this year on some",
          "bipartisan priorities like criminal justice reform and",
          "helping people who are battling prescription drug abuse",
          "and heroin abuse. So, who knows, we might surprise the",
          "cynics again")

docs <- Corpus(VectorSource(text))
print(docs[[1]]$content)
## [1] "Now, I truly understand that because and Because it's an election campaigning in the middle of the year"

Tokenization

Tokenization is the task of separating the text into terms called tokens. library tokenizers has a built-in function call tokenize_words that allows you to do that.

words_token <- tokenize_words(docs$content)
print(words_token[1])
## [[1]]
##  [1] "now"         "i"           "truly"       "understand"  "that"       
##  [6] "because"     "and"         "because"     "it's"        "an"         
## [11] "election"    "campaigning" "in"          "the"         "middle"     
## [16] "of"          "the"         "year"

Once we have a corpus we typically want to modify the documents e.g., stemming, stopword removal,etc. To see the effect of such transformations we will use the tm_map() function which applies (maps) a function to all elements of the corpus.

Convert to Lower Case

One common step is to convert the terms into lower case. If the terms are not converted in lower case then same words that are written with different combinations of lower and capital letters will be considered as different terms:

docs2 <- tm_map(docs, content_transformer(tolower))
print(docs[[1]]$content)
## [1] "Now, I truly understand that because and Because it's an election campaigning in the middle of the year"
print(docs2[[1]]$content)
## [1] "now, i truly understand that because and because it's an election campaigning in the middle of the year"

Remove Stopwords

Another common step is to remove stopwords. Stop words are a set of commonly used words in a language. Examples of stop words in English are terms like “a”, “the”, etc. Stop words are commonly used in Text Mining and Natural Language Processing (NLP) to eliminate words that are so widely used but they carry very little useful information. We can do this in tm_map as below:

docs2 <- tm_map(docs2, removeWords, stopwords("english"))
print(docs[[1]]$content)
## [1] "Now, I truly understand that because and Because it's an election campaigning in the middle of the year"
print(docs2[[1]]$content)
## [1] "now,  truly understand       election campaigning   middle   year"

Stemming and Lemmatization

Stemming and lemmatization, are term transformations that help grouping similar tokens together. Let’s see how a sentence changes with stemming:

text_stem <- tm_map(docs2, stemDocument)
print(text_stem [[1]]$content)
## [1] "now, truli understand elect campaign middl year"
print(docs[[1]]$content)
## [1] "Now, I truly understand that because and Because it's an election campaigning in the middle of the year"

I personally, prefer lemmatization in my analysis. Let’s see the effect of lemmatization in the same sentence.

text_lemma <- tm_map(docs2, lemmatize_strings)
print(text_lemma[[1]]$content)
## [1] "now, truly understand election campaign middle year"
print(docs[[1]]$content)
## [1] "Now, I truly understand that because and Because it's an election campaigning in the middle of the year"

Cleaning & Preprocessing: Regular Expressions (Regex).

Data cleaning, such as removal of numbers, symbosl, text from different languages, is a major part of text analytics . Regex functions can be applied in such tasks. For example, grep and grepl can be used to matches string vectors.

sentences = c("I like statistics", "I like bananas", "Estates and statues are expensive")
grep("stat", sentences)
## [1] 1 3
grepl("stat", sentences)
## [1]  TRUE FALSE  TRUE

Functions sub and gsub can be used to replace string vectors. In the sub function only the first match in the string element is replaced.

sentences = c("I like statistics and I study a lot", "I like bananas", "Estates and statues are expensive")
sub(pattern = "I", replacement = "You", sentences)
## [1] "You like statistics and I study a lot"
## [2] "You like bananas"                     
## [3] "Estates and statues are expensive"
gsub(pattern = " ", replacement = "_", sentences)
## [1] "I_like_statistics_and_I_study_a_lot" "I_like_bananas"                     
## [3] "Estates_and_statues_are_expensive"

We can use also regex (regular expressions) with the tm_map function as below:

ReplaceCharacters <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
sentences = c("I like statistics and I @study a lot", "I like /bananas")
docs <- Corpus(VectorSource(sentences))

docs <- tm_map(docs, ReplaceCharacters, "/")
docs <- tm_map(docs, ReplaceCharacters, "@")
print(docs[[1]]$content)
## [1] "I like statistics and I  study a lot"
print(docs[[2]]$content)
## [1] "I like  bananas"

Document Term Matrix – TF and TF-IDF

A document-term matrix describes the frequency of terms in a collection of documents. Rows is for each documents in the collection and columns is for each terms.

DTM - Term Frequency

Each row and corresponding column contain simple count of the word in each documents. TF = (Number of time the word occurs in the text) / (Total number of words in text)

DTM works best when there is large amount of text, so we used free book from project Gutenberg as an example (http://www.gutenberg.org/). In particular, we will use “The republic by Plato”.

Plato <- gutenberg_download(150)
## Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
colnames(Plato) <- c('doc_id','text')
platocorpus <- SimpleCorpus(VectorSource(Plato$text))

DTM is generated with several other tasks that will help the grouping of similar wordsr. Essentially, punctuation, numbers, and stopword are removed and the text is converted into lowercase. The dtm and the steps are through the use of DocumentTermMatrix function. Obviously if you have done previously the pre-processing with functions such as tm_map it will be wise to not run the same steps twice.

dtm <- DocumentTermMatrix(platocorpus, control = list(
  removePunctuation = TRUE, removeNumbers = TRUE, 
  stopwords =  TRUE, tolower = TRUE, 
  wordLengths=c(1,Inf)))
dtm
## <<DocumentTermMatrix (documents: 16560, terms: 8102)>>
## Non-/sparse entries: 52795/134116325
## Sparsity           : 100%
## Maximal term length: 23
## Weighting          : term frequency (tf)

We can now find the most commonly used terms. Let’s select all the terms that are used more than 200 times.

findFreqTerms(dtm,200)
##  [1] "one"       "may"       "certainly" "state"     "like"      "also"     
##  [7] "must"      "said"      "can"       "good"      "another"   "men"      
## [13] "justice"   "just"      "man"       "now"       "us"        "soul"     
## [19] "true"      "will"      "say"       "let"       "replied"   "yes"

Additionally, we can find the most associated terms. These are terms that are most commonly found with our term of interest. In the case of philosophy that would be the terms disciples and studey.

findAssocs(dtm, "philosophy", corlimit = 0.1) 
## $philosophy
##      disciples          study           muse        connect        english 
##           0.13           0.13           0.13           0.11           0.11 
##         traced       idealism     beginnings        efforts         errors 
##           0.11           0.11           0.11           0.11           0.11 
##        infancy      uncertain       ascribed      giddiness         feeder 
##           0.11           0.11           0.11           0.11           0.11 
##        student    uncivilized        deeming       impostor      distracts 
##           0.11           0.11           0.11           0.11           0.11 
##        captive accompaniments     incomplete           rite      sanctuary 
##           0.11           0.11           0.11           0.11           0.11 
##      illhealth    degenerates     persisting       divinely     originates 
##           0.11           0.11           0.11           0.11           0.11 
##     pretenders     disbelieve   undeservedly       votaries 
##           0.11           0.11           0.11           0.10

Finally, most of the terms may be used sparsely or only once or twice in the whole text. We may be interested in only the terms that are frequently used. Sparsely used terms can be removed using function: removeSparseTerms. 0.99 meaning drop terms which occur in less than 1 percent of the documents

dtms <- removeSparseTerms(dtm, 0.99)
dtms
## <<DocumentTermMatrix (documents: 16560, terms: 28)>>
## Non-/sparse entries: 9620/454060
## Sparsity           : 98%
## Maximal term length: 9
## Weighting          : term frequency (tf)
findFreqTerms(dtms)
##  [1] "one"       "may"       "certainly" "state"     "life"      "like"     
##  [7] "also"      "must"      "said"      "can"       "good"      "another"  
## [13] "men"       "nature"    "justice"   "just"      "man"       "now"      
## [19] "whether"   "us"        "soul"      "true"      "will"      "say"      
## [25] "let"       "shall"     "replied"   "yes"

WordCloud

A word cloud is a visual representation of the text, in which the words appear bigger the more often they are mentioned. Word clouds are great for visualizing unstructured text data and getting insights on trends and patterns. We can do this as following:

library("wordcloud")
freq = data.frame(sort(colSums(as.matrix(dtms)), decreasing=TRUE))
wordcloud(rownames(freq), freq[,1], max.words=50, colors=brewer.pal(1, "Dark2"))
## Warning in brewer.pal(1, "Dark2"): minimal value for n is 3, returning requested palette with 3 different levels

DTM - Term Frequency–Inverse Document Frequency

Compared to the simple TF (Term Frequency weight) TF-IDF still calculate the frequency of team in each document, but it also take into consideration the total number of time that word has been used in all documents.The formula for this is below:

  • TF = (Number of time the word occurs in the text) / (Total number of words in text)
  • IDF = (Total amount of documents / Number of documents with word t in it)
  • TF-IDF = TF x IDF

Let’s see how to use TFIDF weighting and how this will change the previous wordcloud.

dtmtfidf <- DocumentTermMatrix(platocorpus,
                          control = list( weighting =  weightTfIdf, removePunctuation = TRUE, removeNumbers = TRUE, stopwords =  TRUE, 
      tolower = TRUE, wordLengths=c(1,Inf)))

freq = data.frame(sort(colSums(as.matrix(dtmtfidf)), decreasing=TRUE))
wordcloud(rownames(freq), freq[,1], max.words=50, colors=brewer.pal(1, "Dark2"))

Second Part:Topic Modelling in News Article

You will learn now how to perform topic modelling analysis using latent dirichlet allocation (LDA). The goal is to generate a list of the topics covered by the documents and to group the documents by the topics that are found. In this example, you will be using a dataset of news headline from Australian Broadcasting Corporation. You can find the data on Minerva.

You will use the LDA function from the topicmodels package along with other packages. First, you should install all of the following packages if you have yet to do so and add them to your R library

# Install packages that are needed.

#install.packages('stringr')
#install.packages('RColorBrewer')
#install.packages('topicmodels')
#install.packages('ggplot2')
#install.packages('LDAvis')
#install.packages('servr')


library(dplyr) # basic data manipulation

library(tm) # package for text mining package
library(stringr) # package for dealing with strings
library(RColorBrewer)# package to get special theme color
library(wordcloud) # package to create wordcloud

library(topicmodels) # package for topic modelling

library(ggplot2) # basic data visualization
library(LDAvis) # LDA specific visualization 
library(servr) # interactive support for LDA visualization

set.seed(1234) 

News Headline Dataset

The dataset provided today is called ‘A Million News Headlines’. This contains data of news headlines published over a period of eighteen years sourced from the Australian Broadcasting Corporation (http://www.abc.net.au). The format of the dataset is a csv (‘abcnews-date-text-sample.csv’) with the following two columns:

publish_date: Date of publishing for the article in yyyyMMdd format headline_text: Text of the headline in Ascii , English , lowercase

You will create a corpus from the news article text. YFirst we will use the str_conv function to convert text to utf-8 encoding as some of the characters in the text are not characters that tm package can handle.

data <- read.csv(file = "abcnews-date-text-sample.csv", header = TRUE)

# You can skip below line of code only if you have RAM of at least 16 GB in your computer. If there is a memory issue, try to get for less number of articles.
data <- data[1:10000,] #Use only first 10000 news article

news <- stringr::str_conv(data$headline_text, "UTF-8")
# Create Corpus
docs <- Corpus(VectorSource(news))

Create DTM with term frequency

In this step, you create a term frequency document term matrix. This will be used as an input for the LDA model to learn which words are frequently found together in a document so that it could try to model the topics. In terms of data cleaning, you will conduct lemmatization, removal of punctuation, numbers, stopword, and finally lowercase all tokens.

As a result of this cleaning, it is possible that some documents would have all tokens removed. Therefore, we also remove them from the dtm in the code below.

dtmdocs <- DocumentTermMatrix(docs,
              control = list(lemma=TRUE,removePunctuation = TRUE,
              removeNumbers = TRUE, stopwords = TRUE,
              tolower = TRUE))
raw.sum=apply(dtmdocs,1,FUN=sum)
dtmdocs=dtmdocs[raw.sum!=0,]

Next, we will create a frequency table that returns information on how often each words are used in the whole text. This will be use in visualization for a wordcloud in the next part and for the LDA at the end.

dtm.new <- as.matrix(dtmdocs)
frequency <- colSums(dtm.new)
frequency <- sort(frequency, decreasing=TRUE)
doc_length <- rowSums(dtm.new)

frequency[1:10] #Example of the output
##     police        man        new       says    council      court  interview 
##        305        261        257        192        150        134        128 
##       fire australian  australia 
##        127        127        123

Interestingly, “police” is most frequently used words along with “man” and “new”.

Wordcloud

As before to create the wordcloud in R, you need to use the function wordcloud. The input is the words and its associate frequencies. In this case we will select only the top 100 most popular words in the topic headlines.

The parameters inside the function are: rot.per = 0.15 (15% chance to rotate), random.order and random.color as FALSE (frequency of words will determine both order and color), and colors=brewer.pal(8,“Dark2”) from colorbrewer package (The color is ‘Dark2’ from the package and you wants 8 colors in total).

library(wordcloud)
words <- names(frequency)# get back the word
 
wordcloud(words[1:100], frequency[1:100], rot.per=0.15, 
          random.order = FALSE, scale=c(4,0.5),
          random.color = FALSE, colors=brewer.pal(8,"Dark2"))

Essentially, you just turn the previous frequency table into a more intuitive graphical presentation.

Topic Modelling with LDA

Topic modeling models the probability that a word belongs to a topic using the co-occurrence of words in the documents. In R, you can use the function call LDA in topicmodelling package to perform an LDA topic modelling to analyze the text.

Determining number of topics

The first step after the creation of the corpus and the pre-processing of the terms, is the selection of the number of topics (k) that LDA will create. We will do this using a package called ldatuning which offers several algorithms that may help us to decide the number of topics. I have to highlight that there are also many alternatives such as semantic coherence and perplexity but for the purpose of the seminar we will use three measures from ldatuning.

Specifically, We want to minimize the criteria Arun2010 and CaoJuan2009 and maximize the Griffiths2004. We choose a range of k possible topics (5 to 20) and we will select the optimal number based on these criteria

library(ldatuning)
result <- FindTopicsNumber(
  dtm.new,
  topics = seq(from = 5, to = 20, by = 1),
  metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010"),
  method = "Gibbs",
  control = list(seed = 77),
  mc.cores = 2L,
  verbose = TRUE
)
## fit models... done.
## calculate metrics:
##   Griffiths2004... done.
##   CaoJuan2009... done.
##   Arun2010... done.
FindTopicsNumber_plot(result)

Based on the plot a good candidate solution is to choose k=13 topics for the LDA.

Topic Modelling: Latent Dirichlet Allocation

Topic Modelling is the process of generating a list of the topics covered by the documents and of grouping documents by the topics that was found. In this case, we are using Latent Dirichlet Allocation (LDA). In R, you use the function call LDA in topicmodelling package to analyze the text.

iter is used to tell LDA how many iterations we will try find the right model choice. phi is the distribution over terms for a topic and theta is the probability distribution over topics for a document. You will used it for visualization of the LDA solution at the end.

ldaOut <-LDA(dtmdocs,13, method="Gibbs", 
             control=list(iter=1000,seed=1000))
phi <- posterior(ldaOut)$terms %>% as.matrix 
#matrix, with each row containing the distribution over terms for a topic,
theta <- posterior(ldaOut)$topics %>% as.matrix 
#matrix, with each row containing the probability distribution over topics for a document,

To generate a list of the topics covered by the documents, use the command term() with the output of lda. Below, you are choosing the top 10 terms of all topics.

# Which highest alpha 'term' is part of which topics
ldaOut.terms <- as.matrix(terms(ldaOut, 10))
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
man new police death govt
crash sydney found new win
car interview probe australian court
charged china decision interview qld
dies may mayor first country
murder open community league will
woman queensland free melbourne labor
road park end nsw review
missing hits urges day record
jailed case case years hour

Looking at some example topic, you can see that the first topic is mainly about car accidents. Some other topics also include focus on day-to-day crime and government decision making.

To get grouping documents by the topics that was found:

# Which 'topic' is the review in (highest probability)
ldaOut.topics <- data.frame(topics(ldaOut))
ldaOut.topics$index <- as.numeric(row.names(ldaOut.topics))
data$index <- as.numeric(row.names(data))
datawithtopic <- merge(data, ldaOut.topics, by='index',all.x=TRUE)
datawithtopic <- datawithtopic[order(datawithtopic$index), ]
datawithtopic[0:10,]
index publish_date headline_text topics.ldaOut.
1 20100505 man jailed over nightclub glassing 1
2 20160215 messi and suarez combine in cheeky penalty double act 7
3 20120824 wind farms spinning money 8
4 20100128 solar systems continues buy out talks 6
5 20201216 qld wet summer ahead la nina similar strength system yasi 3
6 20050320 tasmanian protesters rally against iraq war 4
7 20040830 coles myer jobs set to be outsourced 7
8 20180103 police questioning four boys about deliberately lit fire 5
9 20200827 afl tigers end eagles winning run bombers comeback beats hawks 11
10 20170411 who is lucy gichuhi from kenya to possible sa senator 1

To get grouping documents by the topics that was found for all topics and not just the most prominent one:

# For each review, how closely it associate with each topics
topicProbabilities <- as.data.frame(ldaOut@gamma)
topicProbabilities[0:10,1:5]
V1 V2 V3 V4 V5
0.1082621 0.0712251 0.0712251 0.0712251 0.0897436
0.0850202 0.0674764 0.0674764 0.0674764 0.0850202
0.0897436 0.0712251 0.0712251 0.0712251 0.0712251
0.0699301 0.0699301 0.0699301 0.0699301 0.0699301
0.0651890 0.0821382 0.0990874 0.0821382 0.0651890
0.0699301 0.0699301 0.0699301 0.0881119 0.0881119
0.0881119 0.0699301 0.0699301 0.0699301 0.0699301
0.0674764 0.0850202 0.0850202 0.0674764 0.1025641
0.0641026 0.0807692 0.0807692 0.0641026 0.0974359
0.0881119 0.0881119 0.0699301 0.0881119 0.0699301

You can see that for the first news article, V2 is the highest.

Visualizing with LDAVis

Finally, you can also create an interactive visualization. To do so, you will use createJSON and serVis in LDAvis package.

The function ‘createJSON’ creates the JSON object that feeds the visualization template. serVis just View and/or share LDAvis in a browser.

vocab <- colnames(phi) #vocab list in DTM

# create the JSON object to feed the visualization in LDAvis:
json_lda <- createJSON(phi = phi, theta = theta, 
                       vocab = vocab, doc.length = doc_length, 
                       term.frequency = frequency)


serVis(json_lda, out.dir = 'vis', open.browser = TRUE)

The visual it shows the total number of topics and how each of them are related to other using marginal topic distribution. The bar on the right of the visual shows the representation of word in each topic/overall documents