Natural Language Processing (NLP) is a branch of artificial intelligence that is steadily growing both in terms of research and market values1. The ultimate objective of NLP is to read, decipher, understand, and make sense of the human languages in a manner that is valuable2. The are many applications of NLP in various industries, such as:
On this occation, we will learn about Topic Modelling and it’s application in a real case. Before we start the journey, let’s consider a simple example.
Suppose that we have the following word cloud, can you guess what these words have in common?
The interpretation may differ from one persone to another, but most of you must be agree that the word cloud has a common theme or topic. Perhaps you might say that it is related to economics, or politics, or business. The real theme of the words is unkown, but we as the observer are giving the group of words a meaningful and understandable topic. This activity is what we call as Topic Modelling.
In text mining, we often have collections of documents, such as blog posts or news articles, that we’d like to divide into natural groups so that we can understand them separately. Topic modeling is a method for unsupervised classification of such documents, similar to clustering on numeric data, which finds natural groups of items even when we’re not sure what we’re looking for.
There are many application of Topic Modelling, even outside of the field of NLP. Some applications of Topic Modelling derived from Boyd-Graber et al.3 and Blei et al.4 includes:
Below is another example of topic modeling from Blei et al. where the top words for each topic (arts, budgets, children, and education) are shown. The colored text on the lower part of the figure illustrate that a single document is a collection of words with various topic.
The popular algorithm for Topic Modeling is Latent Dirichlet Allocation (LDA), which is developed by Blei et al. . This algorithm can be understood in this two simple properties5:
The objective of this article is as follows:
Below is the required package to reproduce the code in this article.
# Data Wrangling
library(tidyverse)
# Text Processing
library(tm)
library(corpus)
library(tidytext)
library(textclean)
library(lubridate)
library(hunspell)
library(SnowballC)
library(textmineR)
library(scales)
# Visualization
library(ggwordcloud)
# Modeling and Evaluation
library(randomForest)
library(e1071)
library(yardstick)
options(scipen = 999)
This section illustrate the first principle and the workflow of Topic Modelling with LDA
LDA is a generative probabilistic model of a corpus. Compared to other topic modelling methods such as the unigram model, Latent Semantic Analysis (LSA), and Probabilistic Latent Semantic Analysis (pLSA), the advantage and disadvantage of LDA is as follows:
Advantages
Disadvantages
We will break the concept one step at a time. Back to the last section, the main principle of LDA is these 2 concepts:
We will go to the first concept. LDA does not give a clear answer wether a document belong to a certain topics because a document is considered as a mixture of topics. In a single document, we may find several topics. A news about the financial crisis of 2008 may consists of economics, politics and social topics, blended in a single article that discuss about the market value, the government response and the impact of high unemployment rate. Thus, as an example, a document may consists of 30% economics, 60% politics, and 10% social topics. The percentage of each topic represent the probability of the document to belong a certain topic. Therefore, we can imagine the mixture of topics as a probability distribution such as the following table:
set.seed(123)
data.frame(news = 1:5,
economics = runif(5, max = 0.5),
politics = runif(5, max = 0.5)) %>%
mutate(social = 1 - (economics + politics))
The first news has a 14% chance to be an economics news, 2% of politics news and 83% of social news. The second news has a 39% chance to be an economics news, 26% of politics news and 34% of social news and so on. Higher probability means that the document can be represented by the topic.
The word Latent in Latent Dirichlet Allocation refers to the latent or hidden structure inside a document. As we have seen in the previous section, we intuively think that a group of words can have a central theme or topics. Therefore, according to the second principle, a topic is a mixture of words, where a certain words have a strong association with a certain topic. For example, the word President have a strong signal that it belong to the topic of politics, or the word Loan and Interest have a strong signal toward economics topic. Just as a document has a probability distribution for each topic, a topic also has a probability distribution for each words/terms. LDA assume that a document is a bag of words, thus we do not care about the word sequence.
An LDA model is built based on these two principles and has the following graphical structures.
Notation:
LDA belong to a hierarchical bayesian model. A bayesian model has a prior and a posterior. A prior means the probability distribution of certain things before we see the data. For example, we may belief that the probability distribution for a roll of dice is uniformly distributed with every value have the same probability to appear. We have this assumption before we try to roll the dice. Meanwhile, the posterior probability reflects the probability distribution after we see the data, after we roll the dice. For example, after we toss the dice for 10,000 times, the number 5 appears for more than 5,000 tosses. This may change our belief that the value for each side is uniformly distributed, since a single side of the dice has a high chance to appear and we might be suspicious that the dice is not fair. We can calculate the posterior probability using the Bayes Theorem6, which we will not discuss on this article. The concept of prior and posterior probability is important to the understanding of the LDA process.
The LDA model consists of 2 different priors distribution: probability distribution of topic to document (topic-document probability) and probability distribution of words to topics (word-topic probability). The probability distribution for the two priors are not uniform or normal distribution, but a Dirichlet distribution with the \(\alpha\) and \(\beta\) parameters as well as the number of topics as the input.
The following figure is the illustration of the Dirichlet distribution for each topic to each document. Suppose we have a collection of documents and we are confident that on the corpus we have 3 different topics. Each dot represent a single document and each side of the triangle represent a single topics. The position of each document represent the probability distribution for each topic. We randomly assign the position of each document based on its probability distribution. We can see that document A has a close proximity to the topic of Science. This indicate that document A has a high probability of belong to topic Science and low probability ot belong to Politics or Economy. Meanwhile, document B has a high probability to belong to the topic Economy. Document C is located between Science and Politics, so it has a rather equal probability for those topics and low probability for the topic Economy. The topic D is located at the center of the triangle, which means that it has an equal probability for the three topics.
The goal of LDA is to make each document as close as possible toward a certain topics, although it may not always be achieved for all document. If the above graphics represent our prior probability for each document to each topics, the following graphic illustrate the posterior probability where all document is having a strong relation with a certain topics.
We may translate the probability distribution into the following table:
data.frame(document = c("A","B","C","D"),
sciences = c(0.8, 0.1, 0.1, 0.1),
politics = c(0.1, 0.05, 0.85, 0.8),
economy = c(0.1, 0.85, 0.05, 0.1))
The second prior distribution is the probability of a topics belong to certain words: word-topic probability. This prior is concerned with what terms that represent what topic? Is there any terms that have strong connection with certain topics? Let’s say we have 4 terms: President, Planet, Market, and Energy. Each terms will act as the edge of the triangle. However, since we have more than 3 terms, the shape of the distribution is not a triangle anymore, but a tetrahedron. The following figures illustrate a random assignment of Dirichlet distribution between topics and words.
If we translate the position of each topic into a probabilistic value, we get the following table:
data.frame(topic = c("Politics", "Science", "Economy"),
president = c(0.8, 0.15, 0.1),
planet = c(0.05, 0.5, 0.05),
energy = c(0.1, 0.3, 0.15),
market = c(0.05, 0.05, 0.7))
Based on the table, the terms President has strong association with the word politics, Planet with Science and Market with Economy.
These prior distribution will be used to generate word samples to calculate the posterior probability or the estimate of the true probability.
Let’s go back toward the LDA structure.
The generative process is the process of generating a document by randomly selecting words from each topic. The rough description of the process is as follows:
Randomly choose a distribution over topics
For each word in the document
In order to estimate the posterior probability of the topic-document and word-topic distribution, we need to use Gibbs-sampling or Variational Expectation Maximation (VEM). Both are approximate inference method and has its own characteristic.
The Gibbs sampling is one of many method to sample data in Bayesian Statistics. The concept of Gibbs sampling in topic modeling is that we assume that all documents and all terms have a prior topics except for a single word, which we will calculate the posterior probability of belonging to certain topics. The algorithm of Gibbs sampling is as follows:
1.1 For each document \(d\) and word \(n\) currently assigned to \(z_{old}\) :
1.1.1 Decrement \(n_{d,z_{old}}\) and \(v_{z_{old}, w_{d,n}}\)
1.1.2 Sample \(z_{new} = k\) with probability proportional to \(\frac{n_{d,k} + \alpha_k\ v_{k,w_{d,n}} + \lambda_{w_{d,n}}}{\Sigma_i=k^K n_{d,i} +\alpha_i\ \Sigma_i v_{k,i} + \lambda_i}\)
1.1.3 Increment \(n_{d,z_{new}}\) and \(v_{z_{old}, w_{d,n}}\)
Notation :
\(z_{old}\) : prior/early topic assignment
\(z_{new}\) : updated topic assignment
\(n_{d,k}\) : Number of times document \(d\) uses topic \(k\)
\(v_{k, w_{d,n}}\) : Number of times topic \(k\) uses word type \(w_{d,n}\)
\(\alpha_k\) : Dirichlet parameter for document to topic distribution
\(\lambda_{w_{d,n}}\) : Dirichlet parameter for topic to word distribution
We will try with an example. Suppose we have a corpus of documents, with one of them is as follows:
The government put curfew during pandemic
Suppose that the document has a pre-determined topic assignment:
data.frame(
topic = c(1, 2, 2, 3, 1, 3),
term = c("the", "government", "put", "curfew", "during", "pandemic")
) %>%
t() %>%
as.data.frame()
The following is the count of words for each topics from all documents inside the corpus:
data.frame(
term = c("the", "government", "put", "curfew", "during", "pandemic"),
topic_1 = c(200, 10, 30 ,20, 5, 8),
topic_2 = c(40, 30, 10, 5, 3, 2),
topic_3 = c(10, 2, 1, 8, 17, 30)
)
For the first iteration and the first term, we would like to update the topic assignment for the word government.
data.frame(
topic = c(1, "?", 2, 3, 1, 3),
document = c("The", "government", "put", "curfew", "during", "pandemic")
) %>%
t() %>%
as.data.frame()
1.1.1 Decrement \(n_{d,z_{old}}\) and \(v_{z_{old}, w_{d,n}}\)
The word government was assigned as topic 2 earlier so we would omit that assignment and reduce the count of word government from the topic 2 by 1 (from 30 to 29).
data.frame(
term = c("the", "government", "put", "curfew", "during", "pandemic"),
topic_1 = c(200, 10, 30 ,20, 5, 8),
topic_2 = c(40, 29, 10, 5, 3, 2),
topic_3 = c(10, 2, 1, 8, 17, 30)
)
1.1.2 Sample \(z_{new} = k\) with probability proportional to \(\frac{n_{d,k} + \alpha_k\ v_{k,w_{d,n}} + \lambda_{w_{d,n}}}{\Sigma_i=k^K n_{d,i} +\alpha_i\ \Sigma_i v_{k,i} + \lambda_i}\)
Now we update the topic assignment using the earlier formula. We will get the probability for each topic (k) and then sample the topic. Higher probability means that the word will be likely assigned to that topic.
\[z_{new} = \frac{n_{d,k} + \alpha_k\ v_{k,w_{d,n}} + \lambda_{w_{d,n}}}{\Sigma_{i=k}^K n_{d,i} +\alpha_i\ \Sigma_i v_{k,i} + \lambda_i}\]
Long story short, after we get the sample, we update the topic assignment for the word. For example, based on the sample, the new topic assignment for the word government is topic 1.
data.frame(
topic = c(1, 1, 2, 3, 1, 3),
document = c("The", "government", "put", "curfew", "during", "pandemic")
) %>%
t() %>%
as.data.frame()
1.1.3 Increment \(n_{d,z_{new}}\) and \(v_{z_{old}, w_{d,n}}\)
We increment the counts for each word to each topic based on the newly updated assignment and we continue to update the topic for the next words. The count for word government for topic 1 is increased from 10 to 1.
data.frame(
term = c("the", "government", "put", "curfew", "during", "pandemic"),
topic_1 = c(200, 11, 30 ,20, 5, 8),
topic_2 = c(40, 29, 10, 5, 3, 2),
topic_3 = c(10, 2, 1, 8, 17, 30)
)
We will try to do a document modelling to find different topics in CBC News/COVID-19 Articles. The dataset is acquired from Kaggle . The context of the dataset is that has the news media been overreacting or under-reacting during the development of COVID-19? What are the media’s main focuses?
We will try to answer this problem by finding latent information inside corpus using Latent Dirichlet Allocation.
The dataset consists of more than 2,725 articles with 7 columns. To read ‘big’ data with fast running time, we can use fread()
function from data.table
package.
covid_news <- data.table::fread("data/covid_news.csv", header = T, encoding = "Latin-1")
glimpse(covid_news)
## Rows: 2,725
## Columns: 7
## $ V1 <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17…
## $ authors <chr> "['Cbc News']", "['Cbc News']", "['The Associated Press'…
## $ title <chr> "Coronavirus a 'wake-up call' for Canada's prescription …
## $ publish_date <chr> "2020-03-27 08:00:00", "2020-03-27 01:45:00", "2020-03-2…
## $ description <chr> "Canadian pharmacies are limiting how much medication ca…
## $ text <chr> "Canadian pharmacies are limiting how much medication ca…
## $ url <chr> "https://www.cbc.ca/news/health/covid-19-drug-supply-1.5…
Data description:
Below is the sample of the data.
Based on the title of the sample news, some article reported about the virus spread and transmission while some others talk about what the government does in response to the COVID-19. We will explore it further using the topic modelling.
We will clean the text first before proceed further. The text cleansing process includes:
-
with white spacecoronavirus
, covid 19
, and covid
covid_clean <- covid_news %>%
mutate(text_clean = text %>%
replace_non_ascii() %>%
replace_html(symbol = F) %>% # remove html tag
str_replace_all("[0-9]", " ") %>%
str_replace_all("[-|]", " ") %>% # replace "-" with space
tolower() %>% #lowercase
str_remove_all("coronavirus|covid 19|covid|canadian|canadians") %>% # remove common words
replace_symbol() %>%
replace_contraction() %>%
replace_word_elongation() %>% # lengthen shortened word
str_replace_all("[[:punct:]]", " ") %>% # remove punctuation
str_replace_all(" dr ", " doctor ") %>%
make_plural() %>%
str_replace_all(" s ", " ") %>%
str_squish() %>% # remove double whitespace
str_trim() # remove whitespace at the start and end of the text
)
Let’s see the remaining number of words on each document in our corpus.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.0 431.0 632.0 713.1 896.0 5100.0
The shortest document has 14 words while the longest has 5100 words. LDA will works better if the text input has a lot of words inside the sentence. We will filter document that has at least consists of 100 words.
## [1] 2710 8
We have 2,710 number of documents remanining.
The next step is to tokenize the text and create a Document-Term Matrix (DTM) from our text data. We will also remove any stop words such as the
or is
since they are irrelevant for this problem. We will also do stem words into their basic form, such as from walking
into walk
. To get better stemming, we also change all positive
terms into positives
. The stemming function is created manually by with hunspell
7 stemming at its core. hunspell
stemming give better result compared to the basic stemming algorithm such as Porter stemming algorithm8.
stem_hunspell <- function(term) {
# look up the term in the dictionary
stems <- hunspell_stem(term)[[1]]
if (length(stems) == 0) { # if there are no stems, use the original term
stem <- term
} else { # if there are multiple stems, use the last one
stem <- stems[[length(stems)]]
}
return(stem)
}
news_term <- covid_clean %>%
unnest_tokens(output = "word", input = text_clean) %>%
anti_join(stop_words) %>%
mutate(word = ifelse(word == "positive", "positives", word),
word = text_tokens(word, stemmer = stem_hunspell) %>% as.character() ) %>%
drop_na(word) %>%
count(V1, word)
Next, we will transform the data into document-term matrix (DTM). The value inside the matrix represent the term frequency
or the number of terms appear inside each document.
## <<DocumentTermMatrix (documents: 2710, terms: 25191)>>
## Non-/sparse entries: 511876/67755734
## Sparsity : 99%
## Maximal term length: 29
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs canada day govern heal home people province public test virus
## 106 27 22 13 26 10 33 26 5 20 13
## 141 18 12 9 28 4 26 21 16 19 16
## 172 23 14 10 25 14 20 18 11 25 16
## 18 8 7 2 79 16 28 42 24 41 14
## 197 15 11 7 23 16 24 15 17 16 15
## 212 16 13 10 23 13 36 21 8 26 12
## 305 25 17 15 23 14 20 13 16 2 7
## 35 12 19 12 25 12 35 15 8 11 8
## 49 18 11 9 27 20 34 14 11 15 5
## 56 19 21 10 27 19 22 17 10 22 10
We have 2,710 documents with total terms of more than 28,000 terms. We will remove rare word that occur only in less than 5 documents and also the common words that appear in more than 90% of all documents. This is intended to give us a collections of terms that is common enough and shared by several documents to indicate a shared topics/latent information but also unique enough that it is not shared by all documents.
word_freq <- findFreqTerms(dtm_news,
lowfreq = 5,
highfreq = nrow(dtm_news)*0.9
)
dtm_news <- dtm_news[ , word_freq]
dtm_news
## <<DocumentTermMatrix (documents: 2710, terms: 9138)>>
## Non-/sparse entries: 442433/24321547
## Sparsity : 98%
## Maximal term length: 19
## Weighting : term frequency (tf)
The number of terms drastically drop from 29,000 to around 9,000 terms. We will use this data to train the LDA model.
We will create an LDA model with k = 3
topics. The choice of number of topics is arbitrary, but we will show you how to find the optimal number of topics later. We will use Gibbs-sampling to estimate the parameter using 5000 iterations of sampling and 4000 burn-in iterations. The burn-in iteration means that we only collecting samples starting from iteration of 4000, since the earlier iteration is still unstable and may not reflect the actual distribution of the data.
Since the computation is quite long (around 30-60 minutes), I have prepared the pre-trained model in the next chunk.
dtm_lda <- Matrix::Matrix(as.matrix(dtm_news), sparse = T)
set.seed(123)
lda_news <- FitLdaModel(dtm = dtm_lda,
k = 3,
iterations = 5000,
burnin = 4000,
calc_coherence = T
)
The details about the parameter of FitLdaModel()
function is as follows:
## List of 9
## $ phi : num [1:3, 1:9138] 0.000000166 0.000000256 0.000020914 0.001365947 0.001274729 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3] "t_1" "t_2" "t_3"
## .. ..$ : chr [1:9138] "acetaminophen" "act" "actual" "america" ...
## $ theta : num [1:2710, 1:3] 0.547362 0.997833 0.000236 0.043015 0.456297 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2710] "0" "1" "2" "3" ...
## .. ..$ : chr [1:3] "t_1" "t_2" "t_3"
## $ gamma : num [1:3, 1:9138] 0.012 0.0121 0.9759 0.5079 0.3081 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3] "t_1" "t_2" "t_3"
## .. ..$ : chr [1:9138] "acetaminophen" "act" "actual" "america" ...
## $ data :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. ..@ i : int [1:442433] 0 17 1311 0 2 4 12 17 18 19 ...
## .. ..@ p : int [1:9139] 0 3 531 570 725 929 943 1337 1479 1502 ...
## .. ..@ Dim : int [1:2] 2710 9138
## .. ..@ Dimnames:List of 2
## .. ..@ x : num [1:442433] 1 1 3 4 1 3 12 7 1 3 ...
## .. ..@ factors : list()
## $ alpha : Named num [1:3] 0.1 0.1 0.1
## ..- attr(*, "names")= chr [1:3] "t_1" "t_2" "t_3"
## $ beta : Named num [1:9138] 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 ...
## ..- attr(*, "names")= chr [1:9138] "acetaminophen" "act" "actual" "america" ...
## $ log_likelihood:'data.frame': 500 obs. of 2 variables:
## ..$ iteration : num [1:500] 0 10 20 30 40 50 60 70 80 90 ...
## ..$ log_likelihood: num [1:500] -5815798 -5770673 -5741072 -5732850 -5729403 ...
## $ coherence : Named num [1:3] 0.108 0.11 0.144
## ..- attr(*, "names")= chr [1:3] "t_1" "t_2" "t_3"
## $ labels : chr [1:3, 1:2] "school" "dollar" "quarantine" "social" ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3] "t_1" "t_2" "t_3"
## .. ..$ : chr [1:2] "label_1" "label_2"
## - attr(*, "class")= chr "lda_topic_model"
Below are some important attribute acquired from the LDA Model:
If a term has a high value of theta, it has a high probability of that term being generated from that topic. This also indicates that the term has a high association toward a certain topic.
lda_news$theta %>%
head() %>%
as.data.frame() %>%
set_names(paste("Topic", 1:3)) %>%
rownames_to_column("document")
Remember that LDA assumes that a topic is a mixture of words. The posterior probability for per-topic-per-word assignment is represented by the phi value. The sum of all phi for a topic is 1.
## t_1 t_2 t_3
## 1 1 1
To get the top terms for each topic, we can use the GetTopTerms
function.
As we have stated earlier, LDA merely give us the hidden/latent structure inside the corpus of our documents. It is our job as the user to interpret the latent information and assign labels for each generated topic.
Below is the top words for each topic. LDA doesn’t specifically inform us about what each topic is about. By looking at the representative words of each topic, we as the human will give meaning to each topic. The top terms in the first topic seems to tells us about the impact of the virus outbreak, as indicated by the word school
, social distance
, store
, and student
. The second topic talks about the politics and economy regarding the COVID-19, since we see the presence of words cent
, federal
, minister
, and company
. The third topic tell us about the spread and transmission of COVID-19, indicated by the word patient
, dr
(doctor), quarantine
, and flight
.
As we can see, by using LDA, even though we don’t have the true labels or class, the model can generate association between words and topics by assigning probabilities. Using k = 3, we have 3 topics that can easily interpreted since they are quite different from each other.
news_word_topic <- GetTopTerms(lda_news$phi, 30) %>%
as.data.frame() %>%
set_names(paste("Topic", 1:3))
news_word_topic
We can also present top words in each topic using visualization. Here, we will visualize the top 50 terms in each topics using word cloud.
news_word_topic %>%
rownames_to_column("id") %>%
mutate(id = as.numeric(id)) %>%
pivot_longer(-id, names_to = "topic", values_to = "term") %>%
ggplot(aes(label = term, size = rev(id), color = topic, alpha = rev(id))) +
geom_text_wordcloud(seed = 123) +
facet_wrap(~topic, scales = "free") +
scale_alpha_continuous(range = c(0.4, 1)) +
scale_color_manual(values = c( "dodgerblue4", "firebrick4", "darkgreen")) +
theme_minimal() +
theme(strip.background = element_rect(fill = "firebrick"),
strip.text.x = element_text(colour = "white"))
We can also acquire the probability of a document belong to certain topics. We will use this metric to check whether our guest about the interpretation of each topic is make sense and if each topic is different enough subjectively.
The following table shows the top 10 news title that has the highest probability to belong the topic of politics and economy.
news_doc_topic %>%
arrange(desc(t_2)) %>%
left_join(covid_clean %>%
mutate(V1 = as.character(V1)) %>%
select(V1, title),
by = c("id" = "V1")) %>%
column_to_rownames("id") %>%
select(title, everything()) %>%
head(10)
The following table shows the top 10 news title that has the highest probability to belong the topic of virus transmission.
news_doc_topic %>%
arrange(desc(t_3)) %>%
left_join(covid_clean %>%
mutate(V1 = as.character(V1)) %>%
select(V1, title),
by = c("id" = "V1")) %>%
column_to_rownames("id") %>%
select(title, everything()) %>%
head(10)
We will illustrate a distant view on the topics in the data over time. Let’s see the range of date when each article is published.
## [1] "2019-12-22 18:36:00" "2020-03-27 08:30:00"
The first article start at the end of December 2019 and the latest article is on March 2020. We will group the data into weekly interval and see the proportion of each topic across the weeks.
news_doc_topic %>%
left_join(covid_clean %>%
mutate(V1 = as.character(V1)) %>%
select(V1, title, publish_date),
by = c("id" = "V1")) %>%
select(-id) %>%
select(title, everything()) %>%
pivot_longer(c(t_1, t_2, t_3), names_to = "topic", values_to = "theta") %>%
mutate(topic = case_when( topic == "t_1" ~ "Social Issues",
topic == "t_2" ~ "Virus Transmission",
TRUE ~ "Politics and Economy") %>%
factor(levels = c("Virus Transmission", "Social Issues", "Politics and Economy")),
publish_date = ymd_hms(publish_date),
time = floor_date(publish_date, unit = "week") %>% as.Date()
) %>%
group_by(time, topic) %>%
summarise(theta = mean(theta)) %>%
ggplot(aes(time, theta, fill = topic, color = topic)) +
geom_line() +
geom_point(show.legend = F) +
theme_minimal() +
theme(legend.position = "top") +
scale_x_date(date_breaks = "1 weeks",
labels = date_format(format = "%d\n%b")) +
scale_y_continuous() +
scale_fill_manual(values = c("firebrick", "orange", "dodgerblue3")) +
labs(x = NULL, y = expression(theta), color = NULL,
title = "Topic Proportions Over Time on Weekly Interval")
As we can see, for COVID-19 case (late 2019-2020), as the time goes, more articles are reporting more about the social issues of the coronavirus. There is a one week time gap between in the late December to early January where no news are collected. News regarding the source of the virus or new cases is less reported on March since almost all country have been contracted by the virus and thus people and the government are more concerned about their survival and well-being.
However, LDA can’t accurately track the change inside the topic over time since LDA assume that the order of the document does not matter. This assumption may be unrealistic when analyzing long-running collections that span years or centuries, since a topic may change from time to time. For example, the topic Virus Transmission may containt mostly about the transmission and new cases in early 2020 and more about the virus source and characteristics in the later period. These kind of change inside the topic is not detected by LDA. A more advanced and improved version LDA that can accomodate this is the Dynamic Topic Model9, a model that respects the ordering of the documents and gives a richer posterior topical structure than LDA.
Although LDA is an unsupervised learning, we can still measure some of its performance. Traditionally, and still for many practical applications, to evaluate if “the correct thing” has been learned about the corpus, an implicit knowledge and “eyeballing” approaches are used. Ideally, we’d like to capture this information in a single metric that can be maximized, and compared.
The evaluation of a topic model can be done by looking at the content directly, such as the top-n words like what we previously did. We can decide whether the collection of words inside each topic make sense or contain certain similarity.
One of the most popular metric to evaluate a topic model is by looking at the topic coherence. Topic Coherence measures the degree of semantic similarity between the top words in a single topic. The textmineR
implements a new topic coherence measure based on probability theory. Probabilistic coherence measures how associated words are in a topic, controlling for statistical independence.
“Suppose you have a corpus of articles from the sports section of a newspaper. A topic with the words {sport, sports, ball, fan, athlete} would look great if you look at correlation, without correcting for independence. But we actually know that it’s a terrible topic because the words are so frequent in this corpus as to be meaningless. In other words, they are highly correlated with each other but they are statistically-independent of each other.”
The intuition of the probabilistic coherence is that it measure how probable a pair of words will come from the same documents than from a random document in the corpus. For example, if we the top 3 words of a topic is apple
, banana
, and cheese
, we can calculate the topic coherence by averaging the following numbers:
\(P(apple|banana) - P(banana)\)
\(P(apple|cheese) - P(cheese)\)
\(P(banana|cheese) - P(cheese)\)
Description:
You can get the coherence for each topic by calling the coherence
object from the LDA models. By default, the topic coherence only look for the top 5 words of each topic.
## t_1 t_2 t_3
## 0.1082933 0.1103960 0.1437167
We will try to find the optimal number of topics by finding the average probabilistic coherence for several number of topics, ranging from k = 10 to k = 100 with interval of 10. To speed up the computation, we will only use 200 sampling iterations with burnin iteration of 180 for the sake of illustration since higher number of iterations can run for hours or even days.
dtm_lda <- Matrix::Matrix(as.matrix(dtm_news), sparse = T)
k_list <- seq(10, 100, by = 10)
model_list <- TmParallelApply(X = k_list, FUN = function(k){
m <- FitLdaModel(dtm = dtm_lda,
k = k,
iterations = 500,
burnin = 200,
calc_coherence = TRUE)
m <- mean(m$coherence)
return(m)
},
cpus = 4
)
k_list <- seq(10, 100, by = 10)
model_list <- read_rds("coherence result.Rds")
iter_k <- data.frame(
k = k_list,
coherence = model_list %>% unlist()
)
iter_k %>%
mutate(max_k = which(coherence == max(coherence)) * 10) %>%
ggplot(aes(k, coherence)) +
geom_vline(aes(xintercept = max_k), alpha = 0.5, lty = "dashed") +
geom_line(color = "skyblue4") +
geom_point() +
scale_x_continuous(breaks = seq(0, 200, 20)) +
labs(x = "Number of Topics", y = "Coherence", title = "Coherence Score over Number of Topics") +
theme_minimal() +
theme(panel.grid.minor = element_blank())
The optimal number of topics can be chosen by picking the number of topics that give the highest average coherence.
There are also other methods to evaluate the topic model. It will be too much to discuss them on this article. You can visit Julia Silge blogpost10 to see some of the evaluation metrics.
Model performance toward a specific task, such as text classification. If the topics is regarded as a feature for classification model, we can use accuracy or any other classification metrics to check if the topic model is good enough to do the job.
LDA can also be treated as a model for dimensionality reduction. Each document can be reduce into its associated gamma values. Dimensionality reduction is crucial in text mining since using each word as a feature will result in large dataset and longer computation. In this section, we will compare the performance of text classification using different treatment of dimensionality reduction.
We will import data about workplace review. The data consists of 60,672 observations with the review text and the sentiment.
## Rows: 60,672
## Columns: 2
## $ Sentiment <chr> "positives", "negatives", "positives", "negatives", "positi…
## $ Review <chr> "People are smart and friendly", "Bureaucracy is slowing th…
We will do the usual data cleansing:
df_clean <- df %>%
mutate(text_clean = Review %>%
tolower() %>%
replace_html() %>%
replace_word_elongation() %>%
str_replace_all("-", " ") %>%
str_remove_all("[[:punct:]]") %>%
str_remove_all("[[0-9]]") %>%
str_squish() %>%
str_trim())
df_clean %>% head()
Next, we inspect the summary of the length of each document. The maximum number of words in a document is 2161 terms.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 7.0 14.0 25.7 27.0 2161.0
We will only take documents with more than 50 terms/words.
## [1] 6271 3
We split the data into the training set (80%) and the testing set (20%). We will also check the class proportion of the target variable in the training set.
set.seed(123)
index <- sample(nrow(df_clean), nrow(df_clean)*0.8)
data_train <- df_clean[index, ]
data_test <- df_clean[-index, ]
table(df_clean$Sentiment) %>% prop.table()
##
## negatives positives
## 0.6451922 0.3548078
As we can see, there is a class imbalance between the negative and postive sentiment, so we will upsample the minority class first.
library(caret)
set.seed(123)
data_train <- upSample(x = data_train %>% select(-Sentiment),
y = as.factor(data_train$Sentiment), yname = "sentiment")
glimpse(data_train)
## Rows: 6,474
## Columns: 3
## $ Review <chr> "As it has grown big, Google has become miserably stagnant…
## $ text_clean <chr> "as it has grown big google has become miserably stagnant …
## $ sentiment <fct> negatives, negatives, negatives, negatives, negatives, neg…
Next, we create the document-term matrix (DTM) for each document. The term will be a combination of unigram (1-gram) and bigram (2-gram) for each documents.
stem_hunspell <- function(term) {
# look up the term in the dictionary
stems <- hunspell_stem(term)[[1]]
if (length(stems) == 0) { # if there are no stems, use the original term
stem <- term
} else { # if there are multiple stems, use the last one
stem <- stems[[length(stems)]]
}
return(stem)
}
train_term <- data_train %>%
rownames_to_column("id") %>%
unnest_tokens(output = "word", input = text_clean) %>%
anti_join(stop_words) %>%
mutate(word = text_tokens(word, stemmer = stem_hunspell) %>% as.character()) %>%
drop_na(word) %>%
count(id, word)
train_bigram <- data_train %>%
rownames_to_column("id") %>%
unnest_tokens(output = "word", input = text_clean, token = "ngrams", n = 2) %>%
drop_na(word) %>%
count(id, word)
test_term <- data_test %>%
rownames_to_column("id") %>%
unnest_tokens(output = "word", input = text_clean) %>%
anti_join(stop_words) %>%
mutate(word = text_tokens(word, stemmer = stem_hunspell) %>% as.character()) %>%
drop_na(word) %>%
count(id, word)
test_bigram <- data_test %>%
rownames_to_column("id") %>%
unnest_tokens(output = "word", input = text_clean, token = "ngrams", n = 2) %>%
drop_na(word) %>%
count(id, word)
Here is the resulting DTM from the corpus of text data.
dtm_train_review <- train_term %>%
bind_rows(train_bigram) %>%
cast_dtm(document = id, term = word, value = n)
dtm_test <- test_term %>%
bind_rows(test_bigram) %>%
cast_dtm(document = id, term = word, value = n)
inspect(dtm_train_review)
## <<DocumentTermMatrix (documents: 6474, terms: 204309)>>
## Non-/sparse entries: 885025/1321811441
## Sparsity : 100%
## Maximal term length: 48
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs amazon company employee if you in the manage of the people team time
## 1016 8 8 3 3 2 11 4 6 3 2
## 1117 20 7 8 5 2 3 3 4 0 7
## 1611 0 0 0 6 3 2 2 9 0 1
## 2298 26 7 6 0 8 29 8 6 32 6
## 2536 6 0 1 3 6 3 5 5 5 4
## 3347 15 5 0 1 2 3 6 9 12 10
## 3355 8 0 2 8 3 3 2 0 0 13
## 4261 2 10 4 4 1 1 1 10 4 5
## 4510 8 0 2 8 3 3 2 0 0 13
## 694 0 1 3 0 1 1 4 5 0 4
We will continue to reduce the number of terms used by only choose words that appear in at least 5 documents and maximum appear in 80% of all documents. We get the final number of terms about 260,000 terms in 6500 documents.
word_freq <- findFreqTerms(dtm_train_review, lowfreq = 5, highfreq = nrow(dtm_train_review)*0.8)
dtm_train <- dtm_train_review[ , word_freq ]
dtm_train
## <<DocumentTermMatrix (documents: 6474, terms: 25908)>>
## Non-/sparse entries: 621283/167107109
## Sparsity : 100%
## Maximal term length: 30
## Weighting : term frequency (tf)
LDA
We will build the LDA topic model for the document-term matrix. We will use number of topic (k) = 50, with 5000 iterations and 4000 burn-in. Since the process is relatively long, we also have saved the previously trained topic model in the next chunk.
The topic distribution for each document (\(\theta\)) will be used as the features for the machine learning model. Using only 50 topics, we expect a 99.8% dimensionality reduction.
\[Dimensionality\ reduction = 1 - \frac{50}{25908} = 0.998 = 99.8\%\]
dtm_lda <- Matrix::Matrix(as.matrix(dtm_train), sparse = T)
set.seed(123)
lda_review <- FitLdaModel(dtm = dtm_lda,
k = 50,
iterations = 5000,
burnin = 4000
)
Finally, we prepare the features and the target variable of the training set for model fitting.
train_y <- data_train$sentiment[ rownames(lda_review$theta) %>% as.numeric() ]
train_x <- lda_review$theta
Bernoullie Convertion
For the conventional naive bayes, we will convert the numerical value (the frequency of each term in each document) into a categorical whether the term is presence in the document or not.
We will use Random Forest and Naive Bayes to fit the data from LDA. We will also compare the performance of the model with a baseline model of Naive Bayes without the dimensionality reduction from LDA.
The random forest model will be trained using 500 trees and mtry
parameter of 2. The error rate from the Out of Bag (OOB) observation is around 6.3% or similar to 93% of accuracy.
library(randomForest)
set.seed(123)
rf_lda <- randomForest(x = train_x,
y = train_y,
ntree = 500,
mtry = 2)
rf_lda
##
## Call:
## randomForest(x = train_x, y = train_y, ntree = 500, mtry = 2)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 6.27%
## Confusion matrix:
## negatives positives class.error
## negatives 3135 102 0.03151066
## positives 304 2933 0.09391412
Next, we will prepare the testing dataset. To get the features of probability distribution of each topic for each document, we ran the topic model on the DTM of the testing set using only 100 iterations and burn-in of 80.
dtm_lda_test <- Matrix::Matrix(as.matrix(dtm_test), sparse = T)
# Get the topic probabilities for each document
set.seed(123)
test_x <- predict(lda_review,
newdata = dtm_lda_test,
iterations = 100,
burnin = 80
)
Next, we predict the testing set using the trained model and see the performance via confusion matrix.
set.seed(123)
pred_test <- predict(rf_lda, test_x)
pred_prob <- predict(rf_lda, test_x, type = "prob")
test_y <- data_test$Sentiment[ rownames(dtm_test) %>% as.numeric() ]
pred_lda <- data.frame(predicted = factor(pred_test, levels = c("positives", "negatives")),
actual = factor(test_y, levels = c("positives", "negatives"))
)
conf_mat(pred_lda,
truth = actual,
estimate = predicted)
## Truth
## Prediction positives negatives
## positives 271 16
## negatives 175 793
We then translate the confusion matrix into several evaluation matrix, such as accuracy, recall/sensitivity, precision and F1 measure. We also calculate the area under curve (AUC) to check the model sensitivity toward change of classification threshold.
result_lda_rf <- data.frame(
accuracy = accuracy_vec( truth = pred_lda$actual,
estimate = pred_lda$predicted),
recall = sens_vec( truth = pred_lda$actual,
estimate = pred_lda$predicted),
precision = precision_vec( truth = pred_lda$actual,
estimate = pred_lda$predicted),
F1 = f_meas_vec(truth = pred_lda$actual,
estimate = pred_lda$predicted),
AUC = roc_auc_vec(truth = pred_lda$actual,
estimate = pred_prob[, 2])
) %>%
mutate_all(scales::percent, accuracy = 0.01)
result_lda_rf
We will feed the same LDA dataset using the Naive Bayes as comparison.
naive_lda <- naiveBayes(x = train_x,
y = train_y)
pred_test <- predict(naive_lda, test_x)
pred_prob <- predict(naive_lda, test_x, type = "raw")
pred_lda_bayes <- data.frame(predicted = factor(pred_test, levels = c("positives", "negatives")),
actual = factor(test_y, levels = c("positives", "negatives"))
)
conf_mat(pred_lda_bayes,
truth = actual,
estimate = predicted)
## Truth
## Prediction positives negatives
## positives 316 24
## negatives 130 785
Here are the evaluation metrics for the Naive Bayes model on LDA dataset.
result_lda_bayes <- data.frame(
accuracy = accuracy_vec( truth = pred_lda_bayes$actual,
estimate = pred_lda_bayes$predicted),
recall = sens_vec( truth = pred_lda_bayes$actual,
estimate = pred_lda_bayes$predicted),
precision = precision_vec( truth = pred_lda_bayes$actual,
estimate = pred_lda_bayes$predicted),
F1 = f_meas_vec(truth = pred_lda_bayes$actual,
estimate = pred_lda_bayes$predicted),
AUC = roc_auc_vec(truth = pred_lda_bayes$actual,
estimate = pred_prob[, 2])
) %>%
mutate_all(scales::percent, accuracy = 0.01)
result_lda_bayes
Lastly, we will train a Naive Bayes model on the original document-term matrix dataset that consist of 25,000+ terms as a baseline or benchmark model. Since the prediction process of naive bayes is taking too much time, we’ve prepared the prediction result in Rds
format.
naive_gram <- naiveBayes(x = train_bn,
y = train_y)
pred_test_gram <- predict(naive_gram, test_bn)
pred_prob_gram <- predict(naive_gram, test_bn, type = "raw")
pred_test_gram <- read_rds("pred_test_gram.Rds")
pred_prob_gram <- read_rds("pred_prob_gram.Rds")
pred_gram_bayes <- data.frame(predicted = factor(pred_test_gram, levels = c("positives", "negatives")),
actual = factor(test_y, levels = c("positives", "negatives"))
)
conf_mat(pred_gram_bayes,
truth = actual,
estimate = predicted)
## Truth
## Prediction positives negatives
## positives 420 88
## negatives 26 721
Here are the evaluation metrics for the Naive Bayes model on the original dataset.
result_gram_bayes <- data.frame(
accuracy = accuracy_vec( truth = pred_gram_bayes$actual,
estimate = pred_lda_bayes$predicted),
recall = sens_vec( truth = pred_gram_bayes$actual,
estimate = pred_gram_bayes$predicted),
precision = precision_vec( truth = pred_gram_bayes$actual,
estimate = pred_gram_bayes$predicted),
F1 = f_meas_vec(truth = pred_gram_bayes$actual,
estimate = pred_gram_bayes$predicted),
AUC = roc_auc_vec(truth = pred_gram_bayes$actual,
estimate = pred_prob_gram[, 2])
) %>%
mutate_all(scales::percent, accuracy = 0.01)
result_gram_bayes
This is the recap of performances of all trained models. Using only 50 features (with 99.8% of dimensionality reduction) extracted from the original DTM using topic model, there are some interesting finding. The recall/sensitivity of the LDA models, both Random Forest and Naive Bayes, is far lower than the Naive Bayes using the original DTM. However, the LDA models has better precision with more than 93% precision on the testing dataset. The trade-off between dimensionality reduction and the model performance for this case is a worthy one, since the dimension is heavily reduced (again, 99.8% reduction) while the model still have an acceptable performance. The dimensionality reduction also result in faster computation. This is especially useful because the prediction process of Naive Bayes with many features take too much time.
result_lda_rf %>%
bind_rows(result_lda_bayes, result_gram_bayes) %>%
mutate(
model = c("Random Forest", "Naive Bayes", "Naive Bayes"),
method = c("LDA", "LDA", "n-Gram"),
`n features` = c( 50, 50, ncol(dtm_train) )
) %>%
select(method, model, everything()) %>%
rename_all(str_to_title)
Social Issues
The following table shows the top 10 news title that has the highest probability to the topic of social issues of coronavirus.