Natural Languange Processing has a wide area of knowledge and implementation, one of them is Topic Model. Topic Model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. For example “dog”, “bone”, and “obedient” will appear more often in document about dogs, “cute”, “evil”, and “home owner” will appear in document about cats. The “topics” produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document’s balance of topics is.
In this article, we will learn to do Topic Model using tidytext
and textmineR
packages. In tidytext
we can easily build Topic Model and visualize it with tidy style but the package itself doesn’t focus on that study only. thus the functionality for topic model are not comprehensive as textmineR
. In textmineR
, not only building a topic model, we also can calculate the goodness of fit of the model. We will use both package to build topic model using Latent Dirichlet Allocation (LDA) algorithm.
What is Topic Modeling Topic Modeling is how the machine collect group of words within a document to build ‘topic’ which contain group of words with similar dependencies. With Topic models (or topic modeling, or topic model, its just the same) methods we can organize, understand and summarize large collections of textual information. It helps in:
In a business approach, topic modeling power for discovering hidden topic can help the organization to understand better about their customer feedback’s So that they can concentrate on those issues customer’s are facing. It also can summarize text for company’s meetings. A high-quality meeting document can enable users to recall the meeting content efficiently. Topic tracking and detection can also use to build recommender system.
There are many techniques that are used to obtain topic models, namely: Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), Correlated Topic Models (CTM) and TextRank. In this study we will focus to implement LDA algorithm to build topic model with tidytext
and textmineR
package. Not only building model, we will also evaluate the goodness of fit of the model using some metrics like R-squared or log-likelihood. There’s also some metrics like coherence
and prevalence
to measure the quality of topics.
From the introduction above we know that there are several ways to do topic model. In this study we will use LDA algorithm. LDA is a mathematical model that is used to find a mixture of words to each topic, also determine the mixture of topics that describe each document. LDA answer these following principles of topic modeling:
theta
phi
We will use two packages: tidytext
including tidymodels
package and textmineR
. Tidytext package build topic model easily and they provide method for extracting the per-topic-per-word probabilities, called \(\beta\) (“beta”), from the model. But they don’t provide metrics to calculate the goodness of model like textmineR
do.
LDA is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s presence is attributable to one of the document’s topics. Plate Notation
(picture below) is a concise way of visually representing the dependencies among the model parameters.
LDA is a generative process. LDA assumes that new documents are created in the following way:
1. Determine number of words in document
2. Choose a topic mixture for the document over a fixed set of topics (example: 20% topic A, 50$ topic B, 30% topic C)
3. Generate the words in the document by:
- pick a topic based on the document’s multinomial distribution (\(z_{m,n} \sim {\sf Multinomial(\theta_{m})}\))
- pick a word based on topic’s multinomial distribution (\(w_{m,n} \sim {\sf Multinomial(\varphi_{z_{mn}})}\)) (where \(\varphi_{z_{mn}}\) is the word distribution for topic z)
4. Repeat the process for n number of iteration until the words distribution in the topics meet the criteria (number 2)
The data is from this kaggle. Its about customers feedback on Amazon musical instruments. Every row represent one feedback from one user. There are several columns but we only need reviewText
which contain the text of the review, overall
the product rating from 1-5 given by user, and reviewTime
which contain the time review was given.
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
## EOF within quoted string
Clean reviewTime column and convert to time format. Also deselect to unnecessary columns.
data <- data %>%
mutate(overall = as.factor(overall),
reviewTime = str_replace_all(reviewTime, pattern = " ",replacement = "-"),
reviewTime = str_replace(reviewTime, pattern = ",",replacement = ""),
reviewTime = mdy(reviewTime)) %>%
select(reviewText, overall,reviewTime)
head(data)
So the objectives in this project is to discover what users are talk about for each rating. This will help the organization to understand better about their customer feedback’s So that they can concentrate on those issues customer’s are facing.
tidytext
Before we put the text to LDA model, we need to clean the text. We gonna build textcleaner
function using several function from tm
, textclean
, and stringr
package. We also need to convert the text to Document Term Matrix (DTM) format because LDA()
function from tidytext
package need dtm format.
# build textcleaner function
textcleaner <- function(x){
x <- as.character(x)
x <- x %>%
str_to_lower() %>% # convert all the string to low alphabet
replace_contraction() %>% # replace contraction to their multi-word forms
replace_internet_slang() %>% # replace internet slang to normal words
replace_emoji() %>% # replace emoji to words
replace_emoticon() %>% # replace emoticon to words
replace_hash(replacement = "") %>% # remove hashtag
replace_word_elongation() %>% # replace informal writing with known semantic replacements
replace_number(remove = T) %>% # remove number
replace_date(replacement = "") %>% # remove date
replace_time(replacement = "") %>% # remove time
str_remove_all(pattern = "[[:punct:]]") %>% # remove punctuation
str_remove_all(pattern = "[^\\s]*[0-9][^\\s]*") %>% # remove mixed string n number
str_squish() %>% # reduces repeated whitespace inside a string.
str_trim() # removes whitespace from start and end of string
xdtm <- VCorpus(VectorSource(x)) %>%
tm_map(removeWords, stopwords("en"))
# convert corpus to document term matrix
return(DocumentTermMatrix(xdtm))
}
Because we want to know the topic from each rating, we should split/subset the data by its rating.
data_1 <- data %>% filter(overall == 1)
data_2 <- data %>% filter(overall == 2)
data_3 <- data %>% filter(overall == 3)
data_4 <- data %>% filter(overall == 4)
data_5 <- data %>% filter(overall == 5)
table(data$overall)
##
## 1 2 3 4 5
## 14 21 77 245 735
From the table above we know that most of the feedback have the highest rating. Because the distributions is different, each rating will have different treatment especially in choosing highest terms frequency. I’ll make sure we will use at least 700-1000 words to be analyzed for each rating.
# apply textcleaner function for review text
dtm_5 <- textcleaner(data_5$reviewText)
# find most frequent terms. i choose words that at least appear in 50 reviews
freqterm_5 <- findFreqTerms(dtm_5,50)
# we have 981 words. subset the dtm to only choose those selected words
dtm_5 <- dtm_5[,freqterm_5]
# only choose words that appear once in each rows
rownum_5 <- apply(dtm_5,1,sum)
dtm_5 <- dtm_5[rownum_5>0,]
# apply to LDA function. set the k = 6, means we want to build 6 topic
lda_5 <- LDA(dtm_5,k = 6,control = list(seed = 1502))
# apply auto tidy using tidy and use beta as per-topic-per-word probabilities
topic_5 <- tidy(lda_5,matrix = "beta")
## Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
# choose 15 words with highest beta from each topic
top_terms_5 <- topic_5 %>%
group_by(topic) %>%
top_n(15,beta) %>%
ungroup() %>%
arrange(topic,-beta)
# plot the topic and words for easy interpretation
plot_topic_5 <- top_terms_5 %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered()
plot_topic_5
dtm_4 <- textcleaner(data_4$reviewText)
freqterm_4 <- findFreqTerms(dtm_4,20)
dtm_4 <- dtm_4[,freqterm_4]
rownum_4 <- apply(dtm_4,1,sum)
dtm_4 <- dtm_4[rownum_4>0,]
lda_4 <- LDA(dtm_4,k = 6,control = list(seed = 1502))
topic_4 <- tidy(lda_4,matrix = "beta")
top_terms_4 <- topic_4 %>%
group_by(topic) %>%
top_n(15,beta) %>%
ungroup() %>%
arrange(topic,-beta)
plot_topic_4 <- top_terms_4 %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered()
plot_topic_4
dtm_3 <- textcleaner(data_3$reviewText)
freqterm_3 <- findFreqTerms(dtm_3,10)
dtm_3 <- dtm_3[,freqterm_3]
rownum_3 <- apply(dtm_3,1,sum)
dtm_3 <- dtm_3[rownum_3>0,]
lda_3 <- LDA(dtm_3,k = 6,control = list(seed = 1502))
topic_3 <- tidy(lda_3,matrix = "beta")
top_terms_3 <- topic_3 %>%
group_by(topic) %>%
top_n(15,beta) %>%
ungroup() %>%
arrange(topic,-beta)
plot_topic_3 <- top_terms_3 %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered()
plot_topic_3
dtm_2 <- textcleaner(data_2$reviewText)
freqterm_2 <- findFreqTerms(dtm_2,5)
dtm_2 <- dtm_2[,freqterm_2]
rownum_2 <- apply(dtm_2,1,sum)
dtm_2 <- dtm_2[rownum_2>0,]
lda_2 <- LDA(dtm_2,k = 6,control = list(seed = 1502))
topic_2 <- tidy(lda_2,matrix = "beta")
top_terms_2 <- topic_2 %>%
group_by(topic) %>%
top_n(15,beta) %>%
ungroup() %>%
arrange(topic,-beta)
plot_topic_2 <- top_terms_2 %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered()
plot_topic_2
dtm_1 <- textcleaner(data_1$reviewText)
freqterm_1 <- findFreqTerms(dtm_1,5)
dtm_1 <- dtm_1[,freqterm_1]
rownum_1 <- apply(dtm_1,1,sum)
dtm_1 <- dtm_1[rownum_1>0,]
lda_1 <- LDA(dtm_1,k = 6,control = list(seed = 1502))
topic_1 <- tidy(lda_1,matrix = "beta")
top_terms_1 <- topic_1 %>%
group_by(topic) %>%
top_n(15,beta) %>%
ungroup() %>%
arrange(topic,-beta)
plot_topic_1 <- top_terms_1 %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered()
plot_topic_1
textmineR
Just like previous text cleaning method, we will build text cleaner function to automate the cleaning process. The different is we don’t need to convert the text to dtm format. textmineR
package has its own dtm converter, CreateDtm()
. Fitting LDA model with textmineR
need dtm format made by CreateDtm()
function. We also can set n-gram size, remove punctuation, stopwords, and any simple text cleaning process.
textcleaner_2 <- function(x){
x <- as.character(x)
x <- x %>%
str_to_lower() %>% # convert all the string to low alphabet
replace_contraction() %>% # replace contraction to their multi-word forms
replace_internet_slang() %>% # replace internet slang to normal words
replace_emoji() %>% # replace emoji to words
replace_emoticon() %>% # replace emoticon to words
replace_hash(replacement = "") %>% # remove hashtag
replace_word_elongation() %>% # replace informal writing with known semantic replacements
replace_number(remove = T) %>% # remove number
replace_date(replacement = "") %>% # remove date
replace_time(replacement = "") %>% # remove time
str_remove_all(pattern = "[[:punct:]]") %>% # remove punctuation
str_remove_all(pattern = "[^\\s]*[0-9][^\\s]*") %>% # remove mixed string n number
str_squish() %>% # reduces repeated whitespace inside a string.
str_trim() # removes whitespace from start and end of string
return(as.data.frame(x))
}
# apply textcleaner function. note: we only clean the text without convert it to dtm
clean_5 <- textcleaner_2(data_5$reviewText)
clean_5 <- clean_5 %>% mutate(id = rownames(clean_5))
# crete dtm
set.seed(1502)
dtm_r_5 <- CreateDtm(doc_vec = clean_5$x,
doc_names = clean_5$id,
ngram_window = c(1,2),
stopword_vec = stopwords("en"),
verbose = F)
dtm_r_5 <- dtm_r_5[,colSums(dtm_r_5)>2]
create LDA model using textmineR
. Here we gonna make 20 topic. the reason why we build so many topic is because textmineR
have metrics for calculate the quality of topics. we will choose some topics with the best quality
set.seed(1502)
mod_lda_5 <- FitLdaModel(dtm = dtm_r_5,
k = 20, # number of topic
iterations = 500,
burnin = 180,
alpha = 0.1,beta = 0.05,
optimize_alpha = T,
calc_likelihood = T,
calc_coherence = T,
calc_r2 = T)
Once we have created a model, we need to evaluate it. For overall goodness of fit, textmineR has R-squared and log likelihood. R-squared is interpretable as the proportion of variability in the data explained by the model, as with linear regression.
## [1] 0.2183867
The primary goodness of fit measures in topic modeling are likelihood methods. Likelihoods, generally the log likelihood, are naturally obtained from probabilistic topic models. the log_likelihood is \(P(tokens|topics)\) at each iteration.
get 15 top terms with the highest phi. phi representing a distribution of words over topics. Words with high phi has the most frequency in a topic.
Let’s see the coherence value for each topics. Topic Coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference. For each pair of words \(\{a,b\}\), then probabilistic coherence
calculates \(P(b|a) - P(b)\) where \(\{a\}\) is more probable than \(\{b\}\) in the topic. In a simple words, coherence tell us how associated words are in a topic
## t_1 t_2 t_3 t_4 t_5 t_6 t_7
## 0.12140404 0.08349523 0.05510456 0.11607445 0.16397834 0.05472121 0.09739406
## t_8 t_9 t_10 t_11 t_12 t_13 t_14
## 0.14221823 0.24856426 0.79310008 0.28175270 0.10231907 0.58667185 0.05449207
## t_15 t_16 t_17 t_18 t_19 t_20
## 0.09204392 0.10147505 0.07949897 0.04519463 0.13664781 0.21586105
We also want to look at prevalence value. Prevalence tell us the most frequent topics in the corpus. Prevelence is probability of topics distribution in the whole documents.
## t_1 t_2 t_3 t_4 t_5 t_6 t_7 t_8
## 5.514614 5.296280 4.868778 7.484032 9.360072 2.748069 4.269445 4.195638
## t_9 t_10 t_11 t_12 t_13 t_14 t_15 t_16
## 5.380414 3.541380 5.807442 5.305865 3.243890 4.657203 5.488087 2.738993
## t_17 t_18 t_19 t_20
## 4.821128 4.035630 7.385820 3.857221
Now we have the top terms at each topic, goodness of model by r2 and log_likelihood, also the quality of topics by calculate coherence and prevalence. let’s compile them in summary
mod_lda_5$summary <- data.frame(topic = rownames(mod_lda_5$phi),
coherence = round(mod_lda_5$coherence,3),
prevalence = round(mod_lda_5$prevalence,3),
top_terms = apply(mod_lda_5$top_terms,2,function(x){paste(x,collapse = ", ")}))
modsum_5 <- mod_lda_5$summary %>%
`rownames<-`(NULL)
modsum_5
We know that the quality of model can be describe with coherence and prevalence value. let’s build a plot to identify which topic has the best quality
modsum_5 %>% pivot_longer(cols = c(coherence,prevalence)) %>%
ggplot(aes(x = factor(topic,levels = unique(topic)), y = value, group = 1)) +
geom_point() + geom_line() +
facet_wrap(~name,scales = "free_y",nrow = 2) +
theme_minimal() +
labs(title = "Best topics by coherence and prevalence score",
subtitle = "Text review with 5 rating",
x = "Topics", y = "Value")
From the graph above we know that topic 10
has the highest quality, means the words in that topic are associated to each other. But in the terms of probability of topics distribution in the whole documents (prevalence), topic 10
has a low score. Mean the review are unlikely using combination of words in topic 10
even tough the words inside that topic are supporting each other.
We can see if topics can be grouped together using Dendogram
. A Dendogram uses Hellinger distance (distance between 2 probability vectors) to decide if the topics are closely related. For instance, the Dendogram below suggests that there are greater similarity between topic 10 and 13.
mod_lda_5$linguistic <- CalcHellingerDist(mod_lda_5$phi)
mod_lda_5$hclust <- hclust(as.dist(mod_lda_5$linguistic),"ward.D")
mod_lda_5$hclust$labels <- paste(mod_lda_5$hclust$labels, mod_lda_5$labels[,1])
plot(mod_lda_5$hclust)
Now we have complete to build topic model in rating 5 and its interpretation, let’s apply the same step for every rating and see the difference of what people are talk about.
# apply textcleaner function. note: we only clean the text without convert it to dtm
clean_4 <- textcleaner_2(data_4$reviewText)
clean_4 <- clean_4 %>% mutate(id = rownames(clean_4))
# crete dtm
set.seed(1502)
dtm_r_4 <- CreateDtm(doc_vec = clean_4$x,
doc_names = clean_4$id,
ngram_window = c(1,2),
stopword_vec = stopwords("en"),
verbose = F)
dtm_r_4 <- dtm_r_4[,colSums(dtm_r_4)>2]
set.seed(1502)
mod_lda_4 <- FitLdaModel(dtm = dtm_r_4,
k = 20, # number of topic
iterations = 500,
burnin = 180,
alpha = 0.1,beta = 0.05,
optimize_alpha = T,
calc_likelihood = T,
calc_coherence = T,
calc_r2 = T)
mod_lda_4$r2
## [1] 0.3154615
## t_1 t_2 t_3 t_4 t_5 t_6 t_7 t_8
## 3.229516 4.122641 6.962079 3.157732 4.363310 5.211399 7.630494 5.013937
## t_9 t_10 t_11 t_12 t_13 t_14 t_15 t_16
## 5.091175 2.149174 5.166844 6.165504 5.431888 4.945937 8.144601 4.778653
## t_17 t_18 t_19 t_20
## 6.224335 4.276829 4.245946 3.688005
## t_1 t_2 t_3 t_4 t_5 t_6 t_7
## 0.22021122 0.05410031 0.13539539 0.27863946 0.07316558 0.78545018 0.19169473
## t_8 t_9 t_10 t_11 t_12 t_13 t_14
## 0.09110965 0.27837147 0.12226709 0.07872343 0.06980018 0.07813782 0.06988493
## t_15 t_16 t_17 t_18 t_19 t_20
## 0.17171459 0.09475681 0.06690737 0.19262703 0.05562603 0.44882875
mod_lda_4$summary <- data.frame(topic = rownames(mod_lda_4$phi),
coherence = round(mod_lda_4$coherence,3),
prevalence = round(mod_lda_4$prevalence,3),
top_terms = apply(mod_lda_4$top_terms,2,function(x){paste(x,collapse = ", ")}))
modsum_4 <- mod_lda_4$summary %>%
`rownames<-`(NULL)
modsum_4
modsum_4 %>% pivot_longer(cols = c(coherence,prevalence)) %>%
ggplot(aes(x = factor(topic,levels = unique(topic)), y = value, group = 1)) +
geom_point() + geom_line() +
facet_wrap(~name,scales = "free_y",nrow = 2) +
theme_minimal() +
labs(title = "Best topics by coherence and prevalence score",
subtitle = "Text review with 4 rating",
x = "Topics", y = "Value")
mod_lda_4$linguistic <- CalcHellingerDist(mod_lda_4$phi)
mod_lda_4$hclust <- hclust(as.dist(mod_lda_4$linguistic),"ward.D")
mod_lda_4$hclust$labels <- paste(mod_lda_4$hclust$labels, mod_lda_4$labels[,1])
plot(mod_lda_4$hclust)
# apply textcleaner function. note: we only clean the text without convert it to dtm
clean_3 <- textcleaner_2(data_3$reviewText)
clean_3 <- clean_3 %>% mutate(id = rownames(clean_3))
# crete dtm
set.seed(1502)
dtm_r_3 <- CreateDtm(doc_vec = clean_3$x,
doc_names = clean_3$id,
ngram_window = c(1,2),
stopword_vec = stopwords("en"),
verbose = F)
dtm_r_3 <- dtm_r_3[,colSums(dtm_r_3)>2]
set.seed(1502)
mod_lda_3 <- FitLdaModel(dtm = dtm_r_3,
k = 20, # number of topic
iterations = 500,
burnin = 180,
alpha = 0.1,beta = 0.05,
optimize_alpha = T,
calc_likelihood = T,
calc_coherence = T,
calc_r2 = T)
mod_lda_3$r2
## [1] 0.5157191
## t_1 t_2 t_3 t_4 t_5 t_6 t_7 t_8
## 10.673904 7.279445 3.918509 5.551189 2.904855 3.990948 3.357977 7.441973
## t_9 t_10 t_11 t_12 t_13 t_14 t_15 t_16
## 4.062656 4.307141 4.099807 5.938272 4.749125 3.453146 6.155349 5.029986
## t_17 t_18 t_19 t_20
## 3.581739 4.039552 5.802925 3.661501
## t_1 t_2 t_3 t_4 t_5 t_6 t_7 t_8
## 0.1383092 0.1852647 0.1778427 0.2641414 0.2559441 0.3155844 0.3456169 0.1493240
## t_9 t_10 t_11 t_12 t_13 t_14 t_15 t_16
## 0.2597739 0.4354978 0.1198223 0.1774170 0.3179221 0.3509740 0.6501623 0.1703680
## t_17 t_18 t_19 t_20
## 0.3290501 0.3553571 0.5694805 0.2866017
mod_lda_3$summary <- data.frame(topic = rownames(mod_lda_3$phi),
coherence = round(mod_lda_3$coherence,3),
prevalence = round(mod_lda_3$prevalence,3),
top_terms = apply(mod_lda_3$top_terms,2,function(x){paste(x,collapse = ", ")}))
modsum_3 <- mod_lda_3$summary %>%
`rownames<-`(NULL)
modsum_3
modsum_3 %>% pivot_longer(cols = c(coherence,prevalence)) %>%
ggplot(aes(x = factor(topic,levels = unique(topic)), y = value, group = 1)) +
geom_point() + geom_line() +
facet_wrap(~name,scales = "free_y",nrow = 2) +
theme_minimal() +
labs(title = "Best topics by coherence and prevalence score",
subtitle = "Text review with 3 rating",
x = "Topics", y = "Value")
mod_lda_3$linguistic <- CalcHellingerDist(mod_lda_3$phi)
mod_lda_3$hclust <- hclust(as.dist(mod_lda_3$linguistic),"ward.D")
mod_lda_3$hclust$labels <- paste(mod_lda_3$hclust$labels, mod_lda_3$labels[,1])
plot(mod_lda_3$hclust)
Review with rating 2 and 1 has the lowest frequency, thus i will reduce the number of topic to 10 instead of 20
# apply textcleaner function. note: we only clean the text without convert it to dtm
clean_2 <- textcleaner_2(data_2$reviewText)
clean_2 <- clean_2 %>% mutate(id = rownames(clean_2))
# crete dtm
set.seed(1502)
dtm_r_2 <- CreateDtm(doc_vec = clean_2$x,
doc_names = clean_2$id,
ngram_window = c(1,2),
stopword_vec = stopwords("en"),
verbose = F)
dtm_r_2 <- dtm_r_2[,colSums(dtm_r_2)>2]
set.seed(1502)
mod_lda_2 <- FitLdaModel(dtm = dtm_r_2,
k = 10, # number of topic
iterations = 500,
burnin = 180,
alpha = 0.1,beta = 0.05,
optimize_alpha = T,
calc_likelihood = T,
calc_coherence = T,
calc_r2 = T)
mod_lda_2$r2
## [1] 0.7483237
## t_1 t_2 t_3 t_4 t_5 t_6 t_7 t_8
## 2.234913 10.458201 13.543898 8.361944 19.279786 7.277687 4.267728 10.041225
## t_9 t_10
## 19.758664 4.775955
## t_1 t_2 t_3 t_4 t_5 t_6 t_7 t_8
## 0.8880952 0.3904762 0.4785714 0.6000000 0.1851190 0.2059524 0.5666667 0.4316667
## t_9 t_10
## 0.3079121 0.6880952
mod_lda_2$summary <- data.frame(topic = rownames(mod_lda_2$phi),
coherence = round(mod_lda_2$coherence,3),
prevalence = round(mod_lda_2$prevalence,3),
top_terms = apply(mod_lda_2$top_terms,2,function(x){paste(x,collapse = ", ")}))
modsum_2 <- mod_lda_2$summary %>%
`rownames<-`(NULL)
modsum_2
modsum_2 %>% pivot_longer(cols = c(coherence,prevalence)) %>%
ggplot(aes(x = factor(topic,levels = unique(topic)), y = value, group = 1)) +
geom_point() + geom_line() +
facet_wrap(~name,scales = "free_y",nrow = 2) +
theme_minimal() +
labs(title = "Best topics by coherence and prevalence score",
subtitle = "Text review with 2 rating",
x = "Topics", y = "Value")
mod_lda_2$linguistic <- CalcHellingerDist(mod_lda_2$phi)
mod_lda_2$hclust <- hclust(as.dist(mod_lda_2$linguistic),"ward.D")
mod_lda_2$hclust$labels <- paste(mod_lda_2$hclust$labels, mod_lda_2$labels[,1])
plot(mod_lda_2$hclust)
Review with rating 2 and 1 has the lowest frequency, thus i will reduce the number of topic to 10 instead of 20
# apply textcleaner function. note: we only clean the text without convert it to dtm
clean_1 <- textcleaner_2(data_1$reviewText)
clean_1 <- clean_1 %>% mutate(id = rownames(clean_1))
# crete dtm
set.seed(1502)
dtm_r_1 <- CreateDtm(doc_vec = clean_1$x,
doc_names = clean_1$id,
ngram_window = c(1,2),
stopword_vec = stopwords("en"),
verbose = F)
dtm_r_1 <- dtm_r_1[,colSums(dtm_r_1)>2]
set.seed(1502)
mod_lda_1 <- FitLdaModel(dtm = dtm_r_1,
k = 10, # number of topic
iterations = 500,
burnin = 180,
alpha = 0.1,beta = 0.05,
optimize_alpha = T,
calc_likelihood = T,
calc_coherence = T,
calc_r2 = T)
mod_lda_1$r2
## [1] 0.371047
## t_1 t_2 t_3 t_4 t_5 t_6 t_7 t_8
## 18.731861 4.397404 4.397404 4.946855 4.397404 14.438267 18.885716 4.397404
## t_9 t_10
## 13.096263 12.311420
## t_1 t_2 t_3 t_4 t_5 t_6 t_7
## 0.34523810 0.07380952 0.07380952 0.07380952 0.07380952 0.31071429 0.23928571
## t_8 t_9 t_10
## 0.07380952 0.17809524 0.03904762
mod_lda_1$summary <- data.frame(topic = rownames(mod_lda_1$phi),
coherence = round(mod_lda_1$coherence,3),
prevalence = round(mod_lda_1$prevalence,3),
top_terms = apply(mod_lda_1$top_terms,2,function(x){paste(x,collapse = ", ")}))
modsum_1 <- mod_lda_1$summary %>%
`rownames<-`(NULL)
modsum_1
modsum_1 %>% pivot_longer(cols = c(coherence,prevalence)) %>%
ggplot(aes(x = factor(topic,levels = unique(topic)), y = value, group = 1)) +
geom_point() + geom_line() +
facet_wrap(~name,scales = "free_y",nrow = 2) +
theme_minimal() +
labs(title = "Best topics by coherence and prevalence score",
subtitle = "Text review with 1 rating",
x = "Topics", y = "Value")
mod_lda_1$linguistic <- CalcHellingerDist(mod_lda_1$phi)
mod_lda_1$hclust <- hclust(as.dist(mod_lda_1$linguistic),"ward.D")
mod_lda_1$hclust$labels <- paste(mod_lda_1$hclust$labels, mod_lda_1$labels[,1])
plot(mod_lda_1$hclust)
We’ve done topic model process from cleaning text to interpretation and analysis. Finally let’s see what people are talk about for each rating. We will choose 5 different topic with the highest quality (coherence). Each topic will have 15 words with highest value of phi (distribution of words over topics).
Highest coherence score, topic 10
and topic 13
contains lots of ‘sticking’ and ‘tongue’ words. Maybe its just a phrase for a specific instrument. It has similar words that make their coherence score rising but low prevalence means they are rarely used in other review, that’s why i suggest its from ‘specific’ instrument. in topic 11
and others people are talk about how good the product is, for example there are words like ‘good’, ‘accurate’, ‘clean’, ‘easy’, ‘recommend’, and ‘great’ that indicates positive sentiment.
Same like before, topic with highest coherence score is filled with sticking and tongue stuff. In this rating people are still praising the product but not as much as rating 5. Keep in mind, the dtm are builded using bigrams, words with 2 words like solid_state or e_tongue are captured and calculated just like single word does. With that information we know that all words showed here have their own phi value and actually represent the review.
Looks like stick and tongue words are everywhere. topic 15
has high coherence and prevalence value in rating 3, means lots of review in this rating are talk about them. in the other hand, in this rating the positive words are barely seen. most of the topics filled with guitar or string related words.
In this rating people are complaining. words like ‘big_muff’, and ‘fuzz’ maybe indicate negative sentiment. words like ‘cheap’ also suspicious. But in my opinion the model is kinda bad in this one. the lack of documents maybe affect the topic modeling tho
In the worst rating people are highly complaint. words like ‘junk’, ‘cheap’ , ‘just’, ‘back’ are everywhere. there’s lot of difference compared with rating 5.
Overall lets keep in mind this dataset is a combination of product, so its obvious if the topic filled with nonsense. But for every rating we’re able to build topics with different instrument. Most of them are talking about with particular instrument with its positive or negative review. In this project we managed to build topic model that separated by instrument, it shows LDA is able to build topic with its semantic words. It will be better if we do topic model with specific product only and discover the problems to remove or a goodness to keep. It surely help organization to understand better about their customer feedback’s So that they can concentrate on those issues customer’s are facing, especially for those who have lots of review to analyze.