0. INTRODUCTION
With the rapid increase in digitization over the past 20 years, the field of text mining has quickly moved to the forefront of data science as a method for better understanding patterns and trends in society. Text mining is often the first step in developing intricate methods for more advanced processes such as social network analysis or natural language processing. For models to learn, they must experience everything (or as much as possible) that is and is not what they are attempting to predict or mimic. Traditionally, this has meant incredibly large volumes of text have been required to feed these learning demands.
A challenge to this process is how human information tendencies are changing. While we continue to create more and more data, it’s presenting in smaller and smaller chunks. Short-form text is quickly becoming the primary means by which people consume daily, practical, and time-sensitive information. From news, to social media, to research abstracts, texts of 200 words or fewer are gaining popularity for carrying the the critical messages in our daily lives. As these shorter formats become increasingly important for analyzing and understanding social trends, it’s important for data science tools to adapt as the information environment continuously evolves (Jipeng et al., 2019).
The context for this project is provided by data from the website Towards Data Science (TDS), which I’ve used often to augment my studies on learning analytics. TDS is a Medium web publication where authors share concepts, ideas, and coding tips through self-published articles. This data experiment will explore what short-form text modeling can reveal about the changes in research focus over the last few years years as charted through TDS articles published from 2018-2021.
This research project aims to answer the following questions:
Can short texts provide enough context to adequately inform topic modeling?
Can short text metadata be used to augment the accuracy of these topic models?
Can short texts provide enough fidelity to measure topic prevalence over time?
Data scientists, students, or any professionals researching this field could benefit by understanding how research/publication trends are linked to shifts in technology. Understanding these shifts could influence how quickly people adopt or advance new technologies in the data science field. Additionally, this information could be useful to understand technologies ripe for study in higher education leading to how students select focus areas, majors, or certificate programs.
2. WRANGLE
For this case study, the data will be parsed into tokens for initial exploration to identify potential trends and context over time. Three topic modeling techniques will be employed, each with their own specific data format requirements. To meet these, the data will be tidied and stemmed and the stop words will be removed to better focus on topic-specific terminology.
2.1 Import Data
The raw data was downloaded into a comma separated values (.csv) file that will be imported into RStudio:
tds_raw <- read_csv("data/towards_data_science.csv")
With the data frame created, the data set will be further manipulated into a format fit for exploratory analysis. Wrangling will consist of the following steps:
Combine title and tagline columns to create a single ‘text’ variable.
Mutate the date column into separate ‘year’ and ‘month’ variables to enable time-based exploration.
Create a corpus through the tidytext package to enable additional text mining tools.
Stem the tidytext corpus and remove stop words.
Tokenize tidytext corpus into unigrams, bigrams, and trigrams to enable term frequency analysis.
Create a document term matrix (DTM) to explore the LDA modeling technique.
2.2 Create Single Text Variable
The current data frame contains the target text in two separate variables, title and tagline. As the pair of variables represent a single document (article in this case), combining the text into a single character variable will enable the models to focus on just one column, simplifying the work of the models.
tds_raw$text <- paste(tds_raw$title,tds_raw$tagline,sep=" ")
# Rename ID column
colnames(tds_raw)[1] <- "doc_id"
2.3 Create Year and Month Variables
To explore topic trending over time, the date variable is being separated into year and month variables.
tds_dates <- tds_raw %>%
mutate(date = mdy(date)) %>%
mutate_at(vars(date), funs(year, month))
2.4 Create tidytext Corpus
This process decomposes the long text string from the text variable into single terms, while maintaining their tie to the source document (doc_id) and its metadata (date, year, month).
tds_tidy <- tds_dates %>%
unnest_tokens(output = word, input = text) %>%
anti_join(stop_words, by = "word")
# Remove numbers
tds_tidy <- tds_tidy[-grep("\\b\\d+\\b", tds_tidy$word),]
tidy_top_tokens <- tds_tidy %>%
count(word, sort = TRUE) %>%
top_n(10)
## Selecting by n
tidy_top_tokens
## # A tibble: 10 × 2
## word n
## <chr> <int>
## 1 data 13884
## 2 learning 7084
## 3 python 6457
## 4 machine 4155
## 5 science 3930
## 6 model 2498
## 7 ai 2151
## 8 analysis 2096
## 9 guide 2096
## 10 deep 1986
The above code created a tidy version of the corpus at the single word (unigram) level while also removing stop words and numbers.
2.5 Stemming
Stemming reduces the feature size of a corpus by transforming terms to their base stem. Stemming reduces the chances of redundancy in terms and phrases as the various topic modeling techniques are explored.
tds_tidy <- tds_tidy %>%
mutate(word = wordStem(word))
2.6 Cast a Document Term Matrix
The LDA model requires the text be presented in the form of a tidy DTM, where each term occupies a single cell according to a unique and controlling variable. In this case, the title will act as that unique identifier.
tidy_tds_DTM <- tds_tidy %>%
count(title, word) %>%
cast_dtm(title, word, n)
tidy_tds_DTM
## <<DocumentTermMatrix (documents: 30128, terms: 14604)>>
## Non-/sparse entries: 294029/439695283
## Sparsity : 100%
## Maximal term length: 34
## Weighting : term frequency (tf)
2.7 Tokenization
Lastly, wrangling will end with the tokenization of the original data to enable further term frequency analysis at the bigram (word pair) and trigram (3-word set) levels. For these iterations, stop word removal and stemming has been incorporated:
tds_bigrams <- tds_dates %>%
unnest_tokens(output = bigram, input = text, token = "ngrams", n = 2)
tds_bigrams <- tds_bigrams %>%
separate(bigram, into = c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
mutate(word1 = wordStem(word1)) %>%
mutate(word2 = wordStem(word2)) %>%
unite(bigram, c(word1, word2), sep = " ")
bigram_top_tokens <- tds_bigrams %>%
count(bigram, sort = TRUE) %>%
top_n(10)
bigram_top_tokens
## # A tibble: 10 × 2
## bigram n
## <chr> <int>
## 1 machin learn 3905
## 2 data scienc 3727
## 3 data scientist 1654
## 4 deep learn 1419
## 5 neural network 1366
## 6 learn model 765
## 7 time seri 678
## 8 data analysi 583
## 9 covid 19 560
## 10 reinforc learn 458
tds_trigrams <- tds_dates %>%
unnest_tokens(output = trigram, input = text, token = "ngrams", n = 3)
tds_trigrams <- tds_trigrams %>%
separate(trigram, into = c("word1", "word2", "word3"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
filter(!word3 %in% stop_words$word) %>%
mutate(word1 = wordStem(word1)) %>%
mutate(word2 = wordStem(word2)) %>%
mutate(word3 = wordStem(word3)) %>%
unite(trigram, c(word1, word2, word3), sep = " ")
trigram_top_tokens <- tds_trigrams %>%
count(trigram, sort = TRUE) %>%
top_n(10)
trigram_top_tokens
## # A tibble: 10 × 2
## trigram n
## <chr> <int>
## 1 machin learn model 602
## 2 data scienc project 341
## 3 natur languag process 289
## 4 convolut neural network 233
## 5 exploratori data analysi 230
## 6 machin learn algorithm 201
## 7 deep learn model 138
## 8 time seri forecast 134
## 9 machin learn project 133
## 10 learn data scienc 132
3. EXPLORATORY ANALYSIS
3.1 Published Article Counts
tds_dates %>%
ggplot(aes(x = date, color = factor(month))) +
geom_bar(show.legend = FALSE) +
labs(y = "Date",
x = "Article Counts",
title = "Towards Data Science Articles",
subtitle = "Published from 2018 - 2021")

This visual depicts the 30k article spread over the past four years. TDS had a great year in 2020 with over 70 articles published monthly in the mid-year period. Any chance that was due to scientists being cooped up during the heart of the COVID-19 pandemic? Writing is a great way to pass the time! The next couple of sections will explore word frequencies to attempt to identify patterns within the most use words or word combinations.
3.2 Word Counts by Year
tds_tidy %>%
group_by(year) %>%
count(word, sort = TRUE) %>%
top_n(10) %>%
ungroup %>%
mutate(word = reorder_within(word, n, year)) %>%
ggplot(aes(x = word, y = n, fill = word)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ year, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
scale_y_continuous(expand = c(0,0)) +
labs(y = "Count",
x = "Unique words",
title = "Most frequent words found in TDS article titles & taglines",
subtitle = "Stop words removed from the list")

When diagraming the top ten unigrams by year, many terms are repeated. Since a data science blogging site is the focus for this project, it’s no surprise that words such as data, science, machine, learning, etc. would appear at the top. The chart below indicates how this changed when the graph was expanded to include the top 20 terms:
tds_tidy %>%
group_by(year) %>%
count(word, sort = TRUE) %>%
top_n(20) %>%
ungroup %>%
mutate(word = reorder_within(word, n, year)) %>%
ggplot(aes(x = word, y = n, fill = word)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ year, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
scale_y_continuous(expand = c(0,0)) +
labs(y = "Count",
x = "Unique words",
title = "Most frequent words found in TDS article titles & taglines",
subtitle = "Stop words removed from the list")

This second attempt does begin to reveal some unique terms by year. Another way to achieve this is by adding words common to all years to the list of stop words. In the end, I decided not to focus too much energy on individual terms as in this community, many of the key topics are described by multiple terms, such as ‘machine learning,’ as opposed to treating those words as separate and distinct entities. To that end, I repeated the above visuals, but for multi-word groupings.
3.3 Bigram Counts
tds_bigrams %>%
group_by(year) %>%
count(bigram, sort = TRUE) %>%
top_n(20) %>%
ungroup %>%
mutate(bigram = reorder_within(bigram, n, year)) %>%
ggplot(aes(x = bigram, y = n, fill = bigram)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ year, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
scale_y_continuous(expand = c(0,0)) +
labs(y = "Count",
x = "Unique Bigrams",
title = "Most frequent bigrams found in article titles & taglines",
subtitle = "Stop words removed from the list")

Expanding to 2-word phrases reveals unique topics in each year. Some interesting examples are random forest (2018), object detect (2019), covid 19 (2020), and python code (2021).
3.4 Trigram Counts
tds_trigrams %>%
group_by(year) %>%
count(trigram, sort = TRUE) %>%
top_n(10) %>%
ungroup %>%
mutate(trigram = reorder_within(trigram, n, year)) %>%
ggplot(aes(x = trigram, y = n, fill = trigram)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ year, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
scale_y_continuous(expand = c(0,0)) +
labs(y = "Count",
x = "Unique Trigrams",
title = "Most frequent trigrams found in article titles & taglines",
subtitle = "Stop words removed from the list")

Trigrams repeat quite a bit in the top ten, though you start seeing more complete ideas emerge about specific activities. Examples include data science job and time seri data.
3.5 Exploratory Analysis Insights
Word frequency analysis reveals that unique words and phrases do emerge from year to year. This implies that research topics are changing throughout the corpus of TDS articles between 2018 and 2021. Some other interesting features of this data set include:
single words do little to distinguish between topics and/or years
trigram phrases become repetitive beyond the top ten, so bigram phrases looks to be the optimal focus for short text topic modeling
4. MODEL
To address the potential for topic generation in short-form text, I will compare various qualitative model results to determine the differences, if any, in how they identify unique topics using shorter data observations. This analysis will examine three models:
Latent Dirichlet Allocation (LDA). LDA works under the premise that every document contains a mixture of topics, and every topic is composed of a mixture of words.
Structural Topic Model (STM). STM employs metadata to improve the assignment of words to topics in a corpus and that can be used to examine relationships between covariates and documents.
Biterm Topic Model (BTM). BTM was designed specifically for a short-format corpus and identifies topics by explicitly modeling word-word co-occurrences in a specified window of text.
4.1 Determining K
Each model is predicated on having a relatively optimum approximation for K, or the number of potential topics to be identified. If K is too small, the corpus is divided into a few very generic topics. If K is too large, the collection is broken into too many topics in which they either overlap or are hardly indecipherable. Before the models can be applied, a common K value should be determined so the results can be more consistently compared. A challenge to this process is that each model maintains distinct methods for determining K based on how the algorithm treats the text. As LDA and STM were designed for larger bodies of text, it will be interesting to see how the model treats the short-form data in this experiment. If a consistent K-value cannot be determined explicitly, then the alternative will be to conduct a trial-and-error approach to find a value that can be applied to all three models. As our main goal is to compare results between them, the alternative K solution will meet the intent of the project. How an optimal K should be selected depends on various factors that are unique to each type of topic model.
FindTopicsNumber() Function in the LDA Model
For the LDA model, four metrics were extracted, then plotted to visualize the maximum or minimum K value of each metric:
k_metrics <- FindTopicsNumber(
tidy_tds_DTM,
topics = seq(5, 50, by = 5),
metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
method = "Gibbs",
control = list(seed = 77),
mc.cores = NA,
return_models = FALSE,
verbose = FALSE,
libpath = NULL
)
FindTopicsNumber_plot(k_metrics)

The key is to identify the visible bend, or inflection point, on each of the lines. This is where they transition from quickly increasing/decreasing to a more flat trajectory. In each of the lines, that inflection point is between 20 and 30 topics. The one metric that has a single and easily recognizable value is the CaoJuan2009 line, which yields a max value of 20 topics.
searchK() Function in STM
The stm package has a useful function called searchK which requires a specific range of values for K and outputs multiple goodness-of-fit measures:
findingk <- searchK(docs,
vocab,
K = c(5:30),
data = meta,
verbose=FALSE)
plot(findingk)

Based on the feedback from the LDA model, the range was tested between 5 and 30 topics. Similar to the LDA results, the curves hit their inflection point between 25 and 30 topics. (Note: this trial required ~ 8 hours to run on a fairly high end Macbook Pro. Your mileage may vary.)
Determining K for BTM
Unlike the previous two models, there is no function built into the BTM package in R for estimating K. As this is a comparative study, using similar K values for each model should provide consistent data for this purpose. Therefore, the BTM model was executed assuming an optimum K value between 20 and 30 topics. The models will use these values while also adding a lower value of 10 to verify what happens if a K value is selected that is too low.
4.2 Latent Dirichlet Allocation (LDA) Model
LDA is a mathematical method for estimating the mixture of words that is associated with each topic, while also determining the mixture of topics that describes each document (Silge & Robinson, 2017).
tds_lda <- LDA(tidy_tds_DTM,
k = 20,
control = list(seed = 588))
terms(tds_lda, 5)
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7
## [1,] "learn" "python" "data" "learn" "data" "data" "learn"
## [2,] "model" "function" "scienc" "data" "scienc" "python" "data"
## [3,] "data" "step" "sql" "tip" "python" "build" "featur"
## [4,] "understand" "learn" "top" "python" "learn" "neural" "python"
## [5,] "ai" "analysi" "databas" "model" "step" "network" "machin"
## Topic 8 Topic 9 Topic 10 Topic 11 Topic 12 Topic 13 Topic 14
## [1,] "data" "learn" "model" "code" "data" "learn" "model"
## [2,] "python" "deep" "data" "python" "scientist" "machin" "learn"
## [3,] "scienc" "data" "network" "perform" "learn" "model" "machin"
## [4,] "model" "intellig" "detect" "run" "it’" "ai" "data"
## [5,] "build" "artifici" "python" "chang" "machin" "data" "ai"
## Topic 15 Topic 16 Topic 17 Topic 18 Topic 19 Topic 20
## [1,] "introduct" "data" "data" "panda" "data" "build"
## [2,] "learn" "scienc" "learn" "guid" "scienc" "time"
## [3,] "machin" "time" "model" "simpl" "project" "imag"
## [4,] "data" "build" "network" "python" "creat" "analysi"
## [5,] "model" "seri" "machin" "data" "start" "detect"
For K = 20, the topics looks to be relatively similar across the board, though their are a couple that stand out as unique. In this format, however, recognizing those topics is not easy. The faceted plot below provides a much more informative visual:
top_terms_lda <- tds_lda %>%
group_by(topic) %>%
slice_max(beta, n = 5, with_ties = FALSE) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms_lda %>%
mutate(term = reorder_within(term, beta, topic)) %>%
group_by(topic, term) %>%
arrange(desc(beta)) %>%
ungroup() %>%
ggplot(aes(beta, term, fill = as.factor(topic))) +
geom_col(show.legend = FALSE) +
scale_y_reordered() +
labs(title = "Top 5 terms in each LDA topic",
x = expression(beta), y = NULL) +
facet_wrap(~ topic, ncol = 4, scales = "free")

I’ve highlighted three topics (11, 18, and 20) whose top five terms were outside the range of the majority of the other topics.
Topic 11 references running or changing Python code.
Topic 18 mentions a specific library in Python, called Pandas.
Topic 20 speaks to building models for image analysis/detection.
While three topics presented themselves as unique, that is a relatively small number within 20 topics. To explore how the results change based on the chosen number of topics, I ran the same model using K=10 and K=30.
For K=10, I had difficulty distinguishing between any of the topics. Very generic with repeated terms of data, machin, learn, model, and scienc, which are what one would expect to see on a website that publihed data science articles.

Using K=30 yields results similar to the image from K=20. While topic 3 is unique compared to the previous run, topics 18 and 20 look almost identical to their counterparts from the earlier trial.

4.3 Structural Topic Model (STM)
The stm package in R requires the documents, meta data, and “vocab”—or total list of words described in the documents— to be stored in separate objects (see code below). The first line of code eliminates both extremely common terms and extremely rare terms, as is common practice in topic modeling, since such terms make word-topic assignment much more difficult (Bail, 2019).
temp <- textProcessor(tds_raw$text,
metadata = tds_raw,
lowercase=TRUE,
removestopwords=TRUE,
removenumbers=TRUE,
removepunctuation=TRUE,
wordLengths=c(3,Inf),
stem=TRUE,
onlycharacter= FALSE,
striphtml=TRUE,
customstopwords=NULL)
docs <- temp$documents
meta <- temp$meta
vocab <- temp$vocab
tds_stm <- stm(documents=docs,
data=meta,
vocab=vocab,
prevalence =~ date,
K=20,
max.em.its=25,
verbose = FALSE,
gamma.prior='L1')

This first iteration of the STM model (K=20) shows a much more diverse range of terms than LDA. For this data set, the only additional information available for distinguishing topics is the publication date of each article. This linear format, however, is still difficult to capture how each topic differs from it’s neighbors. A more visual and useful method is enabled by the toLDAvis function.
toLDAvis(mod = tds_stm, docs = docs)

This tools explores how each topic relates to the others spatially. In a perfect environment, the model would have identified 20 topics separate and distinct from each other. The diagram would’ve shown 20 circles with no overlap. While it’s doubtful a model could predict that level of precision, this diagram shows that with K=20, there is relatively good topic separation. There are just four areas where topics overlap, meaning they share a larger proportion of similar terms.
Topic 2 is highlighted as it is large and maintains good spacing from its nearest neighbors. The size indicates the number of terms associated with this topic. In this case, the key term is ‘data,’ which is not surprising, but its companion terms are what separates it from its neighbors. The lone bubble indicates that this topic is distinct in its term composition. Combining the top 10 terms reveals the topic describes articles written about the skills required to become a data scientist. For comparison across topic values, this model was repeated for K=10 and K=30:

These views add to the inference that K=20 is pretty close to optimum. For K=10, eight of the topics overlap with a neighbor. Likewise for K=30, there are several groupings of overlapping topics.
4.4 Biterm Topic Model (BTM)
The Biterm Topic Model (BTM) is a word co-occurrence based topic model that learns topics by modeling word-word co-occurrences patterns (e.g., biterms) (Wijffels, 2021).
A biterm consists of two words co-occurring in the same context, for example, in the same short text window. This window is described by the parameters skipgram and width.
Skipgram defines the number of words to include in the biterm search space, while width defines how many words on average exist in a single document. For this project, a skipgram value of 5 was used and the width was kept at the default value of 15.
BTM models the biterm occurrences across a complete corpus (unlike LDA models which model the word occurrences in a single document).
Based on the first two iterations, the BTM trial focused on training a model to identify 20 topics:
# Tag parts of speech
anno <- udpipe(tds_raw, "english", trace = 10)
biterms <- as.data.table(anno)
biterms <- biterms[, cooccurrence(x = lemma,
relevant = upos %in% c("NOUN",
"ADJ",
"PROPN"),
skipgram = 5),
by = list(doc_id)]
# Build BTM
set.seed(588)
traindata <- subset(anno, upos %in% c("NOUN", "ADJ", "PROPN"))
traindata <- traindata[, c("doc_id", "lemma")]
model <- BTM(traindata, k = 20,
beta = 0.01,
iter = 500,
biterms = biterms,
trace = 100)
# Plot Model Results (do not run when knitting)
#library(ggraph)
#plot(model,
# top_n = 10,
# title = "BTM model",
# subtitle = "K = 20, 500 Training Iterations",
# labels = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9",
# "10", "11", "12", "13", "14", "15", "16", "17",
# "18", "19"))

5. COMMUNICATE
5.1 Topics Discoverability in Short-form Text
LDA. The LDA model did little to distinguish between topics beyond a generic summary of what one could expect to read in any article on data science. These results qualitatively confirm the inability of the model to consistently recognize unique topics in short-form text. Treating an article’s title and tagline as a complete “document” gives the model too little data to inform latent topic discovery.
STM. The STM model distinguished between topics much more effectively than the LDA model. These results add additional reinforcement that the optimum number of terms is around 20 and adding metadata increases the likelihood of distinguishing between unique topics.
BTM.
|
Topic 0
|
Topic 1
|
Topic 2
|
Topic 3
|
Topic 4
|
Topic 5
|
Topic 6
|
Topic 7
|
Topic 8
|
Topic 9
|
Topic 10
|
Topic 11
|
Topic 12
|
Topic 13
|
Topic 14
|
Topic 15
|
Topic 16
|
Topic 17
|
Topic 18
|
Topic 19
|
|
language
|
part
|
Analysis
|
detection
|
data
|
python
|
data
|
data
|
time
|
python
|
feature
|
data
|
Neural
|
python
|
Regression
|
data
|
data
|
machine
|
data
|
ai
|
|
text
|
game
|
data
|
image
|
ai
|
Pandas
|
python
|
model
|
Series
|
code
|
data
|
science
|
Network
|
data
|
model
|
Covid
|
Google
|
learning
|
python
|
data
|
|
model
|
python
|
sentiment
|
object
|
article
|
data
|
guide
|
time
|
python
|
Jupyter
|
model
|
scientist
|
Networks
|
visualization
|
python
|
Analysis
|
python
|
deep
|
machine
|
Artificial
|
|
NLP
|
Carlo
|
python
|
model
|
science
|
function
|
step
|
machine
|
Analysis
|
line
|
selection
|
Science
|
deep
|
chart
|
Linear
|
case
|
AWS
|
learn
|
science
|
Intelligence
|
|
Natural
|
Monte
|
customer
|
vision
|
machine
|
Sql
|
Analysis
|
analysis
|
part
|
notebook
|
machine
|
project
|
network
|
plot
|
machine
|
price
|
cloud
|
model
|
testing
|
machine
|
|
word
|
Reinforcement
|
analysis
|
python
|
week
|
use
|
tutorial
|
dataset
|
series
|
data
|
python
|
interview
|
image
|
map
|
decision
|
python
|
machine
|
end
|
probability
|
be
|
|
processing
|
simulation
|
Twitter
|
recognition
|
ML
|
database
|
Pandas
|
real
|
model
|
Julia
|
reduction
|
machine
|
neural
|
Plotly
|
regression
|
analysis
|
model
|
Reinforcement
|
common
|
science
|
|
python
|
Markov
|
recommendation
|
deep
|
knowledge
|
power
|
beginner
|
python
|
data
|
programming
|
value
|
Scientists
|
Convolutional
|
create
|
algorithm
|
using
|
web
|
part
|
learning
|
intelligence
|
|
image
|
player
|
review
|
face
|
paper
|
sql
|
part
|
series
|
price
|
new
|
different
|
skill
|
PyTorch
|
Matplotlib
|
random
|
science
|
step
|
models
|
concept
|
human
|
|
classification
|
League
|
topic
|
computer
|
research
|
file
|
introduction
|
simple
|
Forecasting
|
best
|
missing
|
job
|
part
|
using
|
Gradient
|
use
|
Azure
|
Classification
|
introduction
|
learning
|
This list provides the top ten terms in each of the 20 topics as described by the BTM model. These topic phrases were, by far, the most unique of the three models. In addition, the visual provided strength of ties between terms depicting which relationships were most influential in each topic. From a qualitative perspective, BTM discovered the most unique latent topics of the three models and looks to be an optimum choice for short texts.
5.2 Topic Proportion Over Time
While there are ways to visualize LDA topics over time, in this corpus that model did little to distinguish between relevant topics. Therefore, it was eliminated from topic proportion comparison over time. The BTM package in R is not designed to compare topics by metadata as those covariates are removed in building the final model. While I believe it’s possible to do, there needs to be a way to re-join the final BTM model data with the metadata. That left STM as the only model that was built to understand the metadata’s impact on the topic results. The STM exercise used date as a covariate leading to the plot below for each topic and how its prevalence changed over four years.

Some examples with large shifts over time include:
Deep Learning (1st topic, 2nd row) - significant decrease in popularity
How-to Guides on Statistics (4th topic, 2nd row) - decreased as more articles were becoming popular on visualizations and coding vice basic stats
Data Science Project Research (4th topic, 1st row) - gained in popularity as did topic on Python (5th topic in both 1st and 2nd rows)
5.3 Summary Findings
This project was designed to answer three questions about using topic modeling to identify latent themes in a body of short-form text. Specifically:
- Can short texts provide enough context to adequately inform topic modeling?
Both the LDA and STM trials struggled to differentiate between unique topics. Even with metadata added, the STM model improved little over its LDA counterpart. The BTM model, however, discovered numerous unique topics missed by the other two models. The results showed both separate topics and the relationships between the terms within those topics. Therefore, BTM is a sound choice to inform short text topic modeling.
- Can short text metadata be used to augment the accuracy of these topic models?
This is still somewhat unknown. The STM model used metadata but showed little improvement over LDA, which only used document-term frequencies. As the metadata is stripped from the BTM model, it remains to be seen if those two elements can be re-joined for comparison.
- Can short texts provide enough fidelity to measure topic prevalence over time?
The STM model data was plotted over time and showed how topic prevalence changed. So while it is possible to do this with short-form text data, this experiment was unable to do this with the BTM model.
This project was able to demonstrate qualitatively that short texts can be used to discover latent topics. This could save enormous amounts of time and processing resources as shown by how quickly most of the models ran against a corpus of over 30K articles.
5.4 Limitations
Qualitative vs Quantitative. This study compared qualitative metrics on differences between terms, topics, and proportions over time. Quantitative metrics were not explored as it would have significantly increased the scope of the project. Metrics such as model efficiency, precision, and topic coherence are measures that could provide additional comparative insights for these models.
K determination for BTM. While a mathematical method exists to determine K values for the BTM model, they were too advanced for this project and are not built into the R package in a way that makes them simple to extract. As a result, the LDA and STM K values were compared to derive the optimum K.
Additional short text models. There are other models (such as stLDA-C) to derive latent topics from short texts, but they don’t have specific packages that work in the R environment.
Metadata. The only additional variable captured to distinguish between articles was publication date. Other variables available are author, length of article, number of comments, and number of likes. These variables could be used in future experiments to add more nuanced context to each article.
---
title: 'Short Text Topic Modeling: Article Titles & Taglines'
subtitle: 'Final Project for ECI 588, Text Mining in Education'
author: "James Hardaway"
date: "May 1, 2022"
output: 
  html_document:
    toc: true
    toc_depth: 3
    toc_float: yes
    code_folding: hide
    code_download: TRUE
editor_options: 
  markdown: 
    wrap: 72
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## 0. INTRODUCTION

With the rapid increase in digitization over the past 20 years, the
field of text mining has quickly moved to the forefront of data science
as a method for better understanding patterns and trends in society.
Text mining is often the first step in developing intricate methods for
more advanced processes such as social network analysis or natural
language processing. For models to learn, they must experience
everything (or as much as possible) that is and is not what they are
attempting to predict or mimic. Traditionally, this has meant incredibly
large volumes of text have been required to feed these learning demands.

A challenge to this process is how human information tendencies are
changing. While we continue to create more and more data, it's
presenting in smaller and smaller chunks. Short-form text is quickly
becoming the primary means by which people consume daily, practical, and
time-sensitive information. From news, to social media, to research
abstracts, texts of 200 words or fewer are gaining popularity for
carrying the the critical messages in our daily lives. As these shorter
formats become increasingly important for analyzing and understanding
social trends, it's important for data science tools to adapt as the
information environment continuously evolves ([Jipeng et al.,
2019](https://doi.org/10.48550/arXiv.1904.07695)).

The context for this project is provided by data from the website
[Towards Data Science](https://towardsdatascience.com/) (TDS), which
I've used often to augment my studies on learning analytics. TDS is a
[Medium](https://medium.com/) web publication where authors share
concepts, ideas, and coding tips through self-published articles. This
data experiment will explore what short-form text modeling can reveal
about the changes in research focus over the last few years years as
charted through TDS articles published from 2018-2021.

This research project aims to answer the following questions:

-   ***Can short texts provide enough context to adequately inform topic
    modeling?***

-   ***Can short text metadata be used to augment the accuracy of these
    topic models?***

-   ***Can short texts provide enough fidelity to measure topic
    prevalence over time?***

Data scientists, students, or any professionals researching this field
could benefit by understanding how research/publication trends are
linked to shifts in technology. Understanding these shifts could
influence how quickly people adopt or advance new technologies in the
data science field. Additionally, this information could be useful to
understand technologies ripe for study in higher education leading to
how students select focus areas, majors, or certificate programs.

------------------------------------------------------------------------

## 1. PREPARE

### 1.1 Data Source

My raw data was initially developed by [Johannes
Hötter](https://www.kaggle.com/johoetter) earlier this year and posted
to
[Kaggle](https://www.kaggle.com/datasets/johoetter/towards-data-science).
The data set contains the titles and taglines for over 30,000 TDS
articles. While the titles are self explanatory, taglines are those
subtitles that tend to give a brief glimpse into the core purpose of the
article. The raw data is organized into five variables:

1.  **doc_id**: unique numerical identifier for each article

2.  **title**: article title

3.  **tagline**: article subtitle

4.  **url**: article web address

5.  **date**: publication date

### 1.2 R Package Set Up

The following packages were installed and/or loaded to prepare for this
project:

```{r load-packages, message=FALSE}
library(tidyverse)
library(tidytext)
library(SnowballC)
library(topicmodels)
library(stm)
library(ldatuning)
library(knitr)
library(tm)
library(lubridate)
library(kableExtra)
library(BTM)
library(textplot)
library(concaveman)
library(udpipe)
library(data.table)
library(stopwords)
```

------------------------------------------------------------------------

## 2. WRANGLE

For this case study, the data will be parsed into tokens for initial
exploration to identify potential trends and context over time. Three
topic modeling techniques will be employed, each with their own specific
data format requirements. To meet these, the data will be tidied and
stemmed and the stop words will be removed to better focus on
topic-specific terminology.

### 2.1 Import Data

The raw data was downloaded into a comma separated values (.csv) file
that will be imported into RStudio:

```{r titles import, message=FALSE}
tds_raw <- read_csv("data/towards_data_science.csv")
```

With the data frame created, the data set will be further manipulated
into a format fit for exploratory analysis. Wrangling will consist of
the following steps:

1.  Combine title and tagline columns to create a single **'text'**
    variable.

2.  Mutate the date column into separate **'year'** and **'month'**
    variables to enable time-based exploration.

3.  Create a corpus through the **`tidytext`** package to enable
    additional text mining tools.

4.  Stem the **`tidytext`** corpus and remove stop words.

5.  Tokenize **`tidytext`** corpus into unigrams, bigrams, and trigrams
    to enable term frequency analysis.

6.  Create a document term matrix (DTM) to explore the LDA modeling
    technique.

### 2.2 Create Single Text Variable

The current data frame contains the target text in two separate
variables, **title** and **tagline**. As the pair of variables represent
a single document (article in this case), combining the text into a
single character variable will enable the models to focus on just one
column, simplifying the work of the models.

```{r combine text}
tds_raw$text <- paste(tds_raw$title,tds_raw$tagline,sep=" ")

# Rename ID column
colnames(tds_raw)[1]  <- "doc_id"
```

### 2.3 Create Year and Month Variables

To explore topic trending over time, the **date** variable is being
separated into **year** and **month** variables.

```{r mutate date, warning=FALSE}
tds_dates <- tds_raw %>% 
  mutate(date = mdy(date)) %>%
  mutate_at(vars(date), funs(year, month))
```

### 2.4 Create **`tidytext`** Corpus

This process decomposes the long text string from the **text** variable
into single terms, while maintaining their tie to the source document
(**doc_id**) and its metadata (**date**, **year,** **month**).

```{r tidytext corpus}
tds_tidy <- tds_dates %>%
  unnest_tokens(output = word, input = text) %>%
  anti_join(stop_words, by = "word")

# Remove numbers
tds_tidy <- tds_tidy[-grep("\\b\\d+\\b", tds_tidy$word),]

tidy_top_tokens <- tds_tidy %>% 
  count(word, sort = TRUE) %>% 
  top_n(10)

tidy_top_tokens
```

The above code created a tidy version of the corpus at the single word
(unigram) level while also removing stop words and numbers.

### 2.5 Stemming

Stemming reduces the feature size of a corpus by transforming terms to
their base stem. Stemming reduces the chances of redundancy in terms and
phrases as the various topic modeling techniques are explored.

```{r stem each corpus, warning=FALSE}
tds_tidy <- tds_tidy %>% 
  mutate(word = wordStem(word))
```

### 2.6 Cast a Document Term Matrix

The LDA model requires the text be presented in the form of a tidy DTM,
where each term occupies a single cell according to a unique and
controlling variable. In this case, the title will act as that unique
identifier.

```{r tidytext DTM}
tidy_tds_DTM <- tds_tidy %>%
  count(title, word) %>%
  cast_dtm(title, word, n)

tidy_tds_DTM
```

### 2.7 Tokenization

Lastly, wrangling will end with the tokenization of the original data to
enable further term frequency analysis at the *bigram* (word pair) and
*trigram* (3-word set) levels. For these iterations, stop word removal
and stemming has been incorporated:

```{r tokenize bigrams, message=FALSE}
tds_bigrams <- tds_dates %>%   
  unnest_tokens(output = bigram, input = text, token = "ngrams", n = 2)

tds_bigrams <- tds_bigrams %>% 
  separate(bigram, into = c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>%
  mutate(word1 = wordStem(word1)) %>% 
  mutate(word2 = wordStem(word2)) %>% 
  unite(bigram, c(word1, word2), sep = " ")

bigram_top_tokens <- tds_bigrams %>% 
  count(bigram, sort = TRUE) %>% 
  top_n(10)

bigram_top_tokens
```

```{r tokenize trigrams, message=FALSE}
tds_trigrams <- tds_dates %>%   
  unnest_tokens(output = trigram, input = text, token = "ngrams", n = 3)

tds_trigrams <- tds_trigrams %>% 
  separate(trigram, into = c("word1", "word2", "word3"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>% 
  filter(!word3 %in% stop_words$word) %>%
  mutate(word1 = wordStem(word1)) %>% 
  mutate(word2 = wordStem(word2)) %>% 
  mutate(word3 = wordStem(word3)) %>% 
  unite(trigram, c(word1, word2, word3), sep = " ")

trigram_top_tokens <- tds_trigrams %>% 
  count(trigram, sort = TRUE) %>% 
  top_n(10)

trigram_top_tokens
```

------------------------------------------------------------------------

## 3. EXPLORATORY ANALYSIS

### 3.1 Published Article Counts

```{r Article counts}
tds_dates %>% 
  ggplot(aes(x = date, color = factor(month))) +
  geom_bar(show.legend = FALSE) +
  labs(y = "Date",
     x = "Article Counts",
     title = "Towards Data Science Articles",
     subtitle = "Published from 2018 - 2021")
```

This visual depicts the 30k article spread over the past four years. TDS
had a great year in 2020 with over 70 articles published monthly in the
mid-year period. Any chance that was due to scientists being cooped up
during the heart of the COVID-19 pandemic? Writing is a great way to
pass the time! The next couple of sections will explore word frequencies
to attempt to identify patterns within the most use words or word
combinations.

### 3.2 Word Counts by Year

```{r Unigram counts, message=FALSE}
tds_tidy %>%
  group_by(year) %>%
  count(word, sort = TRUE) %>%
  top_n(10) %>%
  ungroup %>%
  mutate(word = reorder_within(word, n, year)) %>%
  ggplot(aes(x = word, y = n, fill = word)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ year, scales = "free_y") +
  coord_flip() +
  scale_x_reordered() +
  scale_y_continuous(expand = c(0,0)) +
  labs(y = "Count",
       x = "Unique words",
       title = "Most frequent words found in TDS article titles & taglines",
       subtitle = "Stop words removed from the list")
```

When diagraming the top ten unigrams by year, many terms are repeated.
Since a data science blogging site is the focus for this project, it's
no surprise that words such as data, science, machine, learning, etc.
would appear at the top. The chart below indicates how this changed when
the graph was expanded to include the top 20 terms:

```{r Unigram top 20, message=FALSE}
tds_tidy %>%
  group_by(year) %>%
  count(word, sort = TRUE) %>%
  top_n(20) %>%
  ungroup %>%
  mutate(word = reorder_within(word, n, year)) %>%
  ggplot(aes(x = word, y = n, fill = word)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ year, scales = "free_y") +
  coord_flip() +
  scale_x_reordered() +
  scale_y_continuous(expand = c(0,0)) +
  labs(y = "Count",
       x = "Unique words",
       title = "Most frequent words found in TDS article titles & taglines",
       subtitle = "Stop words removed from the list")
```

This second attempt does begin to reveal some unique terms by year.
Another way to achieve this is by adding words common to all years to
the list of stop words. In the end, I decided not to focus too much
energy on individual terms as in this community, many of the key topics
are described by multiple terms, such as 'machine learning,' as opposed
to treating those words as separate and distinct entities. To that end,
I repeated the above visuals, but for multi-word groupings.

### 3.3 Bigram Counts

```{r Bigram counts, message=FALSE}
tds_bigrams %>%
  group_by(year) %>%
  count(bigram, sort = TRUE) %>%
  top_n(20) %>%
  ungroup %>%
  mutate(bigram = reorder_within(bigram, n, year)) %>%
  ggplot(aes(x = bigram, y = n, fill = bigram)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ year, scales = "free_y") +
  coord_flip() +
  scale_x_reordered() +
  scale_y_continuous(expand = c(0,0)) +
  labs(y = "Count",
       x = "Unique Bigrams",
       title = "Most frequent bigrams found in article titles & taglines",
       subtitle = "Stop words removed from the list")
```

Expanding to 2-word phrases reveals unique topics in each year. Some
interesting examples are **random forest** (2018), **object detect**
(2019), **covid 19** (2020), and **python code** (2021).

### 3.4 Trigram Counts

```{r Trigram counts, message=FALSE}
tds_trigrams %>%
  group_by(year) %>%
  count(trigram, sort = TRUE) %>%
  top_n(10) %>%
  ungroup %>%
  mutate(trigram = reorder_within(trigram, n, year)) %>%
  ggplot(aes(x = trigram, y = n, fill = trigram)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ year, scales = "free_y") +
  coord_flip() +
  scale_x_reordered() +
  scale_y_continuous(expand = c(0,0)) +
  labs(y = "Count",
       x = "Unique Trigrams",
       title = "Most frequent trigrams found in article titles & taglines",
       subtitle = "Stop words removed from the list")
```

Trigrams repeat quite a bit in the top ten, though you start seeing more
complete ideas emerge about specific activities. Examples include **data
science job** and **time seri data**.

### 3.5 Exploratory Analysis Insights

Word frequency analysis reveals that unique words and phrases do emerge
from year to year. This implies that research topics are changing
throughout the corpus of TDS articles between 2018 and 2021. Some other
interesting features of this data set include:

-   single words do little to distinguish between topics and/or years

-   trigram phrases become repetitive beyond the top ten, so bigram
    phrases looks to be the optimal focus for short text topic modeling

------------------------------------------------------------------------

## 4. MODEL

To address the potential for topic generation in short-form text, I will
compare various qualitative model results to determine the differences,
if any, in how they identify unique topics using shorter data
observations. This analysis will examine three models:

-   **Latent Dirichlet Allocation** (LDA). LDA works under the premise
    that every document contains a mixture of topics, and every topic is
    composed of a mixture of words.

-   **Structural Topic Model** (STM). STM employs metadata to improve
    the assignment of words to topics in a corpus and that can be used
    to examine relationships between covariates and documents.

-   **Biterm Topic Model** (BTM). BTM was designed specifically for a
    short-format corpus and identifies topics by explicitly modeling
    word-word co-occurrences in a specified window of text.

### 4.1 Determining K

Each model is predicated on having a relatively optimum approximation
for *K*, or the number of potential topics to be identified. If *K* is
too small, the corpus is divided into a few very generic topics. If *K*
is too large, the collection is broken into too many topics in which
they either overlap or are hardly indecipherable. Before the models can
be applied, a common K value should be determined so the results can be
more consistently compared. A challenge to this process is that each
model maintains distinct methods for determining K based on how the
algorithm treats the text. As LDA and STM were designed for larger
bodies of text, it will be interesting to see how the model treats the
short-form data in this experiment. If a consistent K-value cannot be
determined explicitly, then the alternative will be to conduct a
trial-and-error approach to find a value that can be applied to all
three models. As our main goal is to compare results between them, the
alternative K solution will meet the intent of the project. How an
optimal *K* should be selected depends on various factors that are
unique to each type of topic model.

#### FindTopicsNumber() Function in the LDA Model

For the LDA model, four metrics were extracted, then plotted to
visualize the maximum or minimum *K* value of each metric:

```{r LDA Method, eval=FALSE}
k_metrics <- FindTopicsNumber(
  tidy_tds_DTM,
  topics = seq(5, 50, by = 5),
  metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
  method = "Gibbs",
  control = list(seed = 77),
  mc.cores = NA,
  return_models = FALSE,
  verbose = FALSE,
  libpath = NULL
)
FindTopicsNumber_plot(k_metrics)
```

![](images/bdbac67d-8a28-4864-8f3d-22e949a5647e.png)

The key is to identify the visible bend, or inflection point, on each of
the lines. This is where they transition from quickly
increasing/decreasing to a more flat trajectory. In each of the lines,
that inflection point is between ***20*** and ***30*** topics. The one
metric that has a single and easily recognizable value is the
***CaoJuan2009*** line, which yields a max value of ***20*** topics.

#### searchK() Function in STM

The **`stm`** package has a useful function called *searchK* which
requires a specific range of values for *K* and outputs multiple
goodness-of-fit measures:

```{r STM Method, eval=FALSE}
findingk <- searchK(docs,
                    vocab,
                    K = c(5:30),
                    data = meta,
                    verbose=FALSE)

plot(findingk)
```

![](images/STM_Kplot.jpeg)

Based on the feedback from the LDA model, the range was tested between 5
and 30 topics. Similar to the LDA results, the curves hit their
inflection point between ***25*** and ***30*** topics. *(Note: this
trial required \~ 8 hours to run on a fairly high end Macbook Pro. Your
mileage may vary.)*

#### Determining K for BTM

Unlike the previous two models, there is no function built into the
**`BTM`** package in R for estimating *K*. As this is a comparative
study, using similar K values for each model should provide consistent
data for this purpose. Therefore, the BTM model was executed assuming an
optimum *K* value between ***20*** and ***30*** topics. The models will
use these values while also adding a lower value of 10 to verify what
happens if a *K* value is selected that is too low.

### 4.2 **Latent Dirichlet Allocation (**LDA) Model

LDA is a mathematical method for estimating the mixture of words that is
associated with each topic, while also determining the mixture of topics
that describes each document ([Silge & Robinson,
2017](https://www.tidytextmining.com/topicmodeling.html)).

```{r LDA Model}
tds_lda <- LDA(tidy_tds_DTM, 
                  k = 20, 
                  control = list(seed = 588))

terms(tds_lda, 5)
```

For ***K = 20***, the topics looks to be relatively similar across the
board, though their are a couple that stand out as unique. In this
format, however, recognizing those topics is not easy. The faceted plot
below provides a much more informative visual:

```{r LDA Facet Plot, eval=FALSE}
top_terms_lda <- tds_lda %>%
  group_by(topic) %>%
  slice_max(beta, n = 5, with_ties = FALSE) %>%
  ungroup() %>%
  arrange(topic, -beta)
top_terms_lda %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  group_by(topic, term) %>%    
  arrange(desc(beta)) %>%  
  ungroup() %>%
  ggplot(aes(beta, term, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  labs(title = "Top 5 terms in each LDA topic",
       x = expression(beta), y = NULL) +
  facet_wrap(~ topic, ncol = 4, scales = "free")
```

![](images/LDA_20.png)

I've highlighted three topics (11, 18, and 20) whose top five terms were
outside the range of the majority of the other topics.

-   Topic 11 references running or changing Python code.

-   Topic 18 mentions a specific library in Python, called Pandas.

-   Topic 20 speaks to building models for image analysis/detection.

While three topics presented themselves as unique, that is a relatively
small number within 20 topics. To explore how the results change based
on the chosen number of topics, I ran the same model using ***K=10***
and ***K=30***.

For *K*=10, I had difficulty distinguishing between any of the topics.
Very generic with repeated terms of ***data, machin, learn, model,***
and ***scienc***, which are what one would expect to see on a website
that publihed data science articles.

![](images/LDA_10.png)

Using ***K=30*** yields results similar to the image from *K*=20. While
topic **3** is unique compared to the previous run, topics **18** and
**20** look almost identical to their counterparts from the earlier
trial.

![](images/LDA_30.png)

#### Performance Summary

With few exceptions, the LDA model did little to distinguish between
topics beyond a generic summary of what one could expect to read in any
article on data science. These results qualitatively confirm the
inability of the model to consistently recognize unique topics in
short-form text. Treating an article's title and tagline as a complete
"document" gives the model too little data to inform latent topic
discovery. The next model should provide somewhat different results as
the analysis includes article metadata to assist in determining
potential topics.

### 4.3 Structural Topic Model (STM)

The `stm` package in R requires the documents, meta data, and
"vocab"---or total list of words described in the documents--- to be
stored in separate objects (see code below). The first line of code
eliminates both extremely common terms and extremely rare terms, as is
common practice in topic modeling, since such terms make word-topic
assignment much more difficult ([Bail,
2019](https://sicss.io/2019/materials/day3-text-analysis/topic-modeling/rmarkdown/Topic_Modeling.html)).

```{r STM Model, message=FALSE, results='hide'}
temp <- textProcessor(tds_raw$text, 
                      metadata = tds_raw,  
                      lowercase=TRUE, 
                      removestopwords=TRUE, 
                      removenumbers=TRUE,  
                      removepunctuation=TRUE, 
                      wordLengths=c(3,Inf),
                      stem=TRUE,
                      onlycharacter= FALSE, 
                      striphtml=TRUE, 
                      customstopwords=NULL)

docs <- temp$documents 
meta <- temp$meta 
vocab <- temp$vocab 

tds_stm <- stm(documents=docs, 
               data=meta,
               vocab=vocab, 
               prevalence =~ date,
               K=20,
               max.em.its=25,
               verbose = FALSE,
               gamma.prior='L1')
```

![](images/STM_20.png)

This first iteration of the STM model (*K*=20) shows a much more diverse
range of terms than LDA. For this data set, the only additional
information available for distinguishing topics is the publication date
of each article. This linear format, however, is still difficult to
capture how each topic differs from it's neighbors. A more visual and
useful method is enabled by the **`toLDAvis`** function.

```{r LDAvis Explorer, message=FALSE}
toLDAvis(mod = tds_stm, docs = docs)
```

![](images/STM_vis20-01.png)

This tools explores how each topic relates to the others spatially. In a
perfect environment, the model would have identified 20 topics separate
and distinct from each other. The diagram would've shown 20 circles with
no overlap. While it's doubtful a model could predict that level of
precision, this diagram shows that with ***K=20***, there is relatively
good topic separation. There are just four areas where topics overlap,
meaning they share a larger proportion of similar terms.

**Topic 2** is highlighted as it is large and maintains good spacing
from its nearest neighbors. The size indicates the number of terms
associated with this topic. In this case, the key term is 'data,' which
is not surprising, but its companion terms are what separates it from
its neighbors. The lone bubble indicates that this topic is distinct in
its term composition. Combining the top 10 terms reveals the topic
describes articles written about the skills required to become a data
scientist. For comparison across topic values, this model was repeated
for ***K=10*** and ***K=30***:

![](images/Picture1.png)

These views add to the inference that K=20 is pretty close to optimum.
For K=10, eight of the topics overlap with a neighbor. Likewise for
K=30, there are several groupings of overlapping topics.

#### Performance Summary

Qualitatively, the STM model distinguished between topics much more
effectively than the LDA model. The separation of topic terms and medium
to large-sized bubbles reinforce that the optimum number of terms is
around 20 and adding metadata increases the likelihood of distinguishing
between unique topics.

### 4.4 Biterm Topic Model (BTM)

The Biterm Topic Model (BTM) is a word co-occurrence based topic model
that learns topics by modeling word-word co-occurrences patterns (e.g.,
biterms) ([Wijffels,
2021](https://cran.r-project.org/web/packages/BTM/BTM.pdf)).

-   A biterm consists of two words co-occurring in the same context, for
    example, in the same short text window. This window is described by
    the parameters *skipgram* and *width*.

-   Skipgram defines the number of words to include in the biterm search
    space, while width defines how many words on average exist in a
    single document. For this project, a skipgram value of 5 was used
    and the width was kept at the default value of 15.

-   BTM models the biterm occurrences across a complete corpus (unlike
    LDA models which model the word occurrences in a single document).

Based on the first two iterations, the BTM trial focused on training a
model to identify 20 topics:

```{r BTM Model, results='hide'}
# Tag parts of speech
anno    <- udpipe(tds_raw, "english", trace = 10)
biterms <- as.data.table(anno)
biterms <- biterms[, cooccurrence(x = lemma,
                                  relevant = upos %in% c("NOUN",
                                                         "ADJ",
                                                         "PROPN"),
                                  skipgram = 5),
                   by = list(doc_id)]

# Build BTM
set.seed(588)
traindata <- subset(anno, upos %in% c("NOUN", "ADJ", "PROPN"))
traindata <- traindata[, c("doc_id", "lemma")]
model <- BTM(traindata, k = 20, 
             beta = 0.01, 
             iter = 500,
             biterms = biterms, 
             trace = 100)

# Plot Model Results (do not run when knitting)
#library(ggraph)
#plot(model,
#     top_n = 10,
#     title = "BTM model",
#     subtitle = "K = 20, 500 Training Iterations",
#     labels = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9",
#                "10", "11", "12", "13",  "14", "15", "16", "17", 
#                "18", "19"))
```

![](images/BTM_20_labels.png)

#### Performance Summary

The visual results of the BTM with K set to 20 topics depicts a few
unique features:

1.  None of the grouped topics repeat. They are all fairly unique in
    terminology.
2.  Word size is a product of how important (or common) that word is in
    that particular topic. Words in smaller font indicate their
    probability (***theta***) of appearing in that topic was less.
3.  The line weights within the topics are indicative of tie strength
    between words. Thicker lines indicate stronger relationships between
    terms.

While this visualization resembles word clouds, the algorithm behind the
graphic is much more complex than simple term frequencies and is a
product of the co-occurrence of terms.

------------------------------------------------------------------------

## 5. COMMUNICATE

### 5.1 Topics Discoverability in Short-form Text

**LDA.** The LDA model did little to distinguish between topics beyond a
generic summary of what one could expect to read in any article on data
science. These results qualitatively confirm the inability of the model
to consistently recognize unique topics in short-form text. Treating an
article's title and tagline as a complete "document" gives the model too
little data to inform latent topic discovery.

**STM.** The STM model distinguished between topics much more
effectively than the LDA model. These results add additional
reinforcement that the optimum number of terms is around 20 and adding
metadata increases the likelihood of distinguishing between unique
topics.

**BTM.**

```{r List BTM Terms by Topic, echo=FALSE}
topicterms <- terms(model, top_n = 10)
stm_terms <- as.data.frame(topicterms)

# View BTM Topic Table
btm_topics <- stm_terms %>% 
  select(-starts_with('p')) %>% 
  rename("Topic 0" = "token",
         "Topic 1" = "token.1",
         "Topic 2" = "token.2",
         "Topic 3" = "token.3",
         "Topic 4" = "token.4",
         "Topic 5" = "token.5",
         "Topic 6" = "token.6",
         "Topic 7" = "token.7",
         "Topic 8" = "token.8",
         "Topic 9" = "token.9",
         "Topic 10" = "token.10",
         "Topic 11" = "token.11",
         "Topic 12" = "token.12",
         "Topic 13" = "token.13",
         "Topic 14" = "token.14",
         "Topic 15" = "token.15",
         "Topic 16" = "token.16",
         "Topic 17" = "token.17",
         "Topic 18" = "token.18",
         "Topic 19" = "token.19",)   
  
# View BTM Topic Table
btm_topics %>% 
  kbl() %>% 
  kable_styling() %>% 
  scroll_box(width = "800px", height = "500px")
```

This list provides the top ten terms in each of the 20 topics as
described by the BTM model. These topic phrases were, by far, the most
unique of the three models. In addition, the visual provided strength of
ties between terms depicting which relationships were most influential
in each topic. From a qualitative perspective, BTM discovered the most
unique latent topics of the three models and looks to be an optimum
choice for short texts.

### 5.2 Topic Proportion Over Time

While there are ways to visualize LDA topics over time, in this corpus
that model did little to distinguish between relevant topics. Therefore,
it was eliminated from topic proportion comparison over time. The BTM
package in R is not designed to compare topics by metadata as those
covariates are removed in building the final model. While I believe it's
possible to do, there needs to be a way to re-join the final BTM model
data with the metadata. That left STM as the only model that was built
to understand the metadata's impact on the topic results. The STM
exercise used *date* as a covariate leading to the plot below for each
topic and how its prevalence changed over four years.

![](images/STM_20_time.png)

Some examples with large shifts over time include:

-   **Deep Learning** (1st topic, 2nd row) - significant decrease in
    popularity

-   **How-to Guides on Statistics** (4th topic, 2nd row) - decreased as
    more articles were becoming popular on visualizations and coding
    vice basic stats

-   **Data Science Project Research** (4th topic, 1st row) - gained in
    popularity as did topic on Python (5th topic in both 1st and 2nd
    rows)

### 5.3 Summary Findings

This project was designed to answer three questions about using topic
modeling to identify latent themes in a body of short-form text.
Specifically:

1.  ***Can short texts provide enough context to adequately inform topic
    modeling?***

Both the LDA and STM trials struggled to differentiate between unique
topics. Even with metadata added, the STM model improved little over its
LDA counterpart. The BTM model, however, discovered numerous unique
topics missed by the other two models. The results showed both separate
topics and the relationships between the terms within those topics.
Therefore, BTM is a sound choice to inform short text topic modeling.

2.  ***Can short text metadata be used to augment the accuracy of these
    topic models?***

This is still somewhat unknown. The STM model used metadata but showed
little improvement over LDA, which only used document-term frequencies.
As the metadata is stripped from the BTM model, it remains to be seen if
those two elements can be re-joined for comparison.

3.  ***Can short texts provide enough fidelity to measure topic
    prevalence over time?***

The STM model data was plotted over time and showed how topic prevalence
changed. So while it is possible to do this with short-form text data,
this experiment was unable to do this with the BTM model.

This project was able to demonstrate qualitatively that short texts can
be used to discover latent topics. This could save enormous amounts of
time and processing resources as shown by how quickly most of the models
ran against a corpus of over 30K articles.

### 5.4 Limitations

**Qualitative vs Quantitative.** This study compared qualitative metrics
on differences between terms, topics, and proportions over time.
Quantitative metrics were not explored as it would have significantly
increased the scope of the project. Metrics such as model efficiency,
precision, and topic coherence are measures that could provide
additional comparative insights for these models.

**K determination for BTM.** While a mathematical method exists to
determine K values for the BTM model, they were too advanced for this
project and are not built into the R package in a way that makes them
simple to extract. As a result, the LDA and STM K values were compared
to derive the optimum K.

**Additional short text models.** There are other models (such as
stLDA-C) to derive latent topics from short texts, but they don't have
specific packages that work in the R environment.

**Metadata.** The only additional variable captured to distinguish
between articles was publication date. Other variables available are
author, length of article, number of comments, and number of likes.
These variables could be used in future experiments to add more nuanced
context to each article.

------------------------------------------------------------------------

## 6. REFERENCES

1.  Bail, C. (2019). *Topic Modeling*. Text as Data Course. Retrieved
    April 22, 2022, from
    <https://sicss.io/2019/materials/day3-text-analysis/topic-modeling/rmarkdown/Topic_Modeling.html>
2.  Wijffels, J. (2021). *BTM: Biterm topic models for short text -
    cran.r-project.org*. CRAN - Package BTM. Retrieved April 27, 2022,
    from <https://cran.r-project.org/web/packages/BTM/BTM.pdf>
3.  Jipeng, Q., Zhenyu, Q., Yun, L., Yunhao, Y., & Xindong, W. (2019).
    *Short text topic modeling techniques, applications, and
    performance: A survey*. arXiv.org. Retrieved April 21, 2022, from
    <https://doi.org/10.48550/arXiv.1904.07695>
4.  Silge, J., & Robinson, D. (2017). *Topic Modeling: Text mining with
    R*. 6 Topic modeling \| Text Mining with R. Retrieved April 30,
    2022, from <https://www.tidytextmining.com/topicmodeling.html>
5.  Yan, X., Guo, J., Lan, Y., & Cheng, X. (2013). *A biterm topic model
    for short texts*. Xiaohui Yan's Homepage. Retrieved April 27, 2022,
    from <http://xiaohuiyan.github.io/paper/BTM-WWW13.pdf>
