0. INTRODUCTION
With the rapid increase in digitization over the past 20 years, the field of text mining has quickly moved to the forefront of data science as a method for better understanding patterns and trends in society. Text mining is often the first step in developing intricate methods for more advanced processes such as social network analysis or natural language processing. For models to learn, they must experience everything (or as much as possible) that is and is not what they are attempting to predict or mimic. Traditionally, this has meant incredibly large volumes of text have been required to feed these learning demands.
A challenge to this process is how human information tendencies are changing. While we continue to create more and more data, it’s presenting in smaller and smaller chunks. Short-form text is quickly becoming the primary means by which people consume daily, practical, and time-sensitive information. From news, to social media, to research abstracts, texts of 200 words or fewer are gaining popularity for carrying the the critical messages in our daily lives. As these shorter formats become increasingly important for analyzing and understanding social trends, it’s important for data science tools to adapt as the information environment continuously evolves (Jipeng et al., 2019).
The context for this project is provided by data from the website Towards Data Science (TDS), which I’ve used often to augment my studies on learning analytics. TDS is a Medium web publication where authors share concepts, ideas, and coding tips through self-published articles. This data experiment will explore what short-form text modeling can reveal about the changes in research focus over the last few years years as charted through TDS articles published from 2018-2021.
This research project aims to answer the following questions:
Can short texts provide enough context to adequately inform topic modeling?
Can short text metadata be used to augment the accuracy of these topic models?
Can short texts provide enough fidelity to measure topic prevalence over time?
Data scientists, students, or any professionals researching this field could benefit by understanding how research/publication trends are linked to shifts in technology. Understanding these shifts could influence how quickly people adopt or advance new technologies in the data science field. Additionally, this information could be useful to understand technologies ripe for study in higher education leading to how students select focus areas, majors, or certificate programs.
2. WRANGLE
For this case study, the data will be parsed into tokens for initial exploration to identify potential trends and context over time. Three topic modeling techniques will be employed, each with their own specific data format requirements. To meet these, the data will be tidied and stemmed and the stop words will be removed to better focus on topic-specific terminology.
2.1 Import Data
The raw data was downloaded into a comma separated values (.csv) file that will be imported into RStudio:
tds_raw <- read_csv("data/towards_data_science.csv")
With the data frame created, the data set will be further manipulated into a format fit for exploratory analysis. Wrangling will consist of the following steps:
Combine title and tagline columns to create a single ‘text’ variable.
Mutate the date column into separate ‘year’ and ‘month’ variables to enable time-based exploration.
Create a corpus through the tidytext package to enable additional text mining tools.
Stem the tidytext corpus and remove stop words.
Tokenize tidytext corpus into unigrams, bigrams, and trigrams to enable term frequency analysis.
Create a document term matrix (DTM) to explore the LDA modeling technique.
2.2 Create Single Text Variable
The current data frame contains the target text in two separate variables, title and tagline. As the pair of variables represent a single document (article in this case), combining the text into a single character variable will enable the models to focus on just one column, simplifying the work of the models.
tds_raw$text <- paste(tds_raw$title,tds_raw$tagline,sep=" ")
# Rename ID column
colnames(tds_raw)[1] <- "doc_id"
2.3 Create Year and Month Variables
To explore topic trending over time, the date variable is being separated into year and month variables.
tds_dates <- tds_raw %>%
mutate(date = mdy(date)) %>%
mutate_at(vars(date), funs(year, month))
2.4 Create tidytext Corpus
This process decomposes the long text string from the text variable into single terms, while maintaining their tie to the source document (doc_id) and its metadata (date, year, month).
tds_tidy <- tds_dates %>%
unnest_tokens(output = word, input = text) %>%
anti_join(stop_words, by = "word")
# Remove numbers
tds_tidy <- tds_tidy[-grep("\\b\\d+\\b", tds_tidy$word),]
tidy_top_tokens <- tds_tidy %>%
count(word, sort = TRUE) %>%
top_n(10)
## Selecting by n
tidy_top_tokens
## # A tibble: 10 × 2
## word n
## <chr> <int>
## 1 data 13884
## 2 learning 7084
## 3 python 6457
## 4 machine 4155
## 5 science 3930
## 6 model 2498
## 7 ai 2151
## 8 analysis 2096
## 9 guide 2096
## 10 deep 1986
The above code created a tidy version of the corpus at the single word (unigram) level while also removing stop words and numbers.
2.5 Stemming
Stemming reduces the feature size of a corpus by transforming terms to their base stem. Stemming reduces the chances of redundancy in terms and phrases as the various topic modeling techniques are explored.
tds_tidy <- tds_tidy %>%
mutate(word = wordStem(word))
2.6 Cast a Document Term Matrix
The LDA model requires the text be presented in the form of a tidy DTM, where each term occupies a single cell according to a unique and controlling variable. In this case, the title will act as that unique identifier.
tidy_tds_DTM <- tds_tidy %>%
count(title, word) %>%
cast_dtm(title, word, n)
tidy_tds_DTM
## <<DocumentTermMatrix (documents: 30128, terms: 14604)>>
## Non-/sparse entries: 294029/439695283
## Sparsity : 100%
## Maximal term length: 34
## Weighting : term frequency (tf)
2.7 Tokenization
Lastly, wrangling will end with the tokenization of the original data to enable further term frequency analysis at the bigram (word pair) and trigram (3-word set) levels. For these iterations, stop word removal and stemming has been incorporated:
tds_bigrams <- tds_dates %>%
unnest_tokens(output = bigram, input = text, token = "ngrams", n = 2)
tds_bigrams <- tds_bigrams %>%
separate(bigram, into = c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
mutate(word1 = wordStem(word1)) %>%
mutate(word2 = wordStem(word2)) %>%
unite(bigram, c(word1, word2), sep = " ")
bigram_top_tokens <- tds_bigrams %>%
count(bigram, sort = TRUE) %>%
top_n(10)
bigram_top_tokens
## # A tibble: 10 × 2
## bigram n
## <chr> <int>
## 1 machin learn 3905
## 2 data scienc 3727
## 3 data scientist 1654
## 4 deep learn 1419
## 5 neural network 1366
## 6 learn model 765
## 7 time seri 678
## 8 data analysi 583
## 9 covid 19 560
## 10 reinforc learn 458
tds_trigrams <- tds_dates %>%
unnest_tokens(output = trigram, input = text, token = "ngrams", n = 3)
tds_trigrams <- tds_trigrams %>%
separate(trigram, into = c("word1", "word2", "word3"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
filter(!word3 %in% stop_words$word) %>%
mutate(word1 = wordStem(word1)) %>%
mutate(word2 = wordStem(word2)) %>%
mutate(word3 = wordStem(word3)) %>%
unite(trigram, c(word1, word2, word3), sep = " ")
trigram_top_tokens <- tds_trigrams %>%
count(trigram, sort = TRUE) %>%
top_n(10)
trigram_top_tokens
## # A tibble: 10 × 2
## trigram n
## <chr> <int>
## 1 machin learn model 602
## 2 data scienc project 341
## 3 natur languag process 289
## 4 convolut neural network 233
## 5 exploratori data analysi 230
## 6 machin learn algorithm 201
## 7 deep learn model 138
## 8 time seri forecast 134
## 9 machin learn project 133
## 10 learn data scienc 132
3. EXPLORATORY ANALYSIS
3.1 Published Article Counts
tds_dates %>%
ggplot(aes(x = date, color = factor(month))) +
geom_bar(show.legend = FALSE) +
labs(y = "Date",
x = "Article Counts",
title = "Towards Data Science Articles",
subtitle = "Published from 2018 - 2021")

This visual depicts the 30k article spread over the past four years. TDS had a great year in 2020 with over 70 articles published monthly in the mid-year period. Any chance that was due to scientists being cooped up during the heart of the COVID-19 pandemic? Writing is a great way to pass the time! The next couple of sections will explore word frequencies to attempt to identify patterns within the most use words or word combinations.
3.2 Word Counts by Year
tds_tidy %>%
group_by(year) %>%
count(word, sort = TRUE) %>%
top_n(10) %>%
ungroup %>%
mutate(word = reorder_within(word, n, year)) %>%
ggplot(aes(x = word, y = n, fill = word)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ year, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
scale_y_continuous(expand = c(0,0)) +
labs(y = "Count",
x = "Unique words",
title = "Most frequent words found in TDS article titles & taglines",
subtitle = "Stop words removed from the list")

When diagraming the top ten unigrams by year, many terms are repeated. Since a data science blogging site is the focus for this project, it’s no surprise that words such as data, science, machine, learning, etc. would appear at the top. The chart below indicates how this changed when the graph was expanded to include the top 20 terms:
tds_tidy %>%
group_by(year) %>%
count(word, sort = TRUE) %>%
top_n(20) %>%
ungroup %>%
mutate(word = reorder_within(word, n, year)) %>%
ggplot(aes(x = word, y = n, fill = word)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ year, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
scale_y_continuous(expand = c(0,0)) +
labs(y = "Count",
x = "Unique words",
title = "Most frequent words found in TDS article titles & taglines",
subtitle = "Stop words removed from the list")

This second attempt does begin to reveal some unique terms by year. Another way to achieve this is by adding words common to all years to the list of stop words. In the end, I decided not to focus too much energy on individual terms as in this community, many of the key topics are described by multiple terms, such as ‘machine learning,’ as opposed to treating those words as separate and distinct entities. To that end, I repeated the above visuals, but for multi-word groupings.
3.3 Bigram Counts
tds_bigrams %>%
group_by(year) %>%
count(bigram, sort = TRUE) %>%
top_n(20) %>%
ungroup %>%
mutate(bigram = reorder_within(bigram, n, year)) %>%
ggplot(aes(x = bigram, y = n, fill = bigram)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ year, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
scale_y_continuous(expand = c(0,0)) +
labs(y = "Count",
x = "Unique Bigrams",
title = "Most frequent bigrams found in article titles & taglines",
subtitle = "Stop words removed from the list")

Expanding to 2-word phrases reveals unique topics in each year. Some interesting examples are random forest (2018), object detect (2019), covid 19 (2020), and python code (2021).
3.4 Trigram Counts
tds_trigrams %>%
group_by(year) %>%
count(trigram, sort = TRUE) %>%
top_n(10) %>%
ungroup %>%
mutate(trigram = reorder_within(trigram, n, year)) %>%
ggplot(aes(x = trigram, y = n, fill = trigram)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ year, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
scale_y_continuous(expand = c(0,0)) +
labs(y = "Count",
x = "Unique Trigrams",
title = "Most frequent trigrams found in article titles & taglines",
subtitle = "Stop words removed from the list")

Trigrams repeat quite a bit in the top ten, though you start seeing more complete ideas emerge about specific activities. Examples include data science job and time seri data.
3.5 Exploratory Analysis Insights
Word frequency analysis reveals that unique words and phrases do emerge from year to year. This implies that research topics are changing throughout the corpus of TDS articles between 2018 and 2021. Some other interesting features of this data set include:
single words do little to distinguish between topics and/or years
trigram phrases become repetitive beyond the top ten, so bigram phrases looks to be the optimal focus for short text topic modeling
4. MODEL
To address the potential for topic generation in short-form text, I will compare various qualitative model results to determine the differences, if any, in how they identify unique topics using shorter data observations. This analysis will examine three models:
Latent Dirichlet Allocation (LDA). LDA works under the premise that every document contains a mixture of topics, and every topic is composed of a mixture of words.
Structural Topic Model (STM). STM employs metadata to improve the assignment of words to topics in a corpus and that can be used to examine relationships between covariates and documents.
Biterm Topic Model (BTM). BTM was designed specifically for a short-format corpus and identifies topics by explicitly modeling word-word co-occurrences in a specified window of text.
4.1 Determining K
Each model is predicated on having a relatively optimum approximation for K, or the number of potential topics to be identified. If K is too small, the corpus is divided into a few very generic topics. If K is too large, the collection is broken into too many topics in which they either overlap or are hardly indecipherable. Before the models can be applied, a common K value should be determined so the results can be more consistently compared. A challenge to this process is that each model maintains distinct methods for determining K based on how the algorithm treats the text. As LDA and STM were designed for larger bodies of text, it will be interesting to see how the model treats the short-form data in this experiment. If a consistent K-value cannot be determined explicitly, then the alternative will be to conduct a trial-and-error approach to find a value that can be applied to all three models. As our main goal is to compare results between them, the alternative K solution will meet the intent of the project. How an optimal K should be selected depends on various factors that are unique to each type of topic model.
FindTopicsNumber() Function in the LDA Model
For the LDA model, four metrics were extracted, then plotted to visualize the maximum or minimum K value of each metric:
k_metrics <- FindTopicsNumber(
tidy_tds_DTM,
topics = seq(5, 50, by = 5),
metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
method = "Gibbs",
control = list(seed = 77),
mc.cores = NA,
return_models = FALSE,
verbose = FALSE,
libpath = NULL
)
FindTopicsNumber_plot(k_metrics)

The key is to identify the visible bend, or inflection point, on each of the lines. This is where they transition from quickly increasing/decreasing to a more flat trajectory. In each of the lines, that inflection point is between 20 and 30 topics. The one metric that has a single and easily recognizable value is the CaoJuan2009 line, which yields a max value of 20 topics.
searchK() Function in STM
The stm package has a useful function called searchK which requires a specific range of values for K and outputs multiple goodness-of-fit measures:
findingk <- searchK(docs,
vocab,
K = c(5:30),
data = meta,
verbose=FALSE)
plot(findingk)

Based on the feedback from the LDA model, the range was tested between 5 and 30 topics. Similar to the LDA results, the curves hit their inflection point between 25 and 30 topics. (Note: this trial required ~ 8 hours to run on a fairly high end Macbook Pro. Your mileage may vary.)
Determining K for BTM
Unlike the previous two models, there is no function built into the BTM package in R for estimating K. As this is a comparative study, using similar K values for each model should provide consistent data for this purpose. Therefore, the BTM model was executed assuming an optimum K value between 20 and 30 topics. The models will use these values while also adding a lower value of 10 to verify what happens if a K value is selected that is too low.
4.2 Latent Dirichlet Allocation (LDA) Model
LDA is a mathematical method for estimating the mixture of words that is associated with each topic, while also determining the mixture of topics that describes each document (Silge & Robinson, 2017).
tds_lda <- LDA(tidy_tds_DTM,
k = 20,
control = list(seed = 588))
terms(tds_lda, 5)
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7
## [1,] "learn" "python" "data" "learn" "data" "data" "learn"
## [2,] "model" "function" "scienc" "data" "scienc" "python" "data"
## [3,] "data" "step" "sql" "tip" "python" "build" "featur"
## [4,] "understand" "learn" "top" "python" "learn" "neural" "python"
## [5,] "ai" "analysi" "databas" "model" "step" "network" "machin"
## Topic 8 Topic 9 Topic 10 Topic 11 Topic 12 Topic 13 Topic 14
## [1,] "data" "learn" "model" "code" "data" "learn" "model"
## [2,] "python" "deep" "data" "python" "scientist" "machin" "learn"
## [3,] "scienc" "data" "network" "perform" "learn" "model" "machin"
## [4,] "model" "intellig" "detect" "run" "it’" "ai" "data"
## [5,] "build" "artifici" "python" "chang" "machin" "data" "ai"
## Topic 15 Topic 16 Topic 17 Topic 18 Topic 19 Topic 20
## [1,] "introduct" "data" "data" "panda" "data" "build"
## [2,] "learn" "scienc" "learn" "guid" "scienc" "time"
## [3,] "machin" "time" "model" "simpl" "project" "imag"
## [4,] "data" "build" "network" "python" "creat" "analysi"
## [5,] "model" "seri" "machin" "data" "start" "detect"
For K = 20, the topics looks to be relatively similar across the board, though their are a couple that stand out as unique. In this format, however, recognizing those topics is not easy. The faceted plot below provides a much more informative visual:
top_terms_lda <- tds_lda %>%
group_by(topic) %>%
slice_max(beta, n = 5, with_ties = FALSE) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms_lda %>%
mutate(term = reorder_within(term, beta, topic)) %>%
group_by(topic, term) %>%
arrange(desc(beta)) %>%
ungroup() %>%
ggplot(aes(beta, term, fill = as.factor(topic))) +
geom_col(show.legend = FALSE) +
scale_y_reordered() +
labs(title = "Top 5 terms in each LDA topic",
x = expression(beta), y = NULL) +
facet_wrap(~ topic, ncol = 4, scales = "free")

I’ve highlighted three topics (11, 18, and 20) whose top five terms were outside the range of the majority of the other topics.
Topic 11 references running or changing Python code.
Topic 18 mentions a specific library in Python, called Pandas.
Topic 20 speaks to building models for image analysis/detection.
While three topics presented themselves as unique, that is a relatively small number within 20 topics. To explore how the results change based on the chosen number of topics, I ran the same model using K=10 and K=30.
For K=10, I had difficulty distinguishing between any of the topics. Very generic with repeated terms of data, machin, learn, model, and scienc, which are what one would expect to see on a website that publihed data science articles.

Using K=30 yields results similar to the image from K=20. While topic 3 is unique compared to the previous run, topics 18 and 20 look almost identical to their counterparts from the earlier trial.

4.3 Structural Topic Model (STM)
The stm package in R requires the documents, meta data, and “vocab”—or total list of words described in the documents— to be stored in separate objects (see code below). The first line of code eliminates both extremely common terms and extremely rare terms, as is common practice in topic modeling, since such terms make word-topic assignment much more difficult (Bail, 2019).
temp <- textProcessor(tds_raw$text,
metadata = tds_raw,
lowercase=TRUE,
removestopwords=TRUE,
removenumbers=TRUE,
removepunctuation=TRUE,
wordLengths=c(3,Inf),
stem=TRUE,
onlycharacter= FALSE,
striphtml=TRUE,
customstopwords=NULL)
docs <- temp$documents
meta <- temp$meta
vocab <- temp$vocab
tds_stm <- stm(documents=docs,
data=meta,
vocab=vocab,
prevalence =~ date,
K=20,
max.em.its=25,
verbose = FALSE,
gamma.prior='L1')

This first iteration of the STM model (K=20) shows a much more diverse range of terms than LDA. For this data set, the only additional information available for distinguishing topics is the publication date of each article. This linear format, however, is still difficult to capture how each topic differs from it’s neighbors. A more visual and useful method is enabled by the toLDAvis function.
toLDAvis(mod = tds_stm, docs = docs)

This tools explores how each topic relates to the others spatially. In a perfect environment, the model would have identified 20 topics separate and distinct from each other. The diagram would’ve shown 20 circles with no overlap. While it’s doubtful a model could predict that level of precision, this diagram shows that with K=20, there is relatively good topic separation. There are just four areas where topics overlap, meaning they share a larger proportion of similar terms.
Topic 2 is highlighted as it is large and maintains good spacing from its nearest neighbors. The size indicates the number of terms associated with this topic. In this case, the key term is ‘data,’ which is not surprising, but its companion terms are what separates it from its neighbors. The lone bubble indicates that this topic is distinct in its term composition. Combining the top 10 terms reveals the topic describes articles written about the skills required to become a data scientist. For comparison across topic values, this model was repeated for K=10 and K=30:

These views add to the inference that K=20 is pretty close to optimum. For K=10, eight of the topics overlap with a neighbor. Likewise for K=30, there are several groupings of overlapping topics.
4.4 Biterm Topic Model (BTM)
The Biterm Topic Model (BTM) is a word co-occurrence based topic model that learns topics by modeling word-word co-occurrences patterns (e.g., biterms) (Wijffels, 2021).
A biterm consists of two words co-occurring in the same context, for example, in the same short text window. This window is described by the parameters skipgram and width.
Skipgram defines the number of words to include in the biterm search space, while width defines how many words on average exist in a single document. For this project, a skipgram value of 5 was used and the width was kept at the default value of 15.
BTM models the biterm occurrences across a complete corpus (unlike LDA models which model the word occurrences in a single document).
Based on the first two iterations, the BTM trial focused on training a model to identify 20 topics:
# Tag parts of speech
anno <- udpipe(tds_raw, "english", trace = 10)
biterms <- as.data.table(anno)
biterms <- biterms[, cooccurrence(x = lemma,
relevant = upos %in% c("NOUN",
"ADJ",
"PROPN"),
skipgram = 5),
by = list(doc_id)]
# Build BTM
set.seed(588)
traindata <- subset(anno, upos %in% c("NOUN", "ADJ", "PROPN"))
traindata <- traindata[, c("doc_id", "lemma")]
model <- BTM(traindata, k = 20,
beta = 0.01,
iter = 500,
biterms = biterms,
trace = 100)
# Plot Model Results (do not run when knitting)
#library(ggraph)
#plot(model,
# top_n = 10,
# title = "BTM model",
# subtitle = "K = 20, 500 Training Iterations",
# labels = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9",
# "10", "11", "12", "13", "14", "15", "16", "17",
# "18", "19"))

5. COMMUNICATE
5.1 Topics Discoverability in Short-form Text
LDA. The LDA model did little to distinguish between topics beyond a generic summary of what one could expect to read in any article on data science. These results qualitatively confirm the inability of the model to consistently recognize unique topics in short-form text. Treating an article’s title and tagline as a complete “document” gives the model too little data to inform latent topic discovery.
STM. The STM model distinguished between topics much more effectively than the LDA model. These results add additional reinforcement that the optimum number of terms is around 20 and adding metadata increases the likelihood of distinguishing between unique topics.
BTM.
|
Topic 0
|
Topic 1
|
Topic 2
|
Topic 3
|
Topic 4
|
Topic 5
|
Topic 6
|
Topic 7
|
Topic 8
|
Topic 9
|
Topic 10
|
Topic 11
|
Topic 12
|
Topic 13
|
Topic 14
|
Topic 15
|
Topic 16
|
Topic 17
|
Topic 18
|
Topic 19
|
|
language
|
part
|
Analysis
|
detection
|
data
|
python
|
data
|
data
|
time
|
python
|
feature
|
data
|
Neural
|
python
|
Regression
|
data
|
data
|
machine
|
data
|
ai
|
|
text
|
game
|
data
|
image
|
ai
|
Pandas
|
python
|
model
|
Series
|
code
|
data
|
science
|
Network
|
data
|
model
|
Covid
|
Google
|
learning
|
python
|
data
|
|
model
|
python
|
sentiment
|
object
|
article
|
data
|
guide
|
time
|
python
|
Jupyter
|
model
|
scientist
|
Networks
|
visualization
|
python
|
Analysis
|
python
|
deep
|
machine
|
Artificial
|
|
NLP
|
Carlo
|
python
|
model
|
science
|
function
|
step
|
machine
|
Analysis
|
line
|
selection
|
Science
|
deep
|
chart
|
Linear
|
case
|
AWS
|
learn
|
science
|
Intelligence
|
|
Natural
|
Monte
|
customer
|
vision
|
machine
|
Sql
|
Analysis
|
analysis
|
part
|
notebook
|
machine
|
project
|
network
|
plot
|
machine
|
price
|
cloud
|
model
|
testing
|
machine
|
|
word
|
Reinforcement
|
analysis
|
python
|
week
|
use
|
tutorial
|
dataset
|
series
|
data
|
python
|
interview
|
image
|
map
|
decision
|
python
|
machine
|
end
|
probability
|
be
|
|
processing
|
simulation
|
Twitter
|
recognition
|
ML
|
database
|
Pandas
|
real
|
model
|
Julia
|
reduction
|
machine
|
neural
|
Plotly
|
regression
|
analysis
|
model
|
Reinforcement
|
common
|
science
|
|
python
|
Markov
|
recommendation
|
deep
|
knowledge
|
power
|
beginner
|
python
|
data
|
programming
|
value
|
Scientists
|
Convolutional
|
create
|
algorithm
|
using
|
web
|
part
|
learning
|
intelligence
|
|
image
|
player
|
review
|
face
|
paper
|
sql
|
part
|
series
|
price
|
new
|
different
|
skill
|
PyTorch
|
Matplotlib
|
random
|
science
|
step
|
models
|
concept
|
human
|
|
classification
|
League
|
topic
|
computer
|
research
|
file
|
introduction
|
simple
|
Forecasting
|
best
|
missing
|
job
|
part
|
using
|
Gradient
|
use
|
Azure
|
Classification
|
introduction
|
learning
|
This list provides the top ten terms in each of the 20 topics as described by the BTM model. These topic phrases were, by far, the most unique of the three models. In addition, the visual provided strength of ties between terms depicting which relationships were most influential in each topic. From a qualitative perspective, BTM discovered the most unique latent topics of the three models and looks to be an optimum choice for short texts.
5.2 Topic Proportion Over Time
While there are ways to visualize LDA topics over time, in this corpus that model did little to distinguish between relevant topics. Therefore, it was eliminated from topic proportion comparison over time. The BTM package in R is not designed to compare topics by metadata as those covariates are removed in building the final model. While I believe it’s possible to do, there needs to be a way to re-join the final BTM model data with the metadata. That left STM as the only model that was built to understand the metadata’s impact on the topic results. The STM exercise used date as a covariate leading to the plot below for each topic and how its prevalence changed over four years.

Some examples with large shifts over time include:
Deep Learning (1st topic, 2nd row) - significant decrease in popularity
How-to Guides on Statistics (4th topic, 2nd row) - decreased as more articles were becoming popular on visualizations and coding vice basic stats
Data Science Project Research (4th topic, 1st row) - gained in popularity as did topic on Python (5th topic in both 1st and 2nd rows)
5.3 Summary Findings
This project was designed to answer three questions about using topic modeling to identify latent themes in a body of short-form text. Specifically:
- Can short texts provide enough context to adequately inform topic modeling?
Both the LDA and STM trials struggled to differentiate between unique topics. Even with metadata added, the STM model improved little over its LDA counterpart. The BTM model, however, discovered numerous unique topics missed by the other two models. The results showed both separate topics and the relationships between the terms within those topics. Therefore, BTM is a sound choice to inform short text topic modeling.
- Can short text metadata be used to augment the accuracy of these topic models?
This is still somewhat unknown. The STM model used metadata but showed little improvement over LDA, which only used document-term frequencies. As the metadata is stripped from the BTM model, it remains to be seen if those two elements can be re-joined for comparison.
- Can short texts provide enough fidelity to measure topic prevalence over time?
The STM model data was plotted over time and showed how topic prevalence changed. So while it is possible to do this with short-form text data, this experiment was unable to do this with the BTM model.
This project was able to demonstrate qualitatively that short texts can be used to discover latent topics. This could save enormous amounts of time and processing resources as shown by how quickly most of the models ran against a corpus of over 30K articles.
5.4 Limitations
Qualitative vs Quantitative. This study compared qualitative metrics on differences between terms, topics, and proportions over time. Quantitative metrics were not explored as it would have significantly increased the scope of the project. Metrics such as model efficiency, precision, and topic coherence are measures that could provide additional comparative insights for these models.
K determination for BTM. While a mathematical method exists to determine K values for the BTM model, they were too advanced for this project and are not built into the R package in a way that makes them simple to extract. As a result, the LDA and STM K values were compared to derive the optimum K.
Additional short text models. There are other models (such as stLDA-C) to derive latent topics from short texts, but they don’t have specific packages that work in the R environment.
Metadata. The only additional variable captured to distinguish between articles was publication date. Other variables available are author, length of article, number of comments, and number of likes. These variables could be used in future experiments to add more nuanced context to each article.
