0. INTRODUCTION
I’ve used the website Towards Data Science often to augment my studies on learning analytics and am curious as to what text mining can reveal about the changes in research focus over the last 4 years. My initial research questions include the following:
Can article metadata be used to identify trends in data science research topics?
Does research topic frequency/popularity change over time or remain consistent year to year?
Can data science research trends identify shifts or advances in technology?
Data scientists, students, or any professionals researching this field could benefit by understanding how research/publication trends are linked to shifts in technology. Understanding these shifts could influence how quickly people adopt or advance new technologies in the data science field. Additionally, this information could be useful to understand technologies ripe for study in higher education leading to how students select focus areas, majors, or minors.
2. WRANGLE
For this case study, I plan to tokenize the text variables for analysis at the n-gram level. Context is important as the data set is comprised of article titles and taglines. I will need to experiment with how stop words impact trends as well as stemming. As I’m not interested in sentiment, I don’t plan on employing dictionary or subject-specific lexicons for analysis.
Import Article Titles and Taglines
My raw data was initially developed by Johannes Hötter earlier this year and posted to Kaggle.
tds_raw <- read_csv("data/towards_data_science.csv")
The dataset contains the titles and taglines for over 30,000 articles. My plan is to use only that information to experiment with topic modeling as opposed to the traditional method of using the complete article text. Now that I’ve successfully created a data frame, I can continue to manipulate the data into a format fit for exploratory analysis. Wrangling will consist of the following:
Combine title and tagline columns to create a single ‘text’ variable.
Mutate the date column into separate ‘year’ and ‘month’ variables to enable time-based exploration.
Create corpora through both the tidytext and tm packages to enable additional text mining tools.
Stem the tm corpus to compare sparsity levels between stemmed and contextual text.
Tokenize tidytext corpus into unigrams, bigrams, and trigrams to enable term frequency analysis.
Create document term matrices (DTMs) from each corpus to compare various topic modeling techniques.
Create Single Text Variable
tds_raw$text <- paste(tds_raw$title,tds_raw$tagline,sep=" ")
Create Year and Month Variables
tds_dates <- tds_raw %>%
mutate(date = mdy(date)) %>%
mutate_at(vars(date), funs(year, month))
Create tidytext Corpus
tds_tidy <- tds_dates %>%
unnest_tokens(output = word, input = text) %>%
anti_join(stop_words, by = "word")
# Remove numbers
tds_tidy <- tds_tidy[-grep("\\b\\d+\\b", tds_tidy$word),]
tidy_top_tokens <- tds_tidy %>%
count(word, sort = TRUE) %>%
top_n(10)
## Selecting by n
tidy_top_tokens
## # A tibble: 10 × 2
## word n
## <chr> <int>
## 1 data 13884
## 2 learning 7084
## 3 python 6457
## 4 machine 4155
## 5 science 3930
## 6 model 2498
## 7 ai 2151
## 8 analysis 2096
## 9 guide 2096
## 10 deep 1986
The above code created a tidy version of the corpus at the single word (unigram) level while also removing stop words and numbers.
Create tm Corpus
tds_corpus <- Corpus(VectorSource(tds_raw$text))
# Remove punctuation and numbers
tds_corpus <- tm_map(tds_corpus, content_transformer(removePunctuation))
tds_corpus <- tm_map(tds_corpus, content_transformer(removeNumbers))
# Transform corpus to all lower case
tds_corpus <- tm_map(tds_corpus, content_transformer(tolower))
Stemming
Stemming reduces the feature size of a corpus by transforming terms to their base stem. I predict stemming will limit redundancy in terms and phrases as I attempt various topic modeling techniques.
# Stem tm corpus
tds_corpus <- tm_map(tds_corpus, content_transformer(stemDocument), language = "english")
# Stem tidytext corpus
tds_tidy <- tds_tidy %>%
mutate(word = wordStem(word))
Cast a Document Term Matrix
tds_DTM <- DocumentTermMatrix(tds_corpus, control = list(wordLengths = c(2, Inf)))
tds_DTM
## <<DocumentTermMatrix (documents: 30665, terms: 28771)>>
## Non-/sparse entries: 486830/881775885
## Sparsity : 100%
## Maximal term length: 45
## Weighting : term frequency (tf)
tidy_tds_DTM <- tds_tidy %>%
count(title, word) %>%
cast_dtm(title, word, n)
tidy_tds_DTM
## <<DocumentTermMatrix (documents: 30128, terms: 14604)>>
## Non-/sparse entries: 294029/439695283
## Sparsity : 100%
## Maximal term length: 34
## Weighting : term frequency (tf)
With the two basic DTMs built, I’d also like to compare those results to that of a DTM that is less sparse to see if I can generate a different outcome:
sparse_tds_DTM = removeSparseTerms(tds_DTM, 0.99)
sparse_tds_DTM
## <<DocumentTermMatrix (documents: 30665, terms: 248)>>
## Non-/sparse entries: 289174/7315746
## Sparsity : 96%
## Maximal term length: 13
## Weighting : term frequency (tf)
This created a DTM with significantly fewer terms (248) and should produce a much different selection of topics once we get to the modeling phase.
Tokenization
Finally, I’ll complete the tokenization of the original data to enable further term frequency analysis of bigrams and trigrams. For these iterations, I’ve incorporated stop word removal and stemming:
tds_bigrams <- tds_dates %>%
unnest_tokens(output = bigram, input = text, token = "ngrams", n = 2)
tds_bigrams <- tds_bigrams %>%
separate(bigram, into = c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
mutate(word1 = wordStem(word1)) %>%
mutate(word2 = wordStem(word2)) %>%
unite(bigram, c(word1, word2), sep = " ")
bigram_top_tokens <- tds_bigrams %>%
count(bigram, sort = TRUE) %>%
top_n(10)
bigram_top_tokens
## # A tibble: 10 × 2
## bigram n
## <chr> <int>
## 1 machin learn 3905
## 2 data scienc 3727
## 3 data scientist 1654
## 4 deep learn 1419
## 5 neural network 1366
## 6 learn model 765
## 7 time seri 678
## 8 data analysi 583
## 9 covid 19 560
## 10 reinforc learn 458
tds_trigrams <- tds_dates %>%
unnest_tokens(output = trigram, input = text, token = "ngrams", n = 3)
tds_trigrams <- tds_trigrams %>%
separate(trigram, into = c("word1", "word2", "word3"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
filter(!word3 %in% stop_words$word) %>%
mutate(word1 = wordStem(word1)) %>%
mutate(word2 = wordStem(word2)) %>%
mutate(word3 = wordStem(word3)) %>%
unite(trigram, c(word1, word2, word3), sep = " ")
trigram_top_tokens <- tds_trigrams %>%
count(trigram, sort = TRUE) %>%
top_n(10)
trigram_top_tokens
## # A tibble: 10 × 2
## trigram n
## <chr> <int>
## 1 machin learn model 602
## 2 data scienc project 341
## 3 natur languag process 289
## 4 convolut neural network 233
## 5 exploratori data analysi 230
## 6 machin learn algorithm 201
## 7 deep learn model 138
## 8 time seri forecast 134
## 9 machin learn project 133
## 10 learn data scienc 132
3. Exploratory Analysis
Published Article Counts
tds_dates %>%
ggplot(aes(x = date, color = factor(month))) +
geom_bar() +
labs(y = "Date",
x = "Article Counts",
title = "Towards Data Science Articles",
subtitle = "Published from 2018 - 2021")

Towards Data Science had a great year in 2021 with over 70 articles published monthly in the mid-year period.
Word Counts by Year
tds_tidy %>%
group_by(year) %>%
count(word, sort = TRUE) %>%
top_n(10) %>%
ungroup %>%
mutate(word = reorder_within(word, n, year)) %>%
ggplot(aes(x = word, y = n, fill = word)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ year, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
scale_y_continuous(expand = c(0,0)) +
labs(y = "Count",
x = "Unique words",
title = "Most frequent words found in TDS article titles & taglines",
subtitle = "Stop words removed from the list")

The first thing that pops out when I graph the top unigrams by year is that many terms are repeated each year. Since I’m analyzing a data science blogging site, it’s expected that terms such as data, science, machine, learning, etc would appear at the top. I was curious to see how this changed if I expanded the graph to include the top 20 terms:
tds_tidy %>%
group_by(year) %>%
count(word, sort = TRUE) %>%
top_n(20) %>%
ungroup %>%
mutate(word = reorder_within(word, n, year)) %>%
ggplot(aes(x = word, y = n, fill = word)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ year, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
scale_y_continuous(expand = c(0,0)) +
labs(y = "Count",
x = "Unique words",
title = "Most frequent words found in TDS article titles & taglines",
subtitle = "Stop words removed from the list")

This second attempt does begin to reveal some unique terms by year. Another way to achieve this is by adding words common to all years to the list of stop words. In the end, I decided not to focus too much on individual terms as in this community, many of the key topics are described by multiple terms, such as ‘machine learning’ as opposed to treating those words as separate and distinct entities. To that end, I repeated the above visuals, but for multi-word groupings.
Bigram Counts
tds_bigrams %>%
group_by(year) %>%
count(bigram, sort = TRUE) %>%
top_n(20) %>%
ungroup %>%
mutate(bigram = reorder_within(bigram, n, year)) %>%
ggplot(aes(x = bigram, y = n, fill = bigram)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ year, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
scale_y_continuous(expand = c(0,0)) +
labs(y = "Count",
x = "Unique Bigrams",
title = "Most frequent bigrams found in article titles & taglines",
subtitle = "Stop words removed from the list")

Expanding to 2-word phrases reveals unique topics in each year. Some interesting examples are random forest (2018), object detect (2019), covid 19 (2020), and python code (2021).
Trigram Counts
tds_trigrams %>%
group_by(year) %>%
count(trigram, sort = TRUE) %>%
top_n(10) %>%
ungroup %>%
mutate(trigram = reorder_within(trigram, n, year)) %>%
ggplot(aes(x = trigram, y = n, fill = trigram)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ year, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
scale_y_continuous(expand = c(0,0)) +
labs(y = "Count",
x = "Unique Trigrams",
title = "Most frequent trigrams found in article titles & taglines",
subtitle = "Stop words removed from the list")

Trigrams repeat quite a bit in the top 10, though you start seeing more complete ideas emerge about specific activities. Examples include data science job and time seri data.
4. Preliminary Findings
This project is designed to answer three key questions about using topic modeling to identify latent themes in a body of short-form text. Specifically:
- Can article metadata be used to identify trends in data science research topics?
In this dataset, the only real metadata comes in the form of article publish dates. Some data that could be available, just not yet a part of my dataset, are author, article length, and number of reader comments.
- Does research topic frequency/popularity change over time or remain consistent year to year?
Exploratory analysis has revealed some unique patterns over time. While some overarching topics are consistently written about every year, there are also unique terms in each individual year. This leads me to believe topic modeling has a good chance to reveal unique technology topics by time period.
- Can data science research trends identify shifts or advances in technology?
So far, exploratory analysis on term frequency has shown changes in article topics from year to year. This makes me believe that shifts in tech will be visible as latent topics emerge from the planned models.
Stemming provided another insight regarding the potential for duplicate topics. I initially ran term frequency counts prior to stemming and the results were much different. Prior to stemming, the term ‘data scientist’ was viewed as distinct from ‘data scientists’ and those phrases would occupy two separate lines. They are obviously very similar, so stemming combined them. This allowed me to see more terms and identify potentially more topics.
Lastly, context is hugely important for models to discover topics. Bigram and trigram analysis yielded much better differentiation among topics over different time frames.
