This tutorial walks you through on:
Install the following packages if you do not have them in your R environment.
tidyverse is a collection of R packages for data
sciencetidytext is used to preprocess data for text
miningtopicmodels is used to perform Latent Dirichlet
Allocation (LDA) topic modeling analysisreshape2 is a dependency that may need to be installed
manuallyLDAvis is used to interactively visualize topic
modeling results using a web-based viewerNote that the result of your analysis may differ based on the
tidytext version. Specifically, the tokenizing logic from
version 0.4.0 (released on December 20th, 2022) and above
has been updated. token = "tweets" option has been
deprecated and will throw an error. This notebook assumes that you’re
using tidytext version >=0.4.0.
# uncomment and run the lines below if you need to install these packages
# install.packages("tidyverse")
# install.packages("tidytext")
# install.packages("topicmodels")
# install.packages("reshape2")
# install.packages("LDAvis")
# install.packages("servr")
Load packages.
library(tidyverse)
library(tidytext)
library(topicmodels)
library(reshape2)
library(LDAvis)
library(servr)
df_tweets = read_csv('Coinbase-tweets.csv')
## Rows: 654 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): username, text
## dbl (1): id
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df_tweets %>% head()
Print out the number of rows.
nrow(df_tweets)
## [1] 654
Create a new column named row_num with unique values in
each row. This allows us to group by each tweet after we tokenize the
tweet text.
df_tweets$row_num <- seq_len(nrow(df_tweets))
df_tweets %>% head() %>% select(-text)
Many tweets contain URL strings in the form of “https://t.co/somestring”. Remove these URL strings using a regular expression
df_tweets$text <- df_tweets$text %>%
str_remove_all("https://t.co/\\w+")
df_tweets %>%
select(text) %>%
head()
Tokenize tweet texts and normalize the tokens.
# tokenize using unnest_tokens()
# this also normalizes the tokens (lowercase, remove punctuations except Twitter-specific characters for mentions, tickers, and URLs)
# token = "tweets" preserves usernames and hashtags
df_tokens <- df_tweets %>%
tidytext::unnest_tokens(input = text, output = word)
df_tokens %>% head(n = 10)
Stop words are words that are commonly used and likely unimportant. Examples include “is”, “by”, “the”, “a”, etc.
# remove stop words using anti_join
# and remove tokens with only 1 or 2 characters
df_tokens <- df_tokens %>%
anti_join(tidytext::stop_words, by = "word") %>%
filter(nchar(word) >= 3)
df_tokens %>% head(n = 10)
df_tokens %>% count(word) %>% arrange(desc(n))
dtm <- df_tokens %>%
count(row_num, word) %>%
cast_dtm(document = row_num, term = word, value = n)
dtm
## <<DocumentTermMatrix (documents: 654, terms: 2880)>>
## Non-/sparse entries: 7789/1875731
## Sparsity : 100%
## Maximal term length: 20
## Weighting : term frequency (tf)
k = 3)tweets_lda <- topicmodels::LDA(dtm, k = 3, control = list(seed = 12))
tweets_lda
## A LDA_VEM topic model with 3 topics.
Print out per-topic-per-word probabilities.
beta values are the probabilities of words in each topic.
tweet_topics <- tidytext::tidy(tweets_lda, matrix = "beta")
top_terms <- tweet_topics %>%
group_by(topic) %>%
slice_max(beta, n = 20) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms
top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic)))+
geom_col(show.legend = FALSE) +
theme_minimal() +
ggtitle("Top terms by topic") +
facet_wrap(~topic, scales = "free") +
scale_y_reordered()
The intertopic distance map will not be shown in the knitted HTML file. Run the R notebook to see the intertopic distance map.
post <- topicmodels::posterior(tweets_lda)
mat <- tweets_lda@wordassignments
json <- LDAvis::createJSON(
phi = post$terms,
theta = post$topics,
vocab = colnames(post$terms),
doc.length = slam::row_sums(mat, na.rm = TRUE),
term.frequency = slam::col_sums(mat, na.rm = TRUE)
)
serVis(json)