Info about the lab

Learning aim

The aim of this lab is to learn topic modelling with Latent Dirichlet Allocation.

Objectives

By the end of this lab session, students should be able to

Convert long data to wide data and vice versa
Summarize data
Train an LDA model in R

Mode

Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. In either case, you can discuss your work with the lab instructor.

Libraries and data

The dataset has been downloaded here:

https://www.kaggle.com/davidwallach/financial-tweets

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter()     masks stats::filter()
## x dplyr::group_rows() masks kableExtra::group_rows()
## x dplyr::lag()        masks stats::lag()

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

## Loading required package: RColorBrewer

Data manipulation

Summarizing data

We will use the dataset of credit card defaults as an example:

head(Default)

Suppose that we want the total number of records, the mean balance, and the variance of balance for each possible value of the variable student. We do it by combining two functions, group_by and summarise as follows:

Default %>%
  group_by(student) %>%
  summarise(N = n(), `Mean Balance` = mean(balance), `Var(Balance)` = var(balance))

Question 1

Modify the previous example to get a table that shows the total number of records, the number of records with Default = "yes", and the smallest balance, and the smallest balance of a record with Default = "yes" for
each possible value of the variable student.

Default %>%
  group_by(student) %>%
  summarise(N = n(), `No default` = sum(default == "Yes"), 
            `Min balance` = min(balance),
            `Min balance of default` = min(balance[default == "Yes"]))

Wide and long data

Here we will learn a basic but very common operation for data manipulation — converting long data to wide data and vice versa. For example, here is a dataset of NASDAQ industry indices:

nasdaq <- read.csv("NASDAQ.csv")
head(nasdaq)

This is a so-called wide format. Usually, to work with such a dataset in R (for example, to plot it), we need to convert it to to a long format. This is done with the R function pivot_longer as follows:

nasdaq_long <- nasdaq %>%
  pivot_longer(Energy:Utilities, names_to = "Sector", values_to = "Price")

nasdaq_long

And now we can plot it:

nasdaq_long %>%
  mutate(date = as.Date(date)) %>%
  ggplot(aes(x = date, y = Price, color = Sector, group = Sector)) +
  geom_line()

Sometimes we need the inverse operation, i.e., convert a long dataset to a wide dataset. This is done with the R function pivot_wider.

Question 2

Read the manual about pivot_wider and convert our long dataset back to a wide dataset.

nasdaq_wide <- nasdaq_long %>%
  pivot_wider(names_from = Sector, values_from = Price)

nasdaq_wide

LDA

Data

Now we load the data and do some simple cleaning.

set.seed(142)

clean_tweets <- function(x) {
  x %>% 
    tolower %>%
    gsub('[^a-z0-9]', ' ', . ) 
}

t <- read.csv("stockerbot-export.csv", stringsAsFactors = FALSE) %>%
  as_tibble %>%
  mutate(clean_text = clean_tweets(text))


visualize_text <- function(x) {
  # x is a character vector
  # the function will extract
  frequent_words <- termFreq(x)
  frequent_words <- frequent_words[!(names(frequent_words) %in% stopwords())]
  wordcloud(words = names(frequent_words), 
            freq = frequent_words, min.freq = 0,
            max.words = 50, random.order=FALSE, rot.per=0.35, 
            colors=brewer.pal(8, "Dark2"))

}

visualize_text(t$clean_text)

DTM

Let us remove stopwords and infrequent terms from DTM. Below is the number of words we will remove and a sample:

word_freq <- termFreq(t$clean_text)
terms_to_remove = c(stopwords() , 
                    names(word_freq)[word_freq <= 5])

length(terms_to_remove)

## [1] 58913

sample(terms_to_remove, 10)

##  [1] "6xiwl"               "regulator"           "1019602543926829000"
##  [4] "seek"                "bki"                 "ttypf9jado"         
##  [7] "53a9qg4nqo"          "mgmljpuhi8"          "1019176415571464200"
## [10] "1019415183079374800"

Now we produce the DTM

corpus <- VCorpus(VectorSource(t$clean_text))
DTM <- DocumentTermMatrix(corpus)
DTM <- DTM[ , !colnames(DTM) %in% terms_to_remove]
dim(DTM)

## [1] 16107  6499

LDA only works if there is at least one word in every document. We may have to remove documents that do not have words now (after deleting infrequent terms). Below we compute the number of remaining words in each document (after deleting infrequent terms):

n_remaining_words <- apply(DTM, 1, sum)
head(n_remaining_words)

##  1  2  3  4  5  6 
##  6 11  4  9  6 10

And here is the number of documents we need to remove

sum(n_remaining_words == 0)

## [1] 50

Now we will subset our original DTM, keeping only documents that have frequent words.

t <- t %>% filter(n_remaining_words > 0)
DTM <- DTM[n_remaining_words > 0 , ]
dim(DTM)

## [1] 16057  6499

Topic modelling

Let us now create topic models with, say, 5 topics. This operation may take a while.

mod_lda <- LDA(DTM, k = 5, control = list(seed = 1234))
mod_lda

## A LDA_VEM topic model with 5 topics.

Word probabilities

Each topic is characterized by its own probabilities for all the words that we have. Below we extract these probabilities and print a sample:

lda_topics <- tidy(mod_lda, matrix = "beta")
lda_topics %>%
  arrange(term)

Now, for each topic, let us choose 10 mot common words in that topic:

top_terms <- lda_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms

And here is the plot:

top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()

Note that this is probably not very informative. For instance, the most common term is always “https”. Terms that are more frequent in the entire dataset usually will get higher probabilities within each topic. Instead, for each term, let us calculate \[ \log\frac{\beta_1}{\beta_2}, \] where \(\beta_1\) is highest out of the 5 probabilities the 5 topics assign to this term and \(\beta_2\) is the second highest. Then a high value of this metric will tell us that the term is strongly associated with just one topic.

For example, let’s do this for the term “https”. Below are probabilities for each topic:

df_https <- lda_topics %>%
  filter(term == "https") %>%
  arrange(-beta)

df_https

We see that the two largest probabilities are almost equal, suggesting that the term “https” is equally strongly associated with topics 2 and 3. The log-ratio metric is

log(df_https$beta[1] / df_https$beta[2])

## [1] 0.0190391

Question 3

For each term, find the topic the term is most strongly associated with and the log-ratio of this association. Then, for each topic, plot the terms with the highest log-ratio.

Hint: you can use the function nth.

Answer

First, we find the most common term and the log ratio

topic_logratio <- lda_topics %>%
  group_by(term) %>%
  summarise(topic = topic[which.max(beta)],
            Logratio = log(nth(beta, 5, order_by = beta) / nth(beta, 4, order_by = beta)))

topic_logratio