Info about the lab

Learning aim

The aim of this lab is to acquire basic familiarity with modern word embedding techniques, such as GloVe.

We will train GloVe on a collection of financial tweets. Note that usually, for real applications, one would train GloVe on a very large corpus of different tweets (not just financial tweets) first and then use it to process tweets of interest.

However, we will work with whatever data that we have.

Objectives

By the end of this lab session, students should be able to

  1. Visualize text as a word cloud

  2. Replace loops with functions apply, sapply, lapply.

  3. Train GloVe in R

  4. Apply continuous bag of words to represent entire tweets

Mode

Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. In either case, you can discuss your work with the lab instructor.

Libraries and data

The dataset has been downloaded here:

library(tidyverse) # for manipulation with data
library(text2vec) # for GloVe
library(tm) # for text processing
library(wordcloud) # for visualizations
library(caret)

set.seed(142)

t <- read_csv("stockerbot-export.csv")
## Warning: 598 parsing failures.
## row  col           expected actual                    file
## 230 text delimiter or quote        'stockerbot-export.csv'
## 230 text delimiter or quote   B    'stockerbot-export.csv'
## 230 text delimiter or quote        'stockerbot-export.csv'
## 230 text delimiter or quote   B    'stockerbot-export.csv'
## 230 text delimiter or quote   <e2> 'stockerbot-export.csv'
## ... .... .................. ...... .......................
## See problems(...) for more details.
head(t)

We will first do some vary basic cleaning by changing all characters to lower case and replacing everything except for alphabet and numerals with a space.

clean_tweets <- function(x) {
  x %>% 
    tolower %>%
    gsub('[^a-z0-9]', ' ', . ) 
}

t <- t %>%
  mutate(clean_text = clean_tweets(text))

t %>%
  select(text, clean_text) %>%
  sample_n(10)

Here is the word cloud:

visualize_text <- function(x) {
  # x is a character vector
  # the function will extract
  frequent_words <- termFreq(x)
  frequent_words <- frequent_words[!(names(frequent_words) %in% stopwords())]
  wordcloud(words = names(frequent_words), 
            freq = frequent_words, min.freq = 0,
            max.words = 50, random.order=FALSE, rot.per=0.35, 
            colors=brewer.pal(8, "Dark2"))

}

visualize_text(t$clean_text)

Functions in R

R is a functional programming language (google “functional programming” if you are interested). The syntax to define a function is very similar to the syntax to define a numeric variable or a data frame.:

square_function <- function(x) {
  x^2
}

square_function(14)
## [1] 196

Thus a function is, in fact, a variable of class “function”:

class(square_function)
## [1] "function"

If we need to calculate the value of some numeric expression, say, \(\frac{2020}{42} - 2^3\), we do not need to define variables and assign constant values to them - we can simply write the expression in R:

2020 / 42 - 2^3
## [1] 40.09524

In the same way, in order to use a function in an expression, we do not have to create a variable of class “function”:

(function(x) x^2)(14)
## [1] 196

Apply family

In R, we very seldom use explicit loops - they are very slow and often hard to read, especially nested loops. Instead, we use functions of the apply family. The simplest of them is lapply. If we have a list or a vector \[ a_1,a_2,\cdots,a_n \] and a function \(f\) that we want to apply to every element of \(v\) to create a new list \[ f(a_1), f(a_2),\dots, f(a_n) \] we use the R function lapply. Below is an example:

v <- 1:10
cat("Our vector v is (", v, ")\n")
## Our vector v is ( 1 2 3 4 5 6 7 8 9 10 )
cat("And now we will apply the square function to every its entry\n")
## And now we will apply the square function to every its entry
lapply(v, square_function)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 9
## 
## [[4]]
## [1] 16
## 
## [[5]]
## [1] 25
## 
## [[6]]
## [1] 36
## 
## [[7]]
## [1] 49
## 
## [[8]]
## [1] 64
## 
## [[9]]
## [1] 81
## 
## [[10]]
## [1] 100

Note that lapply returns a list. However, it is usually more convenient to work with vectors or matrices. The function sapply works like lapply, but it is smarter in that it tries to convert its output to a vector or a matrix:

sapply(v, square_function)
##  [1]   1   4   9  16  25  36  49  64  81 100

Question 1

Here, we look at the dataset mtcars available in base R

head(mtcars)

Use the functions sapply and class to produce the vector with the type of each variable in the dataset mtcars. Note that this will be a named vector.

# Write your code below
sapply(mtcars, class)
##       mpg       cyl      disp        hp      drat        wt      qsec        vs 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##        am      gear      carb 
## "numeric" "numeric" "numeric"

Or with the pipe operator:

mtcars %>% sapply(class)
##       mpg       cyl      disp        hp      drat        wt      qsec        vs 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##        am      gear      carb 
## "numeric" "numeric" "numeric"

Term co-occurrence matrix

Creating the matrix

To understand the term co-occurrence matrix (TCM), let us explore it on a simple example.

Below we create a simple character vector, compute and print its TCM \(X\). Note that the default method of printing TCM

  1. doesn’t show column names;

  2. shows dots instead of zeroes;

  3. shows dots for \(X_{ij}\) when \(i>j\), even though we should have \(X_{ij}=X_{ji}\).

simple_text <- "if life were predictable it would cease to be life and be without flavour"

tokens <- space_tokenizer(simple_text)
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it) 
vectorizer <- vocab_vectorizer(vocab)

tcm <- create_tcm(it, vectorizer, 
                  skip_grams_window = 3)

tcm
## 12 x 12 sparse Matrix of class "dgTMatrix"
##    [[ suppressing 12 column names 'and', 'cease', 'flavour' ... ]]
##                                                                                
## and         . . 0.3333333 . .   .         0.3333333 .   0.5 .         1.5000000
## cease       . . .         . 0.5 0.3333333 1.0000000 .   .   1.0000000 0.5000000
## flavour     . . .         . .   .         .         .   1.0 .         0.5000000
## if          . . .         . .   0.3333333 .         0.5 .   .         .        
## it          . . .         . .   1.0000000 0.3333333 0.5 .   1.0000000 .        
## predictable . . .         . .   .         .         1.0 .   0.5000000 .        
## to          . . .         . .   .         .         .   .   0.5000000 1.0000000
## were        . . .         . .   .         .         .   .   0.3333333 .        
## without     . . .         . .   .         .         .   .   .         1.0000000
## would       . . .         . .   .         .         .   .   .         0.3333333
## be          . . .         . .   .         .         .   .   .         0.3333333
## life        . . .         . .   .         .         .   .   .         .        
##                      
## and         1.0000000
## cease       0.3333333
## flavour     .        
## if          1.0000000
## it          0.3333333
## predictable 0.5000000
## to          0.5000000
## were        1.0000000
## without     0.3333333
## would       .        
## be          1.5000000
## life        .

To print TCM with column names and zeroes, we convert it to matrix:

(t(as.matrix(tcm))+as.matrix(tcm))[ , c("be", "and")]
##                    be       and
## and         1.5000000 0.0000000
## cease       0.5000000 0.0000000
## flavour     0.5000000 0.3333333
## if          0.0000000 0.0000000
## it          0.0000000 0.0000000
## predictable 0.0000000 0.0000000
## to          1.0000000 0.3333333
## were        0.0000000 0.0000000
## without     1.0000000 0.5000000
## would       0.3333333 0.0000000
## be          0.6666667 1.5000000
## life        1.5000000 1.0000000

Question 2

Print the full TCM matrix, including the lower-triangular part.

as.matrix(tcm) + t(as.matrix(tcm))
##                   and     cease   flavour        if        it predictable
## and         0.0000000 0.0000000 0.3333333 0.0000000 0.0000000   0.0000000
## cease       0.0000000 0.0000000 0.0000000 0.0000000 0.5000000   0.3333333
## flavour     0.3333333 0.0000000 0.0000000 0.0000000 0.0000000   0.0000000
## if          0.0000000 0.0000000 0.0000000 0.0000000 0.0000000   0.3333333
## it          0.0000000 0.5000000 0.0000000 0.0000000 0.0000000   1.0000000
## predictable 0.0000000 0.3333333 0.0000000 0.3333333 1.0000000   0.0000000
## to          0.3333333 1.0000000 0.0000000 0.0000000 0.3333333   0.0000000
## were        0.0000000 0.0000000 0.0000000 0.5000000 0.5000000   1.0000000
## without     0.5000000 0.0000000 1.0000000 0.0000000 0.0000000   0.0000000
## would       0.0000000 1.0000000 0.0000000 0.0000000 1.0000000   0.5000000
## be          1.5000000 0.5000000 0.5000000 0.0000000 0.0000000   0.0000000
## life        1.0000000 0.3333333 0.0000000 1.0000000 0.3333333   0.5000000
##                    to      were   without     would        be      life
## and         0.3333333 0.0000000 0.5000000 0.0000000 1.5000000 1.0000000
## cease       1.0000000 0.0000000 0.0000000 1.0000000 0.5000000 0.3333333
## flavour     0.0000000 0.0000000 1.0000000 0.0000000 0.5000000 0.0000000
## if          0.0000000 0.5000000 0.0000000 0.0000000 0.0000000 1.0000000
## it          0.3333333 0.5000000 0.0000000 1.0000000 0.0000000 0.3333333
## predictable 0.0000000 1.0000000 0.0000000 0.5000000 0.0000000 0.5000000
## to          0.0000000 0.0000000 0.0000000 0.5000000 1.0000000 0.5000000
## were        0.0000000 0.0000000 0.0000000 0.3333333 0.0000000 1.0000000
## without     0.0000000 0.0000000 0.0000000 0.0000000 1.0000000 0.3333333
## would       0.5000000 0.3333333 0.0000000 0.0000000 0.3333333 0.0000000
## be          1.0000000 0.0000000 1.0000000 0.3333333 0.6666667 1.5000000
## life        0.5000000 1.0000000 0.3333333 0.0000000 1.5000000 0.0000000

TCM controls

In the lecture, \(X_{ij}\) was defined to be the number of times the word \(j\) appears in a context of word \(i\), i.e., the distance from \(j\) to \(i\) in a sentence is at most \(s\). Thus the process of calculating the TCM is controlled by the window size \(s\) (it is a hyperparameter of GloVe).

Below we compute and print TCM for \(s=3\):

tcm <- create_tcm(it, vectorizer, 
                  skip_grams_window = 3)
tcm
## 12 x 12 sparse Matrix of class "dgTMatrix"
##    [[ suppressing 12 column names 'and', 'cease', 'flavour' ... ]]
##                                                                                
## and         . . 0.3333333 . .   .         0.3333333 .   0.5 .         1.5000000
## cease       . . .         . 0.5 0.3333333 1.0000000 .   .   1.0000000 0.5000000
## flavour     . . .         . .   .         .         .   1.0 .         0.5000000
## if          . . .         . .   0.3333333 .         0.5 .   .         .        
## it          . . .         . .   1.0000000 0.3333333 0.5 .   1.0000000 .        
## predictable . . .         . .   .         .         1.0 .   0.5000000 .        
## to          . . .         . .   .         .         .   .   0.5000000 1.0000000
## were        . . .         . .   .         .         .   .   0.3333333 .        
## without     . . .         . .   .         .         .   .   .         1.0000000
## would       . . .         . .   .         .         .   .   .         0.3333333
## be          . . .         . .   .         .         .   .   .         0.3333333
## life        . . .         . .   .         .         .   .   .         .        
##                      
## and         1.0000000
## cease       0.3333333
## flavour     .        
## if          1.0000000
## it          0.3333333
## predictable 0.5000000
## to          0.5000000
## were        1.0000000
## without     0.3333333
## would       .        
## be          1.5000000
## life        .

Note that etries of TCM are fractional. This happened because the default way of forming the TCM is counting the number of times when a word \(j\) appears in a context of word \(j\) with weights, namely, if the distance from \(j\) to \(i\) in a sentence is \(k\), then \(j\) is counted with weight \(1/k\) in calculating \(X_{ij}\).

Weights are controlled as follows (default is \((1,1/2,\dots,1/s)\):

tcm <- create_tcm(it, vectorizer, 
                  skip_grams_window = 3,
                  weights = c(1, 1, 1))

tcm
## 12 x 12 sparse Matrix of class "dgTMatrix"
##    [[ suppressing 12 column names 'and', 'cease', 'flavour' ... ]]
##                                    
## and         . . 1 . . . 1 . 1 . 2 1
## cease       . . . . 1 1 1 . . 1 1 1
## flavour     . . . . . . . . 1 . 1 .
## if          . . . . . 1 . 1 . . . 1
## it          . . . . . 1 1 1 . 1 . 1
## predictable . . . . . . . 1 . 1 . 1
## to          . . . . . . . . . 1 1 1
## were        . . . . . . . . . 1 . 1
## without     . . . . . . . . . . 1 1
## would       . . . . . . . . . . 1 .
## be          . . . . . . . . . . 1 2
## life        . . . . . . . . . . . .

GloVe

Computing TCM

First, we will compute TCM for the collection of cleaned financial tweets. We will trim the vocabulary to only keep words that appear at least 5 times (otherwise the TCM is too large).

skip_gram_window_size <- 3
tokens <- space_tokenizer(t$clean_text)
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it) %>% prune_vocabulary(5)
vectorizer <- vocab_vectorizer(vocab)
tcm <- create_tcm(it, vectorizer, 
                  skip_grams_window = skip_gram_window_size)

cat("Dimensions of tcm =", dim(tcm), "\n")
## Dimensions of tcm = 8263 8263

Note that the TCM is huge, but it is a sparse matrix, i.e., most of its entries are zero entries. Below is the fraction of nonzero entries in TCM:

sum(tcm > 0) / (length(tcm))
## [1] 0.003480798

Model training

GloVe is trained by minimizing the loss function \[ L(w,\tilde{w}, b,\tilde{b})=\sum_{i,k=1}^{V}f(X_{ik})\left(w_i^{T} \tilde{w}_k + b_i + \tilde{b}_k - \log X_{ik} \right)^2 \]

The process of training is controlled by the following hyperparameters:

  • \(d\), dimension of word and context vectors, i.e., we have \[ w\in\mathbb{R}^d,\qquad \tilde{w}\in\mathbb{R}^d \]

  • The maximum number of iterations of decreasing the loss function.

  • Convergence tolerance \(\varepsilon\) — we stop our iterations if the change of the loss function is below \(\varepsilon\).

  • The threshold \(x_{max}\) controlling the weights \(f(X_{ij})\). Specifically, we will have \(f(x)=1\) whenever \(x\ge x_{max}\).

Now we will train GloVe

dim_word_emb <- 100

# Stage 2 - training GloVe itself
glove = GlobalVectors$new(rank = dim_word_emb, x_max = 10)

wv_main = glove$fit_transform(tcm, n_iter = 50, convergence_tol = 0.01)
## INFO  [14:43:06.215] epoch 1, loss 0.5520 
## INFO  [14:43:06.457] epoch 2, loss 0.2605 
## INFO  [14:43:06.694] epoch 3, loss 0.1412 
## INFO  [14:43:06.907] epoch 4, loss 0.1029 
## INFO  [14:43:07.162] epoch 5, loss 0.0836 
## INFO  [14:43:07.464] epoch 6, loss 0.0703 
## INFO  [14:43:07.744] epoch 7, loss 0.0605 
## INFO  [14:43:08.029] epoch 8, loss 0.0530 
## INFO  [14:43:08.321] epoch 9, loss 0.0470 
## INFO  [14:43:08.612] epoch 10, loss 0.0422 
## INFO  [14:43:08.880] epoch 11, loss 0.0382 
## INFO  [14:43:09.161] epoch 12, loss 0.0349 
## INFO  [14:43:09.435] epoch 13, loss 0.0321 
## INFO  [14:43:09.730] epoch 14, loss 0.0297 
## INFO  [14:43:10.000] epoch 15, loss 0.0276 
## INFO  [14:43:10.302] epoch 16, loss 0.0257 
## INFO  [14:43:10.625] epoch 17, loss 0.0241 
## INFO  [14:43:10.922] epoch 18, loss 0.0227 
## INFO  [14:43:11.215] epoch 19, loss 0.0214 
## INFO  [14:43:11.498] epoch 20, loss 0.0203 
## INFO  [14:43:11.816] epoch 21, loss 0.0193 
## INFO  [14:43:12.111] epoch 22, loss 0.0184 
## INFO  [14:43:12.396] epoch 23, loss 0.0175 
## INFO  [14:43:12.684] epoch 24, loss 0.0168 
## INFO  [14:43:12.946] epoch 25, loss 0.0160 
## INFO  [14:43:13.245] epoch 26, loss 0.0154 
## INFO  [14:43:13.514] epoch 27, loss 0.0148 
## INFO  [14:43:13.810] epoch 28, loss 0.0142 
## INFO  [14:43:14.064] epoch 29, loss 0.0137 
## INFO  [14:43:14.374] epoch 30, loss 0.0132 
## INFO  [14:43:14.662] epoch 31, loss 0.0128 
## INFO  [14:43:14.953] epoch 32, loss 0.0124 
## INFO  [14:43:15.224] epoch 33, loss 0.0120 
## INFO  [14:43:15.535] epoch 34, loss 0.0116 
## INFO  [14:43:15.830] epoch 35, loss 0.0113 
## INFO  [14:43:16.127] epoch 36, loss 0.0109 
## INFO  [14:43:16.420] epoch 37, loss 0.0106 
## INFO  [14:43:16.704] epoch 38, loss 0.0103 
## INFO  [14:43:16.990] epoch 39, loss 0.0101 
## INFO  [14:43:17.290] epoch 40, loss 0.0098 
## INFO  [14:43:17.565] epoch 41, loss 0.0096 
## INFO  [14:43:17.853] epoch 42, loss 0.0093 
## INFO  [14:43:18.236] epoch 43, loss 0.0091 
## INFO  [14:43:18.665] epoch 44, loss 0.0089 
## INFO  [14:43:19.033] epoch 45, loss 0.0087 
## INFO  [14:43:19.341] epoch 46, loss 0.0085 
## INFO  [14:43:19.632] epoch 47, loss 0.0083 
## INFO  [14:43:19.932] epoch 48, loss 0.0081 
## INFO  [14:43:20.228] epoch 49, loss 0.0079 
## INFO  [14:43:20.540] epoch 50, loss 0.0078
wv_context = glove$components
word_vectors = wv_main + t(wv_context)

cat("Dim of word vector matrix =", dim(word_vectors))
## Dim of word vector matrix = 8263 100

Exploration

The output of GloVe is a matrix whose rows represent word vectors. Below are the first 5 words:

row.names(word_vectors)[1:5]
## [1] "0001140361" "0003"       "005925"     "024199"     "032"

And here are entries 1-10 of word vectors for “share”, “bitcoin”, and “oil”:

word_vectors[c("share", "bitcoin", "oil") , 1:10] %>% round(3)
##           [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]   [,9]  [,10]
## share   -0.035  0.510  0.756 -0.570  0.156  0.771 -0.014 -0.114  0.187 -0.464
## bitcoin -0.211 -0.904 -0.125  0.276 -0.170 -0.316 -0.615 -0.083 -0.723  0.356
## oil      0.649  0.178 -0.299  0.724  0.700 -0.156 -0.357 -0.029  0.372 -0.256

Recall that the cosine similarity between vectors \(x\) and \(y\) is \[ \cos\alpha = \frac{\sum_{i=1}^{d}x_iy_i}{\sqrt{\sum_{i=1}^{d}x_i^2}\times \sqrt{\sum_{i=1}^{d}y_i^2}} \] Below is the cosine similarity between words “share” and “stock”

cos_sim <- function(x, y) {
  # computes cosine similarity between vectors x and y
  sum(x*y) / sqrt(sum(x^2) * sum(y^2))
}

cos_sim(word_vectors["share" , ], word_vectors["stock" , ])
## [1] 0.1907162

Is it high or low?

Question 3

Find cosine similarity of between “share” and the rest of the words in the vocabulary and print the top 10 words most similar to “share”

cos_sim_with_share <- function(x) cos_sim(word_vectors["share" , ], x)


word_vectors %>%
  apply(1, cos_sim_with_share) %>%
  sort(decreasing = TRUE) %>%
  head(10)
##     share       per     value         0     price      rose        as  declined 
## 1.0000000 0.7803328 0.4976375 0.4580223 0.4553215 0.4518101 0.4323855 0.3997106 
##  earnings        69 
## 0.3821200 0.3806704

CBOW

To assign vectors to entire documents, we do the following: \[ v(\mbox{document})=\sum_{x\in\mbox{document}}v(x) \] It means that our data matrix is \[ \mbox{DTM}\cdot \mbox{GLOVE} \] First, we need to create DTM:

dtm <- create_dtm(it, vectorizer)
cat("DTM dimensions =", dim(dtm), "\n")
## DTM dimensions = 10350 8263

Now we prepare the CBOW data. Below is a sample:

cbow_data <- as.matrix(dtm %*% word_vectors)
cbow_data[1:7, 1:5]
##           [,1]       [,2]     [,3]       [,4]       [,5]
## 1 -0.004207416 -1.1614794 1.296112  0.8672684 -0.5383059
## 2 -0.104987032  0.7476035 3.358613  1.7254446 11.6405303
## 3 -0.404584743  0.4814405 3.068303  0.5457207  1.4859654
## 4 -0.444753915 -6.2681936 1.106205  2.0391905  3.1810907
## 5 -0.630572541  1.9021055 4.792348 -0.9225555 -0.0770353
## 6  1.528577519 -1.0096101 4.635813  1.4376962 -0.1481463
## 7  0.221412420 -2.5750533 1.682572  2.6312903 -1.4227330

Finding information in the data

Now if we have some statement or a question, we can look for documents, i.e., tweets, that have the largest cosine similarity to our statement. For example, consider the statement: “Biotech is going to raise”. Let us first compute a mini-dtm for it (1 row):

simple_text <- "biotech is going to raise"

new_dtm <- simple_text %>%
  space_tokenizer() %>%
  itoken(progresbar = FALSE) %>%
  create_dtm(vectorizer)

new_dtm %>%
  as.matrix %>%
  as.data.frame %>%
  select(biotech, going, raise, stock, price)

Then for every row of our DTM, we will compute its cosine similarity with our statement, sort by decreasing similarity and print the first 10 documents:

vec_statement <- as.matrix(new_dtm %*% word_vectors)
t_with_sim <- t %>%
  mutate(similarity = apply(cbow_data, 1, function(x) cos_sim(x, vec_statement))) %>%
  arrange(-similarity)

t_with_sim %>%
  select(text, similarity) %>%
  head(n = 10)

Answers