The aim of this lab is to acquire basic familiarity with modern word embedding techniques, such as GloVe.
We will train GloVe on a collection of financial tweets. Note that usually, for real applications, one would train GloVe on a very large corpus of different tweets (not just financial tweets) first and then use it to process tweets of interest.
However, we will work with whatever data that we have.
By the end of this lab session, students should be able to
Visualize text as a word cloud
Replace loops with functions apply
, sapply
, lapply
.
Train GloVe in R
Apply continuous bag of words to represent entire tweets
Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. In either case, you can discuss your work with the lab instructor.
The dataset has been downloaded here:
library(tidyverse) # for manipulation with data
library(text2vec) # for GloVe
library(tm) # for text processing
library(wordcloud) # for visualizations
library(caret)
set.seed(142)
t <- read_csv("stockerbot-export.csv")
## Warning: 598 parsing failures.
## row col expected actual file
## 230 text delimiter or quote 'stockerbot-export.csv'
## 230 text delimiter or quote B 'stockerbot-export.csv'
## 230 text delimiter or quote 'stockerbot-export.csv'
## 230 text delimiter or quote B 'stockerbot-export.csv'
## 230 text delimiter or quote <e2> 'stockerbot-export.csv'
## ... .... .................. ...... .......................
## See problems(...) for more details.
head(t)
We will first do some vary basic cleaning by changing all characters to lower case and replacing everything except for alphabet and numerals with a space.
clean_tweets <- function(x) {
x %>%
tolower %>%
gsub('[^a-z0-9]', ' ', . )
}
t <- t %>%
mutate(clean_text = clean_tweets(text))
t %>%
select(text, clean_text) %>%
sample_n(10)
Here is the word cloud:
visualize_text <- function(x) {
# x is a character vector
# the function will extract
frequent_words <- termFreq(x)
frequent_words <- frequent_words[!(names(frequent_words) %in% stopwords())]
wordcloud(words = names(frequent_words),
freq = frequent_words, min.freq = 0,
max.words = 50, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
}
visualize_text(t$clean_text)
R is a functional programming language (google “functional programming” if you are interested). The syntax to define a function is very similar to the syntax to define a numeric variable or a data frame.:
square_function <- function(x) {
x^2
}
square_function(14)
## [1] 196
Thus a function is, in fact, a variable of class “function”:
class(square_function)
## [1] "function"
If we need to calculate the value of some numeric expression, say, \(\frac{2020}{42} - 2^3\), we do not need to define variables and assign constant values to them - we can simply write the expression in R:
2020 / 42 - 2^3
## [1] 40.09524
In the same way, in order to use a function in an expression, we do not have to create a variable of class “function”:
(function(x) x^2)(14)
## [1] 196
In R, we very seldom use explicit loops - they are very slow and often hard to read, especially nested loops. Instead, we use functions of the apply
family. The simplest of them is lapply
. If we have a list or a vector \[
a_1,a_2,\cdots,a_n
\] and a function \(f\) that we want to apply to every element of \(v\) to create a new list \[
f(a_1), f(a_2),\dots, f(a_n)
\] we use the R function lapply
. Below is an example:
v <- 1:10
cat("Our vector v is (", v, ")\n")
## Our vector v is ( 1 2 3 4 5 6 7 8 9 10 )
cat("And now we will apply the square function to every its entry\n")
## And now we will apply the square function to every its entry
lapply(v, square_function)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 9
##
## [[4]]
## [1] 16
##
## [[5]]
## [1] 25
##
## [[6]]
## [1] 36
##
## [[7]]
## [1] 49
##
## [[8]]
## [1] 64
##
## [[9]]
## [1] 81
##
## [[10]]
## [1] 100
Note that lapply
returns a list. However, it is usually more convenient to work with vectors or matrices. The function sapply
works like lapply
, but it is smarter in that it tries to convert its output to a vector or a matrix:
sapply(v, square_function)
## [1] 1 4 9 16 25 36 49 64 81 100
Here, we look at the dataset mtcars
available in base R
head(mtcars)
Use the functions sapply
and class
to produce the vector with the type of each variable in the dataset mtcars
. Note that this will be a named vector.
# Write your code below
sapply(mtcars, class)
## mpg cyl disp hp drat wt qsec vs
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## am gear carb
## "numeric" "numeric" "numeric"
Or with the pipe operator:
mtcars %>% sapply(class)
## mpg cyl disp hp drat wt qsec vs
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## am gear carb
## "numeric" "numeric" "numeric"
To understand the term co-occurrence matrix (TCM), let us explore it on a simple example.
Below we create a simple character vector, compute and print its TCM \(X\). Note that the default method of printing TCM
doesn’t show column names;
shows dots instead of zeroes;
shows dots for \(X_{ij}\) when \(i>j\), even though we should have \(X_{ij}=X_{ji}\).
simple_text <- "if life were predictable it would cease to be life and be without flavour"
tokens <- space_tokenizer(simple_text)
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it)
vectorizer <- vocab_vectorizer(vocab)
tcm <- create_tcm(it, vectorizer,
skip_grams_window = 3)
tcm
## 12 x 12 sparse Matrix of class "dgTMatrix"
## [[ suppressing 12 column names 'and', 'cease', 'flavour' ... ]]
##
## and . . 0.3333333 . . . 0.3333333 . 0.5 . 1.5000000
## cease . . . . 0.5 0.3333333 1.0000000 . . 1.0000000 0.5000000
## flavour . . . . . . . . 1.0 . 0.5000000
## if . . . . . 0.3333333 . 0.5 . . .
## it . . . . . 1.0000000 0.3333333 0.5 . 1.0000000 .
## predictable . . . . . . . 1.0 . 0.5000000 .
## to . . . . . . . . . 0.5000000 1.0000000
## were . . . . . . . . . 0.3333333 .
## without . . . . . . . . . . 1.0000000
## would . . . . . . . . . . 0.3333333
## be . . . . . . . . . . 0.3333333
## life . . . . . . . . . . .
##
## and 1.0000000
## cease 0.3333333
## flavour .
## if 1.0000000
## it 0.3333333
## predictable 0.5000000
## to 0.5000000
## were 1.0000000
## without 0.3333333
## would .
## be 1.5000000
## life .
To print TCM with column names and zeroes, we convert it to matrix:
(t(as.matrix(tcm))+as.matrix(tcm))[ , c("be", "and")]
## be and
## and 1.5000000 0.0000000
## cease 0.5000000 0.0000000
## flavour 0.5000000 0.3333333
## if 0.0000000 0.0000000
## it 0.0000000 0.0000000
## predictable 0.0000000 0.0000000
## to 1.0000000 0.3333333
## were 0.0000000 0.0000000
## without 1.0000000 0.5000000
## would 0.3333333 0.0000000
## be 0.6666667 1.5000000
## life 1.5000000 1.0000000
Print the full TCM matrix, including the lower-triangular part.
as.matrix(tcm) + t(as.matrix(tcm))
## and cease flavour if it predictable
## and 0.0000000 0.0000000 0.3333333 0.0000000 0.0000000 0.0000000
## cease 0.0000000 0.0000000 0.0000000 0.0000000 0.5000000 0.3333333
## flavour 0.3333333 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## if 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.3333333
## it 0.0000000 0.5000000 0.0000000 0.0000000 0.0000000 1.0000000
## predictable 0.0000000 0.3333333 0.0000000 0.3333333 1.0000000 0.0000000
## to 0.3333333 1.0000000 0.0000000 0.0000000 0.3333333 0.0000000
## were 0.0000000 0.0000000 0.0000000 0.5000000 0.5000000 1.0000000
## without 0.5000000 0.0000000 1.0000000 0.0000000 0.0000000 0.0000000
## would 0.0000000 1.0000000 0.0000000 0.0000000 1.0000000 0.5000000
## be 1.5000000 0.5000000 0.5000000 0.0000000 0.0000000 0.0000000
## life 1.0000000 0.3333333 0.0000000 1.0000000 0.3333333 0.5000000
## to were without would be life
## and 0.3333333 0.0000000 0.5000000 0.0000000 1.5000000 1.0000000
## cease 1.0000000 0.0000000 0.0000000 1.0000000 0.5000000 0.3333333
## flavour 0.0000000 0.0000000 1.0000000 0.0000000 0.5000000 0.0000000
## if 0.0000000 0.5000000 0.0000000 0.0000000 0.0000000 1.0000000
## it 0.3333333 0.5000000 0.0000000 1.0000000 0.0000000 0.3333333
## predictable 0.0000000 1.0000000 0.0000000 0.5000000 0.0000000 0.5000000
## to 0.0000000 0.0000000 0.0000000 0.5000000 1.0000000 0.5000000
## were 0.0000000 0.0000000 0.0000000 0.3333333 0.0000000 1.0000000
## without 0.0000000 0.0000000 0.0000000 0.0000000 1.0000000 0.3333333
## would 0.5000000 0.3333333 0.0000000 0.0000000 0.3333333 0.0000000
## be 1.0000000 0.0000000 1.0000000 0.3333333 0.6666667 1.5000000
## life 0.5000000 1.0000000 0.3333333 0.0000000 1.5000000 0.0000000
In the lecture, \(X_{ij}\) was defined to be the number of times the word \(j\) appears in a context of word \(i\), i.e., the distance from \(j\) to \(i\) in a sentence is at most \(s\). Thus the process of calculating the TCM is controlled by the window size \(s\) (it is a hyperparameter of GloVe).
Below we compute and print TCM for \(s=3\):
tcm <- create_tcm(it, vectorizer,
skip_grams_window = 3)
tcm
## 12 x 12 sparse Matrix of class "dgTMatrix"
## [[ suppressing 12 column names 'and', 'cease', 'flavour' ... ]]
##
## and . . 0.3333333 . . . 0.3333333 . 0.5 . 1.5000000
## cease . . . . 0.5 0.3333333 1.0000000 . . 1.0000000 0.5000000
## flavour . . . . . . . . 1.0 . 0.5000000
## if . . . . . 0.3333333 . 0.5 . . .
## it . . . . . 1.0000000 0.3333333 0.5 . 1.0000000 .
## predictable . . . . . . . 1.0 . 0.5000000 .
## to . . . . . . . . . 0.5000000 1.0000000
## were . . . . . . . . . 0.3333333 .
## without . . . . . . . . . . 1.0000000
## would . . . . . . . . . . 0.3333333
## be . . . . . . . . . . 0.3333333
## life . . . . . . . . . . .
##
## and 1.0000000
## cease 0.3333333
## flavour .
## if 1.0000000
## it 0.3333333
## predictable 0.5000000
## to 0.5000000
## were 1.0000000
## without 0.3333333
## would .
## be 1.5000000
## life .
Note that etries of TCM are fractional. This happened because the default way of forming the TCM is counting the number of times when a word \(j\) appears in a context of word \(j\) with weights, namely, if the distance from \(j\) to \(i\) in a sentence is \(k\), then \(j\) is counted with weight \(1/k\) in calculating \(X_{ij}\).
Weights are controlled as follows (default is \((1,1/2,\dots,1/s)\):
tcm <- create_tcm(it, vectorizer,
skip_grams_window = 3,
weights = c(1, 1, 1))
tcm
## 12 x 12 sparse Matrix of class "dgTMatrix"
## [[ suppressing 12 column names 'and', 'cease', 'flavour' ... ]]
##
## and . . 1 . . . 1 . 1 . 2 1
## cease . . . . 1 1 1 . . 1 1 1
## flavour . . . . . . . . 1 . 1 .
## if . . . . . 1 . 1 . . . 1
## it . . . . . 1 1 1 . 1 . 1
## predictable . . . . . . . 1 . 1 . 1
## to . . . . . . . . . 1 1 1
## were . . . . . . . . . 1 . 1
## without . . . . . . . . . . 1 1
## would . . . . . . . . . . 1 .
## be . . . . . . . . . . 1 2
## life . . . . . . . . . . . .
First, we will compute TCM for the collection of cleaned financial tweets. We will trim the vocabulary to only keep words that appear at least 5 times (otherwise the TCM is too large).
skip_gram_window_size <- 3
tokens <- space_tokenizer(t$clean_text)
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it) %>% prune_vocabulary(5)
vectorizer <- vocab_vectorizer(vocab)
tcm <- create_tcm(it, vectorizer,
skip_grams_window = skip_gram_window_size)
cat("Dimensions of tcm =", dim(tcm), "\n")
## Dimensions of tcm = 8263 8263
Note that the TCM is huge, but it is a sparse matrix, i.e., most of its entries are zero entries. Below is the fraction of nonzero entries in TCM:
sum(tcm > 0) / (length(tcm))
## [1] 0.003480798
GloVe is trained by minimizing the loss function \[ L(w,\tilde{w}, b,\tilde{b})=\sum_{i,k=1}^{V}f(X_{ik})\left(w_i^{T} \tilde{w}_k + b_i + \tilde{b}_k - \log X_{ik} \right)^2 \]
The process of training is controlled by the following hyperparameters:
\(d\), dimension of word and context vectors, i.e., we have \[ w\in\mathbb{R}^d,\qquad \tilde{w}\in\mathbb{R}^d \]
The maximum number of iterations of decreasing the loss function.
Convergence tolerance \(\varepsilon\) — we stop our iterations if the change of the loss function is below \(\varepsilon\).
The threshold \(x_{max}\) controlling the weights \(f(X_{ij})\). Specifically, we will have \(f(x)=1\) whenever \(x\ge x_{max}\).
Now we will train GloVe
dim_word_emb <- 100
# Stage 2 - training GloVe itself
glove = GlobalVectors$new(rank = dim_word_emb, x_max = 10)
wv_main = glove$fit_transform(tcm, n_iter = 50, convergence_tol = 0.01)
## INFO [14:43:06.215] epoch 1, loss 0.5520
## INFO [14:43:06.457] epoch 2, loss 0.2605
## INFO [14:43:06.694] epoch 3, loss 0.1412
## INFO [14:43:06.907] epoch 4, loss 0.1029
## INFO [14:43:07.162] epoch 5, loss 0.0836
## INFO [14:43:07.464] epoch 6, loss 0.0703
## INFO [14:43:07.744] epoch 7, loss 0.0605
## INFO [14:43:08.029] epoch 8, loss 0.0530
## INFO [14:43:08.321] epoch 9, loss 0.0470
## INFO [14:43:08.612] epoch 10, loss 0.0422
## INFO [14:43:08.880] epoch 11, loss 0.0382
## INFO [14:43:09.161] epoch 12, loss 0.0349
## INFO [14:43:09.435] epoch 13, loss 0.0321
## INFO [14:43:09.730] epoch 14, loss 0.0297
## INFO [14:43:10.000] epoch 15, loss 0.0276
## INFO [14:43:10.302] epoch 16, loss 0.0257
## INFO [14:43:10.625] epoch 17, loss 0.0241
## INFO [14:43:10.922] epoch 18, loss 0.0227
## INFO [14:43:11.215] epoch 19, loss 0.0214
## INFO [14:43:11.498] epoch 20, loss 0.0203
## INFO [14:43:11.816] epoch 21, loss 0.0193
## INFO [14:43:12.111] epoch 22, loss 0.0184
## INFO [14:43:12.396] epoch 23, loss 0.0175
## INFO [14:43:12.684] epoch 24, loss 0.0168
## INFO [14:43:12.946] epoch 25, loss 0.0160
## INFO [14:43:13.245] epoch 26, loss 0.0154
## INFO [14:43:13.514] epoch 27, loss 0.0148
## INFO [14:43:13.810] epoch 28, loss 0.0142
## INFO [14:43:14.064] epoch 29, loss 0.0137
## INFO [14:43:14.374] epoch 30, loss 0.0132
## INFO [14:43:14.662] epoch 31, loss 0.0128
## INFO [14:43:14.953] epoch 32, loss 0.0124
## INFO [14:43:15.224] epoch 33, loss 0.0120
## INFO [14:43:15.535] epoch 34, loss 0.0116
## INFO [14:43:15.830] epoch 35, loss 0.0113
## INFO [14:43:16.127] epoch 36, loss 0.0109
## INFO [14:43:16.420] epoch 37, loss 0.0106
## INFO [14:43:16.704] epoch 38, loss 0.0103
## INFO [14:43:16.990] epoch 39, loss 0.0101
## INFO [14:43:17.290] epoch 40, loss 0.0098
## INFO [14:43:17.565] epoch 41, loss 0.0096
## INFO [14:43:17.853] epoch 42, loss 0.0093
## INFO [14:43:18.236] epoch 43, loss 0.0091
## INFO [14:43:18.665] epoch 44, loss 0.0089
## INFO [14:43:19.033] epoch 45, loss 0.0087
## INFO [14:43:19.341] epoch 46, loss 0.0085
## INFO [14:43:19.632] epoch 47, loss 0.0083
## INFO [14:43:19.932] epoch 48, loss 0.0081
## INFO [14:43:20.228] epoch 49, loss 0.0079
## INFO [14:43:20.540] epoch 50, loss 0.0078
wv_context = glove$components
word_vectors = wv_main + t(wv_context)
cat("Dim of word vector matrix =", dim(word_vectors))
## Dim of word vector matrix = 8263 100
The output of GloVe is a matrix whose rows represent word vectors. Below are the first 5 words:
row.names(word_vectors)[1:5]
## [1] "0001140361" "0003" "005925" "024199" "032"
And here are entries 1-10 of word vectors for “share”, “bitcoin”, and “oil”:
word_vectors[c("share", "bitcoin", "oil") , 1:10] %>% round(3)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## share -0.035 0.510 0.756 -0.570 0.156 0.771 -0.014 -0.114 0.187 -0.464
## bitcoin -0.211 -0.904 -0.125 0.276 -0.170 -0.316 -0.615 -0.083 -0.723 0.356
## oil 0.649 0.178 -0.299 0.724 0.700 -0.156 -0.357 -0.029 0.372 -0.256
Recall that the cosine similarity between vectors \(x\) and \(y\) is \[ \cos\alpha = \frac{\sum_{i=1}^{d}x_iy_i}{\sqrt{\sum_{i=1}^{d}x_i^2}\times \sqrt{\sum_{i=1}^{d}y_i^2}} \] Below is the cosine similarity between words “share” and “stock”
cos_sim <- function(x, y) {
# computes cosine similarity between vectors x and y
sum(x*y) / sqrt(sum(x^2) * sum(y^2))
}
cos_sim(word_vectors["share" , ], word_vectors["stock" , ])
## [1] 0.1907162
Is it high or low?
Find cosine similarity of between “share” and the rest of the words in the vocabulary and print the top 10 words most similar to “share”
cos_sim_with_share <- function(x) cos_sim(word_vectors["share" , ], x)
word_vectors %>%
apply(1, cos_sim_with_share) %>%
sort(decreasing = TRUE) %>%
head(10)
## share per value 0 price rose as declined
## 1.0000000 0.7803328 0.4976375 0.4580223 0.4553215 0.4518101 0.4323855 0.3997106
## earnings 69
## 0.3821200 0.3806704
To assign vectors to entire documents, we do the following: \[ v(\mbox{document})=\sum_{x\in\mbox{document}}v(x) \] It means that our data matrix is \[ \mbox{DTM}\cdot \mbox{GLOVE} \] First, we need to create DTM:
dtm <- create_dtm(it, vectorizer)
cat("DTM dimensions =", dim(dtm), "\n")
## DTM dimensions = 10350 8263
Now we prepare the CBOW data. Below is a sample:
cbow_data <- as.matrix(dtm %*% word_vectors)
cbow_data[1:7, 1:5]
## [,1] [,2] [,3] [,4] [,5]
## 1 -0.004207416 -1.1614794 1.296112 0.8672684 -0.5383059
## 2 -0.104987032 0.7476035 3.358613 1.7254446 11.6405303
## 3 -0.404584743 0.4814405 3.068303 0.5457207 1.4859654
## 4 -0.444753915 -6.2681936 1.106205 2.0391905 3.1810907
## 5 -0.630572541 1.9021055 4.792348 -0.9225555 -0.0770353
## 6 1.528577519 -1.0096101 4.635813 1.4376962 -0.1481463
## 7 0.221412420 -2.5750533 1.682572 2.6312903 -1.4227330
Now if we have some statement or a question, we can look for documents, i.e., tweets, that have the largest cosine similarity to our statement. For example, consider the statement: “Biotech is going to raise”. Let us first compute a mini-dtm for it (1 row):
simple_text <- "biotech is going to raise"
new_dtm <- simple_text %>%
space_tokenizer() %>%
itoken(progresbar = FALSE) %>%
create_dtm(vectorizer)
new_dtm %>%
as.matrix %>%
as.data.frame %>%
select(biotech, going, raise, stock, price)
Then for every row of our DTM, we will compute its cosine similarity with our statement, sort by decreasing similarity and print the first 10 documents:
vec_statement <- as.matrix(new_dtm %*% word_vectors)
t_with_sim <- t %>%
mutate(similarity = apply(cbow_data, 1, function(x) cos_sim(x, vec_statement))) %>%
arrange(-similarity)
t_with_sim %>%
select(text, similarity) %>%
head(n = 10)