The aim of this lab is to acquire basic familiarity with modern word embedding techniques, such as GloVe.
We will train GloVe on Donald Trump’s tweets and do downstream tasks - clustering and regression. Note that usually, for real applications, one would train GloVe on a very large corpus of different tweets (not just Trump’s tweets) first and then use it to process Trump’s tweets.
However, we will work with whatever data that we have.
By the end of this lab session, students should be able to
Train GloVe in R
Integrate GloVe with clustering
Integrate GloVe with supervised learning
Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. In either case, you can discuss your work with the lab instructor.
First we will load the libraries, the data, and do some simple preprocessing.
library(tidyverse) # for manipulation with data
library(text2vec) # for GloVe
library(tm) # for text processing
library(wordcloud) # for visualizations
library(caret)
set.seed(142)
t <- read.csv("trump_twitter.csv", stringsAsFactors = FALSE) %>%
as_tibble %>%
filter(nchar(text) <= 280) %>%
filter(!is_retweet == "true") %>%
select(-is_retweet) %>%
mutate(favorite_count = as.numeric(favorite_count)) %>%
filter(!is.na(favorite_count))
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
t %>%
select(text, created_at, retweet_count) %>%
sample_n(10)
We will first do some vary basic cleaning by changing all characters to lower case replacing everything except for alphabet and numerals with a space.
clean_tweets <- function(x) {
x %>%
gsub('[^A-Za-z0-9]', ' ', . ) %>%
tolower
}
t <- t %>%
mutate(clean_text = clean_tweets(text))
t %>%
select(text, clean_text) %>%
sample_n(10)
Here is the word cloud for Trump’s twitter
visualize_text <- function(x) {
# x is a character vector
# the function will extract
frequent_words <- termFreq(x)
frequent_words <- frequent_words[!(names(frequent_words) %in% stopwords())]
wordcloud(words = names(frequent_words),
freq = frequent_words, min.freq = 0,
max.words = 50, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
}
visualize_text(t$clean_text)
## Warning in wordcloud(words = names(frequent_words), freq = frequent_words, :
## realdonaldtrump could not be fit on page. It will not be plotted.
To understand the term co-occurrence matrix (TCM), let us explore it on a simple example.
Below we create a simple character vector, compute and print its TCM \(X\). Note that the default method of printing TCM
doesn’t show column names;
shows dots instead of zeroes;
shows dots for \(X_{ij}\) when \(i>j\), even though we should have \(X_{ij}=X_{ji}\).
simple_text <- c("face masks are driving me crazy",
"only crazy people are not wearing face masks",
"face masks are better than face shields")
simple_text <- "just because you fail once does not mean you are going to fail at everything"
simple_text <- "if life were predictable it would cease to be life and be without flavour"
tokens <- space_tokenizer(simple_text)
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it)
vectorizer <- vocab_vectorizer(vocab)
tcm <- create_tcm(it, vectorizer,
skip_grams_window = 3)
tcm
## 12 x 12 sparse Matrix of class "dgTMatrix"
## [[ suppressing 12 column names 'and', 'cease', 'flavour' ... ]]
##
## and . . 0.3333333 . . . 0.3333333 . 0.5 . 1.5000000
## cease . . . . 0.5 0.3333333 1.0000000 . . 1.0000000 0.5000000
## flavour . . . . . . . . 1.0 . 0.5000000
## if . . . . . 0.3333333 . 0.5 . . .
## it . . . . . 1.0000000 0.3333333 0.5 . 1.0000000 .
## predictable . . . . . . . 1.0 . 0.5000000 .
## to . . . . . . . . . 0.5000000 1.0000000
## were . . . . . . . . . 0.3333333 .
## without . . . . . . . . . . 1.0000000
## would . . . . . . . . . . 0.3333333
## be . . . . . . . . . . 0.3333333
## life . . . . . . . . . . .
##
## and 1.0000000
## cease 0.3333333
## flavour .
## if 1.0000000
## it 0.3333333
## predictable 0.5000000
## to 0.5000000
## were 1.0000000
## without 0.3333333
## would .
## be 1.5000000
## life .
To print TCM with column names and zeroes, we convert it to matrix:
(t(as.matrix(tcm))+as.matrix(tcm))[ , c("be", "and")]
## be and
## and 1.5000000 0.0000000
## cease 0.5000000 0.0000000
## flavour 0.5000000 0.3333333
## if 0.0000000 0.0000000
## it 0.0000000 0.0000000
## predictable 0.0000000 0.0000000
## to 1.0000000 0.3333333
## were 0.0000000 0.0000000
## without 1.0000000 0.5000000
## would 0.3333333 0.0000000
## be 0.6666667 1.5000000
## life 1.5000000 1.0000000
Print the full TCM matrix, including the lower-triangular part.
as.matrix(tcm) + t(as.matrix(tcm))
## and cease flavour if it predictable
## and 0.0000000 0.0000000 0.3333333 0.0000000 0.0000000 0.0000000
## cease 0.0000000 0.0000000 0.0000000 0.0000000 0.5000000 0.3333333
## flavour 0.3333333 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## if 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.3333333
## it 0.0000000 0.5000000 0.0000000 0.0000000 0.0000000 1.0000000
## predictable 0.0000000 0.3333333 0.0000000 0.3333333 1.0000000 0.0000000
## to 0.3333333 1.0000000 0.0000000 0.0000000 0.3333333 0.0000000
## were 0.0000000 0.0000000 0.0000000 0.5000000 0.5000000 1.0000000
## without 0.5000000 0.0000000 1.0000000 0.0000000 0.0000000 0.0000000
## would 0.0000000 1.0000000 0.0000000 0.0000000 1.0000000 0.5000000
## be 1.5000000 0.5000000 0.5000000 0.0000000 0.0000000 0.0000000
## life 1.0000000 0.3333333 0.0000000 1.0000000 0.3333333 0.5000000
## to were without would be life
## and 0.3333333 0.0000000 0.5000000 0.0000000 1.5000000 1.0000000
## cease 1.0000000 0.0000000 0.0000000 1.0000000 0.5000000 0.3333333
## flavour 0.0000000 0.0000000 1.0000000 0.0000000 0.5000000 0.0000000
## if 0.0000000 0.5000000 0.0000000 0.0000000 0.0000000 1.0000000
## it 0.3333333 0.5000000 0.0000000 1.0000000 0.0000000 0.3333333
## predictable 0.0000000 1.0000000 0.0000000 0.5000000 0.0000000 0.5000000
## to 0.0000000 0.0000000 0.0000000 0.5000000 1.0000000 0.5000000
## were 0.0000000 0.0000000 0.0000000 0.3333333 0.0000000 1.0000000
## without 0.0000000 0.0000000 0.0000000 0.0000000 1.0000000 0.3333333
## would 0.5000000 0.3333333 0.0000000 0.0000000 0.3333333 0.0000000
## be 1.0000000 0.0000000 1.0000000 0.3333333 0.6666667 1.5000000
## life 0.5000000 1.0000000 0.3333333 0.0000000 1.5000000 0.0000000
In the lecture, \(X_{ij}\) was defined to be the number of times the word \(j\) appears in a context of word \(i\), i.e., the distance from \(j\) to \(i\) in a sentence is at most \(s\). Thus the process of calculating the TCM is controlled by the window size \(s\) (it is a hyperparameter of GloVe).
Below we compute and print TCM for \(s=3\):
tcm <- create_tcm(it, vectorizer,
skip_grams_window = 3)
tcm
## 12 x 12 sparse Matrix of class "dgTMatrix"
## [[ suppressing 12 column names 'and', 'cease', 'flavour' ... ]]
##
## and . . 0.3333333 . . . 0.3333333 . 0.5 . 1.5000000
## cease . . . . 0.5 0.3333333 1.0000000 . . 1.0000000 0.5000000
## flavour . . . . . . . . 1.0 . 0.5000000
## if . . . . . 0.3333333 . 0.5 . . .
## it . . . . . 1.0000000 0.3333333 0.5 . 1.0000000 .
## predictable . . . . . . . 1.0 . 0.5000000 .
## to . . . . . . . . . 0.5000000 1.0000000
## were . . . . . . . . . 0.3333333 .
## without . . . . . . . . . . 1.0000000
## would . . . . . . . . . . 0.3333333
## be . . . . . . . . . . 0.3333333
## life . . . . . . . . . . .
##
## and 1.0000000
## cease 0.3333333
## flavour .
## if 1.0000000
## it 0.3333333
## predictable 0.5000000
## to 0.5000000
## were 1.0000000
## without 0.3333333
## would .
## be 1.5000000
## life .
Note that etries of TCM are fractional. This happened because the default way of forming the TCM is counting the number of times when a word \(j\) appears in a context of word \(j\) with weights, namely, if the distance from \(j\) to \(i\) in a sentence is \(k\), then \(j\) is counted with weight \(1/k\) in calculating \(X_{ij}\).
Weights are controlled as follows (default is \((1,1/2,\dots,1/s)\):
tcm <- create_tcm(it, vectorizer,
skip_grams_window = 3,
weights = c(1, 1, 1))
tcm
## 12 x 12 sparse Matrix of class "dgTMatrix"
## [[ suppressing 12 column names 'and', 'cease', 'flavour' ... ]]
##
## and . . 1 . . . 1 . 1 . 2 1
## cease . . . . 1 1 1 . . 1 1 1
## flavour . . . . . . . . 1 . 1 .
## if . . . . . 1 . 1 . . . 1
## it . . . . . 1 1 1 . 1 . 1
## predictable . . . . . . . 1 . 1 . 1
## to . . . . . . . . . 1 1 1
## were . . . . . . . . . 1 . 1
## without . . . . . . . . . . 1 1
## would . . . . . . . . . . 1 .
## be . . . . . . . . . . 1 2
## life . . . . . . . . . . . .
First, we will compute TCM for the collection of cleaned Trump’s tweets. We will trim the vocabulary to only keep words that appear at least 5 times (otherwise the TCM is too large).
skip_gram_window_size <- 3
tokens <- space_tokenizer(t$clean_text)
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it) %>% prune_vocabulary(5)
vectorizer <- vocab_vectorizer(vocab)
tcm <- create_tcm(it, vectorizer,
skip_grams_window = skip_gram_window_size)
cat("Dimensions of tcm =", dim(tcm), "\n")
## Dimensions of tcm = 5371 5371
Note that the TCM is huge, but it is a sparse matrix, i.e., most of its entries are zero entries. Below is the fraction of nonzero entries in TCM:
sum(tcm > 0) / (length(tcm))
## [1] 0.008919585
GloVe is trained by gradient descent minimizing the loss function \[ L(w,\tilde{w}, b,\tilde{b})=\sum_{i,k=1}^{V}f(X_{ik})\left(w_i^{T} \tilde{w}_k + b_i + \tilde{b}_k - \log X_{ik} \right)^2 \]
The gradient descent is controlled by the following parameters:
\(d\), dimension of word and context vectors, i.e., we have \[ w\in\mathbb{R}^d,\qquad \tilde{w}\in\mathbb{R}^d \]
The maximum number of iterations after which we stop gradient descent
Convergence tolerance \(\varepsilon\) — we stop gradient descent if the change of the loss function is below \(\varepsilon\).
The threshold \(x_{max}\) controlling the weights \(f(X_{ij})\). Specifically, we will have \(f(x)=1\) whenever \(x\ge x_{max}\).
Now we will train GloVe on Trump’s twitter
dim_word_emb <- 100
# Stage 2 - training GloVe itself
glove = GlobalVectors$new(rank = dim_word_emb, x_max = 10)
wv_main = glove$fit_transform(tcm, n_iter = 50, convergence_tol = 0.01)
## INFO [20:45:28.279] epoch 1, loss 0.2681
## INFO [20:45:28.635] epoch 2, loss 0.1384
## INFO [20:45:28.994] epoch 3, loss 0.0981
## INFO [20:45:29.333] epoch 4, loss 0.0781
## INFO [20:45:29.664] epoch 5, loss 0.0654
## INFO [20:45:29.999] epoch 6, loss 0.0562
## INFO [20:45:30.325] epoch 7, loss 0.0492
## INFO [20:45:30.651] epoch 8, loss 0.0437
## INFO [20:45:30.980] epoch 9, loss 0.0392
## INFO [20:45:31.305] epoch 10, loss 0.0357
## INFO [20:45:31.631] epoch 11, loss 0.0327
## INFO [20:45:31.973] epoch 12, loss 0.0302
## INFO [20:45:32.299] epoch 13, loss 0.0280
## INFO [20:45:32.626] epoch 14, loss 0.0262
## INFO [20:45:32.951] epoch 15, loss 0.0246
## INFO [20:45:33.276] epoch 16, loss 0.0232
## INFO [20:45:33.603] epoch 17, loss 0.0220
## INFO [20:45:33.927] epoch 18, loss 0.0209
## INFO [20:45:34.250] epoch 19, loss 0.0199
## INFO [20:45:34.576] epoch 20, loss 0.0190
## INFO [20:45:34.904] epoch 21, loss 0.0182
## INFO [20:45:35.230] epoch 22, loss 0.0175
## INFO [20:45:35.555] epoch 23, loss 0.0169
## INFO [20:45:35.880] epoch 24, loss 0.0163
## INFO [20:45:36.204] epoch 25, loss 0.0158
## INFO [20:45:36.526] epoch 26, loss 0.0153
## INFO [20:45:36.858] epoch 27, loss 0.0148
## INFO [20:45:37.179] epoch 28, loss 0.0144
## INFO [20:45:37.505] epoch 29, loss 0.0140
## INFO [20:45:37.833] epoch 30, loss 0.0136
## INFO [20:45:38.158] epoch 31, loss 0.0133
## INFO [20:45:38.484] epoch 32, loss 0.0130
## INFO [20:45:38.807] epoch 33, loss 0.0127
## INFO [20:45:39.131] epoch 34, loss 0.0124
## INFO [20:45:39.455] epoch 35, loss 0.0121
## INFO [20:45:39.790] epoch 36, loss 0.0119
## INFO [20:45:40.119] epoch 37, loss 0.0116
## INFO [20:45:40.444] epoch 38, loss 0.0114
## INFO [20:45:40.771] epoch 39, loss 0.0112
## INFO [20:45:41.097] epoch 40, loss 0.0110
## INFO [20:45:41.423] epoch 41, loss 0.0108
## INFO [20:45:41.746] epoch 42, loss 0.0106
## INFO [20:45:42.069] epoch 43, loss 0.0104
## INFO [20:45:42.397] epoch 44, loss 0.0103
## INFO [20:45:42.726] epoch 45, loss 0.0101
## INFO [20:45:43.054] epoch 46, loss 0.0100
## INFO [20:45:43.386] epoch 47, loss 0.0098
## INFO [20:45:43.707] epoch 48, loss 0.0097
## INFO [20:45:44.032] epoch 49, loss 0.0096
## INFO [20:45:44.357] epoch 50, loss 0.0094
wv_context = glove$components
word_vectors = wv_main + t(wv_context)
cat("Dim of word vector matrix =", dim(word_vectors))
## Dim of word vector matrix = 5371 100
The output of GloVe is a matrix whose rows represent word vectors. Below are the first 5 words:
row.names(word_vectors)[1:5]
## [1] "000" "03" "06" "09" "16th"
And here are entries 1-10 of word vectors for “obama”, “democrats”, and “hillary”:
word_vectors[c("obama", "democrats", "hillary") , 1:10] %>% round(3)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## obama -0.548 0.529 -0.254 0.145 -0.260 -0.854 0.721 0.428 -0.335 -0.337
## democrats -0.311 -0.185 0.565 -1.290 -0.146 -1.206 0.528 0.394 -0.328 -0.768
## hillary -0.223 -0.092 -0.304 -0.240 -0.661 0.528 -0.707 -0.184 0.164 -0.870
Recall that the cosine similarity between vectors \(x\) and \(y\) is \[ \cos\alpha = \frac{\sum_{i=1}^{d}x_iy_i}{\sqrt{\sum_{i=1}^{d}x_i^2}\times \sqrt{\sum_{i=1}^{d}y_i^2}} \] Below is the cosine similarity between words “obama” and “hillary”
cos_sim <- function(x, y) {
# computes cosine similarity between vectors x and y
sum(x*y) / sqrt(sum(x^2) * sum(y^2))
}
cos_sim(word_vectors["obama" , ], word_vectors["hillary" , ])
## [1] 0.3425258
Is it high or low?
Find cosine similarity of between “hillary” and the rest of the words in Trump’s vocabulary and print the top 10 words most similar to “hillary”
cos_sim_with_hillary <- function(x) cos_sim(word_vectors["hillary" , ], x)
word_vectors %>%
apply(1, cos_sim_with_hillary) %>%
sort(decreasing = TRUE) %>%
head(10)
## hillary crooked clinton said she that
## 1.0000000 0.8267952 0.7831464 0.4326855 0.4311580 0.4216134
## beat bill her barackobama
## 0.4090511 0.4040966 0.3885555 0.3698914
The first step in combining GloVe with downstream tasks is applying Continuous Bag of Words (CBOW) as follows: \[ v(\mbox{document})=\sum_{x\in\mbox{document}}v(x) \] It means that our data matrix is \[ \mbox{DCM}\cdot \mbox{TCM} \] First, we need to create DTM:
dtm <- create_dtm(it, vectorizer)
cat("DTM dimensions =", dim(dtm), "\n")
## DTM dimensions = 20112 5371
Now we prepare the CBOW data. Below is a sample:
cbow_data <- as.matrix(dtm %*% word_vectors)
cbow_data[1:7, 1:5]
## [,1] [,2] [,3] [,4] [,5]
## 1 -4.3766615 -3.4506032 -4.9521222 -4.9147031 -2.7269021
## 2 -0.5802461 0.6961936 -4.1253416 -2.4991573 -0.3017326
## 3 -2.2474253 1.9739904 -0.7344964 -1.0040018 -2.7986620
## 4 -2.3832234 1.5288176 -2.1396670 -2.7830770 -0.4994253
## 5 0.3931639 0.4732510 0.1472064 -4.5483078 -2.9581171
## 6 -8.6829010 -5.8134600 -1.6795864 -7.5044877 -8.6582221
## 7 -4.6194824 -4.1906928 -6.7806423 -0.1340697 -2.7793685
Let us cluster Trump’s tweets using \(K\)-means with \(K=10\). This table shows the number of tweets in each cluster:
trum_tw_clust <-
kmeans(cbow_data, 10)
table(trum_tw_clust$cluster)
##
## 1 2 3 4 5 6 7 8 9 10
## 880 3998 1580 1622 1198 2380 1829 2614 1658 2134
And here are word clouds by cluster:
trump_cloud_by_cluster <- function(k) {
t %>%
slice(which(trum_tw_clust$cluster == k)) %>%
pull(clean_text) %>%
visualize_text()
}
1:10 %>% lapply(trump_cloud_by_cluster) %>% invisible()
Write an R function that prints a sample of 5 tweets in a given cluster and apply it to the list ot clusters so that it prints 5 tweets in each cluster.
trump_tweets_by_cluster <- function(k) {
t %>%
slice(which(trum_tw_clust$cluster == k)) %>%
sample_n(5) %>%
pull(clean_text)
}
1:10 %>% lapply(trump_tweets_by_cluster) %>% print
## [[1]]
## [1] "thank you brandon https t co vfzwvhyiru"
## [2] "rep alexandria ocasio cortez is correct the va is not broken it is doing great but that is only because of the trump administration we got veterans choice amp accountability passed president trump deserves a lot of credit dan caldwell concerned veterans of america"
## [3] "rt drudgefeed woodward investigate fbicia handling of garbage dossier https t co py5awzprfy"
## [4] "rt gopchairwoman the rnc smashed another fundraising record last month between realdonaldtrump s accomplishments and our grassroots in "
## [5] " top of the fbi brennan started this entire debacle about president trump we now know that brennan had detailed knowledge of the phony dossier he knows about the dossier he denies knowledge of the dossier he briefs the gang of 8 on the hill about the dossier which "
##
## [[2]]
## [1] "senator luther strange has gone up a lot in the polls since i endorsed him a month ago now a close runoff he will be great in d c "
## [2] " democratic operative caught on camera hillary personally ordered donald duck troll campaign that broke the law https t co strehafyuh"
## [3] " ny conservative barbmuenchen crippledamerica is an excellent resource for the 2016 race teamtrump trump2016 makeamericagreatagain"
## [4] " staciann realdonaldtrump go leezagibbons classiest lady to ever grace the show "
## [5] " mittromney has shown last week that he will campaign aggressively against barackobama i am confident he can defeat barackobama "
##
## [[3]]
## [1] " wino911 trump2016 we know better than to trust the rnc https t co emcgtk25fe so cute "
## [2] "buy american amp hire american are the principles at the core of my agenda which is jobs jobs jobs thank you exxonmobil "
## [3] "does anybody think that cnbc will get their fictitious polling numbers corrected sometime prior to the start of the debate sad "
## [4] "the reason ed schultz said nice things about me is that i m the only repub who won t cut social security etc i ll make america rich again "
## [5] "i can t resist hitting lightweight dannyzuker verbally when he starts up because he is just so pathetic and easy stupid "
##
## [[4]]
## [1] " moudalena realdonaldtrump nypost fud31 godbless you mister realdonaldtrump"
## [2] " akd72 realdonaldtrump i bet you will make hell of a president the world needs presidents like you"
## [3] " 05fxdli realdonaldtrump as usual great pinpointed analysis of the world situation i enjoy your frankness and honesty thanks "
## [4] " jwhurter realdonaldtrump are your ties available in south africa you can order them online macys com"
## [5] "scott gottlieb who has done an absolutely terrific job as commissioner of the fda plans to leave government service sometime next month "
##
## [[5]]
## [1] " pjwj316 realdonaldtrump yes america needs realdonaldtrump for president can t wait for america to be great again "
## [2] "rt jessebwatters tomorrow senator lindseygrahamsc splashes into wattersworld to discuss his relationship with president realdonaldtru "
## [3] "with the great vote on cutting taxes this could be a big day for the stock market and you "
## [4] " trufacts 101 realdonaldtrump fabiolasellsnj anything with the trump name is epic thanks "
## [5] "rt mike pence america needs 4 more years of prosperity 4 more years of security and 4 more years of president realdonaldtrump keepam "
##
## [[6]]
## [1] " ryan died on a winning mission according to general mattis not a failure time for the u s to get smart and start winning again "
## [2] " tomuchpolitics realdonaldtrump i would like to hear the answer to that johnny football question with a little good guidance great "
## [3] "remember as a senator obama did not vote for increasing the debt ceiling http t co wtq96itg i guess things change when president "
## [4] "general john kelly is doing a fantastic job as chief of staff there is tremendous spirit and talent in the w h don t believe the fake news"
## [5] "rick perry a good man a great family and a patriot "
##
## [[7]]
## [1] "major story that the dems are making up phony polls in order to suppress the the trump we are going to win "
## [2] " donnieboysmith realdonaldtrump in contrast to rubio and cruz you look like a giant they look terribly weak thank you "
## [3] "the tonight show begins in 5 minutes enjoy "
## [4] "the washington times presidential debate poll trump 77 18290 clinton 17 4100 draintheswamp debate https t co wsgsf5nv6h"
## [5] "i love watching these poor pathetic people pundits on television working so hard and so seriously to try and figure me out they can t "
##
## [[8]]
## [1] "americans who can afford to buy enough food is now at a 3 year low is this barackobama s recovery http t co laei0sfa"
## [2] " carisa01huston realdonaldtrump you rock ditch the haters and keep speaking the truth they just don t like what they hear "
## [3] " tx shaun realdonaldtrump happy birthday donald thanks for never being afraid to tell it like it is"
## [4] " much as it pays to emphasize the positive there are times when the only choice is confrontation the art of the deal"
## [5] "is the boston killer eligible for obama care to bring him back to health "
##
## [[9]]
## [1] "have a great game today usarmy and usnavy i will be watching we love our u s military on behalf of an entire nation thank you for your sacrifice and service armynavygame usa https t co 8m1w9rfwih"
## [2] "via pvpatch by paige austin trump to donate 12 acres for conservation in palos verdes http t co csif8rp7qe"
## [3] " when you can t make them see the light make them feel the heat ronald reagan"
## [4] "i m with you i will work hard and never let you down make america great again https t co vzzz6m5k9t"
## [5] " stevieboy 63 talkernewyorker 5gentexan hi katie let s get donald trump in the wh he s the man to get this country back in order "
##
## [[10]]
## [1] " killianzane i ll be the vp for realdonaldtrump when he is president you would be better than what we have now "
## [2] " cashmoneybonas realdonaldtrump i wish i was watching your presidency campaign instead of the oscars"
## [3] "i really enjoyed being at the iowa state fair the crowds love and enthusiasm is something i will never forget "
## [4] "thank you albany new york makeamericagreatagain trump2016https t co iaousy5vdc https t co wcituuxej8"
## [5] "the best investors are visionaries they look beyond the present "
Let us try to predict retweet count from the text of Trump’s tweet. We will use our CBOW matrix as the matrix of features and will fit a LASSO model. The response variable will be logarithm of retweet count (because the distribution of the raw retweet count is highly skewed).
First, we prepare the data and split the data into training and test sets
set.seed(42)
all_data <- cbow_data %>%
as_tibble %>%
mutate(Y = log(t$retweet_count + 1))
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
ind <- sample(1:nrow(all_data), size = 15000)
train_data <- all_data %>% slice(ind)
test_data <- all_data %>% slice(-ind)
cat("Train data dim =", dim(train_data), "\n")
## Train data dim = 15000 101
cat("Test data dim =", dim(test_data), "\n")
## Test data dim = 5112 101
Now we will train a lasso model
lambda <- 10^seq(-4, 0 , length = 20)
lasso <- train(
Y ~., data = train_data, method = "glmnet",
trControl = trainControl("cv", number = 5),
tuneGrid = expand.grid(alpha = 1, lambda = lambda),
preProcess = c("scale")
)
lasso
## glmnet
##
## 15000 samples
## 100 predictor
##
## Pre-processing: scaled (100)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 12000, 12000, 12001, 11999, 12000
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.0001000000 2.275811 0.4572173 1.859601
## 0.0001623777 2.275811 0.4572173 1.859601
## 0.0002636651 2.275811 0.4572173 1.859601
## 0.0004281332 2.275783 0.4572285 1.859635
## 0.0006951928 2.275729 0.4572493 1.859807
## 0.0011288379 2.275766 0.4572265 1.860243
## 0.0018329807 2.275975 0.4571229 1.861082
## 0.0029763514 2.276485 0.4568798 1.862437
## 0.0048329302 2.277759 0.4562926 1.864911
## 0.0078475997 2.280578 0.4550295 1.869938
## 0.0127427499 2.285699 0.4527852 1.878251
## 0.0206913808 2.295684 0.4484331 1.892225
## 0.0335981829 2.310761 0.4421048 1.913860
## 0.0545559478 2.337410 0.4308059 1.950286
## 0.0885866790 2.380668 0.4120822 2.007097
## 0.1438449888 2.439634 0.3868277 2.079145
## 0.2335721469 2.528904 0.3452911 2.176738
## 0.3792690191 2.615447 0.3098429 2.267878
## 0.6158482111 2.730466 0.2630414 2.396190
## 1.0000000000 2.857540 0.2509167 2.531932
##
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.0006951928.
The test RMSE is
lasso %>%
predict(test_data) %>%
RMSE(test_data$Y)
## [1] 2.300238
How to load a pre-trained GloVe and build a neural net on top of it: