The aim of this lab is to acquire basic familiarity with modern word embedding techniques, such as GloVe.
We will train GloVe on Donald Trump’s tweets and do downstream tasks - clustering and regression. Note that usually, for real applications, one would train GloVe on a very large corpus of different tweets (not just Trump’s tweets) first and then use it to process Trump’s tweets.
However, we will work with whatever data that we have.
By the end of this lab session, students should be able to
Train GloVe in R
Integrate GloVe with clustering
Integrate GloVe with supervised learning
Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. In either case, you can discuss your work with the lab instructor.
First we will load the libraries, the data, and do some simple preprocessing.
library(tidyverse) # for manipulation with data
library(tidytext) # for tidyverse-style text tokenization
library(broom) # for tidying data
library(text2vec) # for GloVe
library(tm) # for text processing
library(ggwordcloud) # for visualizations
library(patchwork) # for arranging several plots
library(caret) # for training ML models
library(janitor) # for helper functions
library(gt) # for nicely formatted tables
set.seed(1729)
### Here we load the data and do some basic cleaning
t <- read_csv("trump_twitter.csv") %>%
filter(nchar(text) <= 280) %>%
filter(!is_retweet == "true") %>%
select(-is_retweet) %>%
mutate(favorite_count = as.numeric(favorite_count)) %>%
drop_na(favorite_count) %>%
mutate(
cleaned_text = text %>%
str_replace_all("[^A-Za-z@#]", " ") %>%
str_squish() %>%
str_to_lower()
)
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
t %>%
select(text, created_at, retweet_count, cleaned_text) %>%
sample_n(10)
Here is the word cloud for Trump’s twitter.
# We will remove the following words since they are too common
# and not very informative
delete_words <- c("rt", "realdonaldtrump", "https", "http")
visualize_text <- function(x) {
tibble(text = x) %>%
unnest_tokens(word, text) %>%
count(word, sort = TRUE) %>%
filter(!word %in% stop_words$word, !word %in% delete_words) %>%
slice_max(n, n = 50) %>%
ggplot(aes(label = word, size = n, color = n)) +
geom_text_wordcloud_area() +
scale_size_area(max_size = 12) +
scale_color_viridis_c() +
theme_void()
}
visualize_text(t$cleaned_text)
To understand the term co-occurrence matrix (TCM), let us explore it on a simple example.
Below we create a simple character vector, compute and print its TCM \(X\). Note that the default method of printing TCM
doesn’t show column names;
shows dots instead of zeroes;
shows dots for \(X_{ij}\) when \(i>j\), even though we should have \(X_{ij}=X_{ji}\).
simple_text <- c("face masks are driving me crazy",
"only crazy people are not wearing face masks",
"face masks are better than face shields")
simple_text <- "just because you fail once does not mean you are going to fail at everything"
simple_text <- "if life were predictable it would cease to be life and be without flavour"
tokens <- space_tokenizer(simple_text)
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it)
vectorizer <- vocab_vectorizer(vocab)
tcm <- create_tcm(it, vectorizer,
skip_grams_window = 3)
tcm
## 12 x 12 sparse Matrix of class "dgTMatrix"
## [[ suppressing 12 column names 'and', 'cease', 'flavour' ... ]]
##
## and . . 0.3333333 . . . 0.3333333 . 0.5 . 1.5000000
## cease . . . . 0.5 0.3333333 1.0000000 . . 1.0000000 0.5000000
## flavour . . . . . . . . 1.0 . 0.5000000
## if . . . . . 0.3333333 . 0.5 . . .
## it . . . . . 1.0000000 0.3333333 0.5 . 1.0000000 .
## predictable . . . . . . . 1.0 . 0.5000000 .
## to . . . . . . . . . 0.5000000 1.0000000
## were . . . . . . . . . 0.3333333 .
## without . . . . . . . . . . 1.0000000
## would . . . . . . . . . . 0.3333333
## be . . . . . . . . . . 0.3333333
## life . . . . . . . . . . .
##
## and 1.0000000
## cease 0.3333333
## flavour .
## if 1.0000000
## it 0.3333333
## predictable 0.5000000
## to 0.5000000
## were 1.0000000
## without 0.3333333
## would .
## be 1.5000000
## life .
To print TCM with column names and zeroes, we convert it to matrix:
(t(as.matrix(tcm))+as.matrix(tcm))[ , c("be", "and")]
## be and
## and 1.5000000 0.0000000
## cease 0.5000000 0.0000000
## flavour 0.5000000 0.3333333
## if 0.0000000 0.0000000
## it 0.0000000 0.0000000
## predictable 0.0000000 0.0000000
## to 1.0000000 0.3333333
## were 0.0000000 0.0000000
## without 1.0000000 0.5000000
## would 0.3333333 0.0000000
## be 0.6666667 1.5000000
## life 1.5000000 1.0000000
Print the full TCM matrix, including the lower-triangular part.
A <- as.matrix(tcm) + t(as.matrix(tcm))
A %>%
round(2) %>%
as_tibble(rownames = "word") %>%
gt()
| word | and | cease | flavour | if | it | predictable | to | were | without | would | be | life |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| and | 0.00 | 0.00 | 0.33 | 0.00 | 0.00 | 0.00 | 0.33 | 0.00 | 0.50 | 0.00 | 1.50 | 1.00 |
| cease | 0.00 | 0.00 | 0.00 | 0.00 | 0.50 | 0.33 | 1.00 | 0.00 | 0.00 | 1.00 | 0.50 | 0.33 |
| flavour | 0.33 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.50 | 0.00 |
| if | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.33 | 0.00 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 |
| it | 0.00 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 0.33 | 0.50 | 0.00 | 1.00 | 0.00 | 0.33 |
| predictable | 0.00 | 0.33 | 0.00 | 0.33 | 1.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.50 | 0.00 | 0.50 |
| to | 0.33 | 1.00 | 0.00 | 0.00 | 0.33 | 0.00 | 0.00 | 0.00 | 0.00 | 0.50 | 1.00 | 0.50 |
| were | 0.00 | 0.00 | 0.00 | 0.50 | 0.50 | 1.00 | 0.00 | 0.00 | 0.00 | 0.33 | 0.00 | 1.00 |
| without | 0.50 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.33 |
| would | 0.00 | 1.00 | 0.00 | 0.00 | 1.00 | 0.50 | 0.50 | 0.33 | 0.00 | 0.00 | 0.33 | 0.00 |
| be | 1.50 | 0.50 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 1.00 | 0.33 | 0.67 | 1.50 |
| life | 1.00 | 0.33 | 0.00 | 1.00 | 0.33 | 0.50 | 0.50 | 1.00 | 0.33 | 0.00 | 1.50 | 0.00 |
In the lecture, \(X_{ij}\) was defined to be the number of times the word \(j\) appears in a context of word \(i\), i.e., the distance from \(j\) to \(i\) in a sentence is at most \(s\). Thus the process of calculating the TCM is controlled by the window size \(s\) (it is a hyperparameter of GloVe).
Below we compute and print TCM for \(s=3\):
tcm <- create_tcm(it, vectorizer,
skip_grams_window = 3)
tcm
## 12 x 12 sparse Matrix of class "dgTMatrix"
## [[ suppressing 12 column names 'and', 'cease', 'flavour' ... ]]
##
## and . . 0.3333333 . . . 0.3333333 . 0.5 . 1.5000000
## cease . . . . 0.5 0.3333333 1.0000000 . . 1.0000000 0.5000000
## flavour . . . . . . . . 1.0 . 0.5000000
## if . . . . . 0.3333333 . 0.5 . . .
## it . . . . . 1.0000000 0.3333333 0.5 . 1.0000000 .
## predictable . . . . . . . 1.0 . 0.5000000 .
## to . . . . . . . . . 0.5000000 1.0000000
## were . . . . . . . . . 0.3333333 .
## without . . . . . . . . . . 1.0000000
## would . . . . . . . . . . 0.3333333
## be . . . . . . . . . . 0.3333333
## life . . . . . . . . . . .
##
## and 1.0000000
## cease 0.3333333
## flavour .
## if 1.0000000
## it 0.3333333
## predictable 0.5000000
## to 0.5000000
## were 1.0000000
## without 0.3333333
## would .
## be 1.5000000
## life .
Note that etries of TCM are fractional. This happened because the default way of forming the TCM is counting the number of times when a word \(j\) appears in a context of word \(j\) with weights, namely, if the distance from \(j\) to \(i\) in a sentence is \(k\), then \(j\) is counted with weight \(1/k\) in calculating \(X_{ij}\).
Weights are controlled as follows (default is \((1,1/2,\dots,1/s)\):
tcm <- create_tcm(it, vectorizer,
skip_grams_window = 3,
weights = c(1, 1, 1))
tcm
## 12 x 12 sparse Matrix of class "dgTMatrix"
## [[ suppressing 12 column names 'and', 'cease', 'flavour' ... ]]
##
## and . . 1 . . . 1 . 1 . 2 1
## cease . . . . 1 1 1 . . 1 1 1
## flavour . . . . . . . . 1 . 1 .
## if . . . . . 1 . 1 . . . 1
## it . . . . . 1 1 1 . 1 . 1
## predictable . . . . . . . 1 . 1 . 1
## to . . . . . . . . . 1 1 1
## were . . . . . . . . . 1 . 1
## without . . . . . . . . . . 1 1
## would . . . . . . . . . . 1 .
## be . . . . . . . . . . 1 2
## life . . . . . . . . . . . .
First, we will compute TCM for the collection of cleaned Trump’s tweets. We will trim the vocabulary to only keep words that appear at least 5 times (otherwise the TCM is too large).
skip_gram_window_size <- 3
tokens <- space_tokenizer(t$cleaned_text)
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it) %>% prune_vocabulary(5)
vectorizer <- vocab_vectorizer(vocab)
tcm <- create_tcm(it, vectorizer,
skip_grams_window = skip_gram_window_size)
cat("Dimensions of tcm =", dim(tcm), "\n")
## Dimensions of tcm = 7149 7149
Note that the TCM is huge, but it is a sparse matrix, i.e., most of its entries are zero entries. Below is the fraction of nonzero entries in TCM:
sum(tcm > 0) / (length(tcm))
## [1] 0.00828957
Note that
mean(tcm > 0)
## Warning in mean.default(tcm > 0): argument is not numeric or logical: returning
## NA
## [1] NA
doesn’t work because tcm is a sparse matrix, not a usual
matrix:
class(tcm)
## [1] "dgTMatrix"
## attr(,"package")
## [1] "Matrix"
Because of this, we will later have to be cautious with converting it to a data frame.
GloVe is trained by gradient descent minimizing the loss function \[ L(w,\tilde{w}, b,\tilde{b})=\sum_{i,k=1}^{V}f(X_{ik})\left(w_i^{T} \tilde{w}_k + b_i + \tilde{b}_k - \log X_{ik} \right)^2 \]
The gradient descent is controlled by the following parameters:
\(d\), dimension of word and context vectors, i.e., we have \[ w\in\mathbb{R}^d,\qquad \tilde{w}\in\mathbb{R}^d \]
The maximum number of iterations after which we stop gradient descent
Convergence tolerance \(\varepsilon\) — we stop gradient descent if the change of the loss function is below \(\varepsilon\).
The threshold \(x_{max}\) controlling the weights \(f(X_{ij})\). Specifically, we will have \(f(x)=1\) whenever \(x\ge x_{max}\).
Now we will train GloVe on Trump’s twitter
dim_word_emb <- 100
# Stage 2 - training GloVe itself
glove = GlobalVectors$new(rank = dim_word_emb, x_max = 10)
wv_main = glove$fit_transform(tcm, n_iter = 50, convergence_tol = 0.01)
wv_context = glove$components
word_vectors = wv_main + t(wv_context)
cat("Dim of word vector matrix =", dim(word_vectors))
## Dim of word vector matrix = 7149 100
The output of GloVe is a matrix whose rows represent word vectors. Below are the first 5 words:
row.names(word_vectors)[1:5]
## [1] "#caucusfortrump" "#cnndebate" "#electionday" "#freeourmarine"
## [5] "#gutfeld"
And here are entries 1-10 of word vectors for “obama”, “democrats”, and “hillary”:
word_vectors[c("obama", "democrats", "hillary") , 1:10] %>% round(3)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## obama 0.219 0.157 -0.030 -0.729 -0.445 -0.646 0.393 -0.608 -0.403 0.855
## democrats -0.626 0.087 -0.046 -1.053 -0.207 -1.214 -0.109 -1.201 0.343 0.603
## hillary 0.219 0.985 0.932 -0.589 -0.845 -0.332 -0.393 0.788 0.051 -0.292
Recall that the cosine similarity between vectors \(x\) and \(y\) is \[ \cos\alpha = \frac{\sum_{i=1}^{d}x_iy_i}{\sqrt{\sum_{i=1}^{d}x_i^2}\times \sqrt{\sum_{i=1}^{d}y_i^2}} \] Below is the cosine similarity between words “obama” and “hillary”
cos_sim <- function(x, y) {
# computes cosine similarity between vectors x and y
sum(x*y) / sqrt(sum(x^2) * sum(y^2))
}
cos_sim(word_vectors["obama" , ], word_vectors["hillary" , ])
## [1] 0.3408355
Is it high or low?
Find cosine similarity of between “hillary” and the rest of the words in Trump’s vocabulary and print the top 10 words most similar to “hillary”
### Type your solution here
## First method (more efficient)
## Pre-compute norms and dot products
squared_norms <- rowSums(word_vectors^2)
dot_with_hillary <- word_vectors["hillary" , ] %*% t(word_vectors)
cos_sim_with_hillary <- function(w) {
dot_with_hillary[1, w] / sqrt(squared_norms["hillary"] * squared_norms[w])
}
tick <- Sys.time()
word_vectors %>%
rownames() %>%
cos_sim_with_hillary() %>%
sort(decreasing = TRUE) %>%
head(10) %>%
enframe()
tock <- Sys.time()
cat("(First method) ")
print(tock-tick)
## Second method (less efficient)
tick <- Sys.time()
cos_sim_with_hillary_2 <- function(w) {
cos_sim(word_vectors["hillary" , ], word_vectors[w,]) %>%
set_names(w) %>%
enframe()
}
word_vectors %>%
rownames() %>%
map_df(cos_sim_with_hillary_2) %>%
arrange(desc(value)) %>%
head(10)
tock <- Sys.time()
cat("(Second method) ")
print(tock-tick)
## (First method) Time difference of 0.0119741 secs
## (Second method) Time difference of 0.5021391 secs
The first step in combining GloVe with downstream tasks is generating features (independent variables). The easiest method is as follows: \[ v(\mbox{document})=\sum_{x\in\mbox{document}}v(x) \] It means that our data matrix is \[ \mbox{DTM}\cdot \mbox{W}, \] where \(W\) is the matrix of word vectors (output of the GloVe model).
First, we need to create DTM:
dtm <- create_dtm(it, vectorizer)
cat("DTM dimensions =", dim(dtm), "\n")
## DTM dimensions = 29703 7149
Now we prepare the data. Below is a sample:
X_glove <- as.matrix(dtm %*% word_vectors)
X_glove[1:7, 1:5]
## [,1] [,2] [,3] [,4] [,5]
## 1 1.5570154 2.3081109 7.532421 -3.1264927 -4.930246
## 2 -2.1042141 -3.1898832 1.014820 -0.7824533 -2.371435
## 3 1.5157911 1.7185869 2.007983 0.6883422 1.236334
## 4 2.0028285 0.5804169 2.135268 -3.0459215 -4.849228
## 5 1.1139806 0.7651401 5.411681 -5.3702478 -8.252705
## 6 0.4072635 1.2618059 1.293127 -4.5669505 -6.095733
## 7 0.7031197 1.0490637 3.597556 -0.9120027 0.283200
Let us cluster Trump’s tweets using \(K\)-means with \(K=10\). This table shows the number of tweets in each cluster:
trum_tw_clust <- kmeans(X_glove, 10)
tabyl(trum_tw_clust$cluster)
And here are word clouds by cluster:
trum_tw_clust %>%
augment(t) %>%
select(.cluster, cleaned_text) %>%
unnest_tokens(word, cleaned_text) %>%
filter(!word %in% stop_words$word, !word %in% delete_words) %>%
count(.cluster, word, sort = TRUE) %>%
group_by(.cluster) %>%
mutate(word_rank = rank(-n)) %>%
filter(word_rank <= 30) %>%
ggplot(aes(label = word, size = n, color = n)) +
geom_text_wordcloud_area(seed = 123, rm_outside = TRUE) +
scale_size_area(max_size = 12) +
scale_color_viridis_c(option = "mako") +
facet_wrap(vars(.cluster)) +
theme_minimal(base_size = 13)
## Warning in wordcloud_boxes(data_points = points_valid_first, boxes = boxes, :
## Some words could not fit on page. They have been removed.
## Warning in wordcloud_boxes(data_points = points_valid_first, boxes = boxes, :
## Some words could not fit on page. They have been removed.
## Warning in wordcloud_boxes(data_points = points_valid_first, boxes = boxes, :
## Some words could not fit on page. They have been removed.
## Warning in wordcloud_boxes(data_points = points_valid_first, boxes = boxes, :
## Some words could not fit on page. They have been removed.
## Warning in wordcloud_boxes(data_points = points_valid_first, boxes = boxes, :
## Some words could not fit on page. They have been removed.
## Warning in wordcloud_boxes(data_points = points_valid_first, boxes = boxes, :
## Some words could not fit on page. They have been removed.
## Warning in wordcloud_boxes(data_points = points_valid_first, boxes = boxes, :
## Some words could not fit on page. They have been removed.
Write an R function that prints a sample of 5 tweets in a given cluster and apply it to the list of clusters so that it prints 5 tweets in each cluster.
trump_tweets_by_cluster <- function(k) {
t %>%
slice(which(trum_tw_clust$cluster == k)) %>%
sample_n(5) %>%
pull(cleaned_text)
}
1:10 %>% map(trump_tweets_by_cluster) %>% print
## [[1]]
## [1] "yesterday was veterans day i hope our armed service members felt appropriately honored this nation loves and respects all of you"
## [2] "@gmar #celebapprentice @realdonaldtrump nice to have this footage of joan"
## [3] "john mccain never had any intention of voting for this bill which his governor loves he campaigned on repeal amp replace let arizona down"
## [4] "if obama keeps pushing wind turbines our country will go down the tubes economically environmentally amp aesthetically"
## [5] "just out neera tanden hillary clinton adviser said israel is depressing i think israel is inspiring"
##
## [[2]]
## [1] "i cannot imagine that congress would dare to leave washington without a beautiful new healthcare bill fully approved and ready to go"
## [2] "the rally in cincinnati is on media put out false reports that it was cancelled will be great love you ohio"
## [3] "@cityblindsmib @realdonaldtrump with mr trump as top man in the world everyone would shake hands and listen to the man true leadership"
## [4] "@mileycyrus don t worry about liam you can do much better and you have plenty of time remain strong"
## [5] "@velvet @realdonaldtrump @kristinkgl @megynkelly i ll vote for trump in a heart beat not a politician but a very smart man true"
##
## [[3]]
## [1] "the lawyer for the whistleblower takes away all credibility from this big impeachment scam it should be ended and the whistleblower his lawyer and corrupt politician schiff should be investigared for fraud"
## [2] "i believe in spending what you have to but i also believe in not spending more than you should the art of the deal"
## [3] "if you strike out nobody is going to help you not your friends not the government you have to look to look out for yourself think big"
## [4] "rt @omastaskip @dbongino as a retired ff i know that the majority of the rank and file are conservative or moderate and do not support lib"
## [5] "can you believe it i m at approval in the republican party and have three stooges running against me one is mr appalachian trail who was actually in argentina for bad reasons"
##
## [[4]]
## [1] "it is disgraceful what the do nothing democrats are doing the impeachment scam but it is also disgraceful what they are not doing namely the usmca vote prescription drug price reduction gun safety infrastructure and much more"
## [2] "as i predicted jay powell and the federal reserve have allowed the dollar to get so strong especially relative to all other currencies that our manufacturers are being negatively affected fed rate too high they are their own worst enemies they don t have a clue pathetic"
## [3] "chairman kim realizes perhaps better than anyone else that without nuclear weapons his country could fast become one of the great economic powers anywhere in the world because of its location and people and him it has more potential for rapid growth than any other nation"
## [4] "very sad to hear the news on the passing of my friend senator thad cochran he was a real senator with incredible values even flew back to senate from mississippi for important healthcare vote when he was desperately ill thad never let our country or me down"
## [5] "the @uscg @fema and all others along with other countries that have been helping have been asked to move people in the badly hit sections of the bahamas to other sections of the bahamas"
##
## [[5]]
## [1] "eric s sept th event will be held at trump national golf club westchester http www trump com golf clubs westchester westchester asp"
## [2] "thank you america #makeamericagreatagain #trump https t co x qaqjjmj"
## [3] "rt @vp tune in tonight as i join @loudobbs https t co jokvcty bm"
## [4] "hillary s health care policies#draintheswamp #debate https t co t sr dk xi"
## [5] "coming soon to pennsylvania avenue http t co g ynukhuae"
##
## [[6]]
## [1] "rt @gopleader @jim jordan to chairman schiff of the members of congress you are the only member who knows who the whistleblower is"
## [2] "had a great meeting at cia headquarters yesterday packed house paid great respect to wall long standing ovations amazing people win"
## [3] "the washington times ukraine envoy blows massive hole into democrat accusations republicans at hearing find no trump pressure the ukrainian president also strongly stated that no pressure was put on him case closed"
## [4] "rt @housegop d s bill on the floor today has nothing to do w pre existing conditions but it will prevent states from reducing the pric"
## [5] "we are way over the fiscal cliff and with obama care being fully implemented in less than months it may be too late"
##
## [[7]]
## [1] "@mattwiemeri thanks"
## [2] "nyc is under constant threat from jihadists amp violent criminals stop amp frisk keeps streets amp subways safe stand strong ray kelly"
## [3] "@prday wilson but he probably won t get it"
## [4] "@mllemariani i agree"
## [5] "@meemawbellefleu thank you i did"
##
## [[8]]
## [1] "michaels cohen s attorney clarified the record saying his client does not know if president trump knew about the trump tower meeting out of which came nothing the answer is that i did not know about the meeting just another phony story by the fake news media"
## [2] "we have not seen china s demands yet which should be few in that previous u s administrations have done so poorly in negotiating china has seen our demands there has been no folding as the media would love people to believe the meetings"
## [3] "donald trump was being framed he fought back that is not obstruction @jessebwatters i had the right to end the whole witch hunt if i wanted i could have fired everyone including mueller if i wanted i chose not to i had the right to use executive privilege i didn t"
## [4] "the radical left tolerates no dissent it permits no opposition it accepts no compromise and it has absolutely no respect for the will of the american people they are coming after me because i am fighting for you #trumprallydallas #kag https t co othkws bud"
## [5] "the new york times is at its lowest point in its long and storied history not only is it losing a lot of money but it is a journalistic disaster being laughed at even in the most liberal of enclaves it has become a very sad joke all all over the world witch hunt hurt them"
##
## [[9]]
## [1] "during my trip to saudi arabia i spoke to the leaders of more than arab amp muslim nations about the need to confront our shared enemies https t co oeaqfiex"
## [2] "throughout american history the men and women of our armed forces have selflessly served our country making tremendous sacrifices to defend our liberty on national pow mia recognition day we honor all american prisoners of war https t co zgiqdybats https t co oeetrlzgcj"
## [3] "if you want to stop the drug smugglers human traffickers and vicious ms gang members from threatening our communities and poisoning our youth you have only one choice you must elect more republicans #kag https t co l nebv seo"
## [4] "time to start building in our country with american workers amp with american iron aluminum amp steel it is time to https t co rgzcryecmz"
## [5] "you have all been waiting the response has been amazing watch my announcement now press release to follow at http t co fxrjjtdp"
##
## [[10]]
## [1] "wow thank you pensacola fl see you friday at pm join me https t co jco za tv https t co lympryy hq"
## [2] "my @squawkcnbc interview discussing the @gop convention @barackobama s sealed records amp @senatorreid s tax claimhttp t co fd wysze"
## [3] "my @foxandfriends interview discussing how @barackobama is running a hateful campaign amp the @rnc convention surprise http t co t cc p"
## [4] "via @dmregister by @briannedmr trump bring back jobs from overseas http t co hlregvffjc let s make america great again"
## [5] "my @foxandfriends interview discussing @newsday s endorsement of @mittromney tomorrow s election and sandy s victims http t co j j mjhx"
Let us try to predict retweet count from the text of Trump’s tweet. We will use our matrix of features and will fit a LASSO model. The response variable will be logarithm of retweet count (because the distribution of the raw retweet count is highly skewed).
First, we prepare the data and split the data into training and test sets
all_data <- X_glove %>%
as_tibble %>%
mutate(Y = log(t$retweet_count + 1))
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ind <- sample(1:nrow(all_data), size = 15000)
train_data <- all_data %>% slice(ind)
test_data <- all_data %>% slice(-ind)
cat("Train data dim =", dim(train_data), "\n")
## Train data dim = 15000 101
cat("Test data dim =", dim(test_data), "\n")
## Test data dim = 14703 101
Now we will train a lasso model
lambda <- 10^seq(-4, 0 , length = 20)
lasso <- train(
Y ~., data = train_data, method = "glmnet",
trControl = trainControl("cv", number = 5),
tuneGrid = expand.grid(alpha = 1, lambda = lambda),
preProcess = c("scale")
)
lasso
## glmnet
##
## 15000 samples
## 100 predictor
##
## Pre-processing: scaled (100)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 12000, 12001, 12001, 11999, 11999
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.0001000000 2.220044 0.4960033 1.810622
## 0.0001623777 2.220044 0.4960033 1.810622
## 0.0002636651 2.220044 0.4960033 1.810622
## 0.0004281332 2.220049 0.4960014 1.810631
## 0.0006951928 2.220064 0.4959905 1.810706
## 0.0011288379 2.220111 0.4959591 1.810993
## 0.0018329807 2.220213 0.4959023 1.811487
## 0.0029763514 2.220596 0.4957229 1.812504
## 0.0048329302 2.221804 0.4951843 1.814650
## 0.0078475997 2.224536 0.4940078 1.818894
## 0.0127427499 2.229393 0.4920019 1.826218
## 0.0206913808 2.238040 0.4885441 1.838684
## 0.0335981829 2.250836 0.4837747 1.856025
## 0.0545559478 2.275376 0.4746768 1.886933
## 0.0885866790 2.324103 0.4555216 1.943885
## 0.1438449888 2.404365 0.4216573 2.031840
## 0.2335721469 2.512686 0.3727982 2.143897
## 0.3792690191 2.634395 0.3148065 2.266472
## 0.6158482111 2.725622 0.2912087 2.373401
## 1.0000000000 2.882790 0.2558439 2.554210
##
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.0002636651.
The test RMSE is
lasso %>%
predict(test_data) %>%
RMSE(test_data$Y)
## [1] 2.200734