Info about the lab

Learning aim

The aim of this lab is to acquire basic familiarity with modern word embedding techniques, such as GloVe.

We will train GloVe on Donald Trump’s tweets and do downstream tasks - clustering and regression. Note that usually, for real applications, one would train GloVe on a very large corpus of different tweets (not just Trump’s tweets) first and then use it to process Trump’s tweets.

However, we will work with whatever data that we have.

Objectives

By the end of this lab session, students should be able to

  1. Train GloVe in R

  2. Integrate GloVe with clustering

  3. Integrate GloVe with supervised learning

Mode

Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. In either case, you can discuss your work with the lab instructor.

Libraries and data

First we will load the libraries, the data, and do some simple preprocessing.

library(tidyverse) # for manipulation with data
library(text2vec) # for GloVe
library(tm) # for text processing
library(wordcloud) # for visualizations
library(caret)

set.seed(142)

t <- read.csv("trump_twitter.csv", stringsAsFactors = FALSE) %>%
  as_tibble %>%
  filter(nchar(text) <= 280) %>%
  filter(!is_retweet == "true") %>%
  select(-is_retweet) %>%
  mutate(favorite_count = as.numeric(favorite_count)) %>%
  filter(!is.na(favorite_count))
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
t %>%
  select(text, created_at, retweet_count) %>%
  sample_n(10)

We will first do some vary basic cleaning by changing all characters to lower case replacing everything except for alphabet and numerals with a space.

clean_tweets <- function(x) {
  x %>% 
    gsub('[^A-Za-z0-9]', ' ', . ) %>%
    tolower
}

t <- t %>%
  mutate(clean_text = clean_tweets(text))

t %>%
  select(text, clean_text) %>%
  sample_n(10)

Here is the word cloud for Trump’s twitter

visualize_text <- function(x) {
  # x is a character vector
  # the function will extract
  frequent_words <- termFreq(x)
  frequent_words <- frequent_words[!(names(frequent_words) %in% stopwords())]
  wordcloud(words = names(frequent_words), 
            freq = frequent_words, min.freq = 0,
            max.words = 50, random.order=FALSE, rot.per=0.35, 
            colors=brewer.pal(8, "Dark2"))

}

visualize_text(t$clean_text)
## Warning in wordcloud(words = names(frequent_words), freq = frequent_words, :
## realdonaldtrump could not be fit on page. It will not be plotted.

Term co-occurrence matrix

Creating the matrix

To understand the term co-occurrence matrix (TCM), let us explore it on a simple example.

Below we create a simple character vector, compute and print its TCM \(X\). Note that the default method of printing TCM

  1. doesn’t show column names;

  2. shows dots instead of zeroes;

  3. shows dots for \(X_{ij}\) when \(i>j\), even though we should have \(X_{ij}=X_{ji}\).

simple_text <- c("face masks are driving me crazy",
                 "only crazy people are not wearing face masks",
                 "face masks are better than face shields")

simple_text <- "just because you fail once does not mean you are going to fail at everything"

simple_text <- "if life were predictable it would cease to be life and be without flavour"

tokens <- space_tokenizer(simple_text)
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it) 
vectorizer <- vocab_vectorizer(vocab)

tcm <- create_tcm(it, vectorizer, 
                  skip_grams_window = 3)

tcm
## 12 x 12 sparse Matrix of class "dgTMatrix"
##    [[ suppressing 12 column names 'and', 'cease', 'flavour' ... ]]
##                                                                                
## and         . . 0.3333333 . .   .         0.3333333 .   0.5 .         1.5000000
## cease       . . .         . 0.5 0.3333333 1.0000000 .   .   1.0000000 0.5000000
## flavour     . . .         . .   .         .         .   1.0 .         0.5000000
## if          . . .         . .   0.3333333 .         0.5 .   .         .        
## it          . . .         . .   1.0000000 0.3333333 0.5 .   1.0000000 .        
## predictable . . .         . .   .         .         1.0 .   0.5000000 .        
## to          . . .         . .   .         .         .   .   0.5000000 1.0000000
## were        . . .         . .   .         .         .   .   0.3333333 .        
## without     . . .         . .   .         .         .   .   .         1.0000000
## would       . . .         . .   .         .         .   .   .         0.3333333
## be          . . .         . .   .         .         .   .   .         0.3333333
## life        . . .         . .   .         .         .   .   .         .        
##                      
## and         1.0000000
## cease       0.3333333
## flavour     .        
## if          1.0000000
## it          0.3333333
## predictable 0.5000000
## to          0.5000000
## were        1.0000000
## without     0.3333333
## would       .        
## be          1.5000000
## life        .

To print TCM with column names and zeroes, we convert it to matrix:

(t(as.matrix(tcm))+as.matrix(tcm))[ , c("be", "and")]
##                    be       and
## and         1.5000000 0.0000000
## cease       0.5000000 0.0000000
## flavour     0.5000000 0.3333333
## if          0.0000000 0.0000000
## it          0.0000000 0.0000000
## predictable 0.0000000 0.0000000
## to          1.0000000 0.3333333
## were        0.0000000 0.0000000
## without     1.0000000 0.5000000
## would       0.3333333 0.0000000
## be          0.6666667 1.5000000
## life        1.5000000 1.0000000

Question 1

Print the full TCM matrix, including the lower-triangular part.

as.matrix(tcm) + t(as.matrix(tcm))
##                   and     cease   flavour        if        it predictable
## and         0.0000000 0.0000000 0.3333333 0.0000000 0.0000000   0.0000000
## cease       0.0000000 0.0000000 0.0000000 0.0000000 0.5000000   0.3333333
## flavour     0.3333333 0.0000000 0.0000000 0.0000000 0.0000000   0.0000000
## if          0.0000000 0.0000000 0.0000000 0.0000000 0.0000000   0.3333333
## it          0.0000000 0.5000000 0.0000000 0.0000000 0.0000000   1.0000000
## predictable 0.0000000 0.3333333 0.0000000 0.3333333 1.0000000   0.0000000
## to          0.3333333 1.0000000 0.0000000 0.0000000 0.3333333   0.0000000
## were        0.0000000 0.0000000 0.0000000 0.5000000 0.5000000   1.0000000
## without     0.5000000 0.0000000 1.0000000 0.0000000 0.0000000   0.0000000
## would       0.0000000 1.0000000 0.0000000 0.0000000 1.0000000   0.5000000
## be          1.5000000 0.5000000 0.5000000 0.0000000 0.0000000   0.0000000
## life        1.0000000 0.3333333 0.0000000 1.0000000 0.3333333   0.5000000
##                    to      were   without     would        be      life
## and         0.3333333 0.0000000 0.5000000 0.0000000 1.5000000 1.0000000
## cease       1.0000000 0.0000000 0.0000000 1.0000000 0.5000000 0.3333333
## flavour     0.0000000 0.0000000 1.0000000 0.0000000 0.5000000 0.0000000
## if          0.0000000 0.5000000 0.0000000 0.0000000 0.0000000 1.0000000
## it          0.3333333 0.5000000 0.0000000 1.0000000 0.0000000 0.3333333
## predictable 0.0000000 1.0000000 0.0000000 0.5000000 0.0000000 0.5000000
## to          0.0000000 0.0000000 0.0000000 0.5000000 1.0000000 0.5000000
## were        0.0000000 0.0000000 0.0000000 0.3333333 0.0000000 1.0000000
## without     0.0000000 0.0000000 0.0000000 0.0000000 1.0000000 0.3333333
## would       0.5000000 0.3333333 0.0000000 0.0000000 0.3333333 0.0000000
## be          1.0000000 0.0000000 1.0000000 0.3333333 0.6666667 1.5000000
## life        0.5000000 1.0000000 0.3333333 0.0000000 1.5000000 0.0000000

TCM controls

In the lecture, \(X_{ij}\) was defined to be the number of times the word \(j\) appears in a context of word \(i\), i.e., the distance from \(j\) to \(i\) in a sentence is at most \(s\). Thus the process of calculating the TCM is controlled by the window size \(s\) (it is a hyperparameter of GloVe).

Below we compute and print TCM for \(s=3\):

tcm <- create_tcm(it, vectorizer, 
                  skip_grams_window = 3)
tcm
## 12 x 12 sparse Matrix of class "dgTMatrix"
##    [[ suppressing 12 column names 'and', 'cease', 'flavour' ... ]]
##                                                                                
## and         . . 0.3333333 . .   .         0.3333333 .   0.5 .         1.5000000
## cease       . . .         . 0.5 0.3333333 1.0000000 .   .   1.0000000 0.5000000
## flavour     . . .         . .   .         .         .   1.0 .         0.5000000
## if          . . .         . .   0.3333333 .         0.5 .   .         .        
## it          . . .         . .   1.0000000 0.3333333 0.5 .   1.0000000 .        
## predictable . . .         . .   .         .         1.0 .   0.5000000 .        
## to          . . .         . .   .         .         .   .   0.5000000 1.0000000
## were        . . .         . .   .         .         .   .   0.3333333 .        
## without     . . .         . .   .         .         .   .   .         1.0000000
## would       . . .         . .   .         .         .   .   .         0.3333333
## be          . . .         . .   .         .         .   .   .         0.3333333
## life        . . .         . .   .         .         .   .   .         .        
##                      
## and         1.0000000
## cease       0.3333333
## flavour     .        
## if          1.0000000
## it          0.3333333
## predictable 0.5000000
## to          0.5000000
## were        1.0000000
## without     0.3333333
## would       .        
## be          1.5000000
## life        .

Note that etries of TCM are fractional. This happened because the default way of forming the TCM is counting the number of times when a word \(j\) appears in a context of word \(j\) with weights, namely, if the distance from \(j\) to \(i\) in a sentence is \(k\), then \(j\) is counted with weight \(1/k\) in calculating \(X_{ij}\).

Weights are controlled as follows (default is \((1,1/2,\dots,1/s)\):

tcm <- create_tcm(it, vectorizer, 
                  skip_grams_window = 3,
                  weights = c(1, 1, 1))

tcm
## 12 x 12 sparse Matrix of class "dgTMatrix"
##    [[ suppressing 12 column names 'and', 'cease', 'flavour' ... ]]
##                                    
## and         . . 1 . . . 1 . 1 . 2 1
## cease       . . . . 1 1 1 . . 1 1 1
## flavour     . . . . . . . . 1 . 1 .
## if          . . . . . 1 . 1 . . . 1
## it          . . . . . 1 1 1 . 1 . 1
## predictable . . . . . . . 1 . 1 . 1
## to          . . . . . . . . . 1 1 1
## were        . . . . . . . . . 1 . 1
## without     . . . . . . . . . . 1 1
## would       . . . . . . . . . . 1 .
## be          . . . . . . . . . . 1 2
## life        . . . . . . . . . . . .

GloVe

Computing TCM

First, we will compute TCM for the collection of cleaned Trump’s tweets. We will trim the vocabulary to only keep words that appear at least 5 times (otherwise the TCM is too large).

skip_gram_window_size <- 3
tokens <- space_tokenizer(t$clean_text)
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it) %>% prune_vocabulary(5)
vectorizer <- vocab_vectorizer(vocab)
tcm <- create_tcm(it, vectorizer, 
                  skip_grams_window = skip_gram_window_size)

cat("Dimensions of tcm =", dim(tcm), "\n")
## Dimensions of tcm = 5371 5371

Note that the TCM is huge, but it is a sparse matrix, i.e., most of its entries are zero entries. Below is the fraction of nonzero entries in TCM:

sum(tcm > 0) / (length(tcm))
## [1] 0.008919585

Model training

GloVe is trained by gradient descent minimizing the loss function \[ L(w,\tilde{w}, b,\tilde{b})=\sum_{i,k=1}^{V}f(X_{ik})\left(w_i^{T} \tilde{w}_k + b_i + \tilde{b}_k - \log X_{ik} \right)^2 \]

The gradient descent is controlled by the following parameters:

  • \(d\), dimension of word and context vectors, i.e., we have \[ w\in\mathbb{R}^d,\qquad \tilde{w}\in\mathbb{R}^d \]

  • The maximum number of iterations after which we stop gradient descent

  • Convergence tolerance \(\varepsilon\) — we stop gradient descent if the change of the loss function is below \(\varepsilon\).

  • The threshold \(x_{max}\) controlling the weights \(f(X_{ij})\). Specifically, we will have \(f(x)=1\) whenever \(x\ge x_{max}\).

Now we will train GloVe on Trump’s twitter

dim_word_emb <- 100

# Stage 2 - training GloVe itself
glove = GlobalVectors$new(rank = dim_word_emb, x_max = 10)

wv_main = glove$fit_transform(tcm, n_iter = 50, convergence_tol = 0.01)
## INFO  [20:45:28.279] epoch 1, loss 0.2681 
## INFO  [20:45:28.635] epoch 2, loss 0.1384 
## INFO  [20:45:28.994] epoch 3, loss 0.0981 
## INFO  [20:45:29.333] epoch 4, loss 0.0781 
## INFO  [20:45:29.664] epoch 5, loss 0.0654 
## INFO  [20:45:29.999] epoch 6, loss 0.0562 
## INFO  [20:45:30.325] epoch 7, loss 0.0492 
## INFO  [20:45:30.651] epoch 8, loss 0.0437 
## INFO  [20:45:30.980] epoch 9, loss 0.0392 
## INFO  [20:45:31.305] epoch 10, loss 0.0357 
## INFO  [20:45:31.631] epoch 11, loss 0.0327 
## INFO  [20:45:31.973] epoch 12, loss 0.0302 
## INFO  [20:45:32.299] epoch 13, loss 0.0280 
## INFO  [20:45:32.626] epoch 14, loss 0.0262 
## INFO  [20:45:32.951] epoch 15, loss 0.0246 
## INFO  [20:45:33.276] epoch 16, loss 0.0232 
## INFO  [20:45:33.603] epoch 17, loss 0.0220 
## INFO  [20:45:33.927] epoch 18, loss 0.0209 
## INFO  [20:45:34.250] epoch 19, loss 0.0199 
## INFO  [20:45:34.576] epoch 20, loss 0.0190 
## INFO  [20:45:34.904] epoch 21, loss 0.0182 
## INFO  [20:45:35.230] epoch 22, loss 0.0175 
## INFO  [20:45:35.555] epoch 23, loss 0.0169 
## INFO  [20:45:35.880] epoch 24, loss 0.0163 
## INFO  [20:45:36.204] epoch 25, loss 0.0158 
## INFO  [20:45:36.526] epoch 26, loss 0.0153 
## INFO  [20:45:36.858] epoch 27, loss 0.0148 
## INFO  [20:45:37.179] epoch 28, loss 0.0144 
## INFO  [20:45:37.505] epoch 29, loss 0.0140 
## INFO  [20:45:37.833] epoch 30, loss 0.0136 
## INFO  [20:45:38.158] epoch 31, loss 0.0133 
## INFO  [20:45:38.484] epoch 32, loss 0.0130 
## INFO  [20:45:38.807] epoch 33, loss 0.0127 
## INFO  [20:45:39.131] epoch 34, loss 0.0124 
## INFO  [20:45:39.455] epoch 35, loss 0.0121 
## INFO  [20:45:39.790] epoch 36, loss 0.0119 
## INFO  [20:45:40.119] epoch 37, loss 0.0116 
## INFO  [20:45:40.444] epoch 38, loss 0.0114 
## INFO  [20:45:40.771] epoch 39, loss 0.0112 
## INFO  [20:45:41.097] epoch 40, loss 0.0110 
## INFO  [20:45:41.423] epoch 41, loss 0.0108 
## INFO  [20:45:41.746] epoch 42, loss 0.0106 
## INFO  [20:45:42.069] epoch 43, loss 0.0104 
## INFO  [20:45:42.397] epoch 44, loss 0.0103 
## INFO  [20:45:42.726] epoch 45, loss 0.0101 
## INFO  [20:45:43.054] epoch 46, loss 0.0100 
## INFO  [20:45:43.386] epoch 47, loss 0.0098 
## INFO  [20:45:43.707] epoch 48, loss 0.0097 
## INFO  [20:45:44.032] epoch 49, loss 0.0096 
## INFO  [20:45:44.357] epoch 50, loss 0.0094
wv_context = glove$components
word_vectors = wv_main + t(wv_context)

cat("Dim of word vector matrix =", dim(word_vectors))
## Dim of word vector matrix = 5371 100

Exploration

The output of GloVe is a matrix whose rows represent word vectors. Below are the first 5 words:

row.names(word_vectors)[1:5]
## [1] "000"  "03"   "06"   "09"   "16th"

And here are entries 1-10 of word vectors for “obama”, “democrats”, and “hillary”:

word_vectors[c("obama", "democrats", "hillary") , 1:10] %>% round(3)
##             [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]   [,9]  [,10]
## obama     -0.548  0.529 -0.254  0.145 -0.260 -0.854  0.721  0.428 -0.335 -0.337
## democrats -0.311 -0.185  0.565 -1.290 -0.146 -1.206  0.528  0.394 -0.328 -0.768
## hillary   -0.223 -0.092 -0.304 -0.240 -0.661  0.528 -0.707 -0.184  0.164 -0.870

Recall that the cosine similarity between vectors \(x\) and \(y\) is \[ \cos\alpha = \frac{\sum_{i=1}^{d}x_iy_i}{\sqrt{\sum_{i=1}^{d}x_i^2}\times \sqrt{\sum_{i=1}^{d}y_i^2}} \] Below is the cosine similarity between words “obama” and “hillary”

cos_sim <- function(x, y) {
  # computes cosine similarity between vectors x and y
  sum(x*y) / sqrt(sum(x^2) * sum(y^2))
}

cos_sim(word_vectors["obama" , ], word_vectors["hillary" , ])
## [1] 0.3425258

Is it high or low?

Question 2

Find cosine similarity of between “hillary” and the rest of the words in Trump’s vocabulary and print the top 10 words most similar to “hillary”

cos_sim_with_hillary <- function(x) cos_sim(word_vectors["hillary" , ], x)


word_vectors %>%
  apply(1, cos_sim_with_hillary) %>%
  sort(decreasing = TRUE) %>%
  head(10)
##     hillary     crooked     clinton        said         she        that 
##   1.0000000   0.8267952   0.7831464   0.4326855   0.4311580   0.4216134 
##        beat        bill         her barackobama 
##   0.4090511   0.4040966   0.3885555   0.3698914

Combining GloVe with downstream tasks

CBOW

The first step in combining GloVe with downstream tasks is applying Continuous Bag of Words (CBOW) as follows: \[ v(\mbox{document})=\sum_{x\in\mbox{document}}v(x) \] It means that our data matrix is \[ \mbox{DCM}\cdot \mbox{TCM} \] First, we need to create DTM:

dtm <- create_dtm(it, vectorizer)
cat("DTM dimensions =", dim(dtm), "\n")
## DTM dimensions = 20112 5371

Now we prepare the CBOW data. Below is a sample:

cbow_data <- as.matrix(dtm %*% word_vectors)
cbow_data[1:7, 1:5]
##         [,1]       [,2]       [,3]       [,4]       [,5]
## 1 -4.3766615 -3.4506032 -4.9521222 -4.9147031 -2.7269021
## 2 -0.5802461  0.6961936 -4.1253416 -2.4991573 -0.3017326
## 3 -2.2474253  1.9739904 -0.7344964 -1.0040018 -2.7986620
## 4 -2.3832234  1.5288176 -2.1396670 -2.7830770 -0.4994253
## 5  0.3931639  0.4732510  0.1472064 -4.5483078 -2.9581171
## 6 -8.6829010 -5.8134600 -1.6795864 -7.5044877 -8.6582221
## 7 -4.6194824 -4.1906928 -6.7806423 -0.1340697 -2.7793685

Clustering

Let us cluster Trump’s tweets using \(K\)-means with \(K=10\). This table shows the number of tweets in each cluster:

trum_tw_clust <-
  kmeans(cbow_data, 10)

table(trum_tw_clust$cluster)
## 
##    1    2    3    4    5    6    7    8    9   10 
##  880 3998 1580 1622 1198 2380 1829 2614 1658 2134

And here are word clouds by cluster:

trump_cloud_by_cluster <- function(k) {
  t %>%
    slice(which(trum_tw_clust$cluster == k)) %>%
    pull(clean_text) %>%
    visualize_text()
}

1:10 %>% lapply(trump_cloud_by_cluster) %>% invisible()

Question 3

Write an R function that prints a sample of 5 tweets in a given cluster and apply it to the list ot clusters so that it prints 5 tweets in each cluster.

trump_tweets_by_cluster <- function(k) {
  t %>%
    slice(which(trum_tw_clust$cluster == k)) %>%
    sample_n(5) %>%
    pull(clean_text) 
}

1:10 %>% lapply(trump_tweets_by_cluster) %>% print
## [[1]]
## [1] "thank you brandon  https   t co vfzwvhyiru"                                                                                                                                                                                                                                           
## [2] "rep  alexandria ocasio cortez is correct the va is not broken it is doing great  but that is only because of the trump administration  we got veterans choice  amp  accountability passed   president trump deserves a lot of credit   dan caldwell concerned veterans of america"    
## [3] "rt  drudgefeed  woodward  investigate fbicia handling of  garbage  dossier    https   t co py5awzprfy"                                                                                                                                                                                
## [4] "rt  gopchairwoman  the rnc smashed another fundraising record last month between  realdonaldtrump s accomplishments and our grassroots in "                                                                                                                                           
## [5] "    top of the fbi  brennan started this entire debacle about president trump  we now know that brennan had detailed knowledge of the  phony  dossier   he knows about the dossier he denies knowledge of the dossier he briefs the gang of 8 on the hill about the dossier which    "
## 
## [[2]]
## [1] "senator luther strange has gone up a lot in the polls since i endorsed him a month ago  now a close runoff  he will be great in d c "      
## [2] " democratic operative caught on camera  hillary personally ordered  donald duck  troll campaign that broke the law https   t co strehafyuh"
## [3] " ny conservative    barbmuenchen  crippledamerica is an excellent resource for the 2016 race  teamtrump  trump2016  makeamericagreatagain" 
## [4] " staciann   realdonaldtrump go  leezagibbons   classiest lady to ever grace the show   "                                                   
## [5] " mittromney has shown last week that he will campaign aggressively against  barackobama  i am confident he can defeat  barackobama "       
## 
## [[3]]
## [1] " wino911    trump2016 we know better than to trust the rnc https   t co emcgtk25fe   so cute "                                              
## [2] "buy american  amp  hire american are the principles at the core of my agenda which is  jobs jobs jobs  thank you  exxonmobil "              
## [3] "does anybody think that  cnbc will get their fictitious  polling numbers corrected sometime prior to the start of the debate  sad "         
## [4] "the reason ed schultz said nice things about me is that i m the only repub who won t cut social security etc  i ll make america rich again "
## [5] "i can t resist hitting lightweight  dannyzuker verbally when he starts up because he is just so pathetic and easy  stupid  "                
## 
## [[4]]
## [1] " moudalena   realdonaldtrump  nypost  fud31 godbless you mister  realdonaldtrump"                                                           
## [2] " akd72   realdonaldtrump i bet you will make hell of a president  the world needs presidents like you"                                      
## [3] " 05fxdli   realdonaldtrump as usual great pinpointed analysis of the world situation  i enjoy your frankness and honesty   thanks "         
## [4] " jwhurter                 realdonaldtrump are your ties available in south africa   you can order them online  macys com"                   
## [5] "scott gottlieb who has done an absolutely terrific job as commissioner of the fda plans to leave government service sometime next month    "
## 
## [[5]]
## [1] " pjwj316   realdonaldtrump yes           america needs  realdonaldtrump for president    can t wait for america to be great again    "       
## [2] "rt  jessebwatters  tomorrow  senator  lindseygrahamsc splashes into  wattersworld to discuss his relationship with president  realdonaldtru "
## [3] "with the great vote on cutting taxes this could be a big day for the stock market   and you "                                                
## [4] " trufacts 101   realdonaldtrump  fabiolasellsnj anything with the  trump name is epic  thanks "                                              
## [5] "rt  mike pence  america needs 4 more years of prosperity 4 more years of security and 4 more years of president  realdonaldtrump   keepam "  
## 
## [[6]]
## [1] "  ryan died on a winning mission   according to general mattis  not a failure  time for the u s  to get smart and start winning again "      
## [2] " tomuchpolitics   realdonaldtrump i would like to hear the answer to that johnny football question   with a little good guidance great "     
## [3] "remember as a senator obama did not vote for increasing the debt ceiling http   t co wtq96itg  i guess things change when president  "       
## [4] "general john kelly is doing a fantastic job as chief of staff  there is tremendous spirit and talent in the w h  don t believe the fake news"
## [5] "rick perry  a good man a great family and a patriot "                                                                                        
## 
## [[7]]
## [1] "major story that the dems are making up phony polls in order to suppress the the trump   we are going to win "                            
## [2] " donnieboysmith   realdonaldtrump in contrast to rubio and cruz you look like a giant  they look terribly weak thank you "                
## [3] "the tonight show begins in 5 minutes  enjoy "                                                                                             
## [4] "the washington times presidential debate poll trump 77   18290 clinton 17   4100  draintheswamp  debate https   t co wsgsf5nv6h"          
## [5] "i love watching these poor pathetic people  pundits  on television working so hard and so seriously to try and figure me out  they can t "
## 
## [[8]]
## [1] "americans who can afford to buy enough food is now at a 3 year low  is this  barackobama s  recovery   http   t co laei0sfa"     
## [2] " carisa01huston   realdonaldtrump you rock    ditch the haters and keep speaking the truth  they just don t like what they hear "
## [3] " tx shaun   realdonaldtrump happy birthday donald  thanks for never being afraid to tell it like it is"                          
## [4] " much as it pays to emphasize the positive there are times when the only choice is confrontation     the art of the deal"        
## [5] "is the boston killer eligible for obama care to bring him back to health "                                                       
## 
## [[9]]
## [1] "have a great game today  usarmy and  usnavy   i will be watching  we love our u s  military  on behalf of an entire nation thank you for your sacrifice and service   armynavygame  usa   https   t co 8m1w9rfwih"
## [2] "via  pvpatch by paige austin   trump to donate 12 acres for conservation in palos verdes  http   t co csif8rp7qe"                                                                                                 
## [3] " when you can t make them see the light make them feel the heat     ronald reagan"                                                                                                                                
## [4] "i m with you  i will work hard and never let you down  make america great again  https   t co vzzz6m5k9t"                                                                                                         
## [5] " stevieboy 63  talkernewyorker   5gentexan  hi katie let s get donald trump in the wh  he s the man to get this country back in order "                                                                           
## 
## [[10]]
## [1] " killianzane  i ll be the vp for  realdonaldtrump when he is president you would be better than what we have now "
## [2] " cashmoneybonas   realdonaldtrump i wish i was watching your presidency campaign instead of the oscars"           
## [3] "i really enjoyed being at the iowa state fair  the crowds love and enthusiasm is something i will never forget "  
## [4] "thank you albany new york  makeamericagreatagain  trump2016https   t co iaousy5vdc https   t co wcituuxej8"       
## [5] "the best investors are visionaries they look beyond the present "

Regression

Let us try to predict retweet count from the text of Trump’s tweet. We will use our CBOW matrix as the matrix of features and will fit a LASSO model. The response variable will be logarithm of retweet count (because the distribution of the raw retweet count is highly skewed).

First, we prepare the data and split the data into training and test sets

set.seed(42)
all_data <- cbow_data %>%
  as_tibble %>%
  mutate(Y = log(t$retweet_count + 1))
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
ind <- sample(1:nrow(all_data), size = 15000)

train_data <- all_data %>% slice(ind)
test_data <- all_data %>% slice(-ind)

cat("Train data dim =", dim(train_data), "\n")
## Train data dim = 15000 101
cat("Test data dim =", dim(test_data), "\n")
## Test data dim = 5112 101

Now we will train a lasso model

lambda <- 10^seq(-4, 0 , length = 20)

lasso <- train(
  Y ~., data = train_data, method = "glmnet",
  trControl = trainControl("cv", number = 5),
  tuneGrid = expand.grid(alpha = 1, lambda = lambda),
  preProcess = c("scale")
)

lasso
## glmnet 
## 
## 15000 samples
##   100 predictor
## 
## Pre-processing: scaled (100) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 12000, 12000, 12001, 11999, 12000 
## Resampling results across tuning parameters:
## 
##   lambda        RMSE      Rsquared   MAE     
##   0.0001000000  2.275811  0.4572173  1.859601
##   0.0001623777  2.275811  0.4572173  1.859601
##   0.0002636651  2.275811  0.4572173  1.859601
##   0.0004281332  2.275783  0.4572285  1.859635
##   0.0006951928  2.275729  0.4572493  1.859807
##   0.0011288379  2.275766  0.4572265  1.860243
##   0.0018329807  2.275975  0.4571229  1.861082
##   0.0029763514  2.276485  0.4568798  1.862437
##   0.0048329302  2.277759  0.4562926  1.864911
##   0.0078475997  2.280578  0.4550295  1.869938
##   0.0127427499  2.285699  0.4527852  1.878251
##   0.0206913808  2.295684  0.4484331  1.892225
##   0.0335981829  2.310761  0.4421048  1.913860
##   0.0545559478  2.337410  0.4308059  1.950286
##   0.0885866790  2.380668  0.4120822  2.007097
##   0.1438449888  2.439634  0.3868277  2.079145
##   0.2335721469  2.528904  0.3452911  2.176738
##   0.3792690191  2.615447  0.3098429  2.267878
##   0.6158482111  2.730466  0.2630414  2.396190
##   1.0000000000  2.857540  0.2509167  2.531932
## 
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.0006951928.

The test RMSE is

lasso %>%
  predict(test_data) %>%
  RMSE(test_data$Y)
## [1] 2.300238

Survey

There is a link to a simple survey after lab 9:

Answers

Here are the answers: