Info about the lab

Learning aim

The aim of this lab is to learn how to construct clusters for documents based on GloVe vector representations.

Objectives

By the end of this lab session, students should be able to

  1. Train \(K\)-means clustering in R

  2. Understand how to prepare the matrix of vector representations for documents in order to construct meaningful clusters.

Mode

Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. In either case, you can discuss your work with the lab instructor.

Libraries and data

The dataset has been downloaded here:

library(tidyverse) # for manipulation with data
library(text2vec) # for GloVe
library(tm) # for text processing
library(wordcloud) # for visualizations
library(caret)

set.seed(142)

t <- read_csv("stockerbot-export.csv")
## Warning: 598 parsing failures.
## row  col           expected actual                    file
## 230 text delimiter or quote        'stockerbot-export.csv'
## 230 text delimiter or quote   B    'stockerbot-export.csv'
## 230 text delimiter or quote        'stockerbot-export.csv'
## 230 text delimiter or quote   B    'stockerbot-export.csv'
## 230 text delimiter or quote   <e2> 'stockerbot-export.csv'
## ... .... .................. ...... .......................
## See problems(...) for more details.
head(t)

We will first do some vary basic cleaning by changing all characters to lower case and replacing everything except for alphabet and numerals with a space.

clean_tweets <- function(x) {
  x %>% 
    tolower %>%
    gsub('[^a-z0-9]', ' ', . ) 
}

t <- t %>%
  mutate(clean_text = clean_tweets(text))

t %>%
  select(text, clean_text) %>%
  sample_n(10)

Here is the word cloud:

visualize_text <- function(x) {
  # x is a character vector
  # the function will extract
  frequent_words <- termFreq(x)
  frequent_words <- frequent_words[!(names(frequent_words) %in% stopwords())]
  wordcloud(words = names(frequent_words), 
            freq = frequent_words, min.freq = 0,
            max.words = 50, random.order=FALSE, rot.per=0.35, 
            colors=brewer.pal(8, "Dark2"))

}

visualize_text(t$clean_text)

Now we will train GloVe on our dataset

dim_word_emb <- 100
skip_gram_window_size <- 3
tokens <- space_tokenizer(t$clean_text)

it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it) %>% prune_vocabulary(5)
vectorizer <- vocab_vectorizer(vocab)
tcm <- create_tcm(it, vectorizer, 
                  skip_grams_window = skip_gram_window_size)

cat("Dimensions of tcm =", dim(tcm), "\n")
## Dimensions of tcm = 8263 8263
# Stage 2 - training GloVe itself
glove = GlobalVectors$new(rank = dim_word_emb, x_max = 10)

wv_main = glove$fit_transform(tcm, n_iter = 50, convergence_tol = 0.01)
## INFO  [18:09:23.055] epoch 1, loss 0.5520 
## INFO  [18:09:23.450] epoch 2, loss 0.2606 
## INFO  [18:09:23.760] epoch 3, loss 0.1413 
## INFO  [18:09:24.060] epoch 4, loss 0.1031 
## INFO  [18:09:24.390] epoch 5, loss 0.0837 
## INFO  [18:09:24.670] epoch 6, loss 0.0704 
## INFO  [18:09:24.956] epoch 7, loss 0.0606 
## INFO  [18:09:25.255] epoch 8, loss 0.0531 
## INFO  [18:09:25.568] epoch 9, loss 0.0471 
## INFO  [18:09:25.852] epoch 10, loss 0.0423 
## INFO  [18:09:26.134] epoch 11, loss 0.0383 
## INFO  [18:09:26.460] epoch 12, loss 0.0349 
## INFO  [18:09:26.741] epoch 13, loss 0.0321 
## INFO  [18:09:27.018] epoch 14, loss 0.0297 
## INFO  [18:09:27.302] epoch 15, loss 0.0276 
## INFO  [18:09:27.641] epoch 16, loss 0.0258 
## INFO  [18:09:27.911] epoch 17, loss 0.0242 
## INFO  [18:09:28.196] epoch 18, loss 0.0227 
## INFO  [18:09:28.496] epoch 19, loss 0.0215 
## INFO  [18:09:28.815] epoch 20, loss 0.0203 
## INFO  [18:09:29.091] epoch 21, loss 0.0193 
## INFO  [18:09:29.377] epoch 22, loss 0.0184 
## INFO  [18:09:29.669] epoch 23, loss 0.0175 
## INFO  [18:09:29.979] epoch 24, loss 0.0168 
## INFO  [18:09:30.258] epoch 25, loss 0.0161 
## INFO  [18:09:30.538] epoch 26, loss 0.0154 
## INFO  [18:09:30.838] epoch 27, loss 0.0148 
## INFO  [18:09:31.148] epoch 28, loss 0.0143 
## INFO  [18:09:31.427] epoch 29, loss 0.0137 
## INFO  [18:09:31.708] epoch 30, loss 0.0133 
## INFO  [18:09:32.044] epoch 31, loss 0.0128 
## INFO  [18:09:32.347] epoch 32, loss 0.0124 
## INFO  [18:09:32.644] epoch 33, loss 0.0120 
## INFO  [18:09:32.953] epoch 34, loss 0.0116 
## INFO  [18:09:33.286] epoch 35, loss 0.0113 
## INFO  [18:09:33.566] epoch 36, loss 0.0110 
## INFO  [18:09:33.838] epoch 37, loss 0.0107 
## INFO  [18:09:34.153] epoch 38, loss 0.0104 
## INFO  [18:09:34.449] epoch 39, loss 0.0101 
## INFO  [18:09:34.724] epoch 40, loss 0.0098 
## INFO  [18:09:35.000] epoch 41, loss 0.0096 
## INFO  [18:09:35.315] epoch 42, loss 0.0093 
## INFO  [18:09:35.603] epoch 43, loss 0.0091 
## INFO  [18:09:35.877] epoch 44, loss 0.0089 
## INFO  [18:09:36.169] epoch 45, loss 0.0087 
## INFO  [18:09:36.503] epoch 46, loss 0.0085 
## INFO  [18:09:36.782] epoch 47, loss 0.0083 
## INFO  [18:09:37.055] epoch 48, loss 0.0081 
## INFO  [18:09:37.351] epoch 49, loss 0.0080 
## INFO  [18:09:37.655] epoch 50, loss 0.0078
wv_context = glove$components
word_vectors = wv_main + t(wv_context)

cat("Dim of word vector matrix =", dim(word_vectors))
## Dim of word vector matrix = 8263 100

And now we will compute vectors representing entire tweets (continuous bag of words):

dtm <- create_dtm(it, vectorizer)
cbow_data <- as.matrix(dtm %*% word_vectors)
cat("CBOW Data dimensions =", dim(cbow_data), "\n")
## CBOW Data dimensions = 10350 100

Clustering

Let us cluster our tweets using \(K\)-means with \(K=10\). This table shows the number of tweets in each cluster:

set.seed(25)
tw_clust <- kmeans(cbow_data, 10)
table(tw_clust$cluster)
## 
##    1    2    3    4    5    6    7    8    9   10 
## 3381    1  268 1005 1574 1761 1458  898    3    1

And here are word clouds by cluster:

cloud_by_cluster <- function(k) {
  t %>%
    slice(which(tw_clust$cluster == k)) %>%
    pull(clean_text) %>%
    visualize_text()
}

1:10 %>% lapply(cloud_by_cluster) %>% invisible()

Question 1

This method of clustering is, in fact, incorrect. Think what is wrong and repair it.

Answer what matters is the cosine similarity, not Euclidean distance. However, clusters are formed based on Euclidean distance. In order to make Euclidean distance similar to cosine similarity, we need to normalize our data:

cbow_data_normalized <- cbow_data %>%
  apply(1, function(x) x / sqrt(sum(x^2))) %>%
  t # we need to transpose the matrix!

tw_clust <- kmeans(cbow_data_normalized, 10)
table(tw_clust$cluster)
## 
##    1    2    3    4    5    6    7    8    9   10 
## 1202 1340 1414 1037  353  692  223 1817 1909  363
1:10 %>% lapply(cloud_by_cluster) %>% invisible()

Question 2

Write an R function that prints a sample of 5 tweets in a given cluster and apply it to the list of clusters so that it prints 5 tweets in each cluster.

extract_tweets_by_cluster <- function(k) {
  t %>%
    slice(which(tw_clust$cluster == k)) %>%
    sample_n(5) %>%
    pull(clean_text) 
}

1:10 %>% lapply(extract_tweets_by_cluster) %>% print
## [[1]]
## [1] "synchrony financial  syf set to announce quarterly earnings on thursday https   t co lfewvsxpz6"                                             
## [2] "traders buy tyson foods  tsn on weakness https   t co nucfrioa6m"                                                                            
## [3] " dpw alerted  dailystockmoney chatroom yesterday      join our chatroom today  subscription is free       nflx  qqq  https   t co tt8awlkhwx"
## [4] "rt  reuters  u s  labor board judge rejects mcdonald s bid to settle franchisees  case https   t co 5xrkdqbzcl  mcd https   t co wssrayk3dv" 
## [5] "traders buy shares of united technologies  utx on weakness https   t co kyfrzx3yed"                                                          
## 
## [[2]]
## [1] "binance coin  bnb price   13 00 quick  sign up before binance closes registration             https   t co hbeuxyqlzb"                       
## [2] "rt  dfstoken  want to trade  dfs  find us at the following exchanges   1  https   t co fvralqfnai 2  https   t co lma6pcpn81 3  https   t c "
## [3] "receive 10  fee discount here to leverage on bitmex      register for bonus    https   t co lkviqtp543   storj  https   t co xet08myssi"     
## [4] "our bot on discord notifies faster than crypto tracker bot      join us    https   t co ir0bfudbkf   cpc  ctr  https   t co rumswvn2cx"      
## [5] "join us for signals from multiple paid groups all in one      join us    https   t co c6otmeb68o   lgd  zeph  gpro  https   t co 0lsnkw1nze" 
## 
## [[3]]
## [1] "eli lilly and co  lly given a  100 00 price target at cantor fitzgerald https   t co byttkgblym"                                             
## [2] "bought  cbio 11 10 pre mkt sold 11 81   71 w 1k shares live in chat nice start to the day   710 already up over my  https   t co ftnwuewbec" 
## [3] "amazon web services has been chosen to provide a majority of the cloud infrastructure for twenty first century fox   https   t co vb5wrkhsrv"
## [4] "goldman downgrades  clx from neutral to sell https   t co yac1ki00qz"                                                                        
## [5] "u s  weekly oil production hit a new milestone last week and domestic crude stockpiles unexpectedly rose  oott  xom  https   t co nmnkagl7fr"
## 
## [[4]]
## [1] "rt  dfstoken   airdrop       vote for dfs on  wolfex  amp  receive 100  dfs on 7 24  rules  retweet tag 5 friends follow  wolf exchange join the "
## [2] " evhc whack on chanos short comment at conference"                                                                                                
## [3] " traderstewie  tjx hod as well  couple retail names looking good ahead of back to school "                                                        
## [4] "rt  financialbuzz  watch us report live from the floor of the nyse  this weeks weekly wrap up includes  nvda  jd  vrx  roku  bkng  symc  tt "     
## [5] "rt  andybiotech   jnj remicade keeps 94  of volume share despite two infliximab  biosimilars from  pfe and  mrk    just sad to see biosimil "     
## 
## [[5]]
## [1] "the blockchain   based essentia  ess moved down   0 0003180892 on july 16 17   https   t co uofixcx3q9"                                      
## [2] "short sale volume  not short interest  for  itub at 2018 07 17 is 58   https   t co zyhhpczfx1  bbd 56   bch 70   jpm 31   cib 34 "          
## [3] " xrx open interest for maturity 07 20 2018  high put 26 00 high call 33 00  maxpain  options https   t co r9zvf0dxw8 https   t co thn3cay6ch"
## [4] " vlo high oi range is 135 00 to 122 00 for option expiration 07 20 2018  maxpain  options https   t co top9wrtwpq https   t co txysogudba"   
## [5] " hes high oi range is 60 00 to 67 50 for option expiration 07 20 2018  maxpain  options https   t co q2nqjbcakm https   t co dm9rmnbnrc"     
## 
## [[6]]
## [1] "rt  420 invest     420 invest    we fund  marijuana opportunities  dm us     aero  abbv  acan  acnff  acbff  agstf  ammj  aphqf  attbf  axi "
## [2] "naj long calls in   aal  baba  adsk  atvi  bud  dal  de  do  jblu  jd  mas  mro  mt  msft  rht  snap  sq  teva  v  wpx"                      
## [3] " ms  kate  he  eat  mr  dvd  bwa   im  one  fig"                                                                                             
## [4] " im  mr  star  dang   im  ms  usa  he  see  ms  flo  out  a  shop"                                                                           
## [5] "hoy  microsoft corp  msft cerr  a  105 25 "                                                                                                  
## 
## [[7]]
## [1] " nly annaly capital management inc  document alert from our stock news alerts app"                                                           
## [2] " cog cabot oil  amp  gas corporation end of day sec alert from our stock news alerts app"                                                    
## [3] "coin listings pumps 20 100   our bot detects it before official announcements      join us     https   t co zct5ypgiho"                      
## [4] "coin listings pumps 20 100   our bot detects it before official announcements      join us     https   t co yqznkyv6qh"                      
## [5] "rt  marketsupchuck  subscribe to our youtube channel fast graphs stock videos  ge  low  hd  fb  fn  ads  skt  svb  t  abc  on  dks  jcom  n "
## 
## [[8]]
## [1] "shelton capital management increased its baxter intl inc  bax holding   https   t co er5i0wnthc"       
## [2] " 1 61 billion in sales expected for wec energy group inc  wec this quarter https   t co vg8s464v5o"    
## [3] "were analysts bearish varian medical systems inc   var this week  https   t co z0hgv6erya"             
## [4] "deere  amp  company  de stock rating lowered by otr global https   t co lwqz5uebpz"                    
## [5] "carrizo oil  amp  gas  crzo stock rating upgraded by zacks investment research https   t co wts9565uua"
## 
## [[9]]
## [1] "similarities between  hlf   qcor  or  mnk and  tsla shorts are amazing   same story same results   https   t co gfgvchxsqv"
## [2] " aaoi https   t co g05symtm9p"                                                                                             
## [3] " ohi  sbra  cci  cio  cone  dea  doc  gt  irm  mnr  stor  umh https   t co phexgytggq"                                     
## [4] "comparing liquidity services  lqdt  amp  accenture  acn https   t co vmwnibzhry"                                           
## [5] "republic services inc  adds james p  snee  ceo hormel foods  to board  rsg  corpgov https   t co ovg8udkg2u"               
## 
## [[10]]
## [1] "operating performance of  twx  revenue in 2012 12 was 25 3 billion  usd and increased to 31 3 billion  usd in 2017   https   t co jdqjik1elh"
## [2] "892  pdco oct 24 0 puts  2 50 sweepers as puts start to accumulate here"                                                                     
## [3] "rt  teslarians   googl paid  40 58 for its  500m   jd stake its 37 75 now    value "                                                         
## [4] "watch for possible next leg up on  htz  to  14 30s once  14 06 broken"                                                                       
## [5] "detail change 1h    mint  48 79    kurt  42 78    tgt  39 22    sup  38 03    tie  36 58    cryptocurrency"

Answers