The aim of this lab is to learn how to construct clusters for documents based on GloVe vector representations.
By the end of this lab session, students should be able to
Train \(K\)-means clustering in R
Understand how to prepare the matrix of vector representations for documents in order to construct meaningful clusters.
Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. In either case, you can discuss your work with the lab instructor.
The dataset has been downloaded here:
library(tidyverse) # for manipulation with data
library(text2vec) # for GloVe
library(tm) # for text processing
library(wordcloud) # for visualizations
library(caret)
set.seed(142)
t <- read_csv("stockerbot-export.csv")
## Warning: 598 parsing failures.
## row col expected actual file
## 230 text delimiter or quote 'stockerbot-export.csv'
## 230 text delimiter or quote B 'stockerbot-export.csv'
## 230 text delimiter or quote 'stockerbot-export.csv'
## 230 text delimiter or quote B 'stockerbot-export.csv'
## 230 text delimiter or quote <e2> 'stockerbot-export.csv'
## ... .... .................. ...... .......................
## See problems(...) for more details.
head(t)
We will first do some vary basic cleaning by changing all characters to lower case and replacing everything except for alphabet and numerals with a space.
clean_tweets <- function(x) {
x %>%
tolower %>%
gsub('[^a-z0-9]', ' ', . )
}
t <- t %>%
mutate(clean_text = clean_tweets(text))
t %>%
select(text, clean_text) %>%
sample_n(10)
Here is the word cloud:
visualize_text <- function(x) {
# x is a character vector
# the function will extract
frequent_words <- termFreq(x)
frequent_words <- frequent_words[!(names(frequent_words) %in% stopwords())]
wordcloud(words = names(frequent_words),
freq = frequent_words, min.freq = 0,
max.words = 50, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
}
visualize_text(t$clean_text)
Now we will train GloVe on our dataset
dim_word_emb <- 100
skip_gram_window_size <- 3
tokens <- space_tokenizer(t$clean_text)
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it) %>% prune_vocabulary(5)
vectorizer <- vocab_vectorizer(vocab)
tcm <- create_tcm(it, vectorizer,
skip_grams_window = skip_gram_window_size)
cat("Dimensions of tcm =", dim(tcm), "\n")
## Dimensions of tcm = 8263 8263
# Stage 2 - training GloVe itself
glove = GlobalVectors$new(rank = dim_word_emb, x_max = 10)
wv_main = glove$fit_transform(tcm, n_iter = 50, convergence_tol = 0.01)
## INFO [18:09:23.055] epoch 1, loss 0.5520
## INFO [18:09:23.450] epoch 2, loss 0.2606
## INFO [18:09:23.760] epoch 3, loss 0.1413
## INFO [18:09:24.060] epoch 4, loss 0.1031
## INFO [18:09:24.390] epoch 5, loss 0.0837
## INFO [18:09:24.670] epoch 6, loss 0.0704
## INFO [18:09:24.956] epoch 7, loss 0.0606
## INFO [18:09:25.255] epoch 8, loss 0.0531
## INFO [18:09:25.568] epoch 9, loss 0.0471
## INFO [18:09:25.852] epoch 10, loss 0.0423
## INFO [18:09:26.134] epoch 11, loss 0.0383
## INFO [18:09:26.460] epoch 12, loss 0.0349
## INFO [18:09:26.741] epoch 13, loss 0.0321
## INFO [18:09:27.018] epoch 14, loss 0.0297
## INFO [18:09:27.302] epoch 15, loss 0.0276
## INFO [18:09:27.641] epoch 16, loss 0.0258
## INFO [18:09:27.911] epoch 17, loss 0.0242
## INFO [18:09:28.196] epoch 18, loss 0.0227
## INFO [18:09:28.496] epoch 19, loss 0.0215
## INFO [18:09:28.815] epoch 20, loss 0.0203
## INFO [18:09:29.091] epoch 21, loss 0.0193
## INFO [18:09:29.377] epoch 22, loss 0.0184
## INFO [18:09:29.669] epoch 23, loss 0.0175
## INFO [18:09:29.979] epoch 24, loss 0.0168
## INFO [18:09:30.258] epoch 25, loss 0.0161
## INFO [18:09:30.538] epoch 26, loss 0.0154
## INFO [18:09:30.838] epoch 27, loss 0.0148
## INFO [18:09:31.148] epoch 28, loss 0.0143
## INFO [18:09:31.427] epoch 29, loss 0.0137
## INFO [18:09:31.708] epoch 30, loss 0.0133
## INFO [18:09:32.044] epoch 31, loss 0.0128
## INFO [18:09:32.347] epoch 32, loss 0.0124
## INFO [18:09:32.644] epoch 33, loss 0.0120
## INFO [18:09:32.953] epoch 34, loss 0.0116
## INFO [18:09:33.286] epoch 35, loss 0.0113
## INFO [18:09:33.566] epoch 36, loss 0.0110
## INFO [18:09:33.838] epoch 37, loss 0.0107
## INFO [18:09:34.153] epoch 38, loss 0.0104
## INFO [18:09:34.449] epoch 39, loss 0.0101
## INFO [18:09:34.724] epoch 40, loss 0.0098
## INFO [18:09:35.000] epoch 41, loss 0.0096
## INFO [18:09:35.315] epoch 42, loss 0.0093
## INFO [18:09:35.603] epoch 43, loss 0.0091
## INFO [18:09:35.877] epoch 44, loss 0.0089
## INFO [18:09:36.169] epoch 45, loss 0.0087
## INFO [18:09:36.503] epoch 46, loss 0.0085
## INFO [18:09:36.782] epoch 47, loss 0.0083
## INFO [18:09:37.055] epoch 48, loss 0.0081
## INFO [18:09:37.351] epoch 49, loss 0.0080
## INFO [18:09:37.655] epoch 50, loss 0.0078
wv_context = glove$components
word_vectors = wv_main + t(wv_context)
cat("Dim of word vector matrix =", dim(word_vectors))
## Dim of word vector matrix = 8263 100
And now we will compute vectors representing entire tweets (continuous bag of words):
dtm <- create_dtm(it, vectorizer)
cbow_data <- as.matrix(dtm %*% word_vectors)
cat("CBOW Data dimensions =", dim(cbow_data), "\n")
## CBOW Data dimensions = 10350 100
Let us cluster our tweets using \(K\)-means with \(K=10\). This table shows the number of tweets in each cluster:
set.seed(25)
tw_clust <- kmeans(cbow_data, 10)
table(tw_clust$cluster)
##
## 1 2 3 4 5 6 7 8 9 10
## 3381 1 268 1005 1574 1761 1458 898 3 1
And here are word clouds by cluster:
cloud_by_cluster <- function(k) {
t %>%
slice(which(tw_clust$cluster == k)) %>%
pull(clean_text) %>%
visualize_text()
}
1:10 %>% lapply(cloud_by_cluster) %>% invisible()
This method of clustering is, in fact, incorrect. Think what is wrong and repair it.
Answer what matters is the cosine similarity, not Euclidean distance. However, clusters are formed based on Euclidean distance. In order to make Euclidean distance similar to cosine similarity, we need to normalize our data:
cbow_data_normalized <- cbow_data %>%
apply(1, function(x) x / sqrt(sum(x^2))) %>%
t # we need to transpose the matrix!
tw_clust <- kmeans(cbow_data_normalized, 10)
table(tw_clust$cluster)
##
## 1 2 3 4 5 6 7 8 9 10
## 1202 1340 1414 1037 353 692 223 1817 1909 363
1:10 %>% lapply(cloud_by_cluster) %>% invisible()
Write an R function that prints a sample of 5 tweets in a given cluster and apply it to the list of clusters so that it prints 5 tweets in each cluster.
extract_tweets_by_cluster <- function(k) {
t %>%
slice(which(tw_clust$cluster == k)) %>%
sample_n(5) %>%
pull(clean_text)
}
1:10 %>% lapply(extract_tweets_by_cluster) %>% print
## [[1]]
## [1] "synchrony financial syf set to announce quarterly earnings on thursday https t co lfewvsxpz6"
## [2] "traders buy tyson foods tsn on weakness https t co nucfrioa6m"
## [3] " dpw alerted dailystockmoney chatroom yesterday join our chatroom today subscription is free nflx qqq https t co tt8awlkhwx"
## [4] "rt reuters u s labor board judge rejects mcdonald s bid to settle franchisees case https t co 5xrkdqbzcl mcd https t co wssrayk3dv"
## [5] "traders buy shares of united technologies utx on weakness https t co kyfrzx3yed"
##
## [[2]]
## [1] "binance coin bnb price 13 00 quick sign up before binance closes registration https t co hbeuxyqlzb"
## [2] "rt dfstoken want to trade dfs find us at the following exchanges 1 https t co fvralqfnai 2 https t co lma6pcpn81 3 https t c "
## [3] "receive 10 fee discount here to leverage on bitmex register for bonus https t co lkviqtp543 storj https t co xet08myssi"
## [4] "our bot on discord notifies faster than crypto tracker bot join us https t co ir0bfudbkf cpc ctr https t co rumswvn2cx"
## [5] "join us for signals from multiple paid groups all in one join us https t co c6otmeb68o lgd zeph gpro https t co 0lsnkw1nze"
##
## [[3]]
## [1] "eli lilly and co lly given a 100 00 price target at cantor fitzgerald https t co byttkgblym"
## [2] "bought cbio 11 10 pre mkt sold 11 81 71 w 1k shares live in chat nice start to the day 710 already up over my https t co ftnwuewbec"
## [3] "amazon web services has been chosen to provide a majority of the cloud infrastructure for twenty first century fox https t co vb5wrkhsrv"
## [4] "goldman downgrades clx from neutral to sell https t co yac1ki00qz"
## [5] "u s weekly oil production hit a new milestone last week and domestic crude stockpiles unexpectedly rose oott xom https t co nmnkagl7fr"
##
## [[4]]
## [1] "rt dfstoken airdrop vote for dfs on wolfex amp receive 100 dfs on 7 24 rules retweet tag 5 friends follow wolf exchange join the "
## [2] " evhc whack on chanos short comment at conference"
## [3] " traderstewie tjx hod as well couple retail names looking good ahead of back to school "
## [4] "rt financialbuzz watch us report live from the floor of the nyse this weeks weekly wrap up includes nvda jd vrx roku bkng symc tt "
## [5] "rt andybiotech jnj remicade keeps 94 of volume share despite two infliximab biosimilars from pfe and mrk just sad to see biosimil "
##
## [[5]]
## [1] "the blockchain based essentia ess moved down 0 0003180892 on july 16 17 https t co uofixcx3q9"
## [2] "short sale volume not short interest for itub at 2018 07 17 is 58 https t co zyhhpczfx1 bbd 56 bch 70 jpm 31 cib 34 "
## [3] " xrx open interest for maturity 07 20 2018 high put 26 00 high call 33 00 maxpain options https t co r9zvf0dxw8 https t co thn3cay6ch"
## [4] " vlo high oi range is 135 00 to 122 00 for option expiration 07 20 2018 maxpain options https t co top9wrtwpq https t co txysogudba"
## [5] " hes high oi range is 60 00 to 67 50 for option expiration 07 20 2018 maxpain options https t co q2nqjbcakm https t co dm9rmnbnrc"
##
## [[6]]
## [1] "rt 420 invest 420 invest we fund marijuana opportunities dm us aero abbv acan acnff acbff agstf ammj aphqf attbf axi "
## [2] "naj long calls in aal baba adsk atvi bud dal de do jblu jd mas mro mt msft rht snap sq teva v wpx"
## [3] " ms kate he eat mr dvd bwa im one fig"
## [4] " im mr star dang im ms usa he see ms flo out a shop"
## [5] "hoy microsoft corp msft cerr a 105 25 "
##
## [[7]]
## [1] " nly annaly capital management inc document alert from our stock news alerts app"
## [2] " cog cabot oil amp gas corporation end of day sec alert from our stock news alerts app"
## [3] "coin listings pumps 20 100 our bot detects it before official announcements join us https t co zct5ypgiho"
## [4] "coin listings pumps 20 100 our bot detects it before official announcements join us https t co yqznkyv6qh"
## [5] "rt marketsupchuck subscribe to our youtube channel fast graphs stock videos ge low hd fb fn ads skt svb t abc on dks jcom n "
##
## [[8]]
## [1] "shelton capital management increased its baxter intl inc bax holding https t co er5i0wnthc"
## [2] " 1 61 billion in sales expected for wec energy group inc wec this quarter https t co vg8s464v5o"
## [3] "were analysts bearish varian medical systems inc var this week https t co z0hgv6erya"
## [4] "deere amp company de stock rating lowered by otr global https t co lwqz5uebpz"
## [5] "carrizo oil amp gas crzo stock rating upgraded by zacks investment research https t co wts9565uua"
##
## [[9]]
## [1] "similarities between hlf qcor or mnk and tsla shorts are amazing same story same results https t co gfgvchxsqv"
## [2] " aaoi https t co g05symtm9p"
## [3] " ohi sbra cci cio cone dea doc gt irm mnr stor umh https t co phexgytggq"
## [4] "comparing liquidity services lqdt amp accenture acn https t co vmwnibzhry"
## [5] "republic services inc adds james p snee ceo hormel foods to board rsg corpgov https t co ovg8udkg2u"
##
## [[10]]
## [1] "operating performance of twx revenue in 2012 12 was 25 3 billion usd and increased to 31 3 billion usd in 2017 https t co jdqjik1elh"
## [2] "892 pdco oct 24 0 puts 2 50 sweepers as puts start to accumulate here"
## [3] "rt teslarians googl paid 40 58 for its 500m jd stake its 37 75 now value "
## [4] "watch for possible next leg up on htz to 14 30s once 14 06 broken"
## [5] "detail change 1h mint 48 79 kurt 42 78 tgt 39 22 sup 38 03 tie 36 58 cryptocurrency"