A tutorial by Carlo R. M. A. Santagiustina
carlo.santagiustina@unive.it
This is an R Markdown document. RMarkdown is a simple formatting syntax for authoring interactive HTML documents that combine R code, html and JS. When you click the Knit button on the top-left corner a document will be generated that includes both content as well as the output of any embedded R code chunks within this tutorial. For more details on R Markdown see http://rmarkdown.rstudio.com.
Text mining: automatic or semi-automatic extraction of information from large numbers of unstructured texts written in a natural language (for example using RegEx);
information to be extracted or inferred from the texts are usually specified manually beforehand. Text mining is an applied branch of information retrieval and computational linguistics
In general, text mining workflows have three steps:
1. Data collection and pre-processing: Gather input texts that are potentially relevant for the given task and pre-process them.
2. Natural language processing: Analyze the text, infer relevant syntactic, semantic and morphological information about words and their dependency relations;
3. Text categorization and modeling: Discover and aggregate recurrent patterns, infer latent semantic representations and distinctions (topics, arguments, narratives);
Because, in our daily life, we use natural language (and not numbers and math formulas) to:
1. R is open-source: free, transparent and deployable across a wide range of operative systems ;
2. R has a huge community: R and its TM and NLP libraries are well-documented, incessantly improved and there are plenty of online tutorials and examples ;
3. R can be used to integrate multiple languages and related libraries (like Spacy) in a pipeline: you can embed and execute scripts, commands and functions written in other programming languages, like Python, Java, C++, SQL, Julia, JS, Fortran ;
4. R is focused on (but not limited to) statistics, and many authors of topic models write their code in R (or wrap-up their c++ libraries in a R package): Employed by researchers in almost all fields of quantitative social sciences ;
require(magrittr)
Piping %>% was introduced in R with the magrittr package which is now part of the tidyverse suite. Many other packages, like purr, have extended the functional programming toolset for R.
this section of the tutorial is a summarized version of https://www.nateday.me/Cats_with_Pipes.html
x %>% f gives the same output of f(x) where f is a function that accepts the data object as input;x %>% f(y) gives the same output of f(x, y)x %>% f %>% g %>% h gives the same output of h(g(f(x)))x %>% f %>% g %>% h(parA=y) %>% i %>% j(parB=z) %>% k gives the same output of k(j(i(h(g(f(x)), parA=y)), parB=z))The left-hand side is evaluated before being passed on to the right-hand side expression.
If you don’t want to use an object xas the first input/parameter of a function when piping, you can choose to which input/parameter send x by using the . character as follows:
x %>% f(y, .) is equivalent to f(y, x)x %>% f(y, z = .) is equivalent to f(y, z = x)It is straightforward to use the placeholder several times in a right-hand side expression. However, when the placeholder only appears in a nested expressions magrittr will still apply the first-argument rule. The reason is that in most cases this results more clean code.
x %>% f(y = nrow(.), z = ncol(.)) is equivalent to f(x, y = nrow(x), z = ncol(x))
The behavior can be overruled by enclosing the right-hand side in braces:
x %>% {f(y = nrow(.), z = ncol(.))} is equivalent to f(y = nrow(x), z = ncol(x))
# creator Carlo R. M. A. Santagiustina
# creation date 29-01-2022
# language R
# R version 4.0.3 (2020-10-10)
#### 0 KNITR OPTIONS ####
knitr::opts_chunk$set(warning = FALSE, message = FALSE, echo = FALSE,fig.width=10, fig.height=10)
##### 1.1 #INSTALL AND LOAD LIBRARIES ####
pkgTest <- function(x)
{
if (!require(x,character.only = TRUE))
{
install.packages(x,dep=TRUE,repos="https://cloud.r-project.org/")
if(!require(x,character.only = TRUE)) stop("Package not found")
}
}
pkgTest("remotes")# for installing libraries from github
pkgTest("knitr")# for printing HTM tables
pkgTest("tidyverse")# for data wrangling
pkgTest("quanteda")# for text mining
pkgTest("spacyr")# for NLP
pkgTest("topicmodels")# for topic model estimation
pkgTest("seededlda")# for seededLDA estimation
pkgTest("igraph")# for constructing and analysing graphs
pkgTest("visNetwork")# for plotting interactive graphs
pkgTest("corpus")#for affect WordNet-Affect lexicon (extension of WordNet, see )
pkgTest("tidytext")#for formatting topicmodels outputs
#pkgTest(udpipe)# other library for NLP
remotes::install_github("richfitz/remoji")
pkgTest("remoji")# for emoji extraction
pkgTest("reshape2")# for reshaping sededLDA outputs
The dataset that will be used in today applications consists of 67 792 tweets (published by 40 608 users) containing the expression "work[ -]?from[ -]?anywhere", and related metadata and user profile info.
These tweets have been published between January 2019 and the July 2021.
They have been downloaded from the Twitter Accademic API V2 using R. For more details on this API see: https://developer.twitter.com/en/products/twitter-api/academic-research. The dataset was preprocessed to remove duplicates and tweets unrelated to the "work from anywhere" issue. Dataset import in R:
## text author_id conversation_id created_at
## Length:67792 Length:67792 Length:67792 Min. :2019-01-01
## Class :character Class :character Class :character 1st Qu.:2020-03-18
## Mode :character Mode :character Mode :character Median :2020-09-02
## Mean :2020-07-23
## 3rd Qu.:2021-02-10
## Max. :2021-06-30
## in_reply_to_user_id possibly_sensitive id source
## Length:67792 Mode :logical Length:67792 Length:67792
## Class :character FALSE:67396 Class :character Class :character
## Mode :character TRUE :396 Mode :character Mode :character
##
##
##
## retweet_count reply_count like_count quote_count
## Min. : 0.000 Min. : 0.0000 Min. : 0.000 Min. : 0.0000
## 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.0000
## Median : 0.000 Median : 0.0000 Median : 0.000 Median : 0.0000
## Mean : 1.121 Mean : 0.3821 Mean : 4.882 Mean : 0.1406
## 3rd Qu.: 0.000 3rd Qu.: 0.0000 3rd Qu.: 1.000 3rd Qu.: 0.0000
## Max. :5119.000 Max. :873.0000 Max. :11684.000 Max. :879.0000
| text | author_id | conversation_id | created_at | in_reply_to_user_id | possibly_sensitive | id | source | retweet_count | reply_count | like_count | quote_count |
|---|---|---|---|---|---|---|---|---|---|---|---|
| #digitaltransformation Top 10 companies that offer work-from-anywhere jobs - WKMG News 6 & ClickOrlando https://t.co/Mg0XtnflUB https://t.co/3fLz3jnKq9 | 1049585267772968960 | 1080012837211291648 | 2019-01-01 | NA | FALSE | 1080012837211291648 | IFTTT | 0 | 0 | 0 | 0 |
| 40 Best International Work from Home Jobs (Work from Anywhere in the World Online) - MoneyPantry #International https://t.co/EMFWTHzZEj | 107631340 | 1080010814562672640 | 2019-01-01 | NA | FALSE | 1080010814562672640 | Linkgage | 0 | 0 | 0 | 0 |
| 40 Best International Work from Home Jobs (Work from Anywhere in the World Online) - MoneyPantry #International #link2 https://t.co/EMFWTHzZEj | 107631340 | 1079979313833934855 | 2019-01-01 | NA | FALSE | 1079979313833934855 | Linkgage | 0 | 0 | 0 | 0 |
| Top 10 companies that offer work-from-anywhere jobs https://t.co/uHzor6qahf | 2214623612 | 1079922194644385793 | 2019-01-01 | NA | FALSE | 1079922194644385793 | 0 | 0 | 0 | 0 | |
| 40 Best International Work from Home Jobs (Work from Anywhere in the World Online) - MoneyPantry #International #link1 https://t.co/EMFWTHzZEj | 107631340 | 1079918941319901184 | 2019-01-01 | NA | FALSE | 1079918941319901184 | Linkgage | 0 | 0 | 1 | 0 |
| Top 10 companies that offer work-from-anywhere jobs https://t.co/LOwL9En8VH | 80203918 | 1079914035754291200 | 2019-01-01 | NA | FALSE | 1079914035754291200 | 0 | 0 | 0 | 0 |
## url id name verified
## Length:40608 Length:40608 Length:40608 Mode :logical
## Class :character Class :character Class :character FALSE:38728
## Mode :character Mode :character Mode :character TRUE :1880
##
##
##
## protected username description followers_count
## Mode :logical Length:40608 Length:40608 Min. : 0
## FALSE:40608 Class :character Class :character 1st Qu.: 107
## Mode :character Mode :character Median : 514
## Mean : 17214
## 3rd Qu.: 2106
## Max. :26099432
## following_count tweet_count listed_count created_at
## Min. : 0 Min. : 1 Min. : 0.0 Min. :2006-03-21
## 1st Qu.: 155 1st Qu.: 1007 1st Qu.: 0.0 1st Qu.:2009-11-11
## Median : 512 Median : 4570 Median : 8.0 Median :2012-07-21
## Mean : 1658 Mean : 23484 Mean : 160.2 Mean :2013-07-16
## 3rd Qu.: 1387 3rd Qu.: 17754 3rd Qu.: 57.0 3rd Qu.:2017-01-27
## Max. :755319 Max. :3976439 Max. :130184.0 Max. :2021-06-28
## location
## Length:40608
## Class :character
## Mode :character
##
##
##
| url | id | name | verified | protected | username | description | followers_count | following_count | tweet_count | listed_count | created_at | location |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| https://t.co/L0FKd02CBX | 60433092 | Morissa (Dr. Rissy) Schwartz | TRUE | FALSE | MorissaSchwartz | All my puns are intended! 😆 Owner of top-rated marketing company https://t.co/3xpIccgnoi - @forbes 30 Under 30 🤩 - @entrepreneur Contributor -📚 https://t.co/8XIkYYnjSh CEO | 21180 | 8958 | 74574 | 1632 | 2009-07-27 | New Jersey, USA |
| https://t.co/i48TLClhwK | 813081982758518784 | A Better Lemonade Stand | FALSE | FALSE | ABLSecom | A Better Lemonade Stand is an online business incubator. We help entrepreneurs build, launch, and grow their online business. | 2108 | 673 | 3888 | 62 | 2016-12-25 | NA |
| NA | 862922852063862784 | Mr. J 🏴 | FALSE | FALSE | jasze36 | If you are into smartphones, tech, superheroes, video games, movies, general geekery, football, and anything Malaysian, my tweets are for you. | 447 | 346 | 35486 | 6 | 2017-05-12 | Malesia |
| https://t.co/6qjZ8PCpvM | 91478624 | Forbes | TRUE | FALSE | Forbes | Official account of Forbes, the world’s leading voice for entrepreneurial success and free enterprise, and home to breaking news, business, money and more. | 17331756 | 5149 | 347669 | 61023 | 2009-11-21 | New York, NY |
| https://t.co/wk9XBMxbNX | 35700404 | 💉💉😷💪🦠Heather Smith | TRUE | FALSE | HeatherSmithAU | 🦄Hype Girl for Accounting Apps 🦄 |
Subscribe to the #AccounTech newsletter 👉https://t.co/oQhoMuqNwk Lead with kindness 👸🌞🦋 💐💃
https://t.co/lS92N5HjxV | 10606| 5177| 92368| 434|2009-04-27 |Brisbane Australia | |https://t.co/GMdWAQKfMC |16350207 |Christine Hueber |FALSE |FALSE |ChristineHueber |Do you have your copy of my #FREE Report yet? Get “Top 10 LinkedIn #B2B #Success Secrets” to #hack #LinkedIn for better results now @ https://t.co/GMdWAQKfMC | 8078| 7987| 116846| 367|2008-09-18 |San Francisco, CA |
| text | author_id | conversation_id | created_at | in_reply_to_user_id | possibly_sensitive | id | source | retweet_count | reply_count | like_count | quote_count | clean_text | text_urls | text_hashtags | text_fintags | text_emoticones |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| #digitaltransformation Top 10 companies that offer work-from-anywhere jobs - WKMG News 6 & ClickOrlando https://t.co/Mg0XtnflUB https://t.co/3fLz3jnKq9 | 1049585267772968960 | 1080012837211291648 | 2019-01-01 | NA | FALSE | 1080012837211291648 | IFTTT | 0 | 0 | 0 | 0 | digitaltransformation Top 10 companies that offer work-from-anywhere jobs - WKMG News 6 and ClickOrlando | https://t.co/Mg0XtnflUB, https://t.co/3fLz3jnKq9 | #digitaltransformation | ||
| 40 Best International Work from Home Jobs (Work from Anywhere in the World Online) - MoneyPantry #International https://t.co/EMFWTHzZEj | 107631340 | 1080010814562672640 | 2019-01-01 | NA | FALSE | 1080010814562672640 | Linkgage | 0 | 0 | 0 | 0 | 40 Best International Work from Home Jobs (Work from Anywhere in the World Online) - MoneyPantry International | https://t.co/EMFWTHzZEj | #International | ||
| 40 Best International Work from Home Jobs (Work from Anywhere in the World Online) - MoneyPantry #International #link2 https://t.co/EMFWTHzZEj | 107631340 | 1079979313833934855 | 2019-01-01 | NA | FALSE | 1079979313833934855 | Linkgage | 0 | 0 | 0 | 0 | 40 Best International Work from Home Jobs (Work from Anywhere in the World Online) - MoneyPantry International link2 | https://t.co/EMFWTHzZEj | #International, #link2 | ||
| Top 10 companies that offer work-from-anywhere jobs https://t.co/uHzor6qahf | 2214623612 | 1079922194644385793 | 2019-01-01 | NA | FALSE | 1079922194644385793 | 0 | 0 | 0 | 0 | Top 10 companies that offer work-from-anywhere jobs | https://t.co/uHzor6qahf | ||||
| 40 Best International Work from Home Jobs (Work from Anywhere in the World Online) - MoneyPantry #International #link1 https://t.co/EMFWTHzZEj | 107631340 | 1079918941319901184 | 2019-01-01 | NA | FALSE | 1079918941319901184 | Linkgage | 0 | 0 | 1 | 0 | 40 Best International Work from Home Jobs (Work from Anywhere in the World Online) - MoneyPantry International link1 | https://t.co/EMFWTHzZEj | #International, #link1 | ||
| Top 10 companies that offer work-from-anywhere jobs https://t.co/LOwL9En8VH | 80203918 | 1079914035754291200 | 2019-01-01 | NA | FALSE | 1079914035754291200 | 0 | 0 | 0 | 0 | Top 10 companies that offer work-from-anywhere jobs | https://t.co/LOwL9En8VH |
| url | id | name | verified | protected | username | description | followers_count | following_count | tweet_count | listed_count | created_at | location | clean_description | description_urls | description_hashtags | description_fintags | description_emoticones |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| https://t.co/L0FKd02CBX | 60433092 | Morissa (Dr. Rissy) Schwartz | TRUE | FALSE | MorissaSchwartz | All my puns are intended! 😆 Owner of top-rated marketing company https://t.co/3xpIccgnoi - @forbes 30 Under 30 🤩 - @entrepreneur Contributor -📚 https://t.co/8XIkYYnjSh CEO | 21180 | 8958 | 74574 | 1632 | 2009-07-27 | New Jersey, USA | All my puns are intended! Owner of top-rated marketing company - @forbes 30 Under 30 - @entrepreneur Contributor - CEO | https://t.co/3xpIccgnoi, https://t.co/8XIkYYnjSh | |||
| https://t.co/i48TLClhwK | 813081982758518784 | A Better Lemonade Stand | FALSE | FALSE | ABLSecom | A Better Lemonade Stand is an online business incubator. We help entrepreneurs build, launch, and grow their online business. | 2108 | 673 | 3888 | 62 | 2016-12-25 | NA | A Better Lemonade Stand is an online business incubator. We help entrepreneurs build, launch, and grow their online business. | ||||
| NA | 862922852063862784 | Mr. J 🏴 | FALSE | FALSE | jasze36 | If you are into smartphones, tech, superheroes, video games, movies, general geekery, football, and anything Malaysian, my tweets are for you. | 447 | 346 | 35486 | 6 | 2017-05-12 | Malesia | If you are into smartphones, tech, superheroes, video games, movies, general geekery, football, and anything Malaysian, my tweets are for you. | ||||
| https://t.co/6qjZ8PCpvM | 91478624 | Forbes | TRUE | FALSE | Forbes | Official account of Forbes, the world’s leading voice for entrepreneurial success and free enterprise, and home to breaking news, business, money and more. | 17331756 | 5149 | 347669 | 61023 | 2009-11-21 | New York, NY | Official account of Forbes, the world s leading voice for entrepreneurial success and free enterprise, and home to breaking news, business, money and more. | ||||
| https://t.co/wk9XBMxbNX | 35700404 | 💉💉😷💪🦠Heather Smith | TRUE | FALSE | HeatherSmithAU | 🦄Hype Girl for Accounting Apps 🦄 |
Subscribe to the #AccounTech newsletter 👉https://t.co/oQhoMuqNwk Lead with kindness 👸🌞🦋 💐💃
https://t.co/lS92N5HjxV | 10606| 5177| 92368| 434|2009-04-27 |Brisbane Australia |Hype Girl for Accounting Apps Subscribe to the AccounTech newsletter Lead with kindness |https://t.co/oQhoMuqNwk, https://t.co/lS92N5HjxV |#AccounTech | | | |https://t.co/GMdWAQKfMC |16350207 |Christine Hueber |FALSE |FALSE |ChristineHueber |Do you have your copy of my #FREE Report yet? Get “Top 10 LinkedIn #B2B #Success Secrets” to #hack #LinkedIn for better results now @ https://t.co/GMdWAQKfMC | 8078| 7987| 116846| 367|2008-09-18 |San Francisco, CA |Do you have your copy of my FREE Report yet? Get “Top 10 LinkedIn B2B Success Secrets” to hack LinkedIn for better results now @ |https://t.co/GMdWAQKfMC |#FREE , #B2B , #Success , #hack , #LinkedIn | | |
A document-feature matrix (DFM) is a matrix that describes the frequency of features (e.g., terms) that occur in a collection of documents (here tweets).
In a document-feature matrix, rows correspond to documents in the collection and columns correspond to features.
######################################################
#### 2 FROM STRINGS TO A DOCUMENT-FEATURE MATRIX ####
######################################################
#[OPTION-1]
#data$tweets$clean_text = gsub(pattern = ,"work[ -]?from[ -]?anywhere", data$tweets$clean_text,replacement = " ",ignore.case = T,perl = T)# run this line to remove query terms directly from strings
rownames(data$tweets)=data$tweets$id
rownames(data$users)=data$users$id
#### 2.1 BUILD THE CORPUS
corpus=list(tweets=NULL, users=NULL)
corpus$tweets = quanteda::corpus(data$tweets,text_field = "clean_text",docid_field ="id")
corpus$users = quanteda::corpus(data$users,text_field = "clean_description",docid_field ="id")
#### 2.2 TOKENIZE DOCUMENTS IN CORPUS #####
tokens=list(tweets=NULL, users=NULL)
tokens$tweets= corpus$tweets %>%
quanteda::tokens(.,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_separators = TRUE,
remove_numbers = TRUE,
split_hyphens =FALSE,
verbose=TRUE
)
tokens$users= corpus$users %>%
quanteda::tokens(.,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_separators = TRUE,
remove_numbers = TRUE,
split_hyphens =FALSE,
verbose=TRUE
)
#### 2.3 BUILD DOCUMENT FEATURE MATRIX #####
dfm=list(tweets=NULL, users=NULL)
dfm$tweets = tokens$tweets %>%
quanteda::dfm(.,
tolower = TRUE,
stem = FALSE,
# select = NULL,
# remove = NULL,
# dictionary = NULL,
# thesaurus = NULL,
# case_insensitive = TRUE,
verbose = TRUE
)
print("Initial tweets DFM")
## [1] "Initial tweets DFM"
head(dfm$tweets)
## Document-feature matrix of: 6 documents, 52,013 features (100.0% sparse) and 15 docvars.
## features
## docs digitaltransformation top companies that offer
## 1080012837211291648 1 1 1 1 1
## 1080010814562672640 0 0 0 0 0
## 1079979313833934855 0 0 0 0 0
## 1079922194644385793 0 1 1 1 1
## 1079918941319901184 0 0 0 0 0
## 1079914035754291200 0 1 1 1 1
## features
## docs work-from-anywhere jobs wkmg news and
## 1080012837211291648 1 1 1 1 1
## 1080010814562672640 0 1 0 0 0
## 1079979313833934855 0 1 0 0 0
## 1079922194644385793 1 1 0 0 0
## 1079918941319901184 0 1 0 0 0
## 1079914035754291200 1 1 0 0 0
## [ reached max_nfeat ... 52,003 more features ]
print("Top Features in the initial tweets DFM")
## [1] "Top Features in the initial tweets DFM"
quanteda::topfeatures(dfm$tweets,50) %>% kable(, col.names = NULL)# top 50 features before stopwords removal
| work | 73793 |
| from | 71515 |
| anywhere | 61396 |
| to | 61052 |
| the | 55263 |
| and | 42495 |
| a | 34809 |
| you | 26839 |
| in | 24297 |
| can | 21030 |
| is | 20633 |
| of | 20627 |
| for | 20188 |
| i | 18063 |
| with | 16271 |
| your | 15161 |
| that | 11742 |
| it | 11447 |
| on | 10944 |
| are | 10270 |
| this | 9750 |
| be | 9450 |
| more | 8647 |
| remote | 7977 |
| we | 7959 |
| how | 7906 |
| our | 7719 |
| world | 7665 |
| as | 7534 |
| have | 7394 |
| work-from-anywhere | 7126 |
| at | 7086 |
| home | 6955 |
| my | 6893 |
| business | 6649 |
| new | 6541 |
| will | 5835 |
| s | 5780 |
| an | 5735 |
| employees | 5576 |
| now | 5480 |
| or | 5436 |
| time | 5413 |
| if | 5362 |
| people | 4943 |
| do | 4800 |
| but | 4724 |
| working | 4685 |
| here | 4642 |
| office | 4519 |
dfm$users = tokens$users %>%
quanteda::dfm(.,
tolower = TRUE,
stem = FALSE,
# select = NULL,
# remove = NULL,
# dictionary = NULL,
# thesaurus = NULL,
# case_insensitive = TRUE,
verbose = TRUE
)
print("Initial users DFM")
## [1] "Initial users DFM"
head(dfm$users)
## Document-feature matrix of: 6 documents, 59,828 features (100.0% sparse) and 16 docvars.
## features
## docs all my puns are intended owner of top-rated marketing
## 60433092 1 1 1 1 1 1 1 1 1
## 813081982758518784 0 0 0 0 0 0 0 0 0
## 862922852063862784 0 1 0 2 0 0 0 0 0
## 91478624 0 0 0 0 0 0 1 0 0
## 35700404 0 0 0 0 0 0 0 0 0
## 16350207 0 1 0 0 0 0 1 0 0
## features
## docs company
## 60433092 1
## 813081982758518784 0
## 862922852063862784 0
## 91478624 0
## 35700404 0
## 16350207 0
## [ reached max_nfeat ... 59,818 more features ]
print("Top Features in the initial users DFM")
## [1] "Top Features in the initial users DFM"
quanteda::topfeatures(dfm$users,50) %>% kable(., col.names = NULL)# top 50 features before stopwords removal
| and | 15903 |
| the | 11501 |
| of | 9955 |
| to | 8549 |
| a | 7804 |
| in | 6575 |
| for | 6552 |
| i | 4455 |
| my | 4290 |
| is | 3957 |
| na | 3732 |
| with | 3373 |
| business | 3306 |
| on | 3060 |
| are | 3050 |
| at | 2853 |
| your | 2816 |
| you | 2798 |
| it | 2529 |
| we | 2515 |
| all | 1923 |
| own | 1900 |
| marketing | 1844 |
| digital | 1795 |
| technology | 1766 |
| by | 1655 |
| that | 1645 |
| from | 1587 |
| tech | 1559 |
| news | 1556 |
| solutions | 1498 |
| about | 1493 |
| life | 1394 |
| work | 1378 |
| our | 1365 |
| services | 1351 |
| an | 1343 |
| not | 1328 |
| love | 1236 |
| world | 1232 |
| people | 1218 |
| more | 1196 |
| help | 1186 |
| founder | 1145 |
| views | 1122 |
| s | 1069 |
| me | 1050 |
| tweets | 1041 |
| cloud | 1008 |
| be | 989 |
#### 2.3 FILTERING THE DOCUMENT FEATURE MATRIX #####
unigrams_vocab=list(tweets=NULL, users=NULL)
for(i in c("tweets","users")){
#### 2.3.1 REMOVE STOPWORDS FROM DFM #####
if(i=="users"){
mystopwords=c(quanteda::stopwords(language = "en"),"s","$","t", "can")
}else{
#[OPTION-2]
mystopwords=c(quanteda::stopwords(language = "en"),"s","$","t", "can","work","from","anywhere","work-from-anywhere","workfromanywhere")#added query terms as stopwords (one could do better by removing them from the strings before building the corpus. See [OPTION-1])
}
dfm[[i]] = dfm_remove(dfm[[i]], pattern = mystopwords, valuetype = "fixed")
#### 2.3.1 BUILD INITIAL UNIGRAMS DICTIONARY #####
unigrams_vocab[[i]] = featnames(dfm[[i]])
#### 2.3.2 IDENTIFY RARE WORDS IN DFM #####
threshold=round(nrow(dfm[[i]])/1000,digits = 0)# one every 1000 documents
removed_rare=colSums(dfm[[i]])[colSums(dfm[[i]]>0)<threshold]#remove all token that don't appear at least once every 1000 documents
names(removed_rare)=featnames(dfm[[i]])[colSums(dfm[[i]]>0)<threshold]
removed_rare=sort(removed_rare,decreasing = T)
#removed_rare %>% kable()
#table(colSums(dfm[[i]]>0)<threshold) #tabulate N features above (TRUE) and below (FALSE) threshold
#### 2.3.3 IDENTIFY SHORT WORDS (- THREE CHARACTER) IN DFM #####
removed_short=colSums(dfm[[i]])[nchar(featnames(dfm[[i]]))<=3]
names(removed_short)=featnames(dfm[[i]])[nchar(featnames(dfm[[i]]))<=3]
removed_short=sort(removed_short,decreasing = T)
#removed_short %>% kable()
table(nchar(featnames(dfm[[i]]))<=3)
#### 2.3.5 REMOVE RARE/SHORT/NUMBER TOKENS FROM DFM #####
dfm[[i]]=dfm_select(
dfm[[i]],
pattern = unique(c(names(removed_rare),names(removed_short))),
selection = "remove",
valuetype = "fixed",
case_insensitive = FALSE,
verbose = quanteda_options("verbose")
)
####2.3.6 KEEP ONLY DOCS WITH AT LEAST THREE TOKENS #####
dfm[[i]] = dfm_subset(dfm[[i]], ntoken(dfm[[i]]) >= 3)#keep only docs with at least 3 tokens
unigrams_vocab[[i]]=featnames(dfm[[i]])
#table(rowSums(dfm[[i]])>=1)#to check if all docs have at least one term
#table(colSums(dfm[[i]])>=1)#to check if all terms (i.e., features) appear at least once
docvars(dfm[[i]])=data[[i]][docnames(dfm[[i]]),]
}
#Top features in tweeets dfm
print("Final tweets DFM")
## [1] "Final tweets DFM"
head(dfm$tweets)
## Document-feature matrix of: 6 documents, 1,780 features (99.7% sparse) and 17 docvars.
## features
## docs digitaltransformation companies offer jobs news best
## 1080012837211291648 1 1 1 1 1 0
## 1080010814562672640 0 0 0 1 0 1
## 1079979313833934855 0 0 0 1 0 1
## 1079922194644385793 0 1 1 1 0 0
## 1079918941319901184 0 0 0 1 0 1
## 1079914035754291200 0 1 1 1 0 0
## features
## docs international home world online
## 1080012837211291648 0 0 0 0
## 1080010814562672640 2 1 1 1
## 1079979313833934855 2 1 1 1
## 1079922194644385793 0 0 0 0
## 1079918941319901184 2 1 1 1
## 1079914035754291200 0 0 0 0
## [ reached max_nfeat ... 1,770 more features ]
print("Top Features in the final tweets DFM")
## [1] "Top Features in the final tweets DFM"
topfeatures(dfm$tweets,50) %>% kable(., col.names = NULL)
| remote | 7895 |
| world | 7454 |
| home | 6611 |
| business | 6604 |
| employees | 5226 |
| time | 5092 |
| people | 4917 |
| working | 4666 |
| office | 4425 |
| need | 3663 |
| want | 3541 |
| remotework | 3530 |
| learn | 3277 |
| just | 3054 |
| help | 3053 |
| make | 2955 |
| like | 2949 |
| team | 2834 |
| able | 2702 |
| today | 2538 |
| free | 2533 |
| join | 2489 |
| companies | 2479 |
| workfromhome | 2471 |
| best | 2463 |
| live | 2421 |
| freedom | 2374 |
| cloud | 2295 |
| online | 2248 |
| technology | 2227 |
| start | 2218 |
| company | 2144 |
| digital | 2069 |
| looking | 2054 |
| internet | 2021 |
| future | 2014 |
| travel | 1991 |
| love | 1983 |
| become | 1940 |
| many | 1900 |
| hiring | 1888 |
| jobs | 1873 |
| life | 1837 |
| read | 1829 |
| tech | 1827 |
| hours | 1801 |
| solutions | 1764 |
| pandemic | 1761 |
| find | 1727 |
| check | 1715 |
print("Final users DFM")
## [1] "Final users DFM"
head(dfm$users)
## Document-feature matrix of: 6 documents, 1,312 features (99.4% sparse) and 18 docvars.
## features
## docs owner marketing company @forbes contributor better stand
## 60433092 1 1 1 1 1 0 0
## 813081982758518784 0 0 0 0 0 1 1
## 862922852063862784 0 0 0 0 0 0 0
## 91478624 0 0 0 0 0 0 0
## 35700404 0 0 0 0 0 0 0
## 16350207 0 0 0 0 0 1 0
## features
## docs online business help
## 60433092 0 0 0
## 813081982758518784 2 2 1
## 862922852063862784 0 0 0
## 91478624 0 1 0
## 35700404 0 0 0
## 16350207 0 0 0
## [ reached max_nfeat ... 1,302 more features ]
print("Top Features in the final users DFM")
## [1] "Top Features in the final users DFM"
topfeatures(dfm$tweets,50) %>% kable(., col.names = NULL)
| remote | 7895 |
| world | 7454 |
| home | 6611 |
| business | 6604 |
| employees | 5226 |
| time | 5092 |
| people | 4917 |
| working | 4666 |
| office | 4425 |
| need | 3663 |
| want | 3541 |
| remotework | 3530 |
| learn | 3277 |
| just | 3054 |
| help | 3053 |
| make | 2955 |
| like | 2949 |
| team | 2834 |
| able | 2702 |
| today | 2538 |
| free | 2533 |
| join | 2489 |
| companies | 2479 |
| workfromhome | 2471 |
| best | 2463 |
| live | 2421 |
| freedom | 2374 |
| cloud | 2295 |
| online | 2248 |
| technology | 2227 |
| start | 2218 |
| company | 2144 |
| digital | 2069 |
| looking | 2054 |
| internet | 2021 |
| future | 2014 |
| travel | 1991 |
| love | 1983 |
| become | 1940 |
| many | 1900 |
| hiring | 1888 |
| jobs | 1873 |
| life | 1837 |
| read | 1829 |
| tech | 1827 |
| hours | 1801 |
| solutions | 1764 |
| pandemic | 1761 |
| find | 1727 |
| check | 1715 |
#topfeatures(dfm$tweets,50,scheme = "docfreq")# number of docs in which the token is present (scheme = "docfreq")
#topfeatures(dfm$tweets,50) / sum(ntoken(dfm$tweets)) #relative freq. of token w.r.t. total number of tokens
docvars(dfm$tweets,field = "postCOVID")=docvars(dfm$tweets,field = "created_at")>=as.Date("2020-03-11")
docvars(dfm$tweets,field = "most_liked")=docvars(dfm$tweets,field = "like_count")> quantile(docvars(dfm$tweets,field = "like_count"), c(.5))
docvars(dfm$tweets,field = "most_discussed")=docvars(dfm$tweets,field = "reply_count")> quantile(docvars(dfm$tweets,field = "reply_count"), c(.5))
docvars(dfm$tweets,field = "most_retweeted")=docvars(dfm$tweets,field = "retweet_count")> quantile(docvars(dfm$tweets,field = "retweet_count"), c(.5))
print("postCOVID top tokens counts (tweets)")
## [1] "postCOVID top tokens counts (tweets)"
topfeatures(dfm$tweets,50,groups = "postCOVID") %>% kable(., col.names = NULL)
|
|
print("(0.5 percentile) Most liked top tokens counts (tweets)")
## [1] "(0.5 percentile) Most liked top tokens counts (tweets)"
topfeatures(dfm$tweets,50,groups = "most_liked") %>% kable(., col.names = NULL)
|
|
print("(0.5 percentile) Most discussed top tokens counts (tweets)")
## [1] "(0.5 percentile) Most discussed top tokens counts (tweets)"
topfeatures(dfm$tweets,50,groups = "most_discussed") %>% kable(., col.names = NULL)
|
|
print("(0.5 percentile) Most retweeted top tokens counts (tweets)")
## [1] "(0.5 percentile) Most retweeted top tokens counts (tweets)"
topfeatures(dfm$tweets,50,groups = "most_retweeted")%>% kable(., col.names = NULL)
|
|
print("Verified accounts top tokens counts (users)")
## [1] "Verified accounts top tokens counts (users)"
topfeatures(dfm$users,50,groups = "verified") %>% kable(., col.names = NULL)
|
|
regular expression (RegEx): sequence of characters that define a search pattern, which is used by string searching algorithms for “find” or “find and replace” operations on strings, or for input validation;
character classes you can tell the regex engine to match only one out of several characters. Simply place the characters you want to match between square brackets. a or an e, use the character class [ae]
Some pre-built character classes that can be used inside RegEx patterns:
[:punct:]: punctuation;
[:alpha:]: letters;
[:lower:]: lowercase letters;
[:upper:]: upperclass letters;
[:digit:]: digits;
[:xdigit:]: hex digits;
[:alnum:]: letters and numbers;
[:cntrl:]: control characters;
[:graph:]: letters, numbers, and punctuation;
[:print:]: letters, numbers, punctuation, and whitespace;
[:space:]: space characters (basically equivalent to );
[:blank:]: space and tab;
You can create your own character classes using []:
[a-z]: matches every character between a and z (in Unicode code point order).
[^abc]: matches anything except a, b, or c.
[^-]: matches ^ or - (characters are escaped).
Quantifiers specify the number of repetitions of the pattern to be matched:
Position of pattern identifiers specify the position conditions of the pattern to be matched:
* The kwic function (in the Quanteda library) can be used to search and filter strings by RegEx conditions, and output the context (neighborhood) of identified RegEx matches.
* The gregexpr function can be used to identify the boundaries of sub-strings that match RegEx patterns (also multi-group).
########################################
#### 3 REGEX AND REGEX DICTIONARIES ####
########################################
#### 3.1 KEYWORDS IN CONTEXT (KWIC) WITH QUANTEDA ####
#SIMPLE EXAMPLES
#tweets and descriptions containing like/dislike emojis (removed from clean_text)
print("Keyword in context: 👍")
## [1] "Keyword in context: \U0001f44d"
quanteda::kwic(data$tweets$text,"👍",window=5,) %>% head() %>% kable()# UTF code \U0001f44d
| docname | from | to | pre | keyword | post | pattern |
|---|---|---|---|---|---|---|
| text2377 | 53 | 53 | get paid on our results | 👍 | 😉 | 👍 |
| text2479 | 19 | 19 | as effective as hands-on healing | 👍 | 👍 | |
| text2911 | 35 | 35 | Click below and apply ! | 👍 | #witpghjobs #pghtech #womeninstem https://t.co/XyaGoc6GrO | 👍 |
| text3372 | 47 | 47 | are a remote worker . | 👍 | https://t.co/XVi89vucZx | 👍 |
| text3534 | 33 | 33 | Remote - work from anywhere | 👍 | Apply https://t.co/RcRnheL17D https://t.co/h6z8A1OHfn | 👍 |
| text3911 | 55 | 55 | the top class work . | 👍 | 👍 | 👍 |
print("Keyword in context: 👎")
## [1] "Keyword in context: \U0001f44e"
quanteda::kwic(data$tweets$text,"👎",window=5) %>% head() %>% kable()# UTF code \U0001f44e
| docname | from | to | pre | keyword | post | pattern |
|---|---|---|---|---|---|---|
| text47324 | 38 | 38 | is you , congratulations ! | 👎 | There are thousands of people | 👎 |
print("Keyword in context: 🏔|🏝")
## [1] "Keyword in context: \U0001f3d4|\U0001f3dd"
quanteda::kwic(data$tweets$text,"🏔|🏝",window=5,valuetype="regex") %>% head() %>% kable()# UTF code \U0001f3d4 OR UTF code \U0001f3dd
| docname | from | to | pre | keyword | post | pattern |
|---|---|---|---|---|---|---|
| text202 | 39 | 39 | the top of Colorado ! | 🏔 | https://t.co/djXypfBfHD | 🏔|🏝 |
| text351 | 22 | 22 | : • Work from anywhere | 🏝 | • Choose your hours ⏱ | 🏔|🏝 |
| text670 | 5 | 5 | Work from anywhere ! | 🏔 | 🏕 🏖 🏜 🏞 🏘 | 🏔|🏝 |
| text775 | 26 | 26 | as you have a phone | 🏝 | 📱 Choose your hours you | 🏔|🏝 |
| text2159 | 1 | 1 | 🏝 | BOSS BABES WANTED 🏝 Tired | 🏔|🏝 | |
| text2159 | 5 | 5 | 🏝 BOSS BABES WANTED | 🏝 | Tired of working long hours | 🏔|🏝 |
#ADVANCED EXAMPLES
print("Keyword in context: work[- ]?from[- ]?anywhere[ -]?job")
## [1] "Keyword in context: work[- ]?from[- ]?anywhere[ -]?job"
quanteda::kwic(data$tweets$text,"work[- ]?from[- ]?anywhere[ -]?job",window=5,valuetype="regex") %>% head() %>% kable()
| docname | from | to | pre | keyword | post | pattern |
|---|---|---|---|---|---|---|
| text14356 | 40 | 40 | #remotework #remotejobs #jobs #jobsearch #workfromanywhere | #workfromanywherejobs | #fjblog https://t.co/nK8aC7vV1R | work[- ]?from[- ]?anywhere[ -]?job |
| text14690 | 41 | 41 | " @MSN #remotework #remotejobs #workfromanywhere | #workfromanywherejobs | #jobs https://t.co/Hzecirx0DZ | work[- ]?from[- ]?anywhere[ -]?job |
| text14957 | 26 | 26 | 👉 https://t.co/pGeV1Fes5V @CNBCMakeIt #remotework #remotejobs | #workfromanywherejobs | #workfromanywhere #fjinthenews #hiring #hiringnow #worklifebalance | work[- ]?from[- ]?anywhere[ -]?job |
| text15074 | 17 | 17 | should try Visit https://t.co/u69eTjdscq #VoiceNOCTechnician | #WorkFromAnywhereJobs | #NetworkOperationsTechnician #VoiceServices #NocEngineers | work[- ]?from[- ]?anywhere[ -]?job |
| text17671 | 23 | 23 | from Anywhere ] https://t.co/uX66tc2H8o #HardwareIntegrationEngineer | #WorkFromAnywhereJobs | #DesignIntegrationEngineer #IntegrationEngineerJobs #SystemIntegration | work[- ]?from[- ]?anywhere[ -]?job |
| text18547 | 27 | 27 | mode Remote Jobs : https://t.co/ikCjnJdwuS | #WorkFromAnywhereJobs | #CiscoJobs #EmploymentOutlook #CiscoEmployment #Cisco @GE_Summit | work[- ]?from[- ]?anywhere[ -]?job |
print("Keyword in context: (?<!(do not |don't ))(lik[a-z]{1,}|lov[a-z]{1,}|enjoy[a-z]{1,}){1,}")
## [1] "Keyword in context: (?<!(do not |don't ))(lik[a-z]{1,}|lov[a-z]{1,}|enjoy[a-z]{1,}){1,}"
quanteda::kwic(data$tweets$text,"(?<!(do not |don't ))(lik[a-z]{1,}|lov[a-z]{1,}|enjoy[a-z]{1,}){1,}",valuetype="regex",window=5) %>% head() %>% kable()
| docname | from | to | pre | keyword | post | pattern |
|---|---|---|---|---|---|---|
| text16 | 16 | 16 | anywhere as long as you | like | researching and fiddling with cutting-edge | (?<!(do not |don’t ))(lik[a-z]{1,}|lov[a-z]{1,}|enjoy[a-z]{1,}){1,} |
| text18 | 16 | 16 | anywhere as long as you | like | researching and fiddling with cutting-edge | (?<!(do not |don’t ))(lik[a-z]{1,}|lov[a-z]{1,}|enjoy[a-z]{1,}){1,} |
| text26 | 20 | 20 | freelance more so I feel | like | Im missing out on being | (?<!(do not |don’t ))(lik[a-z]{1,}|lov[a-z]{1,}|enjoy[a-z]{1,}){1,} |
| text39 | 16 | 16 | all levels , full of | Love | , Light , Joy and | (?<!(do not |don’t ))(lik[a-z]{1,}|lov[a-z]{1,}|enjoy[a-z]{1,}){1,} |
| text44 | 27 | 27 | , what places would you | like | to see ? https://t.co/swiL3mXmbX | (?<!(do not |don’t ))(lik[a-z]{1,}|lov[a-z]{1,}|enjoy[a-z]{1,}){1,} |
| text56 | 46 | 46 | . You’ll be impressed just | like | the customer will . Thank | (?<!(do not |don’t ))(lik[a-z]{1,}|lov[a-z]{1,}|enjoy[a-z]{1,}){1,} |
#### 3.2 EXTRACTING EMOJI FROM UTF-8 STRINGS USING A REGEX DICTIONARY ####
print("Emoji dictionary")
## [1] "Emoji dictionary"
head(remoji:::dat_core) %>% kable()
| emoji | description |
|---|---|
| 😄 | smiling face with open mouth and smiling eyes |
| 😃 | smiling face with open mouth |
| 😀 | grinning face |
| 😊 | smiling face with smiling eyes |
| ☺️ | white smiling face |
| 😉 | winking face |
dict_emoji=remoji:::dat_core$emoji
names(dict_emoji)=remoji:::dat_core$description
data$tweets=data$tweets %>% dplyr::mutate(emoji = regmatches(
text,
gregexpr(pattern = paste(dict_emoji,collapse = "|"), text, perl = TRUE)
))
data$users=data$users %>% dplyr::mutate(emoji = regmatches(
description,
gregexpr(pattern = paste(dict_emoji,collapse = "|"), description, perl = TRUE)
))
#### 3.3 IDENTIFY AND EXTRACT A CAUSAL ARGUMENT WITH A REGEX ####
#THE FOLLOWING CODE SNIPPET WAS DEVELOPED FOR A PAPER BY SANTAGISTINA & WARGLIEN ABOUT THE NO DEAL BREXIT (FORTHCOMING)
list_cause_verbs = c("cause","causes", "causing","caused",
"determine","determines","determining","determined",
"engender","engenders","engendered","engendering",
"entail","entails","entailed","entailing",
"provoke","provokes","provoked","provoking",
"trigger","triggers","triggered","triggering"
)
print("Causal verbs:")
## [1] "Causal verbs:"
list_cause_verbs
## [1] "cause" "causes" "causing" "caused" "determine"
## [6] "determines" "determining" "determined" "engender" "engenders"
## [11] "engendered" "engendering" "entail" "entails" "entailed"
## [16] "entailing" "provoke" "provokes" "provoked" "provoking"
## [21] "trigger" "triggers" "triggered" "triggering"
#build verbs regex
all_at_end=paste(
c(list_cause_verbs),
collapse = "|"
)
#INDENTIFY CAUSAL STATEMENTS (NAIVE FORM)
cause_effect_regex = paste("(?<cause>[^:;,!]{2,})(?<rel_operator>",all_at_end,"){1}(?<effect>[^:;,!]{2,})",sep = "")
example= quanteda::tokenize_sentence(data$tweets$text[61257])[[1]][1] #"I wonder if starlink will cause a wave of reverse migration?"
print(example)
## [1] "I wonder if starlink will cause a wave of reverse migration?"
match=gregexpr(
cause_effect_regex ,
example,
perl = T,
ignore.case = T
)
print("Gregexpr output:")
## [1] "Gregexpr output:"
match
## [[1]]
## [1] 1
## attr(,"match.length")
## [1] 60
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## attr(,"capture.start")
## cause rel_operator effect
## [1,] 1 27 32
## attr(,"capture.length")
## cause rel_operator effect
## [1,] 26 5 29
## attr(,"capture.names")
## [1] "cause" "rel_operator" "effect"
#cause
print("RegEx matches CAUSE:")
## [1] "RegEx matches CAUSE:"
st = attr(match[[1]], "capture.start")[, "cause"]
len = attr(match[[1]], "capture.length")[, "cause"]
substring(example, st, st + len - 1)
## [1] "I wonder if starlink will "
print("RegEx matches OPERATOR (RELATION):")
## [1] "RegEx matches OPERATOR (RELATION):"
#rel_operator
st = attr(match[[1]], "capture.start")[, "rel_operator"]
len = attr(match[[1]], "capture.length")[, "rel_operator"]
substring(example, st, st + len - 1)
## [1] "cause"
print("RegEx matches EFFECT:")
## [1] "RegEx matches EFFECT:"
#effect
st = attr(match[[1]], "capture.start")[, "effect"]
len = attr(match[[1]], "capture.length")[, "effect"]
substring(example, st, st + len - 1)
## [1] " a wave of reverse migration?"
A feature co-occurrence matrix (FCM) displays the number coappearances, in documents (here tweets), of features (e.g., terms). Each feature is identified by a both a row and a column, so the matrix is symmetric and (often) in the diagonal we have feature occurrence counts.
##### 4 FEATURES CO-OCCURRENCE MATRIX (FCM) AND NETWORK ####
topN=100 #parameter representing the number of top features considered
edge_filter_threshold=0.95 #parameter representing the percentile threshold filter applied to edges
#### 4.1 TWEEETS FCM ####
tweets_fcm= dfm$tweets %>%
quanteda::dfm_select(.,pattern= names(topfeatures(dfm$tweets,topN)) %>% as.character(),selection="keep",valuetype="fixed") %>%
quanteda::fcm(.)
tweets_fcm_graph=tweets_fcm %>% igraph::graph_from_adjacency_matrix(.,weighted = T,diag = F,mode = "upper")
tweets_fcm_graph=list(edges=as.data.frame(cbind(igraph::as_edgelist(tweets_fcm_graph),E(tweets_fcm_graph)$weight)),
nodes= data.frame(token=names(topfeatures(dfm$tweets,topN)),freq=topfeatures(dfm$tweets,topN)) )
#edge (percentile) filter
keep_edges=which(as.numeric(tweets_fcm_graph$edges[,3]) >= quantile(as.numeric(tweets_fcm_graph$edges[,3]),edge_filter_threshold))
visNetwork(
height = "900px",
width = "100%",
nodes = data.frame(
id = tweets_fcm_graph$nodes[,1]
,
label = tweets_fcm_graph$nodes[,1]
,
value = as.integer(as.numeric(tweets_fcm_graph$nodes[,2]))^2
,
color.background = "white"
,
color.border= "black"
,
shape = "box"
#,group = temp_$vertices$community_walktrap
),
edges = data.frame(
from=tweets_fcm_graph$edges[keep_edges,1]
,
to= tweets_fcm_graph$edges[keep_edges,2]
,
width = as.integer(as.numeric(tweets_fcm_graph$edges[keep_edges,3])^(1/2))/5 #log(1.001+temp_$edges$weight)*1000
, smooth.enabled = F
# ,smooth.roundness = 0.70
,color = list(color = "darkgray",
highlight = "darkred")
)
) %>%
visNodes(scaling = list(label = list(enabled = T)))%>%
visIgraphLayout(layout = "layout_nicely",physics = F,smooth = T) %>%
visInteraction(
keyboard = F,
navigationButtons = T,
dragNodes = T,
dragView = T,
zoomView = T
) %>%
visOptions(
highlightNearest = list(enabled = TRUE, hover = TRUE, degree = 1),
nodesIdSelection = T
)
# 4.2 EMOJI FCM
tweets_emoji_freq=data$tweets$emoji %>% unlist() %>% table() %>% sort(decreasing = T) %>% as.data.frame()
#tweets_emoji_freq %>% View()
text_emojis=data$tweets$emoji
names(text_emojis)=data$tweets$id
text_emojis_fcm= text_emojis %>%
quanteda::as.tokens(.) %>%
quanteda::dfm(.,tolower = F) %>%
quanteda::dfm_select(.,pattern= tweets_emoji_freq[1:topN,1] %>% as.character(),selection="keep",valuetype="fixed") %>%
quanteda::fcm(.)
#head(text_emojis_fcm)
text_emojis_fcm_graph=text_emojis_fcm %>% igraph::graph_from_adjacency_matrix(.,weighted = T,diag = F,mode = "upper")
text_emojis_fcm_graph=list(edges=as.data.frame(cbind(igraph::as_edgelist(text_emojis_fcm_graph),E(text_emojis_fcm_graph)$weight)),
nodes= tweets_emoji_freq[1:topN,])
keep_edges=which(as.numeric(text_emojis_fcm_graph$edges[,3]) >= quantile(as.numeric(text_emojis_fcm_graph$edges[,3]),edge_filter_threshold))
visNetwork(
height = "900px",
width = "100%",
nodes = data.frame(
id = text_emojis_fcm_graph$nodes[,1]
,
label = text_emojis_fcm_graph$nodes[,1]
,
value = as.integer(as.numeric(text_emojis_fcm_graph$nodes[,2]))^2
,
color.background = "white"
,
color.border= "black"
,
shape = "box"
#,group = temp_$vertices$community_walktrap
),
edges = data.frame(
from=text_emojis_fcm_graph$edges[keep_edges,1]
,
to= text_emojis_fcm_graph$edges[keep_edges,2]
,
width = as.integer(as.numeric(text_emojis_fcm_graph$edges[keep_edges,3])^(1/2))/5 #log(1.001+temp_$edges$weight)*1000
, smooth.enabled = F
# ,smooth.roundness = 0.70
,color = list(color = "darkgray",
highlight = "darkred")
)
) %>%
visNodes(scaling = list(label = list(enabled = T)))%>%
visIgraphLayout(layout = "layout_nicely",physics = F,smooth = T) %>%
visInteraction(
keyboard = F,
navigationButtons = T,
dragNodes = T,
dragView = T,
zoomView = T
) %>%
visOptions(
highlightNearest = list(enabled = TRUE, hover = TRUE, degree = 1),
nodesIdSelection = T
)
Spacy website: https://spacy.io/
spacyR github page: https://github.com/quanteda/spacyr
We use Spacy’s state-of-the-art transformer model for web data (en_core_web_trf),called roBERTa. More info at https://huggingface.co/roberta-base
- Paper on transformer models for NLP by Vaswan et al. (2017) https://arxiv.org/pdf/1706.03762.pdf - roBERTa paper by Liu et al. (2019) https://arxiv.org/pdf/1907.11692.pdf
Part-of-Speech tagging is the process of labeling a word in a text as corresponding to a particular part of speech (role), also based on its context;
Dependency parsing is the process of extracting the dependency parse of a sentence to represent its grammatical structure. It defines the dependency relationship between headwords and their dependents. The head of a sentence has no dependency and is called the root of the sentence.
Spacy(R) basic named entity categories:
PERSON: People, including fictional;NORP: Nationalities or religious or political groups;FAC: Buildings, airports, highways, bridges, etc;ORG: Companies, agencies, institutions, etc;GPE: Countries, cities, states;LOC: Non-GPE locations, mountain ranges, bodies of water;PRODUCT: Objects, vehicles, foods, etc. (Not services.);EVENT: Named hurricanes, battles, wars, sports events, etc.;WORK_OF_ART: Titles of books, songs, etc.;LAW: Named documents made into laws;LANGUAGE: Any named language;Spacy(R) extended categories:
DATE: Absolute or relative dates or periods;TIME: Times smaller than a day;PERCENT: Percentage, including “%”;MONEY: Monetary values, including unit;QUANTITY: Measurements, as of weight or distance;ORDINAL: “first”, “second”, etc.;CARDINAL: Numerals that do not fall under another type;########################################
#### 5 NATURAL LANGUAGE PROCESSING ####
########################################
#### 5.1 NLP WITH SPACY (ROBERTA TRANSFORMER MODEL) THROUGH SPACYR R LIBRARY ####
#Sys.setenv("OMP_NUM_THREADS"=10)# to set the number of threads used by spacy
#spacyr::spacy_install() #to install spacy through conda/miniconda in the local machine
#spacyr::spacy_install(lang_models = c("en_core_web_trf"))#to load the RoBERTa trf model (https://huggingface.co/roberta-base )
spacyr::spacy_initialize(model = "en_core_web_trf")#Initialize Spacy transformer model. Other Spacy models for english web data: "en_core_web_sm" "en_core_web_lg"
#EXAMPLE
ids=c(1080470620293550080)#ID(s) of the selected tweets
tweet_example=data$tweets$clean_text[which(data$tweets$id %in% ids)]#select tweet by ID
print(tweet_example)
## [1] "Learn how to manage 1000s of devices for your remote workforce. Join us on Jan 8 at 2 PM CET. See how TeamViewer Tensor enhances your productivity by empowering all employees to work from anywhere with any device at any time in a secure and fast manner. "
NLP_example=list(spacy_raw=NULL ,entities=NULL ,nounphrases=NULL)
NLP_example$spacy_raw= spacy_parse(
tweet_example,
dependency = TRUE,
lemma = TRUE,
pos = TRUE,
entity = TRUE,
nounphrase = TRUE,
additional_attributes = c("tag_","ent_type_","norm_"))#nlp pipeleine (tokenize sentences, tokenize words, POS, DEP-REL, UNIV.POS, EE ...)
NLP_example$spacy_raw %>% kable()
| doc_id | sentence_id | token_id | token | lemma | pos | head_token_id | dep_rel | entity | nounphrase | whitespace | tag_ | ent_type_ | norm_ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| text1 | 1 | 1 | Learn | learn | VERB | 1 | ROOT | TRUE | VB | learn | |||
| text1 | 1 | 2 | how | how | ADV | 4 | advmod | TRUE | WRB | how | |||
| text1 | 1 | 3 | to | to | PART | 4 | aux | TRUE | TO | to | |||
| text1 | 1 | 4 | manage | manage | VERB | 1 | xcomp | TRUE | VB | manage | |||
| text1 | 1 | 5 | 1000s | 1000 | NOUN | 4 | dobj | CARDINAL_B | beg_root | TRUE | NNS | CARDINAL | 1000s |
| text1 | 1 | 6 | of | of | ADP | 5 | prep | TRUE | IN | of | |||
| text1 | 1 | 7 | devices | device | NOUN | 6 | pobj | beg_root | TRUE | NNS | devices | ||
| text1 | 1 | 8 | for | for | ADP | 4 | prep | TRUE | IN | for | |||
| text1 | 1 | 9 | your | your | PRON | 11 | poss | beg | TRUE | PRP$ | your | ||
| text1 | 1 | 10 | remote | remote | ADJ | 11 | amod | mid | TRUE | JJ | remote | ||
| text1 | 1 | 11 | workforce | workforce | NOUN | 8 | pobj | end_root | FALSE | NN | workforce | ||
| text1 | 1 | 12 | . | . | PUNCT | 1 | punct | TRUE | . | . | |||
| text1 | 2 | 1 | Join | join | VERB | 1 | ROOT | TRUE | VB | join | |||
| text1 | 2 | 2 | us | we | PRON | 1 | dobj | beg_root | TRUE | PRP | us | ||
| text1 | 2 | 3 | on | on | ADP | 1 | prep | TRUE | IN | on | |||
| text1 | 2 | 4 | Jan | Jan | PROPN | 3 | pobj | DATE_B | beg_root | TRUE | NNP | DATE | jan |
| text1 | 2 | 5 | 8 | 8 | NUM | 4 | nummod | DATE_I | TRUE | CD | DATE | 8 | |
| text1 | 2 | 6 | at | at | ADP | 1 | prep | TRUE | IN | at | |||
| text1 | 2 | 7 | 2 | 2 | NUM | 8 | nummod | TIME_B | beg | TRUE | CD | TIME | 2 |
| text1 | 2 | 8 | PM | pm | NOUN | 9 | compound | TIME_I | mid | TRUE | NN | TIME | pm |
| text1 | 2 | 9 | CET | CET | PROPN | 6 | pobj | TIME_I | end_root | FALSE | NNP | TIME | cet |
| text1 | 2 | 10 | . | . | PUNCT | 1 | punct | TRUE | . | . | |||
| text1 | 3 | 1 | SPACE | 1 | ROOT | FALSE | _SP | ||||||
| text1 | 4 | 1 | See | see | VERB | 1 | ROOT | TRUE | VB | see | |||
| text1 | 4 | 2 | how | how | ADV | 5 | advmod | TRUE | WRB | how | |||
| text1 | 4 | 3 | TeamViewer | TeamViewer | PROPN | 4 | compound | PRODUCT_B | beg | TRUE | NNP | PRODUCT | teamviewer |
| text1 | 4 | 4 | Tensor | Tensor | PROPN | 5 | nsubj | PRODUCT_I | end_root | TRUE | NNP | PRODUCT | tensor |
| text1 | 4 | 5 | enhances | enhance | VERB | 1 | ccomp | TRUE | VBZ | enhances | |||
| text1 | 4 | 6 | your | your | PRON | 7 | poss | beg | TRUE | PRP$ | your | ||
| text1 | 4 | 7 | productivity | productivity | NOUN | 5 | dobj | end_root | TRUE | NN | productivity | ||
| text1 | 4 | 8 | by | by | ADP | 5 | prep | TRUE | IN | by | |||
| text1 | 4 | 9 | empowering | empower | VERB | 8 | pcomp | TRUE | VBG | empowering | |||
| text1 | 4 | 10 | all | all | DET | 11 | det | beg | TRUE | DT | all | ||
| text1 | 4 | 11 | employees | employee | NOUN | 9 | dobj | end_root | TRUE | NNS | employees | ||
| text1 | 4 | 12 | to | to | PART | 13 | aux | TRUE | TO | to | |||
| text1 | 4 | 13 | work | work | VERB | 9 | xcomp | TRUE | VB | work | |||
| text1 | 4 | 14 | from | from | ADP | 13 | prep | TRUE | IN | from | |||
| text1 | 4 | 15 | anywhere | anywhere | ADV | 14 | pobj | TRUE | RB | anywhere | |||
| text1 | 4 | 16 | with | with | ADP | 13 | prep | TRUE | IN | with | |||
| text1 | 4 | 17 | any | any | DET | 18 | det | beg | TRUE | DT | any | ||
| text1 | 4 | 18 | device | device | NOUN | 16 | pobj | end_root | TRUE | NN | device | ||
| text1 | 4 | 19 | at | at | ADP | 13 | prep | TRUE | IN | at | |||
| text1 | 4 | 20 | any | any | DET | 21 | det | beg | TRUE | DT | any | ||
| text1 | 4 | 21 | time | time | NOUN | 19 | pobj | end_root | TRUE | NN | time | ||
| text1 | 4 | 22 | in | in | ADP | 13 | prep | TRUE | IN | in | |||
| text1 | 4 | 23 | a | a | DET | 27 | det | beg | TRUE | DT | a | ||
| text1 | 4 | 24 | secure | secure | ADJ | 27 | amod | mid | TRUE | JJ | secure | ||
| text1 | 4 | 25 | and | and | CCONJ | 24 | cc | mid | TRUE | CC | and | ||
| text1 | 4 | 26 | fast | fast | ADJ | 24 | conj | mid | TRUE | JJ | fast | ||
| text1 | 4 | 27 | manner | manner | NOUN | 22 | pobj | end_root | FALSE | NN | manner | ||
| text1 | 4 | 28 | . | . | PUNCT | 1 | punct | TRUE | . | . | |||
| text1 | 5 | 1 | SPACE | 1 | ROOT | FALSE | _SP |
NLP_example$entities= spacyr::entity_consolidate(NLP_example$spacy_raw)#(named) entity consolidation
NLP_example$nounphrases= spacyr::nounphrase_extract(NLP_example$spacy_raw)#nounphrase consolidation
#NLP FOR THE WHOLE DATASET (users and tweets) TAKES +-10HRS
# #Code below was used for NLP pipeline with all tweets and users descriptions
# NLP_data_raw=list(tweets=NULL,users=NULL)
# users=data$users$clean_description
# names(users)=data$users$id
# tweets=data$tweets$clean_text
# names(tweets)=data$tweets$id
# start=Sys.time()
# NLP_data_raw$users = spacy_parse(
# users,
# dependency = TRUE,
# lemma = TRUE,
# pos = TRUE,
# entity = TRUE,
# nounphrase = TRUE,
# additional_attributes = c("tag_")
# ) #,"morph"
# NLP_data_raw$tweets = spacy_parse(
# tweets,
# dependency = TRUE,
# lemma = TRUE,
# pos = TRUE,
# entity = T,
# nounphrase = T,
# additional_attributes = c("tag_")
# ) #,"morph"
# print(paste("NLP time:", Sys.time()-start, "hours"))#"NLP time: 10.1510577821732 hours" on
# spacy_finalize()
# saveRDS(NLP_data_raw,"NLP_data_raw.RDS")
NLP_data_raw=readRDS("../data/NLP_data_raw.RDS")
#View(NLP_data_raw)
#### 5.2 NLP DATA AUGMENTATION ####
#EXAMPLES:
# merging NLP data with AFINN sentiment lexicon
dict_afinn = corpus::sentiment_afinn %>%
as_tibble() %>%
rename(word = 1,
sentiment = 2)
NLP_example$spacy_raw=NLP_example$spacy_raw %>%
left_join(dict_afinn, by = c("lemma"= "word"))# use lemma or norm_ depending on dict type
# merging NLP data with Affect Wordnet
dict_wordnet =corpus::affect_wordnet %>%
as_tibble() %>%
select(term, category, emotion) %>%
rename(word = 1,
wn_category = 2,
wn_emotion = 3)
NLP_example$spacy_raw=NLP_example$spacy_raw %>%
left_join(dict_wordnet, by = c("lemma"= "word"))# use lemma or norm_ depending on dict type
#EXAMPLES:
#NLP_data_raw=readRDS("../data/NLP_data_raw.RDS")
NLP_data_augmented=NLP_data_raw
for(i in c("tweets", "users"))
{
NLP_data_augmented[[i]]=NLP_data_raw[[i]] %>%
left_join(dict_wordnet, by = c("lemma"= "word"))
NLP_data_augmented[[i]]=NLP_data_augmented[[i]] %>%
left_join(dict_afinn, by = c("lemma"= "word"))# use lemma or norm_ depending on dict type
}
View(NLP_data_augmented$tweets)
View(NLP_data_augmented$users)
#### 5.3 NLP DATA TRANSFORMATION IN TWO DFMs: ONE FOR UNIGRAMS THE OTHER FOR DEP.REL (ASSOCIATIONS) ####
#NLP_data_raw=readRDS("../data/NLP_data_raw.RDS")
NLP_data=list(tweets=NULL,users=NULL)
source_nodes=list(tweets=NULL,users=NULL)
target_nodes=list(tweets=NULL,users=NULL)
net_data=list(tweets=NULL,users=NULL)
quanteda_unigrams=list(tweets=NULL,users=NULL)
quanteda_associations=list(tweets=NULL,users=NULL)
associations_dfm=list(tweets=NULL,users=NULL)
unigrams_dfm=list(tweets=NULL,users=NULL)
for(i in c("tweets", "users"))
{
#if consolidating noun phrases remember to redirect head token IDs of removed objects to existing ones.
NLP_data[[i]]=NLP_data_raw[[i]][!((NLP_data_raw[[i]]$pos %in% c("PUNCT","SPACE","CCONJ","NUM","PRON","ADP","SYM","PRON","DET","PART")) |(NLP_data_raw[[i]]$dep_rel %in% c("cc"))),]# POS filter
NLP_data[[i]]=NLP_data[[i]][!nchar(NLP_data[[i]]$lemma)<=1,]#lenght filter (not needed if already removed)
source_nodes[[i]]=NLP_data[[i]][NLP_data[[i]]$dep_rel %in% c("aux","auxpass","cop","arg","comp","acomp","ccomp","xcomp","obj","dobj","iobj","pobj","subj","nsubj","csubj","conj","amod","appos","advcl","preconj","vmod","mwe","mark","advmod","neg","rcmod","quantmod","nn","npadvmod","tmod","num","number","prep","poss","parataxis","goeswith","ref","sdep","xsubj","compound"),] %>% # DEP-REL filter
dplyr::select(doc_id, sentence_id, token_id=head_token_id,pos_source=pos,entity_source=entity,
word_source = token, lemma_source = lemma, dep_rel)
target_nodes[[i]]= NLP_data[[i]] %>% dplyr::select(doc_id, sentence_id, token_id,pos_target=pos,entity_target=entity,
word_target = token, lemma_target = lemma)
#saveRDS(NLP_data,paste0(folder,"NLP_data.RDS"))
net_data[[i]]= dplyr::left_join(source_nodes[[i]] , target_nodes[[i]],
by = c("doc_id", "sentence_id","token_id"))
net_data[[i]]=net_data[[i]][net_data[[i]]$dep_rel!="ROOT",]
net_data[[i]]=net_data[[i]][!is.na(net_data[[i]]$lemma_target),]
net_data[[i]][,c("lemma_source","lemma_target")]
net_data[[i]]=net_data[[i]][!(net_data[[i]]$lemma_source == net_data[[i]]$lemma_target),]#remove loops
net_data[[i]]$edge=paste(net_data[[i]]$lemma_source,net_data[[i]]$lemma_target,sep = "->")
#saveRDS(net_data[[i]],paste0(folder,"net_data.RDS"))
quanteda_unigrams[[i]]=quanteda::as.tokens(NLP_data[[i]] %>% dplyr::select(doc_id, sentence_id, token_id, token = lemma)) %>% quanteda::tokens_tolower()
quanteda_associations[[i]]=quanteda::as.tokens(net_data[[i]] %>% dplyr::select(doc_id, sentence_id, token_id, token = edge)) %>% quanteda::tokens_tolower()
#remove documents that don't have any bigram
unigrams_dfm[[i]] =quanteda_unigrams[[i]][names(quanteda_associations[[i]])]%>% quanteda::dfm()
associations_dfm[[i]]= quanteda_associations[[i]] %>% quanteda::dfm()
}
See LDA2 paper: https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf (Blei, 2003)
See LDA2Net paper: https://www.researchgate.net/publication/356717976_LDA2Net_Digging_under_the_surface_of_COVID-19_topics_in_scientific_literature (Minello, Santagiustina, Warglien, 2021)
#### 6 TOPIC MODELS ####
#### 6.1 LDA TOPIC MODEL (FOR TWEETS DFM FROM SECTION 2.3) ####
Ntopics=5
tweetsLDA=topicmodels::LDA(dfm$tweets,k=Ntopics,method = "VEM",LDAcontrol=list(seed=1,estimate.beta=TRUE))#typical value of alpha is 50/T where T is number of topics and value of beta is 0.1 or 200/W
#A higher alpha then gives a more dense distribution whereas a lower alpha gives a more sparse distribution.
tweetsLDA_top_terms = tidytext::tidy(tweetsLDA, matrix = "beta") %>%#beta contains the word X topic distributions
group_by(topic) %>%
slice_max(beta, n = 20) %>%
ungroup() %>%
arrange(topic, -beta)
tweetsLDA_top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()
#Show examples for each topic
selectedtopic=4
data$tweets[docnames(dfm$tweets[which(tweetsLDA@gamma[,selectedtopic]==max(tweetsLDA@gamma[,selectedtopic])),]),] %>% View() #gamma contains the topic X document distributions
#### 6.2 SEEDED LDA TOPIC MODEL (FOR USERS DFM FROM SECTION 2.3) ####
#define dictionary with seedwords
seed_dict=list(WFA_worker = c("nomad","worker","employee","traveller","globetrotter","backpacker"),#digital nomads et al.
WFA_place_community= c("rent","office","meeting","room","appartment","community","join"),
WFA_firm = c("employer","firm","company","corporate","business","enterprise","we","our","group","llc"),# WFA firms/employers
WFA_manager =c("ceo","executive","director","manager","management","board"),
WFA_coach =c("coach","coaching","support","supporting","consultant","consulting"),
WFA_investor=c("investor", "investing","entrepreneur","founder","funding"),
WFA_media = c("news","reporter", "journalist","writer", "magazine","journal","articles","updates","correspondent")
)
keyword_counts=list()
for(i in names(seed_dict)){
keyword_counts[[i]]=sort(colSums(dfm$users[, quanteda::featnames(dfm$users) %in% seed_dict[[i]]]),decreasing = T)
if(sum(keyword_counts[[i]])==0 | length(keyword_counts[[i]])==0){
seed_dict[[i]]=NULL
}else{
seed_dict[[i]]=names(sort(keyword_counts[[i]][names(keyword_counts[[i]]!=0)],decreasing = T))
}
}
lapply(keyword_counts, function(x) paste(names(x) ," (",x,")", sep="",collapse = "; "))
## $WFA_worker
## [1] "employee (138); traveller (87); nomad (76); worker (73)"
##
## $WFA_place_community
## [1] "community (649); office (324); join (285); meeting (72)"
##
## $WFA_firm
## [1] "business (3232); company (688); enterprise (377); group (319); firm (307); corporate (217); employer (109)"
##
## $WFA_manager
## [1] "management (853); director (782); manager (766); executive (385); board (273)"
##
## $WFA_coach
## [1] "consultant (772); support (662); coach (648); consulting (410); coaching (137); supporting (114)"
##
## $WFA_investor
## [1] "founder (1069); entrepreneur (807); investor (453); investing (174)"
##
## $WFA_media
## [1] "news (1489); writer (809); journalist (182); updates (179); magazine (110); reporter (94); articles (91)"
usersSLDA=seededlda::textmodel_seededlda(x = dfm$users,seed_dict %>% quanteda::dictionary(.),residual = T,
valuetype = c("fixed"),max_iter=2000, weight=0.0025,case_insensitive=T)
usersSLDA_top_terms = setNames(reshape2::melt(usersSLDA$phi),c("topic","term","beta")) %>%#beta contains the word X topic distributions
group_by(topic) %>%
slice_max(beta, n = 25) %>%
ungroup() %>%
arrange(topic, -beta)
usersSLDA_top_terms[nchar(as.character(usersSLDA_top_terms$term))>=3 & usersSLDA_top_terms$topic!="other",] %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()
table(seededlda::topics(usersSLDA))
##
## WFA_worker WFA_place_community WFA_firm WFA_manager
## 4489 3667 4159 3835
## WFA_coach WFA_investor WFA_media other
## 3266 3659 3189 3740
#Examples:
#Cisco Collaboration (userID: 15866857) https://twitter.com/ciscocollab
usersSLDA$theta["15866857",]
## WFA_worker WFA_place_community WFA_firm WFA_manager
## 0.02941176 0.08823529 0.08823529 0.08823529
## WFA_coach WFA_investor WFA_media other
## 0.08823529 0.08823529 0.44117647 0.08823529
#Financial Times (userID: 4898091) https://twitter.com/FinancialTimes
usersSLDA$theta["4898091",]
## WFA_worker WFA_place_community WFA_firm WFA_manager
## 0.05 0.05 0.05 0.05
## WFA_coach WFA_investor WFA_media other
## 0.05 0.05 0.65 0.05
#Users with higher probs by group
type="WFA_media"
data$users[names(which(usersSLDA$theta[,type]==max(usersSLDA$theta[,type]))),] %>% View()
type="WFA_worker"
data$users[names(which(usersSLDA$theta[,type]==max(usersSLDA$theta[,type]))),] %>% View()
type="WFA_firm"
data$users[names(which(usersSLDA$theta[,type]==max(usersSLDA$theta[,type]))),] %>% View()
####6.3 LDA2NET: (TWEETS) TOPIC MODEL ENRICHMENT WITH WEIGHTED DEPENDENCY RELATIONS ####
i="tweets"
#associations_dfm=readRDS("associations_dfm.RDS")
associations_vocab = data.frame(
from = gsub("->.*$", "", x = colnames(associations_dfm[[i]]), perl = T),
to = gsub("^.*->", "", x = colnames(associations_dfm[[i]]), perl = T),
stringsAsFactors = F
)
rownames(associations_vocab)=colnames(associations_dfm[[i]])
associations_vocab_clean = associations_vocab[(associations_vocab$from %in% colnames(dfm[[i]])) &
(associations_vocab$to %in% colnames(dfm[[i]])), ]
associations_vocab_clean$n=colSums(associations_dfm[[i]] %>% dfm_select(., pattern = rownames(associations_vocab_clean), selection = "keep",case_insensitive = F,valuetype = "fixed"))
#save cleaned associations vocabulary
saveRDS(associations_vocab_clean, file = "associations_vocab_clean.rds")
#prune associations_dfm matrix
associations_dfm[[i]] = dfm_select(associations_dfm[[i]],
pattern = rownames(associations_vocab_clean),
valuetype = "fixed")
associations_dfm[[i]] = dfm_subset(associations_dfm[[i]], docnames(associations_dfm[[i]]) %in% docnames(dfm[[i]]))
nrow(associations_dfm[[i]])
## [1] 60876
# 42 131 docs
#saveRDS(associations_dfm, file ="associations_mat_final.rds")
topics_docs=tweetsLDA@gamma
rownames(topics_docs)=docnames(dfm[[i]])
unigrams_vocab=featnames(dfm[[i]])
theta_match= topics_docs[rownames(topics_docs) %in% docnames(associations_dfm[[i]]),]
nrow(theta_match)#docs
## [1] 60876
ncol(theta_match)#topics
## [1] 5
nrow(associations_dfm[[i]])
## [1] 60876
ncol(associations_dfm[[i]])
## [1] 50028
AXT_counts = t(theta_match) %*% associations_dfm[[i]] #doc_topic
print(rowSums(AXT_counts))#don't sum to 1 yet...
## [1] 44762.38 44775.04 44772.65 44791.85 44782.08
AXT_counts = AXT_counts / rowSums(AXT_counts)# this is needed for sum of bigram rel.freqs to sum to 1 (for each topic)
topics_unigrams= exp(tweetsLDA@beta)
colnames(topics_unigrams)=tweetsLDA@terms
rowSums(topics_unigrams )#good if close to 1
## [1] 1 1 1 1 1
p_associations = list()
#for topics:
#selected topics
topics_selection=c(1:nrow(topics_unigrams))
for (j in topics_selection) {
p_associations[[j]] = associations_vocab_clean
p_associations[[j]]$counts = AXT_counts[j, rownames(p_associations[[j]])]
print("stage 1: eprob")
p_associations[[j]]$e_prob = topics_unigrams[j, p_associations[[j]]$from]* topics_unigrams[j, p_associations[[j]]$to]
#for all bigrams observed in the corpus
print("stage 2: eprobXcounts")
p_associations[[j]]$eprobXcounts = p_associations[[j]]$e_prob * p_associations[[j]]$counts
p_associations[[j]]$eprobXcounts = p_associations[[j]]$eprobXcounts/sum(p_associations[[j]]$eprobXcounts)
print("stage 4:saving")
#saveRDS(p_associations[[i]],paste("raw_association_topic_data_rds/topic", i, "_raw_associations.RDS", sep = ""))
print(p_associations[[j]][order(p_associations[[j]]$eprobXcounts,decreasing = T),][1:10,])
}
## [1] "stage 1: eprob"
## [1] "stage 2: eprobXcounts"
## [1] "stage 4:saving"
## from to n counts e_prob
## full->time full time 699 0.0031063558 2.641982e-05
## business->help business help 114 0.0005110207 1.588158e-04
## business->need business need 144 0.0006437535 1.215408e-04
## home->office home office 255 0.0009942964 6.932092e-05
## online->business online business 217 0.0009687601 6.233408e-05
## lifestyle->business lifestyle business 213 0.0009500929 5.733182e-05
## internet->connection internet connection 1345 0.0043344491 1.243652e-05
## help->need help need 90 0.0004027860 1.215322e-04
## free->video free video 215 0.0009632411 4.633633e-05
## today->world today world 158 0.0007064408 6.281999e-05
## eprobXcounts
## full->time 0.02105370
## business->help 0.02081995
## business->need 0.02007192
## home->office 0.01768184
## online->business 0.01549135
## lifestyle->business 0.01397363
## internet->connection 0.01382866
## help->need 0.01255778
## free->video 0.01144996
## today->world 0.01138468
## [1] "stage 1: eprob"
## [1] "stage 2: eprobXcounts"
## [1] "stage 4:saving"
## from to n counts e_prob eprobXcounts
## remote->developer remote developer 746 0.0033480212 1.036141e-04 0.078673999
## remote->working remote working 356 0.0015908120 1.794641e-04 0.064747111
## full->time full time 699 0.0031263975 3.897142e-05 0.027632139
## business->start business start 784 0.0035021549 3.261780e-05 0.025906821
## learn->join learn join 247 0.0011022420 5.430114e-05 0.013574066
## remote->workforce remote workforce 181 0.0008088408 6.182902e-05 0.011341734
## long->live long live 616 0.0027368658 1.684818e-05 0.010457567
## part->time part time 327 0.0014617909 2.779884e-05 0.009215863
## digital->world digital world 409 0.0018241296 2.173901e-05 0.008993310
## remote->office remote office 30 0.0001343109 2.897606e-04 0.008826221
## [1] "stage 1: eprob"
## [1] "stage 2: eprobXcounts"
## [1] "stage 4:saving"
## from to n counts e_prob
## management->business management business 498 0.0022263938 1.291854e-04
## remote->working remote working 356 0.0015917306 1.699096e-04
## business->start business start 784 0.0035048425 5.838376e-05
## remote->team remote team 179 0.0008001553 2.025557e-04
## business->need business need 144 0.0006435043 1.960735e-04
## small->business small business 430 0.0019263390 4.666811e-05
## online->business online business 217 0.0009699626 6.938914e-05
## just->need just need 148 0.0006571279 9.143940e-05
## business->build business build 361 0.0016028522 3.487186e-05
## freedom->give freedom give 331 0.0014689882 3.734535e-05
## eprobXcounts
## management->business 0.06355299
## remote->working 0.05975964
## business->start 0.04521484
## remote->team 0.03581287
## business->need 0.02787988
## small->business 0.01986431
## online->business 0.01487192
## just->need 0.01327712
## business->build 0.01235063
## freedom->give 0.01212203
## [1] "stage 1: eprob"
## [1] "stage 2: eprobXcounts"
## [1] "stage 4:saving"
## from to n counts e_prob
## digital->world digital world 409 0.0018287919 1.308817e-04
## world->travel world travel 232 0.0010175060 1.374097e-04
## home->office home office 255 0.0009912137 1.180991e-04
## world->company world company 96 0.0004285642 2.502649e-04
## internet->connection internet connection 1345 0.0043945268 2.119476e-05
## many->people many people 182 0.0008121853 8.113873e-05
## people->want people want 127 0.0005673483 9.416670e-05
## learn->join learn join 247 0.0011026048 4.470186e-05
## company->join company join 147 0.0006512780 5.473726e-05
## full->time full time 699 0.0031490202 1.100553e-05
## eprobXcounts
## digital->world 0.056414528
## world->travel 0.032953546
## home->office 0.027590631
## world->company 0.025279219
## internet->connection 0.021952727
## many->people 0.015532136
## people->want 0.012592007
## learn->join 0.011616981
## company->join 0.008402282
## full->time 0.008168347
## [1] "stage 1: eprob"
## [1] "stage 2: eprobXcounts"
## [1] "stage 4:saving"
## from to n counts e_prob
## business->start business start 784 0.0035106167 7.162173e-05
## remote->working remote working 356 0.0015872120 1.333587e-04
## home->office home office 255 0.0009913034 1.293575e-04
## today->world today world 158 0.0007051043 1.302334e-04
## remote->team remote team 179 0.0007991811 9.285952e-05
## part->time part time 327 0.0014628260 4.390414e-05
## team->hire team hire 1114 0.0050463074 1.175657e-05
## full->time full time 699 0.0031188024 1.830996e-05
## business->build business build 361 0.0016038144 3.422970e-05
## remote->developer remote developer 746 0.0033760541 1.445099e-05
## eprobXcounts
## business->start 0.05536354
## remote->working 0.04660707
## home->office 0.02823539
## today->world 0.02021954
## remote->team 0.01634057
## part->time 0.01414144
## team->hire 0.01306321
## full->time 0.01257392
## business->build 0.01208795
## remote->developer 0.01074243
#### 6.4 PLOT (PERCENTILE FILTERED) TOPIC NETWORK ####
#color map
library(tidyverse)
library(visNetwork)
map2color <- function(x, pal, limits = NULL) {
if (is.null(limits))
limits = range(x)
pal[findInterval(x,
seq(limits[1], limits[2], length.out = length(pal) + 1),
all.inside = TRUE)]
}
mypal = colorRampPalette(c("gray", "darkred"))(100)
allcolors = gplots::col2hex(grDevices::colors()[grep('gr(a|e)y', grDevices::colors(), invert = T)])
filter_and_plot_topic = function(temp_=temp[[j]], show_nav=T, show_id=T){
temp_$vertices[,c("color")]="white"
ids=temp_$vertices[,c("name")]
topicnet = visNetwork(
main = paste0("Topic ", j),
height = "900px",
width = "100%",
nodes = data.frame(
id = temp_$vertices$name
,
label = temp_$vertices$name
,
size = (temp_$vertices$prob) ^ (1 / 4) * 20 / max((temp_$vertices$prob) ^
(1 / 4)) #log(1.00015+temp_$vertices$prob)*10000
, color.background = temp_$vertices$color
,
color.border= "black"
,
title = paste(
"<b>",
temp_$vertices$name,
"</b><br> Prob:",
temp_$vertices$prob,
sep = ""
)
),
edges = data.frame(
from = temp_$edges$from
,
to = temp_$edges$to
,
width = (temp_$edges$eprobXcounts) ^ (1 / 4) * 10 / max((temp_$edges$eprobXcounts) ^ (1 /
4))
, smooth.enabled = T
,smooth.roundness = 0.70
,color = map2color(temp_$edges$counts, mypal)
,
title = paste(
"<b>",
temp_$edges$from
,
"-",
temp_$edges$to
,
"</b><br> Prob:"
,
temp_$edges$eprobXcounts
,
sep = ""
)
#,group = temp_$vertices$community_walktrap
)
) %>%
visPhysics(solver = "forceAtlas2Based", forceAtlas2Based = list(gravitationalConstant = -20))%>%
visEdges(arrows = "to") %>%
visInteraction(
keyboard = F,
navigationButtons = T,
dragNodes = T,
dragView = T,
zoomView = T
) %>%
visOptions(
highlightNearest = list(enabled = T, hover = T),
nodesIdSelection = T
)
return(topicnet)
}
net=list()
temp=list()
input=list()
input$n_percentile=0.95
input$e_percentile=0.95
for(j in c(1:nrow(topics_unigrams)))
{
temp[[j]]=list()
temp[[j]]$edges=p_associations[[j]]
temp[[j]]$vertices=data.frame(name=colnames(topics_unigrams),prob=topics_unigrams[j,],stringsAsFactors = F)
#filtration
temp[[j]]$vertices = temp[[j]]$vertices[temp[[j]]$vertice$prob>= quantile(temp[[j]]$vertice$prob, probs = input$n_percentile), ]
temp[[j]]$edges = temp[[j]]$edges[(
temp[[j]]$edges$to %in% temp[[j]]$vertices$name &
temp[[j]]$edges$from %in% temp[[j]]$vertices$name
) & ifelse(temp[[j]]$edges$to != temp[[j]]$edges$from, T, F), ]
temp[[j]]$edges = temp[[j]]$edges[temp[[j]]$edges$eprobXcounts>= quantile(temp[[j]]$edges$eprobXcounts, probs = input$e_percentile), ]
temp[[j]]$vertices = temp[[j]]$vertices[temp[[j]]$vertices$name %in% c(temp[[j]]$edges$to,temp[[j]]$edges$from), ]
net[[j]]=filter_and_plot_topic(temp[[j]])
}
net[[1]]
net[[2]]
net[[3]]
net[[4]]
net[[5]]
A tutorial by Carlo R. M. A. Santagiustina
carlo.santagiustina@unive.it