An introduction to text mining, NLP and text classification for management studies







A tutorial by Carlo R. M. A. Santagiustina

carlo.santagiustina@unive.it

 

What is R Markdown

This is an R Markdown document. RMarkdown is a simple formatting syntax for authoring interactive HTML documents that combine R code, html and JS. When you click the Knit button on the top-left corner a document will be generated that includes both content as well as the output of any embedded R code chunks within this tutorial. For more details on R Markdown see http://rmarkdown.rstudio.com.

Schedule for today

Text mining, NLP and topic modeling, in brief

Text mining: automatic or semi-automatic extraction of information from large numbers of unstructured texts written in a natural language (for example using RegEx);

In general, text mining workflows have three steps:
1. Data collection and pre-processing: Gather input texts that are potentially relevant for the given task and pre-process them.
2. Natural language processing: Analyze the text, infer relevant syntactic, semantic and morphological information about words and their dependency relations;
3. Text categorization and modeling: Discover and aggregate recurrent patterns, infer latent semantic representations and distinctions (topics, arguments, narratives);

Why should management scholars use textual data in their research?



Because, in our daily life, we use natural language (and not numbers and math formulas) to:

Why should management scholars use topic models?

Why using R for text mining and topic modeling?


1. R is open-source: free, transparent and deployable across a wide range of operative systems ;
2. R has a huge community: R and its TM and NLP libraries are well-documented, incessantly improved and there are plenty of online tutorials and examples ;
3. R can be used to integrate multiple languages and related libraries (like Spacy) in a pipeline: you can embed and execute scripts, commands and functions written in other programming languages, like Python, Java, C++, SQL, Julia, JS, Fortran ;
4. R is focused on (but not limited to) statistics, and many authors of topic models write their code in R (or wrap-up their c++ libraries in a R package): Employed by researchers in almost all fields of quantitative social sciences ;

Topic modeling workflows and pipelines



R data structures and relations



Workflows and pipelines in R

require(magrittr)

Piping %>% was introduced in R with the magrittr package which is now part of the tidyverse suite. Many other packages, like purr, have extended the functional programming toolset for R.

this section of the tutorial is a summarized version of https://www.nateday.me/Cats_with_Pipes.html

R piping rules

  • x %>% f gives the same output of f(x) where f is a function that accepts the data object as input;
  • x %>% f(y) gives the same output of f(x, y)
  • x %>% f %>% g %>% h gives the same output of h(g(f(x)))
  • x %>% f %>% g %>% h(parA=y) %>% i %>% j(parB=z) %>% k gives the same output of k(j(i(h(g(f(x)), parA=y)), parB=z))

The left-hand side is evaluated before being passed on to the right-hand side expression.

The argument placeholder

If you don’t want to use an object xas the first input/parameter of a function when piping, you can choose to which input/parameter send x by using the . character as follows:

  • x %>% f(y, .) is equivalent to f(y, x)
  • x %>% f(y, z = .) is equivalent to f(y, z = x)

Re-using the placeholder for attributes

It is straightforward to use the placeholder several times in a right-hand side expression. However, when the placeholder only appears in a nested expressions magrittr will still apply the first-argument rule. The reason is that in most cases this results more clean code.

x %>% f(y = nrow(.), z = ncol(.)) is equivalent to f(x, y = nrow(x), z = ncol(x))

The behavior can be overruled by enclosing the right-hand side in braces:

x %>% {f(y = nrow(.), z = ncol(.))} is equivalent to f(y = nrow(x), z = ncol(x))

Which R packages will we use today?

# creator        Carlo R. M. A. Santagiustina  
# creation date  29-01-2022
# language       R   
# R version      4.0.3 (2020-10-10)

####  0 KNITR OPTIONS ####
knitr::opts_chunk$set(warning = FALSE, message = FALSE, echo = FALSE,fig.width=10, fig.height=10) 

##### 1.1 #INSTALL AND LOAD LIBRARIES ####
pkgTest <- function(x)
  {
    if (!require(x,character.only = TRUE))
    {
      install.packages(x,dep=TRUE,repos="https://cloud.r-project.org/")
        if(!require(x,character.only = TRUE)) stop("Package not found")
    }
}
pkgTest("remotes")# for installing libraries from github
pkgTest("knitr")# for printing HTM tables
pkgTest("tidyverse")# for data wrangling
pkgTest("quanteda")# for text mining 
pkgTest("spacyr")# for NLP
pkgTest("topicmodels")# for topic model estimation
pkgTest("seededlda")# for seededLDA estimation
pkgTest("igraph")# for constructing and analysing graphs
pkgTest("visNetwork")# for plotting interactive graphs
pkgTest("corpus")#for affect WordNet-Affect lexicon  (extension of WordNet, see )
pkgTest("tidytext")#for formatting topicmodels outputs
#pkgTest(udpipe)# other library for NLP
remotes::install_github("richfitz/remoji")
pkgTest("remoji")# for emoji extraction
pkgTest("reshape2")# for reshaping sededLDA outputs

The dataset: work from anywhere discussions on Twitter

The dataset that will be used in today applications consists of 67 792 tweets (published by 40 608 users) containing the expression "work[ -]?from[ -]?anywhere", and related metadata and user profile info.

These tweets have been published between January 2019 and the July 2021.
They have been downloaded from the Twitter Accademic API V2 using R. For more details on this API see: https://developer.twitter.com/en/products/twitter-api/academic-research. The dataset was preprocessed to remove duplicates and tweets unrelated to the "work from anywhere" issue. Dataset import in R:

##      text            author_id         conversation_id      created_at        
##  Length:67792       Length:67792       Length:67792       Min.   :2019-01-01  
##  Class :character   Class :character   Class :character   1st Qu.:2020-03-18  
##  Mode  :character   Mode  :character   Mode  :character   Median :2020-09-02  
##                                                           Mean   :2020-07-23  
##                                                           3rd Qu.:2021-02-10  
##                                                           Max.   :2021-06-30  
##  in_reply_to_user_id possibly_sensitive      id               source         
##  Length:67792        Mode :logical      Length:67792       Length:67792      
##  Class :character    FALSE:67396        Class :character   Class :character  
##  Mode  :character    TRUE :396          Mode  :character   Mode  :character  
##                                                                              
##                                                                              
##                                                                              
##  retweet_count       reply_count         like_count         quote_count      
##  Min.   :   0.000   Min.   :  0.0000   Min.   :    0.000   Min.   :  0.0000  
##  1st Qu.:   0.000   1st Qu.:  0.0000   1st Qu.:    0.000   1st Qu.:  0.0000  
##  Median :   0.000   Median :  0.0000   Median :    0.000   Median :  0.0000  
##  Mean   :   1.121   Mean   :  0.3821   Mean   :    4.882   Mean   :  0.1406  
##  3rd Qu.:   0.000   3rd Qu.:  0.0000   3rd Qu.:    1.000   3rd Qu.:  0.0000  
##  Max.   :5119.000   Max.   :873.0000   Max.   :11684.000   Max.   :879.0000
text author_id conversation_id created_at in_reply_to_user_id possibly_sensitive id source retweet_count reply_count like_count quote_count
#digitaltransformation Top 10 companies that offer work-from-anywhere jobs - WKMG News 6 & ClickOrlando https://t.co/Mg0XtnflUB https://t.co/3fLz3jnKq9 1049585267772968960 1080012837211291648 2019-01-01 NA FALSE 1080012837211291648 IFTTT 0 0 0 0
40 Best International Work from Home Jobs (Work from Anywhere in the World Online) - MoneyPantry #International https://t.co/EMFWTHzZEj 107631340 1080010814562672640 2019-01-01 NA FALSE 1080010814562672640 Linkgage 0 0 0 0
40 Best International Work from Home Jobs (Work from Anywhere in the World Online) - MoneyPantry #International #link2 https://t.co/EMFWTHzZEj 107631340 1079979313833934855 2019-01-01 NA FALSE 1079979313833934855 Linkgage 0 0 0 0
Top 10 companies that offer work-from-anywhere jobs https://t.co/uHzor6qahf 2214623612 1079922194644385793 2019-01-01 NA FALSE 1079922194644385793 Facebook 0 0 0 0
40 Best International Work from Home Jobs (Work from Anywhere in the World Online) - MoneyPantry #International #link1 https://t.co/EMFWTHzZEj 107631340 1079918941319901184 2019-01-01 NA FALSE 1079918941319901184 Linkgage 0 0 1 0
Top 10 companies that offer work-from-anywhere jobs https://t.co/LOwL9En8VH 80203918 1079914035754291200 2019-01-01 NA FALSE 1079914035754291200 Facebook 0 0 0 0
##      url                 id                name            verified      
##  Length:40608       Length:40608       Length:40608       Mode :logical  
##  Class :character   Class :character   Class :character   FALSE:38728    
##  Mode  :character   Mode  :character   Mode  :character   TRUE :1880     
##                                                                          
##                                                                          
##                                                                          
##  protected         username         description        followers_count   
##  Mode :logical   Length:40608       Length:40608       Min.   :       0  
##  FALSE:40608     Class :character   Class :character   1st Qu.:     107  
##                  Mode  :character   Mode  :character   Median :     514  
##                                                        Mean   :   17214  
##                                                        3rd Qu.:    2106  
##                                                        Max.   :26099432  
##  following_count   tweet_count       listed_count        created_at        
##  Min.   :     0   Min.   :      1   Min.   :     0.0   Min.   :2006-03-21  
##  1st Qu.:   155   1st Qu.:   1007   1st Qu.:     0.0   1st Qu.:2009-11-11  
##  Median :   512   Median :   4570   Median :     8.0   Median :2012-07-21  
##  Mean   :  1658   Mean   :  23484   Mean   :   160.2   Mean   :2013-07-16  
##  3rd Qu.:  1387   3rd Qu.:  17754   3rd Qu.:    57.0   3rd Qu.:2017-01-27  
##  Max.   :755319   Max.   :3976439   Max.   :130184.0   Max.   :2021-06-28  
##    location        
##  Length:40608      
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
url id name verified protected username description followers_count following_count tweet_count listed_count created_at location
https://t.co/L0FKd02CBX 60433092 Morissa (Dr. Rissy) Schwartz TRUE FALSE MorissaSchwartz All my puns are intended! 😆 Owner of top-rated marketing company https://t.co/3xpIccgnoi - @forbes 30 Under 30 🤩 - @entrepreneur Contributor -📚 https://t.co/8XIkYYnjSh CEO 21180 8958 74574 1632 2009-07-27 New Jersey, USA
https://t.co/i48TLClhwK 813081982758518784 A Better Lemonade Stand FALSE FALSE ABLSecom A Better Lemonade Stand is an online business incubator. We help entrepreneurs build, launch, and grow their online business. 2108 673 3888 62 2016-12-25 NA
NA 862922852063862784 Mr. J 🏴 FALSE FALSE jasze36 If you are into smartphones, tech, superheroes, video games, movies, general geekery, football, and anything Malaysian, my tweets are for you. 447 346 35486 6 2017-05-12 Malesia
https://t.co/6qjZ8PCpvM 91478624 Forbes TRUE FALSE Forbes Official account of Forbes, the world’s leading voice for entrepreneurial success and free enterprise, and home to breaking news, business, money and more. 17331756 5149 347669 61023 2009-11-21 New York, NY
https://t.co/wk9XBMxbNX 35700404 💉💉😷💪🦠Heather Smith TRUE FALSE HeatherSmithAU 🦄Hype Girl for Accounting Apps 🦄

Subscribe to the #AccounTech newsletter 👉https://t.co/oQhoMuqNwk Lead with kindness 👸🌞🦋 💐💃

https://t.co/lS92N5HjxV | 10606| 5177| 92368| 434|2009-04-27 |Brisbane Australia | |https://t.co/GMdWAQKfMC |16350207 |Christine Hueber |FALSE |FALSE |ChristineHueber |Do you have your copy of my #FREE Report yet? Get “Top 10 LinkedIn #B2B #Success Secrets” to #hack #LinkedIn for better results now @ https://t.co/GMdWAQKfMC | 8078| 7987| 116846| 367|2008-09-18 |San Francisco, CA |

text author_id conversation_id created_at in_reply_to_user_id possibly_sensitive id source retweet_count reply_count like_count quote_count clean_text text_urls text_hashtags text_fintags text_emoticones
#digitaltransformation Top 10 companies that offer work-from-anywhere jobs - WKMG News 6 & ClickOrlando https://t.co/Mg0XtnflUB https://t.co/3fLz3jnKq9 1049585267772968960 1080012837211291648 2019-01-01 NA FALSE 1080012837211291648 IFTTT 0 0 0 0 digitaltransformation Top 10 companies that offer work-from-anywhere jobs - WKMG News 6 and ClickOrlando https://t.co/Mg0XtnflUB, https://t.co/3fLz3jnKq9 #digitaltransformation
40 Best International Work from Home Jobs (Work from Anywhere in the World Online) - MoneyPantry #International https://t.co/EMFWTHzZEj 107631340 1080010814562672640 2019-01-01 NA FALSE 1080010814562672640 Linkgage 0 0 0 0 40 Best International Work from Home Jobs (Work from Anywhere in the World Online) - MoneyPantry International https://t.co/EMFWTHzZEj #International
40 Best International Work from Home Jobs (Work from Anywhere in the World Online) - MoneyPantry #International #link2 https://t.co/EMFWTHzZEj 107631340 1079979313833934855 2019-01-01 NA FALSE 1079979313833934855 Linkgage 0 0 0 0 40 Best International Work from Home Jobs (Work from Anywhere in the World Online) - MoneyPantry International link2 https://t.co/EMFWTHzZEj #International, #link2
Top 10 companies that offer work-from-anywhere jobs https://t.co/uHzor6qahf 2214623612 1079922194644385793 2019-01-01 NA FALSE 1079922194644385793 Facebook 0 0 0 0 Top 10 companies that offer work-from-anywhere jobs https://t.co/uHzor6qahf
40 Best International Work from Home Jobs (Work from Anywhere in the World Online) - MoneyPantry #International #link1 https://t.co/EMFWTHzZEj 107631340 1079918941319901184 2019-01-01 NA FALSE 1079918941319901184 Linkgage 0 0 1 0 40 Best International Work from Home Jobs (Work from Anywhere in the World Online) - MoneyPantry International link1 https://t.co/EMFWTHzZEj #International, #link1
Top 10 companies that offer work-from-anywhere jobs https://t.co/LOwL9En8VH 80203918 1079914035754291200 2019-01-01 NA FALSE 1079914035754291200 Facebook 0 0 0 0 Top 10 companies that offer work-from-anywhere jobs https://t.co/LOwL9En8VH
url id name verified protected username description followers_count following_count tweet_count listed_count created_at location clean_description description_urls description_hashtags description_fintags description_emoticones
https://t.co/L0FKd02CBX 60433092 Morissa (Dr. Rissy) Schwartz TRUE FALSE MorissaSchwartz All my puns are intended! 😆 Owner of top-rated marketing company https://t.co/3xpIccgnoi - @forbes 30 Under 30 🤩 - @entrepreneur Contributor -📚 https://t.co/8XIkYYnjSh CEO 21180 8958 74574 1632 2009-07-27 New Jersey, USA All my puns are intended! Owner of top-rated marketing company - @forbes 30 Under 30 - @entrepreneur Contributor - CEO https://t.co/3xpIccgnoi, https://t.co/8XIkYYnjSh
https://t.co/i48TLClhwK 813081982758518784 A Better Lemonade Stand FALSE FALSE ABLSecom A Better Lemonade Stand is an online business incubator. We help entrepreneurs build, launch, and grow their online business. 2108 673 3888 62 2016-12-25 NA A Better Lemonade Stand is an online business incubator. We help entrepreneurs build, launch, and grow their online business.
NA 862922852063862784 Mr. J 🏴 FALSE FALSE jasze36 If you are into smartphones, tech, superheroes, video games, movies, general geekery, football, and anything Malaysian, my tweets are for you. 447 346 35486 6 2017-05-12 Malesia If you are into smartphones, tech, superheroes, video games, movies, general geekery, football, and anything Malaysian, my tweets are for you.
https://t.co/6qjZ8PCpvM 91478624 Forbes TRUE FALSE Forbes Official account of Forbes, the world’s leading voice for entrepreneurial success and free enterprise, and home to breaking news, business, money and more. 17331756 5149 347669 61023 2009-11-21 New York, NY Official account of Forbes, the world s leading voice for entrepreneurial success and free enterprise, and home to breaking news, business, money and more.
https://t.co/wk9XBMxbNX 35700404 💉💉😷💪🦠Heather Smith TRUE FALSE HeatherSmithAU 🦄Hype Girl for Accounting Apps 🦄

Subscribe to the #AccounTech newsletter 👉https://t.co/oQhoMuqNwk Lead with kindness 👸🌞🦋 💐💃

https://t.co/lS92N5HjxV | 10606| 5177| 92368| 434|2009-04-27 |Brisbane Australia |Hype Girl for Accounting Apps Subscribe to the AccounTech newsletter Lead with kindness |https://t.co/oQhoMuqNwk, https://t.co/lS92N5HjxV |#AccounTech | | | |https://t.co/GMdWAQKfMC |16350207 |Christine Hueber |FALSE |FALSE |ChristineHueber |Do you have your copy of my #FREE Report yet? Get “Top 10 LinkedIn #B2B #Success Secrets” to #hack #LinkedIn for better results now @ https://t.co/GMdWAQKfMC | 8078| 7987| 116846| 367|2008-09-18 |San Francisco, CA |Do you have your copy of my FREE Report yet? Get “Top 10 LinkedIn B2B Success Secrets” to hack LinkedIn for better results now @ |https://t.co/GMdWAQKfMC |#FREE , #B2B , #Success , #hack , #LinkedIn | | |

The document-feature matrix

A document-feature matrix (DFM) is a matrix that describes the frequency of features (e.g., terms) that occur in a collection of documents (here tweets).
In a document-feature matrix, rows correspond to documents in the collection and columns correspond to features.

######################################################
#### 2 FROM STRINGS TO A DOCUMENT-FEATURE MATRIX  ####
######################################################
#[OPTION-1]
 #data$tweets$clean_text = gsub(pattern = ,"work[ -]?from[ -]?anywhere",  data$tweets$clean_text,replacement = " ",ignore.case = T,perl = T)# run this line to remove query terms directly from strings

rownames(data$tweets)=data$tweets$id
rownames(data$users)=data$users$id
#### 2.1 BUILD THE CORPUS
  corpus=list(tweets=NULL, users=NULL)
  corpus$tweets = quanteda::corpus(data$tweets,text_field = "clean_text",docid_field ="id")
  corpus$users = quanteda::corpus(data$users,text_field = "clean_description",docid_field ="id")
#### 2.2 TOKENIZE DOCUMENTS IN CORPUS #####
  tokens=list(tweets=NULL, users=NULL)
tokens$tweets= corpus$tweets %>% 
  quanteda::tokens(.,
                   remove_punct = TRUE,
                   remove_symbols = TRUE,
                   remove_separators = TRUE,
                   remove_numbers = TRUE,
                   split_hyphens =FALSE,
                   verbose=TRUE
  ) 
tokens$users= corpus$users %>% 
  quanteda::tokens(.,
                   remove_punct = TRUE,
                   remove_symbols = TRUE,
                   remove_separators = TRUE,
                   remove_numbers = TRUE,
                   split_hyphens =FALSE,
                   verbose=TRUE
  ) 

#### 2.3 BUILD DOCUMENT FEATURE MATRIX #####
dfm=list(tweets=NULL, users=NULL)
dfm$tweets = tokens$tweets %>%
  quanteda::dfm(.,
                tolower = TRUE,
                stem = FALSE,
                #               select = NULL,
                #                remove = NULL,
                #                dictionary = NULL,
                #                thesaurus = NULL,
                #                case_insensitive = TRUE,
                verbose = TRUE
  )
print("Initial tweets DFM")
## [1] "Initial tweets DFM"
head(dfm$tweets)
## Document-feature matrix of: 6 documents, 52,013 features (100.0% sparse) and 15 docvars.
##                      features
## docs                  digitaltransformation top companies that offer
##   1080012837211291648                     1   1         1    1     1
##   1080010814562672640                     0   0         0    0     0
##   1079979313833934855                     0   0         0    0     0
##   1079922194644385793                     0   1         1    1     1
##   1079918941319901184                     0   0         0    0     0
##   1079914035754291200                     0   1         1    1     1
##                      features
## docs                  work-from-anywhere jobs wkmg news and
##   1080012837211291648                  1    1    1    1   1
##   1080010814562672640                  0    1    0    0   0
##   1079979313833934855                  0    1    0    0   0
##   1079922194644385793                  1    1    0    0   0
##   1079918941319901184                  0    1    0    0   0
##   1079914035754291200                  1    1    0    0   0
## [ reached max_nfeat ... 52,003 more features ]
print("Top Features in the initial tweets DFM")
## [1] "Top Features in the initial tweets DFM"
quanteda::topfeatures(dfm$tweets,50) %>% kable(, col.names = NULL)# top 50 features before stopwords removal
work 73793
from 71515
anywhere 61396
to 61052
the 55263
and 42495
a 34809
you 26839
in 24297
can 21030
is 20633
of 20627
for 20188
i 18063
with 16271
your 15161
that 11742
it 11447
on 10944
are 10270
this 9750
be 9450
more 8647
remote 7977
we 7959
how 7906
our 7719
world 7665
as 7534
have 7394
work-from-anywhere 7126
at 7086
home 6955
my 6893
business 6649
new 6541
will 5835
s 5780
an 5735
employees 5576
now 5480
or 5436
time 5413
if 5362
people 4943
do 4800
but 4724
working 4685
here 4642
office 4519
dfm$users = tokens$users %>%
  quanteda::dfm(.,
                tolower = TRUE,
                stem = FALSE,
                #               select = NULL,
                #                remove = NULL,
                #                dictionary = NULL,
                #                thesaurus = NULL,
                #                case_insensitive = TRUE,
                verbose = TRUE
  )
print("Initial users DFM")
## [1] "Initial users DFM"
head(dfm$users)
## Document-feature matrix of: 6 documents, 59,828 features (100.0% sparse) and 16 docvars.
##                     features
## docs                 all my puns are intended owner of top-rated marketing
##   60433092             1  1    1   1        1     1  1         1         1
##   813081982758518784   0  0    0   0        0     0  0         0         0
##   862922852063862784   0  1    0   2        0     0  0         0         0
##   91478624             0  0    0   0        0     0  1         0         0
##   35700404             0  0    0   0        0     0  0         0         0
##   16350207             0  1    0   0        0     0  1         0         0
##                     features
## docs                 company
##   60433092                 1
##   813081982758518784       0
##   862922852063862784       0
##   91478624                 0
##   35700404                 0
##   16350207                 0
## [ reached max_nfeat ... 59,818 more features ]
print("Top Features in the initial users DFM")
## [1] "Top Features in the initial users DFM"
quanteda::topfeatures(dfm$users,50) %>% kable(., col.names = NULL)# top 50 features before stopwords removal
and 15903
the 11501
of 9955
to 8549
a 7804
in 6575
for 6552
i 4455
my 4290
is 3957
na 3732
with 3373
business 3306
on 3060
are 3050
at 2853
your 2816
you 2798
it 2529
we 2515
all 1923
own 1900
marketing 1844
digital 1795
technology 1766
by 1655
that 1645
from 1587
tech 1559
news 1556
solutions 1498
about 1493
life 1394
work 1378
our 1365
services 1351
an 1343
not 1328
love 1236
world 1232
people 1218
more 1196
help 1186
founder 1145
views 1122
s 1069
me 1050
tweets 1041
cloud 1008
be 989
#### 2.3 FILTERING THE DOCUMENT FEATURE MATRIX #####
unigrams_vocab=list(tweets=NULL, users=NULL)
for(i in c("tweets","users")){
#### 2.3.1 REMOVE STOPWORDS FROM DFM #####
if(i=="users"){
  mystopwords=c(quanteda::stopwords(language = "en"),"s","$","t", "can")
}else{
  #[OPTION-2]
  mystopwords=c(quanteda::stopwords(language = "en"),"s","$","t", "can","work","from","anywhere","work-from-anywhere","workfromanywhere")#added query terms as stopwords (one could do better by removing them from the strings before building the corpus. See [OPTION-1])
}


dfm[[i]] = dfm_remove(dfm[[i]], pattern = mystopwords, valuetype = "fixed")
  
#### 2.3.1 BUILD INITIAL UNIGRAMS DICTIONARY #####

unigrams_vocab[[i]] = featnames(dfm[[i]])

#### 2.3.2 IDENTIFY RARE WORDS IN DFM #####
threshold=round(nrow(dfm[[i]])/1000,digits = 0)# one every 1000 documents
removed_rare=colSums(dfm[[i]])[colSums(dfm[[i]]>0)<threshold]#remove all token that don't appear at least once every 1000 documents 
names(removed_rare)=featnames(dfm[[i]])[colSums(dfm[[i]]>0)<threshold]
removed_rare=sort(removed_rare,decreasing = T)
#removed_rare %>% kable()
#table(colSums(dfm[[i]]>0)<threshold) #tabulate N features above (TRUE) and below (FALSE) threshold

#### 2.3.3 IDENTIFY SHORT WORDS (- THREE CHARACTER) IN DFM #####
removed_short=colSums(dfm[[i]])[nchar(featnames(dfm[[i]]))<=3]
names(removed_short)=featnames(dfm[[i]])[nchar(featnames(dfm[[i]]))<=3]
removed_short=sort(removed_short,decreasing = T)
#removed_short %>% kable()
table(nchar(featnames(dfm[[i]]))<=3)


#### 2.3.5 REMOVE RARE/SHORT/NUMBER TOKENS FROM DFM #####
dfm[[i]]=dfm_select(
  dfm[[i]],
  pattern = unique(c(names(removed_rare),names(removed_short))),
  selection = "remove",
  valuetype = "fixed",
  case_insensitive = FALSE,
  verbose = quanteda_options("verbose")
)

####2.3.6 KEEP ONLY DOCS WITH AT LEAST THREE TOKENS #####
dfm[[i]] = dfm_subset(dfm[[i]], ntoken(dfm[[i]]) >= 3)#keep only docs with at least 3 tokens

unigrams_vocab[[i]]=featnames(dfm[[i]])

#table(rowSums(dfm[[i]])>=1)#to check if all docs have at least one term 
#table(colSums(dfm[[i]])>=1)#to check if all terms (i.e., features) appear at least once
docvars(dfm[[i]])=data[[i]][docnames(dfm[[i]]),]
}
#Top features in tweeets dfm
print("Final tweets DFM")
## [1] "Final tweets DFM"
head(dfm$tweets)
## Document-feature matrix of: 6 documents, 1,780 features (99.7% sparse) and 17 docvars.
##                      features
## docs                  digitaltransformation companies offer jobs news best
##   1080012837211291648                     1         1     1    1    1    0
##   1080010814562672640                     0         0     0    1    0    1
##   1079979313833934855                     0         0     0    1    0    1
##   1079922194644385793                     0         1     1    1    0    0
##   1079918941319901184                     0         0     0    1    0    1
##   1079914035754291200                     0         1     1    1    0    0
##                      features
## docs                  international home world online
##   1080012837211291648             0    0     0      0
##   1080010814562672640             2    1     1      1
##   1079979313833934855             2    1     1      1
##   1079922194644385793             0    0     0      0
##   1079918941319901184             2    1     1      1
##   1079914035754291200             0    0     0      0
## [ reached max_nfeat ... 1,770 more features ]
print("Top Features in the final tweets DFM")
## [1] "Top Features in the final tweets DFM"
topfeatures(dfm$tweets,50) %>% kable(., col.names = NULL)
remote 7895
world 7454
home 6611
business 6604
employees 5226
time 5092
people 4917
working 4666
office 4425
need 3663
want 3541
remotework 3530
learn 3277
just 3054
help 3053
make 2955
like 2949
team 2834
able 2702
today 2538
free 2533
join 2489
companies 2479
workfromhome 2471
best 2463
live 2421
freedom 2374
cloud 2295
online 2248
technology 2227
start 2218
company 2144
digital 2069
looking 2054
internet 2021
future 2014
travel 1991
love 1983
become 1940
many 1900
hiring 1888
jobs 1873
life 1837
read 1829
tech 1827
hours 1801
solutions 1764
pandemic 1761
find 1727
check 1715
print("Final users DFM")
## [1] "Final users DFM"
head(dfm$users)
## Document-feature matrix of: 6 documents, 1,312 features (99.4% sparse) and 18 docvars.
##                     features
## docs                 owner marketing company @forbes contributor better stand
##   60433092               1         1       1       1           1      0     0
##   813081982758518784     0         0       0       0           0      1     1
##   862922852063862784     0         0       0       0           0      0     0
##   91478624               0         0       0       0           0      0     0
##   35700404               0         0       0       0           0      0     0
##   16350207               0         0       0       0           0      1     0
##                     features
## docs                 online business help
##   60433092                0        0    0
##   813081982758518784      2        2    1
##   862922852063862784      0        0    0
##   91478624                0        1    0
##   35700404                0        0    0
##   16350207                0        0    0
## [ reached max_nfeat ... 1,302 more features ]
print("Top Features in the final users DFM")
## [1] "Top Features in the final users DFM"
topfeatures(dfm$tweets,50) %>% kable(., col.names = NULL)
remote 7895
world 7454
home 6611
business 6604
employees 5226
time 5092
people 4917
working 4666
office 4425
need 3663
want 3541
remotework 3530
learn 3277
just 3054
help 3053
make 2955
like 2949
team 2834
able 2702
today 2538
free 2533
join 2489
companies 2479
workfromhome 2471
best 2463
live 2421
freedom 2374
cloud 2295
online 2248
technology 2227
start 2218
company 2144
digital 2069
looking 2054
internet 2021
future 2014
travel 1991
love 1983
become 1940
many 1900
hiring 1888
jobs 1873
life 1837
read 1829
tech 1827
hours 1801
solutions 1764
pandemic 1761
find 1727
check 1715
#topfeatures(dfm$tweets,50,scheme = "docfreq")# number of docs in which the token is present (scheme = "docfreq")
#topfeatures(dfm$tweets,50) / sum(ntoken(dfm$tweets)) #relative freq. of token w.r.t. total number of tokens

docvars(dfm$tweets,field = "postCOVID")=docvars(dfm$tweets,field = "created_at")>=as.Date("2020-03-11")
docvars(dfm$tweets,field = "most_liked")=docvars(dfm$tweets,field = "like_count")> quantile(docvars(dfm$tweets,field = "like_count"), c(.5)) 
docvars(dfm$tweets,field = "most_discussed")=docvars(dfm$tweets,field = "reply_count")> quantile(docvars(dfm$tweets,field = "reply_count"), c(.5)) 
docvars(dfm$tweets,field = "most_retweeted")=docvars(dfm$tweets,field = "retweet_count")> quantile(docvars(dfm$tweets,field = "retweet_count"), c(.5)) 

print("postCOVID top tokens counts (tweets)")
## [1] "postCOVID top tokens counts (tweets)"
topfeatures(dfm$tweets,50,groups =  "postCOVID")  %>% kable(., col.names = NULL)
remote 2972
business 2097
world 1772
time 1714
working 1441
remotework 1355
want 1199
freedom 1180
team 1102
home 1078
hiring 1058
employees 1008
office 1005
people 993
freelance 955
love 866
start 843
make 827
workfromhome 826
need 822
travel 807
become 804
online 791
like 785
free 776
remotejobs 771
able 764
developer 748
just 740
technology 691
internet 691
today 676
looking 671
manager 668
hours 650
check 636
social 622
media 612
management 611
leading 601
world’s 599
x-team 597
life 589
brands 586
jobs 584
financial 577
help 572
boss 566
money 562
achieve 536
world 5682
home 5533
remote 4923
business 4507
employees 4218
people 3924
office 3420
time 3378
working 3225
need 2841
learn 2832
help 2481
want 2342
just 2314
remotework 2175
like 2164
live 2140
make 2128
companies 2023
join 2001
able 1938
best 1936
today 1862
future 1850
cloud 1797
free 1757
pandemic 1757
team 1732
digital 1709
solutions 1691
company 1669
workfromhome 1645
many 1577
tech 1548
technology 1536
read 1529
businesses 1493
online 1457
find 1390
long 1387
looking 1383
start 1375
workforce 1352
great 1341
security 1336
internet 1330
stay 1325
jobs 1289
still 1260
life 1248
print("(0.5 percentile) Most liked  top tokens counts (tweets)")
## [1] "(0.5 percentile) Most liked  top tokens counts (tweets)"
topfeatures(dfm$tweets,50,groups =  "most_liked") %>% kable(., col.names = NULL)
remote 4356
business 4323
world 4274
home 3789
employees 3233
time 2998
people 2583
working 2535
office 2217
need 2107
remotework 1968
want 1940
help 1905
free 1842
learn 1828
make 1765
start 1612
just 1559
online 1526
join 1518
today 1502
like 1473
freedom 1472
team 1460
workfromhome 1414
best 1408
able 1407
cloud 1382
live 1375
companies 1374
become 1348
technology 1339
hours 1339
internet 1295
company 1269
looking 1266
businesses 1229
read 1197
travel 1174
jobs 1132
love 1130
solution 1105
life 1094
solutions 1094
many 1086
digital 1069
boss 1068
check 1047
hiring 1044
future 1028
remote 3539
world 3180
home 2822
people 2334
business 2281
office 2208
working 2131
time 2094
employees 1993
want 1601
remotework 1562
need 1556
just 1495
like 1476
learn 1449
team 1374
able 1295
make 1190
help 1148
companies 1105
workfromhome 1057
best 1055
live 1046
today 1036
digital 1000
future 986
join 971
cloud 913
freedom 902
technology 888
company 875
great 868
love 853
hiring 844
tech 824
travel 817
many 814
looking 788
good 763
life 743
jobs 741
find 737
pandemic 737
also 731
internet 726
flexibility 725
online 722
long 713
free 691
productivity 689
print("(0.5 percentile) Most discussed top tokens counts (tweets)")
## [1] "(0.5 percentile) Most discussed top tokens counts (tweets)"
topfeatures(dfm$tweets,50,groups =  "most_discussed") %>% kable(., col.names = NULL)
remote 6763
world 6367
business 6086
home 5375
employees 4721
time 4289
working 3828
people 3685
office 3507
remotework 3307
learn 3062
need 2987
want 2846
help 2759
make 2491
team 2448
free 2359
workfromhome 2314
join 2299
today 2259
just 2203
like 2187
cloud 2161
best 2148
able 2123
companies 2077
freedom 2064
technology 2021
start 2020
online 1993
live 1939
digital 1804
become 1766
future 1746
company 1742
read 1720
looking 1718
solutions 1668
travel 1664
hiring 1660
love 1630
internet 1614
check 1596
jobs 1584
businesses 1569
find 1539
hours 1539
security 1538
tech 1525
pandemic 1516
home 1236
people 1232
remote 1132
world 1087
office 918
just 851
working 838
time 803
like 762
want 695
need 676
able 579
business 518
employees 505
live 482
make 464
many 410
internet 407
think 407
companies 402
company 402
also 387
team 386
good 378
move 356
going 356
love 353
really 348
life 339
even 337
looking 336
know 336
great 335
long 328
travel 327
still 326
best 315
much 312
back 312
freedom 310
tech 302
take 294
help 294
right 292
jobs 289
year 281
today 279
future 268
digital 265
hours 262
print("(0.5 percentile) Most retweeted top tokens counts (tweets)")
## [1] "(0.5 percentile) Most retweeted top tokens counts (tweets)"
topfeatures(dfm$tweets,50,groups =  "most_retweeted")%>% kable(., col.names = NULL)
world 5923
remote 5797
business 5511
home 5470
time 4291
employees 4134
people 3945
working 3681
office 3448
need 2972
want 2873
remotework 2581
just 2558
learn 2513
make 2453
help 2411
like 2399
able 2265
free 2092
today 2011
best 1978
live 1978
freedom 1971
start 1961
online 1960
workfromhome 1958
companies 1942
team 1892
join 1871
internet 1803
company 1752
technology 1740
cloud 1734
love 1720
travel 1697
become 1639
looking 1593
hours 1582
life 1568
many 1567
jobs 1495
future 1490
read 1480
digital 1477
businesses 1464
pandemic 1444
long 1422
tech 1401
solutions 1371
find 1310
remote 2098
world 1531
home 1141
business 1093
employees 1092
working 985
office 977
people 972
remotework 949
team 942
time 801
learn 764
hiring 697
need 691
want 668
help 642
join 618
digital 592
cloud 561
like 550
companies 537
today 527
future 524
workfromhome 513
make 502
just 496
technology 487
best 485
looking 461
live 443
free 441
able 437
tech 426
find 417
check 413
freedom 403
great 400
solutions 393
company 392
productivity 390
futureofwork 389
security 389
jobs 378
workforce 350
read 349
developer 348
secure 348
flexibility 341
many 333
remotejobs 332
print("Verified accounts top tokens counts (users)")
## [1] "Verified accounts top tokens counts (users)"
topfeatures(dfm$users,50,groups =  "verified") %>% kable(., col.names = NULL)
business 3062
marketing 1745
digital 1644
technology 1639
tech 1418
solutions 1416
services 1288
life 1269
work 1261
love 1160
news 1155
people 1117
help 1104
world 1063
views 1020
founder 993
tweets 953
cloud 947
software 890
travel 881
management 828
helping 823
media 812
social 789
entrepreneur 780
global 777
online 773
opinions 767
writer 757
consultant 751
manager 751
director 744
author 743
lover 730
data 724
design 697
make 684
businesses 682
husband 676
professional 672
company 659
best 658
sales 639
coach 635
security 624
father 622
time 621
community 617
follow 610
enthusiast 607
news 334
business 170
official 135
author 116
world 113
tech 108
digital 100
support 97
account 85
work 83
technology 79
follow 79
twitter 78
editor 78
global 76
people 76
founder 76
help 75
leading 68
india 66
media 65
views 62
speaker 59
latest 57
solutions 56
reporter 56
journalist 55
writer 52
future 51
best 50
tweets 49
services 47
host 47
opinions 46
every 46
marketing 44
cloud 42
businesses 42
social 41
customer 41
podcast 41
former 41
breaking 40
platform 40
travel 39
updates 39
information 39
data 39
co-founder 39
personal 39

Regular expressions (RegEx)

regular expression (RegEx): sequence of characters that define a search pattern, which is used by string searching algorithms for “find” or “find and replace” operations on strings, or for input validation;


RegEx character classes

With a character classes you can tell the regex engine to match only one out of several characters. Simply place the characters you want to match between square brackets.
*For example, if you need to match an a or an e, use the character class [ae]

Some pre-built character classes that can be used inside RegEx patterns:


[:punct:]: punctuation;
[:alpha:]: letters;
[:lower:]: lowercase letters;
[:upper:]: upperclass letters;
[:digit:]: digits;
[:xdigit:]: hex digits;
[:alnum:]: letters and numbers;
[:cntrl:]: control characters;
[:graph:]: letters, numbers, and punctuation;
[:print:]: letters, numbers, punctuation, and whitespace;
[:space:]: space characters (basically equivalent to );
[:blank:]: space and tab;

You can create your own character classes using []:


[a-z]: matches every character between a and z (in Unicode code point order).
[^abc]: matches anything except a, b, or c.
[^-]: matches ^ or - (characters are escaped).

RegEx quantifiers

Quantifiers specify the number of repetitions of the pattern to be matched:

  • *: matches at least 0 times;
  • +: matches at least 1 times;
  • ?: matches at most 1 times;
  • {n}: matches exactly n times;
  • {n,}: matches at least n times;
  • {n,m}: matches between n and m times;

RegEx position of pattern


Position of pattern identifiers specify the position conditions of the pattern to be matched:

  • ^: matches the start of the string;
  • $: matches the end of the string;

RegEx in R


* The kwic function (in the Quanteda library) can be used to search and filter strings by RegEx conditions, and output the context (neighborhood) of identified RegEx matches.
* The gregexpr function can be used to identify the boundaries of sub-strings that match RegEx patterns (also multi-group).

RegEx application

########################################
#### 3 REGEX AND REGEX DICTIONARIES ####
########################################

#### 3.1 KEYWORDS IN CONTEXT (KWIC) WITH QUANTEDA ####
#SIMPLE EXAMPLES
#tweets and descriptions containing like/dislike emojis (removed from clean_text)
print("Keyword in context: 👍")
## [1] "Keyword in context: \U0001f44d"
quanteda::kwic(data$tweets$text,"👍",window=5,) %>% head() %>% kable()# UTF code \U0001f44d
docname from to pre keyword post pattern
text2377 53 53 get paid on our results 👍 😉 👍
text2479 19 19 as effective as hands-on healing 👍 👍
text2911 35 35 Click below and apply ! 👍 #witpghjobs #pghtech #womeninstem https://t.co/XyaGoc6GrO 👍
text3372 47 47 are a remote worker . 👍 https://t.co/XVi89vucZx 👍
text3534 33 33 Remote - work from anywhere 👍 Apply https://t.co/RcRnheL17D https://t.co/h6z8A1OHfn 👍
text3911 55 55 the top class work . 👍 👍 👍
print("Keyword in context: 👎")
## [1] "Keyword in context: \U0001f44e"
quanteda::kwic(data$tweets$text,"👎",window=5) %>% head() %>% kable()# UTF code \U0001f44e
docname from to pre keyword post pattern
text47324 38 38 is you , congratulations ! 👎 There are thousands of people 👎
print("Keyword in context: 🏔|🏝")
## [1] "Keyword in context: \U0001f3d4|\U0001f3dd"
quanteda::kwic(data$tweets$text,"🏔|🏝",window=5,valuetype="regex") %>% head() %>% kable()# UTF code  \U0001f3d4 OR UTF code \U0001f3dd
docname from to pre keyword post pattern
text202 39 39 the top of Colorado ! 🏔 https://t.co/djXypfBfHD 🏔|🏝
text351 22 22 : • Work from anywhere 🏝 • Choose your hours ⏱ 🏔|🏝
text670 5 5 Work from anywhere ! 🏔 🏕 🏖 🏜 🏞 🏘 🏔|🏝
text775 26 26 as you have a phone 🏝 📱 Choose your hours you 🏔|🏝
text2159 1 1 🏝 BOSS BABES WANTED 🏝 Tired 🏔|🏝
text2159 5 5 🏝 BOSS BABES WANTED 🏝 Tired of working long hours 🏔|🏝
#ADVANCED EXAMPLES
print("Keyword in context: work[- ]?from[- ]?anywhere[ -]?job")
## [1] "Keyword in context: work[- ]?from[- ]?anywhere[ -]?job"
quanteda::kwic(data$tweets$text,"work[- ]?from[- ]?anywhere[ -]?job",window=5,valuetype="regex") %>% head() %>% kable()
docname from to pre keyword post pattern
text14356 40 40 #remotework #remotejobs #jobs #jobsearch #workfromanywhere #workfromanywherejobs #fjblog https://t.co/nK8aC7vV1R work[- ]?from[- ]?anywhere[ -]?job
text14690 41 41 " @MSN #remotework #remotejobs #workfromanywhere #workfromanywherejobs #jobs https://t.co/Hzecirx0DZ work[- ]?from[- ]?anywhere[ -]?job
text14957 26 26 👉 https://t.co/pGeV1Fes5V @CNBCMakeIt #remotework #remotejobs #workfromanywherejobs #workfromanywhere #fjinthenews #hiring #hiringnow #worklifebalance work[- ]?from[- ]?anywhere[ -]?job
text15074 17 17 should try Visit https://t.co/u69eTjdscq #VoiceNOCTechnician #WorkFromAnywhereJobs #NetworkOperationsTechnician #VoiceServices #NocEngineers work[- ]?from[- ]?anywhere[ -]?job
text17671 23 23 from Anywhere ] https://t.co/uX66tc2H8o #HardwareIntegrationEngineer #WorkFromAnywhereJobs #DesignIntegrationEngineer #IntegrationEngineerJobs #SystemIntegration work[- ]?from[- ]?anywhere[ -]?job
text18547 27 27 mode Remote Jobs : https://t.co/ikCjnJdwuS #WorkFromAnywhereJobs #CiscoJobs #EmploymentOutlook #CiscoEmployment #Cisco @GE_Summit work[- ]?from[- ]?anywhere[ -]?job
print("Keyword in context: (?<!(do not |don't ))(lik[a-z]{1,}|lov[a-z]{1,}|enjoy[a-z]{1,}){1,}")
## [1] "Keyword in context: (?<!(do not |don't ))(lik[a-z]{1,}|lov[a-z]{1,}|enjoy[a-z]{1,}){1,}"
quanteda::kwic(data$tweets$text,"(?<!(do not |don't ))(lik[a-z]{1,}|lov[a-z]{1,}|enjoy[a-z]{1,}){1,}",valuetype="regex",window=5) %>% head() %>% kable()
docname from to pre keyword post pattern
text16 16 16 anywhere as long as you like researching and fiddling with cutting-edge (?<!(do not |don’t ))(lik[a-z]{1,}|lov[a-z]{1,}|enjoy[a-z]{1,}){1,}
text18 16 16 anywhere as long as you like researching and fiddling with cutting-edge (?<!(do not |don’t ))(lik[a-z]{1,}|lov[a-z]{1,}|enjoy[a-z]{1,}){1,}
text26 20 20 freelance more so I feel like Im missing out on being (?<!(do not |don’t ))(lik[a-z]{1,}|lov[a-z]{1,}|enjoy[a-z]{1,}){1,}
text39 16 16 all levels , full of Love , Light , Joy and (?<!(do not |don’t ))(lik[a-z]{1,}|lov[a-z]{1,}|enjoy[a-z]{1,}){1,}
text44 27 27 , what places would you like to see ? https://t.co/swiL3mXmbX (?<!(do not |don’t ))(lik[a-z]{1,}|lov[a-z]{1,}|enjoy[a-z]{1,}){1,}
text56 46 46 . You’ll be impressed just like the customer will . Thank (?<!(do not |don’t ))(lik[a-z]{1,}|lov[a-z]{1,}|enjoy[a-z]{1,}){1,}
#### 3.2 EXTRACTING EMOJI FROM UTF-8 STRINGS USING A REGEX DICTIONARY ####
print("Emoji dictionary")
## [1] "Emoji dictionary"
head(remoji:::dat_core) %>% kable()
emoji description
😄 smiling face with open mouth and smiling eyes
😃 smiling face with open mouth
😀 grinning face
😊 smiling face with smiling eyes
☺️ white smiling face
😉 winking face
dict_emoji=remoji:::dat_core$emoji
names(dict_emoji)=remoji:::dat_core$description

data$tweets=data$tweets %>% dplyr::mutate(emoji = regmatches(
  text,
  gregexpr(pattern = paste(dict_emoji,collapse = "|"), text, perl = TRUE)
)) 

data$users=data$users %>% dplyr::mutate(emoji = regmatches(
  description,
  gregexpr(pattern = paste(dict_emoji,collapse = "|"), description, perl = TRUE)
)) 

#### 3.3 IDENTIFY AND EXTRACT A CAUSAL ARGUMENT WITH A REGEX  ####
#THE FOLLOWING CODE SNIPPET WAS DEVELOPED FOR A PAPER BY SANTAGISTINA & WARGLIEN ABOUT THE NO DEAL BREXIT (FORTHCOMING)
list_cause_verbs = c("cause","causes", "causing","caused",
  "determine","determines","determining","determined",
  "engender","engenders","engendered","engendering",
  "entail","entails","entailed","entailing",
  "provoke","provokes","provoked","provoking",
  "trigger","triggers","triggered","triggering"
)
print("Causal verbs:")
## [1] "Causal verbs:"
list_cause_verbs
##  [1] "cause"       "causes"      "causing"     "caused"      "determine"  
##  [6] "determines"  "determining" "determined"  "engender"    "engenders"  
## [11] "engendered"  "engendering" "entail"      "entails"     "entailed"   
## [16] "entailing"   "provoke"     "provokes"    "provoked"    "provoking"  
## [21] "trigger"     "triggers"    "triggered"   "triggering"
#build verbs regex
all_at_end=paste(
  c(list_cause_verbs),
  collapse  = "|"
)

#INDENTIFY CAUSAL STATEMENTS (NAIVE FORM)
cause_effect_regex = paste("(?<cause>[^:;,!]{2,})(?<rel_operator>",all_at_end,"){1}(?<effect>[^:;,!]{2,})",sep = "")  

example= quanteda::tokenize_sentence(data$tweets$text[61257])[[1]][1] #"I wonder if starlink will cause a wave of reverse migration?"

print(example) 
## [1] "I wonder if starlink will cause a wave of reverse migration?"
match=gregexpr(
  cause_effect_regex ,
  example,
  perl = T,
  ignore.case = T
)
print("Gregexpr output:")
## [1] "Gregexpr output:"
match
## [[1]]
## [1] 1
## attr(,"match.length")
## [1] 60
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## attr(,"capture.start")
##      cause rel_operator effect
## [1,]     1           27     32
## attr(,"capture.length")
##      cause rel_operator effect
## [1,]    26            5     29
## attr(,"capture.names")
## [1] "cause"        "rel_operator" "effect"
#cause
print("RegEx matches CAUSE:")
## [1] "RegEx matches CAUSE:"
st = attr(match[[1]], "capture.start")[, "cause"]
len = attr(match[[1]], "capture.length")[, "cause"]
substring(example, st, st + len - 1)
## [1] "I wonder if starlink will "
print("RegEx matches OPERATOR (RELATION):")
## [1] "RegEx matches OPERATOR (RELATION):"
#rel_operator
st = attr(match[[1]], "capture.start")[, "rel_operator"]
len = attr(match[[1]], "capture.length")[, "rel_operator"]
substring(example, st, st + len - 1)
## [1] "cause"
print("RegEx matches EFFECT:")
## [1] "RegEx matches EFFECT:"
#effect
st = attr(match[[1]], "capture.start")[, "effect"]
len = attr(match[[1]], "capture.length")[, "effect"]
substring(example, st, st + len - 1)
## [1] " a wave of reverse migration?"

Feature co-occurrence matrix

A feature co-occurrence matrix (FCM) displays the number coappearances, in documents (here tweets), of features (e.g., terms). Each feature is identified by a both a row and a column, so the matrix is symmetric and (often) in the diagonal we have feature occurrence counts.

##### 4 FEATURES CO-OCCURRENCE MATRIX (FCM) AND NETWORK ####
topN=100 #parameter representing the number of top features considered
edge_filter_threshold=0.95 #parameter representing the percentile threshold filter applied to edges
#### 4.1 TWEEETS FCM #### 
tweets_fcm= dfm$tweets %>% 
  quanteda::dfm_select(.,pattern= names(topfeatures(dfm$tweets,topN)) %>% as.character(),selection="keep",valuetype="fixed") %>%
  quanteda::fcm(.)

tweets_fcm_graph=tweets_fcm %>% igraph::graph_from_adjacency_matrix(.,weighted = T,diag = F,mode = "upper")
tweets_fcm_graph=list(edges=as.data.frame(cbind(igraph::as_edgelist(tweets_fcm_graph),E(tweets_fcm_graph)$weight)),
                           nodes=  data.frame(token=names(topfeatures(dfm$tweets,topN)),freq=topfeatures(dfm$tweets,topN)) )
#edge (percentile) filter

keep_edges=which(as.numeric(tweets_fcm_graph$edges[,3]) >= quantile(as.numeric(tweets_fcm_graph$edges[,3]),edge_filter_threshold))
visNetwork(
  height = "900px",
  width = "100%",
  nodes = data.frame(
    id = tweets_fcm_graph$nodes[,1]
    ,
    label = tweets_fcm_graph$nodes[,1]
    ,
    value =  as.integer(as.numeric(tweets_fcm_graph$nodes[,2]))^2
    , 
    color.background = "white"
    ,
    color.border= "black"
    ,
    shape = "box"
    #,group = temp_$vertices$community_walktrap
  ),
  edges = data.frame(
    from=tweets_fcm_graph$edges[keep_edges,1]
    ,
    to= tweets_fcm_graph$edges[keep_edges,2]
    ,
    
    width = as.integer(as.numeric(tweets_fcm_graph$edges[keep_edges,3])^(1/2))/5 #log(1.001+temp_$edges$weight)*1000
    , smooth.enabled = F
    #  ,smooth.roundness = 0.70
    ,color = list(color = "darkgray",
                  highlight = "darkred")
  )
) %>%  
  visNodes(scaling = list(label = list(enabled = T)))%>%
  visIgraphLayout(layout = "layout_nicely",physics = F,smooth = T) %>%
  visInteraction(
    keyboard = F,
    navigationButtons =  T,
    dragNodes = T,
    dragView = T,
    zoomView = T
  )  %>%  
  visOptions(
    highlightNearest = list(enabled = TRUE, hover = TRUE, degree = 1),
    nodesIdSelection = T
  ) 
# 4.2 EMOJI FCM
tweets_emoji_freq=data$tweets$emoji %>% unlist() %>% table() %>% sort(decreasing = T) %>% as.data.frame() 
#tweets_emoji_freq %>% View()

text_emojis=data$tweets$emoji
names(text_emojis)=data$tweets$id
text_emojis_fcm= text_emojis %>%
  quanteda::as.tokens(.) %>%
  quanteda::dfm(.,tolower = F) %>% 
  quanteda::dfm_select(.,pattern= tweets_emoji_freq[1:topN,1] %>% as.character(),selection="keep",valuetype="fixed") %>%
  quanteda::fcm(.)
  #head(text_emojis_fcm)

text_emojis_fcm_graph=text_emojis_fcm %>% igraph::graph_from_adjacency_matrix(.,weighted = T,diag = F,mode = "upper")
text_emojis_fcm_graph=list(edges=as.data.frame(cbind(igraph::as_edgelist(text_emojis_fcm_graph),E(text_emojis_fcm_graph)$weight)),
                           nodes= tweets_emoji_freq[1:topN,])
keep_edges=which(as.numeric(text_emojis_fcm_graph$edges[,3]) >= quantile(as.numeric(text_emojis_fcm_graph$edges[,3]),edge_filter_threshold))

visNetwork(
  height = "900px",
  width = "100%",
  nodes = data.frame(
    id = text_emojis_fcm_graph$nodes[,1]
    ,
    label = text_emojis_fcm_graph$nodes[,1]
    ,
    value =  as.integer(as.numeric(text_emojis_fcm_graph$nodes[,2]))^2
    , 
    color.background = "white"
    ,
    color.border= "black"
    ,
    shape = "box"
    #,group = temp_$vertices$community_walktrap
  ),
  edges = data.frame(
    from=text_emojis_fcm_graph$edges[keep_edges,1]
    ,
    to= text_emojis_fcm_graph$edges[keep_edges,2]
    ,
    
    width = as.integer(as.numeric(text_emojis_fcm_graph$edges[keep_edges,3])^(1/2))/5 #log(1.001+temp_$edges$weight)*1000
     , smooth.enabled = F
    #  ,smooth.roundness = 0.70
    ,color = list(color = "darkgray",
    highlight = "darkred")
    )
  ) %>%  
  visNodes(scaling = list(label = list(enabled = T)))%>%
  visIgraphLayout(layout = "layout_nicely",physics = F,smooth = T) %>%
  visInteraction(
    keyboard = F,
    navigationButtons =  T,
    dragNodes = T,
    dragView = T,
    zoomView = T
  )  %>%  
  visOptions(
    highlightNearest = list(enabled = TRUE, hover = TRUE, degree = 1),
    nodesIdSelection = T
  ) 

Natural Language Processing with Spacy(R)



Spacy website: https://spacy.io/

spacyR github page: https://github.com/quanteda/spacyr
We use Spacy’s state-of-the-art transformer model for web data (en_core_web_trf),called roBERTa. More info at https://huggingface.co/roberta-base

- Paper on transformer models for NLP by Vaswan et al. (2017) https://arxiv.org/pdf/1706.03762.pdf - roBERTa paper by Liu et al. (2019) https://arxiv.org/pdf/1907.11692.pdf

Part-of-Speech tagging



Part-of-Speech tagging is the process of labeling a word in a text as corresponding to a particular part of speech (role), also based on its context;

Dependency parsing



Dependency parsing is the process of extracting the dependency parse of a sentence to represent its grammatical structure. It defines the dependency relationship between headwords and their dependents. The head of a sentence has no dependency and is called the root of the sentence.

Named enitity recognintion



Spacy(R) basic named entity categories:

  • PERSON: People, including fictional;
  • NORP: Nationalities or religious or political groups;
  • FAC: Buildings, airports, highways, bridges, etc;
  • ORG: Companies, agencies, institutions, etc;
  • GPE: Countries, cities, states;
  • LOC: Non-GPE locations, mountain ranges, bodies of water;
  • PRODUCT: Objects, vehicles, foods, etc. (Not services.);
  • EVENT: Named hurricanes, battles, wars, sports events, etc.;
  • WORK_OF_ART: Titles of books, songs, etc.;
  • LAW: Named documents made into laws;
  • LANGUAGE: Any named language;


Spacy(R) extended categories:

  • DATE: Absolute or relative dates or periods;
  • TIME: Times smaller than a day;
  • PERCENT: Percentage, including “%”;
  • MONEY: Monetary values, including unit;
  • QUANTITY: Measurements, as of weight or distance;
  • ORDINAL: “first”, “second”, etc.;
  • CARDINAL: Numerals that do not fall under another type;
########################################
#### 5 NATURAL LANGUAGE PROCESSING  ####
########################################

#### 5.1 NLP WITH SPACY (ROBERTA TRANSFORMER MODEL) THROUGH SPACYR R LIBRARY ####
#Sys.setenv("OMP_NUM_THREADS"=10)# to set the number of threads used by spacy
#spacyr::spacy_install() #to install spacy through conda/miniconda in the local machine
#spacyr::spacy_install(lang_models = c("en_core_web_trf"))#to load the RoBERTa trf model (https://huggingface.co/roberta-base )
spacyr::spacy_initialize(model = "en_core_web_trf")#Initialize Spacy transformer model. Other Spacy models for english web data: "en_core_web_sm" "en_core_web_lg"

#EXAMPLE
ids=c(1080470620293550080)#ID(s) of the selected tweets

tweet_example=data$tweets$clean_text[which(data$tweets$id %in% ids)]#select tweet by ID

print(tweet_example)
## [1] "Learn how to manage 1000s of devices for your remote workforce. Join us on Jan 8 at 2 PM CET.   See how TeamViewer Tensor enhances your productivity by empowering all employees to work from anywhere with any device at any time in a secure and fast manner.  "
NLP_example=list(spacy_raw=NULL ,entities=NULL ,nounphrases=NULL)
NLP_example$spacy_raw= spacy_parse(
  tweet_example,
   dependency = TRUE,
   lemma = TRUE,
   pos = TRUE,
   entity = TRUE,
   nounphrase = TRUE,
   additional_attributes = c("tag_","ent_type_","norm_"))#nlp pipeleine (tokenize sentences, tokenize words, POS, DEP-REL, UNIV.POS, EE ...) 
NLP_example$spacy_raw %>% kable()
doc_id sentence_id token_id token lemma pos head_token_id dep_rel entity nounphrase whitespace tag_ ent_type_ norm_
text1 1 1 Learn learn VERB 1 ROOT TRUE VB learn
text1 1 2 how how ADV 4 advmod TRUE WRB how
text1 1 3 to to PART 4 aux TRUE TO to
text1 1 4 manage manage VERB 1 xcomp TRUE VB manage
text1 1 5 1000s 1000 NOUN 4 dobj CARDINAL_B beg_root TRUE NNS CARDINAL 1000s
text1 1 6 of of ADP 5 prep TRUE IN of
text1 1 7 devices device NOUN 6 pobj beg_root TRUE NNS devices
text1 1 8 for for ADP 4 prep TRUE IN for
text1 1 9 your your PRON 11 poss beg TRUE PRP$ your
text1 1 10 remote remote ADJ 11 amod mid TRUE JJ remote
text1 1 11 workforce workforce NOUN 8 pobj end_root FALSE NN workforce
text1 1 12 . . PUNCT 1 punct TRUE . .
text1 2 1 Join join VERB 1 ROOT TRUE VB join
text1 2 2 us we PRON 1 dobj beg_root TRUE PRP us
text1 2 3 on on ADP 1 prep TRUE IN on
text1 2 4 Jan Jan PROPN 3 pobj DATE_B beg_root TRUE NNP DATE jan
text1 2 5 8 8 NUM 4 nummod DATE_I TRUE CD DATE 8
text1 2 6 at at ADP 1 prep TRUE IN at
text1 2 7 2 2 NUM 8 nummod TIME_B beg TRUE CD TIME 2
text1 2 8 PM pm NOUN 9 compound TIME_I mid TRUE NN TIME pm
text1 2 9 CET CET PROPN 6 pobj TIME_I end_root FALSE NNP TIME cet
text1 2 10 . . PUNCT 1 punct TRUE . .
text1 3 1 SPACE 1 ROOT FALSE _SP
text1 4 1 See see VERB 1 ROOT TRUE VB see
text1 4 2 how how ADV 5 advmod TRUE WRB how
text1 4 3 TeamViewer TeamViewer PROPN 4 compound PRODUCT_B beg TRUE NNP PRODUCT teamviewer
text1 4 4 Tensor Tensor PROPN 5 nsubj PRODUCT_I end_root TRUE NNP PRODUCT tensor
text1 4 5 enhances enhance VERB 1 ccomp TRUE VBZ enhances
text1 4 6 your your PRON 7 poss beg TRUE PRP$ your
text1 4 7 productivity productivity NOUN 5 dobj end_root TRUE NN productivity
text1 4 8 by by ADP 5 prep TRUE IN by
text1 4 9 empowering empower VERB 8 pcomp TRUE VBG empowering
text1 4 10 all all DET 11 det beg TRUE DT all
text1 4 11 employees employee NOUN 9 dobj end_root TRUE NNS employees
text1 4 12 to to PART 13 aux TRUE TO to
text1 4 13 work work VERB 9 xcomp TRUE VB work
text1 4 14 from from ADP 13 prep TRUE IN from
text1 4 15 anywhere anywhere ADV 14 pobj TRUE RB anywhere
text1 4 16 with with ADP 13 prep TRUE IN with
text1 4 17 any any DET 18 det beg TRUE DT any
text1 4 18 device device NOUN 16 pobj end_root TRUE NN device
text1 4 19 at at ADP 13 prep TRUE IN at
text1 4 20 any any DET 21 det beg TRUE DT any
text1 4 21 time time NOUN 19 pobj end_root TRUE NN time
text1 4 22 in in ADP 13 prep TRUE IN in
text1 4 23 a a DET 27 det beg TRUE DT a
text1 4 24 secure secure ADJ 27 amod mid TRUE JJ secure
text1 4 25 and and CCONJ 24 cc mid TRUE CC and
text1 4 26 fast fast ADJ 24 conj mid TRUE JJ fast
text1 4 27 manner manner NOUN 22 pobj end_root FALSE NN manner
text1 4 28 . . PUNCT 1 punct TRUE . .
text1 5 1 SPACE 1 ROOT FALSE _SP
NLP_example$entities= spacyr::entity_consolidate(NLP_example$spacy_raw)#(named) entity consolidation
NLP_example$nounphrases= spacyr::nounphrase_extract(NLP_example$spacy_raw)#nounphrase consolidation



#NLP FOR THE WHOLE DATASET (users and tweets) TAKES +-10HRS
# #Code below was used for NLP pipeline with all tweets and users descriptions 
# NLP_data_raw=list(tweets=NULL,users=NULL)
# users=data$users$clean_description
# names(users)=data$users$id
# tweets=data$tweets$clean_text
# names(tweets)=data$tweets$id
# start=Sys.time()
# NLP_data_raw$users = spacy_parse(
#   users,
#   dependency = TRUE,
#   lemma = TRUE,
#   pos = TRUE,
#   entity = TRUE,
#   nounphrase = TRUE,
#   additional_attributes = c("tag_")
# ) #,"morph"
# NLP_data_raw$tweets = spacy_parse(
#   tweets,
#   dependency = TRUE,
#   lemma = TRUE,
#   pos = TRUE,
#   entity = T,
#   nounphrase = T,
#   additional_attributes = c("tag_")
# ) #,"morph"
# print(paste("NLP time:", Sys.time()-start, "hours"))#"NLP time: 10.1510577821732 hours" on 
# spacy_finalize()
# saveRDS(NLP_data_raw,"NLP_data_raw.RDS")
NLP_data_raw=readRDS("../data/NLP_data_raw.RDS")
#View(NLP_data_raw)

#### 5.2 NLP DATA AUGMENTATION ####

#EXAMPLES:
# merging NLP data with AFINN sentiment lexicon
dict_afinn = corpus::sentiment_afinn %>%
  as_tibble() %>%
  rename(word = 1,
         sentiment = 2)
NLP_example$spacy_raw=NLP_example$spacy_raw %>%
  left_join(dict_afinn, by = c("lemma"= "word"))# use lemma or norm_ depending on dict type

# merging NLP data with Affect Wordnet
dict_wordnet =corpus::affect_wordnet %>%
  as_tibble()   %>%
  select(term, category, emotion) %>%
  rename(word = 1,
         wn_category = 2,
         wn_emotion = 3)
NLP_example$spacy_raw=NLP_example$spacy_raw %>%
  left_join(dict_wordnet, by = c("lemma"= "word"))# use lemma or norm_ depending on dict type

#EXAMPLES:
#NLP_data_raw=readRDS("../data/NLP_data_raw.RDS")
NLP_data_augmented=NLP_data_raw
for(i in c("tweets", "users"))
{
  NLP_data_augmented[[i]]=NLP_data_raw[[i]] %>%
    left_join(dict_wordnet, by = c("lemma"= "word"))
  NLP_data_augmented[[i]]=NLP_data_augmented[[i]]  %>%
    left_join(dict_afinn, by = c("lemma"= "word"))# use lemma or norm_ depending on dict type
}
View(NLP_data_augmented$tweets)
View(NLP_data_augmented$users)

#### 5.3 NLP  DATA TRANSFORMATION IN TWO DFMs: ONE FOR UNIGRAMS THE OTHER FOR DEP.REL (ASSOCIATIONS) ####
#NLP_data_raw=readRDS("../data/NLP_data_raw.RDS")
NLP_data=list(tweets=NULL,users=NULL)
source_nodes=list(tweets=NULL,users=NULL)
target_nodes=list(tweets=NULL,users=NULL)
net_data=list(tweets=NULL,users=NULL)
quanteda_unigrams=list(tweets=NULL,users=NULL)
quanteda_associations=list(tweets=NULL,users=NULL)
associations_dfm=list(tweets=NULL,users=NULL)
unigrams_dfm=list(tweets=NULL,users=NULL)

for(i in c("tweets", "users"))
{
#if consolidating noun phrases remember to redirect head token IDs of removed objects to existing ones.
NLP_data[[i]]=NLP_data_raw[[i]][!((NLP_data_raw[[i]]$pos %in% c("PUNCT","SPACE","CCONJ","NUM","PRON","ADP","SYM","PRON","DET","PART")) |(NLP_data_raw[[i]]$dep_rel %in% c("cc"))),]# POS filter

NLP_data[[i]]=NLP_data[[i]][!nchar(NLP_data[[i]]$lemma)<=1,]#lenght filter (not needed if already removed)

source_nodes[[i]]=NLP_data[[i]][NLP_data[[i]]$dep_rel %in% c("aux","auxpass","cop","arg","comp","acomp","ccomp","xcomp","obj","dobj","iobj","pobj","subj","nsubj","csubj","conj","amod","appos","advcl","preconj","vmod","mwe","mark","advmod","neg","rcmod","quantmod","nn","npadvmod","tmod","num","number","prep","poss","parataxis","goeswith","ref","sdep","xsubj","compound"),] %>% # DEP-REL filter
  dplyr::select(doc_id, sentence_id, token_id=head_token_id,pos_source=pos,entity_source=entity,
                word_source = token, lemma_source = lemma, dep_rel)
target_nodes[[i]]= NLP_data[[i]] %>% dplyr::select(doc_id, sentence_id, token_id,pos_target=pos,entity_target=entity,
                                         word_target = token, lemma_target = lemma)
#saveRDS(NLP_data,paste0(folder,"NLP_data.RDS"))
net_data[[i]]= dplyr::left_join(source_nodes[[i]] , target_nodes[[i]],
                           by = c("doc_id", "sentence_id","token_id"))
net_data[[i]]=net_data[[i]][net_data[[i]]$dep_rel!="ROOT",]
net_data[[i]]=net_data[[i]][!is.na(net_data[[i]]$lemma_target),]
net_data[[i]][,c("lemma_source","lemma_target")]
net_data[[i]]=net_data[[i]][!(net_data[[i]]$lemma_source == net_data[[i]]$lemma_target),]#remove loops

net_data[[i]]$edge=paste(net_data[[i]]$lemma_source,net_data[[i]]$lemma_target,sep = "->")
#saveRDS(net_data[[i]],paste0(folder,"net_data.RDS"))
quanteda_unigrams[[i]]=quanteda::as.tokens(NLP_data[[i]] %>% dplyr::select(doc_id, sentence_id, token_id, token = lemma)) %>% quanteda::tokens_tolower()
quanteda_associations[[i]]=quanteda::as.tokens(net_data[[i]]  %>% dplyr::select(doc_id, sentence_id, token_id, token = edge)) %>% quanteda::tokens_tolower()
#remove documents that don't have any bigram

unigrams_dfm[[i]] =quanteda_unigrams[[i]][names(quanteda_associations[[i]])]%>% quanteda::dfm()
associations_dfm[[i]]= quanteda_associations[[i]] %>% quanteda::dfm()

}

Topic Modeling with Probailistic Graphical Models (e.g., LDA)



See LDA2 paper: https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf (Blei, 2003)

Enriching topic models and facilitating their interpretation (e.g., LDA2NET)



See LDA2Net paper: https://www.researchgate.net/publication/356717976_LDA2Net_Digging_under_the_surface_of_COVID-19_topics_in_scientific_literature (Minello, Santagiustina, Warglien, 2021)

#### 6 TOPIC MODELS ####
#### 6.1 LDA TOPIC MODEL (FOR TWEETS DFM FROM SECTION 2.3) ####
Ntopics=5
tweetsLDA=topicmodels::LDA(dfm$tweets,k=Ntopics,method = "VEM",LDAcontrol=list(seed=1,estimate.beta=TRUE))#typical value of alpha is 50/T where T is number of topics and value of beta is 0.1 or 200/W
#A higher alpha then gives a more dense distribution whereas a lower alpha gives a more sparse distribution. 
tweetsLDA_top_terms = tidytext::tidy(tweetsLDA, matrix = "beta") %>%#beta contains the word X topic distributions
  group_by(topic) %>%
  slice_max(beta, n = 20) %>% 
  ungroup() %>%
  arrange(topic, -beta)

tweetsLDA_top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()

#Show examples for each topic
selectedtopic=4
data$tweets[docnames(dfm$tweets[which(tweetsLDA@gamma[,selectedtopic]==max(tweetsLDA@gamma[,selectedtopic])),]),] %>% View() #gamma contains the topic X document distributions


#### 6.2 SEEDED LDA TOPIC MODEL (FOR USERS DFM FROM SECTION 2.3)  ####
#define dictionary with seedwords
seed_dict=list(WFA_worker = c("nomad","worker","employee","traveller","globetrotter","backpacker"),#digital nomads et al.
               WFA_place_community= c("rent","office","meeting","room","appartment","community","join"),
               WFA_firm  = c("employer","firm","company","corporate","business","enterprise","we","our","group","llc"),# WFA firms/employers 
               WFA_manager =c("ceo","executive","director","manager","management","board"),
               WFA_coach =c("coach","coaching","support","supporting","consultant","consulting"),
               WFA_investor=c("investor", "investing","entrepreneur","founder","funding"),
               WFA_media = c("news","reporter", "journalist","writer", "magazine","journal","articles","updates","correspondent")
               )
keyword_counts=list()
for(i in names(seed_dict)){
  keyword_counts[[i]]=sort(colSums(dfm$users[,  quanteda::featnames(dfm$users) %in% seed_dict[[i]]]),decreasing = T)
  if(sum(keyword_counts[[i]])==0 | length(keyword_counts[[i]])==0){
    seed_dict[[i]]=NULL
  }else{
    seed_dict[[i]]=names(sort(keyword_counts[[i]][names(keyword_counts[[i]]!=0)],decreasing = T))
  }
}
lapply(keyword_counts, function(x) paste(names(x) ," (",x,")", sep="",collapse = "; "))
## $WFA_worker
## [1] "employee (138); traveller (87); nomad (76); worker (73)"
## 
## $WFA_place_community
## [1] "community (649); office (324); join (285); meeting (72)"
## 
## $WFA_firm
## [1] "business (3232); company (688); enterprise (377); group (319); firm (307); corporate (217); employer (109)"
## 
## $WFA_manager
## [1] "management (853); director (782); manager (766); executive (385); board (273)"
## 
## $WFA_coach
## [1] "consultant (772); support (662); coach (648); consulting (410); coaching (137); supporting (114)"
## 
## $WFA_investor
## [1] "founder (1069); entrepreneur (807); investor (453); investing (174)"
## 
## $WFA_media
## [1] "news (1489); writer (809); journalist (182); updates (179); magazine (110); reporter (94); articles (91)"
usersSLDA=seededlda::textmodel_seededlda(x = dfm$users,seed_dict %>% quanteda::dictionary(.),residual = T,
                                         valuetype = c("fixed"),max_iter=2000, weight=0.0025,case_insensitive=T)


usersSLDA_top_terms = setNames(reshape2::melt(usersSLDA$phi),c("topic","term","beta")) %>%#beta contains the word X topic distributions
  group_by(topic) %>%
  slice_max(beta, n = 25) %>% 
  ungroup() %>%
  arrange(topic, -beta)

usersSLDA_top_terms[nchar(as.character(usersSLDA_top_terms$term))>=3 & usersSLDA_top_terms$topic!="other",] %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()

table(seededlda::topics(usersSLDA))
## 
##          WFA_worker WFA_place_community            WFA_firm         WFA_manager 
##                4489                3667                4159                3835 
##           WFA_coach        WFA_investor           WFA_media               other 
##                3266                3659                3189                3740
#Examples:
#Cisco Collaboration (userID: 15866857) https://twitter.com/ciscocollab 
usersSLDA$theta["15866857",]
##          WFA_worker WFA_place_community            WFA_firm         WFA_manager 
##          0.02941176          0.08823529          0.08823529          0.08823529 
##           WFA_coach        WFA_investor           WFA_media               other 
##          0.08823529          0.08823529          0.44117647          0.08823529
#Financial Times (userID: 4898091) https://twitter.com/FinancialTimes
usersSLDA$theta["4898091",]
##          WFA_worker WFA_place_community            WFA_firm         WFA_manager 
##                0.05                0.05                0.05                0.05 
##           WFA_coach        WFA_investor           WFA_media               other 
##                0.05                0.05                0.65                0.05
#Users with higher probs  by group
type="WFA_media"
data$users[names(which(usersSLDA$theta[,type]==max(usersSLDA$theta[,type]))),] %>% View()
type="WFA_worker"
data$users[names(which(usersSLDA$theta[,type]==max(usersSLDA$theta[,type]))),] %>% View()
type="WFA_firm"
data$users[names(which(usersSLDA$theta[,type]==max(usersSLDA$theta[,type]))),] %>% View()


####6.3 LDA2NET: (TWEETS) TOPIC MODEL ENRICHMENT WITH WEIGHTED DEPENDENCY RELATIONS ####

i="tweets"
#associations_dfm=readRDS("associations_dfm.RDS")

associations_vocab = data.frame(
  from = gsub("->.*$", "", x = colnames(associations_dfm[[i]]), perl = T),
  to = gsub("^.*->", "", x = colnames(associations_dfm[[i]]), perl = T),
  stringsAsFactors = F
)
rownames(associations_vocab)=colnames(associations_dfm[[i]])
associations_vocab_clean = associations_vocab[(associations_vocab$from %in% colnames(dfm[[i]])) &
                                      (associations_vocab$to %in% colnames(dfm[[i]])), ]
associations_vocab_clean$n=colSums(associations_dfm[[i]] %>% dfm_select(., pattern = rownames(associations_vocab_clean), selection = "keep",case_insensitive = F,valuetype = "fixed"))
#save cleaned associations vocabulary
saveRDS(associations_vocab_clean, file = "associations_vocab_clean.rds")
#prune associations_dfm matrix
associations_dfm[[i]] = dfm_select(associations_dfm[[i]],
                              pattern = rownames(associations_vocab_clean),
                              valuetype = "fixed")

associations_dfm[[i]] = dfm_subset(associations_dfm[[i]], docnames(associations_dfm[[i]]) %in% docnames(dfm[[i]]))
nrow(associations_dfm[[i]])
## [1] 60876
# 42 131 docs
#saveRDS(associations_dfm, file ="associations_mat_final.rds")
topics_docs=tweetsLDA@gamma
rownames(topics_docs)=docnames(dfm[[i]])
unigrams_vocab=featnames(dfm[[i]])
theta_match= topics_docs[rownames(topics_docs) %in% docnames(associations_dfm[[i]]),]
nrow(theta_match)#docs
## [1] 60876
ncol(theta_match)#topics
## [1] 5
nrow(associations_dfm[[i]])
## [1] 60876
ncol(associations_dfm[[i]])
## [1] 50028
AXT_counts  = t(theta_match) %*% associations_dfm[[i]] #doc_topic 
print(rowSums(AXT_counts))#don't sum to 1 yet...
## [1] 44762.38 44775.04 44772.65 44791.85 44782.08
AXT_counts = AXT_counts / rowSums(AXT_counts)# this is needed for sum of bigram rel.freqs to sum to 1 (for each topic)



topics_unigrams= exp(tweetsLDA@beta)
colnames(topics_unigrams)=tweetsLDA@terms
rowSums(topics_unigrams )#good if close to 1
## [1] 1 1 1 1 1
p_associations = list()

#for topics:
#selected topics
topics_selection=c(1:nrow(topics_unigrams))
for (j in topics_selection) {
  p_associations[[j]] = associations_vocab_clean
  p_associations[[j]]$counts = AXT_counts[j, rownames(p_associations[[j]])]
  print("stage 1: eprob")
  p_associations[[j]]$e_prob =  topics_unigrams[j, p_associations[[j]]$from]* topics_unigrams[j, p_associations[[j]]$to]
  #for all bigrams observed in the corpus
  print("stage 2: eprobXcounts")
  p_associations[[j]]$eprobXcounts = p_associations[[j]]$e_prob * p_associations[[j]]$counts
  p_associations[[j]]$eprobXcounts = p_associations[[j]]$eprobXcounts/sum(p_associations[[j]]$eprobXcounts)
  print("stage 4:saving")
  #saveRDS(p_associations[[i]],paste("raw_association_topic_data_rds/topic", i, "_raw_associations.RDS", sep = ""))
  print(p_associations[[j]][order(p_associations[[j]]$eprobXcounts,decreasing = T),][1:10,])
}
## [1] "stage 1: eprob"
## [1] "stage 2: eprobXcounts"
## [1] "stage 4:saving"
##                           from         to    n       counts       e_prob
## full->time                full       time  699 0.0031063558 2.641982e-05
## business->help        business       help  114 0.0005110207 1.588158e-04
## business->need        business       need  144 0.0006437535 1.215408e-04
## home->office              home     office  255 0.0009942964 6.932092e-05
## online->business        online   business  217 0.0009687601 6.233408e-05
## lifestyle->business  lifestyle   business  213 0.0009500929 5.733182e-05
## internet->connection  internet connection 1345 0.0043344491 1.243652e-05
## help->need                help       need   90 0.0004027860 1.215322e-04
## free->video               free      video  215 0.0009632411 4.633633e-05
## today->world             today      world  158 0.0007064408 6.281999e-05
##                      eprobXcounts
## full->time             0.02105370
## business->help         0.02081995
## business->need         0.02007192
## home->office           0.01768184
## online->business       0.01549135
## lifestyle->business    0.01397363
## internet->connection   0.01382866
## help->need             0.01255778
## free->video            0.01144996
## today->world           0.01138468
## [1] "stage 1: eprob"
## [1] "stage 2: eprobXcounts"
## [1] "stage 4:saving"
##                       from        to   n       counts       e_prob eprobXcounts
## remote->developer   remote developer 746 0.0033480212 1.036141e-04  0.078673999
## remote->working     remote   working 356 0.0015908120 1.794641e-04  0.064747111
## full->time            full      time 699 0.0031263975 3.897142e-05  0.027632139
## business->start   business     start 784 0.0035021549 3.261780e-05  0.025906821
## learn->join          learn      join 247 0.0011022420 5.430114e-05  0.013574066
## remote->workforce   remote workforce 181 0.0008088408 6.182902e-05  0.011341734
## long->live            long      live 616 0.0027368658 1.684818e-05  0.010457567
## part->time            part      time 327 0.0014617909 2.779884e-05  0.009215863
## digital->world     digital     world 409 0.0018241296 2.173901e-05  0.008993310
## remote->office      remote    office  30 0.0001343109 2.897606e-04  0.008826221
## [1] "stage 1: eprob"
## [1] "stage 2: eprobXcounts"
## [1] "stage 4:saving"
##                            from       to   n       counts       e_prob
## management->business management business 498 0.0022263938 1.291854e-04
## remote->working          remote  working 356 0.0015917306 1.699096e-04
## business->start        business    start 784 0.0035048425 5.838376e-05
## remote->team             remote     team 179 0.0008001553 2.025557e-04
## business->need         business     need 144 0.0006435043 1.960735e-04
## small->business           small business 430 0.0019263390 4.666811e-05
## online->business         online business 217 0.0009699626 6.938914e-05
## just->need                 just     need 148 0.0006571279 9.143940e-05
## business->build        business    build 361 0.0016028522 3.487186e-05
## freedom->give           freedom     give 331 0.0014689882 3.734535e-05
##                      eprobXcounts
## management->business   0.06355299
## remote->working        0.05975964
## business->start        0.04521484
## remote->team           0.03581287
## business->need         0.02787988
## small->business        0.01986431
## online->business       0.01487192
## just->need             0.01327712
## business->build        0.01235063
## freedom->give          0.01212203
## [1] "stage 1: eprob"
## [1] "stage 2: eprobXcounts"
## [1] "stage 4:saving"
##                          from         to    n       counts       e_prob
## digital->world        digital      world  409 0.0018287919 1.308817e-04
## world->travel           world     travel  232 0.0010175060 1.374097e-04
## home->office             home     office  255 0.0009912137 1.180991e-04
## world->company          world    company   96 0.0004285642 2.502649e-04
## internet->connection internet connection 1345 0.0043945268 2.119476e-05
## many->people             many     people  182 0.0008121853 8.113873e-05
## people->want           people       want  127 0.0005673483 9.416670e-05
## learn->join             learn       join  247 0.0011026048 4.470186e-05
## company->join         company       join  147 0.0006512780 5.473726e-05
## full->time               full       time  699 0.0031490202 1.100553e-05
##                      eprobXcounts
## digital->world        0.056414528
## world->travel         0.032953546
## home->office          0.027590631
## world->company        0.025279219
## internet->connection  0.021952727
## many->people          0.015532136
## people->want          0.012592007
## learn->join           0.011616981
## company->join         0.008402282
## full->time            0.008168347
## [1] "stage 1: eprob"
## [1] "stage 2: eprobXcounts"
## [1] "stage 4:saving"
##                       from        to    n       counts       e_prob
## business->start   business     start  784 0.0035106167 7.162173e-05
## remote->working     remote   working  356 0.0015872120 1.333587e-04
## home->office          home    office  255 0.0009913034 1.293575e-04
## today->world         today     world  158 0.0007051043 1.302334e-04
## remote->team        remote      team  179 0.0007991811 9.285952e-05
## part->time            part      time  327 0.0014628260 4.390414e-05
## team->hire            team      hire 1114 0.0050463074 1.175657e-05
## full->time            full      time  699 0.0031188024 1.830996e-05
## business->build   business     build  361 0.0016038144 3.422970e-05
## remote->developer   remote developer  746 0.0033760541 1.445099e-05
##                   eprobXcounts
## business->start     0.05536354
## remote->working     0.04660707
## home->office        0.02823539
## today->world        0.02021954
## remote->team        0.01634057
## part->time          0.01414144
## team->hire          0.01306321
## full->time          0.01257392
## business->build     0.01208795
## remote->developer   0.01074243
#### 6.4 PLOT (PERCENTILE FILTERED) TOPIC NETWORK ####
#color map
library(tidyverse)
library(visNetwork)
map2color <- function(x, pal, limits = NULL) {
  if (is.null(limits))
    limits = range(x)
  pal[findInterval(x,
                   seq(limits[1], limits[2], length.out = length(pal) + 1),
                   all.inside = TRUE)]
}
mypal = colorRampPalette(c("gray", "darkred"))(100)
allcolors = gplots::col2hex(grDevices::colors()[grep('gr(a|e)y', grDevices::colors(), invert = T)])

filter_and_plot_topic = function(temp_=temp[[j]], show_nav=T, show_id=T){
  
  temp_$vertices[,c("color")]="white"
  ids=temp_$vertices[,c("name")]
  
  topicnet = visNetwork(
      main = paste0("Topic ", j), 
    height = "900px",
    width = "100%",
    nodes = data.frame(
      id = temp_$vertices$name
      ,
      label = temp_$vertices$name
      ,
      size =  (temp_$vertices$prob) ^ (1 / 4) * 20 / max((temp_$vertices$prob) ^
                                                           (1 / 4)) #log(1.00015+temp_$vertices$prob)*10000
      , color.background = temp_$vertices$color
      ,
      color.border= "black"
      ,
      title = paste(
        "<b>",
        temp_$vertices$name,
        "</b><br> Prob:",
        temp_$vertices$prob,
        sep = ""
      )
    ),
    edges = data.frame(
      from = temp_$edges$from
      ,
      to = temp_$edges$to
      ,
      width = (temp_$edges$eprobXcounts) ^ (1 / 4) * 10 / max((temp_$edges$eprobXcounts) ^ (1 /
                                                                                              4))
      , smooth.enabled = T
      ,smooth.roundness = 0.70
      ,color = map2color(temp_$edges$counts, mypal)
      ,
      title = paste(
        "<b>",
        temp_$edges$from
        ,
        "-",
        temp_$edges$to
        ,
        "</b><br> Prob:"
        ,
        temp_$edges$eprobXcounts
        ,

        sep = ""
      )
      #,group = temp_$vertices$community_walktrap
    )
  ) %>%
    visPhysics(solver = "forceAtlas2Based", forceAtlas2Based = list(gravitationalConstant = -20))%>%
    visEdges(arrows = "to") %>%
    visInteraction(
      keyboard = F,
      navigationButtons =  T,
      dragNodes = T,
      dragView = T,
      zoomView = T
    ) %>%
    visOptions(
      highlightNearest = list(enabled = T, hover = T),
      nodesIdSelection = T
    )
  return(topicnet)
}

net=list()
temp=list()
input=list()
input$n_percentile=0.95
input$e_percentile=0.95

for(j in c(1:nrow(topics_unigrams)))
{
  temp[[j]]=list()
  temp[[j]]$edges=p_associations[[j]]
  temp[[j]]$vertices=data.frame(name=colnames(topics_unigrams),prob=topics_unigrams[j,],stringsAsFactors = F)
  #filtration
  temp[[j]]$vertices = temp[[j]]$vertices[temp[[j]]$vertice$prob>= quantile(temp[[j]]$vertice$prob, probs = input$n_percentile), ]
  temp[[j]]$edges = temp[[j]]$edges[(
    temp[[j]]$edges$to %in% temp[[j]]$vertices$name &
      temp[[j]]$edges$from %in% temp[[j]]$vertices$name
  ) & ifelse(temp[[j]]$edges$to != temp[[j]]$edges$from, T, F), ]
  temp[[j]]$edges = temp[[j]]$edges[temp[[j]]$edges$eprobXcounts>= quantile(temp[[j]]$edges$eprobXcounts, probs = input$e_percentile), ]
  temp[[j]]$vertices  = temp[[j]]$vertices[temp[[j]]$vertices$name %in% c(temp[[j]]$edges$to,temp[[j]]$edges$from), ]
  net[[j]]=filter_and_plot_topic(temp[[j]])
}

net[[1]]
net[[2]]
net[[3]]
net[[4]]
net[[5]]

Combining structural topic models and causal arguments extraction



  • “The Unfolding Structure of Arguments in Online Debates: The case of a No-Deal Brexit” (Santagiustina and Warglien, 2021)

https://www.researchgate.net/publication/350176064_The_Unfolding_Structure_of_Arguments_in_Online_Debates_The_case_of_a_No-Deal_Brexit

 

A tutorial by Carlo R. M. A. Santagiustina

carlo.santagiustina@unive.it