An introduction to text mining, NLP and text classification for management studies

A tutorial by Carlo R. M. A. Santagiustina

carlo.santagiustina@unive.it

What is R Markdown
Schedule for today
Text mining, NLP and topic modeling, in brief
Why should management scholars use textual data in their research?
Why should management scholars use topic models?
Why using R for text mining and topic modeling?
- Topic modeling workflows and pipelines
- R data structures and relations
Workflows and pipelines in R
Which R packages will we use today?
The dataset: work from anywhere discussions on Twitter
The document-feature matrix
- Regular expressions (RegEx)
Feature co-occurrence matrix
Natural Language Processing with Spacy(R)
Topic Modeling with Probailistic Graphical Models (e.g., LDA)

What is R Markdown

This is an R Markdown document. RMarkdown is a simple formatting syntax for authoring interactive HTML documents that combine R code, html and JS. When you click the Knit button on the top-left corner a document will be generated that includes both content as well as the output of any embedded R code chunks within this tutorial. For more details on R Markdown see http://rmarkdown.rstudio.com.

Schedule for today

intro, objectives, and data;
text mining basics (pre-processing, building the document-feature matrix and feature cooccurrence matrix);
information extraction and argument segmentation with Regular Expressions (RegEx) and RegEx dictionaries;
natural language processing (Part-Of-Speech tagging, Named Entity Recognition, Dependency parsing);
LDA and other Probabilistic Graphical Models for topic modeling;
LDA2Net: enriching topic models with network representations;

Text mining, NLP and topic modeling, in brief

Text mining: automatic or semi-automatic extraction of information from large numbers of unstructured texts written in a natural language (for example using RegEx);

Types of information to be extracted or inferred from the texts are usually specified manually beforehand.
Text mining is an applied branch of information retrieval and computational linguistics

In general, text mining workflows have three steps:
1. Data collection and pre-processing: Gather input texts that are potentially relevant for the given task and pre-process them.
2. Natural language processing: Analyze the text, infer relevant syntactic, semantic and morphological information about words and their dependency relations;
3. Text categorization and modeling: Discover and aggregate recurrent patterns, infer latent semantic representations and distinctions (topics, arguments, narratives);

Why should management scholars use textual data in their research?

Because, in our daily life, we use natural language (and not numbers and math formulas) to:

motivate, describe and comment preferences, decision criteria and strategies;
report and frame events and anecdotes;

elicit expectations and causal beliefs;
manage impressions;

request justification for, or justify behavior;
nudge, prime and persuade through narratives;

Why should management scholars use topic models?

they provide a way to automatically classify and analyze big (textual) data, enabling the researcher to rapidly model aggregate opinion dynamics related to a phenomena of interest;
they do not impose a-piori the researcher views and interpretative schemes on the data, and hence remove the burden on the research from manually coding text data to interpret and validate the results of the topic model;
they go beyond word frequency analysis, which is (often) too simplistic and superficial to represent the semantic richness and complexity of textual data;
they enable the automatic identification of latent themes contained in a set of documents;
they allow for polysemy, because topics are not mutually exclusive categories: words appear across topics with differing probabilities, and topics themselves may overlap or cluster;

Why using R for text mining and topic modeling?

1. R is open-source: free, transparent and deployable across a wide range of operative systems ;
2. R has a huge community: R and its TM and NLP libraries are well-documented, incessantly improved and there are plenty of online tutorials and examples ;
3. R can be used to integrate multiple languages and related libraries (like Spacy) in a pipeline: you can embed and execute scripts, commands and functions written in other programming languages, like Python, Java, C++, SQL, Julia, JS, Fortran ;
4. R is focused on (but not limited to) statistics, and many authors of topic models write their code in R (or wrap-up their c++ libraries in a R package): Employed by researchers in almost all fields of quantitative social sciences ;

Topic modeling workflows and pipelines

R data structures and relations

Workflows and pipelines in R

require(magrittr)

Piping %>% was introduced in R with the magrittr package which is now part of the tidyverse suite. Many other packages, like purr, have extended the functional programming toolset for R.

this section of the tutorial is a summarized version of https://www.nateday.me/Cats_with_Pipes.html

R piping rules

x %>% f gives the same output of f(x) where f is a function that accepts the data object as input;
x %>% f(y) gives the same output of f(x, y)
x %>% f %>% g %>% h gives the same output of h(g(f(x)))
x %>% f %>% g %>% h(parA=y) %>% i %>% j(parB=z) %>% k gives the same output of k(j(i(h(g(f(x)), parA=y)), parB=z))

The left-hand side is evaluated before being passed on to the right-hand side expression.

The argument placeholder

If you don’t want to use an object xas the first input/parameter of a function when piping, you can choose to which input/parameter send x by using the . character as follows:

x %>% f(y, .) is equivalent to f(y, x)
x %>% f(y, z = .) is equivalent to f(y, z = x)

Re-using the placeholder for attributes

It is straightforward to use the placeholder several times in a right-hand side expression. However, when the placeholder only appears in a nested expressions magrittr will still apply the first-argument rule. The reason is that in most cases this results more clean code.

x %>% f(y = nrow(.), z = ncol(.)) is equivalent to f(x, y = nrow(x), z = ncol(x))

The behavior can be overruled by enclosing the right-hand side in braces:

x %>% {f(y = nrow(.), z = ncol(.))} is equivalent to f(y = nrow(x), z = ncol(x))

Which R packages will we use today?

remotes for installing libraries from GitHub;
knitr for rendering the R Markdown file and tables in a HTML page;
tidyverse for data wrangling;
quanteda for text mining;
spacyr for NLP (using transformers);
topicmodels for LDA model estimation;
seededlda for seededLDA model estimation;
igraph for constructing and analysing networks;
visNetwork for plotting interactive networks;
corpus for using the WordNet-Affect lexicon (extension of WordNet);
tidytext for formatting topic models output tables;
remoji for emoji extraction;

# creator        Carlo R. M. A. Santagiustina  
# creation date  29-01-2022
# language       R   
# R version      4.0.3 (2020-10-10)

####  0 KNITR OPTIONS ####
knitr::opts_chunk$set(warning = FALSE, message = FALSE, echo = FALSE,fig.width=10, fig.height=10) 

##### 1.1 #INSTALL AND LOAD LIBRARIES ####
pkgTest <- function(x)
  {
    if (!require(x,character.only = TRUE))
    {
      install.packages(x,dep=TRUE,repos="https://cloud.r-project.org/")
        if(!require(x,character.only = TRUE)) stop("Package not found")
    }
}
pkgTest("remotes")# for installing libraries from github
pkgTest("knitr")# for printing HTM tables
pkgTest("tidyverse")# for data wrangling
pkgTest("quanteda")# for text mining 
pkgTest("spacyr")# for NLP
pkgTest("topicmodels")# for topic model estimation
pkgTest("seededlda")# for seededLDA estimation
pkgTest("igraph")# for constructing and analysing graphs
pkgTest("visNetwork")# for plotting interactive graphs
pkgTest("corpus")#for affect WordNet-Affect lexicon  (extension of WordNet, see )
pkgTest("tidytext")#for formatting topicmodels outputs
#pkgTest(udpipe)# other library for NLP
remotes::install_github("richfitz/remoji")
pkgTest("remoji")# for emoji extraction
pkgTest("reshape2")# for reshaping sededLDA outputs

The dataset: work from anywhere discussions on Twitter

The dataset that will be used in today applications consists of 67 792 tweets (published by 40 608 users) containing the expression "work[ -]?from[ -]?anywhere", and related metadata and user profile info.

These tweets have been published between January 2019 and the July 2021.
They have been downloaded from the Twitter Accademic API V2 using R. For more details on this API see: https://developer.twitter.com/en/products/twitter-api/academic-research. The dataset was preprocessed to remove duplicates and tweets unrelated to the "work from anywhere" issue. Dataset import in R:

##      text            author_id         conversation_id      created_at        
##  Length:67792       Length:67792       Length:67792       Min.   :2019-01-01  
##  Class :character   Class :character   Class :character   1st Qu.:2020-03-18  
##  Mode  :character   Mode  :character   Mode  :character   Median :2020-09-02  
##                                                           Mean   :2020-07-23  
##                                                           3rd Qu.:2021-02-10  
##                                                           Max.   :2021-06-30  
##  in_reply_to_user_id possibly_sensitive      id               source         
##  Length:67792        Mode :logical      Length:67792       Length:67792      
##  Class :character    FALSE:67396        Class :character   Class :character  
##  Mode  :character    TRUE :396          Mode  :character   Mode  :character  
##                                                                              
##                                                                              
##                                                                              
##  retweet_count       reply_count         like_count         quote_count      
##  Min.   :   0.000   Min.   :  0.0000   Min.   :    0.000   Min.   :  0.0000  
##  1st Qu.:   0.000   1st Qu.:  0.0000   1st Qu.:    0.000   1st Qu.:  0.0000  
##  Median :   0.000   Median :  0.0000   Median :    0.000   Median :  0.0000  
##  Mean   :   1.121   Mean   :  0.3821   Mean   :    4.882   Mean   :  0.1406  
##  3rd Qu.:   0.000   3rd Qu.:  0.0000   3rd Qu.:    1.000   3rd Qu.:  0.0000  
##  Max.   :5119.000   Max.   :873.0000   Max.   :11684.000   Max.   :879.0000

text	author_id	conversation_id	created_at	in_reply_to_user_id	possibly_sensitive	id	source	like_count
#digitaltransformation Top 10 companies that offer work-from-anywhere jobs - WKMG News 6 & ClickOrlando https://t.co/Mg0XtnflUB https://t.co/3fLz3jnKq9	1049585267772968960	1080012837211291648	2019-01-01	NA	FALSE	1080012837211291648	IFTTT	0
40 Best International Work from Home Jobs (Work from Anywhere in the World Online) - MoneyPantry #International https://t.co/EMFWTHzZEj	107631340	1080010814562672640	2019-01-01	NA	FALSE	1080010814562672640	Linkgage	0
40 Best International Work from Home Jobs (Work from Anywhere in the World Online) - MoneyPantry #International #link2 https://t.co/EMFWTHzZEj	107631340	1079979313833934855	2019-01-01	NA	FALSE	1079979313833934855	Linkgage	0
Top 10 companies that offer work-from-anywhere jobs https://t.co/uHzor6qahf	2214623612	1079922194644385793	2019-01-01	NA	FALSE	1079922194644385793	Facebook	0
40 Best International Work from Home Jobs (Work from Anywhere in the World Online) - MoneyPantry #International #link1 https://t.co/EMFWTHzZEj	107631340	1079918941319901184	2019-01-01	NA	FALSE	1079918941319901184	Linkgage	1
Top 10 companies that offer work-from-anywhere jobs https://t.co/LOwL9En8VH	80203918	1079914035754291200	2019-01-01	NA	FALSE	1079914035754291200	Facebook	0

##      url                 id                name            verified      
##  Length:40608       Length:40608       Length:40608       Mode :logical  
##  Class :character   Class :character   Class :character   FALSE:38728    
##  Mode  :character   Mode  :character   Mode  :character   TRUE :1880     
##                                                                          
##                                                                          
##                                                                          
##  protected         username         description        followers_count   
##  Mode :logical   Length:40608       Length:40608       Min.   :       0  
##  FALSE:40608     Class :character   Class :character   1st Qu.:     107  
##                  Mode  :character   Mode  :character   Median :     514  
##                                                        Mean   :   17214  
##                                                        3rd Qu.:    2106  
##                                                        Max.   :26099432  
##  following_count   tweet_count       listed_count        created_at        
##  Min.   :     0   Min.   :      1   Min.   :     0.0   Min.   :2006-03-21  
##  1st Qu.:   155   1st Qu.:   1007   1st Qu.:     0.0   1st Qu.:2009-11-11  
##  Median :   512   Median :   4570   Median :     8.0   Median :2012-07-21  
##  Mean   :  1658   Mean   :  23484   Mean   :   160.2   Mean   :2013-07-16  
##  3rd Qu.:  1387   3rd Qu.:  17754   3rd Qu.:    57.0   3rd Qu.:2017-01-27  
##  Max.   :755319   Max.   :3976439   Max.   :130184.0   Max.   :2021-06-28  
##    location        
##  Length:40608      
##  Class :character  
##  Mode  :character  
##                    
##                    
##

url	id	name	verified	protected	username	description	followers_count	following_count	tweet_count	listed_count	created_at	location
https://t.co/L0FKd02CBX	60433092	Morissa (Dr. Rissy) Schwartz	TRUE	FALSE	MorissaSchwartz	All my puns are intended! 😆 Owner of top-rated marketing company https://t.co/3xpIccgnoi - @forbes 30 Under 30 🤩 - @entrepreneur Contributor -📚 https://t.co/8XIkYYnjSh CEO	21180	8958	74574	1632	2009-07-27	New Jersey, USA
https://t.co/i48TLClhwK	813081982758518784	A Better Lemonade Stand	FALSE	FALSE	ABLSecom	A Better Lemonade Stand is an online business incubator. We help entrepreneurs build, launch, and grow their online business.	2108	673	3888	62	2016-12-25	NA
NA	862922852063862784	Mr. J 🏴	FALSE	FALSE	jasze36	If you are into smartphones, tech, superheroes, video games, movies, general geekery, football, and anything Malaysian, my tweets are for you.	447	346	35486	6	2017-05-12	Malesia
https://t.co/6qjZ8PCpvM	91478624	Forbes	TRUE	FALSE	Forbes	Official account of Forbes, the world’s leading voice for entrepreneurial success and free enterprise, and home to breaking news, business, money and more.	17331756	5149	347669	61023	2009-11-21	New York, NY
https://t.co/wk9XBMxbNX	35700404	💉💉😷💪🦠Heather Smith	TRUE	FALSE	HeatherSmithAU	🦄Hype Girl for Accounting Apps 🦄

Subscribe to the #AccounTech newsletter 👉https://t.co/oQhoMuqNwk Lead with kindness 👸🌞🦋 💐💃

https://t.co/lS92N5HjxV | 10606| 5177| 92368| 434|2009-04-27 |Brisbane Australia | |https://t.co/GMdWAQKfMC |16350207 |Christine Hueber |FALSE |FALSE |ChristineHueber |Do you have your copy of my #FREE Report yet? Get “Top 10 LinkedIn #B2B #Success Secrets” to #hack #LinkedIn for better results now @ https://t.co/GMdWAQKfMC | 8078| 7987| 116846| 367|2008-09-18 |San Francisco, CA |

text	author_id	conversation_id	created_at	in_reply_to_user_id	possibly_sensitive	id	source	like_count	clean_text	text_urls	text_hashtags
#digitaltransformation Top 10 companies that offer work-from-anywhere jobs - WKMG News 6 & ClickOrlando https://t.co/Mg0XtnflUB https://t.co/3fLz3jnKq9	1049585267772968960	1080012837211291648	2019-01-01	NA	FALSE	1080012837211291648	IFTTT	0	digitaltransformation Top 10 companies that offer work-from-anywhere jobs - WKMG News 6 and ClickOrlando	https://t.co/Mg0XtnflUB, https://t.co/3fLz3jnKq9	#digitaltransformation
40 Best International Work from Home Jobs (Work from Anywhere in the World Online) - MoneyPantry #International https://t.co/EMFWTHzZEj	107631340	1080010814562672640	2019-01-01	NA	FALSE	1080010814562672640	Linkgage	0	40 Best International Work from Home Jobs (Work from Anywhere in the World Online) - MoneyPantry International	https://t.co/EMFWTHzZEj	#International
40 Best International Work from Home Jobs (Work from Anywhere in the World Online) - MoneyPantry #International #link2 https://t.co/EMFWTHzZEj	107631340	1079979313833934855	2019-01-01	NA	FALSE	1079979313833934855	Linkgage	0	40 Best International Work from Home Jobs (Work from Anywhere in the World Online) - MoneyPantry International link2	https://t.co/EMFWTHzZEj	#International, #link2
Top 10 companies that offer work-from-anywhere jobs https://t.co/uHzor6qahf	2214623612	1079922194644385793	2019-01-01	NA	FALSE	1079922194644385793	Facebook	0	Top 10 companies that offer work-from-anywhere jobs	https://t.co/uHzor6qahf
40 Best International Work from Home Jobs (Work from Anywhere in the World Online) - MoneyPantry #International #link1 https://t.co/EMFWTHzZEj	107631340	1079918941319901184	2019-01-01	NA	FALSE	1079918941319901184	Linkgage	1	40 Best International Work from Home Jobs (Work from Anywhere in the World Online) - MoneyPantry International link1	https://t.co/EMFWTHzZEj	#International, #link1
Top 10 companies that offer work-from-anywhere jobs https://t.co/LOwL9En8VH	80203918	1079914035754291200	2019-01-01	NA	FALSE	1079914035754291200	Facebook	0	Top 10 companies that offer work-from-anywhere jobs	https://t.co/LOwL9En8VH

url	id	name	verified	protected	username	description	followers_count	following_count	tweet_count	listed_count	created_at	location	clean_description	description_urls
https://t.co/L0FKd02CBX	60433092	Morissa (Dr. Rissy) Schwartz	TRUE	FALSE	MorissaSchwartz	All my puns are intended! 😆 Owner of top-rated marketing company https://t.co/3xpIccgnoi - @forbes 30 Under 30 🤩 - @entrepreneur Contributor -📚 https://t.co/8XIkYYnjSh CEO	21180	8958	74574	1632	2009-07-27	New Jersey, USA	All my puns are intended! Owner of top-rated marketing company - @forbes 30 Under 30 - @entrepreneur Contributor - CEO	https://t.co/3xpIccgnoi, https://t.co/8XIkYYnjSh
https://t.co/i48TLClhwK	813081982758518784	A Better Lemonade Stand	FALSE	FALSE	ABLSecom	A Better Lemonade Stand is an online business incubator. We help entrepreneurs build, launch, and grow their online business.	2108	673	3888	62	2016-12-25	NA	A Better Lemonade Stand is an online business incubator. We help entrepreneurs build, launch, and grow their online business.
NA	862922852063862784	Mr. J 🏴	FALSE	FALSE	jasze36	If you are into smartphones, tech, superheroes, video games, movies, general geekery, football, and anything Malaysian, my tweets are for you.	447	346	35486	6	2017-05-12	Malesia	If you are into smartphones, tech, superheroes, video games, movies, general geekery, football, and anything Malaysian, my tweets are for you.
https://t.co/6qjZ8PCpvM	91478624	Forbes	TRUE	FALSE	Forbes	Official account of Forbes, the world’s leading voice for entrepreneurial success and free enterprise, and home to breaking news, business, money and more.	17331756	5149	347669	61023	2009-11-21	New York, NY	Official account of Forbes, the world s leading voice for entrepreneurial success and free enterprise, and home to breaking news, business, money and more.
https://t.co/wk9XBMxbNX	35700404	💉💉😷💪🦠Heather Smith	TRUE	FALSE	HeatherSmithAU	🦄Hype Girl for Accounting Apps 🦄

Subscribe to the #AccounTech newsletter 👉https://t.co/oQhoMuqNwk Lead with kindness 👸🌞🦋 💐💃

https://t.co/lS92N5HjxV | 10606| 5177| 92368| 434|2009-04-27 |Brisbane Australia |Hype Girl for Accounting Apps Subscribe to the AccounTech newsletter Lead with kindness |https://t.co/oQhoMuqNwk, https://t.co/lS92N5HjxV |#AccounTech | | | |https://t.co/GMdWAQKfMC |16350207 |Christine Hueber |FALSE |FALSE |ChristineHueber |Do you have your copy of my #FREE Report yet? Get “Top 10 LinkedIn #B2B #Success Secrets” to #hack #LinkedIn for better results now @ https://t.co/GMdWAQKfMC | 8078| 7987| 116846| 367|2008-09-18 |San Francisco, CA |Do you have your copy of my FREE Report yet? Get “Top 10 LinkedIn B2B Success Secrets” to hack LinkedIn for better results now @ |https://t.co/GMdWAQKfMC |#FREE , #B2B , #Success , #hack , #LinkedIn | | |

The document-feature matrix

A document-feature matrix (DFM) is a matrix that describes the frequency of features (e.g., terms) that occur in a collection of documents (here tweets).
In a document-feature matrix, rows correspond to documents in the collection and columns correspond to features.

######################################################
#### 2 FROM STRINGS TO A DOCUMENT-FEATURE MATRIX  ####
######################################################
#[OPTION-1]
 #data$tweets$clean_text = gsub(pattern = ,"work[ -]?from[ -]?anywhere",  data$tweets$clean_text,replacement = " ",ignore.case = T,perl = T)# run this line to remove query terms directly from strings

rownames(data$tweets)=data$tweets$id
rownames(data$users)=data$users$id
#### 2.1 BUILD THE CORPUS
  corpus=list(tweets=NULL, users=NULL)
  corpus$tweets = quanteda::corpus(data$tweets,text_field = "clean_text",docid_field ="id")
  corpus$users = quanteda::corpus(data$users,text_field = "clean_description",docid_field ="id")
#### 2.2 TOKENIZE DOCUMENTS IN CORPUS #####
  tokens=list(tweets=NULL, users=NULL)
tokens$tweets= corpus$tweets %>% 
  quanteda::tokens(.,
                   remove_punct = TRUE,
                   remove_symbols = TRUE,
                   remove_separators = TRUE,
                   remove_numbers = TRUE,
                   split_hyphens =FALSE,
                   verbose=TRUE
  ) 
tokens$users= corpus$users %>% 
  quanteda::tokens(.,
                   remove_punct = TRUE,
                   remove_symbols = TRUE,
                   remove_separators = TRUE,
                   remove_numbers = TRUE,
                   split_hyphens =FALSE,
                   verbose=TRUE
  ) 

#### 2.3 BUILD DOCUMENT FEATURE MATRIX #####
dfm=list(tweets=NULL, users=NULL)
dfm$tweets = tokens$tweets %>%
  quanteda::dfm(.,
                tolower = TRUE,
                stem = FALSE,
                #               select = NULL,
                #                remove = NULL,
                #                dictionary = NULL,
                #                thesaurus = NULL,
                #                case_insensitive = TRUE,
                verbose = TRUE
  )
print("Initial tweets DFM")

## [1] "Initial tweets DFM"

head(dfm$tweets)

## Document-feature matrix of: 6 documents, 52,013 features (100.0% sparse) and 15 docvars.
##                      features
## docs                  digitaltransformation top companies that offer
##   1080012837211291648                     1   1         1    1     1
##   1080010814562672640                     0   0         0    0     0
##   1079979313833934855                     0   0         0    0     0
##   1079922194644385793                     0   1         1    1     1
##   1079918941319901184                     0   0         0    0     0
##   1079914035754291200                     0   1         1    1     1
##                      features
## docs                  work-from-anywhere jobs wkmg news and
##   1080012837211291648                  1    1    1    1   1
##   1080010814562672640                  0    1    0    0   0
##   1079979313833934855                  0    1    0    0   0
##   1079922194644385793                  1    1    0    0   0
##   1079918941319901184                  0    1    0    0   0
##   1079914035754291200                  1    1    0    0   0
## [ reached max_nfeat ... 52,003 more features ]

print("Top Features in the initial tweets DFM")

## [1] "Top Features in the initial tweets DFM"

quanteda::topfeatures(dfm$tweets,50) %>% kable(, col.names = NULL)# top 50 features before stopwords removal

work	73793
from	71515
anywhere	61396
to	61052
the	55263
and	42495
a	34809
you	26839
in	24297
can	21030
is	20633
of	20627
for	20188
i	18063
with	16271
your	15161
that	11742
it	11447
on	10944
are	10270
this	9750
be	9450
more	8647
remote	7977
we	7959
how	7906
our	7719
world	7665
as	7534
have	7394
work-from-anywhere	7126
at	7086
home	6955
my	6893
business	6649
new	6541
will	5835
s	5780
an	5735
employees	5576
now	5480
or	5436
time	5413
if	5362
people	4943
do	4800
but	4724
working	4685
here	4642
office	4519

dfm$users = tokens$users %>%
  quanteda::dfm(.,
                tolower = TRUE,
                stem = FALSE,
                #               select = NULL,
                #                remove = NULL,
                #                dictionary = NULL,
                #                thesaurus = NULL,
                #                case_insensitive = TRUE,
                verbose = TRUE
  )
print("Initial users DFM")

## [1] "Initial users DFM"

head(dfm$users)

## Document-feature matrix of: 6 documents, 59,828 features (100.0% sparse) and 16 docvars.
##                     features
## docs                 all my puns are intended owner of top-rated marketing
##   60433092             1  1    1   1        1     1  1         1         1
##   813081982758518784   0  0    0   0        0     0  0         0         0
##   862922852063862784   0  1    0   2        0     0  0         0         0
##   91478624             0  0    0   0        0     0  1         0         0
##   35700404             0  0    0   0        0     0  0         0         0
##   16350207             0  1    0   0        0     0  1         0         0
##                     features
## docs                 company
##   60433092                 1
##   813081982758518784       0
##   862922852063862784       0
##   91478624                 0
##   35700404                 0
##   16350207                 0
## [ reached max_nfeat ... 59,818 more features ]

print("Top Features in the initial users DFM")

## [1] "Top Features in the initial users DFM"

quanteda::topfeatures(dfm$users,50) %>% kable(., col.names = NULL)# top 50 features before stopwords removal

and	15903
the	11501
of	9955
to	8549
a	7804
in	6575
for	6552
i	4455
my	4290
is	3957
na	3732
with	3373
business	3306
on	3060
are	3050
at	2853
your	2816
you	2798
it	2529
we	2515
all	1923
own	1900
marketing	1844
digital	1795
technology	1766
by	1655
that	1645
from	1587
tech	1559
news	1556
solutions	1498
about	1493
life	1394
work	1378
our	1365
services	1351
an	1343
not	1328
love	1236
world	1232
people	1218
more	1196
help	1186
founder	1145
views	1122
s	1069
me	1050
tweets	1041
cloud	1008
be	989

#### 2.3 FILTERING THE DOCUMENT FEATURE MATRIX #####
unigrams_vocab=list(tweets=NULL, users=NULL)
for(i in c("tweets","users")){
#### 2.3.1 REMOVE STOPWORDS FROM DFM #####
if(i=="users"){
  mystopwords=c(quanteda::stopwords(language = "en"),"s","$","t", "can")
}else{
  #[OPTION-2]
  mystopwords=c(quanteda::stopwords(language = "en"),"s","$","t", "can","work","from","anywhere","work-from-anywhere","workfromanywhere")#added query terms as stopwords (one could do better by removing them from the strings before building the corpus. See [OPTION-1])
}


dfm[[i]] = dfm_remove(dfm[[i]], pattern = mystopwords, valuetype = "fixed")
  
#### 2.3.1 BUILD INITIAL UNIGRAMS DICTIONARY #####

unigrams_vocab[[i]] = featnames(dfm[[i]])

#### 2.3.2 IDENTIFY RARE WORDS IN DFM #####
threshold=round(nrow(dfm[[i]])/1000,digits = 0)# one every 1000 documents
removed_rare=colSums(dfm[[i]])[colSums(dfm[[i]]>0)<threshold]#remove all token that don't appear at least once every 1000 documents 
names(removed_rare)=featnames(dfm[[i]])[colSums(dfm[[i]]>0)<threshold]
removed_rare=sort(removed_rare,decreasing = T)
#removed_rare %>% kable()
#table(colSums(dfm[[i]]>0)<threshold) #tabulate N features above (TRUE) and below (FALSE) threshold

#### 2.3.3 IDENTIFY SHORT WORDS (- THREE CHARACTER) IN DFM #####
removed_short=colSums(dfm[[i]])[nchar(featnames(dfm[[i]]))<=3]
names(removed_short)=featnames(dfm[[i]])[nchar(featnames(dfm[[i]]))<=3]
removed_short=sort(removed_short,decreasing = T)
#removed_short %>% kable()
table(nchar(featnames(dfm[[i]]))<=3)


#### 2.3.5 REMOVE RARE/SHORT/NUMBER TOKENS FROM DFM #####
dfm[[i]]=dfm_select(
  dfm[[i]],
  pattern = unique(c(names(removed_rare),names(removed_short))),
  selection = "remove",
  valuetype = "fixed",
  case_insensitive = FALSE,
  verbose = quanteda_options("verbose")
)

####2.3.6 KEEP ONLY DOCS WITH AT LEAST THREE TOKENS #####
dfm[[i]] = dfm_subset(dfm[[i]], ntoken(dfm[[i]]) >= 3)#keep only docs with at least 3 tokens

unigrams_vocab[[i]]=featnames(dfm[[i]])

#table(rowSums(dfm[[i]])>=1)#to check if all docs have at least one term 
#table(colSums(dfm[[i]])>=1)#to check if all terms (i.e., features) appear at least once
docvars(dfm[[i]])=data[[i]][docnames(dfm[[i]]),]
}
#Top features in tweeets dfm
print("Final tweets DFM")

## [1] "Final tweets DFM"

head(dfm$tweets)

## Document-feature matrix of: 6 documents, 1,780 features (99.7% sparse) and 17 docvars.
##                      features
## docs                  digitaltransformation companies offer jobs news best
##   1080012837211291648                     1         1     1    1    1    0
##   1080010814562672640                     0         0     0    1    0    1
##   1079979313833934855                     0         0     0    1    0    1
##   1079922194644385793                     0         1     1    1    0    0
##   1079918941319901184                     0         0     0    1    0    1
##   1079914035754291200                     0         1     1    1    0    0
##                      features
## docs                  international home world online
##   1080012837211291648             0    0     0      0
##   1080010814562672640             2    1     1      1
##   1079979313833934855             2    1     1      1
##   1079922194644385793             0    0     0      0
##   1079918941319901184             2    1     1      1
##   1079914035754291200             0    0     0      0
## [ reached max_nfeat ... 1,770 more features ]

print("Top Features in the final tweets DFM")

## [1] "Top Features in the final tweets DFM"

topfeatures(dfm$tweets,50) %>% kable(., col.names = NULL)

remote	7895
world	7454
home	6611
business	6604
employees	5226
time	5092
people	4917
working	4666
office	4425
need	3663
want	3541
remotework	3530
learn	3277
just	3054
help	3053
make	2955
like	2949
team	2834
able	2702
today	2538
free	2533
join	2489
companies	2479
workfromhome	2471
best	2463
live	2421
freedom	2374
cloud	2295
online	2248
technology	2227
start	2218
company	2144
digital	2069
looking	2054
internet	2021
future	2014
travel	1991
love	1983
become	1940
many	1900
hiring	1888
jobs	1873
life	1837
read	1829
tech	1827
hours	1801
solutions	1764
pandemic	1761
find	1727
check	1715

print("Final users DFM")

## [1] "Final users DFM"

head(dfm$users)

## Document-feature matrix of: 6 documents, 1,312 features (99.4% sparse) and 18 docvars.
##                     features
## docs                 owner marketing company @forbes contributor better stand
##   60433092               1         1       1       1           1      0     0
##   813081982758518784     0         0       0       0           0      1     1
##   862922852063862784     0         0       0       0           0      0     0
##   91478624               0         0       0       0           0      0     0
##   35700404               0         0       0       0           0      0     0
##   16350207               0         0       0       0           0      1     0
##                     features
## docs                 online business help
##   60433092                0        0    0
##   813081982758518784      2        2    1
##   862922852063862784      0        0    0
##   91478624                0        1    0
##   35700404                0        0    0
##   16350207                0        0    0
## [ reached max_nfeat ... 1,302 more features ]

print("Top Features in the final users DFM")

## [1] "Top Features in the final users DFM"

topfeatures(dfm$tweets,50) %>% kable(., col.names = NULL)

remote	7895
world	7454
home	6611
business	6604
employees	5226
time	5092
people	4917
working	4666
office	4425
need	3663
want	3541
remotework	3530
learn	3277
just	3054
help	3053
make	2955
like	2949
team	2834
able	2702
today	2538
free	2533
join	2489
companies	2479
workfromhome	2471
best	2463
live	2421
freedom	2374
cloud	2295
online	2248
technology	2227
start	2218
company	2144
digital	2069
looking	2054
internet	2021
future	2014
travel	1991
love	1983
become	1940
many	1900
hiring	1888
jobs	1873
life	1837
read	1829
tech	1827
hours	1801
solutions	1764
pandemic	1761
find	1727
check	1715

#topfeatures(dfm$tweets,50,scheme = "docfreq")# number of docs in which the token is present (scheme = "docfreq")
#topfeatures(dfm$tweets,50) / sum(ntoken(dfm$tweets)) #relative freq. of token w.r.t. total number of tokens

docvars(dfm$tweets,field = "postCOVID")=docvars(dfm$tweets,field = "created_at")>=as.Date("2020-03-11")
docvars(dfm$tweets,field = "most_liked")=docvars(dfm$tweets,field = "like_count")> quantile(docvars(dfm$tweets,field = "like_count"), c(.5)) 
docvars(dfm$tweets,field = "most_discussed")=docvars(dfm$tweets,field = "reply_count")> quantile(docvars(dfm$tweets,field = "reply_count"), c(.5)) 
docvars(dfm$tweets,field = "most_retweeted")=docvars(dfm$tweets,field = "retweet_count")> quantile(docvars(dfm$tweets,field = "retweet_count"), c(.5)) 

print("postCOVID top tokens counts (tweets)")

## [1] "postCOVID top tokens counts (tweets)"

topfeatures(dfm$tweets,50,groups =  "postCOVID")  %>% kable(., col.names = NULL)

remote	2972
business	2097
world	1772
time	1714
working	1441
remotework	1355
want	1199
freedom	1180
team	1102
home	1078
hiring	1058
employees	1008
office	1005
people	993
freelance	955
love	866
start	843
make	827
workfromhome	826
need	822
travel	807
become	804
online	791
like	785
free	776
remotejobs	771
able	764
developer	748
just	740
technology	691
internet	691
today	676
looking	671
manager	668
hours	650
check	636
social	622
media	612
management	611
leading	601
world’s	599
x-team	597
life	589
brands	586
jobs	584
financial	577
help	572
boss	566
money	562
achieve	536

world	5682
home	5533
remote	4923
business	4507
employees	4218
people	3924
office	3420
time	3378
working	3225
need	2841
learn	2832
help	2481
want	2342
just	2314
remotework	2175
like	2164
live	2140
make	2128
companies	2023
join	2001
able	1938
best	1936
today	1862
future	1850
cloud	1797
free	1757
pandemic	1757
team	1732
digital	1709
solutions	1691
company	1669
workfromhome	1645
many	1577
tech	1548
technology	1536
read	1529
businesses	1493
online	1457
find	1390
long	1387
looking	1383
start	1375
workforce	1352
great	1341
security	1336
internet	1330
stay	1325
jobs	1289
still	1260
life	1248

print("(0.5 percentile) Most liked  top tokens counts (tweets)")

## [1] "(0.5 percentile) Most liked  top tokens counts (tweets)"

topfeatures(dfm$tweets,50,groups =  "most_liked") %>% kable(., col.names = NULL)

remote	4356
business	4323
world	4274
home	3789
employees	3233
time	2998
people	2583
working	2535
office	2217
need	2107
remotework	1968
want	1940
help	1905
free	1842
learn	1828
make	1765
start	1612
just	1559
online	1526
join	1518
today	1502
like	1473
freedom	1472
team	1460
workfromhome	1414
best	1408
able	1407
cloud	1382
live	1375
companies	1374
become	1348
technology	1339
hours	1339
internet	1295
company	1269
looking	1266
businesses	1229
read	1197
travel	1174
jobs	1132
love	1130
solution	1105
life	1094
solutions	1094
many	1086
digital	1069
boss	1068
check	1047
hiring	1044
future	1028

remote	3539
world	3180
home	2822
people	2334
business	2281
office	2208
working	2131
time	2094
employees	1993
want	1601
remotework	1562
need	1556
just	1495
like	1476
learn	1449
team	1374
able	1295
make	1190
help	1148
companies	1105
workfromhome	1057
best	1055
live	1046
today	1036
digital	1000
future	986
join	971
cloud	913
freedom	902
technology	888
company	875
great	868
love	853
hiring	844
tech	824
travel	817
many	814
looking	788
good	763
life	743
jobs	741
find	737
pandemic	737
also	731
internet	726
flexibility	725
online	722
long	713
free	691
productivity	689

print("(0.5 percentile) Most discussed top tokens counts (tweets)")

## [1] "(0.5 percentile) Most discussed top tokens counts (tweets)"

topfeatures(dfm$tweets,50,groups =  "most_discussed") %>% kable(., col.names = NULL)

remote	6763
world	6367
business	6086
home	5375
employees	4721
time	4289
working	3828
people	3685
office	3507
remotework	3307
learn	3062
need	2987
want	2846
help	2759
make	2491
team	2448
free	2359
workfromhome	2314
join	2299
today	2259
just	2203
like	2187
cloud	2161
best	2148
able	2123
companies	2077
freedom	2064
technology	2021
start	2020
online	1993
live	1939
digital	1804
become	1766
future	1746
company	1742
read	1720
looking	1718
solutions	1668
travel	1664
hiring	1660
love	1630
internet	1614
check	1596
jobs	1584
businesses	1569
find	1539
hours	1539
security	1538
tech	1525
pandemic	1516

home	1236
people	1232
remote	1132
world	1087
office	918
just	851
working	838
time	803
like	762
want	695
need	676
able	579
business	518
employees	505
live	482
make	464
many	410
internet	407
think	407
companies	402
company	402
also	387
team	386
good	378
move	356
going	356
love	353
really	348
life	339
even	337
looking	336
know	336
great	335
long	328
travel	327
still	326
best	315
much	312
back	312
freedom	310
tech	302
take	294
help	294
right	292
jobs	289
year	281
today	279
future	268
digital	265
hours	262

print("(0.5 percentile) Most retweeted top tokens counts (tweets)")

## [1] "(0.5 percentile) Most retweeted top tokens counts (tweets)"

topfeatures(dfm$tweets,50,groups =  "most_retweeted")%>% kable(., col.names = NULL)

world	5923
remote	5797
business	5511
home	5470
time	4291
employees	4134
people	3945
working	3681
office	3448
need	2972
want	2873
remotework	2581
just	2558
learn	2513
make	2453
help	2411
like	2399
able	2265
free	2092
today	2011
best	1978
live	1978
freedom	1971
start	1961
online	1960
workfromhome	1958
companies	1942
team	1892
join	1871
internet	1803
company	1752
technology	1740
cloud	1734
love	1720
travel	1697
become	1639
looking	1593
hours	1582
life	1568
many	1567
jobs	1495
future	1490
read	1480
digital	1477
businesses	1464
pandemic	1444
long	1422
tech	1401
solutions	1371
find	1310

remote	2098
world	1531
home	1141
business	1093
employees	1092
working	985
office	977
people	972
remotework	949
team	942
time	801
learn	764
hiring	697
need	691
want	668
help	642
join	618
digital	592
cloud	561
like	550
companies	537
today	527
future	524
workfromhome	513
make	502
just	496
technology	487
best	485
looking	461
live	443
free	441
able	437
tech	426
find	417
check	413
freedom	403
great	400
solutions	393
company	392
productivity	390
futureofwork	389
security	389
jobs	378
workforce	350
read	349
developer	348
secure	348
flexibility	341
many	333
remotejobs	332

print("Verified accounts top tokens counts (users)")

## [1] "Verified accounts top tokens counts (users)"

topfeatures(dfm$users,50,groups =  "verified") %>% kable(., col.names = NULL)

business	3062
marketing	1745
digital	1644
technology	1639
tech	1418
solutions	1416
services	1288
life	1269
work	1261
love	1160
news	1155
people	1117
help	1104
world	1063
views	1020
founder	993
tweets	953
cloud	947
software	890
travel	881
management	828
helping	823
media	812
social	789
entrepreneur	780
global	777
online	773
opinions	767
writer	757
consultant	751
manager	751
director	744
author	743
lover	730
data	724
design	697
make	684
businesses	682
husband	676
professional	672
company	659
best	658
sales	639
coach	635
security	624
father	622
time	621
community	617
follow	610
enthusiast	607

news	334
business	170
official	135
author	116
world	113
tech	108
digital	100
support	97
account	85
work	83
technology	79
follow	79
twitter	78
editor	78
global	76
people	76
founder	76
help	75
leading	68
india	66
media	65
views	62
speaker	59
latest	57
solutions	56
reporter	56
journalist	55
writer	52
future	51
best	50
tweets	49
services	47
host	47
opinions	46
every	46
marketing	44
cloud	42
businesses	42
social	41
customer	41
podcast	41
former	41
breaking	40
platform	40
travel	39
updates	39
information	39
data	39
co-founder	39
personal	39

Regular expressions (RegEx)

regular expression (RegEx): sequence of characters that define a search pattern, which is used by string searching algorithms for “find” or “find and replace” operations on strings, or for input validation;

RegEx character classes

With a character classes you can tell the regex engine to match only one out of several characters. Simply place the characters you want to match between square brackets.
*For example, if you need to match an a or an e, use the character class [ae]

Some pre-built character classes that can be used inside RegEx patterns:

[:punct:]: punctuation;
[:alpha:]: letters;
[:lower:]: lowercase letters;
[:upper:]: upperclass letters;
[:digit:]: digits;
[:xdigit:]: hex digits;
[:alnum:]: letters and numbers;
[:cntrl:]: control characters;
[:graph:]: letters, numbers, and punctuation;
[:print:]: letters, numbers, punctuation, and whitespace;
[:space:]: space characters (basically equivalent to );
[:blank:]: space and tab;

You can create your own character classes using []:

[a-z]: matches every character between a and z (in Unicode code point order).
[^abc]: matches anything except a, b, or c.
[^-]: matches ^ or - (characters are escaped).

RegEx quantifiers

Quantifiers specify the number of repetitions of the pattern to be matched:

*: matches at least 0 times;
+: matches at least 1 times;
?: matches at most 1 times;
{n}: matches exactly n times;
{n,}: matches at least n times;
{n,m}: matches between n and m times;

RegEx position of pattern

Position of pattern identifiers specify the position conditions of the pattern to be matched:

^: matches the start of the string;
$: matches the end of the string;

RegEx in R

* The kwic function (in the Quanteda library) can be used to search and filter strings by RegEx conditions, and output the context (neighborhood) of identified RegEx matches.
* The gregexpr function can be used to identify the boundaries of sub-strings that match RegEx patterns (also multi-group).

RegEx application

########################################
#### 3 REGEX AND REGEX DICTIONARIES ####
########################################

#### 3.1 KEYWORDS IN CONTEXT (KWIC) WITH QUANTEDA ####
#SIMPLE EXAMPLES
#tweets and descriptions containing like/dislike emojis (removed from clean_text)
print("Keyword in context: 👍")

## [1] "Keyword in context: \U0001f44d"

quanteda::kwic(data$tweets$text,"👍",window=5,) %>% head() %>% kable()# UTF code \U0001f44d

docname	from	to	pre	keyword	post	pattern
text2377	53	53	get paid on our results	👍	😉	👍
text2479	19	19	as effective as hands-on healing	👍		👍
text2911	35	35	Click below and apply !	👍	#witpghjobs #pghtech #womeninstem https://t.co/XyaGoc6GrO	👍
text3372	47	47	are a remote worker .	👍	https://t.co/XVi89vucZx	👍
text3534	33	33	Remote - work from anywhere	👍	Apply https://t.co/RcRnheL17D https://t.co/h6z8A1OHfn	👍
text3911	55	55	the top class work .	👍	👍	👍

print("Keyword in context: 👎")

## [1] "Keyword in context: \U0001f44e"

quanteda::kwic(data$tweets$text,"👎",window=5) %>% head() %>% kable()# UTF code \U0001f44e

docname	from	to	pre	keyword	post	pattern
text47324	38	38	is you , congratulations !	👎	There are thousands of people	👎

print("Keyword in context: 🏔|🏝")

## [1] "Keyword in context: \U0001f3d4|\U0001f3dd"

quanteda::kwic(data$tweets$text,"🏔|🏝",window=5,valuetype="regex") %>% head() %>% kable()# UTF code  \U0001f3d4 OR UTF code \U0001f3dd

docname	from	to	pre	keyword	post	pattern
text202	39	39	the top of Colorado !	🏔	https://t.co/djXypfBfHD	🏔\|🏝
text351	22	22	: • Work from anywhere	🏝	• Choose your hours ⏱	🏔\|🏝
text670	5	5	Work from anywhere !	🏔	🏕 🏖 🏜 🏞 🏘	🏔\|🏝
text775	26	26	as you have a phone	🏝	📱 Choose your hours you	🏔\|🏝
text2159	1	1		🏝	BOSS BABES WANTED 🏝 Tired	🏔\|🏝
text2159	5	5	🏝 BOSS BABES WANTED	🏝	Tired of working long hours	🏔\|🏝

#ADVANCED EXAMPLES
print("Keyword in context: work[- ]?from[- ]?anywhere[ -]?job")

## [1] "Keyword in context: work[- ]?from[- ]?anywhere[ -]?job"

quanteda::kwic(data$tweets$text,"work[- ]?from[- ]?anywhere[ -]?job",window=5,valuetype="regex") %>% head() %>% kable()

docname	from	to	pre	keyword	post	pattern
text14356	40	40	#remotework #remotejobs #jobs #jobsearch #workfromanywhere	#workfromanywherejobs	#fjblog https://t.co/nK8aC7vV1R	work[- ]?from[- ]?anywhere[ -]?job
text14690	41	41	" @MSN #remotework #remotejobs #workfromanywhere	#workfromanywherejobs	#jobs https://t.co/Hzecirx0DZ	work[- ]?from[- ]?anywhere[ -]?job
text14957	26	26	👉 https://t.co/pGeV1Fes5V @CNBCMakeIt #remotework #remotejobs	#workfromanywherejobs	#workfromanywhere #fjinthenews #hiring #hiringnow #worklifebalance	work[- ]?from[- ]?anywhere[ -]?job
text15074	17	17	should try Visit https://t.co/u69eTjdscq #VoiceNOCTechnician	#WorkFromAnywhereJobs	#NetworkOperationsTechnician #VoiceServices #NocEngineers	work[- ]?from[- ]?anywhere[ -]?job
text17671	23	23	from Anywhere ] https://t.co/uX66tc2H8o #HardwareIntegrationEngineer	#WorkFromAnywhereJobs	#DesignIntegrationEngineer #IntegrationEngineerJobs #SystemIntegration	work[- ]?from[- ]?anywhere[ -]?job
text18547	27	27	mode Remote Jobs : https://t.co/ikCjnJdwuS	#WorkFromAnywhereJobs	#CiscoJobs #EmploymentOutlook #CiscoEmployment #Cisco @GE_Summit	work[- ]?from[- ]?anywhere[ -]?job

print("Keyword in context: (?<!(do not |don't ))(lik[a-z]{1,}|lov[a-z]{1,}|enjoy[a-z]{1,}){1,}")

## [1] "Keyword in context: (?<!(do not |don't ))(lik[a-z]{1,}|lov[a-z]{1,}|enjoy[a-z]{1,}){1,}"

quanteda::kwic(data$tweets$text,"(?<!(do not |don't ))(lik[a-z]{1,}|lov[a-z]{1,}|enjoy[a-z]{1,}){1,}",valuetype="regex",window=5) %>% head() %>% kable()

docname	from	to	pre	keyword	post	pattern
text16	16	16	anywhere as long as you	like	researching and fiddling with cutting-edge	(?<!(do not \|don’t ))(lik[a-z]{1,}\|lov[a-z]{1,}\|enjoy[a-z]{1,}){1,}
text18	16	16	anywhere as long as you	like	researching and fiddling with cutting-edge	(?<!(do not \|don’t ))(lik[a-z]{1,}\|lov[a-z]{1,}\|enjoy[a-z]{1,}){1,}
text26	20	20	freelance more so I feel	like	Im missing out on being	(?<!(do not \|don’t ))(lik[a-z]{1,}\|lov[a-z]{1,}\|enjoy[a-z]{1,}){1,}
text39	16	16	all levels , full of	Love	, Light , Joy and	(?<!(do not \|don’t ))(lik[a-z]{1,}\|lov[a-z]{1,}\|enjoy[a-z]{1,}){1,}
text44	27	27	, what places would you	like	to see ? https://t.co/swiL3mXmbX	(?<!(do not \|don’t ))(lik[a-z]{1,}\|lov[a-z]{1,}\|enjoy[a-z]{1,}){1,}
text56	46	46	. You’ll be impressed just	like	the customer will . Thank	(?<!(do not \|don’t ))(lik[a-z]{1,}\|lov[a-z]{1,}\|enjoy[a-z]{1,}){1,}

#### 3.2 EXTRACTING EMOJI FROM UTF-8 STRINGS USING A REGEX DICTIONARY ####
print("Emoji dictionary")

## [1] "Emoji dictionary"

head(remoji:::dat_core) %>% kable()

emoji	description
😄	smiling face with open mouth and smiling eyes
😃	smiling face with open mouth
😀	grinning face
😊	smiling face with smiling eyes
☺️	white smiling face
😉	winking face

dict_emoji=remoji:::dat_core$emoji
names(dict_emoji)=remoji:::dat_core$description

data$tweets=data$tweets %>% dplyr::mutate(emoji = regmatches(
  text,
  gregexpr(pattern = paste(dict_emoji,collapse = "|"), text, perl = TRUE)
)) 

data$users=data$users %>% dplyr::mutate(emoji = regmatches(
  description,
  gregexpr(pattern = paste(dict_emoji,collapse = "|"), description, perl = TRUE)
)) 

#### 3.3 IDENTIFY AND EXTRACT A CAUSAL ARGUMENT WITH A REGEX  ####
#THE FOLLOWING CODE SNIPPET WAS DEVELOPED FOR A PAPER BY SANTAGISTINA & WARGLIEN ABOUT THE NO DEAL BREXIT (FORTHCOMING)
list_cause_verbs = c("cause","causes", "causing","caused",
  "determine","determines","determining","determined",
  "engender","engenders","engendered","engendering",
  "entail","entails","entailed","entailing",
  "provoke","provokes","provoked","provoking",
  "trigger","triggers","triggered","triggering"
)
print("Causal verbs:")

## [1] "Causal verbs:"

list_cause_verbs

##  [1] "cause"       "causes"      "causing"     "caused"      "determine"  
##  [6] "determines"  "determining" "determined"  "engender"    "engenders"  
## [11] "engendered"  "engendering" "entail"      "entails"     "entailed"   
## [16] "entailing"   "provoke"     "provokes"    "provoked"    "provoking"  
## [21] "trigger"     "triggers"    "triggered"   "triggering"

#build verbs regex
all_at_end=paste(
  c(list_cause_verbs),
  collapse  = "|"
)

#INDENTIFY CAUSAL STATEMENTS (NAIVE FORM)
cause_effect_regex = paste("(?<cause>[^:;,!]{2,})(?<rel_operator>",all_at_end,"){1}(?<effect>[^:;,!]{2,})",sep = "")  

example= quanteda::tokenize_sentence(data$tweets$text[61257])[[1]][1] #"I wonder if starlink will cause a wave of reverse migration?"

print(example)

## [1] "I wonder if starlink will cause a wave of reverse migration?"

match=gregexpr(
  cause_effect_regex ,
  example,
  perl = T,
  ignore.case = T
)
print("Gregexpr output:")

## [1] "Gregexpr output:"

match

## [[1]]
## [1] 1
## attr(,"match.length")
## [1] 60
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## attr(,"capture.start")
##      cause rel_operator effect
## [1,]     1           27     32
## attr(,"capture.length")
##      cause rel_operator effect
## [1,]    26            5     29
## attr(,"capture.names")
## [1] "cause"        "rel_operator" "effect"

#cause
print("RegEx matches CAUSE:")

## [1] "RegEx matches CAUSE:"

st = attr(match[[1]], "capture.start")[, "cause"]
len = attr(match[[1]], "capture.length")[, "cause"]
substring(example, st, st + len - 1)

## [1] "I wonder if starlink will "

print("RegEx matches OPERATOR (RELATION):")

## [1] "RegEx matches OPERATOR (RELATION):"

#rel_operator
st = attr(match[[1]], "capture.start")[, "rel_operator"]
len = attr(match[[1]], "capture.length")[, "rel_operator"]
substring(example, st, st + len - 1)

## [1] "cause"

print("RegEx matches EFFECT:")

## [1] "RegEx matches EFFECT:"

#effect
st = attr(match[[1]], "capture.start")[, "effect"]
len = attr(match[[1]], "capture.length")[, "effect"]
substring(example, st, st + len - 1)

## [1] " a wave of reverse migration?"

Feature co-occurrence matrix

A feature co-occurrence matrix (FCM) displays the number coappearances, in documents (here tweets), of features (e.g., terms). Each feature is identified by a both a row and a column, so the matrix is symmetric and (often) in the diagonal we have feature occurrence counts.

##### 4 FEATURES CO-OCCURRENCE MATRIX (FCM) AND NETWORK ####
topN=100 #parameter representing the number of top features considered
edge_filter_threshold=0.95 #parameter representing the percentile threshold filter applied to edges
#### 4.1 TWEEETS FCM #### 
tweets_fcm= dfm$tweets %>% 
  quanteda::dfm_select(.,pattern= names(topfeatures(dfm$tweets,topN)) %>% as.character(),selection="keep",valuetype="fixed") %>%
  quanteda::fcm(.)

tweets_fcm_graph=tweets_fcm %>% igraph::graph_from_adjacency_matrix(.,weighted = T,diag = F,mode = "upper")
tweets_fcm_graph=list(edges=as.data.frame(cbind(igraph::as_edgelist(tweets_fcm_graph),E(tweets_fcm_graph)$weight)),
                           nodes=  data.frame(token=names(topfeatures(dfm$tweets,topN)),freq=topfeatures(dfm$tweets,topN)) )
#edge (percentile) filter

keep_edges=which(as.numeric(tweets_fcm_graph$edges[,3]) >= quantile(as.numeric(tweets_fcm_graph$edges[,3]),edge_filter_threshold))
visNetwork(
  height = "900px",
  width = "100%",
  nodes = data.frame(
    id = tweets_fcm_graph$nodes[,1]
    ,
    label = tweets_fcm_graph$nodes[,1]
    ,
    value =  as.integer(as.numeric(tweets_fcm_graph$nodes[,2]))^2
    , 
    color.background = "white"
    ,
    color.border= "black"
    ,
    shape = "box"
    #,group = temp_$vertices$community_walktrap
  ),
  edges = data.frame(
    from=tweets_fcm_graph$edges[keep_edges,1]
    ,
    to= tweets_fcm_graph$edges[keep_edges,2]
    ,
    
    width = as.integer(as.numeric(tweets_fcm_graph$edges[keep_edges,3])^(1/2))/5 #log(1.001+temp_$edges$weight)*1000
    , smooth.enabled = F
    #  ,smooth.roundness = 0.70
    ,color = list(color = "darkgray",
                  highlight = "darkred")
  )
) %>%  
  visNodes(scaling = list(label = list(enabled = T)))%>%
  visIgraphLayout(layout = "layout_nicely",physics = F,smooth = T) %>%
  visInteraction(
    keyboard = F,
    navigationButtons =  T,
    dragNodes = T,
    dragView = T,
    zoomView = T
  )  %>%  
  visOptions(
    highlightNearest = list(enabled = TRUE, hover = TRUE, degree = 1),
    nodesIdSelection = T
  )

# 4.2 EMOJI FCM
tweets_emoji_freq=data$tweets$emoji %>% unlist() %>% table() %>% sort(decreasing = T) %>% as.data.frame() 
#tweets_emoji_freq %>% View()

text_emojis=data$tweets$emoji
names(text_emojis)=data$tweets$id
text_emojis_fcm= text_emojis %>%
  quanteda::as.tokens(.) %>%
  quanteda::dfm(.,tolower = F) %>% 
  quanteda::dfm_select(.,pattern= tweets_emoji_freq[1:topN,1] %>% as.character(),selection="keep",valuetype="fixed") %>%
  quanteda::fcm(.)
  #head(text_emojis_fcm)

text_emojis_fcm_graph=text_emojis_fcm %>% igraph::graph_from_adjacency_matrix(.,weighted = T,diag = F,mode = "upper")
text_emojis_fcm_graph=list(edges=as.data.frame(cbind(igraph::as_edgelist(text_emojis_fcm_graph),E(text_emojis_fcm_graph)$weight)),
                           nodes= tweets_emoji_freq[1:topN,])
keep_edges=which(as.numeric(text_emojis_fcm_graph$edges[,3]) >= quantile(as.numeric(text_emojis_fcm_graph$edges[,3]),edge_filter_threshold))

visNetwork(
  height = "900px",
  width = "100%",
  nodes = data.frame(
    id = text_emojis_fcm_graph$nodes[,1]
    ,
    label = text_emojis_fcm_graph$nodes[,1]
    ,
    value =  as.integer(as.numeric(text_emojis_fcm_graph$nodes[,2]))^2
    , 
    color.background = "white"
    ,
    color.border= "black"
    ,
    shape = "box"
    #,group = temp_$vertices$community_walktrap
  ),
  edges = data.frame(
    from=text_emojis_fcm_graph$edges[keep_edges,1]
    ,
    to= text_emojis_fcm_graph$edges[keep_edges,2]
    ,
    
    width = as.integer(as.numeric(text_emojis_fcm_graph$edges[keep_edges,3])^(1/2))/5 #log(1.001+temp_$edges$weight)*1000
     , smooth.enabled = F
    #  ,smooth.roundness = 0.70
    ,color = list(color = "darkgray",
    highlight = "darkred")
    )
  ) %>%  
  visNodes(scaling = list(label = list(enabled = T)))%>%
  visIgraphLayout(layout = "layout_nicely",physics = F,smooth = T) %>%
  visInteraction(
    keyboard = F,
    navigationButtons =  T,
    dragNodes = T,
    dragView = T,
    zoomView = T
  )  %>%  
  visOptions(
    highlightNearest = list(enabled = TRUE, hover = TRUE, degree = 1),
    nodesIdSelection = T
  )

Natural Language Processing with Spacy(R)

Spacy website: https://spacy.io/

spacyR github page: https://github.com/quanteda/spacyr
We use Spacy’s state-of-the-art transformer model for web data (en_core_web_trf),called roBERTa. More info at https://huggingface.co/roberta-base

- Paper on transformer models for NLP by Vaswan et al. (2017) https://arxiv.org/pdf/1706.03762.pdf - roBERTa paper by Liu et al. (2019) https://arxiv.org/pdf/1907.11692.pdf

Part-of-Speech tagging

Part-of-Speech tagging is the process of labeling a word in a text as corresponding to a particular part of speech (role), also based on its context;

Dependency parsing

Dependency parsing is the process of extracting the dependency parse of a sentence to represent its grammatical structure. It defines the dependency relationship between headwords and their dependents. The head of a sentence has no dependency and is called the root of the sentence.

Named enitity recognintion

Spacy(R) basic named entity categories:

PERSON: People, including fictional;
NORP: Nationalities or religious or political groups;
FAC: Buildings, airports, highways, bridges, etc;
ORG: Companies, agencies, institutions, etc;
GPE: Countries, cities, states;
LOC: Non-GPE locations, mountain ranges, bodies of water;
PRODUCT: Objects, vehicles, foods, etc. (Not services.);
EVENT: Named hurricanes, battles, wars, sports events, etc.;
WORK_OF_ART: Titles of books, songs, etc.;
LAW: Named documents made into laws;
LANGUAGE: Any named language;

Spacy(R) extended categories:

DATE: Absolute or relative dates or periods;
TIME: Times smaller than a day;
PERCENT: Percentage, including “%”;
MONEY: Monetary values, including unit;
QUANTITY: Measurements, as of weight or distance;
ORDINAL: “first”, “second”, etc.;
CARDINAL: Numerals that do not fall under another type;

########################################
#### 5 NATURAL LANGUAGE PROCESSING  ####
########################################

#### 5.1 NLP WITH SPACY (ROBERTA TRANSFORMER MODEL) THROUGH SPACYR R LIBRARY ####
#Sys.setenv("OMP_NUM_THREADS"=10)# to set the number of threads used by spacy
#spacyr::spacy_install() #to install spacy through conda/miniconda in the local machine
#spacyr::spacy_install(lang_models = c("en_core_web_trf"))#to load the RoBERTa trf model (https://huggingface.co/roberta-base )
spacyr::spacy_initialize(model = "en_core_web_trf")#Initialize Spacy transformer model. Other Spacy models for english web data: "en_core_web_sm" "en_core_web_lg"

#EXAMPLE
ids=c(1080470620293550080)#ID(s) of the selected tweets

tweet_example=data$tweets$clean_text[which(data$tweets$id %in% ids)]#select tweet by ID

print(tweet_example)

## [1] "Learn how to manage 1000s of devices for your remote workforce. Join us on Jan 8 at 2 PM CET.   See how TeamViewer Tensor enhances your productivity by empowering all employees to work from anywhere with any device at any time in a secure and fast manner.  "

NLP_example=list(spacy_raw=NULL ,entities=NULL ,nounphrases=NULL)
NLP_example$spacy_raw= spacy_parse(
  tweet_example,
   dependency = TRUE,
   lemma = TRUE,
   pos = TRUE,
   entity = TRUE,
   nounphrase = TRUE,
   additional_attributes = c("tag_","ent_type_","norm_"))#nlp pipeleine (tokenize sentences, tokenize words, POS, DEP-REL, UNIV.POS, EE ...) 
NLP_example$spacy_raw %>% kable()

doc_id	sentence_id	token_id	token	lemma	pos	head_token_id	dep_rel	entity	nounphrase	whitespace	tag_	ent_type_	norm_
text1	1	1	Learn	learn	VERB	1	ROOT			TRUE	VB		learn
text1	1	2	how	how	ADV	4	advmod			TRUE	WRB		how
text1	1	3	to	to	PART	4	aux			TRUE	TO		to
text1	1	4	manage	manage	VERB	1	xcomp			TRUE	VB		manage
text1	1	5	1000s	1000	NOUN	4	dobj	CARDINAL_B	beg_root	TRUE	NNS	CARDINAL	1000s
text1	1	6	of	of	ADP	5	prep			TRUE	IN		of
text1	1	7	devices	device	NOUN	6	pobj		beg_root	TRUE	NNS		devices
text1	1	8	for	for	ADP	4	prep			TRUE	IN		for
text1	1	9	your	your	PRON	11	poss		beg	TRUE	PRP$		your
text1	1	10	remote	remote	ADJ	11	amod		mid	TRUE	JJ		remote
text1	1	11	workforce	workforce	NOUN	8	pobj		end_root	FALSE	NN		workforce
text1	1	12	.	.	PUNCT	1	punct			TRUE	.		.
text1	2	1	Join	join	VERB	1	ROOT			TRUE	VB		join
text1	2	2	us	we	PRON	1	dobj		beg_root	TRUE	PRP		us
text1	2	3	on	on	ADP	1	prep			TRUE	IN		on
text1	2	4	Jan	Jan	PROPN	3	pobj	DATE_B	beg_root	TRUE	NNP	DATE	jan
text1	2	5	8	8	NUM	4	nummod	DATE_I		TRUE	CD	DATE	8
text1	2	6	at	at	ADP	1	prep			TRUE	IN		at
text1	2	7	2	2	NUM	8	nummod	TIME_B	beg	TRUE	CD	TIME	2
text1	2	8	PM	pm	NOUN	9	compound	TIME_I	mid	TRUE	NN	TIME	pm
text1	2	9	CET	CET	PROPN	6	pobj	TIME_I	end_root	FALSE	NNP	TIME	cet
text1	2	10	.	.	PUNCT	1	punct			TRUE	.		.
text1	3	1			SPACE	1	ROOT			FALSE	_SP
text1	4	1	See	see	VERB	1	ROOT			TRUE	VB		see
text1	4	2	how	how	ADV	5	advmod			TRUE	WRB		how
text1	4	3	TeamViewer	TeamViewer	PROPN	4	compound	PRODUCT_B	beg	TRUE	NNP	PRODUCT	teamviewer
text1	4	4	Tensor	Tensor	PROPN	5	nsubj	PRODUCT_I	end_root	TRUE	NNP	PRODUCT	tensor
text1	4	5	enhances	enhance	VERB	1	ccomp			TRUE	VBZ		enhances
text1	4	6	your	your	PRON	7	poss		beg	TRUE	PRP$		your
text1	4	7	productivity	productivity	NOUN	5	dobj		end_root	TRUE	NN		productivity
text1	4	8	by	by	ADP	5	prep			TRUE	IN		by
text1	4	9	empowering	empower	VERB	8	pcomp			TRUE	VBG		empowering
text1	4	10	all	all	DET	11	det		beg	TRUE	DT		all
text1	4	11	employees	employee	NOUN	9	dobj		end_root	TRUE	NNS		employees
text1	4	12	to	to	PART	13	aux			TRUE	TO		to
text1	4	13	work	work	VERB	9	xcomp			TRUE	VB		work
text1	4	14	from	from	ADP	13	prep			TRUE	IN		from
text1	4	15	anywhere	anywhere	ADV	14	pobj			TRUE	RB		anywhere
text1	4	16	with	with	ADP	13	prep			TRUE	IN		with
text1	4	17	any	any	DET	18	det		beg	TRUE	DT		any
text1	4	18	device	device	NOUN	16	pobj		end_root	TRUE	NN		device
text1	4	19	at	at	ADP	13	prep			TRUE	IN		at
text1	4	20	any	any	DET	21	det		beg	TRUE	DT		any
text1	4	21	time	time	NOUN	19	pobj		end_root	TRUE	NN		time
text1	4	22	in	in	ADP	13	prep			TRUE	IN		in
text1	4	23	a	a	DET	27	det		beg	TRUE	DT		a
text1	4	24	secure	secure	ADJ	27	amod		mid	TRUE	JJ		secure
text1	4	25	and	and	CCONJ	24	cc		mid	TRUE	CC		and
text1	4	26	fast	fast	ADJ	24	conj		mid	TRUE	JJ		fast
text1	4	27	manner	manner	NOUN	22	pobj		end_root	FALSE	NN		manner
text1	4	28	.	.	PUNCT	1	punct			TRUE	.		.
text1	5	1			SPACE	1	ROOT			FALSE	_SP

NLP_example$entities= spacyr::entity_consolidate(NLP_example$spacy_raw)#(named) entity consolidation
NLP_example$nounphrases= spacyr::nounphrase_extract(NLP_example$spacy_raw)#nounphrase consolidation



#NLP FOR THE WHOLE DATASET (users and tweets) TAKES +-10HRS
# #Code below was used for NLP pipeline with all tweets and users descriptions 
# NLP_data_raw=list(tweets=NULL,users=NULL)
# users=data$users$clean_description
# names(users)=data$users$id
# tweets=data$tweets$clean_text
# names(tweets)=data$tweets$id
# start=Sys.time()
# NLP_data_raw$users = spacy_parse(
#   users,
#   dependency = TRUE,
#   lemma = TRUE,
#   pos = TRUE,
#   entity = TRUE,
#   nounphrase = TRUE,
#   additional_attributes = c("tag_")
# ) #,"morph"
# NLP_data_raw$tweets = spacy_parse(
#   tweets,
#   dependency = TRUE,
#   lemma = TRUE,
#   pos = TRUE,
#   entity = T,
#   nounphrase = T,
#   additional_attributes = c("tag_")
# ) #,"morph"
# print(paste("NLP time:", Sys.time()-start, "hours"))#"NLP time: 10.1510577821732 hours" on 
# spacy_finalize()
# saveRDS(NLP_data_raw,"NLP_data_raw.RDS")
NLP_data_raw=readRDS("../data/NLP_data_raw.RDS")
#View(NLP_data_raw)

#### 5.2 NLP DATA AUGMENTATION ####

#EXAMPLES:
# merging NLP data with AFINN sentiment lexicon
dict_afinn = corpus::sentiment_afinn %>%
  as_tibble() %>%
  rename(word = 1,
         sentiment = 2)
NLP_example$spacy_raw=NLP_example$spacy_raw %>%
  left_join(dict_afinn, by = c("lemma"= "word"))# use lemma or norm_ depending on dict type

# merging NLP data with Affect Wordnet
dict_wordnet =corpus::affect_wordnet %>%
  as_tibble()   %>%
  select(term, category, emotion) %>%
  rename(word = 1,
         wn_category = 2,
         wn_emotion = 3)
NLP_example$spacy_raw=NLP_example$spacy_raw %>%
  left_join(dict_wordnet, by = c("lemma"= "word"))# use lemma or norm_ depending on dict type

#EXAMPLES:
#NLP_data_raw=readRDS("../data/NLP_data_raw.RDS")
NLP_data_augmented=NLP_data_raw
for(i in c("tweets", "users"))
{
  NLP_data_augmented[[i]]=NLP_data_raw[[i]] %>%
    left_join(dict_wordnet, by = c("lemma"= "word"))
  NLP_data_augmented[[i]]=NLP_data_augmented[[i]]  %>%
    left_join(dict_afinn, by = c("lemma"= "word"))# use lemma or norm_ depending on dict type
}
View(NLP_data_augmented$tweets)
View(NLP_data_augmented$users)

#### 5.3 NLP  DATA TRANSFORMATION IN TWO DFMs: ONE FOR UNIGRAMS THE OTHER FOR DEP.REL (ASSOCIATIONS) ####
#NLP_data_raw=readRDS("../data/NLP_data_raw.RDS")
NLP_data=list(tweets=NULL,users=NULL)
source_nodes=list(tweets=NULL,users=NULL)
target_nodes=list(tweets=NULL,users=NULL)
net_data=list(tweets=NULL,users=NULL)
quanteda_unigrams=list(tweets=NULL,users=NULL)
quanteda_associations=list(tweets=NULL,users=NULL)
associations_dfm=list(tweets=NULL,users=NULL)
unigrams_dfm=list(tweets=NULL,users=NULL)

for(i in c("tweets", "users"))
{
#if consolidating noun phrases remember to redirect head token IDs of removed objects to existing ones.
NLP_data[[i]]=NLP_data_raw[[i]][!((NLP_data_raw[[i]]$pos %in% c("PUNCT","SPACE","CCONJ","NUM","PRON","ADP","SYM","PRON","DET","PART")) |(NLP_data_raw[[i]]$dep_rel %in% c("cc"))),]# POS filter

NLP_data[[i]]=NLP_data[[i]][!nchar(NLP_data[[i]]$lemma)<=1,]#lenght filter (not needed if already removed)

source_nodes[[i]]=NLP_data[[i]][NLP_data[[i]]$dep_rel %in% c("aux","auxpass","cop","arg","comp","acomp","ccomp","xcomp","obj","dobj","iobj","pobj","subj","nsubj","csubj","conj","amod","appos","advcl","preconj","vmod","mwe","mark","advmod","neg","rcmod","quantmod","nn","npadvmod","tmod","num","number","prep","poss","parataxis","goeswith","ref","sdep","xsubj","compound"),] %>% # DEP-REL filter
  dplyr::select(doc_id, sentence_id, token_id=head_token_id,pos_source=pos,entity_source=entity,
                word_source = token, lemma_source = lemma, dep_rel)
target_nodes[[i]]= NLP_data[[i]] %>% dplyr::select(doc_id, sentence_id, token_id,pos_target=pos,entity_target=entity,
                                         word_target = token, lemma_target = lemma)
#saveRDS(NLP_data,paste0(folder,"NLP_data.RDS"))
net_data[[i]]= dplyr::left_join(source_nodes[[i]] , target_nodes[[i]],
                           by = c("doc_id", "sentence_id","token_id"))
net_data[[i]]=net_data[[i]][net_data[[i]]$dep_rel!="ROOT",]
net_data[[i]]=net_data[[i]][!is.na(net_data[[i]]$lemma_target),]
net_data[[i]][,c("lemma_source","lemma_target")]
net_data[[i]]=net_data[[i]][!(net_data[[i]]$lemma_source == net_data[[i]]$lemma_target),]#remove loops

net_data[[i]]$edge=paste(net_data[[i]]$lemma_source,net_data[[i]]$lemma_target,sep = "->")
#saveRDS(net_data[[i]],paste0(folder,"net_data.RDS"))
quanteda_unigrams[[i]]=quanteda::as.tokens(NLP_data[[i]] %>% dplyr::select(doc_id, sentence_id, token_id, token = lemma)) %>% quanteda::tokens_tolower()
quanteda_associations[[i]]=quanteda::as.tokens(net_data[[i]]  %>% dplyr::select(doc_id, sentence_id, token_id, token = edge)) %>% quanteda::tokens_tolower()
#remove documents that don't have any bigram

unigrams_dfm[[i]] =quanteda_unigrams[[i]][names(quanteda_associations[[i]])]%>% quanteda::dfm()
associations_dfm[[i]]= quanteda_associations[[i]] %>% quanteda::dfm()

}

Topic Modeling with Probailistic Graphical Models (e.g., LDA)

See LDA2 paper: https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf (Blei, 2003)

Enriching topic models and facilitating their interpretation (e.g., LDA2NET)

See LDA2Net paper: https://www.researchgate.net/publication/356717976_LDA2Net_Digging_under_the_surface_of_COVID-19_topics_in_scientific_literature (Minello, Santagiustina, Warglien, 2021)

#### 6 TOPIC MODELS ####
#### 6.1 LDA TOPIC MODEL (FOR TWEETS DFM FROM SECTION 2.3) ####
Ntopics=5
tweetsLDA=topicmodels::LDA(dfm$tweets,k=Ntopics,method = "VEM",LDAcontrol=list(seed=1,estimate.beta=TRUE))#typical value of alpha is 50/T where T is number of topics and value of beta is 0.1 or 200/W
#A higher alpha then gives a more dense distribution whereas a lower alpha gives a more sparse distribution. 
tweetsLDA_top_terms = tidytext::tidy(tweetsLDA, matrix = "beta") %>%#beta contains the word X topic distributions
  group_by(topic) %>%
  slice_max(beta, n = 20) %>% 
  ungroup() %>%
  arrange(topic, -beta)

tweetsLDA_top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()

#Show examples for each topic
selectedtopic=4
data$tweets[docnames(dfm$tweets[which(tweetsLDA@gamma[,selectedtopic]==max(tweetsLDA@gamma[,selectedtopic])),]),] %>% View() #gamma contains the topic X document distributions


#### 6.2 SEEDED LDA TOPIC MODEL (FOR USERS DFM FROM SECTION 2.3)  ####
#define dictionary with seedwords
seed_dict=list(WFA_worker = c("nomad","worker","employee","traveller","globetrotter","backpacker"),#digital nomads et al.
               WFA_place_community= c("rent","office","meeting","room","appartment","community","join"),
               WFA_firm  = c("employer","firm","company","corporate","business","enterprise","we","our","group","llc"),# WFA firms/employers 
               WFA_manager =c("ceo","executive","director","manager","management","board"),
               WFA_coach =c("coach","coaching","support","supporting","consultant","consulting"),
               WFA_investor=c("investor", "investing","entrepreneur","founder","funding"),
               WFA_media = c("news","reporter", "journalist","writer", "magazine","journal","articles","updates","correspondent")
               )
keyword_counts=list()
for(i in names(seed_dict)){
  keyword_counts[[i]]=sort(colSums(dfm$users[,  quanteda::featnames(dfm$users) %in% seed_dict[[i]]]),decreasing = T)
  if(sum(keyword_counts[[i]])==0 | length(keyword_counts[[i]])==0){
    seed_dict[[i]]=NULL
  }else{
    seed_dict[[i]]=names(sort(keyword_counts[[i]][names(keyword_counts[[i]]!=0)],decreasing = T))
  }
}
lapply(keyword_counts, function(x) paste(names(x) ," (",x,")", sep="",collapse = "; "))

## $WFA_worker
## [1] "employee (138); traveller (87); nomad (76); worker (73)"
## 
## $WFA_place_community
## [1] "community (649); office (324); join (285); meeting (72)"
## 
## $WFA_firm
## [1] "business (3232); company (688); enterprise (377); group (319); firm (307); corporate (217); employer (109)"
## 
## $WFA_manager
## [1] "management (853); director (782); manager (766); executive (385); board (273)"
## 
## $WFA_coach
## [1] "consultant (772); support (662); coach (648); consulting (410); coaching (137); supporting (114)"
## 
## $WFA_investor
## [1] "founder (1069); entrepreneur (807); investor (453); investing (174)"
## 
## $WFA_media
## [1] "news (1489); writer (809); journalist (182); updates (179); magazine (110); reporter (94); articles (91)"

usersSLDA=seededlda::textmodel_seededlda(x = dfm$users,seed_dict %>% quanteda::dictionary(.),residual = T,
                                         valuetype = c("fixed"),max_iter=2000, weight=0.0025,case_insensitive=T)


usersSLDA_top_terms = setNames(reshape2::melt(usersSLDA$phi),c("topic","term","beta")) %>%#beta contains the word X topic distributions
  group_by(topic) %>%
  slice_max(beta, n = 25) %>% 
  ungroup() %>%
  arrange(topic, -beta)

usersSLDA_top_terms[nchar(as.character(usersSLDA_top_terms$term))>=3 & usersSLDA_top_terms$topic!="other",] %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()

table(seededlda::topics(usersSLDA))

## 
##          WFA_worker WFA_place_community            WFA_firm         WFA_manager 
##                4489                3667                4159                3835 
##           WFA_coach        WFA_investor           WFA_media               other 
##                3266                3659                3189                3740

#Examples:
#Cisco Collaboration (userID: 15866857) https://twitter.com/ciscocollab 
usersSLDA$theta["15866857",]

##          WFA_worker WFA_place_community            WFA_firm         WFA_manager 
##          0.02941176          0.08823529          0.08823529          0.08823529 
##           WFA_coach        WFA_investor           WFA_media               other 
##          0.08823529          0.08823529          0.44117647          0.08823529

#Financial Times (userID: 4898091) https://twitter.com/FinancialTimes
usersSLDA$theta["4898091",]

##          WFA_worker WFA_place_community            WFA_firm         WFA_manager 
##                0.05                0.05                0.05                0.05 
##           WFA_coach        WFA_investor           WFA_media               other 
##                0.05                0.05                0.65                0.05

#Users with higher probs  by group
type="WFA_media"
data$users[names(which(usersSLDA$theta[,type]==max(usersSLDA$theta[,type]))),] %>% View()
type="WFA_worker"
data$users[names(which(usersSLDA$theta[,type]==max(usersSLDA$theta[,type]))),] %>% View()
type="WFA_firm"
data$users[names(which(usersSLDA$theta[,type]==max(usersSLDA$theta[,type]))),] %>% View()


####6.3 LDA2NET: (TWEETS) TOPIC MODEL ENRICHMENT WITH WEIGHTED DEPENDENCY RELATIONS ####

i="tweets"
#associations_dfm=readRDS("associations_dfm.RDS")

associations_vocab = data.frame(
  from = gsub("->.*$", "", x = colnames(associations_dfm[[i]]), perl = T),
  to = gsub("^.*->", "", x = colnames(associations_dfm[[i]]), perl = T),
  stringsAsFactors = F
)
rownames(associations_vocab)=colnames(associations_dfm[[i]])
associations_vocab_clean = associations_vocab[(associations_vocab$from %in% colnames(dfm[[i]])) &
                                      (associations_vocab$to %in% colnames(dfm[[i]])), ]
associations_vocab_clean$n=colSums(associations_dfm[[i]] %>% dfm_select(., pattern = rownames(associations_vocab_clean), selection = "keep",case_insensitive = F,valuetype = "fixed"))
#save cleaned associations vocabulary
saveRDS(associations_vocab_clean, file = "associations_vocab_clean.rds")
#prune associations_dfm matrix
associations_dfm[[i]] = dfm_select(associations_dfm[[i]],
                              pattern = rownames(associations_vocab_clean),
                              valuetype = "fixed")

associations_dfm[[i]] = dfm_subset(associations_dfm[[i]], docnames(associations_dfm[[i]]) %in% docnames(dfm[[i]]))
nrow(associations_dfm[[i]])

## [1] 60876

# 42 131 docs
#saveRDS(associations_dfm, file ="associations_mat_final.rds")
topics_docs=tweetsLDA@gamma
rownames(topics_docs)=docnames(dfm[[i]])
unigrams_vocab=featnames(dfm[[i]])
theta_match= topics_docs[rownames(topics_docs) %in% docnames(associations_dfm[[i]]),]
nrow(theta_match)#docs

## [1] 60876

ncol(theta_match)#topics

## [1] 5

nrow(associations_dfm[[i]])

## [1] 60876

ncol(associations_dfm[[i]])

## [1] 50028

AXT_counts  = t(theta_match) %*% associations_dfm[[i]] #doc_topic 
print(rowSums(AXT_counts))#don't sum to 1 yet...

## [1] 44762.38 44775.04 44772.65 44791.85 44782.08

AXT_counts = AXT_counts / rowSums(AXT_counts)# this is needed for sum of bigram rel.freqs to sum to 1 (for each topic)



topics_unigrams= exp(tweetsLDA@beta)
colnames(topics_unigrams)=tweetsLDA@terms
rowSums(topics_unigrams )#good if close to 1

## [1] 1 1 1 1 1

p_associations = list()

#for topics:
#selected topics
topics_selection=c(1:nrow(topics_unigrams))
for (j in topics_selection) {
  p_associations[[j]] = associations_vocab_clean
  p_associations[[j]]$counts = AXT_counts[j, rownames(p_associations[[j]])]
  print("stage 1: eprob")
  p_associations[[j]]$e_prob =  topics_unigrams[j, p_associations[[j]]$from]* topics_unigrams[j, p_associations[[j]]$to]
  #for all bigrams observed in the corpus
  print("stage 2: eprobXcounts")
  p_associations[[j]]$eprobXcounts = p_associations[[j]]$e_prob * p_associations[[j]]$counts
  p_associations[[j]]$eprobXcounts = p_associations[[j]]$eprobXcounts/sum(p_associations[[j]]$eprobXcounts)
  print("stage 4:saving")
  #saveRDS(p_associations[[i]],paste("raw_association_topic_data_rds/topic", i, "_raw_associations.RDS", sep = ""))
  print(p_associations[[j]][order(p_associations[[j]]$eprobXcounts,decreasing = T),][1:10,])
}

## [1] "stage 1: eprob"
## [1] "stage 2: eprobXcounts"
## [1] "stage 4:saving"
##                           from         to    n       counts       e_prob
## full->time                full       time  699 0.0031063558 2.641982e-05
## business->help        business       help  114 0.0005110207 1.588158e-04
## business->need        business       need  144 0.0006437535 1.215408e-04
## home->office              home     office  255 0.0009942964 6.932092e-05
## online->business        online   business  217 0.0009687601 6.233408e-05
## lifestyle->business  lifestyle   business  213 0.0009500929 5.733182e-05
## internet->connection  internet connection 1345 0.0043344491 1.243652e-05
## help->need                help       need   90 0.0004027860 1.215322e-04
## free->video               free      video  215 0.0009632411 4.633633e-05
## today->world             today      world  158 0.0007064408 6.281999e-05
##                      eprobXcounts
## full->time             0.02105370
## business->help         0.02081995
## business->need         0.02007192
## home->office           0.01768184
## online->business       0.01549135
## lifestyle->business    0.01397363
## internet->connection   0.01382866
## help->need             0.01255778
## free->video            0.01144996
## today->world           0.01138468
## [1] "stage 1: eprob"
## [1] "stage 2: eprobXcounts"
## [1] "stage 4:saving"
##                       from        to   n       counts       e_prob eprobXcounts
## remote->developer   remote developer 746 0.0033480212 1.036141e-04  0.078673999
## remote->working     remote   working 356 0.0015908120 1.794641e-04  0.064747111
## full->time            full      time 699 0.0031263975 3.897142e-05  0.027632139
## business->start   business     start 784 0.0035021549 3.261780e-05  0.025906821
## learn->join          learn      join 247 0.0011022420 5.430114e-05  0.013574066
## remote->workforce   remote workforce 181 0.0008088408 6.182902e-05  0.011341734
## long->live            long      live 616 0.0027368658 1.684818e-05  0.010457567
## part->time            part      time 327 0.0014617909 2.779884e-05  0.009215863
## digital->world     digital     world 409 0.0018241296 2.173901e-05  0.008993310
## remote->office      remote    office  30 0.0001343109 2.897606e-04  0.008826221
## [1] "stage 1: eprob"
## [1] "stage 2: eprobXcounts"
## [1] "stage 4:saving"
##                            from       to   n       counts       e_prob
## management->business management business 498 0.0022263938 1.291854e-04
## remote->working          remote  working 356 0.0015917306 1.699096e-04
## business->start        business    start 784 0.0035048425 5.838376e-05
## remote->team             remote     team 179 0.0008001553 2.025557e-04
## business->need         business     need 144 0.0006435043 1.960735e-04
## small->business           small business 430 0.0019263390 4.666811e-05
## online->business         online business 217 0.0009699626 6.938914e-05
## just->need                 just     need 148 0.0006571279 9.143940e-05
## business->build        business    build 361 0.0016028522 3.487186e-05
## freedom->give           freedom     give 331 0.0014689882 3.734535e-05
##                      eprobXcounts
## management->business   0.06355299
## remote->working        0.05975964
## business->start        0.04521484
## remote->team           0.03581287
## business->need         0.02787988
## small->business        0.01986431
## online->business       0.01487192
## just->need             0.01327712
## business->build        0.01235063
## freedom->give          0.01212203
## [1] "stage 1: eprob"
## [1] "stage 2: eprobXcounts"
## [1] "stage 4:saving"
##                          from         to    n       counts       e_prob
## digital->world        digital      world  409 0.0018287919 1.308817e-04
## world->travel           world     travel  232 0.0010175060 1.374097e-04
## home->office             home     office  255 0.0009912137 1.180991e-04
## world->company          world    company   96 0.0004285642 2.502649e-04
## internet->connection internet connection 1345 0.0043945268 2.119476e-05
## many->people             many     people  182 0.0008121853 8.113873e-05
## people->want           people       want  127 0.0005673483 9.416670e-05
## learn->join             learn       join  247 0.0011026048 4.470186e-05
## company->join         company       join  147 0.0006512780 5.473726e-05
## full->time               full       time  699 0.0031490202 1.100553e-05
##                      eprobXcounts
## digital->world        0.056414528
## world->travel         0.032953546
## home->office          0.027590631
## world->company        0.025279219
## internet->connection  0.021952727
## many->people          0.015532136
## people->want          0.012592007
## learn->join           0.011616981
## company->join         0.008402282
## full->time            0.008168347
## [1] "stage 1: eprob"
## [1] "stage 2: eprobXcounts"
## [1] "stage 4:saving"
##                       from        to    n       counts       e_prob
## business->start   business     start  784 0.0035106167 7.162173e-05
## remote->working     remote   working  356 0.0015872120 1.333587e-04
## home->office          home    office  255 0.0009913034 1.293575e-04
## today->world         today     world  158 0.0007051043 1.302334e-04
## remote->team        remote      team  179 0.0007991811 9.285952e-05
## part->time            part      time  327 0.0014628260 4.390414e-05
## team->hire            team      hire 1114 0.0050463074 1.175657e-05
## full->time            full      time  699 0.0031188024 1.830996e-05
## business->build   business     build  361 0.0016038144 3.422970e-05
## remote->developer   remote developer  746 0.0033760541 1.445099e-05
##                   eprobXcounts
## business->start     0.05536354
## remote->working     0.04660707
## home->office        0.02823539
## today->world        0.02021954
## remote->team        0.01634057
## part->time          0.01414144
## team->hire          0.01306321
## full->time          0.01257392
## business->build     0.01208795
## remote->developer   0.01074243

#### 6.4 PLOT (PERCENTILE FILTERED) TOPIC NETWORK ####
#color map
library(tidyverse)
library(visNetwork)
map2color <- function(x, pal, limits = NULL) {
  if (is.null(limits))
    limits = range(x)
  pal[findInterval(x,
                   seq(limits[1], limits[2], length.out = length(pal) + 1),
                   all.inside = TRUE)]
}
mypal = colorRampPalette(c("gray", "darkred"))(100)
allcolors = gplots::col2hex(grDevices::colors()[grep('gr(a|e)y', grDevices::colors(), invert = T)])

filter_and_plot_topic = function(temp_=temp[[j]], show_nav=T, show_id=T){
  
  temp_$vertices[,c("color")]="white"
  ids=temp_$vertices[,c("name")]
  
  topicnet = visNetwork(
      main = paste0("Topic ", j), 
    height = "900px",
    width = "100%",
    nodes = data.frame(
      id = temp_$vertices$name
      ,
      label = temp_$vertices$name
      ,
      size =  (temp_$vertices$prob) ^ (1 / 4) * 20 / max((temp_$vertices$prob) ^
                                                           (1 / 4)) #log(1.00015+temp_$vertices$prob)*10000
      , color.background = temp_$vertices$color
      ,
      color.border= "black"
      ,
      title = paste(
        "<b>",
        temp_$vertices$name,
        "</b><br> Prob:",
        temp_$vertices$prob,
        sep = ""
      )
    ),
    edges = data.frame(
      from = temp_$edges$from
      ,
      to = temp_$edges$to
      ,
      width = (temp_$edges$eprobXcounts) ^ (1 / 4) * 10 / max((temp_$edges$eprobXcounts) ^ (1 /
                                                                                              4))
      , smooth.enabled = T
      ,smooth.roundness = 0.70
      ,color = map2color(temp_$edges$counts, mypal)
      ,
      title = paste(
        "<b>",
        temp_$edges$from
        ,
        "-",
        temp_$edges$to
        ,
        "</b><br> Prob:"
        ,
        temp_$edges$eprobXcounts
        ,

        sep = ""
      )
      #,group = temp_$vertices$community_walktrap
    )
  ) %>%
    visPhysics(solver = "forceAtlas2Based", forceAtlas2Based = list(gravitationalConstant = -20))%>%
    visEdges(arrows = "to") %>%
    visInteraction(
      keyboard = F,
      navigationButtons =  T,
      dragNodes = T,
      dragView = T,
      zoomView = T
    ) %>%
    visOptions(
      highlightNearest = list(enabled = T, hover = T),
      nodesIdSelection = T
    )
  return(topicnet)
}

net=list()
temp=list()
input=list()
input$n_percentile=0.95
input$e_percentile=0.95

for(j in c(1:nrow(topics_unigrams)))
{
  temp[[j]]=list()
  temp[[j]]$edges=p_associations[[j]]
  temp[[j]]$vertices=data.frame(name=colnames(topics_unigrams),prob=topics_unigrams[j,],stringsAsFactors = F)
  #filtration
  temp[[j]]$vertices = temp[[j]]$vertices[temp[[j]]$vertice$prob>= quantile(temp[[j]]$vertice$prob, probs = input$n_percentile), ]
  temp[[j]]$edges = temp[[j]]$edges[(
    temp[[j]]$edges$to %in% temp[[j]]$vertices$name &
      temp[[j]]$edges$from %in% temp[[j]]$vertices$name
  ) & ifelse(temp[[j]]$edges$to != temp[[j]]$edges$from, T, F), ]
  temp[[j]]$edges = temp[[j]]$edges[temp[[j]]$edges$eprobXcounts>= quantile(temp[[j]]$edges$eprobXcounts, probs = input$e_percentile), ]
  temp[[j]]$vertices  = temp[[j]]$vertices[temp[[j]]$vertices$name %in% c(temp[[j]]$edges$to,temp[[j]]$edges$from), ]
  net[[j]]=filter_and_plot_topic(temp[[j]])
}

net[[1]]

net[[2]]

net[[3]]

net[[4]]

net[[5]]

Combining structural topic models and causal arguments extraction

“The Unfolding Structure of Arguments in Online Debates: The case of a No-Deal Brexit” (Santagiustina and Warglien, 2021)

https://www.researchgate.net/publication/350176064_The_Unfolding_Structure_of_Arguments_in_Online_Debates_The_case_of_a_No-Deal_Brexit

Some R books on TM and NLP

A tutorial by Carlo R. M. A. Santagiustina

carlo.santagiustina@unive.it

An introduction to text mining, NLP and text classification for management studies

BLISS R tutorial

01/01/2022

What is R Markdown

Schedule for today

Text mining, NLP and topic modeling, in brief

Why should management scholars use textual data in their research?

Why should management scholars use topic models?

Why using R for text mining and topic modeling?

Topic modeling workflows and pipelines

R data structures and relations

Workflows and pipelines in R

R piping rules

The argument placeholder

Re-using the placeholder for attributes

Which R packages will we use today?

The dataset: work from anywhere discussions on Twitter

The document-feature matrix

Regular expressions (RegEx)

RegEx character classes

RegEx quantifiers

RegEx position of pattern

RegEx in R

RegEx application

Feature co-occurrence matrix

Natural Language Processing with Spacy(R)

Part-of-Speech tagging

Dependency parsing

Named enitity recognintion

Topic Modeling with Probailistic Graphical Models (e.g., LDA)

Enriching topic models and facilitating their interpretation (e.g., LDA2NET)

Combining structural topic models and causal arguments extraction

Some R books on TM and NLP

An introduction to text mining, NLP and text classification for management studies

BLISS R tutorial

01/01/2022

What is R Markdown

Schedule for today

Text mining, NLP and topic modeling, in brief

Why should management scholars use textual data in their research?

Why should management scholars use topic models?

Why using R for text mining and topic modeling?

Topic modeling workflows and pipelines

R data structures and relations

Workflows and pipelines in R

R piping rules

The argument placeholder

Re-using the placeholder for attributes

Which R packages will we use today?

The dataset: work from anywhere discussions on Twitter

The document-feature matrix

Regular expressions (RegEx)

RegEx character classes

RegEx quantifiers

RegEx position of pattern

RegEx in R

RegEx application

Feature co-occurrence matrix

Natural Language Processing with Spacy(R)

Part-of-Speech tagging

Dependency parsing

Named enitity recognintion

Topic Modeling with Probailistic Graphical Models (e.g., LDA)

Enriching topic models and facilitating their interpretation (e.g., LDA2NET)

Other topic models (and related papers and R packages)

Combining structural topic models and causal arguments extraction

Some R books on TM and NLP