The New Momentum Foundation develops research tools and systems to analyze societal undercurrents in digital media networks. This walk through will show you how to use New Momentum’s Analysis Toolkit for R. We will walk you through all the package’s functions and should you how they can be used to:
In our workflow, the Analysis Toolkit for R is used after we collect and analyze data using New Momentum’s Research and Data Infrastructure that provides Infrastructure as Code to deploy data retrieval and processing systems in the Amazon Cloud and saves social media data to a so-called graph database (Neo4J). This graph database allows for complex analyses using Neo4j’s CYPHER query language and its graph data science library.
In this walkthrough, we work with a Twitter data set that includes Dutch tweets written between January 1st, 2020 until October 31st, 2020, and include combinations of words like ‘discriminatie’ (i.e., ‘discrimination’), ‘racisme’ (i.e., ‘racism’) and ‘blm’, with words like ‘rotterdam’, ‘rdam’ and ‘010’ (i.e., the area code of Rotterdam). The data set also includes all the retweets, quotes and replies associates to these tweets. Furthermore, the set includes the user bios of the users that have published the tweets, plus all users that they are following (followees) and are being followed by (followers). This is important information, as it shows how users are (indirectly) connected beyond the scope of twitter debates about discrimination in the city of Rotterdam. Finally, the data set also includes all the hashtags that are associated to the tweets.
In Neo4J, our datamodel looks as follows:
In this walkthrough, we will explore discourses about discrimination in Rotterdam through the lens of a so-called ego-network, in which the nodes represent Twitter users and the edges (i.e., connections) whether users follow each other.
In Neo4j, we used the Graph Data Science library’s implementation of the Louvain algorithm to detect communities in the ego network. We used cypher queries to create and download a node- and edgelist that we visualized in the open source network visualization software Gephi.
This is an example of a very minal nodelist, comprising the id for
each node, the community, and the number of tweets the user has written.
This is an example of a minimal edgelist, comprising the the id’s of
the source and target nodes (directed relations are drawn from the
source to the target) and a weight column that reflects whether the
source users follows the target user (weight=1) and the number of
interactions from the source user to the target user (weight > 1)
In Gephi, we created the
visualization below. Each node represents a
Twitter user and the lines - that are hardly visible because there are
so many - represent wether users follow each other. Thicker lines also
represent interactions. Large nodes with bright colors have written more
tweets than other users. The colors stand for the communities that were
detected.
By filtering by weight, the network is easily transformed in a
interaction network (representing interactions between users):
The visuals are pretty, but do not tell us anything yet about:
To answer these question, we will use the Analysis Toolkit for R.
First, we install the latest version of the Analysis Toolkit for R:
## install {remotes} if not already
if (!"remotes" %in% installed.packages()) {
install.packages("remotes")
}
## install from gitlab
if (!"newmomentum" %in% installed.packages()) {
remotes::install_git("https://gitlab.com/new-momentum/analysis-toolkit-R.git")
}
Next, we load the required packages:
library(ggplot2)
library(ggraph)
library(newmomentum)
library(tidyverse)
Next, we load our data. In this case, we have retrieved the data from Neo4j and stored them locally as csv files. To create a csv of users, we have used the following query:
Getting the nodelist
To create a csv of tweets, we have used the following query (not that we have capped the length of the text field for demonstration purposes):
Getting the nodelist
We load the csv into R as tibbles:
users <- read_csv("data/users.csv",
col_types = list(col_character(),
col_character(),
col_integer(),
col_integer()))
tweets_raw <- read_csv("data/tweets.csv",
col_types = list(col_character(),
col_character(),
col_integer(),
col_logical()))
sample_n(users,10) %>% mutate(Community="[anonymized]") %>% print
#> # A tibble: 10 × 4
#> Id description Community n_tweets
#> <chr> <chr> <chr> <int>
#> 1 31491 Freelance mediajournalist - medianieuws oa Totaal T… [anonymi… 0
#> 2 11836 Laat Zwarte Piet niet verloren gaan! [anonymi… 0
#> 3 34359 I am a Alpha Male and I do Alpha Male things! Spa… [anonymi… 0
#> 4 14778 Tweeting about #social injustice. Freelance #digita… [anonymi… 0
#> 5 8313 Works in finance & marketing sector. I'm half Indon… [anonymi… 0
#> 6 45026 <NA> [anonymi… 0
#> 7 494 radiomaker, stukjesschrijver en Zeeuw. (zij/haar) [anonymi… 0
#> 8 2574 Carpe diem , zelfstandig ondernemer , recht,svaard… [anonymi… 0
#> 9 44777 <NA> [anonymi… 0
#> 10 17178 <NA> [anonymi… 0
sample_n(tweets_raw,10) %>% mutate(text=str_replace_all(text,"@([0-9]+)?[aA-zZ]+([0-9]+)? ","[anonymized] "))
#> # A tibble: 10 × 4
#> status_id text Community is_retweet
#> <chr> <chr> <int> <lgl>
#> 1 210 "@Rotterdam'n racist heeft mijn leven geruïne… 2 FALSE
#> 2 883 "[anonymized] [anonymized] @D66Rotterdam [ano… 2 FALSE
#> 3 2337 "Vandaag eert Frankrijk #SamuelPaty.\n\nHij s… 4 FALSE
#> 4 155 "Chinezen zijn discriminatie zat en praten me… 4 FALSE
#> 5 1213 "From today on, any White researcher calls me… 2 FALSE
#> 6 889 "@J_0_0_S_T [anonymized] [anonymized] Maurice… 2 FALSE
#> 7 645 "Rotterdam police investigating racism in cop… 1 FALSE
#> 8 896 "[anonymized] [anonymized] [anonymized] Blijf… 1 FALSE
#> 9 1802 "[anonymized] Racism: have you tried not talk… 2 FALSE
#> 10 1934 "[anonymized] [anonymized] [anonymized] [anon… 2 FALSE
To get a clear picture of who inhabits the communities in the network, we analyze the user bios in several steps:
Note that we recommend to use these tools to inform further qualitative analysis. Word clouds as well as semantic networks are excellent tools to identify larger patterns in corpi of text, but what these patterns mean often strongly depends on the context. Neo4j allows to easily retrieve bios and/or tweets that include specific words or word combinations. For example:
Pattern matching in Neo4j
A word cloud is a visual representation that is often used to summarize large amounts of text. Word clouds provide a quick overview of the main themes and topics. When applied to analyze a set of user bios, word clouds can give a clear impression of how users describe themselves. When potted for each community, word clouds can shed light on what sets the communities apart.
The size of the words is based on their frequency. The same is true for when word clouds are plotted for each community. However, in that case, the color intensity of the words expresses the relative prevalence of the word in comparison with the other communities. In this way, words that are more typical of user bios in a specific community are highlighted.
First, we create a word cloud for all the user bios, regardless of communities:
create_corpus
functionprepare_wordcloud function.plot_wordcloud
function## Create a corpus for the word cloud
corpus_all_who <- create_corpus(input = users,
text_name = "description")
## Create the word cloud
wordcloud_all_who <- prepare_wordcloud(corpus = corpus_all_who,
word_name = "word",
stop_lang = c("en","nl"),
n = 35,
min_count = 3,
min_ln = 3)
## Plot the word cloud
plot_wordcloud_all_who <- plot_wordcloud(wordclouds = wordcloud_all_who)
The figure below shows the word cloud for all user bios. The most frequently used words are the largest. These words are “politiek” (i.e. “politics”), “rotterdam” and “journalist”. However, we also see some contradicting words, like “pro” and “anti”, “rechts” (i.e. “right-wing”) and “links” (i.e. “left-wing”), and “rotterdam” and “amsterdam”.
plot_wordcloud_all_who
Although the word cloud for the whole network says something about the twitter users, it is still unclear how to characterize the users in the different communities. This can be examined best by creating a word cloud of the user descriptions per community.
color_nodelist. For demonstration purposes, we re-color our
tibble of users using color_nodelist in that the colors of
the word clouds correspond with the colors of the network.create_corpus where we group
along the community variable.prepare_wordclouds where we group along the
group variable in the corpus we just created.plot_wordcloud that are, in
turn, exported into individual list entries
## Setting the colors for each of the users based on their respective community
colors_who <- color_nodelist(input = users,
com_name = "Community",
sort = TRUE,
normalize_act_level = TRUE,
bg_color = "#ffffff") %>%
select(Id,Community,Color) %>%
mutate(Color = replace(Color, Color == "#426274", "#0868ac"))
## Create a corpus for the word clouds, where we group by community
corpus_per_com_who <- create_corpus(input = users,
text_name = "description",
grouping_id_name = "Community")
## Create the word clouds
wordcloud_per_com_who <- prepare_wordcloud(corpus = corpus_per_com_who,
word_name = "word",
grouping_id_name = "group",
stop_lang = c("en","nl"),
min_ln = 3,
n = 50,
min_count = 2)
## Plot the word clouds
plot_wordcloud_per_com_who <- plot_wordcloud(wordclouds = wordcloud_per_com_who,
color_index = select(colors_who,Community,Color) %>% distinct)
Below, the word clouds of the user bios for each community are shown. The most frequently used words are the largest, while the more relatively prevalent words have brighter colors.
plot_wordcloud_per_com_who[[1]]
plot_wordcloud_per_com_who[[2]]
plot_wordcloud_per_com_who[[3]]
Based on this analysis, we can label the communities in the network
as follows:
Emojis provide a way to express ideas, feelings or signal sarcasm or other kinds of humor. Therefore, it can be useful to examine which emojis are used most frequently.
First, we create an emoji cloud for all the user bios, regardless of communities:
create_emojicorpus
functionprepare_emojicloud function.plot_emojicloud
function## Create a corpus for the emoji cloud
corpus_emo_all_who <- create_emojicorpus(input = users,
text_name = "description")
## Create the emoji cloud
emojicloud_all_who <- prepare_emojicloud(corpus = corpus_emo_all_who,
emoji_name = "emoji",
n = 50,
min_count = 3)
## Plot the emoji cloud
plot_emojicloud_all_who <- plot_emojicloud(emojiclouds = emojicloud_all_who,
size = 1)
Below, the emoji cloud for the user bios of all the users in the network is shown. The larger the emoji, the more frequently the emoji is used in the bios of the users.
Emojicloud for all of the user bios
To understand how the communities compare to each other in terms of the emojis they use in their bios, we plot a cloud for each community.
create_emojicorpus and group
by community.prepare_emojiclouds where we group along the
group variable in the corpus we just created.plot_emojicloud that are, in
turn, exported into individual list entries
## Create a corpus for the emoji clouds, grouping by community
corpus_emo_per_com_who <- create_emojicorpus(input = users,
text_name = "description",
grouping_id_name = "Community")
## Create the emoji clouds, grouping by group
emojicloud_per_com_who <- prepare_emojicloud(corpus = corpus_emo_per_com_who,
emoji_name = "emoji",
grouping_id_name = "group",
n = 30,
min_count = 3)
## Plot the emoji clouds
plot_emojicloud_per_com_who <- plot_emojicloud(emojiclouds = emojicloud_per_com_who)
Below, the emoji clouds for the user bios of each community are shown. Here, some differences can be seen in the usage of emojis between the users of the communities.
Emojicloud for the conservative community
Emojicloud for the progressive community
Emojicloud for the Rotterdam community
To conclude: word clouds are very helpful in getting a clear image of the nature of different communities, while emoji clouds help to understand the communities further. However, qualitative analysis of the user bios is needed to understand the meaning of the emojis, because emojis can sometimes be meant and interpreted in different ways. In our workflow, Neo4j allows to easily retrieve bios that include specific words or word combinations.
A semantic network (semnet) is a representation of the relationships between words in corpus. It consists of nodes and edges, where nodes represent words and edges represent relationships between the words. A semnet allows to organize and represent complex relationships between words, which can be difficult to express using methods such as lists, tables or word clouds.
We use semnets to draw maps of the words that occur most frequently and draw edges between words that co-occur relatively frequently in individual tweets. Furthermore, we color the words along the community in which they are most prevalent. This means, for example, that words that exclusively occur in the conservative community are colored green, whereas words that occur evenly across each of the communities are colored with a mix between green, red and blue. Note that we work with threshold values: words are only included in the semantic network if (a) they occur a certain number of times, and (b) have a relationship with another word of a certain strength.
create_corpussemnet, setting the minimum correlation value at .2, the
minimum occurrence of words to build the network around at n=35 (97.5th
percentile), the minimum occurrence of words to include around the
central words at n=15 (95th percentile), and indicate that we only want
to include clusters that comprise at least 3 words.mix_color_by_group to calculate the colors of
the words in that they reflect prevalence across communities and join it
with the semnet we have created.statnet object and plot
the network using ggraph.## Create a corpus for the semnet
corpus_semnet_who <- create_corpus(input = users,
text_name = "description",
grouping_id_name = "Id")
## Derive the quantiles for the word count
get_quantiles(corpus_semnet_who) %>% print(n=50)
#> # A tibble: 449 × 2
#> Count Quantile
#> <int> <dbl>
#> 1 1 66.6
#> 2 2 78.1
#> 3 3 83.3
#> 4 4 86.2
#> 5 5 88.3
#> 6 6 89.7
#> 7 7 90.8
#> 8 8 91.6
#> 9 9 92.4
#> 10 10 93
#> 11 11 93.5
#> 12 12 94.0
#> 13 13 94.4
#> 14 14 94.7
#> 15 15 95.0
#> 16 16 95.3
#> 17 17 95.5
#> 18 18 95.7
#> 19 19 95.9
#> 20 20 96.0
#> 21 21 96.2
#> 22 22 96.3
#> 23 23 96.5
#> 24 24 96.6
#> 25 25 96.7
#> 26 26 96.8
#> 27 27 96.9
#> 28 28 97.0
#> 29 29 97.1
#> 30 30 97.2
#> 31 31 97.3
#> 32 32 97.4
#> 33 33 97.4
#> 34 34 97.5
#> 35 35 97.5
#> 36 36 97.6
#> 37 37 97.6
#> 38 38 97.7
#> 39 39 97.7
#> 40 40 97.8
#> 41 41 97.8
#> 42 42 97.9
#> 43 43 97.9
#> 44 44 98.0
#> 45 45 98.0
#> 46 46 98.0
#> 47 47 98.1
#> 48 48 98.1
#> 49 49 98.1
#> 50 50 98.2
#> # ℹ 399 more rows
## Create the semnet
semnet_who <- semnet(corpus = corpus_semnet_who,
word_name = "word",
grouping_id_name = "group",
cor_cut = .2,
count_cut = .975,
contextual_cut = .95,
min_ln = 3,
stop_lang = c("en","nl"),
edge_filt = TRUE,
connected_component = 3)
#> [1] "Filtering for noise and en"
#> [1] "Filtering for nl"
#> [1] "Found 72494 words"
#> [1] "Creating network around words that occur at least 32 times, which are 1861 words"
#> [1] "Filtering context, too."
#> [1] "Including contextual words that occur at least 15 times."
#> [1] "Network has 764 edges. If high, consider filtering with edge_weight_flt."
#> [1] "The weakest edge between every two nodes is omitted."
#> [1] "Network now includes 382 edges."
#> [1] "The final network includes 552 words, that correlate at least 0.2"
#> [1] "Filtering connected component"
## Determine the word colors in the semnet
semnet_colors_who <- corpus_semnet_who %>%
filter(word %in% semnet_who[[1]]$Id) %>%
left_join(.,colors_who,by = c("group"="Id")) %>%
na.omit %>%
group_by(word,Color) %>%
summarize(coms=n_distinct(group),.groups = "drop") %>%
group_by(word) %>%
slice_max(order_by=coms,n=7,with_ties=FALSE) %>%
mutate(prop=coms/sum(coms)) %>%
group_by(Id=word) %>%
do(Color=mix_color_by_group(.$Color,length(.$coms),.$prop)) %>%
unnest(cols = "Color")
semnet_who[[1]] <- semnet_who[[1]] %>%
left_join(., semnet_colors_who)
## Create the statnet of the semnet
statnet_who <- load_as_statnet(nodelist = semnet_who[[1]],
edgelist = semnet_who[[2]],
com_name = "Community",
color_name = "Color")
## Plot the semnet
semnet_plot_who <- ggraph(statnet_who,"nicely",maxiter=10000) +
geom_edge_link(aes(width=Weight,color=Color),alpha=0.2) +
geom_node_point(aes(fill=Color, size=Weight), shape=21, color="black", stroke=0.25) +
geom_node_text(aes(label = Label, size=Weight), color="black",family="FK Grotesk",
repel=T, min.segment.length=0.2,max.iter=50000,max.time=5) +
scale_fill_identity() +
scale_edge_color_identity() +
scale_size_continuous(range = c(2, 8)) +
scale_edge_width(range = c(0.5, 5)) +
theme_newmomentum_sn()
Below, the plot of the semantic network for the user bios is shown. The larger the words, the more frequently these words appear in the descriptions. The size of the edges represents the correlation between the words. If the edge between nodes is relatively thick, it means that these words are used in combination with each other frequently. Analyzing a semantic network helps finding relationships between words in a large corpus, but we recommend to follow up these patterns with qualitative analysis to understand what they mean. The components (i.e. groups of connected words) that stand out, are mostly the ones in one of the colors also used for the word clouds, since these components typically say something about the users in one of the communities.
semnet_plot_who
Now we have clear picture of who the users in the network are, we can start analyzing what they are saying. We do this by by examining the tweets, following the same approach that we used for the user bio’s: creating word clouds and semantic networks to identify patterns, and following those up with qualitative analysis.
First, we create a word cloud for all the tweets, regardless of communities:
create_corpus
functionprepare_wordcloud function. Using
the function’s attributes, we filter for stop words
(stop_lang) and some words we have manually selected
(filt).plot_wordcloud
function## Remove urls and words screen names from the tweets
tweets <- tweets_raw %>%
mutate(text = gsub("https?://\\S+\\s?","", text)) %>%
mutate(text = gsub("@\\S+\\s?", "", text))
## Create a corpus for the word cloud
corpus_all_what <- create_corpus(input = tweets,
text_name = "text")
## Create the word cloud
wordcloud_all_what <- prepare_wordcloud(corpus = corpus_all_what,
word_name = "word",
stop_lang = c("en","nl"),
min_ln = 3,
n = 30,
min_count = 2,
filt = c("jij","wel","goed","echt","gaan","amp","even","gaat","moet","alleen","moeten","gewoon","maken","alle","zie","waarom","weer","mag","weet","mee","denk","wij","heel","zegt","laat","bent","via","net","willen","komen"))
## Plot the word cloud
plot_wordcloud_all_what <- plot_wordcloud(wordclouds = wordcloud_all_what)
Due to the search queries described in the data section, many tweets contain words like “racisme” and “discriminatie” (i.e. “racism” and “discrimination”), together with words like “Rotterdam”. We see this reflected in the word cloud of the whole data set. Almost all most frequently used words have something to do with racism and the black lives matter-movement.
plot_wordcloud_all_what
Although the previosu word cloud says something about the content of all the tweets, it is more interesting to examine differences across communities. In this section, a word cloud of the tweets per community is created to get a clear picture about the different opinions and statements.
create_corpus where we group
along the community variable.prepare_wordclouds where we group along the
group variable in the corpus we just created.plot_wordcloud that are, in
turn, exported into individual list entries
## Define the word colors of the word clouds
colors_what <- tweets %>%
left_join(.,select(colors_who,Community,Color) %>% unique, by = c("Community" = "Community")) %>%
select(status_id,Community,Color)
## Create a corpus for the word clouds
corpus_per_com_what <- create_corpus(input = tweets,
text_name = "text",
grouping_id_name = "Community")
## Create the word clouds
wordcloud_per_com_what <- prepare_wordcloud(corpus = corpus_per_com_what,
word_name = "word",
grouping_id_name = "group",
stop_lang = c("en","nl"),
n = 50,
min_count = 3,
filt = c("jij","wel","goed","echt","gaan","amp","even","gaat","moet","alleen","moeten","gewoon","maken","alle","zie","waarom","weer","mag","weet","mee","denk","wij","heel","zegt","laat","bent","via","net","willen","komen","jullie","nou","maakt","komt","hoor","laten","vind"))
## Plot the word clouds
plot_wordcloud_per_com_what <- plot_wordcloud(wordclouds = wordcloud_per_com_what,
color_index = select(colors_what,Community,Color) %>% distinct)
The plots of the word clouds for the tweets per community are shown below. The larger words are, the more frequently they are used in the tweets of the corresponding communities. Bright-colored words express the relative prevalence of words compared to the other communities.
plot_wordcloud_per_com_what
#> [[1]]
#>
#> [[2]]
#>
#> [[3]]
* In the conservative community, the words “rotterdam” and “racisme”
appear most frequently, but the most prevalent words are “bilthoven”,
“boerenprotest” (i.e. “farmers’ protest”) and “groenlinks” (i.e. Dutch
green party). Qualitative analysis showed us that this refers to a
protest by Dutch farmers at the National Health Institute (RIVM) in
Bilthoven (according to the conservative community, the authorities
upheld double standards in enforcing compliance with COVID measures when
comparing the farmer protests with BLM protests), and probably signifies
rejection of the Dutch green party, due to their opinion on the farmers’
protest, among others. * What stands out in the word cloud of the tweets
by the progressive community, is the usage of English words. These users
apparently tweet more often in English. This means that the community
includes members that are more internationally oriented, or includes
many English speaking users (qualitative analysis showed us that both is
true). Almost all the most frequently used words relate in some way to
racism and discrimination. * The tweets of the members of the Rotterdam
community do not contain a lot of words that are not frequently used by
the members of the other communities.
First, we create an emoji cloud for all the tweets, regardless of communities:
create_emojicorpus
functionprepare_emojicloud function.plot_emojicloud
function## Create a corpus for the emoji cloud
corpus_emo_all_what <- create_emojicorpus(input = tweets,
text_name = "text")
## Create the emoji cloud
emojicloud_all_what <- prepare_emojicloud(corpus = corpus_emo_all_what,
emoji_name = "emoji",
n = 50,
min_count = 3)
## Plot the emoji cloud
plot_emojicloud_all_what <- plot_emojicloud(emojiclouds = emojicloud_all_what,
size = 1)
The emoji cloud for the whole network mostly shows emojis that express some sort of emotion (e.g.: anger, disgust, raised fist).
Emojicloud for all of the tweets
create_emojicorpus and group
by community.prepare_emojiclouds where we group along the
group variable in the corpus we just created.plot_emojicloud that are, in
turn, exported into individual list entries## Create a corpus for the emoji clouds
corpus_emo_per_com_what <- create_emojicorpus(input = tweets,
text_name = "text",
grouping_id_name = "Community")
## Create the emoji clouds
emojicloud_per_com_what <- prepare_emojicloud(corpus = corpus_emo_per_com_what,
emoji_name = "emoji",
grouping_id_name = "group",
n = 30,
min_count = 1)
## Plot the emoji clouds
plot_emojicloud_per_com_what <- plot_emojicloud(emojiclouds = emojicloud_per_com_what)
The emoji clouds that are shown below, show the most frequently used emojis by the users in the different communities.
Emojicloud for the tweets in the conservative community
Emojicloud for the tweets in the Rotterdam community
Emojicloud for the tweets in the progressive community
There are some clear differences in the emoji clouds:
create_corpussemnet, setting the minimum correlation value at .2, the
minimum occurrence of words to build the network around at n=14 (95th
percentile), the minimum occurrence of words to include around the
central words at n=9 (92,5th percentile), and indicate that we only want
to include clusters that comprise at least 3 words.mix_color_by_group to calculate the colors of
the words in that they reflect prevalence across communities and join it
with the semnet we have created.statnet object and plot
the network using ggraph.## Create a corpus for the semnet
corpus_semnet_what <- create_corpus(input = tweets,
text_name = "text",
grouping_id_name = "status_id")
## Derive the quantiles of the word counts
get_quantiles(corpus_semnet_what) %>% print(n=25)
#> # A tibble: 165 × 2
#> Count Quantile
#> <int> <dbl>
#> 1 1 58.7
#> 2 2 73.8
#> 3 3 80.6
#> 4 4 84.4
#> 5 5 87.4
#> 6 6 89.3
#> 7 7 90.7
#> 8 8 91.7
#> 9 9 92.6
#> 10 10 93.3
#> 11 11 93.9
#> 12 12 94.4
#> 13 13 94.9
#> 14 14 95.2
#> 15 15 95.5
#> 16 16 95.7
#> 17 17 96.0
#> 18 18 96.3
#> 19 19 96.5
#> 20 20 96.6
#> 21 21 96.7
#> 22 22 96.9
#> 23 23 96.9
#> 24 24 97.1
#> 25 25 97.2
#> # ℹ 140 more rows
## Create the semnet
semnet_what <- semnet(corpus = corpus_semnet_what,
word_name = "word",
grouping_id_name = "group",
cor_cut = .25,
count_cut = .95,
contextual_cut = .925,
min_ln = 3,
stop_lang = c("en","nl"),
edge_filt = TRUE,
connected_component = 3)
#> [1] "Filtering for noise and en"
#> [1] "Filtering for nl"
#> [1] "Found 10126 words"
#> [1] "Creating network around words that occur at least 11 times, which are 519 words"
#> [1] "Filtering context, too."
#> [1] "Including contextual words that occur at least 8 times."
#> [1] "Network has 370 edges. If high, consider filtering with edge_weight_flt."
#> [1] "The weakest edge between every two nodes is omitted."
#> [1] "Network now includes 185 edges."
#> [1] "The final network includes 163 words, that correlate at least 0.25"
#> [1] "Filtering connected component"
## Define the word colors for the semnet
semnet_colors_what <- corpus_semnet_what %>%
filter(word %in% semnet_what[[1]]$Id) %>%
left_join(.,colors_what,by = c("group"="status_id")) %>%
na.omit %>%
group_by(word,Color) %>%
summarize(coms=n_distinct(group),.groups = "drop") %>%
group_by(word) %>%
slice_max(order_by=coms,n=7,with_ties=FALSE) %>%
mutate(prop=coms/sum(coms)) %>%
group_by(Id=word) %>%
do(Color=mix_color_by_group(.$Color,length(.$coms),.$prop)) %>%
unnest(cols = "Color")
semnet_what[[1]] <- semnet_what[[1]] %>%
left_join(., semnet_colors_what)
## create the statnet of the semnet
statnet_what <- load_as_statnet(nodelist = semnet_what[[1]],
edgelist = semnet_what[[2]],
com_name = "Community",
color_name = "Color")
## Plot the semnet
semnet_plot_what <- ggraph(statnet_what,"nicely",maxiter=10000) +
geom_edge_link(aes(width=Weight,color=Color),alpha=0.2) +
geom_node_point(aes(fill=Color, size=Weight), shape=21, color="black", stroke=0.25) +
geom_node_text(aes(label = Label, size=Weight), color="black",family="FK Grotesk",
repel=T, min.segment.length=0.2,max.iter=50000,max.time=5) +
scale_fill_identity() +
scale_edge_color_identity() +
scale_size_continuous(range = c(2, 8)) +
scale_edge_width(range = c(0.5, 5)) +
theme_newmomentum_sn()
The plot below shows the semantic network of the tweets, where the colors are based on the community in which the word occurs. If a word is used in tweets of multiple communities, the colors are mixed proportionally. It can be seen in the semantic network that no red colored nodes are present, but only green, blue and mix of green/blue/red. Inspecting the components and their colors, helps to understand which themes communities are bringing forth and/or which frames they apply to the subject matter.
semnet_plot_what
* The semantic network shows the conservative community discontent (“van
de pot gerukt”) about the authorities’ alleged double standards when
enforcing COVID-rules during a fireworks show (“vuurwerkshow”), the
farmer protests in Bilthoven (“boerenprotest”, “bilthoven”, “rivm”) and
the BLM protests. * The members of the progressive community tweet
mostly about subjects related to the black lives matter-movement
(e.g. “george floyd”, “killed”, “police”, “black lives matter”) and
manage to contextualize the BLM movement to the Dutch context (e.g.:
“slavernij”, “excuses”, “racismedebat”, “gesprek nodig”) * One of the
subjects discussed by both communities, is about a leaked WhatsApp group
chat of police officers in Rotterdam where racist comments were shared
(“racisme”, “appgroep”, “politie”).
This toolkit can be used to explore any kind of textual data by creating word clouds and/or semnets for all the text and/or by group. We use the toolkit along our Research and Data Infrastructure to gather social media dataand explore networks.