Introduction

The New Momentum Foundation develops research tools and systems to analyze societal undercurrents in digital media networks. This walk through will show you how to use New Momentum’s Analysis Toolkit for R. We will walk you through all the package’s functions and should you how they can be used to:

understand who are involved in social media discourses about a specific theme or topic, and
what these groups are saying about a specific theme or topic.

When do you use the Analysis Toolkit for R?

In our workflow, the Analysis Toolkit for R is used after we collect and analyze data using New Momentum’s Research and Data Infrastructure that provides Infrastructure as Code to deploy data retrieval and processing systems in the Amazon Cloud and saves social media data to a so-called graph database (Neo4J). This graph database allows for complex analyses using Neo4j’s CYPHER query language and its graph data science library.

About the data set

In this walkthrough, we work with a Twitter data set that includes Dutch tweets written between January 1st, 2020 until October 31st, 2020, and include combinations of words like ‘discriminatie’ (i.e., ‘discrimination’), ‘racisme’ (i.e., ‘racism’) and ‘blm’, with words like ‘rotterdam’, ‘rdam’ and ‘010’ (i.e., the area code of Rotterdam). The data set also includes all the retweets, quotes and replies associates to these tweets. Furthermore, the set includes the user bios of the users that have published the tweets, plus all users that they are following (followees) and are being followed by (followers). This is important information, as it shows how users are (indirectly) connected beyond the scope of twitter debates about discrimination in the city of Rotterdam. Finally, the data set also includes all the hashtags that are associated to the tweets.

In Neo4J, our datamodel looks as follows: data model

About the network

In this walkthrough, we will explore discourses about discrimination in Rotterdam through the lens of a so-called ego-network, in which the nodes represent Twitter users and the edges (i.e., connections) whether users follow each other.

In Neo4j, we used the Graph Data Science library’s implementation of the Louvain algorithm to detect communities in the ego network. We used cypher queries to create and download a node- and edgelist that we visualized in the open source network visualization software Gephi.

This is an example of a very minal nodelist, comprising the id for each node, the community, and the number of tweets the user has written. Getting the nodelist

This is an example of a minimal edgelist, comprising the the id’s of the source and target nodes (directed relations are drawn from the source to the target) and a weight column that reflects whether the source users follows the target user (weight=1) and the number of interactions from the source user to the target user (weight > 1) Getting the edgelist

In Gephi, we created the visualization below. The network Each node represents a Twitter user and the lines - that are hardly visible because there are so many - represent wether users follow each other. Thicker lines also represent interactions. Large nodes with bright colors have written more tweets than other users. The colors stand for the communities that were detected.

By filtering by weight, the network is easily transformed in a interaction network (representing interactions between users):

Digging deeper

The visuals are pretty, but do not tell us anything yet about:

Who inhabits the communities?
What the communities are saying about discrimination in Rotterdam?

To answer these question, we will use the Analysis Toolkit for R.

Using the Analysis Toolkit for R

Installation

First, we install the latest version of the Analysis Toolkit for R:

## install {remotes} if not already
if (!"remotes" %in% installed.packages()) {
  install.packages("remotes")
}

## install from gitlab
if (!"newmomentum" %in% installed.packages()) {
  remotes::install_git("https://gitlab.com/new-momentum/analysis-toolkit-R.git")
}

Load required packages

Next, we load the required packages:

library(ggplot2)
library(ggraph)
library(newmomentum)
library(tidyverse)

Data

Next, we load our data. In this case, we have retrieved the data from Neo4j and stored them locally as csv files. To create a csv of users, we have used the following query:

Getting the nodelist

To create a csv of tweets, we have used the following query (not that we have capped the length of the text field for demonstration purposes):

Getting the nodelist

We load the csv into R as tibbles:

users <- read_csv("data/users.csv",
                  col_types = list(col_character(),
                                   col_character(),
                                   col_integer(),
                                   col_integer()))

tweets_raw <- read_csv("data/tweets.csv",
                  col_types = list(col_character(),
                                   col_character(),
                                   col_integer(),
                                   col_logical()))

sample_n(users,10) %>% mutate(Community="[anonymized]") %>% print
#> # A tibble: 10 × 4
#>    Id    description                                          Community n_tweets
#>    <chr> <chr>                                                <chr>        <int>
#>  1 31491 Freelance mediajournalist - medianieuws oa Totaal T… [anonymi…        0
#>  2 11836 Laat Zwarte Piet niet verloren gaan!                 [anonymi…        0
#>  3 34359 I am a Alpha Male and I do Alpha Male things!   Spa… [anonymi…        0
#>  4 14778 Tweeting about #social injustice. Freelance #digita… [anonymi…        0
#>  5 8313  Works in finance & marketing sector. I'm half Indon… [anonymi…        0
#>  6 45026 <NA>                                                 [anonymi…        0
#>  7 494   radiomaker, stukjesschrijver en Zeeuw. (zij/haar)    [anonymi…        0
#>  8 2574  Carpe diem , zelfstandig ondernemer  , recht,svaard… [anonymi…        0
#>  9 44777 <NA>                                                 [anonymi…        0
#> 10 17178 <NA>                                                 [anonymi…        0
sample_n(tweets_raw,10) %>% mutate(text=str_replace_all(text,"@([0-9]+)?[aA-zZ]+([0-9]+)? ","[anonymized] "))
#> # A tibble: 10 × 4
#>    status_id text                                           Community is_retweet
#>    <chr>     <chr>                                              <int> <lgl>     
#>  1 210       "@Rotterdam'n racist heeft mijn leven geruïne…         2 FALSE     
#>  2 883       "[anonymized] [anonymized] @D66Rotterdam [ano…         2 FALSE     
#>  3 2337      "Vandaag eert Frankrijk #SamuelPaty.\n\nHij s…         4 FALSE     
#>  4 155       "Chinezen zijn discriminatie zat en praten me…         4 FALSE     
#>  5 1213      "From today on, any White researcher calls me…         2 FALSE     
#>  6 889       "@J_0_0_S_T [anonymized] [anonymized] Maurice…         2 FALSE     
#>  7 645       "Rotterdam police investigating racism in cop…         1 FALSE     
#>  8 896       "[anonymized] [anonymized] [anonymized] Blijf…         1 FALSE     
#>  9 1802      "[anonymized] Racism: have you tried not talk…         2 FALSE     
#> 10 1934      "[anonymized] [anonymized] [anonymized] [anon…         2 FALSE

Who are the Twitter users in the communities?

To get a clear picture of who inhabits the communities in the network, we analyze the user bios in several steps:

We examine the bios by creating a word cloud of all the descriptions.
We create word clouds for each community, shedding light on the distinctive character of each of the communities.
We do the same for the emoji’s we find in the user bios.
We create a semantic network for all the user bios. Semantic network can provide valuable insights into the relationships between the words mentioned in the user bios.

Following patterns using qualitative analysis

Note that we recommend to use these tools to inform further qualitative analysis. Word clouds as well as semantic networks are excellent tools to identify larger patterns in corpi of text, but what these patterns mean often strongly depends on the context. Neo4j allows to easily retrieve bios and/or tweets that include specific words or word combinations. For example:

Getting a list of users of whose bios match with one or more words

Pattern matching in Neo4j

About word clouds

A word cloud is a visual representation that is often used to summarize large amounts of text. Word clouds provide a quick overview of the main themes and topics. When applied to analyze a set of user bios, word clouds can give a clear impression of how users describe themselves. When potted for each community, word clouds can shed light on what sets the communities apart.

The size of the words is based on their frequency. The same is true for when word clouds are plotted for each community. However, in that case, the color intensity of the words expresses the relative prevalence of the word in comparison with the other communities. In this way, words that are more typical of user bios in a specific community are highlighted.

Plotting a word cloud for all user bios

First, we create a word cloud for all the user bios, regardless of communities:

We create a so-called ‘corpus’ using the create_corpus function
We use the corpus to prepare a tibble with the data required for visualization with the prepare_wordcloud function.
We plot the word cloud with the plot_wordcloud function

## Create a corpus for the word cloud 
corpus_all_who <- create_corpus(input = users,
                            text_name = "description")

## Create the word cloud
wordcloud_all_who <- prepare_wordcloud(corpus = corpus_all_who,
                                   word_name = "word",
                                   stop_lang = c("en","nl"),
                                   n = 35,
                                   min_count = 3,
                                   min_ln = 3)

## Plot the word cloud
plot_wordcloud_all_who <- plot_wordcloud(wordclouds = wordcloud_all_who)

The figure below shows the word cloud for all user bios. The most frequently used words are the largest. These words are “politiek” (i.e. “politics”), “rotterdam” and “journalist”. However, we also see some contradicting words, like “pro” and “anti”, “rechts” (i.e. “right-wing”) and “links” (i.e. “left-wing”), and “rotterdam” and “amsterdam”.

plot_wordcloud_all_who

Plotting word clouds for each community

Although the word cloud for the whole network says something about the twitter users, it is still unclear how to characterize the users in the different communities. This can be examined best by creating a word cloud of the user descriptions per community.

In the Gephi visualization, the nodes are colored along the communities they belong to. For that, we have used the function color_nodelist. For demonstration purposes, we re-color our tibble of users using color_nodelist in that the colors of the word clouds correspond with the colors of the network.
We create a corpus using create_corpus where we group along the community variable.
We prepare a tibble with the data required to plot each of the word clouds using prepare_wordclouds where we group along the group variable in the corpus we just created.
We plot the clouds using plot_wordcloud that are, in turn, exported into individual list entries


## Setting the colors for each of the users based on their respective community
colors_who <- color_nodelist(input = users,
                          com_name = "Community",
                          sort = TRUE,
                          normalize_act_level = TRUE,
                          bg_color = "#ffffff") %>%
   select(Id,Community,Color) %>%
  mutate(Color = replace(Color, Color == "#426274", "#0868ac"))


## Create a corpus for the word clouds, where we group by community
corpus_per_com_who <- create_corpus(input = users,
                                text_name = "description",
                                grouping_id_name = "Community")

## Create the word clouds
wordcloud_per_com_who <- prepare_wordcloud(corpus = corpus_per_com_who,
                                       word_name = "word",
                                       grouping_id_name = "group",
                                       stop_lang = c("en","nl"),
                                       min_ln = 3,
                                       n = 50,
                                       min_count = 2)

## Plot the word clouds
plot_wordcloud_per_com_who <- plot_wordcloud(wordclouds = wordcloud_per_com_who,
                                         color_index = select(colors_who,Community,Color) %>% distinct)

Below, the word clouds of the user bios for each community are shown. The most frequently used words are the largest, while the more relatively prevalent words have brighter colors.

plot_wordcloud_per_com_who[[1]]

plot_wordcloud_per_com_who[[2]]

plot_wordcloud_per_com_who[[3]]

The green community can be described as a ‘conservative’ community. This is based on words like “trump”, “nexit” (i.e. Dutch version of the brexit) and “rechts” (i.e. “right-wing”). Moreover, the conservative political parties PVV and FVD appear to be mentioned more frequently in the user bios in this community than in the other communities.
The blue community can be described as a ‘progressive’ community, due to the presence and relative prevalence of words like “social” and “groenlinks” (i.e. Dutch green party). There is also a clear link with print media, e.g. see “journalist”, “columnist”, “boek” (i.e. “book”) and “author”.
The red community can be described as the Rotterdam community. “Rotterdam” is the most frequently used term and is one of the relatively most prevalent words in this word cloud. Furthermore, the users in this community apparently use words like “politie” and “gemeente” (i.e. “police” and “municipality”) relatively more than the users in the other communities. However, we also see that the user bios in this community mostly refer to personal and/or professional live (rather than ideology). For example, by mentioning their profession, e.g. “manager” and “adviseur” (i.e. “consultant”), their family situation, e.g. “getrouwd” and “moeder” (i.e. “married” and “mother), or basic interests, e.g. ”sport” and “muziek” (i.e. “music”).

Based on this analysis, we can label the communities in the network as follows:

About emojiclouds

Emojis provide a way to express ideas, feelings or signal sarcasm or other kinds of humor. Therefore, it can be useful to examine which emojis are used most frequently.

Plotting emojiclouds for all user bios

First, we create an emoji cloud for all the user bios, regardless of communities:

We create a corpus using the create_emojicorpus function
We use the corpus to prepare a tibble with the data required for visualization with the prepare_emojicloud function.
We plot the word cloud with the plot_emojicloud function

## Create a corpus for the emoji cloud
corpus_emo_all_who <- create_emojicorpus(input = users,
                                     text_name = "description")

## Create the emoji cloud
emojicloud_all_who <- prepare_emojicloud(corpus = corpus_emo_all_who,
                                     emoji_name = "emoji",
                                     n = 50,
                                     min_count = 3)

## Plot the emoji cloud
plot_emojicloud_all_who <- plot_emojicloud(emojiclouds = emojicloud_all_who,
                                       size = 1)

Below, the emoji cloud for the user bios of all the users in the network is shown. The larger the emoji, the more frequently the emoji is used in the bios of the users.

We see many different (flags of) countries, such as the Netherlands (“NL”), the United States (“US”) and Ireland (“IL”).
We see emojis that express love such as hearts in many colors and the rainbow flag, which represents the LGBTQ+ community.
We also see emoji of a red cross, which can be interpreted as being against something.

Emojicloud for all of the user bios

Plotting emojiclouds for each community

To understand how the communities compare to each other in terms of the emojis they use in their bios, we plot a cloud for each community.

We create a corpus using create_emojicorpus and group by community.
We prepare a tibble with the data required to plot each of the emoji clouds using prepare_emojiclouds where we group along the group variable in the corpus we just created.
We plot the clouds using plot_emojicloud that are, in turn, exported into individual list entries


## Create a corpus for the emoji clouds, grouping by community
corpus_emo_per_com_who <- create_emojicorpus(input = users,
                                         text_name = "description",
                                         grouping_id_name = "Community")

## Create the emoji clouds, grouping by group
emojicloud_per_com_who <- prepare_emojicloud(corpus = corpus_emo_per_com_who,
                                         emoji_name = "emoji",
                                         grouping_id_name = "group",
                                         n = 30,
                                         min_count = 3)

## Plot the emoji clouds
plot_emojicloud_per_com_who <- plot_emojicloud(emojiclouds = emojicloud_per_com_who)

Below, the emoji clouds for the user bios of each community are shown. Here, some differences can be seen in the usage of emojis between the users of the communities.

Emojis that stand out in the first emoji cloud (conservative community) are for example the tractor representing the solidarity with the farmers in the Netherlands, and the cross and the stop sign, implying that the users in this community are mostly against something.
In the second emoji cloud, the most frequently used emojis in the progressive community are shown. What stands out, is the size of the rainbow flags and the frequent usage of the European flag (“EU”) and the globe emojis, implying the users in this community do not only characterize themselves as someone from a particular country, but also as citizens of Europe and the world.
The last emoji cloud shows which emojis occur the most in the bios of users in in the Rotterdam community. The emojis mostly highlight the interests and hobbiesof these users such as a dog, a camera, books and a football.

Emojicloud for the conservative community

Emojicloud for the progressive community

Emojicloud for the Rotterdam community

To conclude: word clouds are very helpful in getting a clear image of the nature of different communities, while emoji clouds help to understand the communities further. However, qualitative analysis of the user bios is needed to understand the meaning of the emojis, because emojis can sometimes be meant and interpreted in different ways. In our workflow, Neo4j allows to easily retrieve bios that include specific words or word combinations.

Plotting a semantic network for the user bios

A semantic network (semnet) is a representation of the relationships between words in corpus. It consists of nodes and edges, where nodes represent words and edges represent relationships between the words. A semnet allows to organize and represent complex relationships between words, which can be difficult to express using methods such as lists, tables or word clouds.

We use semnets to draw maps of the words that occur most frequently and draw edges between words that co-occur relatively frequently in individual tweets. Furthermore, we color the words along the community in which they are most prevalent. This means, for example, that words that exclusively occur in the conservative community are colored green, whereas words that occur evenly across each of the communities are colored with a mix between green, red and blue. Note that we work with threshold values: words are only included in the semantic network if (a) they occur a certain number of times, and (b) have a relationship with another word of a certain strength.

We create a corpus where we group along user id’s using create_corpus
To determine the threshold values, we inspect at which percentile we want to cut-off.
We prepare the data required to plot the semnet with semnet, setting the minimum correlation value at .2, the minimum occurrence of words to build the network around at n=35 (97.5th percentile), the minimum occurrence of words to include around the central words at n=15 (95th percentile), and indicate that we only want to include clusters that comprise at least 3 words.
We use mix_color_by_group to calculate the colors of the words in that they reflect prevalence across communities and join it with the semnet we have created.
We load the network into the memory as a statnet object and plot the network using ggraph.

## Create a corpus for the semnet
corpus_semnet_who <- create_corpus(input = users,
                               text_name = "description",
                               grouping_id_name = "Id")

## Derive the quantiles for the word count
get_quantiles(corpus_semnet_who) %>% print(n=50)
#> # A tibble: 449 × 2
#>    Count Quantile
#>    <int>    <dbl>
#>  1     1     66.6
#>  2     2     78.1
#>  3     3     83.3
#>  4     4     86.2
#>  5     5     88.3
#>  6     6     89.7
#>  7     7     90.8
#>  8     8     91.6
#>  9     9     92.4
#> 10    10     93  
#> 11    11     93.5
#> 12    12     94.0
#> 13    13     94.4
#> 14    14     94.7
#> 15    15     95.0
#> 16    16     95.3
#> 17    17     95.5
#> 18    18     95.7
#> 19    19     95.9
#> 20    20     96.0
#> 21    21     96.2
#> 22    22     96.3
#> 23    23     96.5
#> 24    24     96.6
#> 25    25     96.7
#> 26    26     96.8
#> 27    27     96.9
#> 28    28     97.0
#> 29    29     97.1
#> 30    30     97.2
#> 31    31     97.3
#> 32    32     97.4
#> 33    33     97.4
#> 34    34     97.5
#> 35    35     97.5
#> 36    36     97.6
#> 37    37     97.6
#> 38    38     97.7
#> 39    39     97.7
#> 40    40     97.8
#> 41    41     97.8
#> 42    42     97.9
#> 43    43     97.9
#> 44    44     98.0
#> 45    45     98.0
#> 46    46     98.0
#> 47    47     98.1
#> 48    48     98.1
#> 49    49     98.1
#> 50    50     98.2
#> # ℹ 399 more rows

## Create the semnet
semnet_who <- semnet(corpus = corpus_semnet_who,
                 word_name = "word",
                 grouping_id_name = "group",
                 cor_cut = .2,
                 count_cut = .975,
                 contextual_cut = .95,
                 min_ln = 3,
                 stop_lang = c("en","nl"),
                 edge_filt = TRUE,
                 connected_component = 3)
#> [1] "Filtering for noise and en"
#> [1] "Filtering for nl"
#> [1] "Found 72494 words"
#> [1] "Creating network around words that occur at least 32 times, which are 1861 words"
#> [1] "Filtering context, too."
#> [1] "Including contextual words that occur at least 15 times."
#> [1] "Network has 764 edges. If high, consider filtering with edge_weight_flt."
#> [1] "The weakest edge between every two nodes is omitted."
#> [1] "Network now includes 382 edges."
#> [1] "The final network includes 552 words, that correlate at least 0.2"
#> [1] "Filtering connected component"

## Determine the word colors in the semnet
semnet_colors_who <- corpus_semnet_who %>%
  filter(word %in% semnet_who[[1]]$Id) %>% 
  left_join(.,colors_who,by = c("group"="Id")) %>% 
  na.omit %>% 
  group_by(word,Color) %>%
  summarize(coms=n_distinct(group),.groups = "drop") %>%
  group_by(word) %>%
  slice_max(order_by=coms,n=7,with_ties=FALSE) %>%
  mutate(prop=coms/sum(coms)) %>%
  group_by(Id=word) %>%
  do(Color=mix_color_by_group(.$Color,length(.$coms),.$prop)) %>% 
  unnest(cols = "Color")

semnet_who[[1]] <- semnet_who[[1]] %>%
  left_join(., semnet_colors_who)

## Create the statnet of the semnet
statnet_who <- load_as_statnet(nodelist = semnet_who[[1]],
                           edgelist = semnet_who[[2]],
                           com_name = "Community",
                           color_name = "Color")

## Plot the semnet
semnet_plot_who <- ggraph(statnet_who,"nicely",maxiter=10000) +
  geom_edge_link(aes(width=Weight,color=Color),alpha=0.2) +
  geom_node_point(aes(fill=Color, size=Weight), shape=21, color="black", stroke=0.25) +
  geom_node_text(aes(label = Label, size=Weight), color="black",family="FK Grotesk",
                 repel=T, min.segment.length=0.2,max.iter=50000,max.time=5) +
  scale_fill_identity() +
  scale_edge_color_identity() +
  scale_size_continuous(range = c(2, 8)) +
  scale_edge_width(range = c(0.5, 5)) +
  theme_newmomentum_sn()

Below, the plot of the semantic network for the user bios is shown. The larger the words, the more frequently these words appear in the descriptions. The size of the edges represents the correlation between the words. If the edge between nodes is relatively thick, it means that these words are used in combination with each other frequently. Analyzing a semantic network helps finding relationships between words in a large corpus, but we recommend to follow up these patterns with qualitative analysis to understand what they mean. The components (i.e. groups of connected words) that stand out, are mostly the ones in one of the colors also used for the word clouds, since these components typically say something about the users in one of the communities.

semnet_plot_who

For the conservative community (green), a few components emerge. Especially, the component with “pro”, “anti” and “islam” is obvious, where it is important to notice that there is a strong connection between “pro” and “anti”, and “anti” and “islam”, but not between “pro” and “islam”. This implies that the word “islam” is mostly used in combination with “anti” in the conservative community. Furthermore, the words “zwart”, “wit” and “blank” (i.e. “black”, “white” and “white”) are used often in combination with each other in this community, and the users in this community tend to mention that they have the right to “freedom” of “speech” in their descriptions. The users also frequently mention the leaders of the conservative parties PVV and FVD, i.e. Geert Wilders and Thierry Baudet.
For the progressive community (blue), black lives matter” is stated frequently in the bios. Besides, the users from this community often describe job titels like “european parliament member”, “(assistant) university professor” and “tweede kamerlid” (i.e. “member of the Dutch House of Representatives”). This implies a certain level of education and profession.
Only one component appears for the Rotterdam community (red), which includes words concerning emergency services, such as, “politie” (i.e. “police”), “meldingen” (i.e. “reports”), and “112” and “0900-8844” which are the Dutch emergency numbers.
Besides these specific community components, multiple components have mixed colors. The words in these components are used by users from multiple communities. These components are mostly general combinations of words, like “lekker eten” (i.e. “good food”) - which everybody likes.

What are the Twitter users saying?

Now we have clear picture of who the users in the network are, we can start analyzing what they are saying. We do this by by examining the tweets, following the same approach that we used for the user bio’s: creating word clouds and semantic networks to identify patterns, and following those up with qualitative analysis.

Plotting a word cloud for all the tweets

First, we create a word cloud for all the tweets, regardless of communities:

We create a so-called ‘corpus’ using the create_corpus function
We use the corpus to prepare a tibble with the data required for visualization with the prepare_wordcloud function. Using the function’s attributes, we filter for stop words (stop_lang) and some words we have manually selected (filt).
We plot the word cloud with the plot_wordcloud function

## Remove urls and words screen names from the tweets
tweets <- tweets_raw %>% 
  mutate(text = gsub("https?://\\S+\\s?","", text)) %>%
  mutate(text = gsub("@\\S+\\s?", "", text))

## Create a corpus for the word cloud
corpus_all_what <- create_corpus(input = tweets,
                            text_name = "text")

## Create the word cloud
wordcloud_all_what <- prepare_wordcloud(corpus = corpus_all_what,
                                   word_name = "word",
                                   stop_lang = c("en","nl"),
                                   min_ln = 3,
                                   n = 30,
                                   min_count = 2,
                                   filt = c("jij","wel","goed","echt","gaan","amp","even","gaat","moet","alleen","moeten","gewoon","maken","alle","zie","waarom","weer","mag","weet","mee","denk","wij","heel","zegt","laat","bent","via","net","willen","komen"))

## Plot the word cloud
plot_wordcloud_all_what <- plot_wordcloud(wordclouds = wordcloud_all_what)

Due to the search queries described in the data section, many tweets contain words like “racisme” and “discriminatie” (i.e. “racism” and “discrimination”), together with words like “Rotterdam”. We see this reflected in the word cloud of the whole data set. Almost all most frequently used words have something to do with racism and the black lives matter-movement.

plot_wordcloud_all_what

Plotting word clouds for the tweets in each community

Although the previosu word cloud says something about the content of all the tweets, it is more interesting to examine differences across communities. In this section, a word cloud of the tweets per community is created to get a clear picture about the different opinions and statements.

In the Gephi visualization, the nodes are colored along the communities they belong to. We have previously colored the tibble of users to reflect the community’s colors. We will join our tibble of tweets with the tibble of users for the tweets to reflect communities too.
We create a corpus using create_corpus where we group along the community variable.
We prepare a tibble with the data required to plot each of the word clouds using prepare_wordclouds where we group along the group variable in the corpus we just created.
We plot the clouds using plot_wordcloud that are, in turn, exported into individual list entries


## Define the word colors of the word clouds
colors_what <- tweets %>%
  left_join(.,select(colors_who,Community,Color) %>% unique, by = c("Community" = "Community")) %>%
  select(status_id,Community,Color)

## Create a corpus for the word clouds
corpus_per_com_what <- create_corpus(input = tweets,
                                 text_name = "text",
                                 grouping_id_name = "Community")

## Create the word clouds
wordcloud_per_com_what <- prepare_wordcloud(corpus = corpus_per_com_what,
                                         word_name = "word",
                                         grouping_id_name = "group",
                                         stop_lang = c("en","nl"),
                                         n = 50,
                                         min_count = 3,
                                         filt = c("jij","wel","goed","echt","gaan","amp","even","gaat","moet","alleen","moeten","gewoon","maken","alle","zie","waarom","weer","mag","weet","mee","denk","wij","heel","zegt","laat","bent","via","net","willen","komen","jullie","nou","maakt","komt","hoor","laten","vind"))

## Plot the word clouds
plot_wordcloud_per_com_what <- plot_wordcloud(wordclouds = wordcloud_per_com_what,
                                             color_index = select(colors_what,Community,Color) %>% distinct)

The plots of the word clouds for the tweets per community are shown below. The larger words are, the more frequently they are used in the tweets of the corresponding communities. Bright-colored words express the relative prevalence of words compared to the other communities.

plot_wordcloud_per_com_what
#> [[1]]

#> 
#> [[2]]

#> 
#> [[3]]

* In the conservative community, the words “rotterdam” and “racisme” appear most frequently, but the most prevalent words are “bilthoven”, “boerenprotest” (i.e. “farmers’ protest”) and “groenlinks” (i.e. Dutch green party). Qualitative analysis showed us that this refers to a protest by Dutch farmers at the National Health Institute (RIVM) in Bilthoven (according to the conservative community, the authorities upheld double standards in enforcing compliance with COVID measures when comparing the farmer protests with BLM protests), and probably signifies rejection of the Dutch green party, due to their opinion on the farmers’ protest, among others. * What stands out in the word cloud of the tweets by the progressive community, is the usage of English words. These users apparently tweet more often in English. This means that the community includes members that are more internationally oriented, or includes many English speaking users (qualitative analysis showed us that both is true). Almost all the most frequently used words relate in some way to racism and discrimination. * The tweets of the members of the Rotterdam community do not contain a lot of words that are not frequently used by the members of the other communities.

Plotting emoji clouds for the whole network

First, we create an emoji cloud for all the tweets, regardless of communities:

We create a corpus using the create_emojicorpus function
We use the corpus to prepare a tibble with the data required for visualization with the prepare_emojicloud function.
We plot the word cloud with the plot_emojicloud function

## Create a corpus for the emoji cloud
corpus_emo_all_what <- create_emojicorpus(input = tweets,
                                      text_name = "text")

## Create the emoji cloud
emojicloud_all_what <- prepare_emojicloud(corpus = corpus_emo_all_what,
                                      emoji_name = "emoji",
                                      n = 50,
                                      min_count = 3)

## Plot the emoji cloud 
plot_emojicloud_all_what <- plot_emojicloud(emojiclouds = emojicloud_all_what,
                                        size = 1)

The emoji cloud for the whole network mostly shows emojis that express some sort of emotion (e.g.: anger, disgust, raised fist).

Emojicloud for all of the tweets

Plotting emojiclouds per community

We create a corpus using create_emojicorpus and group by community.
We prepare a tibble with the data required to plot each of the emoji clouds using prepare_emojiclouds where we group along the group variable in the corpus we just created.
We plot the clouds using plot_emojicloud that are, in turn, exported into individual list entries

## Create a corpus for the emoji clouds
corpus_emo_per_com_what <- create_emojicorpus(input = tweets,
                                          text_name = "text",
                                          grouping_id_name = "Community")

## Create the emoji clouds
emojicloud_per_com_what <- prepare_emojicloud(corpus = corpus_emo_per_com_what,
                                          emoji_name = "emoji",
                                          grouping_id_name = "group",
                                          n = 30,
                                          min_count = 1)
 
## Plot the emoji clouds
plot_emojicloud_per_com_what <- plot_emojicloud(emojiclouds = emojicloud_per_com_what)

The emoji clouds that are shown below, show the most frequently used emojis by the users in the different communities.

Emojicloud for the tweets in the conservative community

Emojicloud for the tweets in the Rotterdam community

Emojicloud for the tweets in the progressive community

There are some clear differences in the emoji clouds:

The emoji cloud of the conservative community shows that a lot of different emojis are used equally frequently, where the pointing finger-, disgusted-, tractor- and menorah-emojis are quite prominent.
The users in the progressive community mainly use the laughing emoji and also multiple emojis referring to the black lives matter-movement.
The tweets in the Rotterdam community contain almost only emojis referring to the police and to the black lives matter-movement.

Plotting a semantic network for the tweets

Since we’re interested in how words are combined within tweets, we create a corpus where we group along status_id’s using create_corpus
To determine the threshold values, we inspect at which percentile we want to cut-off.
We prepare the data required to plot the semnet with semnet, setting the minimum correlation value at .2, the minimum occurrence of words to build the network around at n=14 (95th percentile), the minimum occurrence of words to include around the central words at n=9 (92,5th percentile), and indicate that we only want to include clusters that comprise at least 3 words.
We use mix_color_by_group to calculate the colors of the words in that they reflect prevalence across communities and join it with the semnet we have created.
We load the network into the memory as a statnet object and plot the network using ggraph.

## Create a corpus for the semnet
corpus_semnet_what <- create_corpus(input = tweets,
                                text_name = "text",
                                grouping_id_name = "status_id")

## Derive the quantiles of the word counts
get_quantiles(corpus_semnet_what) %>% print(n=25)
#> # A tibble: 165 × 2
#>    Count Quantile
#>    <int>    <dbl>
#>  1     1     58.7
#>  2     2     73.8
#>  3     3     80.6
#>  4     4     84.4
#>  5     5     87.4
#>  6     6     89.3
#>  7     7     90.7
#>  8     8     91.7
#>  9     9     92.6
#> 10    10     93.3
#> 11    11     93.9
#> 12    12     94.4
#> 13    13     94.9
#> 14    14     95.2
#> 15    15     95.5
#> 16    16     95.7
#> 17    17     96.0
#> 18    18     96.3
#> 19    19     96.5
#> 20    20     96.6
#> 21    21     96.7
#> 22    22     96.9
#> 23    23     96.9
#> 24    24     97.1
#> 25    25     97.2
#> # ℹ 140 more rows

## Create the semnet
semnet_what <- semnet(corpus = corpus_semnet_what,
                  word_name = "word",
                  grouping_id_name = "group",
                  cor_cut = .25,
                  count_cut = .95,
                  contextual_cut = .925,
                  min_ln = 3,
                  stop_lang = c("en","nl"),
                  edge_filt = TRUE,
                  connected_component = 3)
#> [1] "Filtering for noise and en"
#> [1] "Filtering for nl"
#> [1] "Found 10126 words"
#> [1] "Creating network around words that occur at least 11 times, which are 519 words"
#> [1] "Filtering context, too."
#> [1] "Including contextual words that occur at least 8 times."
#> [1] "Network has 370 edges. If high, consider filtering with edge_weight_flt."
#> [1] "The weakest edge between every two nodes is omitted."
#> [1] "Network now includes 185 edges."
#> [1] "The final network includes 163 words, that correlate at least 0.25"
#> [1] "Filtering connected component"

## Define the word colors for the semnet
semnet_colors_what <- corpus_semnet_what %>%
  filter(word %in% semnet_what[[1]]$Id) %>% 
  left_join(.,colors_what,by = c("group"="status_id")) %>% 
  na.omit %>% 
  group_by(word,Color) %>%
  summarize(coms=n_distinct(group),.groups = "drop") %>%
  group_by(word) %>%
  slice_max(order_by=coms,n=7,with_ties=FALSE) %>%
  mutate(prop=coms/sum(coms)) %>%
  group_by(Id=word) %>%
  do(Color=mix_color_by_group(.$Color,length(.$coms),.$prop)) %>% 
  unnest(cols = "Color")

semnet_what[[1]] <- semnet_what[[1]] %>%
   left_join(., semnet_colors_what)

## create the statnet of the semnet
statnet_what <- load_as_statnet(nodelist = semnet_what[[1]],
                            edgelist = semnet_what[[2]],
                            com_name = "Community",
                            color_name = "Color")

## Plot the semnet
semnet_plot_what <- ggraph(statnet_what,"nicely",maxiter=10000) +
  geom_edge_link(aes(width=Weight,color=Color),alpha=0.2) +
  geom_node_point(aes(fill=Color, size=Weight), shape=21, color="black", stroke=0.25) +
  geom_node_text(aes(label = Label, size=Weight), color="black",family="FK Grotesk",
                 repel=T, min.segment.length=0.2,max.iter=50000,max.time=5) +
  scale_fill_identity() +
  scale_edge_color_identity() +
  scale_size_continuous(range = c(2, 8)) +
  scale_edge_width(range = c(0.5, 5)) +
  theme_newmomentum_sn()

The plot below shows the semantic network of the tweets, where the colors are based on the community in which the word occurs. If a word is used in tweets of multiple communities, the colors are mixed proportionally. It can be seen in the semantic network that no red colored nodes are present, but only green, blue and mix of green/blue/red. Inspecting the components and their colors, helps to understand which themes communities are bringing forth and/or which frames they apply to the subject matter.

semnet_plot_what

* The semantic network shows the conservative community discontent (“van de pot gerukt”) about the authorities’ alleged double standards when enforcing COVID-rules during a fireworks show (“vuurwerkshow”), the farmer protests in Bilthoven (“boerenprotest”, “bilthoven”, “rivm”) and the BLM protests. * The members of the progressive community tweet mostly about subjects related to the black lives matter-movement (e.g. “george floyd”, “killed”, “police”, “black lives matter”) and manage to contextualize the BLM movement to the Dutch context (e.g.: “slavernij”, “excuses”, “racismedebat”, “gesprek nodig”) * One of the subjects discussed by both communities, is about a leaked WhatsApp group chat of police officers in Rotterdam where racist comments were shared (“racisme”, “appgroep”, “politie”).

Using our toolkit to analyze other kinds of data

This toolkit can be used to explore any kind of textual data by creating word clouds and/or semnets for all the text and/or by group. We use the toolkit along our Research and Data Infrastructure to gather social media dataand explore networks.

How to use the Analysis Toolkit for R

Ludo Pfaff

Roel Lutkenhaus

2023-05-24