Vaccines have been hotly debated throughout the pandemic. There are different thoughts about each type and brand of vaccine, ranging from views about their effectiveness to side effects and duration of protection. One of the most successful vaccines to date is Pfizer-BioNTech, which has been part of the controversy as well. Encountering a database of tweets about Pfizer-BioNTech vaccines, I thought it would be interesting to analyze them through word networks to see which keywords/topics were most discussed and how they related to each other.
First, we load libraries that we need in the following steps.
library(dplyr)
library(tidytext)
library(tidyverse)
library(tidyr)
library(ggplot2)
library(igraph)
## Warning: package 'igraph' was built under R version 4.1.3
library(ggraph)
## Warning: package 'ggraph' was built under R version 4.1.3
Now let’s read a dataset of the tweets about Pfizer or BioNTech. This dataset has many attributes, so I only select the column “text” which contains tweets and the column “user_name”.
vaccination <- read_csv("data/vaccination_tweets.csv")
## Rows: 11020 Columns: 16
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (8): user_name, user_location, user_description, user_created, date, tex...
## dbl (6): id, user_followers, user_friends, user_favourites, retweets, favorites
## lgl (2): user_verified, is_retweet
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
vaccination <- select(vaccination,user_name,text)
# remove_words_from_text <- function(text) {
# text <- unlist(strsplit(tolower(text), " "))
# paste(text[!text %in% words_to_remove], collapse = " ")
# }
#
# words_to_remove <- stop_words$word
# vaccination$text <- lapply(vaccination$text, remove_words_from_text)
In this step, we will tokenize tweets into bigrams.
ct_bigrams <- vaccination %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
There are 191753 bigrams. Now let’s do a quick count to see common bigrams:
ct_bigrams %>%
count(bigram, sort = TRUE)
## # A tibble: 86,214 x 2
## bigram n
## <chr> <int>
## 1 of the 1051
## 2 pfizerbiontech vaccine 981
## 3 covid 19 976
## 4 the pfizerbiontech 945
## 5 dose of 671
## 6 19 vaccine 512
## 7 of pfizerbiontech 446
## 8 pfizer biontech 439
## 9 first dose 422
## 10 covid19 vaccine 401
## # ... with 86,204 more rows
As we see, some stop words appeared in common bigrams, so we need to remove these words.
bigrams_separated <- ct_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
bigram_counts <- bigrams_filtered %>%
count(word1, word2, sort = TRUE)
bigram_counts
## # A tibble: 34,586 x 3
## word1 word2 n
## <chr> <chr> <int>
## 1 pfizerbiontech vaccine 981
## 2 covid 19 976
## 3 19 vaccine 512
## 4 pfizer biontech 439
## 5 covid19 vaccine 401
## 6 covid vaccine 325
## 7 pfizerbiontech covid19 251
## 8 pfizer vaccine 236
## 9 2nd dose 218
## 10 pfizer pfizerbiontech 208
## # ... with 34,576 more rows
Now, in bigram_counts, all stop words are removed.
We can also combine the separated words with unite:
bigrams_united <- bigrams_filtered %>%
unite(bigram, word1, word2, sep = " ")
bigrams_united
## # A tibble: 55,002 x 2
## user_name bigram
## <chr> <chr>
## 1 "Rachel Roh" daikon paste
## 2 "Rachel Roh" cytokine storm
## 3 "Rachel Roh" storm pfizerbiontech
## 4 "Rachel Roh" pfizerbiontech xehhi~
## 5 "Albert Fong" biggest vaccination
## 6 "Albert Fong" vaccination effort
## 7 "Albert Fong" ev dlchrzjkhm
## 8 "eli\U0001f1f1\U0001f1f9\U0001f1ea\U0001f1fa\U0001f44c" coronavirus sputnikv
## 9 "eli\U0001f1f1\U0001f1f9\U0001f1ea\U0001f1fa\U0001f44c" sputnikv astrazeneca
## 10 "eli\U0001f1f1\U0001f1f9\U0001f1ea\U0001f1fa\U0001f44c" astrazeneca pfizerbi~
## # ... with 54,992 more rows
We need to transform our dataset (bigram_counts) into these variables in the following way: from is the “word1”, to is the “word2”, and weight is “n”.
Let’s use graph_from_data_frame to make the transformation:
bigram_graph <- bigram_counts %>%
graph_from_data_frame()
bigram_graph
## IGRAPH e50df3d DN-- 21259 34586 --
## + attr: name (v/c), n (e/n)
## + edges from e50df3d (vertex names):
## [1] pfizerbiontech->vaccine covid ->19
## [3] 19 ->vaccine pfizer ->biontech
## [5] covid19 ->vaccine covid ->vaccine
## [7] pfizerbiontech->covid19 pfizer ->vaccine
## [9] 2nd ->dose pfizer ->pfizerbiontech
## [11] pfizerbiontech->covidvaccine pfizervaccine ->pfizerbiontech
## [13] 1st ->dose pfizerbiontech->covid
## [15] covidvaccine ->pfizerbiontech vaccine ->pfizerbiontech
## + ... omitted several edges
Based on the high number of bigrams, I decided to remove those with a nonsignificant number of occurrence. So I tried different numbers and found out that we would only need to keep those appearing more than 40 times.
bigram_graph_filtered <- bigram_counts %>%
filter(n > 40) %>%
graph_from_data_frame()
bigram_graph_filtered
## IGRAPH e51a83d DN-- 43 59 --
## + attr: name (v/c), n (e/n)
## + edges from e51a83d (vertex names):
## [1] pfizerbiontech->vaccine covid ->19
## [3] 19 ->vaccine pfizer ->biontech
## [5] covid19 ->vaccine covid ->vaccine
## [7] pfizerbiontech->covid19 pfizer ->vaccine
## [9] 2nd ->dose pfizer ->pfizerbiontech
## [11] pfizerbiontech->covidvaccine pfizervaccine ->pfizerbiontech
## [13] 1st ->dose pfizerbiontech->covid
## [15] covidvaccine ->pfizerbiontech vaccine ->pfizerbiontech
## + ... omitted several edges
set.seed(100)
a <- grid::arrow(type = "closed", length = unit(.2, "inches"))
ggraph(bigram_graph_filtered, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "yellow", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
The result obtained using the word network have been way more interesting than I anticipated before conducting this study. In the middle cluster, which is comprised of the most repeated bigrams, the name of the main competitors of Pfizer can be seen: “Moderna Vaccine” and “AstraZeneca vaccine”. This shows that people use the name of these brands in their tweets about Pfizer, potentially for making comparisons or showing preference. The count of doses can be also seen by which we can see the number of doses people get or need to receive has been highly written in their tweets. Another interesting arrow is “vaccine approved” which again shows the public attention towards the approval of vaccines.
In other clusters, “soar arm” is so interesting which has been highly tweeted indicating the high focus on this side effect of Pfizer vaccines. Also “protective measures” including complementary tips about protecting against COVID and staying safe have been tweeted so many times. One another combination that I really liked is “remember 2021” which shows how the year 2021 will remain in the memory of people.
To conclude, word network seems to be able to derive meaningful relationships from a corpus which would be highly useful for getting insights into a large amount of text.