This is a notebook that documents my first natural language processing project, in which I scrape Twitter data on the war in Ukraine. I collect tweets using three different keywords in two different moments–at the very start of the conflict and 20 days into the war.
My interest is in evaluating the differences in perception about the war according to exposure to media with different political orientation. Breitbart is clearly right-wing and CNN, in my opinion, is moderate, although there are claims that it is left- and that it is right-wing. I collected tweets since February 24 (representing “natural state” of opinions) and data in this project covers the tweets until March 10. Latest tweets would represent the effect of 3 weeks of media coverage.
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# GENERAL TWEETS FIRST DAYS ---------------------------------------------
load("/Users/renatorusso/Desktop/TLTLAB/Ukraine analysis/ukraine_feb_24.Rda")
ukr_early1 <- as_tibble(ukraine_feb_24_real)
# DATA CLEANING AND WRANGLING ---------------------------------------------
## creating a function to remove numbers and punctuation
## in preliminary analysis, I noticed that the prevalence of "Ukraine" made it
## challenging to draw any conclusion from the sentiment analysis, so I decided
## to remove it too. I'm also using the custom function to remove a few words
## detected when I ran posterior parts of this analysis
clean_text <- function(text) {
text <- tolower(text)
text <- gsub("[[:digit:]]+", "", text)
text <- gsub("[[:punct:]]+", "", text)
text <- gsub("ukraine", "", text)
return(text)
}
## cleaning the text
ukr_early1$text <- clean_text(ukr_early1$text)
## tokenizing the words in the "text" column
tweets_token <- ukr_early1 %>%
tidytext::unnest_tokens(word, text) %>%
count(status_id, word)
tweets_token
## # A tibble: 297,750 x 3
## status_id word n
## <chr> <chr> <int>
## 1 1496680570637725697 a 1
## 2 1496680570637725697 address 1
## 3 1496680570637725697 after 1
## 4 1496680570637725697 all 1
## 5 1496680570637725697 an 2
## 6 1496680570637725697 assistance 1
## 7 1496680570637725697 cannot 1
## 8 1496680570637725697 collapse 1
## 9 1496680570637725697 constant 1
## 10 1496680570637725697 countries 1
## # … with 297,740 more rows
## removing stop words from tidy text's standard library
tweets_token <- tweets_token %>%
anti_join(tidytext::get_stopwords())
## Joining, by = "word"
## removing more stop words by creating a custom list of stop words that appeared
## when I first ran posterior parts of this analysis
my_stopwords_ukr <- tibble(word = c(as.character(1:10),
"like", "dont", "de", "la", "en",
"amp", "et", "le", "les", "I", "à", "des", "pas",
"guerre", "russie", "cest", "krieg", "breitbartnews",
"cnn", "msnbc", "foxnews"))
tweets_token <- tweets_token %>%
anti_join(my_stopwords_ukr)
## Joining, by = "word"
## creating a document-text matrix
DTM <- tweets_token %>%
tidytext::cast_dtm(status_id, word, n)
DTM
## <<DocumentTermMatrix (documents: 17950, terms: 34336)>>
## Non-/sparse entries: 196894/616134306
## Sparsity : 100%
## Maximal term length: 47
## Weighting : term frequency (tf)
# EXPLORATORY ANALYSIS ----------------------------------------------------
## looking at the document-term matrix
tweets_token %>%
group_by(word) %>%
summarize(occurrence = sum(n)) %>%
arrange(desc(occurrence))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 34,336 x 2
## word occurrence
## <chr> <int>
## 1 putin 5527
## 2 russia 4886
## 3 war 2461
## 4 military 2355
## 5 russian 1805
## 6 operation 1678
## 7 people 1496
## 8 just 1285
## 9 now 1250
## 10 us 1155
## # … with 34,326 more rows
# TOPIC MODELING ----------------------------------------------------------
library(topicmodels)
## running Latent Drichlet Allocation (LDA)
LDA <- topicmodels::LDA(DTM, k = 4, control = list(seed = 123))
LDA_td <- tidytext::tidy(LDA)
LDA_td
## # A tibble: 137,344 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 address 1.64e- 19
## 2 2 address 3.90e- 30
## 3 3 address 2.66e- 3
## 4 4 address 1.88e- 23
## 5 1 assistance 6.80e- 18
## 6 2 assistance 2.17e-123
## 7 3 assistance 2.95e- 4
## 8 4 assistance 1.10e- 4
## 9 1 collapse 3.44e- 5
## 10 2 collapse 2.62e- 13
## # … with 137,334 more rows
## visualizing topics using ggplot2() and tidytext()
library(tidytext)
topTerms <- LDA_td %>%
group_by(topic) %>%
top_n(7, beta) %>%
arrange(topic, -beta)
theme_set(theme_bw())
topTerms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free_x") +
coord_flip() +
scale_x_reordered() +
labs(title = "Topics in general tweets early in the war")
library(widyr)
word_pairs_ukr_early <- tweets_token %>%
pairwise_count(word, status_id, sort = TRUE, upper = FALSE)
## Warning: `distinct_()` was deprecated in dplyr 0.7.0.
## Please use `distinct()` instead.
## See vignette('programming') for more help
#network of co-occurring words
library(ggplot2)
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:purrr':
##
## compose, simplify
## The following object is masked from 'package:tidyr':
##
## crossing
## The following object is masked from 'package:tibble':
##
## as_data_frame
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(ggraph)
word_pairs_ukr_early %>%
filter(n >= 200) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void() +
labs(title = "Co-occurring terms in general tweets early in the war")
# GENERAL TWEETS 20 DAYS INTO THE WAR ---------------------------------------------
load("/Users/renatorusso/ukraine_mar_16.Rda")
ukr_20 <- as_tibble(ukraine_mar_16)
# DATA CLEANING AND WRANGLING ---------------------------------------------
## cleaning the text
ukr_20$text <- clean_text(ukr_20$text)
## tokenizing the words in the "text" column
tweets_token_20 <- ukr_20 %>%
tidytext::unnest_tokens(word, text) %>%
count(status_id, word)
## removing stop words from tidy text's standard library
tweets_token_20 <- tweets_token_20 %>%
anti_join(stop_words)
## Joining, by = "word"
## words in German and French have not been removed with the "stop_words" native
## dataset (and they appear a lot), so I'll try to include the word databases
## "manually"
stop_german <- data.frame(word = stopwords::stopwords("de"), stringsAsFactors = FALSE)
stop_french <- data.frame(word = stopwords::stopwords("fr"), stringsAsFactors = FALSE)
tweets_token_20 <- tweets_token_20 %>%
anti_join(stop_words, by = c('word')) %>%
anti_join(stop_german, by = c("word")) %>%
anti_join(stop_french, by = c("word"))
tweets_token_20 <- tweets_token_20 %>%
anti_join(my_stopwords_ukr)
## Joining, by = "word"
## creating a document-text matrix
DTM_20 <- tweets_token_20 %>%
tidytext::cast_dtm(status_id, word, n)
# EXPLORATORY ANALYSIS ----------------------------------------------------
## looking at the document-term matrix
tweets_token_20 %>%
group_by(word) %>%
summarize(occurrence = sum(n)) %>%
arrange(desc(occurrence))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 54,146 x 2
## word occurrence
## <chr> <int>
## 1 russia 3682
## 2 russian 2502
## 3 putin 2309
## 4 people 1631
## 5 nato 1243
## 6 biden 985
## 7 stop 954
## 8 world 881
## 9 military 874
## 10 children 824
## # … with 54,136 more rows
# TOPIC MODELING ----------------------------------------------------------
#library(topicmodels)
## running Latent Drichlet Allocation (LDA)
LDA_20 <- topicmodels::LDA(DTM_20, k = 4, control = list(seed = 123))
LDA_td_20 <- tidytext::tidy(LDA_20)
LDA_td_20
## # A tibble: 216,584 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 americans 8.47e- 4
## 2 2 americans 3.49e-19
## 3 3 americans 9.69e- 4
## 4 4 americans 5.77e- 4
## 5 1 anymore 8.56e- 5
## 6 2 anymore 3.65e- 8
## 7 3 anymore 5.97e-11
## 8 4 anymore 2.53e- 4
## 9 1 army 4.75e- 4
## 10 2 army 7.63e- 4
## # … with 216,574 more rows
topTerms_20 <- LDA_td_20 %>%
group_by(topic) %>%
top_n(7, beta) %>%
arrange(topic, -beta)
topTerms_20 %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free_x") +
coord_flip() +
scale_x_reordered() +
labs(title = "Topics in general tweets 20 days into the war")
word_pairs_ukr_20 <- tweets_token_20 %>%
pairwise_count(word, status_id, sort = TRUE, upper = FALSE)
#network of co-occurring words
word_pairs_ukr_20 %>%
filter(n >= 100) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void() +
labs(title = "Co-occurring terms in general tweets 20 days into the war")
# CNN-RELATED EARLY TWEETS ---------------------------------------------
load("/Users/renatorusso/Desktop/TLTLAB/Ukraine analysis/ukraine_cnn_feb_25.Rda")
ukr_early_cnn <- as_tibble(ukraine_cnn_feb_25)
# DATA CLEANING AND WRANGLING ---------------------------------------------
## cleaning the text
ukr_early_cnn$text <- clean_text(ukr_early_cnn$text)
## tokenizing the words in the "text" column
tweets_token_cnn_early <- ukr_early_cnn %>%
tidytext::unnest_tokens(word, text) %>%
count(status_id, word)
## removing stop words from tidy text's standard library
tweets_token_cnn_early <- tweets_token_cnn_early %>%
anti_join(stop_words)
## Joining, by = "word"
tweets_token_cnn_early <- tweets_token_cnn_early %>%
anti_join(stop_words, by = c('word')) %>%
anti_join(stop_german, by = c("word")) %>%
anti_join(stop_french, by = c("word"))
tweets_token_cnn_early <- tweets_token_cnn_early %>%
anti_join(my_stopwords_ukr)
## Joining, by = "word"
## creating a document-text matrix
DTM_early_cnn <- tweets_token_cnn_early %>%
tidytext::cast_dtm(status_id, word, n)
# EXPLORATORY ANALYSIS ----------------------------------------------------
## looking at the document-term matrix
tweets_token_cnn_early %>%
group_by(word) %>%
summarize(occurrence = sum(n)) %>%
arrange(desc(occurrence))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 42,327 x 2
## word occurrence
## <chr> <int>
## 1 russia 4863
## 2 russian 3073
## 3 putin 2575
## 4 invasion 1793
## 5 ukrainian 1693
## 6 nato 1523
## 7 biden 1471
## 8 world 1132
## 9 kyiv 1074
## 10 people 1051
## # … with 42,317 more rows
# TOPIC MODELING ----------------------------------------------------------
#library(topicmodels)
## running Latent Drichlet Allocation (LDA)
LDA_cnn_early <- topicmodels::LDA(DTM_early_cnn, k = 4, control = list(seed = 123))
LDA_cnn_early_td <- tidytext::tidy(LDA_cnn_early)
LDA_cnn_early
## A LDA_VEM topic model with 4 topics.
topTerms_cnn_early <- LDA_cnn_early_td %>%
group_by(topic) %>%
top_n(7, beta) %>%
arrange(topic, -beta)
topTerms_cnn_early %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free_x") +
coord_flip() +
scale_x_reordered() +
labs(title = "Topics in CNN-related tweets early in the war")
word_pairs_cnn_early <- tweets_token_cnn_early %>%
pairwise_count(word, status_id, sort = TRUE, upper = FALSE)
## network of co-occurring words
word_pairs_cnn_early %>%
filter(n >= 200) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void() +
labs(title = "Co-occurring terms in CNN-related tweets early in the war")
# CNN-RELATED 20 DAYS INTO THE WAR ---------------------------------------------
load("/Users/renatorusso/ukraine_cnn_mar_16.Rda")
ukr_d20_cnn <- as_tibble(ukraine_cnn_mar_16)
# DATA CLEANING AND WRANGLING ---------------------------------------------
## cleaning the text
ukr_d20_cnn$text <- clean_text(ukr_d20_cnn$text)
## tokenizing the words in the "text" column
tweets_token_d20_cnn <- ukr_d20_cnn %>%
tidytext::unnest_tokens(word, text) %>%
count(status_id, word)
## removing stop words from tidy text's standard library
tweets_token_d20_cnn <- tweets_token_d20_cnn %>%
anti_join(stop_words)
## Joining, by = "word"
tweets_token_d20_cnn <- tweets_token_d20_cnn %>%
anti_join(stop_words, by = c('word')) %>%
anti_join(stop_german, by = c("word")) %>%
anti_join(stop_french, by = c("word"))
tweets_token_d20_cnn <- tweets_token_d20_cnn %>%
anti_join(my_stopwords_ukr)
## Joining, by = "word"
## creating a document-text matrix
DTM_d20_cnn <- tweets_token_d20_cnn %>%
tidytext::cast_dtm(status_id, word, n)
# EXPLORATORY ANALYSIS ----------------------------------------------------
## looking at the document-term matrix
tweets_token_d20_cnn %>%
group_by(word) %>%
summarize(occurrence = sum(n)) %>%
arrange(desc(occurrence))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 41,410 x 2
## word occurrence
## <chr> <int>
## 1 russia 3978
## 2 russian 2809
## 3 putin 2413
## 4 nato 1704
## 5 news 1251
## 6 ukrainian 1137
## 7 biden 1117
## 8 military 1088
## 9 zelensky 1080
## 10 people 1077
## # … with 41,400 more rows
# TOPIC MODELING ----------------------------------------------------------
#library(topicmodels)
## running Latent Drichlet Allocation (LDA)
LDA_d20_cnn <- topicmodels::LDA(DTM_d20_cnn, k = 4, control = list(seed = 123))
LDA_d20_cnn_td <- tidytext::tidy(LDA_d20_cnn)
LDA_d20_cnn
## A LDA_VEM topic model with 4 topics.
topTerms_d20_cnn <- LDA_d20_cnn_td %>%
group_by(topic) %>%
top_n(7, beta) %>%
arrange(topic, -beta)
topTerms_d20_cnn %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free_x") +
coord_flip() +
scale_x_reordered() +
labs(title = "Topics in CNN-related tweets 20 days into the war")
word_pairs_d20_cnn <- tweets_token_d20_cnn %>%
pairwise_count(word, status_id, sort = TRUE, upper = FALSE)
## network of co-occurring words
word_pairs_d20_cnn %>%
filter(n >= 200) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void() +
labs(title = "Co-occurring terms in CNN-related tweets 20 days into the war")
# BREITBART-RELATED EARLY TWEETS ---------------------------------------------
load("/Users/renatorusso/Desktop/TLTLAB/Ukraine analysis/ukraine_bb_feb_25.Rda")
ukr_early_bb <- as_tibble(ukraine_bb_feb_25)
# DATA CLEANING AND WRANGLING ---------------------------------------------
## cleaning the text
ukr_early_bb$text <- clean_text(ukr_early_bb$text)
## tokenizing the words in the "text" column
tweets_token_bb_early <- ukr_early_bb %>%
tidytext::unnest_tokens(word, text) %>%
count(status_id, word)
## removing stop words from tidy text's standard library
tweets_token_bb_early <- tweets_token_bb_early %>%
anti_join(stop_words)
## Joining, by = "word"
tweets_token_bb_early <- tweets_token_bb_early %>%
anti_join(stop_words, by = c('word')) %>%
anti_join(stop_german, by = c("word")) %>%
anti_join(stop_french, by = c("word"))
tweets_token_bb_early <- tweets_token_bb_early %>%
anti_join(my_stopwords_ukr)
## Joining, by = "word"
## creating a document-text matrix
DTM_early_bb <- tweets_token_bb_early %>%
tidytext::cast_dtm(status_id, word, n)
# EXPLORATORY ANALYSIS ----------------------------------------------------
## looking at the document-term matrix
tweets_token_bb_early %>%
group_by(word) %>%
summarize(occurrence = sum(n)) %>%
arrange(desc(occurrence))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3,944 x 2
## word occurrence
## <chr> <int>
## 1 russia 355
## 2 biden 350
## 3 putin 330
## 4 invasion 201
## 5 trump 192
## 6 joe 162
## 7 russian 133
## 8 sanctions 133
## 9 crisis 126
## 10 america 107
## # … with 3,934 more rows
# TOPIC MODELING ----------------------------------------------------------
#library(topicmodels)
## running Latent Drichlet Allocation (LDA)
LDA_bb_early <- topicmodels::LDA(DTM_early_bb, k = 4, control = list(seed = 123))
LDA_bb_early_td <- tidytext::tidy(LDA_bb_early)
LDA_bb_early
## A LDA_VEM topic model with 4 topics.
topTerms_bb_early <- LDA_bb_early_td %>%
group_by(topic) %>%
top_n(7, beta) %>%
arrange(topic, -beta)
topTerms_bb_early %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free_x") +
coord_flip() +
scale_x_reordered() +
labs(title = "Topics in Breitbart-related tweets early in the war")
word_pairs_bb_early <- tweets_token_bb_early %>%
pairwise_count(word, status_id, sort = TRUE, upper = FALSE)
## network of co-occurring words
word_pairs_bb_early %>%
filter(n >= 50) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void() +
labs(title = "Co-occurring terms in Breitbart-related tweets early in the war")
# BREITBART-RELATED 20 DAYS INTO THE WAR TWEETS ---------------------------------------------
load("/Users/renatorusso/ukraine_bb_mar_16.Rda")
ukr_d20_bb <- as_tibble(ukraine_bb_mar_16)
# DATA CLEANING AND WRANGLING ---------------------------------------------
## cleaning the text
ukr_d20_bb$text <- clean_text(ukr_d20_bb$text)
## tokenizing the words in the "text" column
tweets_token_d20_bb <- ukr_d20_bb %>%
tidytext::unnest_tokens(word, text) %>%
count(status_id, word)
## removing stop words from tidy text's standard library
tweets_token_d20_bb <- tweets_token_d20_bb %>%
anti_join(stop_words)
## Joining, by = "word"
tweets_token_d20_bb <- tweets_token_d20_bb %>%
anti_join(stop_words, by = c('word')) %>%
anti_join(stop_german, by = c("word")) %>%
anti_join(stop_french, by = c("word"))
tweets_token_d20_bb <- tweets_token_d20_bb %>%
anti_join(my_stopwords_ukr)
## Joining, by = "word"
## creating a document-text matrix
DTM_d20_bb <- tweets_token_d20_bb %>%
tidytext::cast_dtm(status_id, word, n)
# EXPLORATORY ANALYSIS ----------------------------------------------------
## looking at the document-term matrix
tweets_token_d20_bb %>%
group_by(word) %>%
summarize(occurrence = sum(n)) %>%
arrange(desc(occurrence))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3,459 x 2
## word occurrence
## <chr> <int>
## 1 russia 242
## 2 biden 130
## 3 invasion 103
## 4 putin 102
## 5 russian 102
## 6 trump 96
## 7 china 90
## 8 poll 72
## 9 president 61
## 10 invaded 55
## # … with 3,449 more rows
# TOPIC MODELING ----------------------------------------------------------
#library(topicmodels)
## running Latent Drichlet Allocation (LDA)
LDA_d20_bb <- topicmodels::LDA(DTM_d20_bb, k = 4, control = list(seed = 123))
LDA_d20_bb_td <- tidytext::tidy(LDA_d20_bb)
LDA_d20_bb
## A LDA_VEM topic model with 4 topics.
topTerms_d20_bb <- LDA_d20_bb_td %>%
group_by(topic) %>%
top_n(7, beta) %>%
arrange(topic, -beta)
topTerms_d20_bb %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free_x") +
coord_flip() +
scale_x_reordered() +
labs(title = "Topics in Breitbart-related tweets 20-days into the war")
word_pairs_d20_bb <- tweets_token_d20_bb %>%
pairwise_count(word, status_id, sort = TRUE, upper = FALSE)
## network of co-occurring words
word_pairs_d20_bb %>%
filter(n >= 20) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void() +
labs(title = "Co-occurring terms in Breitbart-related tweets 20 days into the war")
Now that I have created all the charts, I’ll replicate them here to make the comparison easier.
topTerms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free_x") +
coord_flip() +
scale_x_reordered() +
labs(title = "Topics in general tweets early in the war")
topTerms_20 %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free_x") +
coord_flip() +
scale_x_reordered() +
labs(title = "Topics in general tweets 20 days into the war")
topTerms_cnn_early %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free_x") +
coord_flip() +
scale_x_reordered() +
labs(title = "Topics in CNN-related tweets early in the war")
topTerms_d20_cnn %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free_x") +
coord_flip() +
scale_x_reordered() +
labs(title = "Topics in CNN-related tweets 20 days into the war")
topTerms_bb_early %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free_x") +
coord_flip() +
scale_x_reordered() +
labs(title = "Topics in Breitbart-related tweets early in the war")
topTerms_d20_bb %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free_x") +
coord_flip() +
scale_x_reordered() +
labs(title = "Topics in Breitbart-related tweets 20-days into the war")
| Topic # | General | CNN-related | Breitbart-related |
| 1 | Russia, Putin, war | NATO and sanctions; Biden | Oil; Trump; refugees |
| 2 | Explosions heard | Russian invasion; Europe | Biden; sanctions; border |
| 3 | Russian military operation | News coverage of the invasion | Russia; special operation |
| 4 | NATO; Putin; war | Chernobyl; control | Crisis; Biden; prices |
| Topic # | General | CNN-related | Breitbart-related |
| 1 | NATO support | Zelensky, Biden, Putin | Democrats; failure |
| 2 | People/children | China/Russia | China; biological weapons |
| 3 | NATO, aid, Biden | Bomb in Mariupol theater | Poll; Trump |
| 4 | Stop Putin | NATO; Zelensky | Documentary; Poland |
As the tables above show, there are a few differences in each segment of the data set, such as:
Connections to domestic issues. In tweets mentioning Breitbart News, comments connecting the conflict with US domestic issues appear from the outset, as demonstrated by topics like “Trump,” “border,” and “crisis,” associated with “Biden.”
China and biological weapons Tweets mentioning Breitbart 20 days into the war associate China and biological weapons, possibly connecting with a column published on the news website about the conspiracy theories about US-funded biological labs in Ukraine.
alt text
Domestic politics In Breitbart-related tweets, US domestic politics is salient. For example, Trump appears as one of the topics. Some tweets that mention the former president refer to a public opinion poll according to which there is belief that Russia would not invade Ukraine were Trump the US president.
word_pairs_ukr_early %>%
filter(n >= 200) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void() +
labs(title = "Co-occurring terms in general tweets early in the war")
word_pairs_ukr_20 %>%
filter(n >= 100) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void() +
labs(title = "Co-occurring terms in general tweets 20 days into the war")
word_pairs_cnn_early %>%
filter(n >= 200) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void() +
labs(title = "Co-occurring terms in CNN-related tweets early in the war")
word_pairs_d20_cnn %>%
filter(n >= 250) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void() +
labs(title = "Co-occurring terms in CNN-related tweets 20 days into the war")
word_pairs_bb_early %>%
filter(n >= 50) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void() +
labs(title = "Co-occurring terms in Breitbart-related tweets early in the war")
word_pairs_d20_bb %>%
filter(n >= 20) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void() +
labs(title = "Co-occurring terms in Breitbart-related tweets 20 days into the war")
| Subset | Noteworthy regions |
| General/early hours | Military operation; the UN reaction |
| General/20 days into | Biden’s reaction; Trumps connection with Putin; political and economical consequences |
| CNN/early hours | Russian approximation to Chernobyl; the role of NATO and sanctions |
| CNN/20 days into | Mariupol theater |
| Breitbart/early hours | Biden and crisis; Border agents being sent to Poland |
| Breitbart/20 days into | Censorship of Oliver Stone’s documentary; war helping democrats divert attention |
As seen above, the co-occurring terms charts also shows some differences across subsets at the two different moments. The Breitbart-related tweets seem to display stronger connections between the conflict and US domestic politics–a similar pattern found in the chart of topics.
My motivation for this analysis was identifying differences in perception about the war in Ukraine according to exposure to media with different political orientations, and, indeed, my analysis point to a few aspects that might be worth exploring in more depth. The most salient difference refers to the apparent sensitivity of a segment of users about the connections between the war and domestic issues. From the outset, tweets mentioning @BreitbartNews point to connections with former president Trump, the border crisis in the US, and the effects of the war in diverting the public’s attention to more crucial issues–and how this might benefit Democrats.
One limitation of this analysis is that the volumes of tweets are considerably different across subsets: general tweets, obviously form an enormous subset, whereas CNN-related appear in a much lower volume, although still much higher than that of Breitbart related publications. Those differences are clear in the co-occurring charts, whose edges differ in order of magnitude (10s for Breitbart-related; hundreds for CNN-related; thousands for general).