About this project

This is a notebook that documents my first natural language processing project, in which I scrape Twitter data on the war in Ukraine. I collect tweets using three different keywords in two different moments–at the very start of the conflict and 20 days into the war.

Tweets containing the term “Ukraine”. These publications are a sort of “control group.” They are intended to measure general perceptions about the conflict, without any other filters besides time of publication and, obviously the ability and intention to publish tweets. This would represent the control group.
Tweets containing the terms “Ukraine” and “CNN”. The intent here is evaluating publications that make reference to a mainstream media outlet. This would represent one of the “treatment groups.”
Tweets containing the terms “Ukraine” and “Breitbart”. This would represent a sample of tweets that are associated with a right-wing website. This is the other “treatment group.”

My interest is in evaluating the differences in perception about the war according to exposure to media with different political orientation. Breitbart is clearly right-wing and CNN, in my opinion, is moderate, although there are claims that it is left- and that it is right-wing. I collected tweets since February 24 (representing “natural state” of opinions) and data in this project covers the tweets until March 10. Latest tweets would represent the effect of 3 weeks of media coverage.

Data preparation of each subset

General/early

library(tidyverse)

## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

# GENERAL TWEETS FIRST DAYS ---------------------------------------------
load("/Users/renatorusso/Desktop/TLTLAB/Ukraine analysis/ukraine_feb_24.Rda")
ukr_early1 <- as_tibble(ukraine_feb_24_real)

# DATA CLEANING AND WRANGLING ---------------------------------------------
## creating a function to remove numbers and punctuation
## in preliminary analysis, I noticed that the prevalence of "Ukraine" made it
## challenging to draw any conclusion from the sentiment analysis, so I decided 
## to remove it too. I'm also using the custom function to remove a few words
## detected when I ran posterior parts of this analysis

clean_text <- function(text) {
  text <- tolower(text)
  text <- gsub("[[:digit:]]+", "", text)
  text <- gsub("[[:punct:]]+", "", text)
  text <- gsub("ukraine", "", text)
  return(text)
}

## cleaning the text
ukr_early1$text <- clean_text(ukr_early1$text)

## tokenizing the words in the "text" column
tweets_token <- ukr_early1 %>% 
  tidytext::unnest_tokens(word, text) %>% 
  count(status_id, word)

tweets_token

## # A tibble: 297,750 x 3
##    status_id           word           n
##    <chr>               <chr>      <int>
##  1 1496680570637725697 a              1
##  2 1496680570637725697 address        1
##  3 1496680570637725697 after          1
##  4 1496680570637725697 all            1
##  5 1496680570637725697 an             2
##  6 1496680570637725697 assistance     1
##  7 1496680570637725697 cannot         1
##  8 1496680570637725697 collapse       1
##  9 1496680570637725697 constant       1
## 10 1496680570637725697 countries      1
## # … with 297,740 more rows

## removing stop words from tidy text's standard library
tweets_token <- tweets_token %>% 
  anti_join(tidytext::get_stopwords())

## Joining, by = "word"

## removing more stop words by creating a custom list of stop words that appeared
## when I first ran posterior parts of this analysis
my_stopwords_ukr <- tibble(word = c(as.character(1:10),
                                    "like", "dont", "de", "la", "en",
                                    "amp", "et", "le", "les", "I", "à", "des", "pas",
                                    "guerre", "russie", "cest", "krieg", "breitbartnews",
                                    "cnn", "msnbc", "foxnews"))

tweets_token <- tweets_token %>% 
  anti_join(my_stopwords_ukr)

## Joining, by = "word"

## creating a document-text matrix
DTM <- tweets_token %>% 
  tidytext::cast_dtm(status_id, word, n)

DTM

## <<DocumentTermMatrix (documents: 17950, terms: 34336)>>
## Non-/sparse entries: 196894/616134306
## Sparsity           : 100%
## Maximal term length: 47
## Weighting          : term frequency (tf)

# EXPLORATORY ANALYSIS ----------------------------------------------------
## looking at the document-term matrix
tweets_token %>% 
  group_by(word) %>% 
  summarize(occurrence = sum(n)) %>% 
  arrange(desc(occurrence))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 34,336 x 2
##    word      occurrence
##    <chr>          <int>
##  1 putin           5527
##  2 russia          4886
##  3 war             2461
##  4 military        2355
##  5 russian         1805
##  6 operation       1678
##  7 people          1496
##  8 just            1285
##  9 now             1250
## 10 us              1155
## # … with 34,326 more rows

# TOPIC MODELING ----------------------------------------------------------
library(topicmodels)
## running Latent Drichlet Allocation (LDA)
LDA <- topicmodels::LDA(DTM, k = 4, control = list(seed = 123))
LDA_td <- tidytext::tidy(LDA)
LDA_td

## # A tibble: 137,344 x 3
##    topic term            beta
##    <int> <chr>          <dbl>
##  1     1 address    1.64e- 19
##  2     2 address    3.90e- 30
##  3     3 address    2.66e-  3
##  4     4 address    1.88e- 23
##  5     1 assistance 6.80e- 18
##  6     2 assistance 2.17e-123
##  7     3 assistance 2.95e-  4
##  8     4 assistance 1.10e-  4
##  9     1 collapse   3.44e-  5
## 10     2 collapse   2.62e- 13
## # … with 137,334 more rows

## visualizing topics using ggplot2() and tidytext()
library(tidytext)

topTerms <- LDA_td %>% 
  group_by(topic) %>% 
  top_n(7, beta) %>% 
  arrange(topic, -beta)

theme_set(theme_bw())

topTerms %>% 
  mutate(term = reorder_within(term, beta, topic)) %>% 
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free_x") +
  coord_flip() +
  scale_x_reordered() +
  labs(title = "Topics in general tweets early in the war")

library(widyr)
word_pairs_ukr_early <- tweets_token %>% 
  pairwise_count(word, status_id, sort = TRUE, upper = FALSE)

## Warning: `distinct_()` was deprecated in dplyr 0.7.0.
## Please use `distinct()` instead.
## See vignette('programming') for more help

#network of co-occurring words
library(ggplot2)
library(igraph)

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:purrr':
## 
##     compose, simplify

## The following object is masked from 'package:tidyr':
## 
##     crossing

## The following object is masked from 'package:tibble':
## 
##     as_data_frame

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

library(ggraph)

word_pairs_ukr_early %>%
  filter(n >= 200) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, 
                 point.padding = unit(0.2, "lines")) +
  theme_void() +
  labs(title = "Co-occurring terms in general tweets early in the war")

# GENERAL TWEETS 20 DAYS INTO THE WAR ---------------------------------------------
load("/Users/renatorusso/ukraine_mar_16.Rda")
ukr_20 <- as_tibble(ukraine_mar_16)

# DATA CLEANING AND WRANGLING ---------------------------------------------

## cleaning the text
ukr_20$text <- clean_text(ukr_20$text)

## tokenizing the words in the "text" column
tweets_token_20 <- ukr_20 %>% 
  tidytext::unnest_tokens(word, text) %>% 
  count(status_id, word)

## removing stop words from tidy text's standard library
tweets_token_20 <- tweets_token_20 %>% 
  anti_join(stop_words)

## Joining, by = "word"

## words in German and French have not been removed with the "stop_words" native
## dataset (and they appear a lot), so I'll try to include the word databases 
## "manually"
stop_german <- data.frame(word = stopwords::stopwords("de"), stringsAsFactors = FALSE)

stop_french <- data.frame(word = stopwords::stopwords("fr"), stringsAsFactors = FALSE)

tweets_token_20 <-  tweets_token_20 %>% 
  anti_join(stop_words, by = c('word')) %>%
  anti_join(stop_german, by = c("word")) %>% 
  anti_join(stop_french, by = c("word"))

tweets_token_20 <- tweets_token_20 %>% 
  anti_join(my_stopwords_ukr)

## Joining, by = "word"

## creating a document-text matrix
DTM_20 <- tweets_token_20 %>% 
  tidytext::cast_dtm(status_id, word, n)

# EXPLORATORY ANALYSIS ----------------------------------------------------
## looking at the document-term matrix
tweets_token_20 %>% 
  group_by(word) %>% 
  summarize(occurrence = sum(n)) %>% 
  arrange(desc(occurrence))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 54,146 x 2
##    word     occurrence
##    <chr>         <int>
##  1 russia         3682
##  2 russian        2502
##  3 putin          2309
##  4 people         1631
##  5 nato           1243
##  6 biden           985
##  7 stop            954
##  8 world           881
##  9 military        874
## 10 children        824
## # … with 54,136 more rows

# TOPIC MODELING ----------------------------------------------------------
#library(topicmodels)
## running Latent Drichlet Allocation (LDA)
LDA_20 <- topicmodels::LDA(DTM_20, k = 4, control = list(seed = 123))
LDA_td_20 <- tidytext::tidy(LDA_20)
LDA_td_20

## # A tibble: 216,584 x 3
##    topic term          beta
##    <int> <chr>        <dbl>
##  1     1 americans 8.47e- 4
##  2     2 americans 3.49e-19
##  3     3 americans 9.69e- 4
##  4     4 americans 5.77e- 4
##  5     1 anymore   8.56e- 5
##  6     2 anymore   3.65e- 8
##  7     3 anymore   5.97e-11
##  8     4 anymore   2.53e- 4
##  9     1 army      4.75e- 4
## 10     2 army      7.63e- 4
## # … with 216,574 more rows

topTerms_20 <- LDA_td_20 %>% 
  group_by(topic) %>% 
  top_n(7, beta) %>% 
  arrange(topic, -beta)

topTerms_20 %>% 
  mutate(term = reorder_within(term, beta, topic)) %>% 
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free_x") +
  coord_flip() +
  scale_x_reordered() +
  labs(title = "Topics in general tweets 20 days into the war")

word_pairs_ukr_20 <- tweets_token_20 %>% 
  pairwise_count(word, status_id, sort = TRUE, upper = FALSE)

#network of co-occurring words
word_pairs_ukr_20 %>%
  filter(n >= 100) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, 
                 point.padding = unit(0.2, "lines")) +
  theme_void() +
  labs(title = "Co-occurring terms in general tweets 20 days into the war")

# CNN-RELATED EARLY TWEETS ---------------------------------------------
load("/Users/renatorusso/Desktop/TLTLAB/Ukraine analysis/ukraine_cnn_feb_25.Rda")
ukr_early_cnn <- as_tibble(ukraine_cnn_feb_25)

# DATA CLEANING AND WRANGLING ---------------------------------------------

## cleaning the text
ukr_early_cnn$text <- clean_text(ukr_early_cnn$text)

## tokenizing the words in the "text" column
tweets_token_cnn_early <- ukr_early_cnn %>% 
  tidytext::unnest_tokens(word, text) %>% 
  count(status_id, word)

## removing stop words from tidy text's standard library
tweets_token_cnn_early <- tweets_token_cnn_early %>% 
  anti_join(stop_words)

## Joining, by = "word"

tweets_token_cnn_early <-  tweets_token_cnn_early %>% 
  anti_join(stop_words, by = c('word')) %>%
  anti_join(stop_german, by = c("word")) %>% 
  anti_join(stop_french, by = c("word"))

tweets_token_cnn_early <- tweets_token_cnn_early %>% 
  anti_join(my_stopwords_ukr)

## Joining, by = "word"

## creating a document-text matrix
DTM_early_cnn <- tweets_token_cnn_early %>% 
  tidytext::cast_dtm(status_id, word, n)

# EXPLORATORY ANALYSIS ----------------------------------------------------
## looking at the document-term matrix
tweets_token_cnn_early %>% 
  group_by(word) %>% 
  summarize(occurrence = sum(n)) %>% 
  arrange(desc(occurrence))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 42,327 x 2
##    word      occurrence
##    <chr>          <int>
##  1 russia          4863
##  2 russian         3073
##  3 putin           2575
##  4 invasion        1793
##  5 ukrainian       1693
##  6 nato            1523
##  7 biden           1471
##  8 world           1132
##  9 kyiv            1074
## 10 people          1051
## # … with 42,317 more rows

# TOPIC MODELING ----------------------------------------------------------
#library(topicmodels)
## running Latent Drichlet Allocation (LDA)
LDA_cnn_early <- topicmodels::LDA(DTM_early_cnn, k = 4, control = list(seed = 123))
LDA_cnn_early_td <- tidytext::tidy(LDA_cnn_early)
LDA_cnn_early

## A LDA_VEM topic model with 4 topics.

topTerms_cnn_early <- LDA_cnn_early_td %>% 
  group_by(topic) %>% 
  top_n(7, beta) %>% 
  arrange(topic, -beta)

topTerms_cnn_early %>% 
  mutate(term = reorder_within(term, beta, topic)) %>% 
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free_x") +
  coord_flip() +
  scale_x_reordered() +
  labs(title = "Topics in CNN-related tweets early in the war")

word_pairs_cnn_early <- tweets_token_cnn_early %>% 
  pairwise_count(word, status_id, sort = TRUE, upper = FALSE)

## network of co-occurring words
word_pairs_cnn_early %>%
  filter(n >= 200) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, 
                 point.padding = unit(0.2, "lines")) +
  theme_void() +
  labs(title = "Co-occurring terms in CNN-related tweets early in the war")

# CNN-RELATED 20 DAYS INTO THE WAR ---------------------------------------------
load("/Users/renatorusso/ukraine_cnn_mar_16.Rda")
ukr_d20_cnn <- as_tibble(ukraine_cnn_mar_16)

# DATA CLEANING AND WRANGLING ---------------------------------------------

## cleaning the text
ukr_d20_cnn$text <- clean_text(ukr_d20_cnn$text)

## tokenizing the words in the "text" column
tweets_token_d20_cnn <- ukr_d20_cnn %>% 
  tidytext::unnest_tokens(word, text) %>% 
  count(status_id, word)

## removing stop words from tidy text's standard library
tweets_token_d20_cnn <- tweets_token_d20_cnn %>% 
  anti_join(stop_words)

## Joining, by = "word"

tweets_token_d20_cnn <-  tweets_token_d20_cnn %>% 
  anti_join(stop_words, by = c('word')) %>%
  anti_join(stop_german, by = c("word")) %>% 
  anti_join(stop_french, by = c("word"))

tweets_token_d20_cnn <- tweets_token_d20_cnn %>% 
  anti_join(my_stopwords_ukr)

## Joining, by = "word"

## creating a document-text matrix
DTM_d20_cnn <- tweets_token_d20_cnn %>% 
  tidytext::cast_dtm(status_id, word, n)

# EXPLORATORY ANALYSIS ----------------------------------------------------
## looking at the document-term matrix
tweets_token_d20_cnn %>% 
  group_by(word) %>% 
  summarize(occurrence = sum(n)) %>% 
  arrange(desc(occurrence))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 41,410 x 2
##    word      occurrence
##    <chr>          <int>
##  1 russia          3978
##  2 russian         2809
##  3 putin           2413
##  4 nato            1704
##  5 news            1251
##  6 ukrainian       1137
##  7 biden           1117
##  8 military        1088
##  9 zelensky        1080
## 10 people          1077
## # … with 41,400 more rows

# TOPIC MODELING ----------------------------------------------------------
#library(topicmodels)
## running Latent Drichlet Allocation (LDA)
LDA_d20_cnn <- topicmodels::LDA(DTM_d20_cnn, k = 4, control = list(seed = 123))
LDA_d20_cnn_td <- tidytext::tidy(LDA_d20_cnn)
LDA_d20_cnn

## A LDA_VEM topic model with 4 topics.

topTerms_d20_cnn <- LDA_d20_cnn_td %>% 
  group_by(topic) %>% 
  top_n(7, beta) %>% 
  arrange(topic, -beta)

topTerms_d20_cnn %>% 
  mutate(term = reorder_within(term, beta, topic)) %>% 
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free_x") +
  coord_flip() +
  scale_x_reordered() +
  labs(title = "Topics in CNN-related tweets 20 days into the war")

word_pairs_d20_cnn <- tweets_token_d20_cnn %>% 
  pairwise_count(word, status_id, sort = TRUE, upper = FALSE)

## network of co-occurring words
word_pairs_d20_cnn %>%
  filter(n >= 200) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, 
                 point.padding = unit(0.2, "lines")) +
  theme_void() +
  labs(title = "Co-occurring terms in CNN-related tweets 20 days into the war")

# BREITBART-RELATED EARLY TWEETS ---------------------------------------------
load("/Users/renatorusso/Desktop/TLTLAB/Ukraine analysis/ukraine_bb_feb_25.Rda")
ukr_early_bb <- as_tibble(ukraine_bb_feb_25)

# DATA CLEANING AND WRANGLING ---------------------------------------------

## cleaning the text
ukr_early_bb$text <- clean_text(ukr_early_bb$text)

## tokenizing the words in the "text" column
tweets_token_bb_early <- ukr_early_bb %>% 
  tidytext::unnest_tokens(word, text) %>% 
  count(status_id, word)

## removing stop words from tidy text's standard library
tweets_token_bb_early <- tweets_token_bb_early %>% 
  anti_join(stop_words)

## Joining, by = "word"

tweets_token_bb_early <-  tweets_token_bb_early %>% 
  anti_join(stop_words, by = c('word')) %>%
  anti_join(stop_german, by = c("word")) %>% 
  anti_join(stop_french, by = c("word"))

tweets_token_bb_early <- tweets_token_bb_early %>% 
  anti_join(my_stopwords_ukr)

## Joining, by = "word"

## creating a document-text matrix
DTM_early_bb <- tweets_token_bb_early %>% 
  tidytext::cast_dtm(status_id, word, n)

# EXPLORATORY ANALYSIS ----------------------------------------------------
## looking at the document-term matrix
tweets_token_bb_early %>% 
  group_by(word) %>% 
  summarize(occurrence = sum(n)) %>% 
  arrange(desc(occurrence))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3,944 x 2
##    word      occurrence
##    <chr>          <int>
##  1 russia           355
##  2 biden            350
##  3 putin            330
##  4 invasion         201
##  5 trump            192
##  6 joe              162
##  7 russian          133
##  8 sanctions        133
##  9 crisis           126
## 10 america          107
## # … with 3,934 more rows

# TOPIC MODELING ----------------------------------------------------------
#library(topicmodels)
## running Latent Drichlet Allocation (LDA)
LDA_bb_early <- topicmodels::LDA(DTM_early_bb, k = 4, control = list(seed = 123))
LDA_bb_early_td <- tidytext::tidy(LDA_bb_early)
LDA_bb_early

## A LDA_VEM topic model with 4 topics.

topTerms_bb_early <- LDA_bb_early_td %>% 
  group_by(topic) %>% 
  top_n(7, beta) %>% 
  arrange(topic, -beta)

topTerms_bb_early %>% 
  mutate(term = reorder_within(term, beta, topic)) %>% 
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free_x") +
  coord_flip() +
  scale_x_reordered() +
  labs(title = "Topics in Breitbart-related tweets early in the war")

word_pairs_bb_early <- tweets_token_bb_early %>% 
  pairwise_count(word, status_id, sort = TRUE, upper = FALSE)

## network of co-occurring words
word_pairs_bb_early %>%
  filter(n >= 50) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, 
                 point.padding = unit(0.2, "lines")) +
  theme_void() +
  labs(title = "Co-occurring terms in Breitbart-related tweets early in the war")

# BREITBART-RELATED 20 DAYS INTO THE WAR TWEETS ---------------------------------------------
load("/Users/renatorusso/ukraine_bb_mar_16.Rda")
ukr_d20_bb <- as_tibble(ukraine_bb_mar_16)

# DATA CLEANING AND WRANGLING ---------------------------------------------

## cleaning the text
ukr_d20_bb$text <- clean_text(ukr_d20_bb$text)

## tokenizing the words in the "text" column
tweets_token_d20_bb <- ukr_d20_bb %>% 
  tidytext::unnest_tokens(word, text) %>% 
  count(status_id, word)

## removing stop words from tidy text's standard library
tweets_token_d20_bb <- tweets_token_d20_bb %>% 
  anti_join(stop_words)

## Joining, by = "word"

tweets_token_d20_bb <-  tweets_token_d20_bb %>% 
  anti_join(stop_words, by = c('word')) %>%
  anti_join(stop_german, by = c("word")) %>% 
  anti_join(stop_french, by = c("word"))

tweets_token_d20_bb <- tweets_token_d20_bb %>% 
  anti_join(my_stopwords_ukr)

## Joining, by = "word"

## creating a document-text matrix
DTM_d20_bb <- tweets_token_d20_bb %>% 
  tidytext::cast_dtm(status_id, word, n)

# EXPLORATORY ANALYSIS ----------------------------------------------------
## looking at the document-term matrix
tweets_token_d20_bb %>% 
  group_by(word) %>% 
  summarize(occurrence = sum(n)) %>% 
  arrange(desc(occurrence))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3,459 x 2
##    word      occurrence
##    <chr>          <int>
##  1 russia           242
##  2 biden            130
##  3 invasion         103
##  4 putin            102
##  5 russian          102
##  6 trump             96
##  7 china             90
##  8 poll              72
##  9 president         61
## 10 invaded           55
## # … with 3,449 more rows

# TOPIC MODELING ----------------------------------------------------------
#library(topicmodels)
## running Latent Drichlet Allocation (LDA)
LDA_d20_bb <- topicmodels::LDA(DTM_d20_bb, k = 4, control = list(seed = 123))
LDA_d20_bb_td <- tidytext::tidy(LDA_d20_bb)
LDA_d20_bb

## A LDA_VEM topic model with 4 topics.

topTerms_d20_bb <- LDA_d20_bb_td %>% 
  group_by(topic) %>% 
  top_n(7, beta) %>% 
  arrange(topic, -beta)

topTerms_d20_bb %>% 
  mutate(term = reorder_within(term, beta, topic)) %>% 
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free_x") +
  coord_flip() +
  scale_x_reordered() +
  labs(title = "Topics in Breitbart-related tweets 20-days into the war")

word_pairs_d20_bb <- tweets_token_d20_bb %>% 
  pairwise_count(word, status_id, sort = TRUE, upper = FALSE)

## network of co-occurring words
word_pairs_d20_bb %>%
  filter(n >= 20) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, 
                 point.padding = unit(0.2, "lines")) +
  theme_void() +
  labs(title = "Co-occurring terms in Breitbart-related tweets 20 days into the war")

Now that I have created all the charts, I’ll replicate them here to make the comparison easier.

Charts of most relevant topics

General/early

topTerms %>% 
  mutate(term = reorder_within(term, beta, topic)) %>% 
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free_x") +
  coord_flip() +
  scale_x_reordered() +
  labs(title = "Topics in general tweets early in the war")

General/20 days

topTerms_20 %>% 
  mutate(term = reorder_within(term, beta, topic)) %>% 
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free_x") +
  coord_flip() +
  scale_x_reordered() +
  labs(title = "Topics in general tweets 20 days into the war")

CNN/early

topTerms_cnn_early %>% 
  mutate(term = reorder_within(term, beta, topic)) %>% 
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free_x") +
  coord_flip() +
  scale_x_reordered() +
  labs(title = "Topics in CNN-related tweets early in the war")

CNN/20 days

topTerms_d20_cnn %>% 
  mutate(term = reorder_within(term, beta, topic)) %>% 
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free_x") +
  coord_flip() +
  scale_x_reordered() +
  labs(title = "Topics in CNN-related tweets 20 days into the war")

Breitbart/early

topTerms_bb_early %>% 
  mutate(term = reorder_within(term, beta, topic)) %>% 
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free_x") +
  coord_flip() +
  scale_x_reordered() +
  labs(title = "Topics in Breitbart-related tweets early in the war")

Breitbart/20 days

topTerms_d20_bb %>% 
  mutate(term = reorder_within(term, beta, topic)) %>% 
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free_x") +
  coord_flip() +
  scale_x_reordered() +
  labs(title = "Topics in Breitbart-related tweets 20-days into the war")

Early hours

Topic #	General	CNN-related	Breitbart-related
1	Russia, Putin, war	NATO and sanctions; Biden	Oil; Trump; refugees
2	Explosions heard	Russian invasion; Europe	Biden; sanctions; border
3	Russian military operation	News coverage of the invasion	Russia; special operation
4	NATO; Putin; war	Chernobyl; control	Crisis; Biden; prices

20 days into the war

Topic #	General	CNN-related	Breitbart-related
1	NATO support	Zelensky, Biden, Putin	Democrats; failure
2	People/children	China/Russia	China; biological weapons
3	NATO, aid, Biden	Bomb in Mariupol theater	Poll; Trump
4	Stop Putin	NATO; Zelensky	Documentary; Poland

As the tables above show, there are a few differences in each segment of the data set, such as:

Connections to domestic issues. In tweets mentioning Breitbart News, comments connecting the conflict with US domestic issues appear from the outset, as demonstrated by topics like “Trump,” “border,” and “crisis,” associated with “Biden.”

China and biological weapons Tweets mentioning Breitbart 20 days into the war associate China and biological weapons, possibly connecting with a column published on the news website about the conspiracy theories about US-funded biological labs in Ukraine.

alt text

Domestic politics In Breitbart-related tweets, US domestic politics is salient. For example, Trump appears as one of the topics. Some tweets that mention the former president refer to a public opinion poll according to which there is belief that Russia would not invade Ukraine were Trump the US president.

Charts of co-occurring terms

General/early

word_pairs_ukr_early %>%
  filter(n >= 200) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, 
                 point.padding = unit(0.2, "lines")) +
  theme_void() +
  labs(title = "Co-occurring terms in general tweets early in the war")

General/20 days

word_pairs_ukr_20 %>%
  filter(n >= 100) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, 
                 point.padding = unit(0.2, "lines")) +
  theme_void() +
  labs(title = "Co-occurring terms in general tweets 20 days into the war")

CNN/early

word_pairs_cnn_early %>%
  filter(n >= 200) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, 
                 point.padding = unit(0.2, "lines")) +
  theme_void() +
  labs(title = "Co-occurring terms in CNN-related tweets early in the war")

CNN/20 days

 word_pairs_d20_cnn %>%
  filter(n >= 250) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, 
                 point.padding = unit(0.2, "lines")) +
  theme_void() +
  labs(title = "Co-occurring terms in CNN-related tweets 20 days into the war")

Breitbart/early

  word_pairs_bb_early %>%
  filter(n >= 50) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, 
                 point.padding = unit(0.2, "lines")) +
  theme_void() +
  labs(title = "Co-occurring terms in Breitbart-related tweets early in the war")

Breitbart/20 days

word_pairs_d20_bb %>%
  filter(n >= 20) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, 
                 point.padding = unit(0.2, "lines")) +
  theme_void() +
  labs(title = "Co-occurring terms in Breitbart-related tweets 20 days into the war")

Noteworthy graph regions by subset of tweets

Subset	Noteworthy regions
General/early hours	Military operation; the UN reaction
General/20 days into	Biden’s reaction; Trumps connection with Putin; political and economical consequences
CNN/early hours	Russian approximation to Chernobyl; the role of NATO and sanctions
CNN/20 days into	Mariupol theater
Breitbart/early hours	Biden and crisis; Border agents being sent to Poland
Breitbart/20 days into	Censorship of Oliver Stone’s documentary; war helping democrats divert attention

As seen above, the co-occurring terms charts also shows some differences across subsets at the two different moments. The Breitbart-related tweets seem to display stronger connections between the conflict and US domestic politics–a similar pattern found in the chart of topics.

My motivation for this analysis was identifying differences in perception about the war in Ukraine according to exposure to media with different political orientations, and, indeed, my analysis point to a few aspects that might be worth exploring in more depth. The most salient difference refers to the apparent sensitivity of a segment of users about the connections between the war and domestic issues. From the outset, tweets mentioning @BreitbartNews point to connections with former president Trump, the border crisis in the US, and the effects of the war in diverting the public’s attention to more crucial issues–and how this might benefit Democrats.

One limitation of this analysis is that the volumes of tweets are considerably different across subsets: general tweets, obviously form an enormous subset, whereas CNN-related appear in a much lower volume, although still much higher than that of Breitbart related publications. Those differences are clear in the co-occurring charts, whose edges differ in order of magnitude (10s for Breitbart-related; hundreds for CNN-related; thousands for general).

Mini Learning Analytics Assignment - Natural Language Processing

Renato Russo

March, 30, 2022

About this project

Data preparation of each subset

General/early

Charts of most relevant topics

General/early

General/20 days

CNN/early

CNN/20 days

Breitbart/early

Breitbart/20 days

Early hours

20 days into the war

Charts of co-occurring terms

General/early

General/20 days

CNN/early

CNN/20 days

Breitbart/early

Breitbart/20 days

Noteworthy graph regions by subset of tweets