df <- read_rds(path = here::here("data_twitter", "data_processed", "covid_19.rds"))

Basic Data Information

total_tweets <- df %>%
  count()

n_users <- df %>%
  distinct(user_id) %>%
  count()

Below are the basic statistics (total number of tweets, number of unique users and date):

Number of total tweets: 659618. Number of diferent users: 106200.

Accounts with the highest number of tweets:

df %>%
  count(user_name, sort = T) %>%
  slice(1:10) %>%
  gt() %>%
  tab_options(
    heading.title.font.size = 13,
    column_labels.font.size = 11,
    table.font.size = 11
  ) %>%
  tab_header(
    title = "# of different tweets by user"
  ) %>%
  fmt_number(
    columns = vars(n),
    decimals = 0
  )
# of different tweets by user
user_name n
Coronavirus bot 2,477
covid19bot 2,040
Cyber Security Feed 940
🚨#CoronaCamps R THE PLAN🚨NOT A DRILL🌹#NotMeUs 645
COVID-19 Real Time Numbers 540
NaFe065 481
. 462
The Quint 450
Hakkı Art. 439
Sally Deal 392

Accounts with the highest number of users:

df %>%
  select(user_name, user_followers_count) %>%
  arrange(desc(user_followers_count)) %>%
  distinct(user_name, .keep_all = T) %>%
  slice(1:10) %>%
  gt() %>%
  tab_options(
    heading.title.font.size = 13,
    column_labels.font.size = 11,
    table.font.size = 11
  ) %>%
  tab_header(
    title = "Firts 20th users by # of followers"
  ) %>%
  fmt_number(
    columns = vars(user_followers_count),
    decimals = 0
  )
Firts 20th users by # of followers
user_name user_followers_count
CGTN 14,050,164
Rachel Maddow MSNBC 9,930,284
Hindustan Times 7,173,683
People's Daily, China 7,097,825
Lonely Planet 6,306,359
ESPNcricinfo 5,872,015
UK Prime Minister 5,608,948
American Red Cross 5,320,293
Alfie Deyes 5,203,717
MTV NEWS 5,142,713

Which tweets were retweeted the most?

df %>%
  count(id, sort = T) %>%
  slice(1:10) %>%
  left_join(df %>% select(id, text, created_at)) %>%
  select(-id) %>%
  gt() %>%
  tab_options(
    heading.title.font.size = 13,
    column_labels.font.size = 11,
    table.font.size = 11
  ) %>%
  tab_header(
    title = "Firts 20th tweets most retweeted"
  ) %>%
  fmt_number(
    columns = vars(n),
    decimals = 0
  )
Firts 20th tweets most retweeted
n text created_at
4 RT @PTI_News: Strictly-implemented social-distancing measures will reduce overall expected number of cases of #coronavirus pandemic by 62 p… 2020-03-23 16:58:47
4 RT @PTI_News: Strictly-implemented social-distancing measures will reduce overall expected number of cases of #coronavirus pandemic by 62 p… 2020-03-23 16:58:47
4 RT @PTI_News: Strictly-implemented social-distancing measures will reduce overall expected number of cases of #coronavirus pandemic by 62 p… 2020-03-23 16:58:47
4 RT @PTI_News: Strictly-implemented social-distancing measures will reduce overall expected number of cases of #coronavirus pandemic by 62 p… 2020-03-23 16:58:47
3 RT @WajahatAli: Trump said #coronavirus came up "suddenly" &amp; they are now responding. I rarely swear but that's bullshit. Trump is 2 months… 2020-03-16 19:52:59
3 RT @WajahatAli: Trump said #coronavirus came up "suddenly" &amp; they are now responding. I rarely swear but that's bullshit. Trump is 2 months… 2020-03-16 19:52:59
3 RT @WajahatAli: Trump said #coronavirus came up "suddenly" &amp; they are now responding. I rarely swear but that's bullshit. Trump is 2 months… 2020-03-16 19:52:59
3 It perhaps doesn’t come as any surprise, but it’s still sad to hear to that @WestEndLIVE 2020, planned for 20 &amp; 21 June, has been cancelled due to the recent guidance from the Government over COVID-19 #coronavirus #theatrenews https://t.co/FBOu77wSG5 https://t.co/El481cNhWZ 2020-03-23 16:58:03
3 It perhaps doesn’t come as any surprise, but it’s still sad to hear to that @WestEndLIVE 2020, planned for 20 &amp; 21 June, has been cancelled due to the recent guidance from the Government over COVID-19 #coronavirus #theatrenews https://t.co/FBOu77wSG5 https://t.co/El481cNhWZ 2020-03-23 16:58:03
3 It perhaps doesn’t come as any surprise, but it’s still sad to hear to that @WestEndLIVE 2020, planned for 20 &amp; 21 June, has been cancelled due to the recent guidance from the Government over COVID-19 #coronavirus #theatrenews https://t.co/FBOu77wSG5 https://t.co/El481cNhWZ 2020-03-23 16:58:03
3 RT @PTI_News: 24-year-old man, who returned from Scotland last Thursday, tests positive for #COVIDー19 in Patna; total novel #coronavirus ca… 2020-03-23 16:58:03
3 RT @PTI_News: 24-year-old man, who returned from Scotland last Thursday, tests positive for #COVIDー19 in Patna; total novel #coronavirus ca… 2020-03-23 16:58:03
3 RT @PTI_News: 24-year-old man, who returned from Scotland last Thursday, tests positive for #COVIDー19 in Patna; total novel #coronavirus ca… 2020-03-23 16:58:03
3 Coronavirus briefing cancelled: Boris scraps conference for urgent #COVID_19 meeting https://t.co/GkfW2mNrBJ @Daily_Express https://t.co/lZx7Xjx3GI 2020-03-23 16:58:06
3 Coronavirus briefing cancelled: Boris scraps conference for urgent #COVID_19 meeting https://t.co/GkfW2mNrBJ @Daily_Express https://t.co/lZx7Xjx3GI 2020-03-23 16:58:06
3 Coronavirus briefing cancelled: Boris scraps conference for urgent #COVID_19 meeting https://t.co/GkfW2mNrBJ @Daily_Express https://t.co/lZx7Xjx3GI 2020-03-23 16:58:06
3 There are now 40,000 verified #coronavirus cases in the US.The death count is at 480 and climbing by the hour.What are the GOP doing about it?Playing the victim card.#TrumpVirus #COVIDー19 #COVIDIOTS 2020-03-23 16:58:08
3 There are now 40,000 verified #coronavirus cases in the US.The death count is at 480 and climbing by the hour.What are the GOP doing about it?Playing the victim card.#TrumpVirus #COVIDー19 #COVIDIOTS 2020-03-23 16:58:08
3 There are now 40,000 verified #coronavirus cases in the US.The death count is at 480 and climbing by the hour.What are the GOP doing about it?Playing the victim card.#TrumpVirus #COVIDー19 #COVIDIOTS 2020-03-23 16:58:08
3 RT @RajaniKohli: We need consistent internet connection, nothing can be worse than the unbelievable health scare which is going around the… 2020-03-23 16:58:15
3 RT @RajaniKohli: We need consistent internet connection, nothing can be worse than the unbelievable health scare which is going around the… 2020-03-23 16:58:15
3 RT @RajaniKohli: We need consistent internet connection, nothing can be worse than the unbelievable health scare which is going around the… 2020-03-23 16:58:15
3 RT @DrDenaGrayson: ⚠️Dr. Tony Fauci called the data for the anti-malarial #hydroxychlorquine “anecdotal...so you really can make any defini… 2020-03-23 16:58:20
3 RT @DrDenaGrayson: ⚠️Dr. Tony Fauci called the data for the anti-malarial #hydroxychlorquine “anecdotal...so you really can make any defini… 2020-03-23 16:58:20
3 RT @DrDenaGrayson: ⚠️Dr. Tony Fauci called the data for the anti-malarial #hydroxychlorquine “anecdotal...so you really can make any defini… 2020-03-23 16:58:20
3 The coronavirus "pandemic is accelerating. It took 67 days from the first reported case to reach the first 100,000 cases. Eleven days for the second 100,000 and just four days for the third 100,000," -- WHO Director-General Dr. Tedros Adhanom Ghebreyesus.#COVIDー19 #coronavirus 2020-03-23 16:58:31
3 The coronavirus "pandemic is accelerating. It took 67 days from the first reported case to reach the first 100,000 cases. Eleven days for the second 100,000 and just four days for the third 100,000," -- WHO Director-General Dr. Tedros Adhanom Ghebreyesus.#COVIDー19 #coronavirus 2020-03-23 16:58:31
3 The coronavirus "pandemic is accelerating. It took 67 days from the first reported case to reach the first 100,000 cases. Eleven days for the second 100,000 and just four days for the third 100,000," -- WHO Director-General Dr. Tedros Adhanom Ghebreyesus.#COVIDー19 #coronavirus 2020-03-23 16:58:31
3 #Ohio #Planned_Parenthood Refuses to Follow Order that Halts Abortions During #Coronavirus https://t.co/YK60U5pDGB @BreitbartNews https://t.co/0bTTa9hlo2 2020-03-23 16:58:32
3 #Ohio #Planned_Parenthood Refuses to Follow Order that Halts Abortions During #Coronavirus https://t.co/YK60U5pDGB @BreitbartNews https://t.co/0bTTa9hlo2 2020-03-23 16:58:32
3 #Ohio #Planned_Parenthood Refuses to Follow Order that Halts Abortions During #Coronavirus https://t.co/YK60U5pDGB @BreitbartNews https://t.co/0bTTa9hlo2 2020-03-23 16:58:32

Tweets per hour

library(scales)

df %>%
  select(created_at) %>%
  mutate(round_created_at_tweet = round_date(created_at, unit = "hour")) %>%
  count(round_created_at_tweet) %>%
  ggplot(aes(x = round_created_at_tweet, y = n)) +
  geom_line() +
  labs(
    title = "",
    x = "Date",
    y = "# of tweets"
  ) +
  scale_x_datetime() +
  theme_light() +
  theme(legend.position = "bottom") +
  scale_color_brewer(palette = "Set1")

BOTS

Distribution of prob_bot

Distribution of probability of being a bot:

df %>%
  select(prob_bot) %>%
  ggplot(aes(x = prob_bot)) +
  geom_histogram(bins = 10)

library(tidytext)

remove_reg <- "&amp;|&lt;|&gt;"
remove_urls <- "http"

df %>%
  mutate(text = str_remove_all(text, remove_reg)) %>%
  unnest_tokens(word, text, token = "tweets") %>%
  filter(
    !word %in% stop_words$word,
    !word %in% str_detect(word, remove_urls),
    !word %in% str_remove_all(stop_words$word, "'"),
    !word %in% c("rt", "#covid_19", "#covid19", "#covid", "#covid19esp", "#coronavirus", "#coronavirusesp"),
    str_detect(word, "[a-z]")
  ) -> tidy_df

tidy_df %>%
  group_by(is_bot) %>%
  count(word, sort = T) %>%
  ungroup() %>%
  left_join(tidy_df %>%
    group_by(is_bot) %>%
    summarise(total = n())) %>%
  mutate(freq = n / total) -> tidy_freq
library(scales)

tidy_freq %>%
  slice(1:100000) %>%
  pivot_wider(names_from = is_bot, values_from = freq) %>%
  replace_na(list(`No Bot` = 0, Bot = 0)) %>%
  ggplot(aes(x = `No Bot`, y = Bot)) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  geom_abline(color = "red")

tidy_df %>%
  filter(!str_detect(word, "^@")) %>%
  count(word, is_bot) %>%
  group_by(word) %>%
  filter(sum(n) >= 10) %>%
  ungroup() %>%
  spread(is_bot, n, fill = 0) %>%
  mutate_if(is.numeric, list(~ (. + 1) / (sum(.) + 1))) %>%
  mutate(logratio = log(`No Bot` / Bot)) %>%
  arrange(desc(logratio)) -> word_ratios

word_ratios %>%
  arrange(abs(logratio))
## # A tibble: 28,201 x 4
##    word             Bot  `No Bot`   logratio
##    <chr>          <dbl>     <dbl>      <dbl>
##  1 hundreds   0.000232  0.000232  -0.0000545
##  2 chart      0.0000662 0.0000662  0.000124 
##  3 fair       0.0000662 0.0000662  0.000124 
##  4 quit       0.0000536 0.0000536 -0.000173 
##  5 opinion    0.0000788 0.0000788  0.000326 
##  6 correct    0.000141  0.000141  -0.000397 
##  7 foreigners 0.0000863 0.0000864  0.000419 
##  8 halt       0.0000922 0.0000922 -0.000429 
##  9 detention  0.0000847 0.0000846 -0.000592 
## 10 touching   0.0000847 0.0000846 -0.000592 
## # … with 28,191 more rows
word_ratios %>%
  group_by(logratio < 0) %>%
  top_n(15, abs(logratio)) %>%
  ungroup() %>%
  mutate(word = reorder(word, logratio)) %>%
  ggplot(aes(word, logratio, fill = logratio < 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  ylab("log odds ratio (No Bot/Bot)") +
  scale_fill_discrete(name = "", labels = c("No Bot", "Bot"))

df %>%
  filter(!str_detect(text, "^RT")) %>%
  mutate(text = str_remove_all(text, remove_reg)) %>%
  unnest_tokens(word, text, token = "tweets", strip_url = TRUE) %>%
  filter(
    !word %in% stop_words$word,
    !word %in% str_remove_all(stop_words$word, "'"),
    !word %in% c("rt", "#covid_19", "#covid19", "#covid", "#covid19esp", "#coronavirus", "#coronavirusesp"),
    str_detect(word, "[a-z]")
  ) %>%
  count(is_bot, word, sort = T) %>%
  group_by(is_bot) %>%
  # summarise(n = n()) %>%
  slice_max(n, n = 10) %>%
  # arrange(n) %>%
  ungroup() %>%
  # mutate(word = fct_infreq(word, ordered = T)) %>%
  ggplot(aes(reorder_within(word, n, is_bot), n, fill = is_bot)) +
  geom_col(show.legend = F) +
  scale_x_reordered() +
  facet_wrap(~is_bot, scales = "free", ncol = 2) +
  coord_flip()

Topic Modeling

Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It treats each document as a mixture of topics, and each topic as a mixture of words. This allows documents to “overlap” each other in terms of content, rather than being separated into discrete groups, in a way that mirrors typical use of natural language:

Word-topic probabilities

tidy_df %>%
  count(user_id, word) %>%
  select(user_id, word, n) %>%
  cast_dtm(user_id, word, n) -> dtm_df

topicmodels::LDA(dtm_df, k = 20, control = list(seed = 1234)) -> lda_tweets

topic_tweets <- tidy(lda_tweets)

tw_top_terms <- topic_tweets %>%
  group_by(topic) %>%
  top_n(20, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

tw_top_terms %>%
  mutate(
    topic = paste0("Topic ", topic),
    term = reorder_within(term, beta, topic),
  ) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = F) +
  facet_wrap(~topic, scales = "free", ncol = 3) +
  coord_flip() +
  scale_x_reordered()

Document-topic probabilities

Besides estimating each topic as a mixture of words, LDA also models each document as a mixture of topics. We can examine the per-document-per-topic probabilities called gamma. The following chart shows the occurrence of each topic by gender. That is, topic 4 is present as the main option in 171 users classified as “f” by gender extractor.

topic_tweets <- tidy(lda_tweets, matrix = "gamma")

topic_tweets %>%
  left_join(df %>%
    mutate(user_id = as.character(user_id)) %>%
    select(is_bot, user_id), by = c("document" = "user_id")) %>%
  group_by(document) %>%
  top_n(1, wt = gamma) %>%
  ungroup() %>%
  count(is_bot, topic, sort = T) -> f

f %>%
  mutate(topic = fct_reorder(as_factor(topic), n)) %>%
  arrange(desc(n)) %>%
  ggplot(aes(x = topic, y = n, fill = is_bot)) +
  geom_col(show.legend = T, position = "dodge2") +
  scale_x_reordered() +
  coord_flip()

The following table shows the same information as the previous chart:

tw_top_terms %>%
  group_by(topic) %>%
  summarise(text = paste(term, collapse = " ")) %>%
  rename("Most common words in topics" = text) -> tw_top_terms_b

f %>%
  pivot_wider(id_cols = topic, names_from = is_bot, values_from = n) %>%
  left_join(tw_top_terms_b) %>%
  select(topic, `Most common words in topics`, `No Bot`, Bot) %>%
  mutate(topic = paste0("Topic ", topic)) %>%
  group_by(topic) %>%
  gt::gt()
Most common words in topics No Bot Bot
Topic 8
trump people @drdenagrayson @realdonaldtrump health tests @tedlieu stock @rvawonk testing americans act house pandemic test ago dear american positive million 92090 23040
Topic 14
president @realdonaldtrump china people national @realjameswoods virus democrats chinese trump media news @realcandaceo spread americans pandemic world crisis america country 53214 23119
Topic 6
india people @ani govt @narendramodi indian positive government march medical pm minister time stay total spread health fight world due 41418 12029
Topic 13
uk people nhs government test home govt strategy staff johnson breaking @billneelynbc @borisjohnson boris @skynews crisis testing health italy scientists 38763 7620
Topic 16
people positive @quicktake home tested county due health close breaking #breaking public restaurants social covid19 amid closed testing city stay 31292 6913
Topic 4
people home stay advice days social support symptoms news nhs time @borisjohnson @dhscgovuk government health paracetamol plenty fine ab washed 27285 6774
Topic 9
@who spread health @drtedros people protect support hands pandemic covid19 world response social countries stay global solidarity time information care 22064 6549
Topic 18
people government australia health @sahouraxo day time workers world home schools crisis doctors australian medical sanctions coronavirus test italy morrison 22063 4224
Topic 19
people workers pandemic crisis health @berniesanders time check care biden vote sick tested plan bernie response campaign spread congress @ninaturner 21603 5904
Topic 5
people chinese coronavirus china spread government health medical fight virus president country world covid19 italy countries time govt due outbreak 20626 7027
Topic 1
people #coronavirusoutbreak time house virus makes white calling watch morning trump kungflu official referred @weijia #coronaviruspandemic #covidー19 day #coronacrisis #coronavirusupdate 20178 5900
Topic 3
total deaths confirmed worldwide italy death people reports totaling #italy breaking toll #iran coronavirus usa iran due reported infected china 19594 10833
Topic 20
china chinese people #china world virus #wuhan @jenniferatntd communist time news media government pandemic home @basedpoland stop party wuhan italy 17541 7179
Topic 15
people italy photograph country @erichaywood spread china coronavirus @who message disease test bro positive deaths modern developing announced nation suspected 17341 4206
Topic 17
amazing night supermarket women late people farmers @chefjoseandres couple found checking markets 70s sams @augienash clubolder neededlady outcashier @nygovcuomo health 14869 2908
Topic 12
front health lines professionals debt owe dept trump physician gratitude pan spouse @barackobama emergency profound treating actively wholl @rachelpatzerphd diffi 13869 4227
Topic 10
message dad #dontbeaspreader @melbrooks @maxbrooksauthor https://t.co/hqhc4ffxbe hands wash safe stupid crazy joker @arkhamvideos https://t.co/aheohta8fg pandemic remember @gordonramsay south @bbcpolitics complacent 10100 2044
Topic 2
people bay yall shelter die america feel mind paying quarantine save poor italian regional mayors violating tryna bern️ @shezalibra favourite 9107 2504
Topic 11
public health street police actions orleans crowds bourbon jeopardizing due @brantlywx home people stay month san business lost korean francisco 8860 2295
Topic 7
coronavirus update data #bot countryregion https://t.co/h2u9je2w7o#coronavirus covid19 0recovered home 0the https://t.co/gimpo4sa6g#coronavirus pandemic online free stay #cybersecurity #pandemic 1the business read 6714 5578
