df <- read_rds(path = here::here("data_twitter", "data_processed", "covid_19.rds"))
total_tweets <- df %>%
count()
n_users <- df %>%
distinct(user_id) %>%
count()
Below are the basic statistics (total number of tweets, number of unique users and date):
Number of total tweets: 659618. Number of diferent users: 106200.
df %>%
count(user_name, sort = T) %>%
slice(1:10) %>%
gt() %>%
tab_options(
heading.title.font.size = 13,
column_labels.font.size = 11,
table.font.size = 11
) %>%
tab_header(
title = "# of different tweets by user"
) %>%
fmt_number(
columns = vars(n),
decimals = 0
)
| # of different tweets by user | |
|---|---|
| user_name | n |
| Coronavirus bot | 2,477 |
| covid19bot | 2,040 |
| Cyber Security Feed | 940 |
| 🚨#CoronaCamps R THE PLAN🚨NOT A DRILL🌹#NotMeUs | 645 |
| COVID-19 Real Time Numbers | 540 |
| NaFe065 | 481 |
| . | 462 |
| The Quint | 450 |
| Hakkı Art. | 439 |
| Sally Deal | 392 |
df %>%
select(user_name, user_followers_count) %>%
arrange(desc(user_followers_count)) %>%
distinct(user_name, .keep_all = T) %>%
slice(1:10) %>%
gt() %>%
tab_options(
heading.title.font.size = 13,
column_labels.font.size = 11,
table.font.size = 11
) %>%
tab_header(
title = "Firts 20th users by # of followers"
) %>%
fmt_number(
columns = vars(user_followers_count),
decimals = 0
)
| Firts 20th users by # of followers | |
|---|---|
| user_name | user_followers_count |
| CGTN | 14,050,164 |
| Rachel Maddow MSNBC | 9,930,284 |
| Hindustan Times | 7,173,683 |
| People's Daily, China | 7,097,825 |
| Lonely Planet | 6,306,359 |
| ESPNcricinfo | 5,872,015 |
| UK Prime Minister | 5,608,948 |
| American Red Cross | 5,320,293 |
| Alfie Deyes | 5,203,717 |
| MTV NEWS | 5,142,713 |
df %>%
count(id, sort = T) %>%
slice(1:10) %>%
left_join(df %>% select(id, text, created_at)) %>%
select(-id) %>%
gt() %>%
tab_options(
heading.title.font.size = 13,
column_labels.font.size = 11,
table.font.size = 11
) %>%
tab_header(
title = "Firts 20th tweets most retweeted"
) %>%
fmt_number(
columns = vars(n),
decimals = 0
)
| Firts 20th tweets most retweeted | ||
|---|---|---|
| n | text | created_at |
| 4 | RT @PTI_News: Strictly-implemented social-distancing measures will reduce overall expected number of cases of #coronavirus pandemic by 62 p… | 2020-03-23 16:58:47 |
| 4 | RT @PTI_News: Strictly-implemented social-distancing measures will reduce overall expected number of cases of #coronavirus pandemic by 62 p… | 2020-03-23 16:58:47 |
| 4 | RT @PTI_News: Strictly-implemented social-distancing measures will reduce overall expected number of cases of #coronavirus pandemic by 62 p… | 2020-03-23 16:58:47 |
| 4 | RT @PTI_News: Strictly-implemented social-distancing measures will reduce overall expected number of cases of #coronavirus pandemic by 62 p… | 2020-03-23 16:58:47 |
| 3 | RT @WajahatAli: Trump said #coronavirus came up "suddenly" & they are now responding. I rarely swear but that's bullshit. Trump is 2 months… | 2020-03-16 19:52:59 |
| 3 | RT @WajahatAli: Trump said #coronavirus came up "suddenly" & they are now responding. I rarely swear but that's bullshit. Trump is 2 months… | 2020-03-16 19:52:59 |
| 3 | RT @WajahatAli: Trump said #coronavirus came up "suddenly" & they are now responding. I rarely swear but that's bullshit. Trump is 2 months… | 2020-03-16 19:52:59 |
| 3 | It perhaps doesn’t come as any surprise, but it’s still sad to hear to that @WestEndLIVE 2020, planned for 20 & 21 June, has been cancelled due to the recent guidance from the Government over COVID-19 #coronavirus #theatrenews https://t.co/FBOu77wSG5 https://t.co/El481cNhWZ | 2020-03-23 16:58:03 |
| 3 | It perhaps doesn’t come as any surprise, but it’s still sad to hear to that @WestEndLIVE 2020, planned for 20 & 21 June, has been cancelled due to the recent guidance from the Government over COVID-19 #coronavirus #theatrenews https://t.co/FBOu77wSG5 https://t.co/El481cNhWZ | 2020-03-23 16:58:03 |
| 3 | It perhaps doesn’t come as any surprise, but it’s still sad to hear to that @WestEndLIVE 2020, planned for 20 & 21 June, has been cancelled due to the recent guidance from the Government over COVID-19 #coronavirus #theatrenews https://t.co/FBOu77wSG5 https://t.co/El481cNhWZ | 2020-03-23 16:58:03 |
| 3 | RT @PTI_News: 24-year-old man, who returned from Scotland last Thursday, tests positive for #COVIDー19 in Patna; total novel #coronavirus ca… | 2020-03-23 16:58:03 |
| 3 | RT @PTI_News: 24-year-old man, who returned from Scotland last Thursday, tests positive for #COVIDー19 in Patna; total novel #coronavirus ca… | 2020-03-23 16:58:03 |
| 3 | RT @PTI_News: 24-year-old man, who returned from Scotland last Thursday, tests positive for #COVIDー19 in Patna; total novel #coronavirus ca… | 2020-03-23 16:58:03 |
| 3 | Coronavirus briefing cancelled: Boris scraps conference for urgent #COVID_19 meeting https://t.co/GkfW2mNrBJ @Daily_Express https://t.co/lZx7Xjx3GI | 2020-03-23 16:58:06 |
| 3 | Coronavirus briefing cancelled: Boris scraps conference for urgent #COVID_19 meeting https://t.co/GkfW2mNrBJ @Daily_Express https://t.co/lZx7Xjx3GI | 2020-03-23 16:58:06 |
| 3 | Coronavirus briefing cancelled: Boris scraps conference for urgent #COVID_19 meeting https://t.co/GkfW2mNrBJ @Daily_Express https://t.co/lZx7Xjx3GI | 2020-03-23 16:58:06 |
| 3 | There are now 40,000 verified #coronavirus cases in the US.The death count is at 480 and climbing by the hour.What are the GOP doing about it?Playing the victim card.#TrumpVirus #COVIDー19 #COVIDIOTS | 2020-03-23 16:58:08 |
| 3 | There are now 40,000 verified #coronavirus cases in the US.The death count is at 480 and climbing by the hour.What are the GOP doing about it?Playing the victim card.#TrumpVirus #COVIDー19 #COVIDIOTS | 2020-03-23 16:58:08 |
| 3 | There are now 40,000 verified #coronavirus cases in the US.The death count is at 480 and climbing by the hour.What are the GOP doing about it?Playing the victim card.#TrumpVirus #COVIDー19 #COVIDIOTS | 2020-03-23 16:58:08 |
| 3 | RT @RajaniKohli: We need consistent internet connection, nothing can be worse than the unbelievable health scare which is going around the… | 2020-03-23 16:58:15 |
| 3 | RT @RajaniKohli: We need consistent internet connection, nothing can be worse than the unbelievable health scare which is going around the… | 2020-03-23 16:58:15 |
| 3 | RT @RajaniKohli: We need consistent internet connection, nothing can be worse than the unbelievable health scare which is going around the… | 2020-03-23 16:58:15 |
| 3 | RT @DrDenaGrayson: ⚠️Dr. Tony Fauci called the data for the anti-malarial #hydroxychlorquine “anecdotal...so you really can make any defini… | 2020-03-23 16:58:20 |
| 3 | RT @DrDenaGrayson: ⚠️Dr. Tony Fauci called the data for the anti-malarial #hydroxychlorquine “anecdotal...so you really can make any defini… | 2020-03-23 16:58:20 |
| 3 | RT @DrDenaGrayson: ⚠️Dr. Tony Fauci called the data for the anti-malarial #hydroxychlorquine “anecdotal...so you really can make any defini… | 2020-03-23 16:58:20 |
| 3 | The coronavirus "pandemic is accelerating. It took 67 days from the first reported case to reach the first 100,000 cases. Eleven days for the second 100,000 and just four days for the third 100,000," -- WHO Director-General Dr. Tedros Adhanom Ghebreyesus.#COVIDー19 #coronavirus | 2020-03-23 16:58:31 |
| 3 | The coronavirus "pandemic is accelerating. It took 67 days from the first reported case to reach the first 100,000 cases. Eleven days for the second 100,000 and just four days for the third 100,000," -- WHO Director-General Dr. Tedros Adhanom Ghebreyesus.#COVIDー19 #coronavirus | 2020-03-23 16:58:31 |
| 3 | The coronavirus "pandemic is accelerating. It took 67 days from the first reported case to reach the first 100,000 cases. Eleven days for the second 100,000 and just four days for the third 100,000," -- WHO Director-General Dr. Tedros Adhanom Ghebreyesus.#COVIDー19 #coronavirus | 2020-03-23 16:58:31 |
| 3 | #Ohio #Planned_Parenthood Refuses to Follow Order that Halts Abortions During #Coronavirus https://t.co/YK60U5pDGB @BreitbartNews https://t.co/0bTTa9hlo2 | 2020-03-23 16:58:32 |
| 3 | #Ohio #Planned_Parenthood Refuses to Follow Order that Halts Abortions During #Coronavirus https://t.co/YK60U5pDGB @BreitbartNews https://t.co/0bTTa9hlo2 | 2020-03-23 16:58:32 |
| 3 | #Ohio #Planned_Parenthood Refuses to Follow Order that Halts Abortions During #Coronavirus https://t.co/YK60U5pDGB @BreitbartNews https://t.co/0bTTa9hlo2 | 2020-03-23 16:58:32 |
library(scales)
df %>%
select(created_at) %>%
mutate(round_created_at_tweet = round_date(created_at, unit = "hour")) %>%
count(round_created_at_tweet) %>%
ggplot(aes(x = round_created_at_tweet, y = n)) +
geom_line() +
labs(
title = "",
x = "Date",
y = "# of tweets"
) +
scale_x_datetime() +
theme_light() +
theme(legend.position = "bottom") +
scale_color_brewer(palette = "Set1")
Distribution of probability of being a bot:
df %>%
select(prob_bot) %>%
ggplot(aes(x = prob_bot)) +
geom_histogram(bins = 10)
library(tidytext)
remove_reg <- "&|<|>"
remove_urls <- "http"
df %>%
mutate(text = str_remove_all(text, remove_reg)) %>%
unnest_tokens(word, text, token = "tweets") %>%
filter(
!word %in% stop_words$word,
!word %in% str_detect(word, remove_urls),
!word %in% str_remove_all(stop_words$word, "'"),
!word %in% c("rt", "#covid_19", "#covid19", "#covid", "#covid19esp", "#coronavirus", "#coronavirusesp"),
str_detect(word, "[a-z]")
) -> tidy_df
tidy_df %>%
group_by(is_bot) %>%
count(word, sort = T) %>%
ungroup() %>%
left_join(tidy_df %>%
group_by(is_bot) %>%
summarise(total = n())) %>%
mutate(freq = n / total) -> tidy_freq
library(scales)
tidy_freq %>%
slice(1:100000) %>%
pivot_wider(names_from = is_bot, values_from = freq) %>%
replace_na(list(`No Bot` = 0, Bot = 0)) %>%
ggplot(aes(x = `No Bot`, y = Bot)) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
geom_abline(color = "red")
tidy_df %>%
filter(!str_detect(word, "^@")) %>%
count(word, is_bot) %>%
group_by(word) %>%
filter(sum(n) >= 10) %>%
ungroup() %>%
spread(is_bot, n, fill = 0) %>%
mutate_if(is.numeric, list(~ (. + 1) / (sum(.) + 1))) %>%
mutate(logratio = log(`No Bot` / Bot)) %>%
arrange(desc(logratio)) -> word_ratios
word_ratios %>%
arrange(abs(logratio))
## # A tibble: 28,201 x 4
## word Bot `No Bot` logratio
## <chr> <dbl> <dbl> <dbl>
## 1 hundreds 0.000232 0.000232 -0.0000545
## 2 chart 0.0000662 0.0000662 0.000124
## 3 fair 0.0000662 0.0000662 0.000124
## 4 quit 0.0000536 0.0000536 -0.000173
## 5 opinion 0.0000788 0.0000788 0.000326
## 6 correct 0.000141 0.000141 -0.000397
## 7 foreigners 0.0000863 0.0000864 0.000419
## 8 halt 0.0000922 0.0000922 -0.000429
## 9 detention 0.0000847 0.0000846 -0.000592
## 10 touching 0.0000847 0.0000846 -0.000592
## # … with 28,191 more rows
word_ratios %>%
group_by(logratio < 0) %>%
top_n(15, abs(logratio)) %>%
ungroup() %>%
mutate(word = reorder(word, logratio)) %>%
ggplot(aes(word, logratio, fill = logratio < 0)) +
geom_col(show.legend = FALSE) +
coord_flip() +
ylab("log odds ratio (No Bot/Bot)") +
scale_fill_discrete(name = "", labels = c("No Bot", "Bot"))
df %>%
filter(!str_detect(text, "^RT")) %>%
mutate(text = str_remove_all(text, remove_reg)) %>%
unnest_tokens(word, text, token = "tweets", strip_url = TRUE) %>%
filter(
!word %in% stop_words$word,
!word %in% str_remove_all(stop_words$word, "'"),
!word %in% c("rt", "#covid_19", "#covid19", "#covid", "#covid19esp", "#coronavirus", "#coronavirusesp"),
str_detect(word, "[a-z]")
) %>%
count(is_bot, word, sort = T) %>%
group_by(is_bot) %>%
# summarise(n = n()) %>%
slice_max(n, n = 10) %>%
# arrange(n) %>%
ungroup() %>%
# mutate(word = fct_infreq(word, ordered = T)) %>%
ggplot(aes(reorder_within(word, n, is_bot), n, fill = is_bot)) +
geom_col(show.legend = F) +
scale_x_reordered() +
facet_wrap(~is_bot, scales = "free", ncol = 2) +
coord_flip()
Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It treats each document as a mixture of topics, and each topic as a mixture of words. This allows documents to “overlap” each other in terms of content, rather than being separated into discrete groups, in a way that mirrors typical use of natural language:
tidy_df %>%
count(user_id, word) %>%
select(user_id, word, n) %>%
cast_dtm(user_id, word, n) -> dtm_df
topicmodels::LDA(dtm_df, k = 20, control = list(seed = 1234)) -> lda_tweets
topic_tweets <- tidy(lda_tweets)
tw_top_terms <- topic_tweets %>%
group_by(topic) %>%
top_n(20, beta) %>%
ungroup() %>%
arrange(topic, -beta)
tw_top_terms %>%
mutate(
topic = paste0("Topic ", topic),
term = reorder_within(term, beta, topic),
) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = F) +
facet_wrap(~topic, scales = "free", ncol = 3) +
coord_flip() +
scale_x_reordered()
Besides estimating each topic as a mixture of words, LDA also models each document as a mixture of topics. We can examine the per-document-per-topic probabilities called gamma. The following chart shows the occurrence of each topic by gender. That is, topic 4 is present as the main option in 171 users classified as “f” by gender extractor.
topic_tweets <- tidy(lda_tweets, matrix = "gamma")
topic_tweets %>%
left_join(df %>%
mutate(user_id = as.character(user_id)) %>%
select(is_bot, user_id), by = c("document" = "user_id")) %>%
group_by(document) %>%
top_n(1, wt = gamma) %>%
ungroup() %>%
count(is_bot, topic, sort = T) -> f
f %>%
mutate(topic = fct_reorder(as_factor(topic), n)) %>%
arrange(desc(n)) %>%
ggplot(aes(x = topic, y = n, fill = is_bot)) +
geom_col(show.legend = T, position = "dodge2") +
scale_x_reordered() +
coord_flip()
The following table shows the same information as the previous chart:
tw_top_terms %>%
group_by(topic) %>%
summarise(text = paste(term, collapse = " ")) %>%
rename("Most common words in topics" = text) -> tw_top_terms_b
f %>%
pivot_wider(id_cols = topic, names_from = is_bot, values_from = n) %>%
left_join(tw_top_terms_b) %>%
select(topic, `Most common words in topics`, `No Bot`, Bot) %>%
mutate(topic = paste0("Topic ", topic)) %>%
group_by(topic) %>%
gt::gt()
| Most common words in topics | No Bot | Bot |
|---|---|---|
| Topic 8 | ||
| trump people @drdenagrayson @realdonaldtrump health tests @tedlieu stock @rvawonk testing americans act house pandemic test ago dear american positive million | 92090 | 23040 |
| Topic 14 | ||
| president @realdonaldtrump china people national @realjameswoods virus democrats chinese trump media news @realcandaceo spread americans pandemic world crisis america country | 53214 | 23119 |
| Topic 6 | ||
| india people @ani govt @narendramodi indian positive government march medical pm minister time stay total spread health fight world due | 41418 | 12029 |
| Topic 13 | ||
| uk people nhs government test home govt strategy staff johnson breaking @billneelynbc @borisjohnson boris @skynews crisis testing health italy scientists | 38763 | 7620 |
| Topic 16 | ||
| people positive @quicktake home tested county due health close breaking #breaking public restaurants social covid19 amid closed testing city stay | 31292 | 6913 |
| Topic 4 | ||
| people home stay advice days social support symptoms news nhs time @borisjohnson @dhscgovuk government health paracetamol plenty fine ab washed | 27285 | 6774 |
| Topic 9 | ||
| @who spread health @drtedros people protect support hands pandemic covid19 world response social countries stay global solidarity time information care | 22064 | 6549 |
| Topic 18 | ||
| people government australia health @sahouraxo day time workers world home schools crisis doctors australian medical sanctions coronavirus test italy morrison | 22063 | 4224 |
| Topic 19 | ||
| people workers pandemic crisis health @berniesanders time check care biden vote sick tested plan bernie response campaign spread congress @ninaturner | 21603 | 5904 |
| Topic 5 | ||
| people chinese coronavirus china spread government health medical fight virus president country world covid19 italy countries time govt due outbreak | 20626 | 7027 |
| Topic 1 | ||
| people #coronavirusoutbreak time house virus makes white calling watch morning trump kungflu official referred @weijia #coronaviruspandemic #covidー19 day #coronacrisis #coronavirusupdate | 20178 | 5900 |
| Topic 3 | ||
| total deaths confirmed worldwide italy death people reports totaling #italy breaking toll #iran coronavirus usa iran due reported infected china | 19594 | 10833 |
| Topic 20 | ||
| china chinese people #china world virus #wuhan @jenniferatntd communist time news media government pandemic home @basedpoland stop party wuhan italy | 17541 | 7179 |
| Topic 15 | ||
| people italy photograph country @erichaywood spread china coronavirus @who message disease test bro positive deaths modern developing announced nation suspected | 17341 | 4206 |
| Topic 17 | ||
| amazing night supermarket women late people farmers @chefjoseandres couple found checking markets 70s sams @augienash clubolder neededlady outcashier @nygovcuomo health | 14869 | 2908 |
| Topic 12 | ||
| front health lines professionals debt owe dept trump physician gratitude pan spouse @barackobama emergency profound treating actively wholl @rachelpatzerphd diffi | 13869 | 4227 |
| Topic 10 | ||
| message dad #dontbeaspreader @melbrooks @maxbrooksauthor https://t.co/hqhc4ffxbe hands wash safe stupid crazy joker @arkhamvideos https://t.co/aheohta8fg pandemic remember @gordonramsay south @bbcpolitics complacent | 10100 | 2044 |
| Topic 2 | ||
| people bay yall shelter die america feel mind paying quarantine save poor italian regional mayors violating tryna bern️ @shezalibra favourite | 9107 | 2504 |
| Topic 11 | ||
| public health street police actions orleans crowds bourbon jeopardizing due @brantlywx home people stay month san business lost korean francisco | 8860 | 2295 |
| Topic 7 | ||
| coronavirus update data #bot countryregion https://t.co/h2u9je2w7o#coronavirus covid19 0recovered home 0the https://t.co/gimpo4sa6g#coronavirus pandemic online free stay #cybersecurity #pandemic 1the business read | 6714 | 5578 |