The Tweet JSON data has been converted to a data frame which is compatible with rtweet package. The advantage of such conversion is that it “saves users from the wreck of time and frustration associated with disentangling the nasty nested list returned from Twitter’s API.” (see https://rtweet.info/reference/search_tweets.html, parse argument).
The desciptions of variables that we will use are as follows:
source("R/utils.R")
get_vars_description()
source: https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object
We made 5 extractions, which include the tweets that contain keywords “Monsanto” and (“lawsuit” OR “trial”) from 2018-02-17 to 2019-01-04. But we truncated the results before 2019-05-01 because most of the discussion was not about the recent lawsuit we are interested in, so the actual combined dataset only contains the tweets from 2018-05-01 to 2019-01-04.
Total number of tweets:
tw <- readRDS("data/tweets_simp_513.rds")
nrow(tw)
## [1] 116978
Numer of retweets:
table(tw$is_retweet)
##
## FALSE TRUE
## 32471 84507
Number of quoted tweets:
table(tw$is_quote)
##
## FALSE TRUE
## 113923 3055
Number of unique tweets:
tw$text %>%
plain_tweets() %>%
unique() %>%
length()
## [1] 14656
Total number of unique accounts:
users <- readRDS("data/users_simp_517.rds")
nrow(users)
## [1] 66640
Accounts / Tweets Rate:
nrow(tw) / nrow(users)
## [1] 1.755372
Distribution of users’ languages:
users$account_lang %>%
merge_lang() %>%
top_of_table(5) %>%
ggplot(aes(x = "", y = freq, fill = var)) +
geom_col(width = 1) +
coord_polar(theta = "y", direction = 1) +
scale_fill_brewer(name = "Language",
labels = c("Deutsch", "English", "Español",
"Français", "Nederlands", "Others"),
palette="Accent") +
labs(x = NULL, y = NULL) +
theme_minimal()
Devices/App used:
tw$source %>% table() %>% top_of_table(6) %>%
ggplot(aes(x = "", y = freq, fill = var)) +
geom_col(width = 1) +
coord_polar(theta = "y", direction = 1) +
scale_fill_brewer(palette="Accent") +
labs(x = NULL, y = NULL, fill = "Utility") +
theme_minimal()
Gender (identified by first name, via COSMOS software):
table(users$gender)
##
## FEMALE MALE UNISEX UNKNOWN
## 16331 17269 3771 29269
table(users$gender) %>%
as.data.frame() %>%
ggplot(aes(x = "", y = Freq, fill = Var1)) +
geom_col(width = 1) +
coord_polar(theta = "y", direction = 1) +
labs(x = NULL, y = NULL, fill = "Gender") +
theme_minimal()
Geovisualization:
countries <- maps::map("world", namesonly = T, plot = F)
maps::map("world", region = countries[-grep("Antarctica", countries)], lwd = .25)
with(tw, points(lng, lat, pch = 20, cex = .75, col = rgb(0, .3, .7, .75)))
Note that the geolocation information is available on only 487 tweets.
tw$lat %>% is.na() %>% `!` %>% sum()
## [1] 487
Number of tweets contributed by each user:
ggplot(users, aes(x = ncreate)) +
geom_histogram(bins = 50) +
scale_y_log10() +
labs(x = "No. of tweets contributed", y = "No. of users") +
theme_light()
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 37 rows containing missing values (geom_bar).
Number of users that are classified as bots by tweetbotornot
table(users$is.bot)
##
## FALSE TRUE
## 66209 430
Number of tweets that are posted by bots
table(tw$is.bot)
##
## FALSE TRUE
## 108097 8871
rtweet::ts_plot(tw) +
#scale_y_log10() +
labs(
x = NULL, y = NULL,
title = "Tweet counts aggregated using daily intervals",
caption = "\nSource: Data collected from Twitter Premium API via Python client"
) +
scale_x_datetime(date_breaks = "1 month", date_labels = "%b") +
theme_light()
ts_plot(tw) +
scale_y_log10() +
labs(
x = NULL, y = NULL,
title = "Tweet counts aggregated using weekly intervals",
caption = "\nSource: Data collected from Twitter Premium API via Python client"
) +
scale_x_datetime(date_breaks = "1 month", date_labels = "%b") +
theme_light()