1 Data format

The Tweet JSON data has been converted to a data frame which is compatible with rtweet package. The advantage of such conversion is that it “saves users from the wreck of time and frustration associated with disentangling the nasty nested list returned from Twitter’s API.” (see https://rtweet.info/reference/search_tweets.html, parse argument).

The desciptions of variables that we will use are as follows:

source("R/utils.R")
get_vars_description()

source: https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object

2 Basic Statistics

We made 5 extractions, which include the tweets that contain keywords “Monsanto” and (“lawsuit” OR “trial”) from 2018-02-17 to 2019-01-04. But we truncated the results before 2019-05-01 because most of the discussion was not about the recent lawsuit we are interested in, so the actual combined dataset only contains the tweets from 2018-05-01 to 2019-01-04.

Total number of tweets:

tw <- readRDS("data/tweets_simp_513.rds")
nrow(tw)

## [1] 116978

Numer of retweets:

table(tw$is_retweet)

## 
## FALSE  TRUE 
## 32471 84507

Number of quoted tweets:

table(tw$is_quote)

## 
##  FALSE   TRUE 
## 113923   3055

Number of unique tweets:

tw$text %>% 
  plain_tweets() %>% 
  unique() %>% 
  length()

## [1] 14656

3 User demographics

Total number of unique accounts:

users <- readRDS("data/users_simp_517.rds")
nrow(users)

## [1] 66640

Accounts / Tweets Rate:

nrow(tw) / nrow(users)

## [1] 1.755372

Distribution of users’ languages:

users$account_lang %>% 
  merge_lang() %>%
  top_of_table(5) %>%
  ggplot(aes(x = "", y = freq, fill = var)) +
  geom_col(width = 1) +
  coord_polar(theta = "y", direction = 1) + 
  scale_fill_brewer(name = "Language", 
                    labels = c("Deutsch", "English", "Español", 
                               "Français", "Nederlands", "Others"),
                    palette="Accent") +
  labs(x = NULL, y = NULL) +
  theme_minimal()

Devices/App used:

tw$source %>% table() %>% top_of_table(6) %>%
  ggplot(aes(x = "", y = freq, fill = var)) +
  geom_col(width = 1) +
  coord_polar(theta = "y", direction = 1) + 
  scale_fill_brewer(palette="Accent") +
  labs(x = NULL, y = NULL, fill = "Utility") +
  theme_minimal()

Gender (identified by first name, via COSMOS software):

table(users$gender)

## 
##  FEMALE    MALE  UNISEX UNKNOWN 
##   16331   17269    3771   29269

table(users$gender) %>%
  as.data.frame() %>%
  ggplot(aes(x = "", y = Freq, fill = Var1)) +
  geom_col(width = 1) +
  coord_polar(theta = "y", direction = 1) +
  labs(x = NULL, y = NULL, fill = "Gender") +
  theme_minimal()

Geovisualization:

countries <- maps::map("world", namesonly = T, plot = F)
maps::map("world", region = countries[-grep("Antarctica", countries)], lwd = .25)
with(tw, points(lng, lat, pch = 20, cex = .75, col = rgb(0, .3, .7, .75)))

Note that the geolocation information is available on only 487 tweets.

tw$lat %>% is.na() %>% `!` %>% sum()

## [1] 487

3.1 User behavior

3.1.1 The most active users

Number of tweets contributed by each user:

ggplot(users, aes(x = ncreate)) +
  geom_histogram(bins = 50) +
  scale_y_log10() +
  labs(x = "No. of tweets contributed", y = "No. of users") +
  theme_light()

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Removed 37 rows containing missing values (geom_bar).

3.1.2 Bots detection

Number of users that are classified as bots by tweetbotornot

table(users$is.bot)

## 
## FALSE  TRUE 
## 66209   430

Number of tweets that are posted by bots

table(tw$is.bot)

## 
##  FALSE   TRUE 
## 108097   8871

Only consider the influential accounts which contribued tweets more than 1 standard deviations above the mean.
Threshold: fast.score > 0.95 OR normal.score > 0.90.

4 Time span

rtweet::ts_plot(tw) +
  #scale_y_log10() +
  labs(
    x = NULL, y = NULL,
    title = "Tweet counts aggregated using daily intervals",
    caption = "\nSource: Data collected from Twitter Premium API via Python client"
  ) +
  scale_x_datetime(date_breaks = "1 month", date_labels = "%b") +
  theme_light()

ts_plot(tw) +
  scale_y_log10() +
  labs(
    x = NULL, y = NULL,
    title = "Tweet counts aggregated using weekly intervals",
    caption = "\nSource: Data collected from Twitter Premium API via Python client"
  ) +
  scale_x_datetime(date_breaks = "1 month", date_labels = "%b") +
  theme_light()

Peak at the second week of August 2018, where the Roundup-Cancer verdict was finalized.
A base-10 log scale is used for the y axis, consider the skewness of data.
Note the slight lag between the count of all tweets and the count of the original tweets, it makes sense because the retweets are always delayed.

Data Summary

Steven Liu

May 2019