Coding Notebook: Clintons and Ghislaine Maxwell Twitter Conspiracy Theories

In this notebook, we are going to take a look at a dataset of Twitter data, collected using {rtweet}. I searched for roughly 18,000 tweets that mention “cintons” about three hours after the FBI announced it had arrested Ghislaine Maxwell, an associate of Jeffrey Epstein’s who is widely believed to have helped develop and manage a sex trafficking ring.

In the past, far-right communities on social media have amplified links and conspiracy theories about the Clintons’ association with Jeffrey Epstein and Ghislaine Maxwell, so my guiding research question for this post is:

Data collection

Due to the work that Michael Kearney has put into rtweet, collecting data is a breeze, requiring only a few lines of code. The commented lines below will run the data scrape, but since we’re interested in a snapshot in time, I’ve also provided the csv of data. Line 34 re-converts the hashtags column into a list column, which is more usable later on. I load tidyverse and rtweet up front.

library(rtweet)
library(tidyverse)
# q <- search_tweets("clintons", n = 18000, include_rts = FALSE) 
# q %>% write_as_csv()
q <- read_csv("clintons_tweets.csv") %>% 
  mutate(hashtags = str_split(hashtags, " "))

To visualize tweets over time, I’ll use ts_plot which handles a lot of datetime processing for us. This is adapted from an example from rtweet.info. Unsurprisingly, the plot reveals a massive spike in mentions of the Clintons right at June 2, when the arrest of Maxwell occurred. While there’s always a low burn of anti-Clinton talk on social media, the enormous spike indicates a pretty significant shift as a result of the arrest.

q %>% 
  ts_plot("3 hours") +
  theme_minimal() +
  theme(plot.title = ggplot2::element_text(face = "bold")) +
  labs(
    x = NULL, y = NULL,
    title = "Frequency of Twitter posts about the Clintons",
    subtitle = "Tweet counts aggregated using three-hour intervals"
  )

My goal is to see if we can describe the users tweeting about the Clintons, and one of the ways we can do this is by observing their tweet patterns. One of my prior assumptions, for example, is that right-wing and QAnon users generally use more hashtags and @-mention more people than mainstream users. Let’s check out a histogram of number of hashtags per tweet:

q %>% 
  unnest(cols = hashtags, keep_empty = TRUE) %>% 
  mutate(num = case_when(is.na(hashtags) ~ 0,
                         TRUE ~ 1)) %>% 
  group_by(status_id) %>% 
  summarise(sum_hash = sum(num)) -> grouped_sums

q %>% 
  select(status_id, created_at) %>% 
  left_join(grouped_sums, by = "status_id") %>% 
  mutate(day = lubridate::floor_date(created_at, "days")) %>% 
  group_by(day) %>% 
  summarise(day_ave = mean(sum_hash)) %>% 
  ggplot(aes(x = day, y = day_ave)) +
  geom_point() + 
  geom_line() + 
  labs(
    x = "Date", y = "Average Hashtags per Tweet",
    title = "Average Hashtags per Tweet in Tweets about the Clintons"
  ) + 
  theme_minimal()

Not surprisingly, There’s a noticeable uptick in hashtags per Tweet on July 2, when Maxwell was arrested. While alone this cannot prove amplification by far-right actors, it indicates the possible presence of hashtag chaining - a strategy that many users employ in an attempt to boost their engagement numbers.

Another indication of this is an abnormally high number of @-mentions per Tweet, which are colloquially known as “Trump Trains” (although they’re often used in non-Trump QAnon discussions, as well).

In this dataset, we don’t see an increase in average mentions per Tweet:

q %>% 
  mutate(mentions = str_extract_all(text, "@[a-zA-Z0-9_]+")) %>% 
  unnest(cols = mentions, keep_empty = TRUE) %>% 
  mutate(mentions = str_replace(mentions, "@", "")) %>% 
  mutate(num = case_when(is.na(mentions) ~ 0,
                         TRUE ~ 1)) %>% 
  group_by(status_id) %>% 
  summarise(sum_hash = sum(num)) -> grouped_sums 

q %>% 
  select(status_id, created_at) %>% 
  left_join(grouped_sums, by = "status_id") %>% 
  mutate(day = lubridate::floor_date(created_at, "days")) %>% 
  group_by(day) %>% 
  summarise(day_ave = mean(sum_hash)) %>% 
  ggplot(aes(x = day, y = day_ave)) +
  geom_point() + 
  geom_line() + 
  labs(
    x = "Date", y = "Average Mentions per Tweet",
    title = "Average Mentions per Tweet in Tweets about the Clintons"
  ) + 
  theme_minimal()

However, we can also see an uptick in the number of 10+-mention Tweets per day, another indicator of possible far-right/QAnon coordinated amplification of a narrative.

q %>% 
  select(status_id, created_at) %>% 
  left_join(grouped_sums, by = "status_id") %>% 
  filter(sum_hash >= 10) %>%
  mutate(day = lubridate::floor_date(created_at, "days")) %>% 
  count(day) %>%
  ggplot(aes(x = day, y = n)) +
  geom_point() + 
  geom_line() + 
  labs(
    x = "Date", y = "Count of 10+-Mention Tweets",
    title = "Number of 10+-Mention Tweets per Day in Tweets about the Clintons"
  ) + 
  theme_minimal()

If we take a look at the most dominant hashtags used in this dataset, the trend becomes clearer. Of the top 20 hashtags, 11 are directly related to right-wing ideologies, and 5 of those are linked to QAnon (#QAnon, #Pizzagate, #Obamagate, #ClintonBodyCount, and #WWG1WGA). Although QAnon supplanted Pizzagate a few years ago, the original Pizzagate conspiracy theory has made a resurgence during the COVID-19 pandemic.

q %>% 
  unnest(cols = "hashtags") %>% 
  mutate(hashtags = tolower(hashtags)) %>% 
  count(hashtags) %>% 
  filter(!is.na(hashtags)) %>%
  top_n(n = 20) %>% 
  ggplot(aes(x = reorder(hashtags, n), y = n)) +
  geom_col() + 
  coord_flip() + 
  labs(
    x = "Hashtag",
    y = "Occurrence among 18k tweets about the Clintons", 
    title = "Hashtag frequency in dataset of tweets mentioning the Clintons\n after Ghislaine Maxwell's arrest"
  ) + 
  theme_minimal()

And, of course, we see a spike in use of the three core QAnon hashtags right at July 2.

q %>% 
  unnest(cols = hashtags) %>% 
  filter(grepl("qanon|wwg1wga|pizzagate", tolower(hashtags))) %>% 
  mutate(hashtags = tolower(hashtags)) %>% 
  select(status_id, created_at) %>% 
  mutate(day = lubridate::floor_date(created_at, "days")) %>% 
  count(day) %>% 
  ggplot(aes(x = day, y = n)) +
  geom_point() + 
  geom_line() + 
  labs(
    x = "Date", y = "Frequency of QAnon-related Hashtags",
    title = "Average Mentions per Tweet in Tweets about the Clintons"
  ) + 
  theme_minimal()

At this point, I’m pretty convinced there’s at least some activity from QAnon and Pizzagate accounts (I know, I know, that’s the safest prediction I could make). At this point, I will start building networks to try and visualize activity and see what else we can find. If we want to keep all of the attributes of the dataset - that is, all of the columns besides just username - we need to make an edgelist and a nodelist, each with their own attributes.

Edge attributes are characteristics of the relationship between two users. So for Twitter, in which case we’re looking at a specific tweet that mentions one or more users, there aren’t that many that are about the connection itself. We might collapse multiple tweets from one person @-mentioning the same other person into a single edge, in which case the number of tweets would be an edge attribute we could use to weight that edge. M

ore interesting in this case are node attributes, which describe the user (or the specific tweet). In this case, I’ll save a sum total of retweets garnered by all of the tweets posted by a user in this dataset. I also build a “color” attribute that uses colorRamp to apply a color gradient to the number of retweets. This way, we can color nodes in our network visualization.

After creating a nodelist and edgelist, we can use igraph::graph_from_data_frame to build a graph object.

library(igraph)
library(visNetwork)
edges <- q %>% 
  mutate(mentions = str_extract_all(text, "@[a-zA-Z0-9_]+")) %>% 
  unnest(cols = mentions) %>% 
  mutate(mentions = str_replace(mentions, "@", "")) %>% 
  mutate(screen_name = tolower(screen_name)) %>% 
  mutate(mentions = tolower(mentions)) %>% 
  mutate(qanon = case_when(grepl("qanon|wwg1wga|pizzagate|clintonbodycount|maga|obamagate", tolower(text))~1,
                           TRUE ~ 0)) %>% 
  select(screen_name, mentions, qanon)

nodes <- q %>% 
  mutate(screen_name = tolower(screen_name)) %>% 
  group_by(screen_name) %>% 
  
  summarise(retweets_total = sum(retweet_count)) %>% 
  full_join(edges %>% select(mentions), by = c("screen_name" = "mentions")) %>% 
  mutate(retweets_total = as.numeric(retweets_total)) %>% 
  mutate(retweets_total = case_when(is.na(retweets_total) ~ 0,
                                    !is.na(retweets_total) ~ retweets_total)) 


nodes <- nodes %>% count(screen_name, retweets_total) %>% group_by(screen_name, retweets_total) %>% top_n(n = 1) %>% select(-n)

f <- colorRamp(c("white", "blue"))


nodes$color <- log(nodes$retweets_total / diff(range(nodes$retweets_total))+1)


nodes$color <- rgb(f(nodes$color)/255)

g <- graph_from_data_frame(edges, vertices = nodes)

Instead of doing a normal 2D plot using igraph’s built-in plotting, I’m going to use a library called threejs to graph our network in 3D. You can interact with it: spin it, zoom in, look for relationships between the nodes. Nodes with high retweet counts are colored blue. Nodes with high “degree” (i.e. number of connections) are larger. Edges that represent Tweets that mention QAnon or Trump-related terms are colored red.

What patterns do you see?

V(g)$size <- log(V(g)$retweets_total + 2) / 2
E(g)$arrow.size = .05
E(g)$arrow.width = .05
#E(g)$width = .5
E(g)$color = ifelse(E(g)$qanon == 1, "maroon", "dark gray")

library(threejs)
graphjs(g)

I notice a few things right away. There is a dense core of connections surrounded by a number of smaller groups or isolates. Most of the large nodes are in that inner core (although there are some larger ones among the disconnected nodes). It’s also clear that the inner, large community is dense with QAnon and Trump-related rhetoric. In the next code block, I separate out the that large community using decompose, which splits the network object into each of its independent parts.

dg <- decompose(g, mode="weak", 
                min.vertices = 5)



k <- dg[[1]]
graphjs(k)

One of the ways to make sense of the ntwork visualization is by finding a variety of different local measures. I’ve spoken about these before. Let’s find the nodes with the highest values for each of the measures. Sorting on out degree (the number of different users a certain user has mentioned), we find a number of different accounts with the generic “noun+string of numbers” username format, which can sometimes indicate bot or inauthentic activity, especially when paired with the high out-degree measure. In degree reveals mostly celebrities, which is common with right-wing users who often throw in mentions to powerful accounts.

metrics <- data.frame("Out Degree" = degree(k, mode = "out"), "In Degree" = degree(k, mode = "in"), "Eigenvector" = evcent(k)$vector, "Betweenness" = betweenness(k), "Closeness" = closeness(k)) %>% arrange(desc(Out.Degree))

# all_users <- metrics %>% 
#   arrange(desc(Degree)) %>%
#   row.names() 

# all_users_info <- lookup_users(all_users)
# 
# all_users_info %>% 
#   mutate(qanon = case_when(grepl("qanon|wwg1wga|qarmy|flynn|storm|white rabbit|jfk", tolower(description)) == TRUE ~ 1,
#                            TRUE ~ 0)) %>% 
#   mutate(maga = case_when(grepl("maga|kag|trump", tolower(description)) == TRUE ~ 1,
#                            TRUE ~ 0)) %>% 
#   mutate(con = case_when(grepl("conservative|2a|patriot", tolower(description)) == TRUE ~ 1,
#                            TRUE ~ 0)) -> labels

metrics %>% arrange(desc(Out.Degree)) %>% top_n(n=20) %>% knitr::kable(format = "html")

	Out.Degree	In.Degree	Eigenvector	Betweenness
digivorr	52	5	0.0000000	831.78810
thereseosulliv2	51	2	0.0000000	85.33333
bfhistory12	50	0	0.0000000	0.00000
blakdragonheart	50	2	0.0000000	70.33333
chrisg409ubc	50	7	0.0000000	585.20238
flattielover	50	7	0.0000000	1091.93175
lor_blueeyes	50	1	0.0000000	0.00000
psfnyc5	50	5	0.0000000	1541.05000
qdecoder	50	1	0.0000000	0.00000
robertdobbs2018	50	4	0.0000000	440.66429
sandycedar59	50	5	0.0000000	857.04841
swbhfx	50	1	0.0000000	0.00000
traveler3906	50	1	0.0000000	0.00000
zaharias19	50	4	0.0000000	837.56508
paronoidthe	49	1	0.0000000	46.00000
dinmark2	45	0	0.0000000	0.00000
_netizenn	9	1	0.0512168	4.50000
simbajoseph	6	0	0.0126231	0.00000
swedikaji	2	0	0.0127905	0.00000
pwrcane1	1	4	0.0000000	0.00000

That’s all for this week!

Coding Notebook: Clintons and Ghislaine Maxwell Twitter Conspiracy Theories

Alex Newhouse

6/30/2020

Data collection