Introduction

In recent years, a novel kind of informational warfare had started to emerge. In the advent of social media, various authoritarian countries are actively trying to destabilize democratic institutions in the Western world by flooding websites like Twitter and Facebook with misinformation. The agencies responsible for most of it primarily target American elections, including the presidential election of 2016 and the midterms two years later. Based on the investigations done by independent think tanks and intelligence agencies, Russia has been identified as actor number one. Specifically, there is a state-funded company in Saint Petersburg that hires hundreds of people to write nasty comments on the American segment of the main social media platforms to pit people against each other. This instance is only the beginning of a streak of potentially more pernicious attempts to sow division within peaceful societies worldwide. As a Russian political exile living in the United States, I view finding statistical patterns in the tweets made by Putin’s trolls to be particularly important. It is a fascinating but dangerous phenomenon, which we need to analyze to learn how to fight back against such misinformation. Luckily, some statisticians have managed to identify these fake accounts and consolidate all their data into several datasets that, together, contain over three million tweets over the last five years. As much as data aid in predicting people’s actions on a large scale, it can also help us pinpoint the logic of third actors’ vicious acts. Taking that into consideration, the main research question is what kind of statistical patterns are there in the tweets made by the Russian fake accounts. Investigating this question is crucial because such information can provide us with concrete tools to identify further attacks and develop preventive measures against them. In this study, we are going to peruse different dataset variables in search of peculiar relationships. Considering the centralized creation of these fake accounts, our primary focus is tracing the potential similarities between them.

For instance, at what time of a day do the Russian operatives usually act? What are the most widely used languages in which these Facebook messages were written? Are disproportionately prolific days connected to the main political events, like elections or primaries? Does the majority of the social media accounts have roughly the same ratio of followers to following? These, among many others, are some of the questions that can help us understand the nature of the misinformation attacks that the country faces every election. Since understanding is the precursor of good decision-making, the major platforms, together with governmental officials, will have a more comprehensive view of the situation when crafting policies to combat future attacks. The data we are using is about 250 thousand — a fraction of nearly 3 million — tweets Clemson University researchers Darren Linvill and Patrick Warren gathered as a part of special counsel Robert Mueller’s Russia investigation. The basis for the Twitter handles included in this data are the November 2017 and June 2018 lists of Internet Research Agency-connected handles that Twitter provided to Congress. This data set contains every tweet sent from each of the 2,752 handles on the November 2017 list since May 10, 2015.

The researchers collected the data using custom searches on a tool called Social Studio, owned by Salesforce and contracted for use by Clemson’s Social Media Listening Center. The database consists of 21 variables, ranging from the date each tweet was published to the account type. For brevity, our research does not use the entire array of 9 datasets and only focus on one. We are certain >250,000 randomly assigned tweets will provide us a lot of insight into the nature of these attacks over five years.

Equipped with such a germ, we invite you to explore the most exciting patterns of the trolls’ behavior depicted in graphs and plots. In the following pages, you will come across some pretty revealing results of our analysis. For instance, the density plot that shows the time the trolls tend to send their messages teaches us something important about the 7-hour time-zone difference between Russia and the East Coast. Another interesting finding is the Russian bot accounts have an almost identical number of followers and retweets. It may be surprising to some, but the fake texts are not exclusively of the politically right rhetoric. There is also a lot of activity that we can associate with the opposite side of the aisle. Precisely because of this rather omnivorous political framing, we believe such insight is crucial to understand. But not just because misinformation is deleterious in and of itself. Research into these fake Facebook messages is also a great opportunity for us to look into the cracks of our system and to potentially find a cure for our catastrophically divided society.

Methods

The overarching question is whether these Twitter accounts and their messages have underlying patterns to them. Therefore, it is crucial to investigate whether they have anything uniquely similar to them. This transparent, statistically-driven discrimination is the only way to get remotely close to separating real tweets from fake ones. In an alternative scenario, the social media companies and the governmental decision-makers will only have rhetoric. Imagine the following situation. A new election cycle is approaching. But this time around, all of the social media activity is random. Yet, the spread of misinformation is increasing the closer we get to voting day. Without any impartial heuristic – and the only such heuristic is a mathematical one – deciding to ban or keep accounts will likely be on the grounds of disagreement with whatever the current party or interest group is in power. No further explanation is needed to realize that this would be a bad precedent for our democracy. Luckily for us, there are hundreds of thousands of identified fake accounts that contain plenty of data. We are to look through them using data visualization techniques and statistical methods, including:

Summarizing tweets by date and plotting them to see any abnormalities and overall trends of activity. Then, researching the days that are noticeably more prolific than others to see if there are similarities in news around those days. To get these exact dates, we will be implementing plotly package methods. Given that there are over 250,000 tweets in the dataset, the combination of ggplot2 graphs and plotly mechanisms is the most efficient way to access this kind of information.
Discriminating based on specific categories and characteristics of different variables. As such, we will separate tweets by their political leanings, languages, daytime, etc. Our hypothesis going in is that right- and left-wing trolls use different kinds of rhetoric, talk about categorically different issues and events. For instance, one group panders to the ultra-conservatives by talking about the American flag and troops’ greatness, the other emphasizes racism and inequality. The distribution of activity by category is significant to study as well.
Analyzing the linguistics of fake tweets. Using the ggtext package, we will identify the most frequently used words for both sides and see what clues that might give us. The outliers, the words that the Twitter bots use disproportionately, can direct our attention to particular individuals and websites. This information is significant because the researchers will get a fuller picture of the triggers that cause divisive social media actions.
To do the previous task, we will manually clean up the dataset from word fillers, links, and other context-free text. Primarily, we must delete all of the “https’s” and “.com’s” of this world. Since there are going to be a lot of retweets and links, we assume that the majority of top words will be of this sort. The context-free and ambiguous lexicon is consisting of words like “right,” “is,” “will,” “be,” etc. However, it is significant to mention that the process is not unfinished until there are no such words altogether. There may be a situation where our hypothesis turns out to be wrong, and the most frequently used words are context-free. Under such circumstances, we will have to demonstrate that, too.

Graphs and Interpretations

Preceding their interpretations, the following visualizations demonstrate the most significant results of our findings.

The first graph displays daily posting activity throughout the years of 2015 to 2018. There are three spikes of abnormally active posting. The y-axis is the daily tweet number of all accounts, and the x-axis is each day of the period. Using Plotly, we have identified the exact days that correspond to these bot messaging streaks and researched any news around those days that could potentially trigger such activity:

russian_troll_tweets = read_csv("https://raw.githubusercontent.com/fivethirtyeight/russian-troll-tweets/master/IRAhandle_tweets_1.csv")


russian_trolls_edited = russian_troll_tweets
russian_trolls_time = russian_troll_tweets


russian_trolls_edited = russian_trolls_edited %>%
  mutate(
  publish_date = as.Date(russian_trolls_edited$publish_date, "%m/%d/%Y %H:%M"))

glimpse(russian_trolls_edited)

## Rows: 243,891
## Columns: 21
## $ external_author_id <dbl> 9.06e+17, 9.06e+17, 9.06e+17, 9.06e+17, 9.06e+17, …
## $ author             <chr> "10_GOP", "10_GOP", "10_GOP", "10_GOP", "10_GOP", …
## $ content            <chr> "\"We have a sitting Democrat US Senator on trial …
## $ region             <chr> "Unknown", "Unknown", "Unknown", "Unknown", "Unkno…
## $ language           <chr> "English", "English", "English", "English", "Engli…
## $ publish_date       <date> 2017-10-01, 2017-10-01, 2017-10-01, 2017-10-01, 2…
## $ harvested_date     <chr> "10/1/2017 19:59", "10/1/2017 22:43", "10/1/2017 2…
## $ following          <dbl> 1052, 1054, 1054, 1062, 1050, 1050, 1050, 1050, 10…
## $ followers          <dbl> 9636, 9637, 9637, 9642, 9645, 9644, 9644, 9644, 96…
## $ updates            <dbl> 253, 254, 255, 256, 246, 247, 248, 249, 250, 251, …
## $ post_type          <chr> NA, NA, "RETWEET", NA, "RETWEET", NA, "RETWEET", N…
## $ account_type       <chr> "Right", "Right", "Right", "Right", "Right", "Righ…
## $ retweet            <dbl> 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ account_category   <chr> "RightTroll", "RightTroll", "RightTroll", "RightTr…
## $ new_june_2018      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ alt_external_id    <dbl> 9.058747e+17, 9.058747e+17, 9.058747e+17, 9.058747…
## $ tweet_id           <dbl> 9.145804e+17, 9.146218e+17, 9.146235e+17, 9.146391…
## $ article_url        <chr> "http://twitter.com/905874659358453760/statuses/91…
## $ tco1_step1         <chr> "https://twitter.com/10_gop/status/914580356430536…
## $ tco2_step1         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ tco3_step1         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

final_project_theme = theme_minimal() +
  theme(text = element_text(family = "serif", size = 12))


dates_and_counts <- setNames(data.frame(table(russian_trolls_edited$publish_date)), c("date_published", "daily_tweets"))


dates_and_counts = dates_and_counts %>%
  mutate(dates_and_counts, date_published = as.Date(date_published))

glimpse(dates_and_counts)

## Rows: 1,089
## Columns: 2
## $ date_published <date> 2014-11-27, 2014-11-28, 2014-12-01, 2014-12-04, 2014-…
## $ daily_tweets   <int> 3, 69, 10, 8, 5, 8, 5, 21, 4, 8, 2, 1, 1, 1, 21, 36, 2…

p <- ggplot(dates_and_counts, aes(x = date_published, y = daily_tweets)) + 
  geom_line(color = "#56B4E9", size = 0.5) +
   labs(
     x = "Date of Publication",
     y = "Number of Tweers",
    title = "Number of Tweets by Russian Trolls",
    subtitle = "2015-2018"
  ) + 
  final_project_theme

ggplotly(p)

Around July 22, 2015, there were, among others, the following events we found to be potential causes. The first is the unexpected recent republican primary poll results that showed a strong Trump’s lead for the first time. A week ago, there was a terrorist attack in Chattanooga, Tennessee, done by a Muslim man. We suppose this could have also triggered many heated discussions online, thereby providing the Saint Petersburg based so-called troll factory a ton of material to work with. Lastly, although probably least likely, John Kasich formally announced his presidential run. Considering the rather middle-of-the-road nature of his politics, however, we do not think that could have been the main cataclysm.

A year later, the general election of 2016 was approaching. That in and of itself had increased the overall daily activity. However, during this period, there were also comparatively productive days for the Russian bots. In particular, Twitter had faced a disproportionate level of bot activity on the 6th of October. Based on our news research, two events we thought could be the reasons for the buzz. The first is the VP debate between Mike Pence and Tim Kaine. It is conceivable to imagine a lot of cheering for the former Indiana governor in those tweets and vice versa. The second news was that Oklahoma high court called the new abortion requirements unconstitutional. Regardless of the party alignment, it could have provided a lot of ground for insults and polarization on social media.

The last highly active day in our 250,000 tweet database is August 12th, 2017. Donald Trump’s presidency had already been more than a year old. There were a lot of discussions online about Trump’s weakening support. He also attacked senator Richard Blumenthal on Twitter. Taking into consideration these two major news, we believe that day was the plot maxima because the sitting president was tweeting a lot. That could urge his base to smear his opponents, including the senator. Apart from that, Google fired the engineer who wrote a controversial gender memo. The public found the company’s HR decision to be very contentious. For that reason, we see how people across the political aisle could have gotten into plenty of social media fights on the grounds of the biological differences between men and women.

Account Types and Languages

Next, here are two histograms representing the number of messages by the account type and their language. Worth mentioning, there are over ten languages in which the Russian tolls wrote their tweets, including Japanese and Farsi. Here, however, are the three most often used ones. English and Russian are self-evident. But what about Italian? Having found no Italians in the vicinity, we analyzed the news connecting Italy, Twitter, and the Russian bots. In 2018, there was a big election in Italy. A far-right politician and his party were gaining significant traction in the polls. The election there, involving their version of Trump, can explain an abundance of tweets in the Italian around that time. As far as the account types are concerned, we found there are several classifications. Nevertheless, in this study, we are going to concentrate only on the left and the right trolls:

account_type_plot <- ggplot(data = russian_trolls_edited) +
geom_bar(mapping = aes(x = account_category, fill = account_category)) + final_project_theme +
labs(
title = "What Are The Types Of Twitter Accounts?",
x = "Account Type",
y = "Number of Tweets"
) +
  theme(text = element_text(family = "serif"), legend.position = "none")

account_type_plot

top_languages = russian_trolls_edited %>%
filter(language == "Russian" | language == "English" | language == "Italian")  

language_plot <- ggplot(data = top_languages) +
geom_bar(mapping = aes(x = language, fill = language)) + final_project_theme +
labs(
title = "What Are The Top Languages?",
x = "Language",
y = "Number of Tweets"
) +
  scale_y_continuous(breaks = c(0, 30000, 60000, 90000, 120000, 150000, 180000, 210000)) +
theme(text = element_text(family = "serif"), legend.position = "none")

language_plot

Activity Patterns Driven by Time Difference

One significant reality that we must consider when talking about intercontinental relations is the time difference. From early on, we were curious about whether the different time-zones impact when the Saint Petersburg trolls tweet and how it is different from the average Twitter activity in the United States.

As we can see here, there are quite a few morning tweets done by the Russians, when it is usually the quietest in America. For instance, when it is one at night on the East Coast, it is the start of a workday in Saint Petersburg. After that, we see a lot of consistent activity there, although pretty dormant, and very few tweets here in the US. But then, when the new world wakes up, the bots’ switch to posting from left-leaning accounts. The peak writing time for the Americans is approximately between 10 am and 12 pm, whereas it is between 2 pm and 6 pm for the troll factory workers. This is interesting because it is a deep night in that part of Russia. We believe the reason for that is the timed tweeting option. The posts are being written during the day, and while the authors are asleep, the messages are being posted.

russian_trolls_time = russian_troll_tweets

russian_trolls_time = russian_trolls_time %>%
  mutate(
  publish_date = as.POSIXct(strptime(russian_trolls_time$publish_date, "%m/%d/%Y %H:%M", tz = "EST")))

russian_trolls_time$publish_time <- hms::hms(second(russian_trolls_time$publish_date), minute(russian_trolls_time$publish_date), hour(russian_trolls_time$publish_date))

russian_trolls_time$publish_time <- as.POSIXct(russian_trolls_time$publish_time)

russian_trolls_time = russian_trolls_time %>%
  filter(account_category == "RightTroll" | account_category == "LeftTroll")

ggplot(russian_trolls_time) +
  geom_density(data = russian_trolls_time, aes(x = publish_time, y = ..scaled.., fill = account_category), alpha = 0.75) +
  scale_x_datetime(breaks = date_breaks("2 hours"), labels = date_format("%H:%M")) +
  labs(
     x = "Time of Publication",
     y = "Relative Frequency",
    title = "When Do Fake Accounts Post?",
    subtitle = "2015-2018",
    fill = "Account Category"
  ) + 
  final_project_theme

We expected to see a clearer discrepancy due to time difference. Still, it is quite obvious that the Russian bots do post at different times, and that can be useful in building preventive measures against them.

Linguistic Analysis

The final data presentation turned out the most time-consuming and, therefore, most rewarding to work on. It took a considerable amount of time to clean up the dataset from all of the words we described in the methods section. What we have found is a highly noticeable rhetorical discrepancy between the two sides. It can just be explained by the fact that there are just more low hanging fruits on the right, including the words Donald Trump refers to on his Twitter account. However, we think the explanation is more sophisticated. The reality is that it is quite difficult for a typical Russian to understand the nature of the problems that the American left is protesting against, like “inequality” or “high minimum wage.” It just lies on a different political plane that is hard to grasp when a person lives in an authoritarian regime with a sluggish economy, like the Russian Federation. So, the few themes that do get covered mostly are the black lives matter movement. Therefore, as you can see, there are a lot of context-free words. And it is even after a rigorous data cleaning. On the right, however, the top 25 words can almost summarize what Donald Trump was writing about on Twitter during those years:

truncated_trolls = subset(russian_trolls_edited, select = -c(external_author_id, author, region, publish_date, harvested_date, following, followers, updates, post_type, account_type, retweet, new_june_2018, alt_external_id, tweet_id, article_url, tco1_step1, tco2_step1, tco3_step1))


tweet_words <- truncated_trolls %>%
    filter(language == "English")

  tweet_words_english = subset(tweet_words, select = -c(language))
    head(tweet_words_english)

## # A tibble: 6 x 2
##   content                                                       account_category
##   <chr>                                                         <chr>           
## 1 "\"We have a sitting Democrat US Senator on trial for corrup… RightTroll      
## 2 "Marshawn Lynch arrives to game in anti-Trump shirt. Judging… RightTroll      
## 3 "Daughter of fallen Navy Sailor delivers powerful monologue … RightTroll      
## 4 "JUST IN: President Trump dedicates Presidents Cup golf tour… RightTroll      
## 5 "19,000 RESPECTING our National Anthem! #StandForOurAnthem🇺🇸… RightTroll      
## 6 "Dan Bongino: \"Nobody trolls liberals better than Donald Tr… RightTroll

  tweet_words_english <- tweet_words_english %>%
    drop_na(content) %>%
    unnest_tokens(tweet_word, content)
  
  top_words_trolls <- tweet_words_english %>%
    filter(account_category == "RightTroll" | account_category == "LeftTroll")
  
  top_words_trolls <- top_words_trolls %>%
    count(account_category, tweet_word, sort = TRUE) %>%
    mutate(tweet_word = fct_inorder(tweet_word))
  
  
    options(tibble.print_max = 200, tibble.print_min = 200) 
    top_words_trolls <- top_words_trolls[-c(1:9), ]
    top_words_trolls <- top_words_trolls[-c(5:35), ]
    top_words_trolls <- top_words_trolls[-c(13:19), ]
    top_words_trolls <- top_words_trolls[-c(8:11), ]
    top_words_trolls <- top_words_trolls[-c(10:13), ]
    top_words_trolls <- top_words_trolls[-c(22:29), ]
    top_words_trolls <- top_words_trolls[-c(29:36), ]
    top_words_trolls <- top_words_trolls[-c(31:34), ]
    top_words_trolls <- top_words_trolls[-c(33:38), ]
    top_words_trolls <- top_words_trolls[-c(11:14), ]
    top_words_trolls <- top_words_trolls[-c(30:33), ]
    top_words_trolls <- top_words_trolls[-c(43:45), ]
    top_words_trolls <- top_words_trolls[-c(1,6), ]
    top_words_trolls <- top_words_trolls[-c(2,11,12,14,18,19,20,21), ]
    top_words_trolls <- top_words_trolls[-c(42:50), ]
    top_words_trolls <- top_words_trolls[-c(27,30,31,34,35,44,47,48), ]
    top_words_trolls <- top_words_trolls[-c(21,22,35,38,42,44,46,50), ]
 top_words_trolls <- top_words_trolls[-c(43,46,57,58,59,60,61,62,64,65,66,68,69,70,71,74,75,76,77,78,79,80,82,83,84,85,86,91,95,96,100), ]
 top_words_trolls <- top_words_trolls[-c(24,45,50,51,52,53,74,75,76,85,87,96), ]
 top_words_trolls <- top_words_trolls[-c(22), ]
 top_words_trolls <- top_words_trolls[-c(41,61,73,89,108,109,118,129,134,136,152,163,170,191), ]
 top_words_trolls <- top_words_trolls[-c(190,194), ]
 
    top_words_trolls <- top_words_trolls %>%
      group_by(account_category) %>% 
    top_n(25) %>% 
    ungroup() %>%
      mutate(account_category = as.factor(account_category),
             tweet_word = reorder_within(tweet_word, n, account_category))
      
    ggplot(data = top_words_trolls, aes(tweet_word, n, fill = account_category)) +
      geom_col(show.legend = FALSE) +
      facet_wrap(~account_category, scales = "free_y") +
      coord_flip() +
      scale_x_reordered() +
      scale_y_continuous(expand = c(0,0)) +
      labs(y = "Count", x = NULL,
           title = "25 most frequent words by Account Type, Left and Right") +
      final_project_theme

The last thing to mention is the word “rt” on both sides. We specifically decided to not delete it because it is a reference to the website of Russia Today, the major government-sponsored TV network. According to our news analysis, their coverage of the politics in the west is polarizing. Apart from that, given that their ratings outside of Russia are not impressive, this finding can be quite helpful with identifying further attacks.

Conclusion

In conclusion, here is what we found so far:

First, there is a consistent difference in the time the Russian trolls post from that of most of the American profiles.

Second, the bots’ activity increases in times of contentions events. Predominantly, it happens whenever there is room for political argument and polarization.

Finally, the primary source of the news among these accounts is RT, a Russian government-sponsored channel.

Study Limitations and Further Research Areas

Having scratched only the surface, we think there are a few useful potential areas into which we should dig to build a more comprehensive understanding of these attacks. For instance, it would be applicable to see which states the bots target most and whether there is such regional discrimination whatsoever. More cooperation on Twitter’s executive side would be helpful.

Another one is the kind of topic areas and not just the words that these fake messages contain. Some researchers are already using Machine Learning algorithms to group these posts by themes. Lastly, this is all concerned with the bots’ tweets in and of themselves. However, we find it even more important to compare them to the real people’s tweets and see the differences, if there are any. In the long run, the goal is not only to protect ourselves but to understand why we give these small groups of people around the world to pit us against each other in the first place. And at least marginally solving that will be the most effective a society can create.

Investigating Statistical Patterns in Russian Troll Tweets

STAT41 Final Project

EGOR CHERNIUK