Final IRA Report

Through the document if you would like to see a specific piece of code that created the output press the code button in the right hand corner. Alternatively, press the code button in the top right hand corner and press ‘show all code’.

1 - Data import, preprocessing and initial exploratory analysis

1.2 Data Import

The data is a dataframe of 200,000 tweets leaked by Twitter and subsequently obtained by NBC.

## # A tibble: 6 x 16
##   user_id user_key created_at created_str         retweet_count retweeted
##     <dbl> <chr>         <dbl> <dttm>                      <dbl> <lgl>    
## 1  2.53e9 kathiem…    1.49e12 2017-02-27 14:54:00            NA NA       
## 2  2.53e9 traceyh…    1.47e12 2016-08-15 14:50:20            NA NA       
## 3 NA      evewebs…    1.44e12 2015-06-30 21:56:09            NA NA       
## 4  4.84e9 blackto…    1.47e12 2016-09-16 08:04:48            18 FALSE    
## 5  1.69e9 jacquel…    1.47e12 2016-09-18 19:46:25             0 FALSE    
## 6  2.59e9 judelam…    1.46e12 2016-04-07 11:37:45            NA NA       
## # … with 10 more variables: favorite_count <dbl>, text <chr>,
## #   tweet_id <dbl>, source <chr>, hashtags <chr>, expanded_urls <chr>,
## #   posted <chr>, mentions <chr>, retweeted_status_id <dbl>,
## #   in_reply_to_status_id <dbl>
## # A tibble: 6 x 14
##       id location name  followers_count statuses_count time_zone verified
##    <dbl> <chr>    <chr>           <dbl>          <dbl> <chr>     <lgl>   
## 1 1.00e8 still ⬆… #Eze…            1053          31858 <NA>      FALSE   
## 2 2.47e8 Chicago… B E …             650           6742 Mountain… FALSE   
## 3 2.50e8 <NA>     Chri…              44            843 <NA>      FALSE   
## 4 4.50e8 <NA>     Рамз…           94773          10877 Moscow    FALSE   
## 5 4.72e8 Санкт-П… Марг…           23305          18401 Volgograd FALSE   
## 6 1.04e9 Amerika  Dark…              22          22603 Jakarta   FALSE   
## # … with 7 more variables: lang <chr>, screen_name <chr>,
## #   description <chr>, created_at <chr>, favourites_count <dbl>,
## #   friends_count <dbl>, listed_count <dbl>

1.4 Exploratory Analysis

I want to get a feel for the data here and initially see if any interesting patterns emerge.

##  [1] "user_id"               "user_key"             
##  [3] "created_at"            "created_str"          
##  [5] "retweet_count"         "retweeted"            
##  [7] "favorite_count"        "text"                 
##  [9] "tweet_id"              "source"               
## [11] "hashtags"              "expanded_urls"        
## [13] "posted"                "mentions"             
## [15] "retweeted_status_id"   "in_reply_to_status_id"
## [17] "party"
##  [1] "id"               "location"         "name"            
##  [4] "followers_count"  "statuses_count"   "time_zone"       
##  [7] "verified"         "lang"             "screen_name"     
## [10] "description"      "created_at"       "favourites_count"
## [13] "friends_count"    "listed_count"

Exploratory Data Analysis

## Warning: Removed 21 rows containing non-finite values (stat_bin).

Interestingly, the days where the IRA were most active all fell in a short period around the autumn of 2016.

Lets see what words were being used on this day:

## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.

Hmm, not that much of interest. I might come back to this later and look at what was talked about on this day relative to other days (either log odds or tf-idf). A google search only reveals that a hurricane was coming but there isn’t a clear explanation for the spike in activity on this day (beyond the fact it’s election time).

Lets look at the (log) distribution of the number of tweets per day.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Some days (nearly 40) we saw as few as 1 tweet being published by the IRA, other days it could be as high as 4000.

Lets take a look at which hashtags were commonly used

How were the number of retweets and favourites distributed across the dataset?

## Warning: Removed 145399 rows containing non-finite values (stat_smooth).
## Warning: Removed 145399 rows containing missing values (geom_point).

Where were these accounts tweeting from?

I find it interesting that some accounts are openly tweeting from Russia..

Which accounts were the most prolific?

## # A tibble: 6 x 2
##   user_key           n
##   <chr>          <int>
## 1 ameliebaldwin   9269
## 2 hyddrox         6813
## 3 giselleevns     6652
## 4 patriotblake    4140
## 5 thefoundingson  3663
## 6 melvinsroberts  3346

We see Amelia Baldwin was the most prolific account, accounting for nearly 5% of all tweets. What was she tweeting about?

## # A tibble: 10 x 1
##    text                                                                    
##    <chr>                                                                   
##  1 "It's either working class \"kills\" her, or she kills working class\nS…
##  2 We need to bring our country back from the edge of extinction!!! @realD…
##  3 "You'd better read it before watching #debates \nhttps://t.co/wPo3W6sVU…
##  4 First time ever voting in 38 Years! #ElectionDay #TrumpWinsBecause #Tru…
##  5 CALLING ON ALL #TRUMP_VOTERS  URGENT‼️ #TrumpPence16  #TrumpForPresiden… 
##  6 .@realDonaldTrump why would anyone vote for someone who will raise taxe…
##  7 I don't vote dems, but...poor Bernie https://t.co/u8O7V9kxKv            
##  8 #ImNotWithHer #NeverHilary #TrumpPence16 #MakeAmericaGreatAgain https:/…
##  9 RT @American_Woman4: #MAGA,#FEMININEAMERICA4TRUMP,#LGBT4Trump,#Fl4Trump…
## 10 "RT @Conservatexian: News post: \"TWITTER Buries 32 of Donald Trump\u00…

Interestly, there is very little engagement (retweets, favourites) with the tweets. Particularly relevant given that all these tweets are coming from a network of accounts, that could certainly promote them more if wanted. Evidence of astro-turfing?

Lets see what she’s tweeting about more broadly:

## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
## # A tibble: 28,680 x 2
##    word                 n
##    <chr>            <int>
##  1 trump             1979
##  2 clinton           1094
##  3 hillary           1055
##  4 @realdonaldtrump   628
##  5 obama              604
##  6 #maga              446
##  7 @hillaryclinton    320
##  8 people             317
##  9 donald             290
## 10 #trump             276
## # … with 28,670 more rows

Key findings from EDA:** * There were accounts openly tweeting from Russia * Autumn of 2016 saw by far the heaviest activity * Some days saw as few as 1 tweet being published, others as high as 4000 * Hashtags were mainly political, with some non-political ones.

2 - Textual and thematic analysis

2.1 Whole corpus thematic analysis

## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
## # A tibble: 253,805 x 2
##    word                 n
##    <chr>            <int>
##  1 trump            24138
##  2 clinton          10588
##  3 hillary           9710
##  4 obama             7336
##  5 people            6345
##  6 dont              5553
##  7 donald            4875
##  8 @realdonaldtrump  4007
##  9 im                3996
## 10 #tcot             3740
## # … with 253,795 more rows

Looks somewhat what we might expect.

Now I’m going to look at bigram relations to see which words frequently come together and then graph the relationships (the ‘bigrams’.

## Warning in graph_from_data_frame(.): In `d' `NA' elements were replaced
## with string "NA"

This graph shows the words that most frequently appear next to each other, hence we see a lot of names. However, it might be more useful to look at what words appear in the same tweet, but not neccesarily next to each other, if we want to get a better understanding of the themes emerging.

I don’t know what this ‘merkelmussbleiben’ means… lets take a look.

## # A tibble: 1,109 x 1
##    text                                                                    
##    <chr>                                                                   
##  1 @johannesvogel würde Frau Merkel 4. Amtszeit schaffen? #Merkelmussbleib…
##  2 Vorwärts immer, rückwärts nimmer! #Merkelmussbleiben                    
##  3 #Merkel hält an Flüchtlings-Deal mit der #Türkei fest #Merkelmussbleiben
##  4 Sie ist nicht gleichgültig! #Merkelmussbleiben #girlstalkselfies        
##  5 @BjoernMaatz sind die Chancen bei Bundeskanzlerin Merkel für 4. Amtszei…
##  6 Ich glaub sie ist alternativlos! #Merkelmussbleiben                     
##  7 Mehr Platz für Familie #Merkelmussbleiben                               
##  8 Merkel ist nicht radikal, dennoch macht sie keine Rückschritte! #Merkel…
##  9 Es kommt mir vor, Frau Merkel hat alle Chancen für noch eine Amtszeit! …
## 10 #Merkel rettet Syrische leben #Merkelmussbleiben                        
## # … with 1,099 more rows

OK so firsly we can see this hashtag was used over a thousand times, secondly we see that it’s not just American politics but also German.

## 
##  de  en  es  fr  id  ru 
##  18 272   1   1   1  90

I’m checking the other dataset here (the one of the account details). The language variable only accounts for 400 of the 453 accounts but it seems close enough.

Lets finally consider using correlation of words to draw out themes. In this instance we are looking at which words appear most frequently together relative to how often they appear with other words. For example a correlation of 0.99 of ‘opiceisis’ and ‘iceisis’ suggest that these words are almost always found together and never apart.

Ths isn’t hugely helpful at the moment, it’s revealing names (which we know are likely to appear together anyway), although ‘tax returns’, ‘stop islam’ and ‘pray for brussels’, and the ‘iceisis’ and ‘opiceisis’ suggests they were purposely taregtting divisive topics.

2.2 By-party thematic analysis

Given everything here it’s really hard to conclude anything other than the IRA were overtly pro Trump.

Im going to start comparing word frequency between different candidates?

OK so relatively speaking there are lots of mention of Sanders, although the actual counts for Sanders are not very high (change free_y to see non relative).

Looks like Clinton was more ‘popular’ on social media. Are these statistically significant differences?

##                Df     Sum Sq Mean Sq F value           Pr(>F)    
## party           3    3647215 1215738    20.4 0.00000000000034 ***
## Residuals   44857 2675917282   59654                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 20959 observations deleted due to missingness

Yes, they are.

2.3 Using logs-odd ratio

Given most of our party dataset looks at Clinton and Trump lets compare the log odds ratios for these two and see if it reveals anything about the content.

## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.

Im going to do another log odds comparison but this time I’ll do it between all the parties:

So this is a similar graph but for all of the candidates. I’m leaving the previous ones up because it goes into more depth with the Clinton Trump differences, and also because I manually calculated the log odds ratio (keeping as reference, this time I used the bind_log_odds function from the tidylo package.)

I think I want to do some log odds ratio analysis for which topics / words were being used over time. I’ll do that later…

3. Topic Modelling

Topic modelling is a form of unsupervised machine learning that aims to draw out ‘clusters’ (specified by the user) of topics. As it stands my analysis hasn’t been particularly successful, although I think results can be improved with a bit more data preprocessing and tweaking the algorith.

First off, I’m going to calculate the ‘term frequency - inverse document frequency’ which aims to calculate how important a word is to a particular topic or document.

## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
## Selecting by tf_idf

Next, the data needs to be transformed (cast) from the current ‘tidy’ data frame into a format suitable for machine learning.

3.1Structural Topic Modelling

So far my results haven’t been particularly conclusive. I also want to find a better way to visualise these results.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The gamma result here shows the probability that a particular topic has a document that belongs to it. I think the results here are not very impressive because the data is already too structured (I have manually filtered it by party to begin with). Given we already have a clear idea of what is being discussed in relation to the different candidates it might be a better idea to use this on the considerably more unknown data (e.g. everything not directly mentioning a candidate).

Ok so here I’m filtering for tweets that don’t explicitly mention a politician (2/3 of the dataset). And them Im going to try and complete some machine learning analysis.

This is still STM

## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#3.2 Latent Dirichlet Allocation

OK, alternatively I’m going to try and use LDA. I’ve written a function here so one can input their cleaned corpus and nmber of desired clusters (k =) and it will perform LDA

top_terms_by_topic_LDA <- function(input_text, # should be a columm from a dataframe
                                   plot = T, # return a plot? TRUE by defult
                                   number_of_topics = 4) # number of topics (4 by default)
{    
    # create a corpus (type of object expected by tm) and document term matrix
    Corpus <- Corpus(VectorSource(input_text)) # make a corpus object
    DTM <- DocumentTermMatrix(Corpus) # get the count of words/document

    # remove any empty rows in our document term matrix (if there are any 
    # we'll get an error when we try to run our LDA)
    unique_indexes <- unique(DTM$i) # get the index of each unique value
    DTM <- DTM[unique_indexes,] # get a subset of only those indexes
    
    # preform LDA & get the words/topic in a tidy text format
    lda <- LDA(DTM, k = number_of_topics, control = list(seed = 1234))
    topics <- tidy(lda, matrix = "beta")

    # get the top ten terms for each topic
    top_terms <- topics  %>% # take the topics data frame and..
      group_by(topic) %>% # treat each topic as a different group
      top_n(10, beta) %>% # get the top 10 most informative words
      ungroup() %>% # ungroup
      arrange(topic, -beta) # arrange words in descending informativeness

    # if the user asks for a plot (TRUE by default)
    if(plot == T){
        # plot the top ten terms for each topic in order
        top_terms %>% # take the top terms
          mutate(term = reorder(term, beta)) %>% # sort terms by beta value 
          ggplot(aes(term, beta, fill = factor(topic))) + # plot beta by theme
          geom_col(show.legend = FALSE) + # as a bar plot
          facet_wrap(~ topic, scales = "free") + # which each topic in a seperate plot
          labs(x = NULL, y = "Beta") + # no x label, change y label 
          coord_flip() # turn bars sideways
    }else{ 
        # if the user does not request a plot
        # return a list of sorted terms instead
        return(top_terms)
    }
}

I’m going to try with a less ‘tidy’ approach here and used quanteda and tm functions. The words get filtered for stop words and then ‘stemmed’.

4. Sentiment Analysis

4.1 Data preprocessing

I am still in the process of coding this. Sentiment analysis aims to apply different rankings or scores to each word and then plot how this varies by candidate or over time. In order to make the analysis more accurate it is necessary to factor in word preceded by negations. I have nearly finished this but the final stage is proving a bit tricky.

## Warning: Column `word_2`/`word` joining character vector and factor,
## coercing into character vector

Unsure why but when I ran this code on a different computer tidytext::get_sentiments(“afinn”) wasn’t working (I think due to my internet settings)so I had to manually import the lexicon. Using the tidytext function should work for you.

4.2 Afinn sentimental analysis

## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
## Warning: Column `word` joining character vector and factor, coercing into
## character vector
## Warning: Column `word` joining character vector and factor, coercing into
## character vector

I really have some questions about this.Will probably need more analysis.

## Warning: Column `word` joining character vector and factor, coercing into
## character vector

4.5 Bing sentiment analysis

This calculates words that are preceded by a negation (not added in yet).

## Joining, by = "word"

Overall, many many more negative words were used for all candidates.

Insert: install_github(“dgrtwo/drlib”) into the console to download the necessary package for ‘reorder_within’ (without it facetting doesn’t produce the results in order)

## Selecting by n

5. Statistical Analysis

Within this section I want to understand which words were contributing most to influence, as measured by either retweets or favourite count. In order to do this I wanted to perform a LASSO regression, however I am struggling to preprocess the data in a way that gives meaningful findings.

This graph suggests that as we add more words to our regression model it doesn’t make it hugely more accurate.

Currently this has stopped working… Will get it up and running and update RPubs when I can. I have taken out of chunks so I can knit without errors.

tweet_tokens_filtered <- tweet_tokens %>% distinct(row, word) %>% add_count(word) %>% filter(n >= 500)

tweet_tokens_matrix <- tweet_tokens_filtered %>% cast_sparse(row, word)

tweet_ids <- as.integer(rownames(tweet_tokens_matrix)) tweets_lasso <- tweets %>% filter(!is.na(retweet_count))

retweets <- tweets_lasso$retweet_count[tweet_ids]

cv_glmnet_model_tweets <- cv.glmnet(tweet_tokens_matrix, retweets) plot(cv_glmnet_model_tweets)

tweet_lexicon <- cv_glmnet_model_tweets$glmnet.fit %>% tidy() %>% filter(term != “(Intercept)”) %>% filter(!str_detect(term, pattern = START %R% “http”))

tweet_lexicon %>% arrange(estimate) %>% group_by(direction = ifelse(estimate < 0, “Negative”, “Positive”)) %>% top_n(20, abs(estimate)) %>% ungroup() %>% mutate(term = fct_reorder(term, estimate)) %>% ggplot(aes(term, estimate, fill = direction)) + geom_col() + coord_flip() + labs(y = “Estimated effect of word on the retweets”)

6. Executive Summary

This report looks at a dataset of 200,000 tweets we know were authored by the Internet Research Agency. I am interested in better understanding the way in which the Russian Government is trying to influence opinion. As it stands I believe there are a number of possibilites. Either -

  • They are attempting to promote a specific narrative e.g. pro-Trump
  • They are attempting to create divison, stoking issues such as Black Lives Matter as well as Trump
  • They are talking about everything in an attempt to give the impression of false consensus (astro-turfing).

In order to do this I have aimed to make observations around the themes and content of the tweets. I analysed both the whole data set and tweets that made a reference to one specific candidate.

I’m going to briefly outline my current findings and observations of interest.

Firstly, supervised learning revealed a clear messaging startegy when the tweets were filtered by candidate.

These findings were reinforced by using log odd calculations to measure the likelyhood words would be used in conjuncton with a particular candidate relative to how often they appeared elsewhere.

When we examine the whole corpus we see just how directed conversation was towards Trump related topics.

Unsupervised learning algorithms were used to detect themes amongst the rest of the corpus (that is, tweets that didn’t explicitly mention a politician). Both structural topic modelling and latent dirichlet were used, with mixed results. With further corpus preprocessing, and further calulcation of an appropriate k value I believe better results are obtainable. Below, are the results achieve through structural topic modelling (trying to work out a better visualisation method).

Sentiment analysis revealed a host of negative and positive words associated with each candidate.

Sentiment analysis revealed some changes in mood over time, although I would like to verify this with more research. Attempts to isolate individual emotions were not very successful.

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : span too small. fewer data values than degrees of freedom.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 1.4766e+09
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 92880
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 1.4789e+12

For example, if we look specifically at Clinton we see a huge number of negative words influencing our analysis

Regarding sentiment analysis I am still in the process of coding words preceded by a negative to count against the score (so not happy would be negative, for example), this should be finished soon.

Alex Stephenson

2019-08-24