Final IRA Report
Final IRA Report
- 1 - Data import, preprocessing and initial exploratory analysis
- 1.1 Packages
- 1.2 Data Import
- 1.3 Preprocessing
- 1.4 Exploratory Analysis
- Exploratory Data Analysis
- Key findings from EDA:** * There were accounts openly tweeting from Russia * Autumn of 2016 saw by far the heaviest activity * Some days saw as few as 1 tweet being published, others as high as 4000 * Hashtags were mainly political, with some non-political ones.
- 2 - Textual and thematic analysis
- 3. Topic Modelling
- 5. Statistical Analysis
- 6. Executive Summary
Through the document if you would like to see a specific piece of code that created the output press the code button in the right hand corner. Alternatively, press the code button in the top right hand corner and press ‘show all code’.
1 - Data import, preprocessing and initial exploratory analysis
1.1 Packages
#Packages
library(tidyverse)
library(tidyr)
library(purrr)
library(tidytext)
library(widyr)
library(ggraph)
library(igraph)
library(topicmodels)
library(lubridate)
library(Matrix)
library(glmnet)
library(doMC)
library(stm)
library(quanteda)
library(rebus)
library(devtools)
library(tidylo)
library(ggrepel)
library(rmdformats)
library(SnowballC)
library(tm)
options(scipen = 999, digits = 2)
set.seed(12345)1.2 Data Import
The data is a dataframe of 200,000 tweets leaked by Twitter and subsequently obtained by NBC.
## # A tibble: 6 x 16
## user_id user_key created_at created_str retweet_count retweeted
## <dbl> <chr> <dbl> <dttm> <dbl> <lgl>
## 1 2.53e9 kathiem… 1.49e12 2017-02-27 14:54:00 NA NA
## 2 2.53e9 traceyh… 1.47e12 2016-08-15 14:50:20 NA NA
## 3 NA evewebs… 1.44e12 2015-06-30 21:56:09 NA NA
## 4 4.84e9 blackto… 1.47e12 2016-09-16 08:04:48 18 FALSE
## 5 1.69e9 jacquel… 1.47e12 2016-09-18 19:46:25 0 FALSE
## 6 2.59e9 judelam… 1.46e12 2016-04-07 11:37:45 NA NA
## # … with 10 more variables: favorite_count <dbl>, text <chr>,
## # tweet_id <dbl>, source <chr>, hashtags <chr>, expanded_urls <chr>,
## # posted <chr>, mentions <chr>, retweeted_status_id <dbl>,
## # in_reply_to_status_id <dbl>
## # A tibble: 6 x 14
## id location name followers_count statuses_count time_zone verified
## <dbl> <chr> <chr> <dbl> <dbl> <chr> <lgl>
## 1 1.00e8 still ⬆… #Eze… 1053 31858 <NA> FALSE
## 2 2.47e8 Chicago… B E … 650 6742 Mountain… FALSE
## 3 2.50e8 <NA> Chri… 44 843 <NA> FALSE
## 4 4.50e8 <NA> Рамз… 94773 10877 Moscow FALSE
## 5 4.72e8 Санкт-П… Марг… 23305 18401 Volgograd FALSE
## 6 1.04e9 Amerika Dark… 22 22603 Jakarta FALSE
## # … with 7 more variables: lang <chr>, screen_name <chr>,
## # description <chr>, created_at <chr>, favourites_count <dbl>,
## # friends_count <dbl>, listed_count <dbl>
1.3 Preprocessing
The only main edit I need to make to the dataframe to begin with is to create a secondary data frame that is filtered to only include tweets that mention either Clinton, Sanders, Obama or Trump.
tweets$party <- NA
clinton <- c("Clinton", "Hillary", "Hillary Clinton")
trump <- c("Donald", "Trump", "Donald Trump")
obama <- c("Barack", "Obama", "Barack Obama")
sanders <- c("Bernie", "Sanders", "Bernie Sanders")
politicians <- c(clinton, trump, obama, sanders)
politicians <- paste(politicians, collapse = "|")
politicians <- tolower(politicians)
party_tweets <- tweets %>%
mutate(text = tolower(text)) %>%
filter(str_detect(text, politicians))
party_tweets$party <- str_extract(party_tweets$text, politicians)
party_tweets$party[party_tweets$party == "barack"] <- "Obama"
party_tweets$party[party_tweets$party == "obama"] <- "Obama"
party_tweets$party[party_tweets$party == "bernie"] <- "Sanders"
party_tweets$party[party_tweets$party == "sanders"] <- "Sanders"
party_tweets$party[party_tweets$party == "clinton"] <- "Clinton"
party_tweets$party[party_tweets$party == "hillary"] <- "Clinton"
party_tweets$party[party_tweets$party == "trump"] <- "Trump"
party_tweets$party[party_tweets$party == "donald"] <- "Trump"
custom_stop_words <- bind_rows(tibble(word = c("https", "t.co", "rt", "amp"),
lexicon = c("custom")),
stop_words) 1.4 Exploratory Analysis
I want to get a feel for the data here and initially see if any interesting patterns emerge.
## [1] "user_id" "user_key"
## [3] "created_at" "created_str"
## [5] "retweet_count" "retweeted"
## [7] "favorite_count" "text"
## [9] "tweet_id" "source"
## [11] "hashtags" "expanded_urls"
## [13] "posted" "mentions"
## [15] "retweeted_status_id" "in_reply_to_status_id"
## [17] "party"
## [1] "id" "location" "name"
## [4] "followers_count" "statuses_count" "time_zone"
## [7] "verified" "lang" "screen_name"
## [10] "description" "created_at" "favourites_count"
## [13] "friends_count" "listed_count"
Exploratory Data Analysis
ggplot(tweets, aes(created_str)) +
geom_histogram(bins = 100) +
labs(x = "Date", y = "Number of tweets", title = "Number of tweets between 2015 and 2017")## Warning: Removed 21 rows containing non-finite values (stat_bin).
most_active_days <- tweets %>%
mutate(created_str = floor_date(created_str, unit = "1 day")) %>%
count(created_str, sort = T)
most_active_days %>%
mutate(created_str = reorder(created_str, n)) %>%
head(n = 10) %>%
ggplot(aes(created_str, n)) +
geom_col() +
coord_flip() +
labs(title = "10 most 'active' days", y = "Number of tweets", x = "Date of tweet") Interestingly, the days where the IRA were most active all fell in a short period around the autumn of 2016.
Lets see what words were being used on this day:
most_active_day <- tweets %>%
mutate(created_str = floor_date(created_str, unit = "1 day")) %>%
filter(created_str == as.Date("2016-10-06"))
most_active_day %>%
unnest_tokens(word, text, token = "tweets") %>%
filter(!word %in% custom_stop_words$word) %>%
count(word, sort = T) %>%
filter(n > 100) %>%
mutate(word = reorder(word,n)) %>%
ggplot(aes(word, n)) +
geom_col() +
coord_flip() +
labs(x = "", y = "Word", title = "Most common words on 6th October, 2016")## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
Hmm, not that much of interest. I might come back to this later and look at what was talked about on this day relative to other days (either log odds or tf-idf). A google search only reveals that a hurricane was coming but there isn’t a clear explanation for the spike in activity on this day (beyond the fact it’s election time).
Lets look at the (log) distribution of the number of tweets per day.
most_active_days %>%
ggplot(aes(n)) +
geom_histogram() +
scale_x_log10() +
labs(title = "Number of days with 'n' amount of tweets on log scale", y = "")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
most_active_day <- tweets %>%
mutate(created_str = floor_date(created_str, unit = "1 day")) %>%
filter(created_str == as.Date("2016-10-06"))Some days (nearly 40) we saw as few as 1 tweet being published by the IRA, other days it could be as high as 4000.
Lets take a look at which hashtags were commonly used
tweets %>%
count(hashtags, sort = T) %>%
filter(hashtags != "[]") %>%
filter(n > 500) %>%
mutate(hashtags = reorder(hashtags, n)) %>%
ggplot(aes(x = hashtags, y = n)) +
geom_col() +
coord_flip() +
labs(title = "Most popular hashtags", x = "Hashtags", y = "Number of occurances")How were the number of retweets and favourites distributed across the dataset?
par(mfrow = c(1, 2))
tweets %>%
ggplot(aes(retweet_count)) +
geom_histogram() +
scale_x_log10() +
labs(title = "Retweet count on log scale", y = "", x = "Number of retweets on each tweet")tweets %>%
ggplot(aes(favorite_count)) +
geom_histogram() +
scale_x_log10() +
labs(title = "Favourite count on log scale", y = "", x = "Number of favourites on each tweet")users %>%
ggplot(aes(friends_count)) +
geom_histogram() +
scale_x_log10() +
labs(caption = "Normal log distribution with left skew", x = "Friend count", title = "Distribution of 'friends' on a log scale")corr <- as.data.frame(cor(tweets$retweet_count, tweets$favorite_count, use = "complete.obs"))
corr <- round(corr, digits = 2)
ggplot(tweets, aes(x = favorite_count, y = retweet_count)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Correlation between favourite count and retweet count", x = "Favourite count", y = "Retweet count", caption = paste("Correlation = ", corr))## Warning: Removed 145399 rows containing non-finite values (stat_smooth).
## Warning: Removed 145399 rows containing missing values (geom_point).
Where were these accounts tweeting from?
#Quickly recode
users$location[users$location == "United States"] <- "USA"
users$location[users$location == "US"] <- "USA"
users$location[users$location == "Москва"] <- "Россия"
users$location[users$location == "Москва, Россия"] <- "Россия"
users$location[users$location == "Atlanta, GA"] <- "Atlanta"
users %>%
select(location) %>%
na.omit() %>%
group_by(location) %>%
count(location, sort = T) %>%
ungroup() %>%
filter(n > 3) %>%
mutate(location = reorder(location, n)) %>%
ggplot(aes(location, n)) +
geom_col() +
coord_flip() +
labs(x = "User location", y = "Number of tweets", title = "Number of tweets by location") I find it interesting that some accounts are openly tweeting from Russia..
Which accounts were the most prolific?
## # A tibble: 6 x 2
## user_key n
## <chr> <int>
## 1 ameliebaldwin 9269
## 2 hyddrox 6813
## 3 giselleevns 6652
## 4 patriotblake 4140
## 5 thefoundingson 3663
## 6 melvinsroberts 3346
We see Amelia Baldwin was the most prolific account, accounting for nearly 5% of all tweets. What was she tweeting about?
amelia_baldwin <- tweets %>%
filter(user_key == "ameliebaldwin")
amelia_baldwin %>%
arrange(desc(retweet_count)) %>%
head(n = 10) %>%
select(text)## # A tibble: 10 x 1
## text
## <chr>
## 1 "It's either working class \"kills\" her, or she kills working class\nS…
## 2 We need to bring our country back from the edge of extinction!!! @realD…
## 3 "You'd better read it before watching #debates \nhttps://t.co/wPo3W6sVU…
## 4 First time ever voting in 38 Years! #ElectionDay #TrumpWinsBecause #Tru…
## 5 CALLING ON ALL #TRUMP_VOTERS URGENT‼️ #TrumpPence16 #TrumpForPresiden…
## 6 .@realDonaldTrump why would anyone vote for someone who will raise taxe…
## 7 I don't vote dems, but...poor Bernie https://t.co/u8O7V9kxKv
## 8 #ImNotWithHer #NeverHilary #TrumpPence16 #MakeAmericaGreatAgain https:/…
## 9 RT @American_Woman4: #MAGA,#FEMININEAMERICA4TRUMP,#LGBT4Trump,#Fl4Trump…
## 10 "RT @Conservatexian: News post: \"TWITTER Buries 32 of Donald Trump\u00…
Interestly, there is very little engagement (retweets, favourites) with the tweets. Particularly relevant given that all these tweets are coming from a network of accounts, that could certainly promote them more if wanted. Evidence of astro-turfing?
Lets see what she’s tweeting about more broadly:
amelia_baldwin %>%
filter(user_key == "ameliebaldwin") %>%
unnest_tokens(word, text, token = "tweets") %>%
filter(!word %in% custom_stop_words$word) %>%
count(word, sort = T)## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
## # A tibble: 28,680 x 2
## word n
## <chr> <int>
## 1 trump 1979
## 2 clinton 1094
## 3 hillary 1055
## 4 @realdonaldtrump 628
## 5 obama 604
## 6 #maga 446
## 7 @hillaryclinton 320
## 8 people 317
## 9 donald 290
## 10 #trump 276
## # … with 28,670 more rows
2 - Textual and thematic analysis
2.1 Whole corpus thematic analysis
tweet_tokens <- tweets %>%
select(user_id, user_key, text, created_str) %>%
na.omit() %>%
mutate(row= row_number()) %>%
unnest_tokens(word, text, token = "tweets") %>%
filter(!word %in% custom_stop_words$word)## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
## # A tibble: 253,805 x 2
## word n
## <chr> <int>
## 1 trump 24138
## 2 clinton 10588
## 3 hillary 9710
## 4 obama 7336
## 5 people 6345
## 6 dont 5553
## 7 donald 4875
## 8 @realdonaldtrump 4007
## 9 im 3996
## 10 #tcot 3740
## # … with 253,795 more rows
Looks somewhat what we might expect.
Now I’m going to look at bigram relations to see which words frequently come together and then graph the relationships (the ‘bigrams’.
bigrams <- tweets %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) #re-tokenise our tweets with bigrams.
bigrams_separated <- bigrams %>%
separate(bigram, c("word_1", "word_2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word_1 %in% custom_stop_words$word) %>%
filter(!word_2 %in% custom_stop_words$word)
bigram_counts <- bigrams_filtered %>%
count(word_1, word_2 , sort = T)
bigram_graph <- bigram_counts %>%
filter(n > 250) %>%
graph_from_data_frame()## Warning in graph_from_data_frame(.): In `d' `NA' elements were replaced
## with string "NA"
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()This graph shows the words that most frequently appear next to each other, hence we see a lot of names. However, it might be more useful to look at what words appear in the same tweet, but not neccesarily next to each other, if we want to get a better understanding of the themes emerging.
tweet_words <- tweets %>%
mutate(row = row_number()) %>%
unnest_tokens(word, text) %>%
filter(!word %in% custom_stop_words$word)
tweet_pairs <- tweet_words %>%
pairwise_count(word, row, sort = T)
tweet_pairs_graph <- tweet_pairs %>%
filter(n > 500) %>%
graph_from_data_frame()
tweet_pairs_graph <- ggraph(tweet_pairs_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
tweet_pairs_graph +
labs(title = "Words which appear in the same tweet most frequently")I don’t know what this ‘merkelmussbleiben’ means… lets take a look.
## # A tibble: 1,109 x 1
## text
## <chr>
## 1 @johannesvogel würde Frau Merkel 4. Amtszeit schaffen? #Merkelmussbleib…
## 2 Vorwärts immer, rückwärts nimmer! #Merkelmussbleiben
## 3 #Merkel hält an Flüchtlings-Deal mit der #Türkei fest #Merkelmussbleiben
## 4 Sie ist nicht gleichgültig! #Merkelmussbleiben #girlstalkselfies
## 5 @BjoernMaatz sind die Chancen bei Bundeskanzlerin Merkel für 4. Amtszei…
## 6 Ich glaub sie ist alternativlos! #Merkelmussbleiben
## 7 Mehr Platz für Familie #Merkelmussbleiben
## 8 Merkel ist nicht radikal, dennoch macht sie keine Rückschritte! #Merkel…
## 9 Es kommt mir vor, Frau Merkel hat alle Chancen für noch eine Amtszeit! …
## 10 #Merkel rettet Syrische leben #Merkelmussbleiben
## # … with 1,099 more rows
OK so firsly we can see this hashtag was used over a thousand times, secondly we see that it’s not just American politics but also German.
##
## de en es fr id ru
## 18 272 1 1 1 90
I’m checking the other dataset here (the one of the account details). The language variable only accounts for 400 of the 453 accounts but it seems close enough.
Lets finally consider using correlation of words to draw out themes. In this instance we are looking at which words appear most frequently together relative to how often they appear with other words. For example a correlation of 0.99 of ‘opiceisis’ and ‘iceisis’ suggest that these words are almost always found together and never apart.
tweet_cor <- tweet_words %>%
group_by(word) %>%
filter(n() > 200) %>%
pairwise_cor(word, row, sort = T)
tweet_cor_graph <- tweet_cor %>%
filter(correlation > .4) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), repel = TRUE) +
theme_void()
tweet_cor_graph +
labs(title = "Words which appear most often together relative to how often they are alone") Ths isn’t hugely helpful at the moment, it’s revealing names (which we know are likely to appear together anyway), although ‘tax returns’, ‘stop islam’ and ‘pray for brussels’, and the ‘iceisis’ and ‘opiceisis’ suggests they were purposely taregtting divisive topics.
2.2 By-party thematic analysis
Given everything here it’s really hard to conclude anything other than the IRA were overtly pro Trump.
Im going to start comparing word frequency between different candidates?
party_tokens <- party_tweets %>%
unnest_tokens(word, text, token = "tweets") %>%
filter(!word %in% custom_stop_words$word)
frequency <- party_tokens %>%
group_by(party) %>%
count(word, sort = TRUE) %>%
left_join(party_tokens %>%
group_by(party) %>%
summarise(total = n())) %>%
mutate(freq = n/total) %>%
filter(n >= 20)
frequency <- frequency %>%
select(party, word, freq) %>%
spread(party, freq) %>%
arrange(Clinton, Trump)
frequency %>%
ggplot(aes(Clinton, Trump)) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = scales::percent_format()) +
scale_y_log10(labels = scales::percent_format()) +
geom_abline(color = "red") +
labs(title = "Relative word frequency of Trump and Clinton")party_tweets %>%
ggplot(aes(party)) +
geom_bar() +
labs(title = "Number of mentions of each politician", x = "Politician", y = "")party_tweets %>%
ggplot(aes(created_str, fill = party)) +
geom_histogram(show.legend = FALSE) +
facet_wrap(~ party, scales= "free_y", ncol = 1) +
labs(caption = "Note here that the scales are different for Sanders")OK so relatively speaking there are lots of mention of Sanders, although the actual counts for Sanders are not very high (change free_y to see non relative).
party_tweets %>%
ggplot(aes(retweet_count)) +
geom_histogram() +
facet_wrap(~ party) +
scale_x_log10() +
labs(x = "Retweet count", y = "")summary_of_retweets <- party_tweets %>%
filter(!is.na(retweet_count)) %>%
group_by(party) %>%
summarise(total_rt = sum(retweet_count), num_of_tweets = sum(party == 'Clinton', party == 'Trump', party == 'Barack', party == 'Sanders')) %>%
mutate(avg_rt = total_rt / num_of_tweets) %>%
filter(!is.infinite(avg_rt))
summary_of_retweets %>%
filter(num_of_tweets > 0) %>%
ggplot(aes(party, avg_rt, fill = party)) +
geom_col() +
labs(x = "Party", y = "Average number of retweets") Looks like Clinton was more ‘popular’ on social media. Are these statistically significant differences?
## Df Sum Sq Mean Sq F value Pr(>F)
## party 3 3647215 1215738 20.4 0.00000000000034 ***
## Residuals 44857 2675917282 59654
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 20959 observations deleted due to missingness
Yes, they are.
2.3 Using logs-odd ratio
Given most of our party dataset looks at Clinton and Trump lets compare the log odds ratios for these two and see if it reveals anything about the content.
word_ratios <- party_tokens %>%
filter(party == "Trump" | party == "Clinton") %>%
count(word, party, sort = T) %>%
group_by(word) %>%
filter(sum(n) >= 50) %>%
ungroup() %>%
spread(party, n, fill = 0) %>%
mutate_if(is.numeric, funs((. + 1) / (sum(.) + 1))) %>%
mutate(logratio = log(Clinton / Trump)) %>%
arrange(desc(logratio))## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
par(mfrow = c(1, 2))
clinton_trump_lo_graph_handles <- word_ratios %>%
group_by(logratio < 0) %>%
top_n(15, abs(logratio)) %>%
ungroup() %>%
mutate(word = reorder(word, logratio)) %>%
ggplot(aes(word, logratio, fill = logratio < 0)) +
geom_col(show.legend = FALSE) +
coord_flip() +
ylab("log odds ratio (Trump/Clinton)") +
scale_fill_discrete(name = "", labels = c("Labor", "Liberal")) +
labs(title = "Log odds ratio of word use between candidates", subtitle = "With @ handles")
clinton_trump_lo_graph_handlesclinton_trump_lo_graph_no_handles <- word_ratios %>%
filter(!str_detect(word, pattern = START %R% "@")) %>%
filter(!str_detect(word, pattern = START %R% "https")) %>%
group_by(logratio < 0) %>%
top_n(15, abs(logratio)) %>%
ungroup() %>%
mutate(word = reorder(word, logratio)) %>%
ggplot(aes(word, logratio, fill = logratio < 0)) +
geom_col(show.legend = FALSE) +
coord_flip() +
labs(y= "log odds ratio (Trump/Clinton)", x = "") +
scale_fill_discrete(name = "", labels = c("Trump", "Clinton")) +
labs(title = "Log odds ratio of word use between candidates", subtitle = "Without hashtags")
clinton_trump_lo_graph_no_handlesIm going to do another log odds comparison but this time I’ll do it between all the parties:
party_tokens_count <- party_tokens %>%
group_by(party) %>%
count(word, sort = T) %>%
ungroup()
party_tokens_log_odds <- party_tokens_count %>%
bind_log_odds(party, word, n)
party_tokens_log_odds_graph <- party_tokens_log_odds %>%
filter(!word %in% c("trump", "clinton", "hillary", "obama", "donald", "bernie", "sanders", "hillarys", "#trump", "#bernie", "berniesanders", "@hillaryclinton", "#hillaryclinton", "barack", "obamaas", "#obama", "@berniesanders", "@sensanders", "#berniesanders", "clintons", "#hillary", "obamas", "@realdonaldtrump")) %>%
group_by(party) %>%
top_n(10, log_odds) %>%
ungroup %>%
mutate(word = fct_reorder(word, log_odds)) %>%
ggplot(aes(word, log_odds, fill = party)) +
geom_col(show.legend = FALSE) +
facet_wrap(~party, scales = "free") +
coord_flip() +
scale_y_continuous(expand = c(0,0)) +
labs(y = "Logs odds ratio", x = "", title = "Which words or phrases are most specific to each candidate?")
party_tokens_log_odds_graphSo this is a similar graph but for all of the candidates. I’m leaving the previous ones up because it goes into more depth with the Clinton Trump differences, and also because I manually calculated the log odds ratio (keeping as reference, this time I used the bind_log_odds function from the tidylo package.)
I think I want to do some log odds ratio analysis for which topics / words were being used over time. I’ll do that later…
3. Topic Modelling
Topic modelling is a form of unsupervised machine learning that aims to draw out ‘clusters’ (specified by the user) of topics. As it stands my analysis hasn’t been particularly successful, although I think results can be improved with a bit more data preprocessing and tweaking the algorith.
First off, I’m going to calculate the ‘term frequency - inverse document frequency’ which aims to calculate how important a word is to a particular topic or document.
tidy_party_tweets <- party_tweets %>%
unnest_tokens(word, text, token = "tweets") %>%
filter(!word %in% custom_stop_words$word)## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
tidy_party_tweets_filtered <- tidy_party_tweets %>%
filter(!str_detect(word, pattern = START %R% "@")) #remove @s
party_tf_idf <- tidy_party_tweets_filtered %>%
count(party, word, sort = TRUE) %>%
bind_tf_idf(word, party, n) %>%
arrange(-tf_idf) %>%
group_by(party) %>%
top_n(15) %>%
ungroup()## Selecting by tf_idf
party_tf_idf_graph <- party_tf_idf %>%
mutate(word = reorder(word, tf_idf)) %>%
ggplot(aes(word, tf_idf, fill = party)) +
geom_col(alpha = 0.8, show.legend = FALSE) +
scale_x_reordered() +
facet_wrap(~ party, scales = "free") +
coord_flip()
party_tf_idf_graphNext, the data needs to be transformed (cast) from the current ‘tidy’ data frame into a format suitable for machine learning.
3.1Structural Topic Modelling
party_dfm <- tidy_party_tweets_filtered %>%
count(party, word, sort = T) %>%
cast_dfm(party, word, n)
structural_tm <- stm(party_dfm, K = 24, verbose = FALSE, init.type = "Spectral")
td_beta_24 <- tidy(structural_tm)
td_beta_24 %>%
group_by(topic) %>%
top_n(6, beta) %>%
ungroup() %>%
mutate(topic = paste0("Topic ", topic),
term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = as.factor(topic))) +
geom_col(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ topic, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
labs(x = NULL, y = expression(beta),
title = "Highest word probabilities for each topic",
subtitle = "Different words are associated with different topics")So far my results haven’t been particularly conclusive. I also want to find a better way to visualise these results.
td_gamma_24 <- tidy(structural_tm, matrix = "gamma")
ggplot(td_gamma_24, aes(gamma, fill = as.factor(topic))) +
geom_histogram(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ topic, ncol = 3) +
labs(title = "Distribution of document probabilities for each topic",
subtitle = "Each topic is associated with 1-3 stories",
y = "Number of topics", x = expression(gamma))## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The gamma result here shows the probability that a particular topic has a document that belongs to it. I think the results here are not very impressive because the data is already too structured (I have manually filtered it by party to begin with). Given we already have a clear idea of what is being discussed in relation to the different candidates it might be a better idea to use this on the considerably more unknown data (e.g. everything not directly mentioning a candidate).
Ok so here I’m filtering for tweets that don’t explicitly mention a politician (2/3 of the dataset). And them Im going to try and complete some machine learning analysis.
This is still STM
non_party_tweets <- tweets %>%
mutate(text = tolower(text)) %>%
filter(!str_detect(text, politicians))
non_party_tweets_tokens <- non_party_tweets %>%
filter(!is.na(tweet_id)) %>%
unnest_tokens(word, text, token = "tweets") %>%
filter(!word %in% custom_stop_words$word) %>%
filter(!str_detect(word, pattern = START %R% "@")) #remove @s## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
non_party_tweets_dfm <- non_party_tweets_tokens %>%
count(tweet_id, word, sort = T) %>%
cast_dfm(tweet_id, word, n)
non_party_stm <- stm(non_party_tweets_dfm, K = 24, verbose = FALSE, init.type = "Spectral")
non_party_stm_beta <- tidy(non_party_stm, matrix = "beta")
non_party_stm_beta_graph <- non_party_stm_beta %>%
group_by(topic) %>%
top_n(6, beta) %>%
ungroup() %>%
mutate(topic = paste0("Topic ", topic),
term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = as.factor(topic))) +
geom_col(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ topic, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
labs(x = NULL, y = expression(beta),
title = "Highest word probabilities for each topic",
subtitle = "Different words are associated with different topics")
non_party_stm_beta_graphnon_party_stm_gamma <- tidy(non_party_stm, matrix = "gamma")
non_party_stm_gamma_graph <- ggplot(non_party_stm_gamma, aes(gamma, fill = as.factor(topic))) +
geom_histogram(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ topic, ncol = 3) +
labs(title = "Distribution of document probabilities for each topic",
subtitle = "Each topic is associated with 1-3 stories",
y = "Number of topics", x = expression(gamma))
non_party_stm_gamma_graph## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(non_party_stm_gamma, aes(gamma)) +
geom_histogram() +
scale_y_log10() +
labs(title = "Distribution of probabilities for all topics",
y = "Number of documents", x = expression(gamma))## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#3.2 Latent Dirichlet Allocation
OK, alternatively I’m going to try and use LDA. I’ve written a function here so one can input their cleaned corpus and nmber of desired clusters (k =) and it will perform LDA
top_terms_by_topic_LDA <- function(input_text, # should be a columm from a dataframe
plot = T, # return a plot? TRUE by defult
number_of_topics = 4) # number of topics (4 by default)
{
# create a corpus (type of object expected by tm) and document term matrix
Corpus <- Corpus(VectorSource(input_text)) # make a corpus object
DTM <- DocumentTermMatrix(Corpus) # get the count of words/document
# remove any empty rows in our document term matrix (if there are any
# we'll get an error when we try to run our LDA)
unique_indexes <- unique(DTM$i) # get the index of each unique value
DTM <- DTM[unique_indexes,] # get a subset of only those indexes
# preform LDA & get the words/topic in a tidy text format
lda <- LDA(DTM, k = number_of_topics, control = list(seed = 1234))
topics <- tidy(lda, matrix = "beta")
# get the top ten terms for each topic
top_terms <- topics %>% # take the topics data frame and..
group_by(topic) %>% # treat each topic as a different group
top_n(10, beta) %>% # get the top 10 most informative words
ungroup() %>% # ungroup
arrange(topic, -beta) # arrange words in descending informativeness
# if the user asks for a plot (TRUE by default)
if(plot == T){
# plot the top ten terms for each topic in order
top_terms %>% # take the top terms
mutate(term = reorder(term, beta)) %>% # sort terms by beta value
ggplot(aes(term, beta, fill = factor(topic))) + # plot beta by theme
geom_col(show.legend = FALSE) + # as a bar plot
facet_wrap(~ topic, scales = "free") + # which each topic in a seperate plot
labs(x = NULL, y = "Beta") + # no x label, change y label
coord_flip() # turn bars sideways
}else{
# if the user does not request a plot
# return a list of sorted terms instead
return(top_terms)
}
}I’m going to try with a less ‘tidy’ approach here and used quanteda and tm functions. The words get filtered for stop words and then ‘stemmed’.
tweets_Corpus <- Corpus(VectorSource(non_party_tweets$text))
tweets_DTM <- DocumentTermMatrix(tweets_Corpus)
tweets_DTM_tidy <- tidy(tweets_DTM)
tweets_DTM_tidy_cleaned <- tweets_DTM_tidy %>%
filter(!term %in% custom_stop_words$word) %>%
mutate(stem = wordStem(term))
cleaned_documents <- tweets_DTM_tidy_cleaned %>%
group_by(document) %>%
mutate(terms = toString(rep(stem, count))) %>%
select(document, terms) %>%
unique()non_party_lda_24_graph <- top_terms_by_topic_LDA(cleaned_documents$terms, number_of_topics = 24)
non_party_lda_24_graph4. Sentiment Analysis
4.1 Data preprocessing
I am still in the process of coding this. Sentiment analysis aims to apply different rankings or scores to each word and then plot how this varies by candidate or over time. In order to make the analysis more accurate it is necessary to factor in word preceded by negations. I have nearly finished this but the final stage is proving a bit tricky.
afinn <- read.delim("AFINN-111.txt", header = F)
names(afinn) <- c("word", "value")
negation_words <- c("not", "no", "never", "without", "won't", "dont", "doesnt", "doesn't", "don't", "can't") #n.b. I think the tokenisation process removes apostrophes and other punctuation but not sure - defintiely performs tolower().
not_words <- bigrams_separated %>%
filter(word_1 %in% negation_words) %>%
inner_join(afinn, by = c(word_2 = "word")) %>%
count(word_2, value, sort = TRUE) %>%
mutate(value = value * -1) %>%
mutate(contribution = value * n)## Warning: Column `word_2`/`word` joining character vector and factor,
## coercing into character vector
Unsure why but when I ran this code on a different computer tidytext::get_sentiments(“afinn”) wasn’t working (I think due to my internet settings)so I had to manually import the lexicon. Using the tidytext function should work for you.
4.2 Afinn sentimental analysis
party_tokens <- party_tweets %>%
unnest_tokens(word, text, token = "tweets") %>%
filter(!word %in% custom_stop_words$word) %>%
filter(!word == "like") #like is included as a positive word, but isn't used as one. ## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
## Warning: Column `word` joining character vector and factor, coercing into
## character vector
party_sentiment_over_time <- party_sentiment %>%
mutate(created_str = floor_date(created_str, unit = "1 day")) %>%
add_count(created_str, word) %>%
mutate(total_score = value * n) %>%
group_by(created_str)
sentiment_over_time <- tweet_tokens %>%
select(word, created_str) %>%
inner_join(afinn, by = "word") %>%
mutate(created_str = floor_date(created_str, unit = "1 day")) %>%
add_count(created_str, word) %>%
mutate(total_score = value * n) %>%
group_by(created_str)## Warning: Column `word` joining character vector and factor, coercing into
## character vector
sentiment_over_time_graph <- sentiment_over_time %>%
filter(created_str != as.Date("2016-07-21")) %>%
summarise(by_day_avg = sum(total_score) / n()) %>%
ggplot(aes(created_str, by_day_avg, fill = by_day_avg > 0)) +
geom_col(show.legend = F) +
scale_y_continuous(labels = scales::comma) +
labs(x = "Net sentiment", y = "", title = "The average sentiment by day across the dataset", subtitle = "AFINN lexicon used", caption = "Net sentiment was calculated by summing the total sentiment for each day and dividing by the number of tweets")
sentiment_over_time_graphparty_sentiment_over_time_graph <- party_sentiment_over_time %>%
select(created_str, word, value, n, total_score, party) %>%
group_by(created_str) %>%
ggplot(aes(created_str, mean(total_score) , fill = party)) +
geom_col(show.legend = F) +
ylab("") +
xlab("net sentiment") +
scale_y_continuous(labels = scales::comma) +
facet_wrap(~party, scales = "free")
party_sentiment_over_time_graph I really have some questions about this.Will probably need more analysis.
clinton_sentiment_words_graph <- party_tokens %>%
filter(party == "Clinton") %>%
count(word, sort = T) %>%
inner_join(afinn, by = "word") %>%
mutate(total_score = value * n) %>%
top_n(20, abs(total_score)) %>%
mutate(word = reorder(word, total_score)) %>%
ggplot(aes(x = word, y = total_score, fill = total_score > 0)) +
geom_col(show.legend = F) +
coord_flip() +
labs(y = "", x = "Contribution to score (n * value)")## Warning: Column `word` joining character vector and factor, coercing into
## character vector
4.3 Loughran sentiment analysis
party_sentiment_loughran <- party_tokens %>%
inner_join(get_sentiments("loughran"), by = "word") %>%
filter(!word == "like")
party_sentiment_loughran_graph <- party_sentiment_loughran %>%
filter(created_str > as.Date("2016-10-01") & created_str < as.Date("2016-11-01")) %>%
mutate(created_str = floor_date(created_str, unit = "1 day")) %>%
filter(sentiment != "superfluous") %>%
add_count(created_str, word) %>%
group_by(created_str) %>%
count(sentiment, party) %>%
filter(!sentiment == "litigious") %>%
ggplot(aes(created_str, n, color = party)) +
geom_smooth(se = FALSE) +
geom_point()+
facet_wrap(~ sentiment, scales = "free") +
labs(title = "Average sentiment during October 2016 for Clinton, Trump and Sanders", x = "", y = "")
party_sentiment_loughran_graphthis area graph isn’t working. Will fix.
party_sentiment_loughran %>% ggplot(aes(created_str, n, fill = sentiment, color = sentiment)) + geom_area(stat = “identity”, position = “fill”, alpha = 0.6, size = 0.3, color = “black”) + facet_wrap(~ party, scales = “free”)
4.5 Bing sentiment analysis
This calculates words that are preceded by a negation (not added in yet).
bing <- get_sentiments("bing")
bing <- bing %>%
filter(!word == "like") %>% #as before, not really use in 'i like' context.
filter(!word == "trump") #ranked as positive
bing_party_tokens <- party_tokens %>%
inner_join(bing) %>%
select(party, word, sentiment) %>%
count(word, sentiment, party, sort = T)## Joining, by = "word"
bing_negation_words <- bigrams_separated %>%
filter(word_1 %in% negation_words) %>%
inner_join(bing, by = c(word_2 = "word")) %>%
group_by(sentiment) %>%
count(word_2, sort = TRUE)Overall, many many more negative words were used for all candidates.
bing_party_tokens_bargraph <- bing_party_tokens %>%
group_by(party, sentiment) %>%
summarise(number = sum(n)) %>%
ggplot(aes(party, number, fill = sentiment)) +
geom_bar(stat = "identity", position = position_dodge())
bing_party_tokens_bargraph Insert: install_github(“dgrtwo/drlib”) into the console to download the necessary package for ‘reorder_within’ (without it facetting doesn’t produce the results in order)
faceted_bing_party_tokens <- bing_party_tokens %>%
group_by(party, sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder_within(word, n, party)) %>% #this function is from D.Robinson's dev package, dl from github.
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = F) +
facet_wrap(~ party, scales = "free") +
coord_flip() +
scale_x_reordered() +
labs(title = "Most positive / negative words used by each candidate", y = "", x = "")## Selecting by n
5. Statistical Analysis
Within this section I want to understand which words were contributing most to influence, as measured by either retweets or favourite count. In order to do this I wanted to perform a LASSO regression, however I am struggling to preprocess the data in a way that gives meaningful findings.
This graph suggests that as we add more words to our regression model it doesn’t make it hugely more accurate.
Currently this has stopped working… Will get it up and running and update RPubs when I can. I have taken out of chunks so I can knit without errors.
tweet_tokens_filtered <- tweet_tokens %>% distinct(row, word) %>% add_count(word) %>% filter(n >= 500)
tweet_tokens_matrix <- tweet_tokens_filtered %>% cast_sparse(row, word)
tweet_ids <- as.integer(rownames(tweet_tokens_matrix)) tweets_lasso <- tweets %>% filter(!is.na(retweet_count))
retweets <- tweets_lasso$retweet_count[tweet_ids]
cv_glmnet_model_tweets <- cv.glmnet(tweet_tokens_matrix, retweets) plot(cv_glmnet_model_tweets)
tweet_lexicon <- cv_glmnet_model_tweets$glmnet.fit %>% tidy() %>% filter(term != “(Intercept)”) %>% filter(!str_detect(term, pattern = START %R% “http”))
tweet_lexicon %>% arrange(estimate) %>% group_by(direction = ifelse(estimate < 0, “Negative”, “Positive”)) %>% top_n(20, abs(estimate)) %>% ungroup() %>% mutate(term = fct_reorder(term, estimate)) %>% ggplot(aes(term, estimate, fill = direction)) + geom_col() + coord_flip() + labs(y = “Estimated effect of word on the retweets”)
6. Executive Summary
This report looks at a dataset of 200,000 tweets we know were authored by the Internet Research Agency. I am interested in better understanding the way in which the Russian Government is trying to influence opinion. As it stands I believe there are a number of possibilites. Either -
- They are attempting to promote a specific narrative e.g. pro-Trump
- They are attempting to create divison, stoking issues such as Black Lives Matter as well as Trump
- They are talking about everything in an attempt to give the impression of false consensus (astro-turfing).
In order to do this I have aimed to make observations around the themes and content of the tweets. I analysed both the whole data set and tweets that made a reference to one specific candidate.
I’m going to briefly outline my current findings and observations of interest.
Firstly, supervised learning revealed a clear messaging startegy when the tweets were filtered by candidate.
These findings were reinforced by using log odd calculations to measure the likelyhood words would be used in conjuncton with a particular candidate relative to how often they appeared elsewhere.
When we examine the whole corpus we see just how directed conversation was towards Trump related topics.
par(mfrow = c(1, 2))
tweet_pairs_graph +
labs(title = "Words which appear in the same tweet most frequently")tweet_cor_graph +
labs(title = "Words which appear most often together relative to how often they are alone")Unsupervised learning algorithms were used to detect themes amongst the rest of the corpus (that is, tweets that didn’t explicitly mention a politician). Both structural topic modelling and latent dirichlet were used, with mixed results. With further corpus preprocessing, and further calulcation of an appropriate k value I believe better results are obtainable. Below, are the results achieve through structural topic modelling (trying to work out a better visualisation method).
Sentiment analysis revealed a host of negative and positive words associated with each candidate.
par(mfrow = c(1, 2))
bing_party_tokens_bargraph +
labs(title = "Number of positive or negative words used in tweets mentioning this candidate", y = "", x = "Candidate")Sentiment analysis revealed some changes in mood over time, although I would like to verify this with more research. Attempts to isolate individual emotions were not very successful.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : span too small. fewer data values than degrees of freedom.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 1.4766e+09
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 92880
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 1.4789e+12
For example, if we look specifically at Clinton we see a huge number of negative words influencing our analysis
Regarding sentiment analysis I am still in the process of coding words preceded by a negative to count against the score (so not happy would be negative, for example), this should be finished soon.