As part of this text analysis project I have taken a data set (https://www.kaggle.com/taniaj/australian-election-2019-tweets) of tweets about the Australian election. My initial analysis led me to complete a sentiment analysis (both across the whole dataset and broken down by party). Secondly, I tried to calculate which themes were most prevalent across the dataset. In order to do this I thematically explored the data by simply looking at common words, which topics and themes appeared frequently together (word pairs, correlation and bigrams) and finally used topic modelling to bring out different topics within the dataset.

First off, lets perform some Exploratory Data Analysis.

library(tidyverse)
library(tidyr)
library(purrr)
library(tidytext)
library(widyr)
library(ggraph)
library(igraph)
library(topicmodels)
library(lubridate)
library(wordcloud)

Import data and EDA

tweets <- read_csv("auspol2019.csv")
tweets <- na.omit(tweets)
tweets <- tweets %>% 
  mutate(created_at = ymd_hms(created_at))
head(tweets)
## # A tibble: 6 x 11
##   created_at               id full_text retweet_count favorite_count
##   <dttm>                <dbl> <chr>             <dbl>          <dbl>
## 1 2019-05-20 09:13:44 1.13e18 After th…             0              0
## 2 2019-05-20 09:13:43 1.13e18 @narendr…             0              0
## 3 2019-05-20 09:13:33 1.13e18 @workman…             0              0
## 4 2019-05-20 09:13:23 1.13e18 Shares w…             0              0
## 5 2019-05-20 09:12:57 1.13e18 It is di…             0              0
## 6 2019-05-20 09:12:28 1.13e18 "@robyne…             0              0
## # … with 6 more variables: user_id <dbl>, user_name <chr>,
## #   user_screen_name <chr>, user_description <chr>, user_location <chr>,
## #   user_created_at <dttm>
par(mfrow=c(2,2))

tweets %>% 
  ggplot(aes(favorite_count)) +
  geom_histogram() +
  scale_x_log10() +
  labs(x = "Log scale of favourites")

tweets %>% 
  ggplot(aes(created_at)) +
  geom_histogram(bins = 20) +
  labs(title = "Tweet distribution over time", y = "", x = "")

corr <- cor(tweets$retweet_count, tweets$favorite_count)
corr_caption <- paste("Correlation =", corr)  

tweets %>% 
  ggplot(aes(retweet_count, favorite_count)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = F) +
  labs(caption = corr_caption, title = "How do favourites and retweets correlate?")

#lets recode the data slightly to get an idea of where people are tweeting from

tweets$user_location[tweets$user_location == "Sydney"] <- "Sydney, New South Wales"
tweets$user_location[tweets$user_location == "Sydney, Australia"] <- "Sydney, New South Wales"
tweets$user_location[tweets$user_location == "Melbourne, Australia"] <- "Melbourne, Victoria"
tweets$user_location[tweets$user_location == "Brisbane, Australia"] <- "Brisbane, Queensland"
tweets$user_location[tweets$user_location == "Brisbane"] <- "Brisbane, Queensland"
tweets$user_location[tweets$user_location == "Sydney Australia"] <- "Sydney, New South Wales"


tweets %>% 
  select(user_location) %>% 
  group_by(user_location) %>% 
  count(user_location, sort = T) %>% 
  ungroup() %>% 
  filter(n >= 1000) %>% 
  filter(!user_location == "Australia") %>% 
  mutate(user_location = reorder(user_location, n)) %>% 
  ggplot(aes(user_location, n)) +
  geom_col() +
  coord_flip() +
  labs(x = "User location", y = "Number of tweets", title = "Number of tweets by location")

Tokenise the tweets

To perform this text analysis lets tokenise the tweets (create a separate observation for each word of each tweet) and remove the stopwords (common words such as “it”, “as”, “at”, that might confuse our analysis).

remove_reg <- "&amp;|&lt;|&gt;"
tidy_tweets <- tweets %>% 
  filter(!str_detect(full_text, "^RT")) %>%
  mutate(full_text = str_remove_all(full_text, remove_reg)) %>%
  unnest_tokens(word, full_text, token = "tweets") %>%
  filter(!word %in% stop_words$word,
         !word %in% str_remove_all(stop_words$word, "'"),
         str_detect(word, "[a-z]"))

tidy_tweets %>% 
  count(word, sort = T)
## # A tibble: 175,928 x 2
##    word              n
##    <chr>         <int>
##  1 #auspol       89365
##  2 #ausvotes     52120
##  3 election      25493
##  4 australia     19953
##  5 vote          15689
##  6 #ausvotes2019 13315
##  7 labor         12225
##  8 people         8233
##  9 #ausvotes19    7596
## 10 party          6618
## # … with 175,918 more rows

What about the spike on the 18th May we saw in our inital EDA? Did anything specific happen then?

tidy_tweets %>% 
  filter(created_at >= "2019-05-18 00 00 00" & created_at <= "2019-05-19 00 00 00") %>% 
  count(word, sort = T)
## # A tibble: 66,644 x 2
##    word                  n
##    <chr>             <int>
##  1 #ausvotes         24977
##  2 #auspol           24102
##  3 election           9845
##  4 australia          7955
##  5 vote               5834
##  6 labor              3958
##  7 #ausvotes2019      3898
##  8 #australiadecides  2749
##  9 people             2735
## 10 win                2445
## # … with 66,634 more rows

Seemingly the only new hashtag here is ‘#Australiadecides’, which might be the name of a televised debate. lets see if a clearer picture emerges of what happened on that day later (Tableau would be very useful software in this instance for pinpointing individual data points and drawing conclusions).

Lets visualise our findings (and remove the hashtaged tweets because we know they’re so prevalent)

set.seed(1234)
tidy_tweets %>% 
  anti_join(stop_words) %>% 
  filter(!str_detect(word, "^#")) %>%
  count(word) %>% 
  with(wordcloud(word, n , max = 75))
## Joining, by = "word"

Sentiment Analysis

First off, lets work out which words contributed the most to the sentiment of electoral discourse!

sentiment <- tidy_tweets %>% 
  count(word, sort = T) %>% 
  inner_join(get_sentiments("afinn"), by = "word") %>% 
  mutate(total_score = n * score)

sentiment %>% 
  top_n(20, abs(total_score)) %>% 
  mutate(word = reorder(word, total_score)) %>% 
  ggplot(aes(word, total_score, fill = total_score > 0)) +
  geom_col(show.legend = F) +
  coord_flip()

How did the sentiment change over time?

Lets group the tweets into bins of a day.

sentiment_over_time <- tidy_tweets %>% 
  select(word, created_at) %>% 
  inner_join(get_sentiments("afinn"), by = "word") %>% 
  mutate(created_at = floor_date(created_at, unit = "1 day")) %>% 
  add_count(created_at, word) %>%
  mutate(total_score = score * n) %>%
  group_by(created_at) 

sentiment_over_time
## # A tibble: 159,421 x 5
## # Groups:   created_at [11]
##    word          created_at          score     n total_score
##    <chr>         <dttm>              <int> <int>       <int>
##  1 resolute      2019-05-20 00:00:00     2     3           6
##  2 bother        2019-05-20 00:00:00    -2     2          -4
##  3 shares        2019-05-20 00:00:00     1    82          82
##  4 gains         2019-05-20 00:00:00     2    10          20
##  5 disappointing 2019-05-20 00:00:00    -2    15         -30
##  6 challenge     2019-05-20 00:00:00    -1    27         -27
##  7 bloody        2019-05-20 00:00:00    -3    17         -51
##  8 spark         2019-05-20 00:00:00     1    31          31
##  9 care          2019-05-20 00:00:00     2    53         106
## 10 blame         2019-05-20 00:00:00    -2    43         -86
## # … with 159,411 more rows
sentiment_over_time %>% 
  summarise(by_day = sum(total_score)) %>% 
  ggplot(aes(created_at, by_day, fill = by_day > 0)) +
  geom_col(show.legend = F) + 
  ylab("") +
  xlab("net sentiment") +
  scale_y_continuous(labels = scales::comma)

We see here that there was vastly more positive sentiment on the 18th than on any of the other days! However, did something exceptionally positive happen on that day or is it because (as we saw in our EDA) there were the most tweets on the 18th. A small change to our code reveals the truth:

sentiment_over_time %>% 
  summarise(by_day = sum(total_score)/ sum(n)) %>% 
  ggplot(aes(created_at, by_day, fill = by_day > 0)) +
  geom_col(show.legend = F)+ 
  xlab("Net sentiment") +
  ylab("Average sentiment of each word") +
  labs(title = "The average sentiment rating for each word across our data set")

By dividing our sentiment by the amount of words we see that the 18th is no longer the most positive day. The graph shows that the average word had a positive sentiment rating of +1, but the 19th was a more positive day. The first day of our dataset, the 10th, had a net negative average sentiment.

Sentiment Analysis by party

How is the sentiment different for different parties?

tweets$party <- NA #create a new column for party name

party_tweets <- tweets %>% 
  filter(str_detect(full_text, c("Labor")) | str_detect(full_text, c("Liberal"))) #filter the tweets so only ones relating to the party remain
party_tweets$party <- grepl("Labor", party_tweets$full_text) == T #detect whether tweet contains name or not 

party_tweets$party [party_tweets$party == "TRUE"] <- "Labor"
party_tweets$party [party_tweets$party == "FALSE"] <- "Liberal"

Now lets do some analysis comparind these two groups

party_tweets %>% 
  ggplot(aes(x = created_at, fill = party)) +
  geom_histogram() +
  facet_wrap(~ party, ncol = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

It doesn’t look like, within our 10 days of Tweets, an event happened that sparked particular interest in ‘Labor’ or ‘Liberal’ as the tweets follow roughly the same pattern.

Lets compare different frequency of word usage between the two groups:

tidy_party_tweets <- party_tweets %>% 
  filter(!str_detect(full_text, "^RT")) %>%
  mutate(full_text = str_remove_all(full_text, remove_reg)) %>%
  unnest_tokens(word, full_text, token = "tweets") %>%
  filter(!word %in% stop_words$word,
         !word %in% str_remove_all(stop_words$word, "'"),
         str_detect(word, "[a-z]"))

frequency <- tidy_party_tweets %>% 
  group_by(party) %>% 
  count(word, sort = TRUE) %>% 
  left_join(tidy_party_tweets %>% 
              group_by(party) %>% 
              summarise(total = n())) %>%
  mutate(freq = n/total) %>% 
  filter(n >= 20)

frequency <- frequency %>% 
  select(party, word, freq) %>% 
  spread(party, freq) %>% 
  arrange(Labor, Liberal)

frequency %>% 
  ggplot(aes(Liberal, Labor)) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = scales::percent_format()) +
  scale_y_log10(labels = scales::percent_format()) +
  geom_abline(color = "red")

It doesn’t seem like there was that much difference in word usage! Tweets with the word liberal in them were more likely to use the word ‘God’, ‘Dirty’, ‘Lying’ and mention ‘@thenationals’, their future coalition partner. ‘Labor’ tweets were more likely to use ‘win’, ‘education’ and ‘taxes’.

Next, lets use the logs-odd ratio to consider which words were most frequently used by each side.

word_ratios <- tidy_party_tweets %>% 
  count(word, party) %>% 
  group_by(word) %>% 
  filter(sum(n) >= 10) %>% #remove words that have been used less than 10 times
  ungroup() %>% 
  spread(party, n, fill = 0) %>% 
  mutate_if(is.numeric, funs((. + 1) / (sum(.) + 1))) %>%
  mutate(logratio = log(Liberal / Labor)) %>%
  arrange(desc(logratio))

word_ratios %>% 
  arrange(abs(logratio))
## # A tibble: 4,117 x 4
##    word        Labor  Liberal   logratio
##    <chr>       <dbl>    <dbl>      <dbl>
##  1 social   0.000460 0.000460 -0.0000498
##  2 debate   0.000140 0.000139 -0.00117  
##  3 dickson  0.000140 0.000139 -0.00117  
##  4 friend   0.000140 0.000139 -0.00117  
##  5 guardian 0.000279 0.000279 -0.00117  
##  6 plenty   0.000140 0.000139 -0.00117  
##  7 stuck    0.000140 0.000139 -0.00117  
##  8 pretty   0.000419 0.000418 -0.00117  
##  9 false    0.000181 0.000181  0.00168  
## 10 putting  0.000362 0.000362  0.00168  
## # … with 4,107 more rows

These words are about equally likely to be featured in a liberal or labor tweet

What are unique to each group?

word_ratios %>% 
  group_by(logratio < 0) %>% 
  top_n(15, abs(logratio)) %>% 
  ungroup() %>% 
  mutate(word = reorder(word, logratio)) %>%
  ggplot(aes(word, logratio, fill = logratio < 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  ylab("log odds ratio (Labor/Li)") +
  scale_fill_discrete(name = "", labels = c("Labor", "Liberal"))

Our data thus suggests that labor tweets often mention liberals but liberal tweets rarely mention labor (because @ austrlainlabor, labors, labor, vote1labor etc were unique to that group). #SmellslikeliberalDesperation also serves as a reminder that these tweets mentioning that particular party shouldnt be interpreted as an endorsement of that party.

Lets do some comparison of which words were more ‘successful’ for social media engagement.

summary_of_favs <- party_tweets %>% 
  group_by(party) %>% 
  summarise(total_fav = sum(favorite_count), num_of_tweets = sum(party == 'Labor', party == 'Liberal')) %>% 
  mutate(avg_favs = total_fav / num_of_tweets)


summary_of_rts <- party_tweets %>% 
  group_by(party) %>% 
  summarise(total_rt = sum(retweet_count), num_of_tweets = sum(party == 'Labor', party == 'Liberal')) %>% 
  mutate(avg_rts = total_rt / num_of_tweets)


par(mfrow=c(2,1))
summary_of_favs %>% 
  ggplot(aes(party, avg_favs, fill = party)) +
  geom_col() +
  labs(x = "Party", y = "Average number of favourites")

summary_of_rts %>% 
  ggplot(aes(party, avg_rts, fill = party)) +
  geom_col() +
  labs(x = "Party", y = "Average number of retweets")

It seems the Liberal party had a slightly stronger social media effort. Lets check if these differences are statiscally significant.

liberals <- party_tweets %>% 
  filter(party == "Liberal")

labor <- party_tweets %>% 
  filter(party == "Labor")

t.test(labor$retweet_count, liberals$retweet_count, paired = F)
## 
##  Welch Two Sample t-test
## 
## data:  labor$retweet_count and liberals$retweet_count
## t = -1, df = 10000, p-value = 0.3
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.83  0.87
## sample estimates:
## mean of x mean of y 
##       6.3       7.3

There’s actually no statically significant difference between the number of retweets.

As opposed to using the “AFINN”" lexicon, which gives words a score, lets use the binary “Bing” lexicon to see if there are more positive or negative words used. The aim here is to see if the some frequent, high scoring words are scewing the results.

tidy_sentiment <- tidy_party_tweets %>%
  count(word, party, sort = T) %>% 
  inner_join(get_sentiments("bing"), by = "word") 


tidy_sentiment %>% 
  group_by(sentiment) %>% 
  top_n(10) %>% 
  ungroup() %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(sentiment, n, fill = sentiment)) +
  geom_col(show.legend = F) +
  coord_flip()+
  facet_wrap(~ party) +
  labs(y = "Number of positive or negative words used", x = "Sentiment")
## Selecting by sentiment

Interestingly, it seems both parties use much more negative words than positive words. However, if we remember our previous analysis, words on average had a positive sentiment. Lets change the lexicon.

avg_sentiment_graph <- tidy_party_tweets %>%
  count(word, party, sort = T) %>% 
  inner_join(get_sentiments("afinn"), by = "word") %>% 
  mutate(total_score = score * n) %>% 
  group_by(party) %>% 
  summarise(sum_total_score = mean(total_score)) %>% 
  ggplot(aes(party, sum_total_score, fill = sum_total_score > 0)) +
  geom_col(show.legend = F) +
  labs(x = "Party", y = "Average sentiment of each party")
avg_sentiment_graph

So, comparing to our previous analysis (whereby we compared the sentiment by day) it would seem that tweets mentioning a political party tend to be more negative than tweets that don’t mention a political party. Finally, lets find out which words are the most influential.

tidy_sentiment %>% 
  select(word, sentiment, n) %>% 
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()
## Selecting by n

It seems our results are being somewhat scewed by the high frequency of win and its high scoring (4 points).

How accutate is our analysis, though? If ‘win’ is preceded by ‘don’t’ or ‘wont’ perhaps it’s not as influential as we think.

N Grams and word correlation

Here i’m interested in finding out two things: firstly, how many words are actually positive words preceded by negative words and secondly which words are commonly associated together?

tweet_bigrams <- tweets %>% 
  unnest_tokens(bigram, full_text, token = "ngrams", n = 2) #re-tokenise our tweets with bigrams. 

custom_stop_words <- bind_rows(tibble(word = c("https", "t.co"), 
                                      lexicon = c("custom")),
                               stop_words) #add a few more stop_words that appear with links
tweet_bigrams %>% 
  count(bigram, sort = T)
## # A tibble: 999,520 x 2
##    bigram                  n
##    <chr>               <int>
##  1 https t.co          84829
##  2 auspol https        17238
##  3 auspol ausvotes     14813
##  4 ausvotes https      11648
##  5 of the               9366
##  6 in the               8825
##  7 for the              6579
##  8 auspol ausvotes2019  5932
##  9 ausvotes auspol      5889
## 10 to the               5467
## # … with 999,510 more rows
bigrams_separated <- tweet_bigrams %>% 
  separate(bigram, c("word_1", "word_2"), sep = " ")

bigrams_filtered <- bigrams_separated %>%
  filter(!word_1 %in% custom_stop_words$word) %>%
  filter(!word_2 %in% custom_stop_words$word) %>% 
  filter(!word_1 %in% c("auspol", "ausvotes")) %>% 
  filter(!word_2 == c("ausvotes2019")) #here I'm removing the hashtags (I think these were the criteria for the twitter api search)

bigrams_filtered %>% 
  count(word_1, word_2 , sort = T)
## # A tibble: 384,227 x 3
##    word_1       word_2       n
##    <chr>        <chr>    <int>
##  1 federal      election  3948
##  2 scott        morrison  3204
##  3 prime        minister  3009
##  4 climate      change    2838
##  5 bill         shorten   2634
##  6 ausvotes2019 auspol    2162
##  7 bob          hawke     2102
##  8 election     2019      1755
##  9 australia    election  1692
## 10 tony         abbott    1428
## # … with 384,217 more rows

It’s interesting to see climate change featuring so heavily on the agenda. Excluding names and political terminology like ‘federal election’ its the most discussed topic. lets see which other words its associated with.

tweet_trigrams <- tweets %>% 
  unnest_tokens(bigram, full_text, token = "ngrams", n = 3) 


trigrams_separated <- tweet_trigrams %>% 
  separate(bigram, c("word_1", "word_2", "word_3"), sep = " ")

trigrams_filtered <- trigrams_separated %>%
  filter(!word_1 %in% custom_stop_words$word) %>%
  filter(!word_2 %in% custom_stop_words$word) %>% 
  filter(!word_3 %in% custom_stop_words$word)

trigrams_filtered %>% 
  filter(word_1 == "climate" & word_2 == "change" | word_2 == "climate" & word_3 == "change") %>% 
  count(word_1, word_2, word_3, sort = T)
## # A tibble: 818 x 4
##    word_1      word_2  word_3       n
##    <chr>       <chr>   <chr>    <int>
##  1 climate     change  election   297
##  2 australia’s climate change     144
##  3 climate     change  policy      94
##  4 climate     change  auspol      93
##  5 climate     change  ausvotes    51
##  6 climate     change  deniers     47
##  7 climate     change  action      41
##  8 climate     change  denying     41
##  9 climate     change  policies    41
## 10 climate     change  denial      37
## # … with 808 more rows

Lets try and get a better idea of how many of these positive scoring words are actually negatives

negation_words <- c("not", "no", "never", "without", "wont", "dont")
AFINN <- get_sentiments("afinn")

bigrams_separated %>% 
  filter(word_1 %in% negation_words) %>% 
  count(word_1, word_2, sort = T)
## # A tibble: 5,071 x 3
##    word_1 word_2     n
##    <chr>  <chr>  <int>
##  1 not    a        866
##  2 not    the      643
##  3 not    to       502
##  4 no     one      491
##  5 not    be       469
##  6 never  been     314
##  7 not    just     305
##  8 not    sure     273
##  9 no     longer   256
## 10 not    going    251
## # … with 5,061 more rows
not_words <- bigrams_separated %>%
  filter(word_1 %in% negation_words) %>%
  inner_join(AFINN, by = c(word_2 = "word")) %>%
  count(word_2, score, sort = TRUE)

not_words
## # A tibble: 550 x 3
##    word_2 score     n
##    <chr>  <int> <int>
##  1 matter     1   240
##  2 no        -1   183
##  3 forget    -1   154
##  4 doubt     -1   125
##  5 good       3   119
##  6 want       1   103
##  7 like       2    89
##  8 vision     1    62
##  9 thanks     2    58
## 10 win        4    56
## # … with 540 more rows

the most common word that followed a negation was ‘matter’, which isn’t hugely helpful. “No no”, “dont no” etc also doesn’t reveal much. However, we know that 56 of our ‘wins’ were preceded by a negation. Lets visualise this:

not_words %>% 
  mutate(contribution = score * n) %>% 
  arrange(desc(abs(contribution))) %>% 
  top_n(25) %>% 
  mutate(word_2 = reorder(word_2, contribution)) %>% 
  ggplot(aes(word_2, contribution, fill = contribution > 0)) +
  geom_col(show.legend = F) +
  coord_flip() +
  labs(x = "Words preceded by a negative", y = "Sentiment score * number of occurances")
## Selecting by contribution

Interestingly, all of these words were positive words made negative. None of the top 25 words were negative words made positive by a negation. Overall, it seems the discussion was less positive than orignally seemed.

Now we have an idea of which words were preceded by a negative lets use this data to reassess our original conclusions.

not_word_contribution <- not_words %>% 
  mutate(contribution = 0 - score * n)
sum_not_words <- sum(not_word_contribution$contribution)

par(mfrow=c(2,1))

avg_sentiment_graph_w_neg <- tidy_party_tweets %>%
  count(word, party, sort = T) %>% 
  inner_join(get_sentiments("afinn"), by = "word") %>% 
  mutate(total_score = ((score * n) + not_word_contribution$contribution))%>% 
  group_by(party) %>% 
  summarise(sum_total_score = mean(total_score)) %>% 
  ggplot(aes(party, sum_total_score, fill = sum_total_score > 0)) +
  geom_col(show.legend = F) +
  labs(x = "Party", y = "Average sentiment of each Party")

avg_sentiment_graph_w_neg +
  labs(title = "Updated graph accounting for negation")

avg_sentiment_graph +
  labs(title = "Original graph")

In fact, when accounting for negative words, both parties were much more negative than we originally thought.

Lets consider using bing, the binary lexicon from before.

bing <- get_sentiments("bing")
not_words_bing <- bigrams_separated %>%
  filter(word_1 %in% negation_words) %>%
  inner_join(bing, by = c(word_2 = "word")) %>%
  count(word_2, sentiment, sort = TRUE) %>% 
  mutate(n = n * -1)


not_words_bing %>% 
  arrange(n)
## # A tibble: 708 x 3
##    word_2 sentiment     n
##    <chr>  <chr>     <dbl>
##  1 wonder positive   -128
##  2 doubt  negative   -125
##  3 good   positive   -119
##  4 enough positive   -111
##  5 like   positive    -89
##  6 win    positive    -56
##  7 happy  positive    -50
##  8 right  positive    -49
##  9 work   positive    -44
## 10 fear   negative    -41
## # … with 698 more rows

Whats the situation with ‘wonder’ here?

bigrams_separated %>% 
  filter(word_2 == "wonder") %>% 
  filter(word_1 %in% negation_words) %>%
  count(word_1, sort = T)
## # A tibble: 1 x 2
##   word_1     n
##   <chr>  <int>
## 1 no       128

Not that interesting, but worth remembering that different lexicons can give different results.

tidy_sentiment %>% 
  select(word, sentiment, n) %>% 
  mutate(n = n + not_words_bing$n) %>% 
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()
## Warning in n + not_words_bing$n: longer object length is not a multiple of
## shorter object length
## Selecting by n

Slightly different, but not a massive change.

Lets try and visualise the association between the words.

hashtags <- c("ausvotes", "auspol", "ausvotes2019", "ausvotes19")
set.seed(2016)
bigram_counts <- bigrams_filtered %>% 
  count(word_1, word_2 , sort = T) %>% 
  filter(!word_1 %in% hashtags) %>% 
  filter(!word_2 %in% hashtags)
  


bigram_graph <- bigram_counts %>% 
  filter(n > 350) %>% 
  graph_from_data_frame()
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

This is simply the bigrams we are see together (filtered for n>350). Nothing particular interesting here.

Word association

Lets now consider which pairs are most likely to be found in the same tweet (but not necessarily next to each other) and also which words have the highest correlation - which words are likely to appear together relative to how often they appear separately.

tweet_words <- tweets %>% 
  mutate(row = row_number()) %>% 
  unnest_tokens(word, full_text) %>% 
  filter(!word %in% custom_stop_words$word) %>% 
  filter(!word %in% hashtags)
  
tweet_pairs <- tweet_words %>% 
  pairwise_count(word, row, sort = T)

tweet_pairs
## # A tibble: 11,519,894 x 3
##    item1     item2         n
##    <chr>     <chr>     <dbl>
##  1 australia election   7954
##  2 election  australia  7954
##  3 vote      australia  5341
##  4 australia vote       5341
##  5 federal   election   3844
##  6 election  federal    3844
##  7 2019      election   3347
##  8 election  2019       3347
##  9 scott     morrison   3198
## 10 morrison  scott      3198
## # … with 11,519,884 more rows

this has the most common pair found for each tweet, although they’re not necessarily next to each other.

lets use correlation - that considers how often words appear together relative to how often they appear separately.

tweet_cor <- tweet_words %>% 
  group_by(word) %>% 
  filter(n() > 200) %>% 
  pairwise_cor(word, row, sort = T)

tweet_cor  
## # A tibble: 1,472,582 x 3
##    item1        item2        correlation
##    <chr>        <chr>              <dbl>
##  1 streaming    channel7           0.905
##  2 channel7     streaming          0.905
##  3 corporation  broadcasting       0.902
##  4 broadcasting corporation        0.902
##  5 michaelusher riley7news         0.884
##  6 riley7news   michaelusher       0.884
##  7 minister     prime              0.857
##  8 prime        minister           0.857
##  9 credits      franking           0.834
## 10 franking     credits            0.834
## # … with 1,472,572 more rows
tweet_cor %>% 
  filter(correlation > .50) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void()

Most of these words being names doesn’t really help us much, a more thorough analysis might filter the names out.

Topic modelling

Finally, lets try and unsupervised learning approach to gleaning some new insight into the topic.

tweets_dtm <- tidy_tweets %>% 
  filter(!str_detect(word, "^#")) %>%
  filter(!word %in% hashtags) %>% 
  count(id, word) %>% 
  cast_dtm(id, word, n)

tweets_lda <- LDA(tweets_dtm, k = 2, control = list(seed = 1234))

Lets extract the per topic word probabilities (beta) for the different topics

tweet_topic_terms <- tweets_lda %>% 
  tidy(matrix = "beta") %>% 
  group_by(topic) %>% 
  top_n(25, beta) %>% 
  ungroup() %>% 
  arrange(topic, -beta) 
  
tweet_topic_terms %>% 
  mutate(term = reorder(term, beta)) %>% 
  ggplot(aes(x= term, y = beta, fill = topic)) +
  geom_col(show.legend = F) +
  facet_wrap(~ topic, scales = "free_y") +
  coord_flip()

It’s probably more interesting to consider which terms had the greatest difference in log beta

beta_spread <- tweets_lda %>%
  tidy(matrix = "beta") %>% 
  mutate(topic = paste0("topic", topic)) %>%
  spread(topic, beta) %>%
  filter(topic1 > .001 | topic2 > .001) %>%
  mutate(log_ratio = log2(topic2 / topic1))

beta_spread %>% 
  mutate(term = reorder(term, log_ratio)) %>% 
  top_n(20, abs(log_ratio)) %>% 
  ggplot(aes(term, log_ratio, fill = log_ratio > 0)) +
  geom_col(show.legend = F) +
  coord_flip() +
  labs(y =  "Words with the greatest difference in β between topic 2 and topic 1", x = "Terms")

The results here seem more useful. When splitting into two topics we see more loaded words being used by one group - god, environment, left and coal. However, it doesn’t look like we have split into topics of parties but instead topics explicitly mentioning politicians and ‘themes’ - Australia, God, Coal, left’. lets find out which words these were normally being accompanied by:

bigrams_filtered %>% 
  filter(word_1 %in% c("coal", "god", "environment") | word_2 %in% c("coal", "god", "environment")) %>% 
  count(word_1, word_2, sort = T) %>% 
  top_n(15, n)
## # A tibble: 17 x 3
##    word_1          word_2             n
##    <chr>           <chr>          <int>
##  1 god             bless            125
##  2 environment     minister         119
##  3 coal            mine             110
##  4 coal            mines             57
##  5 environment     auspol            53
##  6 adani           coal              49
##  7 environment     sustainability    44
##  8 financialcrisis environment       44
##  9 coal            industry          42
## 10 coal            mining            42
## 11 coal            auspol            34
## 12 coal            fired             34
## 13 invisible       environment       28
## 14 coal            loving            19
## 15 coal            miners            19
## 16 coal            power             19
## 17 god             auspol            19

Lets have a look at the gamma statistic - the probability that each document belongs to a each tocpic

lda_gamma <- tidy(tweets_lda, matrix = "gamma")

lda_gamma %>% 
  arrange(desc(gamma))
## # A tibble: 280,342 x 3
##    document            topic gamma
##    <chr>               <int> <dbl>
##  1 1129709983401046016     2 0.758
##  2 1129445844271239168     2 0.656
##  3 1129866995271438336     2 0.654
##  4 1129146318960463872     1 0.650
##  5 1127805923559231488     2 0.642
##  6 1127013486670729216     2 0.628
##  7 1127906054619127808     2 0.618
##  8 1127824090084007936     2 0.612
##  9 1130319010279464960     2 0.607
## 10 1129635356746211328     1 0.606
## # … with 280,332 more rows

Doesn’t really look like there is a clear delineation between topics:

ggplot(lda_gamma, aes(gamma)) +
  geom_histogram() +
  scale_y_log10() +
  labs(title = "Distribution of probabilities for all topics",
       y = "Number of documents", x = expression(gamma))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 9 rows containing missing values (geom_bar).

the distribution of gamma looks quite worrying here - there is effectively a 50 50 chance a tweet could be assigned to each topic.

At this point you can run through different numbers of topics and find the graph with the highest number of gamma = 0 and gamma = 1, which would suggest a high number of documents either fit into a topic or they don’t. Given that our tweets are all about one topic - politics - the distribution of gamma doesn’t really improve as we increase k.