As part of this text analysis project I have taken a data set (https://www.kaggle.com/taniaj/australian-election-2019-tweets) of tweets about the Australian election. My initial analysis led me to complete a sentiment analysis (both across the whole dataset and broken down by party). Secondly, I tried to calculate which themes were most prevalent across the dataset. In order to do this I thematically explored the data by simply looking at common words, which topics and themes appeared frequently together (word pairs, correlation and bigrams) and finally used topic modelling to bring out different topics within the dataset.
First off, lets perform some Exploratory Data Analysis.
library(tidyverse)
library(tidyr)
library(purrr)
library(tidytext)
library(widyr)
library(ggraph)
library(igraph)
library(topicmodels)
library(lubridate)
library(wordcloud)
tweets <- read_csv("auspol2019.csv")
tweets <- na.omit(tweets)
tweets <- tweets %>%
mutate(created_at = ymd_hms(created_at))
head(tweets)
## # A tibble: 6 x 11
## created_at id full_text retweet_count favorite_count
## <dttm> <dbl> <chr> <dbl> <dbl>
## 1 2019-05-20 09:13:44 1.13e18 After th… 0 0
## 2 2019-05-20 09:13:43 1.13e18 @narendr… 0 0
## 3 2019-05-20 09:13:33 1.13e18 @workman… 0 0
## 4 2019-05-20 09:13:23 1.13e18 Shares w… 0 0
## 5 2019-05-20 09:12:57 1.13e18 It is di… 0 0
## 6 2019-05-20 09:12:28 1.13e18 "@robyne… 0 0
## # … with 6 more variables: user_id <dbl>, user_name <chr>,
## # user_screen_name <chr>, user_description <chr>, user_location <chr>,
## # user_created_at <dttm>
par(mfrow=c(2,2))
tweets %>%
ggplot(aes(favorite_count)) +
geom_histogram() +
scale_x_log10() +
labs(x = "Log scale of favourites")
tweets %>%
ggplot(aes(created_at)) +
geom_histogram(bins = 20) +
labs(title = "Tweet distribution over time", y = "", x = "")
corr <- cor(tweets$retweet_count, tweets$favorite_count)
corr_caption <- paste("Correlation =", corr)
tweets %>%
ggplot(aes(retweet_count, favorite_count)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm", se = F) +
labs(caption = corr_caption, title = "How do favourites and retweets correlate?")
#lets recode the data slightly to get an idea of where people are tweeting from
tweets$user_location[tweets$user_location == "Sydney"] <- "Sydney, New South Wales"
tweets$user_location[tweets$user_location == "Sydney, Australia"] <- "Sydney, New South Wales"
tweets$user_location[tweets$user_location == "Melbourne, Australia"] <- "Melbourne, Victoria"
tweets$user_location[tweets$user_location == "Brisbane, Australia"] <- "Brisbane, Queensland"
tweets$user_location[tweets$user_location == "Brisbane"] <- "Brisbane, Queensland"
tweets$user_location[tweets$user_location == "Sydney Australia"] <- "Sydney, New South Wales"
tweets %>%
select(user_location) %>%
group_by(user_location) %>%
count(user_location, sort = T) %>%
ungroup() %>%
filter(n >= 1000) %>%
filter(!user_location == "Australia") %>%
mutate(user_location = reorder(user_location, n)) %>%
ggplot(aes(user_location, n)) +
geom_col() +
coord_flip() +
labs(x = "User location", y = "Number of tweets", title = "Number of tweets by location")
To perform this text analysis lets tokenise the tweets (create a separate observation for each word of each tweet) and remove the stopwords (common words such as “it”, “as”, “at”, that might confuse our analysis).
remove_reg <- "&|<|>"
tidy_tweets <- tweets %>%
filter(!str_detect(full_text, "^RT")) %>%
mutate(full_text = str_remove_all(full_text, remove_reg)) %>%
unnest_tokens(word, full_text, token = "tweets") %>%
filter(!word %in% stop_words$word,
!word %in% str_remove_all(stop_words$word, "'"),
str_detect(word, "[a-z]"))
tidy_tweets %>%
count(word, sort = T)
## # A tibble: 175,928 x 2
## word n
## <chr> <int>
## 1 #auspol 89365
## 2 #ausvotes 52120
## 3 election 25493
## 4 australia 19953
## 5 vote 15689
## 6 #ausvotes2019 13315
## 7 labor 12225
## 8 people 8233
## 9 #ausvotes19 7596
## 10 party 6618
## # … with 175,918 more rows
What about the spike on the 18th May we saw in our inital EDA? Did anything specific happen then?
tidy_tweets %>%
filter(created_at >= "2019-05-18 00 00 00" & created_at <= "2019-05-19 00 00 00") %>%
count(word, sort = T)
## # A tibble: 66,644 x 2
## word n
## <chr> <int>
## 1 #ausvotes 24977
## 2 #auspol 24102
## 3 election 9845
## 4 australia 7955
## 5 vote 5834
## 6 labor 3958
## 7 #ausvotes2019 3898
## 8 #australiadecides 2749
## 9 people 2735
## 10 win 2445
## # … with 66,634 more rows
Seemingly the only new hashtag here is ‘#Australiadecides’, which might be the name of a televised debate. lets see if a clearer picture emerges of what happened on that day later (Tableau would be very useful software in this instance for pinpointing individual data points and drawing conclusions).
Lets visualise our findings (and remove the hashtaged tweets because we know they’re so prevalent)
set.seed(1234)
tidy_tweets %>%
anti_join(stop_words) %>%
filter(!str_detect(word, "^#")) %>%
count(word) %>%
with(wordcloud(word, n , max = 75))
## Joining, by = "word"
First off, lets work out which words contributed the most to the sentiment of electoral discourse!
sentiment <- tidy_tweets %>%
count(word, sort = T) %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
mutate(total_score = n * score)
sentiment %>%
top_n(20, abs(total_score)) %>%
mutate(word = reorder(word, total_score)) %>%
ggplot(aes(word, total_score, fill = total_score > 0)) +
geom_col(show.legend = F) +
coord_flip()
How did the sentiment change over time?
Lets group the tweets into bins of a day.
sentiment_over_time <- tidy_tweets %>%
select(word, created_at) %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
mutate(created_at = floor_date(created_at, unit = "1 day")) %>%
add_count(created_at, word) %>%
mutate(total_score = score * n) %>%
group_by(created_at)
sentiment_over_time
## # A tibble: 159,421 x 5
## # Groups: created_at [11]
## word created_at score n total_score
## <chr> <dttm> <int> <int> <int>
## 1 resolute 2019-05-20 00:00:00 2 3 6
## 2 bother 2019-05-20 00:00:00 -2 2 -4
## 3 shares 2019-05-20 00:00:00 1 82 82
## 4 gains 2019-05-20 00:00:00 2 10 20
## 5 disappointing 2019-05-20 00:00:00 -2 15 -30
## 6 challenge 2019-05-20 00:00:00 -1 27 -27
## 7 bloody 2019-05-20 00:00:00 -3 17 -51
## 8 spark 2019-05-20 00:00:00 1 31 31
## 9 care 2019-05-20 00:00:00 2 53 106
## 10 blame 2019-05-20 00:00:00 -2 43 -86
## # … with 159,411 more rows
sentiment_over_time %>%
summarise(by_day = sum(total_score)) %>%
ggplot(aes(created_at, by_day, fill = by_day > 0)) +
geom_col(show.legend = F) +
ylab("") +
xlab("net sentiment") +
scale_y_continuous(labels = scales::comma)
We see here that there was vastly more positive sentiment on the 18th than on any of the other days! However, did something exceptionally positive happen on that day or is it because (as we saw in our EDA) there were the most tweets on the 18th. A small change to our code reveals the truth:
sentiment_over_time %>%
summarise(by_day = sum(total_score)/ sum(n)) %>%
ggplot(aes(created_at, by_day, fill = by_day > 0)) +
geom_col(show.legend = F)+
xlab("Net sentiment") +
ylab("Average sentiment of each word") +
labs(title = "The average sentiment rating for each word across our data set")
By dividing our sentiment by the amount of words we see that the 18th is no longer the most positive day. The graph shows that the average word had a positive sentiment rating of +1, but the 19th was a more positive day. The first day of our dataset, the 10th, had a net negative average sentiment.
How is the sentiment different for different parties?
tweets$party <- NA #create a new column for party name
party_tweets <- tweets %>%
filter(str_detect(full_text, c("Labor")) | str_detect(full_text, c("Liberal"))) #filter the tweets so only ones relating to the party remain
party_tweets$party <- grepl("Labor", party_tweets$full_text) == T #detect whether tweet contains name or not
party_tweets$party [party_tweets$party == "TRUE"] <- "Labor"
party_tweets$party [party_tweets$party == "FALSE"] <- "Liberal"
Now lets do some analysis comparind these two groups
party_tweets %>%
ggplot(aes(x = created_at, fill = party)) +
geom_histogram() +
facet_wrap(~ party, ncol = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
It doesn’t look like, within our 10 days of Tweets, an event happened that sparked particular interest in ‘Labor’ or ‘Liberal’ as the tweets follow roughly the same pattern.
Lets compare different frequency of word usage between the two groups:
tidy_party_tweets <- party_tweets %>%
filter(!str_detect(full_text, "^RT")) %>%
mutate(full_text = str_remove_all(full_text, remove_reg)) %>%
unnest_tokens(word, full_text, token = "tweets") %>%
filter(!word %in% stop_words$word,
!word %in% str_remove_all(stop_words$word, "'"),
str_detect(word, "[a-z]"))
frequency <- tidy_party_tweets %>%
group_by(party) %>%
count(word, sort = TRUE) %>%
left_join(tidy_party_tweets %>%
group_by(party) %>%
summarise(total = n())) %>%
mutate(freq = n/total) %>%
filter(n >= 20)
frequency <- frequency %>%
select(party, word, freq) %>%
spread(party, freq) %>%
arrange(Labor, Liberal)
frequency %>%
ggplot(aes(Liberal, Labor)) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = scales::percent_format()) +
scale_y_log10(labels = scales::percent_format()) +
geom_abline(color = "red")
It doesn’t seem like there was that much difference in word usage! Tweets with the word liberal in them were more likely to use the word ‘God’, ‘Dirty’, ‘Lying’ and mention ‘@thenationals’, their future coalition partner. ‘Labor’ tweets were more likely to use ‘win’, ‘education’ and ‘taxes’.
Next, lets use the logs-odd ratio to consider which words were most frequently used by each side.
word_ratios <- tidy_party_tweets %>%
count(word, party) %>%
group_by(word) %>%
filter(sum(n) >= 10) %>% #remove words that have been used less than 10 times
ungroup() %>%
spread(party, n, fill = 0) %>%
mutate_if(is.numeric, funs((. + 1) / (sum(.) + 1))) %>%
mutate(logratio = log(Liberal / Labor)) %>%
arrange(desc(logratio))
word_ratios %>%
arrange(abs(logratio))
## # A tibble: 4,117 x 4
## word Labor Liberal logratio
## <chr> <dbl> <dbl> <dbl>
## 1 social 0.000460 0.000460 -0.0000498
## 2 debate 0.000140 0.000139 -0.00117
## 3 dickson 0.000140 0.000139 -0.00117
## 4 friend 0.000140 0.000139 -0.00117
## 5 guardian 0.000279 0.000279 -0.00117
## 6 plenty 0.000140 0.000139 -0.00117
## 7 stuck 0.000140 0.000139 -0.00117
## 8 pretty 0.000419 0.000418 -0.00117
## 9 false 0.000181 0.000181 0.00168
## 10 putting 0.000362 0.000362 0.00168
## # … with 4,107 more rows
These words are about equally likely to be featured in a liberal or labor tweet
What are unique to each group?
word_ratios %>%
group_by(logratio < 0) %>%
top_n(15, abs(logratio)) %>%
ungroup() %>%
mutate(word = reorder(word, logratio)) %>%
ggplot(aes(word, logratio, fill = logratio < 0)) +
geom_col(show.legend = FALSE) +
coord_flip() +
ylab("log odds ratio (Labor/Li)") +
scale_fill_discrete(name = "", labels = c("Labor", "Liberal"))
Our data thus suggests that labor tweets often mention liberals but liberal tweets rarely mention labor (because @ austrlainlabor, labors, labor, vote1labor etc were unique to that group). #SmellslikeliberalDesperation also serves as a reminder that these tweets mentioning that particular party shouldnt be interpreted as an endorsement of that party.
Lets do some comparison of which words were more ‘successful’ for social media engagement.
summary_of_favs <- party_tweets %>%
group_by(party) %>%
summarise(total_fav = sum(favorite_count), num_of_tweets = sum(party == 'Labor', party == 'Liberal')) %>%
mutate(avg_favs = total_fav / num_of_tweets)
summary_of_rts <- party_tweets %>%
group_by(party) %>%
summarise(total_rt = sum(retweet_count), num_of_tweets = sum(party == 'Labor', party == 'Liberal')) %>%
mutate(avg_rts = total_rt / num_of_tweets)
par(mfrow=c(2,1))
summary_of_favs %>%
ggplot(aes(party, avg_favs, fill = party)) +
geom_col() +
labs(x = "Party", y = "Average number of favourites")
summary_of_rts %>%
ggplot(aes(party, avg_rts, fill = party)) +
geom_col() +
labs(x = "Party", y = "Average number of retweets")
It seems the Liberal party had a slightly stronger social media effort. Lets check if these differences are statiscally significant.
liberals <- party_tweets %>%
filter(party == "Liberal")
labor <- party_tweets %>%
filter(party == "Labor")
t.test(labor$retweet_count, liberals$retweet_count, paired = F)
##
## Welch Two Sample t-test
##
## data: labor$retweet_count and liberals$retweet_count
## t = -1, df = 10000, p-value = 0.3
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.83 0.87
## sample estimates:
## mean of x mean of y
## 6.3 7.3
There’s actually no statically significant difference between the number of retweets.
As opposed to using the “AFINN”" lexicon, which gives words a score, lets use the binary “Bing” lexicon to see if there are more positive or negative words used. The aim here is to see if the some frequent, high scoring words are scewing the results.
tidy_sentiment <- tidy_party_tweets %>%
count(word, party, sort = T) %>%
inner_join(get_sentiments("bing"), by = "word")
tidy_sentiment %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(sentiment, n, fill = sentiment)) +
geom_col(show.legend = F) +
coord_flip()+
facet_wrap(~ party) +
labs(y = "Number of positive or negative words used", x = "Sentiment")
## Selecting by sentiment
Interestingly, it seems both parties use much more negative words than positive words. However, if we remember our previous analysis, words on average had a positive sentiment. Lets change the lexicon.
avg_sentiment_graph <- tidy_party_tweets %>%
count(word, party, sort = T) %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
mutate(total_score = score * n) %>%
group_by(party) %>%
summarise(sum_total_score = mean(total_score)) %>%
ggplot(aes(party, sum_total_score, fill = sum_total_score > 0)) +
geom_col(show.legend = F) +
labs(x = "Party", y = "Average sentiment of each party")
avg_sentiment_graph
So, comparing to our previous analysis (whereby we compared the sentiment by day) it would seem that tweets mentioning a political party tend to be more negative than tweets that don’t mention a political party. Finally, lets find out which words are the most influential.
tidy_sentiment %>%
select(word, sentiment, n) %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip()
## Selecting by n
It seems our results are being somewhat scewed by the high frequency of win and its high scoring (4 points).
How accutate is our analysis, though? If ‘win’ is preceded by ‘don’t’ or ‘wont’ perhaps it’s not as influential as we think.
Here i’m interested in finding out two things: firstly, how many words are actually positive words preceded by negative words and secondly which words are commonly associated together?
tweet_bigrams <- tweets %>%
unnest_tokens(bigram, full_text, token = "ngrams", n = 2) #re-tokenise our tweets with bigrams.
custom_stop_words <- bind_rows(tibble(word = c("https", "t.co"),
lexicon = c("custom")),
stop_words) #add a few more stop_words that appear with links
tweet_bigrams %>%
count(bigram, sort = T)
## # A tibble: 999,520 x 2
## bigram n
## <chr> <int>
## 1 https t.co 84829
## 2 auspol https 17238
## 3 auspol ausvotes 14813
## 4 ausvotes https 11648
## 5 of the 9366
## 6 in the 8825
## 7 for the 6579
## 8 auspol ausvotes2019 5932
## 9 ausvotes auspol 5889
## 10 to the 5467
## # … with 999,510 more rows
bigrams_separated <- tweet_bigrams %>%
separate(bigram, c("word_1", "word_2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word_1 %in% custom_stop_words$word) %>%
filter(!word_2 %in% custom_stop_words$word) %>%
filter(!word_1 %in% c("auspol", "ausvotes")) %>%
filter(!word_2 == c("ausvotes2019")) #here I'm removing the hashtags (I think these were the criteria for the twitter api search)
bigrams_filtered %>%
count(word_1, word_2 , sort = T)
## # A tibble: 384,227 x 3
## word_1 word_2 n
## <chr> <chr> <int>
## 1 federal election 3948
## 2 scott morrison 3204
## 3 prime minister 3009
## 4 climate change 2838
## 5 bill shorten 2634
## 6 ausvotes2019 auspol 2162
## 7 bob hawke 2102
## 8 election 2019 1755
## 9 australia election 1692
## 10 tony abbott 1428
## # … with 384,217 more rows
It’s interesting to see climate change featuring so heavily on the agenda. Excluding names and political terminology like ‘federal election’ its the most discussed topic. lets see which other words its associated with.
tweet_trigrams <- tweets %>%
unnest_tokens(bigram, full_text, token = "ngrams", n = 3)
trigrams_separated <- tweet_trigrams %>%
separate(bigram, c("word_1", "word_2", "word_3"), sep = " ")
trigrams_filtered <- trigrams_separated %>%
filter(!word_1 %in% custom_stop_words$word) %>%
filter(!word_2 %in% custom_stop_words$word) %>%
filter(!word_3 %in% custom_stop_words$word)
trigrams_filtered %>%
filter(word_1 == "climate" & word_2 == "change" | word_2 == "climate" & word_3 == "change") %>%
count(word_1, word_2, word_3, sort = T)
## # A tibble: 818 x 4
## word_1 word_2 word_3 n
## <chr> <chr> <chr> <int>
## 1 climate change election 297
## 2 australia’s climate change 144
## 3 climate change policy 94
## 4 climate change auspol 93
## 5 climate change ausvotes 51
## 6 climate change deniers 47
## 7 climate change action 41
## 8 climate change denying 41
## 9 climate change policies 41
## 10 climate change denial 37
## # … with 808 more rows
Lets try and get a better idea of how many of these positive scoring words are actually negatives
negation_words <- c("not", "no", "never", "without", "wont", "dont")
AFINN <- get_sentiments("afinn")
bigrams_separated %>%
filter(word_1 %in% negation_words) %>%
count(word_1, word_2, sort = T)
## # A tibble: 5,071 x 3
## word_1 word_2 n
## <chr> <chr> <int>
## 1 not a 866
## 2 not the 643
## 3 not to 502
## 4 no one 491
## 5 not be 469
## 6 never been 314
## 7 not just 305
## 8 not sure 273
## 9 no longer 256
## 10 not going 251
## # … with 5,061 more rows
not_words <- bigrams_separated %>%
filter(word_1 %in% negation_words) %>%
inner_join(AFINN, by = c(word_2 = "word")) %>%
count(word_2, score, sort = TRUE)
not_words
## # A tibble: 550 x 3
## word_2 score n
## <chr> <int> <int>
## 1 matter 1 240
## 2 no -1 183
## 3 forget -1 154
## 4 doubt -1 125
## 5 good 3 119
## 6 want 1 103
## 7 like 2 89
## 8 vision 1 62
## 9 thanks 2 58
## 10 win 4 56
## # … with 540 more rows
the most common word that followed a negation was ‘matter’, which isn’t hugely helpful. “No no”, “dont no” etc also doesn’t reveal much. However, we know that 56 of our ‘wins’ were preceded by a negation. Lets visualise this:
not_words %>%
mutate(contribution = score * n) %>%
arrange(desc(abs(contribution))) %>%
top_n(25) %>%
mutate(word_2 = reorder(word_2, contribution)) %>%
ggplot(aes(word_2, contribution, fill = contribution > 0)) +
geom_col(show.legend = F) +
coord_flip() +
labs(x = "Words preceded by a negative", y = "Sentiment score * number of occurances")
## Selecting by contribution
Interestingly, all of these words were positive words made negative. None of the top 25 words were negative words made positive by a negation. Overall, it seems the discussion was less positive than orignally seemed.
Now we have an idea of which words were preceded by a negative lets use this data to reassess our original conclusions.
not_word_contribution <- not_words %>%
mutate(contribution = 0 - score * n)
sum_not_words <- sum(not_word_contribution$contribution)
par(mfrow=c(2,1))
avg_sentiment_graph_w_neg <- tidy_party_tweets %>%
count(word, party, sort = T) %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
mutate(total_score = ((score * n) + not_word_contribution$contribution))%>%
group_by(party) %>%
summarise(sum_total_score = mean(total_score)) %>%
ggplot(aes(party, sum_total_score, fill = sum_total_score > 0)) +
geom_col(show.legend = F) +
labs(x = "Party", y = "Average sentiment of each Party")
avg_sentiment_graph_w_neg +
labs(title = "Updated graph accounting for negation")
avg_sentiment_graph +
labs(title = "Original graph")
In fact, when accounting for negative words, both parties were much more negative than we originally thought.
Lets consider using bing, the binary lexicon from before.
bing <- get_sentiments("bing")
not_words_bing <- bigrams_separated %>%
filter(word_1 %in% negation_words) %>%
inner_join(bing, by = c(word_2 = "word")) %>%
count(word_2, sentiment, sort = TRUE) %>%
mutate(n = n * -1)
not_words_bing %>%
arrange(n)
## # A tibble: 708 x 3
## word_2 sentiment n
## <chr> <chr> <dbl>
## 1 wonder positive -128
## 2 doubt negative -125
## 3 good positive -119
## 4 enough positive -111
## 5 like positive -89
## 6 win positive -56
## 7 happy positive -50
## 8 right positive -49
## 9 work positive -44
## 10 fear negative -41
## # … with 698 more rows
Whats the situation with ‘wonder’ here?
bigrams_separated %>%
filter(word_2 == "wonder") %>%
filter(word_1 %in% negation_words) %>%
count(word_1, sort = T)
## # A tibble: 1 x 2
## word_1 n
## <chr> <int>
## 1 no 128
Not that interesting, but worth remembering that different lexicons can give different results.
tidy_sentiment %>%
select(word, sentiment, n) %>%
mutate(n = n + not_words_bing$n) %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip()
## Warning in n + not_words_bing$n: longer object length is not a multiple of
## shorter object length
## Selecting by n
Slightly different, but not a massive change.
Lets try and visualise the association between the words.
hashtags <- c("ausvotes", "auspol", "ausvotes2019", "ausvotes19")
set.seed(2016)
bigram_counts <- bigrams_filtered %>%
count(word_1, word_2 , sort = T) %>%
filter(!word_1 %in% hashtags) %>%
filter(!word_2 %in% hashtags)
bigram_graph <- bigram_counts %>%
filter(n > 350) %>%
graph_from_data_frame()
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
This is simply the bigrams we are see together (filtered for n>350). Nothing particular interesting here.
Lets now consider which pairs are most likely to be found in the same tweet (but not necessarily next to each other) and also which words have the highest correlation - which words are likely to appear together relative to how often they appear separately.
tweet_words <- tweets %>%
mutate(row = row_number()) %>%
unnest_tokens(word, full_text) %>%
filter(!word %in% custom_stop_words$word) %>%
filter(!word %in% hashtags)
tweet_pairs <- tweet_words %>%
pairwise_count(word, row, sort = T)
tweet_pairs
## # A tibble: 11,519,894 x 3
## item1 item2 n
## <chr> <chr> <dbl>
## 1 australia election 7954
## 2 election australia 7954
## 3 vote australia 5341
## 4 australia vote 5341
## 5 federal election 3844
## 6 election federal 3844
## 7 2019 election 3347
## 8 election 2019 3347
## 9 scott morrison 3198
## 10 morrison scott 3198
## # … with 11,519,884 more rows
this has the most common pair found for each tweet, although they’re not necessarily next to each other.
lets use correlation - that considers how often words appear together relative to how often they appear separately.
tweet_cor <- tweet_words %>%
group_by(word) %>%
filter(n() > 200) %>%
pairwise_cor(word, row, sort = T)
tweet_cor
## # A tibble: 1,472,582 x 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 streaming channel7 0.905
## 2 channel7 streaming 0.905
## 3 corporation broadcasting 0.902
## 4 broadcasting corporation 0.902
## 5 michaelusher riley7news 0.884
## 6 riley7news michaelusher 0.884
## 7 minister prime 0.857
## 8 prime minister 0.857
## 9 credits franking 0.834
## 10 franking credits 0.834
## # … with 1,472,572 more rows
tweet_cor %>%
filter(correlation > .50) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), repel = TRUE) +
theme_void()
Most of these words being names doesn’t really help us much, a more thorough analysis might filter the names out.
Finally, lets try and unsupervised learning approach to gleaning some new insight into the topic.
tweets_dtm <- tidy_tweets %>%
filter(!str_detect(word, "^#")) %>%
filter(!word %in% hashtags) %>%
count(id, word) %>%
cast_dtm(id, word, n)
tweets_lda <- LDA(tweets_dtm, k = 2, control = list(seed = 1234))
Lets extract the per topic word probabilities (beta) for the different topics
tweet_topic_terms <- tweets_lda %>%
tidy(matrix = "beta") %>%
group_by(topic) %>%
top_n(25, beta) %>%
ungroup() %>%
arrange(topic, -beta)
tweet_topic_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(x= term, y = beta, fill = topic)) +
geom_col(show.legend = F) +
facet_wrap(~ topic, scales = "free_y") +
coord_flip()
It’s probably more interesting to consider which terms had the greatest difference in log beta
beta_spread <- tweets_lda %>%
tidy(matrix = "beta") %>%
mutate(topic = paste0("topic", topic)) %>%
spread(topic, beta) %>%
filter(topic1 > .001 | topic2 > .001) %>%
mutate(log_ratio = log2(topic2 / topic1))
beta_spread %>%
mutate(term = reorder(term, log_ratio)) %>%
top_n(20, abs(log_ratio)) %>%
ggplot(aes(term, log_ratio, fill = log_ratio > 0)) +
geom_col(show.legend = F) +
coord_flip() +
labs(y = "Words with the greatest difference in β between topic 2 and topic 1", x = "Terms")
The results here seem more useful. When splitting into two topics we see more loaded words being used by one group - god, environment, left and coal. However, it doesn’t look like we have split into topics of parties but instead topics explicitly mentioning politicians and ‘themes’ - Australia, God, Coal, left’. lets find out which words these were normally being accompanied by:
bigrams_filtered %>%
filter(word_1 %in% c("coal", "god", "environment") | word_2 %in% c("coal", "god", "environment")) %>%
count(word_1, word_2, sort = T) %>%
top_n(15, n)
## # A tibble: 17 x 3
## word_1 word_2 n
## <chr> <chr> <int>
## 1 god bless 125
## 2 environment minister 119
## 3 coal mine 110
## 4 coal mines 57
## 5 environment auspol 53
## 6 adani coal 49
## 7 environment sustainability 44
## 8 financialcrisis environment 44
## 9 coal industry 42
## 10 coal mining 42
## 11 coal auspol 34
## 12 coal fired 34
## 13 invisible environment 28
## 14 coal loving 19
## 15 coal miners 19
## 16 coal power 19
## 17 god auspol 19
Lets have a look at the gamma statistic - the probability that each document belongs to a each tocpic
lda_gamma <- tidy(tweets_lda, matrix = "gamma")
lda_gamma %>%
arrange(desc(gamma))
## # A tibble: 280,342 x 3
## document topic gamma
## <chr> <int> <dbl>
## 1 1129709983401046016 2 0.758
## 2 1129445844271239168 2 0.656
## 3 1129866995271438336 2 0.654
## 4 1129146318960463872 1 0.650
## 5 1127805923559231488 2 0.642
## 6 1127013486670729216 2 0.628
## 7 1127906054619127808 2 0.618
## 8 1127824090084007936 2 0.612
## 9 1130319010279464960 2 0.607
## 10 1129635356746211328 1 0.606
## # … with 280,332 more rows
Doesn’t really look like there is a clear delineation between topics:
ggplot(lda_gamma, aes(gamma)) +
geom_histogram() +
scale_y_log10() +
labs(title = "Distribution of probabilities for all topics",
y = "Number of documents", x = expression(gamma))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 9 rows containing missing values (geom_bar).
the distribution of gamma looks quite worrying here - there is effectively a 50 50 chance a tweet could be assigned to each topic.
At this point you can run through different numbers of topics and find the graph with the highest number of gamma = 0 and gamma = 1, which would suggest a high number of documents either fit into a topic or they don’t. Given that our tweets are all about one topic - politics - the distribution of gamma doesn’t really improve as we increase k.