library(twitteR)
library(tidyverse)
library(tidytext)
library(igraph)
library(ggraph)
data("stop_words")
Consumer and Access keys/secrets hidden from RPubs. Acquire this from Twitter developer site.
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
## [1] "Using direct authentication"
Most recent 1000 records with hashtag #Election2020 from Twitter as of 2/20/2019
tweets = searchTwitter('#2020Election -filter:retweets', n = 1000, resultType = "recent")
tweets = twListToDF(tweets) %>% as_tibble()
tidy_tweets = tweets %>%
select(text) %>%
mutate(tweetID = row_number()) %>%
unnest_tokens(word, text, token="words")
tidy_tweets %>%
count(word) %>%
anti_join(stop_words) %>%
filter(!str_detect(word, "t.co|https|2020election")) %>%
top_n(25, n) %>%
mutate(word = reorder(word, n)) %>%
ggplot() +
labs(title = "Top 25 words in tweets with hashtag #2020Election") +
geom_col(aes(x=word, y=n)) +
coord_flip()
## Joining, by = "word"
TF-IDF is a measure that finds words with high importance by boosting words that are frequently referenced in a particular record/tweet but are not used widely among all tweets. Words that appear in the majority of tweets are penalized.
tidy_tweets %>%
left_join(tidy_tweets %>% count(word)) %>%
bind_tf_idf(word, tweetID, n) %>%
anti_join(stop_words) %>%
filter(!str_detect(word, "t.co|https|2020election")) %>%
group_by(word) %>%
summarise(mean_tf_idf=mean(tf_idf)) %>%
top_n(25, mean_tf_idf) %>%
mutate(word = reorder(word, mean_tf_idf)) %>%
ggplot() +
labs(title = "Top 25 words in tweets with hashtag #2020Election",
y = "Mean TF-IDF") +
geom_col(aes(x=word, y=mean_tf_idf)) +
coord_flip()
## Joining, by = "word"
## Joining, by = "word"
Tweets may be tokenized by individual words (unigrams), adjacent word pairs (bigrams), or adjacent words in a window of n words (n-grams). By using a listing of bigrams and their counts, network visualizations can be created to examine word relationships.
Using the ggraph package, we can extend the functionality of ggplot2 to create network graphs using bigram counts, term frequencies, or TF-IDF values.
In brief, this is accomplished by taking our tokenized bigrams, separating them into individual words, setting those words as nodes within a graph, and using a metric such as counts or TF-IDF as the weight for a directed edge between those nodes.
The following example uses bigram counts to define edge weights. The igraph package in R is used to create the initial graph object from a data frame. Only the top 25 bigrams according to counts are retained to maintain a clean visualization.
tidy_tweets_bigrams = tweets %>%
as_tibble() %>%
select(text) %>%
mutate(tweetID = row_number()) %>%
unnest_tokens(bigram, text, token="ngrams", n=2) %>%
filter(!is.na(bigram))
bigram_graph = tidy_tweets_bigrams %>%
count(bigram) %>%
select(bigram, n) %>%
filter(!str_detect(bigram, "http|https|t.co")) %>%
separate(bigram, c("word1","word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word, !word2 %in% stop_words$word) %>%
top_n(25, n) %>%
graph_from_data_frame()
set.seed(2018)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n),
show.legend = FALSE,
arrow = a,
end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue",
size = 5) +
geom_node_text(aes(label = name),
vjust = 1,
hjust = 1) +
theme_void()
Here is a similar example that calculates mean TF-IDF value for bigrams and selects the top 25. This should result in a network that retains words with higher importance.
tidy_tweets_bigrams_tf_idf = tidy_tweets_bigrams %>%
left_join(tidy_tweets_bigrams %>% count(bigram)) %>%
bind_tf_idf(bigram, tweetID, n) %>%
group_by(bigram) %>%
summarise(mean_tf_idf = mean(tf_idf))
## Joining, by = "bigram"
bigram_graph = tidy_tweets_bigrams_tf_idf %>%
select(bigram, mean_tf_idf) %>%
filter(!str_detect(bigram, "http|https|t.co|2020election")) %>%
separate(bigram, c("word1","word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word, !word2 %in% stop_words$word) %>%
top_n(25, mean_tf_idf) %>%
graph_from_data_frame()
set.seed(2018)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = mean_tf_idf),
show.legend = FALSE,
arrow = a,
end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue",
size = 5) +
geom_node_text(aes(label = name),
vjust = 1,
hjust = 1) +
theme_void()