The Ashes is a traditional five match Test cricket series played and hosted in turns between Australia and England once every 2 years.
This vignette explains in detail the below mentioned processes :
1)Accessing and retrieving Twitter tweets in Australia regarding Ashes using the rtweet package.
2)Plotting the Australia Tweet map for the Ashes related Tweets using the OZ package.
3)Plotting the day wise tweet frequency.
4)Generating word cloud from tweets using wordcloud2 package.
Before you start make sure you have a Twitter account.
R consists of several packages for accessing and searching Twitter data (twitteR , streamR , rtweet)
In this vignette we will be using the rtweet package to access Twitter data.
#Install rtweet Package from CRAN
install.packages("rtweet" , repos = "http://cran.us.r-project.org")
#Load rtweet package
library(rtweet)
You can use the get_trends() function to identify what’s currently trending on Twitter. Here we use it to find out what’s trending in Australia.
sf <- get_trends("australia")
Trending Ashes hashtags (#engvsaus , #Ashes , #ashes2019 , #theashes , #ashes19 , #theashes2019)
To get the latest 18000 Tweets from Australia comprising of the above mentioned hashtags we need to write the search_tweets() function as follows.
q - query to be searched
n - Number of tweets to be returned. By default, search_tweets() returns 100 tweets. The max limit is 18000 tweets every 15 mins.
geocode - latitude , longitude and radius surrounding a particular location
We can only retrieve tweets from the past 6-9 days.
The geocode for Australia can be found on this link.
rt <- search_tweets(q=" #engvsaus OR #Ashes OR #ashes2019 OR #theashes OR #ashes19 OR #theashes2019", geocode = '-25.274398,133.775136,2000mi' , n = 18000)
The first time you run one of these functions (search_tweets(), get_timeline(), get_followers() or get_favorites()) you would be prompted to login to your Twitter account.
The rtweet package contains the ts_plot() function which accumulates the tweet frequency across specified time intervals and plots the time series of tweets. Hence the naming convention ‘ts’ - time series.
ts_plot( rt, by = "day")
# Install the packages required for generating the wordcloud
install.packages("rcorpora" , repos = "http://cran.us.r-project.org")
install.packages("tidytext", repos = "http://cran.us.r-project.org")
install.packages("dplyr" , repos = "http://cran.us.r-project.org")
install.packages("wordcloud2" , repos = "http://cran.us.r-project.org")
# Load the packages
library(rcorpora)
library(tidytext)
library(dplyr)
library(wordcloud2)
In order to make sure that the words appearing in the wordcloud are relevant, we need to clean the Tweet texts by removing irrelevant words.
We can eliminate some basic english words using the words/stopwords/en corpus from the rcorpora package. Following that we use the unnest_tokens function to transform the Twitter Texts to one word per row and count() function to count the number of words.
stopwords <- corpora("words/stopwords/en")$stopWords
tweetwords <- rt %>% unnest_tokens(word, text) %>% count(word, sort=TRUE) %>% filter(!word %in% stopwords)
ggplot(head(tweetwords, n=10), aes(x=reorder(word, -n), y=n))+ geom_bar(stat="identity")+ ggtitle("Top 10 words regarding Ashes")+ theme(axis.text=element_text(size=15, angle = 90, face="bold"), axis.title.x = element_blank(), title = element_text(size=15))
After looking at the above plot and top 100 frequent words, I can still see many irrelevant words. These words can be removed by adding them to the filter() command.
tweetwords <- tweetwords %>% filter(!word %in% c("it's","t.co", "don't" , "https", "2", "1", "3", "they’ve", "2nd", "hd", "here’s", "5"))
I decided on using the wordcloud2 package over the wordcloud as it provides more options to customize the wordcloud.
tweetwords$n[tweetwords$word=="ashes"] <- 70
wordcloud2(data = tweetwords, backgroundColor = "#34495e",color = rep_len(c("#F09240", "#3498db", "#2ecc71", "#f1c40f", "#8e44ad"), nrow(tweetwords)), minRotation = -pi/6, minSize=5, size=1)