Introduction

The Ashes is a traditional five match Test cricket series played and hosted in turns between Australia and England once every 2 years.

This vignette explains in detail the below mentioned processes :

1)Accessing and retrieving Twitter tweets in Australia regarding Ashes using the rtweet package.

2)Plotting the Australia Tweet map for the Ashes related Tweets using the OZ package.

3)Plotting the day wise tweet frequency.

4)Generating word cloud from tweets using wordcloud2 package.

1)Accessing and retrieving Twitter tweets in Australia regarding Ashes using the rtweet package.

Before you start make sure you have a Twitter account.

R consists of several packages for accessing and searching Twitter data (twitteR , streamR , rtweet)

In this vignette we will be using the rtweet package to access Twitter data.

#Install rtweet Package from CRAN
install.packages("rtweet" , repos = "http://cran.us.r-project.org")
#Load rtweet package
library(rtweet)

Authorisation

The reason for choosing rtweet is that it no longer requires us to have a Twitter developer account and Twitter application to use Twitter’s API. Instead you just need to use one of it’s function like search_tweets(), get_timeline(), get_followers() or get_favorites() to get API authorization. This saves us the time and hassle of creating Twitter Developer accounts and creating Authorisation Tokens to get API Authorisation.

Searching Twitter for Tweets

Trending Ashes hashtags (#engvsaus , #Ashes , #ashes2019 , #theashes , #ashes19 , #theashes2019)

To get the latest 18000 Tweets from Australia comprising of the above mentioned hashtags we need to write the search_tweets() function as follows.

q - query to be searched

n - Number of tweets to be returned. By default, search_tweets() returns 100 tweets. The max limit is 18000 tweets every 15 mins.

geocode - latitude , longitude and radius surrounding a particular location

We can only retrieve tweets from the past 6-9 days.

The geocode for Australia can be found on this link.

rt <- search_tweets(q=" #engvsaus OR #Ashes OR #ashes2019 OR #theashes OR #ashes19 OR #theashes2019", geocode = '-25.274398,133.775136,2000mi' , n = 18000)

The first time you run one of these functions (search_tweets(), get_timeline(), get_followers() or get_favorites()) you would be prompted to login to your Twitter account.

3)Plotting the day wise tweet frequency

The rtweet package contains the ts_plot() function which accumulates the tweet frequency across specified time intervals and plots the time series of tweets. Hence the naming convention ‘ts’ - time series.

Day wise Tweet frequency

ts_plot( rt, by = "day")

4)Generating word cloud using Wordcloud2 package

# Install the packages required for generating the wordcloud
install.packages("rcorpora" , repos = "http://cran.us.r-project.org")
install.packages("tidytext", repos = "http://cran.us.r-project.org")
install.packages("dplyr" , repos = "http://cran.us.r-project.org")
install.packages("wordcloud2" , repos = "http://cran.us.r-project.org")
# Load the packages 
library(rcorpora)
library(tidytext)
library(dplyr)
library(wordcloud2)

Tidying the Tweets

In order to make sure that the words appearing in the wordcloud are relevant, we need to clean the Tweet texts by removing irrelevant words.

We can eliminate some basic english words using the words/stopwords/en corpus from the rcorpora package. Following that we use the unnest_tokens function to transform the Twitter Texts to one word per row and count() function to count the number of words.

stopwords <- corpora("words/stopwords/en")$stopWords
tweetwords <- rt %>% unnest_tokens(word, text) %>% count(word, sort=TRUE) %>% filter(!word %in% stopwords)

Plot the top 10 words with highest frequency

ggplot(head(tweetwords, n=10), aes(x=reorder(word, -n), y=n))+ geom_bar(stat="identity")+ ggtitle("Top 10 words regarding Ashes")+ theme(axis.text=element_text(size=15, angle = 90, face="bold"), axis.title.x = element_blank(), title = element_text(size=15))

After looking at the above plot and top 100 frequent words, I can still see many irrelevant words. These words can be removed by adding them to the filter() command.

tweetwords <- tweetwords %>% filter(!word %in% c("it's","t.co", "don't" , "https", "2", "1", "3", "they’ve", "2nd", "hd", "here’s", "5"))

Wordcloud

I decided on using the wordcloud2 package over the wordcloud as it provides more options to customize the wordcloud.

tweetwords$n[tweetwords$word=="ashes"] <- 70
wordcloud2(data = tweetwords, backgroundColor = "#34495e",color = rep_len(c("#F09240", "#3498db", "#2ecc71", "#f1c40f", "#8e44ad"), nrow(tweetwords)), minRotation = -pi/6, minSize=5, size=1)