Introduction

This vignette explains the steps in using the functions provided by the rtweet package to access tweets from Twitter. Twitter provide REST APIs for accessing tweets and user data, which the rtweet functions sit over.

There are several pre-R steps to perform to access tweets. This blog does a good explanation of these steps and in summary:

  1. First, you must have a twitter account
  2. Then, apply for twitter developer access. This entails setting up an app and receiving confirmation from Twitter.
  3. After this, a twitter app can be created, and importantly consumer and access keys are created. These are then used for authorisation when connecting to twitter from R using the REST APIs.

To access Twitter, the first step is to install the rtweet package and load the library. Some other packages I use along the way are also loaded here.

## install rtweet from CRAN
install.packages("rtweet")
install.packages("qdapRegex")
## load rtweet package
library(rtweet)
library(qdapRegex)
library(plyr) 
library(tidyverse) 
library(tm) 
library(ggplot2)
library(wordcloud)

Authorisation

Now, to access twitter, an authorisation token must be created. The create_token function does this. Note that if only the consumer keys are sent (or an access token was not created in the twitter app), then a web-browser will popup to complete access to twitter asking for username/password. Also note in the code snippet, you will need to add your own credentials as these are private per user.

# Change the next four lines based on your own consumer_key, consume_secret, access_token, and access_secret. 
consumer_key <- "Copy your Consumer Key for the application here"
consumer_secret <- "Copy your Consumer Secret for the application here"
access_token <- "Copy your Access Token for the application here"
access_secret <- "Copy your Access Secret for the application here"

## authenticate via access token
token <- create_token(
  app = "Movie Trends - UTS",
  consumer_key = consumer_key,
  consumer_secret = consumer_secret,
  access_token = access_token,
  access_secret = access_secret)

Searching For Tweets

After authentication is successful, I can search for tweets using the search_tweets function call. This must be passed a query string to search the text for and other useful parameters are:

Parameter Meaning
q Search query
n Number of tweets to return. A maximum of 18,000 tweets per 15 minutes can be returned, so to get more than this, retryonratelimit parameter should be set to TRUE and processing will continue fetching tweets every 15 minutes until it is exhausted.
geocode Latitude, longitude and radius in either miles or kilometres for where tweets are from. For example for a 100km radius around Sydney the geocode would be ‘-33.8688,151.20732,100km’
type The type of tweets to return, either “recent”, “mixed” or “popular”
include_rts Include retweets (as generated by Twitters builtin “retweet” function. If set to TRUE, retweets will be included
lang Specify the language of the tweets to return. This is an ISO 639-1 code, so for example English language tweets have code ‘en’
retryonratelimit As mentioned for number of tweets, this logical value determines whether the application will keep trying to get tweets every 15 minutes for volumes above 18,000.

To get English tweets about the movie Captain Marvel using the hashtag #captainmarvel, the call would be as follows. Under the standard APIs only data from the last 6-9 days is returned. The tweets are put into a dataframe by default.

numberOfTweets <- 15000
tweets_cm <- search_tweets('#captainmarvel', n = numberOfTweets, type = "recent", include_rts = FALSE, retryonratelimit = TRUE, lang='en')
## Searching for tweets...
## This may take a few seconds...
## Finished collecting tweets!
nrow(tweets_cm)
## [1] 16068

If we look at the data frame returned, we can see there are 16068 returned tweets.

Using the Data

Plotting a Time Series

Now that we have a list of tweets, another useful function in the rtweet package is to plot the time series of the tweets we have just captured using ts_plot. This function takes a data frame and a time period to plot. Here, hours is the time period and shows how tweets follow a distinct pattern over the course of a day.

ts_plot(tweets_cm, by = "hours")

ts_data builds the data behind this plot, having 1 row per time period and the number of tweets in that period. This can be useful for producing graphs using ggplot, so below we get the tweets for the movie Green Book and compare to Captain Marvel on the same graph. Marvel comics not only dominates the box office but also the twittersphere.

tweets_gb <- search_tweets('#greenbook', n = numberOfTweets, type = "recent", include_rts = FALSE, retryonratelimit = TRUE, lang='en')

# compare over 4 hour periods
ts_cm <- ts_data(tweets_cm, by = "4 hours")
ts_gb <- ts_data(tweets_gb, by = "4 hours")

# Plot the 2 lines together.  since green book goes back further, filter green book data to start at the same time as Captain Marvel data
ggplot(data=ts_cm)  +  
  geom_line(aes(x=time, y=n), color="Blue") +
  geom_line(data=filter(ts_gb, time >= min(ts_cm$time)), aes(x=time, y=n), color="Red") +
  labs(x = "Date", y = "Number of Tweets")

Wordcloud

To end, the tweets can be loaded into a corpus (collection of documents), cleaned by removing urls, special characters, common words, punctuation and numbers and then use wordcloud to display the top 50 terms in the data. I have also removed ‘captainmarvel’ as this was the search term so will obviously be in each text.

#Remove characters functions
tweet.removeEmoji = function(x) gsub("\\p{So}|\\p{Cn}", "", x, perl = TRUE)
tweet.removeSpecialChar = function(x) gsub("[^[:alnum:]///' ]", "", x)
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, " ", x))})

tweets_cm$text <- rm_url(tweets_cm$text)
tweets_cm$text <- rm_twitter_url(tweets_cm$text)
docs = Corpus(VectorSource(tweets_cm$text))

docs <- tm_map(docs, toSpace, "-")
docs <- tm_map(docs, toSpace, "`")
docs <- tm_map(docs, toSpace, ":")
docs <- tm_map(docs, removePunctuation)
docs = tm_map(docs, content_transformer(tweet.removeEmoji))
docs = tm_map(docs, content_transformer(tweet.removeSpecialChar))
docs <- tm_map(docs,content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, removeWords, "captainmarvel")
docs <- tm_map(docs, removeWords, "rt")

# Extra stop words from analysing frequencies after doing stem and culling DTM
myStopwords <- c('movie', 'amp', 'saw', 'time', 'film', 'really', 'watch', 'today', 'dont',
                 'got', 'didnt', 'cant', 'can', 'will', 'finally', 'going',
                 'new', 'wait', 'think', 'just', 'see', 'one', 'movies', 'still')
docs <- tm_map(docs, removeWords, myStopwords)

# convert corpus to Document Term Matrix
dtm <-DocumentTermMatrix(docs)

# collapse matrix by summing over columns and then ordering by frequency
freq <- colSums(as.matrix(dtm))

# setting the same seed each time ensures consistent look across clouds
set.seed(10)
# wordCloud with colour for 50 words
par(mar = rep(0, 4))
wordcloud(names(freq),freq,max.words=50,colors=brewer.pal(6,"Dark2"), 
          scale = c(4, 0.2),random.order=FALSE, rot.per=.15)

References