Retrieving Tweets with R

Introduction

This vignette consists of a brief explanation on how to use R to retrieve tweets using the OAuth API. For further information on this API, please follow https://en.wikipedia.org/wiki/OAuth. For the specific implementation of this API by Twitter, please follow https://dev.twitter.com/oauth.

As an example, this vignette will retrieve the tweets that used the term data science and do a basic data exploration on them. Some insights on how to use Twitter and other social medias will also be briefly presented.

Requirements

To access twits by using R, one has to:

Create a Twitter Account
Create a Twitter Application
Install required R package

Create a Twitter Account

This is a basic step, nevertheless, it is still a requirement. You will use your account to build your Twitter applications. To create a Twitter account, follow https://twitter.com.

Create a Twitter Aplication

Acess Twitter applications site (https://apps.twitter.com/)
Create an application
Read and agree with the Twitter Developer Agreement & Policy: https://dev.twitter.com/overview/terms/agreement-and-policy
Detailed directions on how to create an application can be found at http://docs.inboundnow.com/guide/create-twitter-application/

The Twitter user will receive a Consumer Key and a Consumer Secret when creating a new application. Furthermore, the user will have to create an Access Token and an Access Token Secret to enable external access to the application. Remember to not share these tokens to untrusted parties when creating applications on your Twitter account.

Install twitteR R package

The function below will install the required package if you have a working internet connection. This call will automatically install dependencies packages.

install.packages("twitteR", repos = 'http://cran.us.r-project.org')

Initializing R Environment

The following packages will be used in this vignette, thus they shall be loaded as below.

library(dplyr)
library(twitteR)
library(knitr)
library(ggplot2)

Setting Credentials

The first thing to do is to set the credential tokens. You will have to copy API key, API secret, token and token secret from the apps Twitter web page where you created your application. To make the code readable, it is always a good idea to use variables to set them instead of writting them directly in the setup_twitter_oauth function call you will see below.

api_key <- "API KEY of the Twitter Application"
api_secret <- "API SECRET of the Twitter Application"
token <- "TOKEN of the Twitter Application"
token_secret <- "TOKEN SECRET of the Twitter Application"

Connecting to Twitter Application by OAuth

Then, you must connect to the Twitter application created by you. The function setup_twitter_oauth() do it as you pass the adequate credentials to access your application.

setup_twitter_oauth(consumer_key = api_key,
                    consumer_secret = api_secret,
                    access_token = token,
                    access_secret = token_secret)

Retrieving Tweets

Now it is already possible to retrieve tweets. The function below performs this operation. We will search for popular tweets that have the term “data science”, from March 01 2017 to April 07 2017.

tweets <- searchTwitter(searchString = "data science", n=100, lang="en", since="2017-03-01", until="2017-04-07" , resultType = "popular")

## Warning in doRppAPICall("search/tweets", n, params = params,
## retryOnRateLimit = retryOnRateLimit, : 100 tweets were requested but the
## API can only return 22

Note the following characteristics of the arguments:

searchstring: is the string to be searched. Use “+” to separate terms;
n: the maximum number of tweets to be retrieved. Searches may return less the n as warned above;
since: initial date of the search;
until: final date of the search;
locale: the language in which the tweets to be searched were written. It is very important if English is not your mother language;
resultType: this argument can dramatically change the result of your query. Possible values are: “mixed”, “recent” and “popular”. For instance, the call above returns more tweets if resultType is set to “recent”.

More details about this function can be found at https://www.rdocumentation.org/packages/twitteR/versions/1.1.9/topics/searchTwitter

Viewing Tweets Variables

The tweets object is a list of twitteR objects. twitteR objects contain much information about a tweet. Most of them are not of interest of this vignette. The function twListToDF(), as used below, transforms the list of twitteR objects into a dataframe that is more understandable and workable. The dataframe generated contains the following variables:

tweetsDF <- twListToDF(tweets)
names(tweetsDF)

##  [1] "text"          "favorited"     "favoriteCount" "replyToSN"    
##  [5] "created"       "truncated"     "replyToSID"    "id"           
##  [9] "replyToUID"    "statusSource"  "screenName"    "retweetCount" 
## [13] "isRetweet"     "retweeted"     "longitude"     "latitude"

Top 10 Popular Authors

Now that we have the tweets fetched in a dataframe, we will perform a simple analysis on them. Suppose we want to measure the popularity of the authors considering the number of retweets they had. Despite having 22 tweets, we have only 19 distinct authors (screenName variable). This tells us that some authors have more than one tweet in our list. So, we will aggregate the number of the retweets (retweetCount variable) by authors. The aggregate() function performs this operation. Then, we sort the aggregated dataframe and show the top ten authors with more retweets.

length(tweetsDF$screenName) # total number of tweets

## [1] 22

length(unique(tweetsDF$screenName)) # number of unique tweets authors

## [1] 19

retweetsByAuthor <- aggregate(formula = retweetCount ~ screenName, data = tweetsDF, FUN = sum)
retweetsByAuthor <- arrange(retweetsByAuthor, desc(retweetCount)) ## sorting
top10 <- head(retweetsByAuthor, 10)
kable(top10)

screenName	retweetCount
ValaAfshar	136
hadleywickham	87
ScienceNews	87
Harvard	82
Informatica	68
EvanSinar	66
kaggle	65
DataCamp	42
sciencemagazine	39
foodgov	28

Top 20 Barchart

The ranking information can also be shown as a figure. The bar chart below depicts the top 20 retweet authors among the tweets retrieved by our search.

ggplot (head(retweetsByAuthor, 20), aes(reorder(screenName, -retweetCount), retweetCount)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

Further Thoughts

Social medias are a rich source of information about people’s behavior as a huge amount of individuals publicly expresses their thoughts and feelings about an enormous variety of subjects online. Knowledge can be derived from this “bag” of information by applying data science techniques. The example above could be expanded in such a way that a company could trace popular tweeters to use their tweets to promote its brand or its products Furthermore, by applying text mining techniques, an enterprise could measure how positively or negatively social media users evaluate its brand, actions or products. Thus, social medias are an effervescent means full of information to be worked by data scientists.