This note is for the UTSA students who are taking my Data Analytics Applications (DA 6813) course. For this analysis you must have your Twitter credentials. If you don’t have them refer to the content folder on Blackboard for the instructions to get them.
For this note you will need twitteR
and sentimentr
packages.
library(twitteR)
library(sentimentr)
library(plyr) # To get a frequency table
##
## Attaching package: 'plyr'
## The following object is masked from 'package:twitteR':
##
## id
Once twitteR
loads up, use your consumer key and consumer secret to set up Twitter oauth. I am not going to display the actual code here because it contains my keys but it will be of this nature:
# setup_twitter_oauth(consumer_key = "xxx", consumer_secret = "yyyy")
# Where "xxx" and "yyyy" are your credentials.
Once your credentials are accepted by Twitter, you can access its API. In this note I am going to get 1000 tweets containing a trending topic in San Antonio, TX. For this we need to get a woeid
of the location. woeid
stands for “where on earth ID”. So let’s first get that.
avloc <- availableTrendLocations()
head(avloc)
## name country woeid
## 1 Worldwide 1
## 2 Winnipeg Canada 2972
## 3 Ottawa Canada 3369
## 4 Quebec Canada 3444
## 5 Montreal Canada 3534
## 6 Toronto Canada 4118
In the above code I created an object avloc
which contains information on the name of the location, the country, and its respective woeid
. For example, Toronto’s woeid
is 4118. Let’s see whether San Antonio appears on this list.
avloc[avloc$name == "San Antonio",]
## name country woeid
## 389 San Antonio United States 2487796
San Antonio’s woeid is 2487796. We will need this to get the trending topics in San Antonio at a given hour. I am going to use getTrends
function from twitteR
to obtain these trends. Rather than copying and pasting the woeid
I will simply reference it from the avloc. In the following code, R will automatically retrieve the value stored in the cell where the value of name
variable in avloc
is “San Antonio” and which belongs to 3rd column, which as know has woeid
. This way I reduce the chance of making a mistake in copying and pasting the woeid
.
trend <- getTrends(woeid = avloc[avloc$name == "San Antonio",3])
head(trend)
## name url
## 1 #HTGAWM http://twitter.com/search?q=%23HTGAWM
## 2 #Dolphins http://twitter.com/search?q=%23Dolphins
## 3 #NationalCoffeeDay http://twitter.com/search?q=%23NationalCoffeeDay
## 4 #ThisTown http://twitter.com/search?q=%23ThisTown
## 5 #BBOTT http://twitter.com/search?q=%23BBOTT
## 6 Visit San Antonio http://twitter.com/search?q=%22Visit+San+Antonio%22
## query woeid
## 1 %23HTGAWM 2487796
## 2 %23Dolphins 2487796
## 3 %23NationalCoffeeDay 2487796
## 4 %23ThisTown 2487796
## 5 %23BBOTT 2487796
## 6 %22Visit+San+Antonio%22 2487796
From the output, the top 6 trends are #HTGAWM, #Dolphins, #NationalCoffeeDay, #ThisTown, #BBOTT, and Visit San Antonio. As an aside, trending topics can be with or without hashtags.
I am going to pick the first trending topic, #HTGAWM, and search for 1000 tweets that contain it. There is no guarantee that I will obtain the number of tweets I requested. If twitteR
can’t retrieve it, you will get a warning.
Twitter API allows you to search for keywords on Twitter. As we are looking for the trending topic from San Antonio, for meaningful search we must give location code for San Antonio. As it turns out, the location code must be in a specific format: “latitude,longitude,radius”. The radius can be specified in miles or kilometers. In the following code I will specify it in miles. I am going to get the latitude and longitude of UTSA and obtain tweets sent within 20 miles radius from this location. I could have used downtown San Antonio or any other location in San Antonio as well.
Where will you get the latitude and longitude? A crude but simple way is to use Google Maps. Search for your location and Google Maps will take you there. The URL in your browser will contain latitude and longitude. Here is a screenshot of the UTSA search in Google Maps.
UTSA on Google Map. Note the URL
With this much information we can now start our search. I am going to ask for 1000 tweets all in English language. Take note of the geocode
parameter. I literally copied the first two numbers from Google Maps URL!
tweet <- searchTwitter(trend[1,1], n= 1000, lang = 'en', geocode = '29.5845579,-98.6187748,20mi')
## Warning in doRppAPICall("search/tweets", n, params = params,
## retryOnRateLimit = retryOnRateLimit, : 1000 tweets were requested but the
## API can only return 276
class (tweet) # Check the class of 'tweet' object
## [1] "list"
Once the search is complete, twitteR
returns a list, which can be converted into a data frame for ease of analysis. I will use twListToDF
function from twitteR
package.
tweet <- twListToDF(tweet)
class(tweet) # Check class of 'tweet' object and verify that it's data frame
## [1] "data.frame"
Let’s print first 10 tweets in our data frame.
head(tweet,10)
## text
## 1 @violadavis exquisitely too juicy \xed\xa0\xbd\xed\xb8\xb1\xed\xa0\xbd\xed\xb8\xb1\xed\xa0\xbd\xed\xb2\x80#HTGAWM https://t.co/ntK5KEuw2h
## 2 @violadavis OH MY DAMN \xed\xa0\xbd\xed\xb8\xb1\xed\xa0\xbd\xed\xb1\x8d\xed\xa0\xbc\xed\xbf\xbc\xed\xa0\xbd\xed\xb1\x80 #HTGAWM https://t.co/5m3s0uDNJj
## 3 #HTGAWM Do just want to go through a faze of being a badboy then go back to Connor or is it your HIV pushing Connor away
## 4 #HTGAWM But I do feel horrible for Connor the boy loves Olli and Olli was giving signals to later push him away like dude what's the deal
## 5 #HTGAWM Annalise was like Oliver useless we will keep him in back but now she's like Wipe this device clean all of… https://t.co/oSGPolt8Ah
## 6 #HTGAWM At least this dead body not Oliver but what about Connor
## 7 #HTGAWM Omg what Oliver part of this murder helping Annalise she gave the phone \xed\xa0\xbd\xed\xb3\xb1 he definitely part of the squad now
## 8 What did Annalise do now!? Ay and who is #UnderTheSheet?? And really Oliver cooking dinner but you want out? \xed\xa0\xbd\xed\xb8\xad this is too much. #HTGAWM
## 9 RT @MyNameisAmreena: Okay now I really don't know who it is #HTGAWM
## 10 #HTGAWM is driving me insane!! \xed\xa0\xbd\xed\xb8\xa9\xed\xa0\xbd\xed\xb8\xa9\xed\xa0\xbd\xed\xb8\xa9
## favorited favoriteCount replyToSN created truncated
## 1 FALSE 0 violadavis 2016-09-30 03:18:00 FALSE
## 2 FALSE 0 violadavis 2016-09-30 03:17:47 FALSE
## 3 FALSE 0 <NA> 2016-09-30 03:15:46 FALSE
## 4 FALSE 0 <NA> 2016-09-30 03:14:27 FALSE
## 5 FALSE 1 <NA> 2016-09-30 03:10:33 TRUE
## 6 FALSE 1 <NA> 2016-09-30 03:06:33 FALSE
## 7 FALSE 0 <NA> 2016-09-30 03:04:30 FALSE
## 8 FALSE 1 <NA> 2016-09-30 03:03:42 FALSE
## 9 FALSE 0 <NA> 2016-09-30 03:02:36 FALSE
## 10 FALSE 0 <NA> 2016-09-30 03:02:02 FALSE
## replyToSID id replyToUID
## 1 <NA> 781694560145580032 2717254872
## 2 781691193176436736 781694503597977600 2717254872
## 3 <NA> 781693997391552514 <NA>
## 4 <NA> 781693664678379520 <NA>
## 5 <NA> 781692685279121409 <NA>
## 6 <NA> 781691675731120128 <NA>
## 7 <NA> 781691161131954177 <NA>
## 8 <NA> 781690958266011649 <NA>
## 9 <NA> 781690682763251712 <NA>
## 10 <NA> 781690538156183553 <NA>
## statusSource
## 1 <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>
## 2 <a href="http://twitter.com/#!/download/ipad" rel="nofollow">Twitter for iPad</a>
## 3 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 4 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 5 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 6 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 7 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 8 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 9 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 10 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## screenName retweetCount isRetweet retweeted longitude latitude
## 1 ShannaSince1987 0 FALSE FALSE NA NA
## 2 summer0001 0 FALSE FALSE NA NA
## 3 LunaticNation 0 FALSE FALSE NA NA
## 4 LunaticNation 0 FALSE FALSE NA NA
## 5 LunaticNation 0 FALSE FALSE NA NA
## 6 LunaticNation 0 FALSE FALSE NA NA
## 7 LunaticNation 0 FALSE FALSE NA NA
## 8 sugamandy 0 FALSE FALSE NA NA
## 9 IamNot_aWhore 1 TRUE FALSE NA NA
## 10 merndadlg 0 FALSE FALSE NA NA
At this point I am going to do some basic level of cleaning. In tweet
column statusSource
contains the information about the source of the tweet—whether it was sent from an iPhone, Android phone, Twitter web, etc. But the variable values are quite messy and it’s not possible to make a nice frequency table with them. So let’s clean up that variable.
In all the values I printed above for this variable, you will see </a>
appearing at the end of each value. We can easily replace this string using the powerful gsub
function in base R. We will replace it by literally nothing. In order to avoid overwriting the variable statusSource
, I will create another variable statusSource1
tweet$statusSource1 <- gsub('</a>',"",tweet$statusSource)
head(tweet$statusSource1)
## [1] "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android"
## [2] "<a href=\"http://twitter.com/#!/download/ipad\" rel=\"nofollow\">Twitter for iPad"
## [3] "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone"
## [4] "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone"
## [5] "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone"
## [6] "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone"
As we see, gsub
nicely replaced all </a>
with nothing!
Next we need to replace a long text string which is enclosed in <>
. We can use gsub
with regular expressions (regex) in order to replace this entire string. For more patterns in regex, check out this link: [http://www.endmemo.com/program/R/gsub.php]. Another helpful website is [https://www.memberpress.com/how-to-become-a-regular-expression-power-user/]
In the following code I am overwriting statusSource1
.
tweet$statusSource1 <- gsub('.*>',"",tweet$statusSource1)
head(tweet$statusSource1)
## [1] "Twitter for Android" "Twitter for iPad" "Twitter for iPhone"
## [4] "Twitter for iPhone" "Twitter for iPhone" "Twitter for iPhone"
Now we have a clean variable! Let’s get a frequency table using count()
function in the package plyr
plyr::count(tweet$statusSource1)
## x freq
## 1 Facebook 2
## 2 Path 1
## 3 RoundTeam 1
## 4 TVShow Time 1
## 5 Twitter for Android 94
## 6 Twitter for iPad 2
## 7 Twitter for iPhone 166
## 8 Twitter for Windows 1
## 9 Twitter Web Client 8
This tutorial is being updated so I will add more stuff here soon.