Want to see what’s popular? Twitter is a good place to start with. Twitter is a public platform that alows users to post what they are up to, for recent years it has been increasingly used by organizations and government officers to broadcast news and their opinions, a typical example is the common saying ‘rule by Twitter’ following Donald Trump rode into office. Over the years ‘tweets’ has become a valuable tool for data mining experiment such as sentiment analysis. The TwitteR package is intended to provide access to the Twitter API using R, allowing users to grab interesting subsets of Twitter data for their analyses.
In this Vignett, a brife introduction on three ways of how to scrape a chunck of tweets will be illurstrated:
First, to actually get access to bulk data from Tweeter, we need to create an application at https://apps.twitter.com. To do that, we need to register as a Tweeter developer and fill in some information: Then create an application using the developer account that we just signed up:
Last step of the preparation phase is to get keys and tokens:
Now they tweeter account is all set up and access to twitter data should be granted!
There are a few other useful packages than titteR to supplement tweeter data scraping.
library(twitteR)
library(purrr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:twitteR':
##
## id, location
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Utilise the keys and tokens acquired from Tweeter developer account:
consumerKey = "2UBwrhKMuLRvKxBw1dwCfJfs2"
consumerSecret = "dWnVwl593uB9RyJsCScCRePY6yv6Ogs6lfSqguGOLGcEgiEl3X"
accessToken = "1231894513439756289-miuF5QtVJ3o1sQXHMdTElx7UuldNKk"
accessSecret = "0afZHGFHvGER56QNVxjZK4jSkEgefpmghS79RGf3Pm4Ka"
options(httr_oauth_cache=TRUE)
setup_twitter_oauth(consumer_key = consumerKey, consumer_secret = consumerSecret,
access_token = accessToken, access_secret = accessSecret)
## [1] "Using direct authentication"
This process will authenticate via hrrt. The personal codes should stay confidential, so the above codes have been regerated by the time this Vignette is published.
The basic workflow of this method is to seach certain tweets and add those tweets to a database, then write the database to a file. ### To create a list of Donald J. Trump’s recent 3200 Tweets:
trumptweets<- userTimeline("realDonaldTrump", n = 3200)
trumptweets_df <- tbl_df(map_df(trumptweets, as.data.frame))
write.csv(trumptweets_df, "trumptweets.csv")
‘trumptweets.csv’ file should be generated in the working directory and ready to be viewed and analysed.
To get feeling and taste of what the output dataframe looks like:
str(read.csv("trumptweets.csv"))
## 'data.frame': 33 obs. of 17 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ text : Factor w/ 33 levels ".....Could be as high as 15 Million Barrels. Good (GREAT) news for everyone!",..: 14 9 25 3 32 31 8 18 21 5 ...
## $ favorited : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ favoriteCount: int 42123 30321 67392 78175 138629 70737 124494 39623 111329 219394 ...
## $ replyToSN : Factor w/ 1 level "realDonaldTrump": NA NA NA 1 NA NA NA NA NA NA ...
## $ created : Factor w/ 32 levels "2020-04-02 12:57:41",..: 32 31 30 29 28 27 26 25 24 23 ...
## $ truncated : logi FALSE FALSE TRUE TRUE TRUE TRUE ...
## $ replyToSID : num NA NA NA 1.25e+18 NA ...
## $ id : num 1.25e+18 1.25e+18 1.25e+18 1.25e+18 1.25e+18 ...
## $ replyToUID : int NA NA NA 25073877 NA NA NA NA NA NA ...
## $ statusSource : Factor w/ 1 level "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>": 1 1 1 1 1 1 1 1 1 1 ...
## $ screenName : Factor w/ 1 level "realDonaldTrump": 1 1 1 1 1 1 1 1 1 1 ...
## $ retweetCount : int 11957 8503 17854 14016 24598 15692 24361 10140 19821 44219 ...
## $ isRetweet : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ retweeted : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ longitude : logi NA NA NA NA NA NA ...
## $ latitude : logi NA NA NA NA NA NA ...
Althogh some columns are not seemingly useful, there are still some valuable information this data reveals like text, favoriteCount, created, and retweetCount.
daylightsavings <- searchTwitter("#daylightsavings exclude:retweets", n=3200)
## Warning in doRppAPICall("search/tweets", n, params = params, retryOnRateLimit =
## retryOnRateLimit, : 3200 tweets were requested but the API can only return 609
daylightsavings_df <- tbl_df(map_df(daylightsavings, as.data.frame))
write.csv(daylightsavings_df, "daylightsavings.csv")
Oops, looks like only a portion or tweets were returned. To dig in a little bit more, get the head and tail of the output dataframe:
tail(read.csv("daylightsavings.csv"))
## X
## 604 604
## 605 605
## 606 606
## 607 607
## 608 608
## 609 609
## text
## 604 Spring is almost here! Set your clocks forward by one hour with the Rado True Secret. #DaylightSavings… https://t.co/rhHeYxLit0
## 605 As a reminder that the clocks go forward tonight, here’s 81-year-old Evan Davies, a retired watchmaker from the Pon… https://t.co/gGdrJvJQFT
## 606 Don't forget to put your clocks forward this Sunday \U0001f337\U0001f423\n\n#springforward\n#daylightsavings https://t.co/Qe2CiyaeyG
## 607 Say “NO To UK Daylight Saving Time!” Less Sleep Less Health! MORE Accidents, Heart Attacks and Less Immunity from l… https://t.co/PEwFa27kBx
## 608 Daylight savings day is tomorrow, don't forget the clocks go forward tonight! #daylightsavingsday #daylightsavings https://t.co/YDwrpzp0Lp
## 609 @ST0NEHENGE Bet this is easier than doing my cooker clock!\n\nhttps://t.co/WlfNy2eHRg\n\n#BST #DaylightSavings
## favorited favoriteCount replyToSN created truncated replyToSID
## 604 FALSE 29 <NA> 2020-03-28 16:00:11 TRUE NA
## 605 FALSE 1 <NA> 2020-03-28 15:55:11 TRUE NA
## 606 FALSE 5 <NA> 2020-03-28 15:50:02 FALSE NA
## 607 FALSE 0 <NA> 2020-03-28 15:31:05 TRUE NA
## 608 FALSE 0 <NA> 2020-03-28 15:00:26 FALSE NA
## 609 FALSE 42 ST0NEHENGE 2020-03-28 14:45:12 FALSE 1.2438e+18
## id replyToUID
## 604 1.243931e+18 NA
## 605 1.243930e+18 NA
## 606 1.243928e+18 NA
## 607 1.243924e+18 NA
## 608 1.243916e+18 NA
## 609 1.243912e+18 45553126
## statusSource
## 604 <a href="https://prod1.sprinklr.com" rel="nofollow">Sprinklr Publishing</a>
## 605 <a href="https://www.hootsuite.com" rel="nofollow">Hootsuite Inc.</a>
## 606 <a href="https://buffer.com" rel="nofollow">Buffer</a>
## 607 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 608 <a href="https://buffer.com" rel="nofollow">Buffer</a>
## 609 <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>
## screenName retweetCount isRetweet retweeted longitude latitude
## 604 rado 6 FALSE FALSE NA NA
## 605 Census2021 1 FALSE FALSE NA NA
## 606 BishopBurton 1 FALSE FALSE NA NA
## 607 thesleepcoach 2 FALSE FALSE NA NA
## 608 lemonpressprint 0 FALSE FALSE NA NA
## 609 stuart180 6 FALSE FALSE NA NA
head(read.csv("daylightsavings.csv"))
## X
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## text
## 1 Any chance we can save this hour for like Dec 2020 when we can actually use it for anything other than playing scrabble? #daylightsavings
## 2 ห่างกัน3ชมแล้วนะปทท #daylightsavings
## 3 Follow up. Dogs now think dinner is late.\n#daylightsavings #dogsoftwitter https://t.co/8at5qsTOel
## 4 Remember to turn back time tonight \u23f0 #DayLightSavings #MoreTimeAtHome #lockdown @ Waitakere, New Zealand https://t.co/HI8oewcU1O
## 5 I keep telling my dog it’s only 4pm, and dinner is at 5pm. As per usual. #daylightsavings
## 6 Same, same not different #daylightsavings https://t.co/EtbeXsTLFm
## favorited favoriteCount replyToSN created truncated
## 1 FALSE 1 <NA> 2020-04-05 06:38:08 FALSE
## 2 FALSE 0 <NA> 2020-04-05 06:25:08 FALSE
## 3 FALSE 0 pupsnpopculture 2020-04-05 06:11:10 FALSE
## 4 FALSE 1 <NA> 2020-04-05 06:09:48 FALSE
## 5 FALSE 0 <NA> 2020-04-05 06:01:04 FALSE
## 6 FALSE 0 <NA> 2020-04-05 05:59:01 FALSE
## replyToSID id replyToUID
## 1 NA 1.246689e+18 NA
## 2 NA 1.246685e+18 NA
## 3 1.24656e+18 1.246682e+18 8.936359e+17
## 4 NA 1.246681e+18 NA
## 5 NA 1.246679e+18 NA
## 6 NA 1.246679e+18 NA
## statusSource
## 1 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 2 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 3 <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>
## 4 <a href="http://instagram.com" rel="nofollow">Instagram</a>
## 5 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 6 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## screenName retweetCount isRetweet retweeted longitude latitude
## 1 laurathistle9 0 FALSE FALSE NA NA
## 2 nmukp 0 FALSE FALSE NA NA
## 3 pupsnpopculture 0 FALSE FALSE NA NA
## 4 ToniTalijancich 0 FALSE FALSE 174.5439 -36.8491
## 5 TobyBayer 0 FALSE FALSE NA NA
## 6 adamwesterink 0 FALSE FALSE NA NA
By looking at the post time of the first tweet and last tweet in this dataframe,it suggests that there is a limit of about 1-week tweets are returned due to API reason.
tweetstotrump <- searchTwitter("@realDonaldTrump exclude:retweets", n=3200)
tweetstotrump_df <- tbl_df(map_df(tweetstotrump, as.data.frame))
write.csv(tweetstotrump_df, "tweetstotrump.csv")
Using the data scraped from Tweeter, analysis like frequency of word appearance, instances where a word appears, and the time it appears can be done with the data extracted using the above method. By analysing the number of ‘favoriteCount’ and ‘retweetCount’, we can shed some light on social trend and preferences. There are still a lot of more functions that twitteR offer, look up to https://cran.r-project.org/web/packages/twitteR/twitteR.pdf for a more comprehensive help.