Today we are going to jump in to the topics of Twitter, the Twitter API and how to analyze Twitter data. After going through the basics, the rest of the lab will be yours to start playing around with collecting and analyzing your own Twitter data!
A note on preliminaries: to follow along with the below on your own, you will require
Twitter is a microblogging and social networking service founded in 2006. Today, the platform boasts the following statistics
Twitter is based on microblogging: users can send messages of up to 280 characters called Tweets. User names (screen names or handles) start with an “@” sign. Each individual can choose to follow other users, which will make their Tweets appear on that individual’s timeline. Some other features of note:
There are a number of reasons why Twitter can be useful for academic study, most clearly that Twitter can have offline effects.
Many studies on the effect of social media published thus far have used Twitter data for a variety of reasons.
The Twitter Application Programming Interface (API) allows for four basic methods for collecting Tweets.
Except for the last option, the one we will be using primarily in this course, Tweets can only be downloaded in real time (as they are being published).
To access the Twitter API v2, you must have an authenticated developer account and a Bearer Token. I personally like to save these within a file of the following structure for easy access:
wd <- "D:/Twitter"
setwd(wd)
twitter_info <- read.csv("twitter_info2.csv",
stringsAsFactors = F)
names(twitter_info)
## [1] "api_key" "api_key_secret" "bearer_token"
## [4] "access_token" "access_token_secret"
I.e. a .csv file with three columns: one including my API Key, another including my API Key Secret, and the third including my Bearer Token. It is only this last bit of information, which you should get from your Essential Access Account, which we need to start using the API!
Unfortunately, since Twitter updated their API to v2, the number of working R packages have decreased substantially. Only two appear to exist in working form, one of which requires Academic Research Access (which you likely don’t have) and the other is a package on GitHub which doesn’t work particularly well (in my experience).
But fear not! With the skills we have been developing for using the R programming language we can easily write ourselves a few functions for making the collection of Tweets quite straightforward after learning a bit about how the API works!
We are going to start exactly where I started when learning about accessing the API: with the Twitter created tutorial on Getting started with R and v2 of the Twitter API.
You should go ahead and read that document, as well as the below referenced Twitter API v2 documentation, on your own time to get the most out of the platform. For our purposes it is worthwhile to study the anatomy of a API call and how to generalize what we learn into useful functions.
To start, we need a few packages loaded into our environment:
library(httr)
library(jsonlite)
library(dplyr)
library(plyr)
The first of these packages is for using R to make calls to the internet. The second package is useful for converting data which comes in JSON format (essentially nested lists) into data.frames which are easily used within R. The last packages are, as we already know, quite useful for data wrangling tasks.
Step 1 for any API call is to declare our bearer token, the thing which Twitter uses to verify who is making an API call and what permissions they have. From this, we will create a “header” which transmits this information alongside our request.
bearer_token <- twitter_info$bearer_token
headers <- c(`Authorization` = sprintf('Bearer %s', bearer_token))
Step 2 for any API call is to determine which endpoint we want to pass a query to and what sort of information we want to request. Here are a number of useful links to relevant API documentation.
Once we have determined what we want, we then have to form a query. The structure of a query is a simple list indicating all of parameters we want to pass along to the API. For example, suppose that we wanted to get basic profile information for Joe Biden’s personal Twitter account:
params <- list(user.fields = "public_metrics,description",
expansions = "pinned_tweet_id")
handle <- 'JoeBiden'
We can then set up the query by identifying the correct endpoint and sending the request to the API:
url_handle <- sprintf('https://api.twitter.com/2/users/by?usernames=%s', handle)
response <- httr::GET(url = url_handle,
httr::add_headers(.headers = headers),
query = params)
obj <- httr::content(response, as = "text")
Responses from the API take the form of a JSON file:
prettify(obj)
## {
## "data": [
## {
## "description": "Husband to @DrBiden, proud father and grandfather. Ready to build back better for all Americans. Official account is @POTUS.",
## "name": "Joe Biden",
## "pinned_tweet_id": "1650801827728986112",
## "public_metrics": {
## "followers_count": 37219172,
## "following_count": 47,
## "tweet_count": 9215,
## "listed_count": 41827
## },
## "id": "939091",
## "username": "JoeBiden"
## }
## ],
## "includes": {
## "tweets": [
## {
## "edit_history_tweet_ids": [
## "1650801827728986112"
## ],
## "id": "1650801827728986112",
## "text": "Every generation has a moment where they have had to stand up for democracy. To stand up for their fundamental freedoms. I believe this is ours.\n\nThat’s why I’m running for reelection as President of the United States. Join us. Let’s finish the job. https://t.co/V9Mzpw8Sqy https://t.co/Y4NXR6B8ly"
## }
## ]
## }
## }
##
To get this into a format that can be easily manipulated within R, we need to flatten the JSON into a data.frame:
json_data <- fromJSON(obj, flatten = TRUE) %>%
as.data.frame()
as_tibble(json_data)
## # A tibble: 1 × 12
## data.description data.name data.pinned_twe… data.id data.username
## <chr> <chr> <chr> <chr> <chr>
## 1 Husband to @DrBiden, proud f… Joe Biden 165080182772898… 939091 JoeBiden
## # … with 7 more variables: data.public_metrics.followers_count <int>,
## # data.public_metrics.following_count <int>,
## # data.public_metrics.tweet_count <int>,
## # data.public_metrics.listed_count <int>,
## # includes.tweets.edit_history_tweet_ids <list>, includes.tweets.id <chr>,
## # includes.tweets.text <chr>
Great! How might we modify the above to collect Tweets filtering by keywords and location? First, we need to select the appropriate API and modify the query accordingly. Since you only have “Essential” access, we are limited to looking through “recent” Tweets published over the past seven days.
Let’s suppose that we want to collect Tweets from persons in the UK who Tweeted about either Biden or Trump and have geolocation data on.
endpoint <- 'https://api.twitter.com/2/tweets/search/recent'
params <- list(query = "(Biden OR Trump) (place_country:GB has:geo)",
tweet.fields = "author_id,created_at,public_metrics,geo",
max_results = 100)
response <- httr::GET(url = endpoint,
httr::add_headers(.headers = headers),
query = params) %>%
httr::content(., as = "text") %>%
fromJSON(., flatten = TRUE) %>%
as.data.frame()
as_tibble(response)
## # A tibble: 100 × 17
## data.author_id data.id data.text data.created_at data.edit_histo…
## <chr> <chr> <chr> <chr> <list>
## 1 1487008168693702659 1659729304685… "@RonnyJ… 2023-05-20T01:… <chr [1]>
## 2 1487008168693702659 1659723856607… "@MAGAIn… 2023-05-20T00:… <chr [1]>
## 3 1487008168693702659 1659718443136… "@its_th… 2023-05-20T00:… <chr [1]>
## 4 1487008168693702659 1659713948029… "@lauren… 2023-05-20T00:… <chr [1]>
## 5 1468892948309987338 1659712146412… "#RUSSIA… 2023-05-20T00:… <chr [1]>
## 6 1487008168693702659 1659708156132… "@TeamTr… 2023-05-19T23:… <chr [1]>
## 7 1487008168693702659 1659707295591… "@Gunthe… 2023-05-19T23:… <chr [1]>
## 8 1266307739938144257 1659706676637… "Trump h… 2023-05-19T23:… <chr [1]>
## 9 1487008168693702659 1659706261284… "@Gunthe… 2023-05-19T23:… <chr [1]>
## 10 1464504657263353860 1659706020648… "https:/… 2023-05-19T23:… <chr [1]>
## # … with 90 more rows, and 12 more variables: data.geo.place_id <chr>,
## # data.geo.coordinates.type <chr>, data.geo.coordinates.coordinates <list>,
## # data.public_metrics.retweet_count <int>,
## # data.public_metrics.reply_count <int>,
## # data.public_metrics.like_count <int>,
## # data.public_metrics.quote_count <int>,
## # data.public_metrics.impression_count <int>, meta.newest_id <chr>, …
If we wanted to find the location of these tweets, we could then do something like…
get_geo <- function(place_id){
link <- paste0("https://api.twitter.com/1.1/geo/id/",
place_id,
".json")
all <- httr::GET(url = link,
httr::add_headers(.headers = headers)) %>%
httr::content(., as = "text") %>%
fromJSON(., flatten = TRUE)
out <- data.frame(id = all$id,
name = all$name,
full_name = all$full_name,
country = all$country,
country_code = all$country_code,
place_type = all$place_type)
out
}
locations <- list()
for(i in 1:length(unique(response$data.geo.place_id))){
place <- unique(response$data.geo.place_id)[i]
out <- get_geo(place)
locations[[i]] <- out
}
locs <- do.call("rbind",locations)
locs
## id name full_name
## 1 6f31c24707aca514 Holywell Holywell, Wales
## 2 791e00bcadc4615f Glasgow Glasgow, Scotland
## 3 53b67b1d1cc81a51 Birmingham Birmingham, England
## 4 703ee413ec69365a Newhaven Newhaven, England
## 5 2fa9f57ed641748a Menstrie Menstrie, Scotland
## 6 511655fc081bb251 Portsmouth Portsmouth, England
## 7 65b23b0045f450f6 Kingston upon Thames Kingston upon Thames, London
## 8 01c7ed8caf11e8b2 Kensington Kensington, London
## 9 25d3e991f5637f5a South West South West, England
## 10 28679b23ed15b380 Belfast Belfast, Northern Ireland
## 11 4cb7ff8db49dfaa0 Ashford Ashford, England
## 12 038247c1b5bb34c9 Greenwich Greenwich, London
## 13 0af014accd6f6e99 Scotland Scotland, United Kingdom
## 14 315b740b108481f6 Manchester Manchester, England
## 15 06cb7db38dd5f950 Worcester Worcester, England
## 16 3d8aada45e3c5866 Beverley Beverley, England
## 17 3bc1b6cfd27ef7f6 East East, England
## 18 2a3f152d1ac5044a Hackney Hackney, London
## 19 1039a706613a52c5 Bricket Wood Bricket Wood, East
## 20 5f54245c51670911 Milford on Sea Milford on Sea, England
## 21 457b4814b4240d87 London London, England
## 22 01d711761615d783 Ormskirk Ormskirk, England
## 23 6a779d5cb7e570e8 Chelmsford Chelmsford, East
## 24 1f36c2b60fbc98ac Swansea Swansea, Wales
## 25 2fbecaad4fd6b148 Higher Penwortham Higher Penwortham, England
## 26 7ae9e2f2ff7a87cd Edinburgh Edinburgh, Scotland
## 27 31fffbe34de66921 Salford Salford, England
## 28 4b6c0ea1297b258a Leyland Leyland, England
## 29 56c45474148ca4da Paddington Paddington, London
## 30 544762ebf7fda780 Islington Islington, London
## 31 7ef79c5ab17d518c Barnet Barnet, London
## 32 3bb508d078dcdf12 King's Lynn King's Lynn, England
## 33 4249bb41acf40cfb Byker Byker, England
## 34 67bc7263f7b9047b North East North East, England
## 35 06168d1feda43857 South East South East, England
## 36 46f2260bd86a6f0c Redcar Redcar, England
## 37 46e17f9d2f1027e9 Gustard Wood Gustard Wood, East
## 38 42d0cf7d49d27c95 Hillingdon Hillingdon, London
## 39 62a2a7f86cd9a5b4 Dundee Dundee, Scotland
## 40 074898ace1d8ef8a Horsham Horsham, England
## 41 1da00c8852cc9da2 Newport Newport, Wales
## 42 593b55aac2dd394d Wimborne Minster Wimborne Minster, England
## 43 3395783754204776 Kirkham Kirkham, England
## 44 06f9f5a068aa411f Corby Corby, England
## 45 6e99018e409ed45f West Ashby West Ashby, England
## 46 1c37515518593fe3 Richmond Richmond, London
## 47 3455d7166dccdac5 West Molesey West Molesey, South East
## 48 01cf9c6409a7f7a0 Kilmarnock Kilmarnock, Scotland
## 49 58468d6e28fde202 Fleetwood Fleetwood, England
## 50 1e8d77a6081c8ce6 Pwllheli Pwllheli, Wales
## 51 07e9c7d1954fff64 Sheffield Sheffield, England
## 52 46d6cb31bd607799 Middleton Middleton, England
## 53 2dc53e413a6cbcd8 Middlesbrough Middlesbrough, England
## 54 758461cd66db17e0 Sutton Sutton, London
## 55 48eef71af51c9838 Stapleford Stapleford, England
## 56 52b3e5e1ab04b40e Hambleton Hambleton, England
## 57 759dfe79a02eb78a Blackburn Blackburn, England
## 58 38d05a66be6d4ee1 Colchester Colchester, England
## 59 45549e3e82b86df1 Braunstone Braunstone, England
## country country_code place_type
## 1 United Kingdom GB city
## 2 United Kingdom GB city
## 3 United Kingdom GB city
## 4 United Kingdom GB city
## 5 United Kingdom GB city
## 6 United Kingdom GB city
## 7 United Kingdom GB city
## 8 United Kingdom GB city
## 9 United Kingdom GB admin
## 10 United Kingdom GB city
## 11 United Kingdom GB city
## 12 United Kingdom GB city
## 13 United Kingdom GB admin
## 14 United Kingdom GB city
## 15 United Kingdom GB city
## 16 United Kingdom GB city
## 17 United Kingdom GB admin
## 18 United Kingdom GB city
## 19 United Kingdom GB city
## 20 United Kingdom GB city
## 21 United Kingdom GB city
## 22 United Kingdom GB city
## 23 United Kingdom GB city
## 24 United Kingdom GB city
## 25 United Kingdom GB city
## 26 United Kingdom GB city
## 27 United Kingdom GB city
## 28 United Kingdom GB city
## 29 United Kingdom GB city
## 30 United Kingdom GB city
## 31 United Kingdom GB city
## 32 United Kingdom GB city
## 33 United Kingdom GB city
## 34 United Kingdom GB admin
## 35 United Kingdom GB admin
## 36 United Kingdom GB city
## 37 United Kingdom GB city
## 38 United Kingdom GB city
## 39 United Kingdom GB city
## 40 United Kingdom GB city
## 41 United Kingdom GB city
## 42 United Kingdom GB city
## 43 United Kingdom GB city
## 44 United Kingdom GB city
## 45 United Kingdom GB city
## 46 United Kingdom GB city
## 47 United Kingdom GB city
## 48 United Kingdom GB city
## 49 United Kingdom GB city
## 50 United Kingdom GB city
## 51 United Kingdom GB city
## 52 United Kingdom GB city
## 53 United Kingdom GB city
## 54 United Kingdom GB city
## 55 United Kingdom GB city
## 56 United Kingdom GB city
## 57 United Kingdom GB city
## 58 United Kingdom GB city
## 59 United Kingdom GB city
Although this endpoint is rate-limited to 75 requests per 15-minute window and does not accept batch requests.
Anyway, that’s cool! What if we want to grab a bunch of Tweets by a particular user instead? With “essential” access, you can grab up to the last 3200 Tweets from a user. Applying the principles from above, and reading about pagination, we can develop the following function:
last_n_tweets <- function(bearer_token = "", user_id = "", n = 100,
tweet_fields = c("attachments",
"created_at",
"entities",
"in_reply_to_user_id",
"public_metrics",
"referenced_tweets",
"source")){
headers <- c(`Authorization` = sprintf('Bearer %s', bearer_token))
# Convert User ID into Numerical ID
sprintf('https://api.twitter.com/2/users/by?usernames=%s', user_id) %>%
httr::GET(url = .,
httr::add_headers(.headers = headers),
query = list()) %>%
httr::content(.,as="text") %>%
fromJSON(.,flatten = T) %>%
as.data.frame() -> tmp
num_id <- tmp$data.id
# For that user, grab most recent n tweets, in batches of 100
if(n <= 100){
requests <- n
}else{
requests <- rep(100,floor(n/100))
if(n %% 100 != 0){
requests <- c(requests, n %% 100)
}
}
next_token <- NA
all <- list()
# Initialize, grab first results
paste0('https://api.twitter.com/2/users/',num_id,'/tweets') %>%
httr::GET(url = .,
httr::add_headers(.headers = headers),
query = list(`max_results` = requests[1],
tweet.fields = paste(tweet_fields,collapse=","))) %>%
httr::content(.,as="text") %>%
fromJSON(.,flatten = T) %>%
as.data.frame() -> out
all[[1]] <- out
# For more than 100, need to use pagination tokens.
if(length(requests) >= 2){
next_token[2] <- unique(as.character(all[[1]]$meta.next_token))
for(i in 2:length(requests)){
paste0('https://api.twitter.com/2/users/',num_id,'/tweets') %>%
httr::GET(url = .,
httr::add_headers(.headers = headers),
query = list(`max_results` = requests[i],
tweet.fields = paste(tweet_fields,collapse=","),
pagination_token = next_token[i])) %>%
httr::content(.,as="text") %>%
fromJSON(.,flatten = T) %>%
as.data.frame() -> out
all[[i]] <- out
next_token[i + 1] <- unique(as.character(all[[i]]$meta.next_token))
}
}
do.call("rbind.fill",all)
}
Let’s test it out!
out <- last_n_tweets(twitter_info$bearer_token,"JoeBiden",3200)
as_tibble(out)
## # A tibble: 3,200 × 21
## data.id data.created_at data.edit_histo… data.text data.referenced…
## <chr> <chr> <list> <chr> <list>
## 1 1659319725359345… 2023-05-18T22:… <chr [1]> "I'm pro… <NULL>
## 2 1659209989939154… 2023-05-18T14:… <chr [1]> "RT @POT… <df [1 × 2]>
## 3 1659008430835658… 2023-05-18T01:… <chr [1]> "Made a … <NULL>
## 4 1658985812770398… 2023-05-18T00:… <chr [1]> "To this… <NULL>
## 5 1658959709041221… 2023-05-17T22:… <chr [1]> "Nancy P… <NULL>
## 6 1658933179405705… 2023-05-17T20:… <chr [1]> "We had … <NULL>
## 7 1658921350390530… 2023-05-17T19:… <chr [1]> "That’s … <NULL>
## 8 1658471867022467… 2023-05-16T13:… <chr [1]> "My plan… <NULL>
## 9 1658196835868049… 2023-05-15T19:… <chr [1]> "Student… <NULL>
## 10 1658189261529571… 2023-05-15T19:… <chr [1]> "We pass… <NULL>
## # … with 3,190 more rows, and 16 more variables:
## # data.in_reply_to_user_id <chr>, data.entities.urls <list>,
## # data.entities.annotations <list>, data.entities.mentions <list>,
## # data.attachments.media_keys <list>,
## # data.public_metrics.retweet_count <int>,
## # data.public_metrics.reply_count <int>,
## # data.public_metrics.like_count <int>, …
Excellent! Now, if you had an academic researcher account, you might want to do a similar thing but to collect up to all of the Tweets by a user! The below function does just that (you won’t be able to use it, but the code is illustrative):
get_all_tweets <- function(bearer_token = "", user_id = "",
tweet_fields = c("attachments",
"created_at",
"entities",
"in_reply_to_user_id",
"public_metrics",
"referenced_tweets",
"source")){
# Initialize, Grab first results
all <- list()
paste0('https://api.twitter.com/2/tweets/search/all') %>%
httr::GET(url = .,
httr::add_headers(.headers = headers),
query = list(query = paste0("from:",user_id),
start_time = "2006-03-22T00:00:00Z",
max_results = 500,
tweet.fields = paste(tweet_fields,collapse=",")
)) %>%
httr::content(.,as="text") %>%
fromJSON(.,flatten = T) %>%
as.data.frame() -> out
all[[1]] <- out
# For more than 500 Tweets, need to loop
nxt <- unique(as.character(all[[1]]$meta.next_token))
if(identical(nxt,character(0))){
return(out)
}else{
stop <- F
i <- 2
}
while(stop == F){
Sys.sleep(2)
paste0('https://api.twitter.com/2/tweets/search/all') %>%
httr::GET(url = .,
httr::add_headers(.headers = headers),
query = list(query = paste0("from:",user_id),
start_time = "2006-03-22T00:00:00Z",
max_results = 500,
next_token = nxt,
tweet.fields = paste(tweet_fields,collapse=",")
)) %>%
httr::content(.,as="text") %>%
fromJSON(.,flatten = T) %>%
as.data.frame() -> out
all[[i]] <- out
nxt <- unique(as.character(all[[i]]$meta.next_token))
stop <- identical(nxt,character(0))
i <- i + 1
}
do.call("rbind.fill",all)
}
Let’s take it for a spin!
tmp <- get_all_tweets(twitter_info$bearer_token, "POTUS")
plot(table(as.Date(tmp$data.created_at)))
Before we get ahead of ourselves, we want to make sure that you have fundamentals in order. Do the following:
Write a script which…
Save and submit your working R script to the Exercise/Quiz Submission Link by the end of the day (ideally, end of lab session!).
This lab is partially based off of materials provided by Sean Kates, Pablo Barbera, and Drew Dimmery.↩︎