NOTE: This is a demonstration of downloading and cleaning non-retweet Twitter data using the academic research API. I’m glad to share more best practices and lessons learned in Twitter data collection and analysis. I can be reached at here.

Academic Research API

ATTENTION:

The resultant dataframe has the following columns.

[1] “author_id” “conversation_id” “created_at” “hashtag”
[5] “lang” “like_count” “mention” “quote_count”
[9] “referenced_status_id” “referenced_user_id” “reply_count” “retweet_count”
[13] “status_id” “text” “type”

NOTE: author, referenced_status_id, and referenced_user information is to be downloaded and merged to this resultant data frame once the download is complete.

library(academictwitteR) # get_all_tweets()
library(rtweet) # save_as_csv() that prepends numerical ids as characters
library(dplyr) # %>% convenient data cleaning
library(tibble) # enframe(table(data$variable)) frequency table, name, value

REQUIRED:

bearer_token (API): A password with 100+ digits authorized by Twitter API

Download

get_all_tweets() is to retrieve full archival Twitter data according to the query.
query, e.g., new_query = build_query(query = "(\"multi-string keywords\" OR keyword1 OR keyword2), lang = "en", is_retweet = FALSE, has_mentions = NULL)"
start_tweets =, the start of the targeted time range.
end_tweets =, the end of the targeted time range. Because (1) most recent tweets are downloaded first and (2) most Twitter data collection requires several (many) iterations, the end_tweets has to be reset after every iteration.
newtime is the time stamp of the earliest tweet from last iteration. This variable is created at the end of each iteration to enable semi-auto workflow.
bind_tweets = TRUE to keep the downloaded Twitter data as a data frame.
n =, number of tweets to download. DON’T BE GREEDY!
page_n = is set to 500 by default. But reduce this number to 200 or 100 would reduce the chance of interruptions due to errors.

Download started at 2023-01-16 19:55:16

origtw <- get_all_tweets(query = new_query,start_tweets = "2011-01-03T06:00:00Z", end_tweets = newtime, bearer_token = bearer.token, bind_tweets = TRUE, n = 1000, page_n = 100)

## Warning: Recommended to specify a data path in order to mitigate data loss when
## ingesting large amounts of data.

## Warning: Tweets will not be stored as JSONs or as a .rds file and will only be
## available in local memory if assigned to an object.

## query:  ("green card backlog" OR gcbacklog OR greencardbacklog) -is:retweet lang:en 
## Total pages queried: 1 (tweets captured this page: 100).
## Total pages queried: 2 (tweets captured this page: 100).
## Total pages queried: 3 (tweets captured this page: 98).
## Total pages queried: 4 (tweets captured this page: 98).
## Total pages queried: 5 (tweets captured this page: 98).
## Total pages queried: 6 (tweets captured this page: 99).
## Total pages queried: 7 (tweets captured this page: 100).
## Total pages queried: 8 (tweets captured this page: 97).
## Total pages queried: 9 (tweets captured this page: 100).
## Total pages queried: 10 (tweets captured this page: 94).
## Total pages queried: 11 (tweets captured this page: 100).
## Total tweets captured now reach 1000 : finishing collection.

Download ended at 2023-01-16 19:55:40

range(origtw$created_at) # time range of downloaded Twitter data

## [1] "2022-09-30T20:40:03.000Z" "2022-11-14T23:48:56.000Z"

Flatten Twitter Data

Copy regular (non-nested) columns to a new data frame.
Extract nested mention lists and convert them to a column.
Extract nested hashtag lists and convert them to a column.

Regular Columns

[grepl("^RT", origtw$text)==F] is to filter out potential retweets/quoted tweets whose text are truncated.

df_nort = origtw[grepl("^RT", origtw$text)==F,c("author_id","created_at","id","conversation_id","text","lang","in_reply_to_user_id")] ## w/o Rt "in_reply_to_user_id" will be renamed to "referenced_user_id"
length(which(origtw$lang!="en")) # Make sure all tweets are in English

## [1] 0

names(origtw$public_metrics)

## [1] "retweet_count"    "reply_count"      "like_count"       "quote_count"     
## [5] "impression_count"

df_nort$retweet_count = origtw$public_metrics$retweet_count[grepl("^RT", origtw$text)==F]
df_nort$reply_count = origtw$public_metrics$reply_count[grepl("^RT", origtw$text)==F]
df_nort$like_count = origtw$public_metrics$like_count[grepl("^RT", origtw$text)==F]
df_nort$quote_count = origtw$public_metrics$quote_count[grepl("^RT", origtw$text)==F]
which(colnames(df_nort)=="id") # 3

## [1] 3

colnames(df_nort)[which(colnames(df_nort)=="id")] = "status_id"

Nested Mention List

mention <- origtw$entities$mentions[grepl("^RT", origtw$text)==F]
names(mention) = seq_along(mention) # name the elements in the list to match the main dataset
mention_df <- do.call("rbind", lapply(mention, as.data.frame)) # extract the data observation part from the list
mention_df$row_id <- floor(as.numeric(rownames(mention_df))) # make row_id interger
drop_col = c("start","end")
mention_df <- mention_df[,!names(mention_df)%in%c("start","end")]
class(mention_df$username) # factor

## [1] "character"

df2 = aggregate(trimws(mention_df$username), list(mention_df$row_id), paste, collapse=",") # row_id and comma separated mention
names(df2) = c("row_id","mention")
#df2$mention
names(df_nort)

##  [1] "author_id"           "created_at"          "status_id"          
##  [4] "conversation_id"     "text"                "lang"               
##  [7] "in_reply_to_user_id" "retweet_count"       "reply_count"        
## [10] "like_count"          "quote_count"

df_nort$row_id = seq(1:nrow(df_nort))
df3 = merge(df_nort,df2, by.x = "row_id", by.y = "row_id", all.x = TRUE)

Nested Hashtag List

hashtag = origtw$entities$hashtags[grepl("^RT", origtw$text)==F]
names(hashtag) = seq_along(hashtag)
hashtag_df = do.call("rbind", lapply(hashtag, as.data.frame))
hashtag_df$row_id = floor(as.numeric(rownames(hashtag_df)))
hashtag_df = hashtag_df[,!names(hashtag_df)%in%c("start","end")]
head(hashtag_df) # Take a look at the hashtags

##                  tag row_id
## 2   greencardbacklog      2
## 3.1          Prop308      3
## 3.2        GCbacklog      3
## 4          GCBacklog      4
## 5.1         EagleAct      5
## 5.2        GCBacklog      5

df4 = aggregate(trimws(hashtag_df$tag), list(hashtag_df$row_id), paste, collapse = ",")
names(df4) = c("row_id","hashtag")
df5 = merge(df3,df4,by.x = "row_id",by.y = "row_id",all.x = TRUE)

Extract Referenced Users and Tweets

reference<- origtw$referenced_tweets
names(reference) = seq_along(reference)

reference_df = do.call("rbind", lapply(reference, as.data.frame))
reference_df$row_id = floor(as.numeric(rownames(reference_df)))
# Change referenced "id" to "in_reply_to_status_id"
colnames(reference_df)[which(colnames(reference_df)=="id")] = "referenced_status_id"
nrow(reference_df) # How many other tweets have the downloaded tweets referred to?

## [1] 853

df6 = merge(df5, reference_df, by.x = "row_id", by.y = "row_id", all.x = TRUE) # tweets + referenced tweets
colnames(df6)[which(colnames(df6)== "in_reply_to_user_id")] <- "referenced_user_id" # update variable name
df6 <- df6[,sort(names(df6))];nrow(df6) # the number of rows is likely to increase a little bit b/c some tweets have two referenced tweets.

## [1] 1012

df6 <- unique(df6);nrow(df6)

## [1] 1012

# Why there're extra rows in df6 than df5 after merging? ANSWER: >1 in_reply_to_status_ids (referenced_status_id, quoted, and replied_to), although with same author_id, same status_id, and same referenced_user_id
dup_rows <- enframe(table(df6$row_id))
dup_rows <- dup_rows[rev(order(dup_rows$value)),] # examine the tweets that referred to (quoted/replied_to) two other tweets
names(df6) # examine the variable names before saving the data frame to local disk.

##  [1] "author_id"            "conversation_id"      "created_at"          
##  [4] "hashtag"              "lang"                 "like_count"          
##  [7] "mention"              "quote_count"          "referenced_status_id"
## [10] "referenced_user_id"   "reply_count"          "retweet_count"       
## [13] "row_id"               "status_id"            "text"                
## [16] "type"

Save Flattened Dataframe

save_as_csv() from rtweets must be used to prepend ids as characters.

cutoffdate <- substr(range(origtw$created_at)[1], 1, 13); cutoffdate

## [1] "2022-09-30T20"

thistitle <- paste("gcbcklog_", cutoffdate, ".csv", sep = "")
df6 <- df6[,names(df6)!= "row_id"] # remove the temporary row_id variable
df6$type[is.na(df6$type)] <- "initial"
save_as_csv(df6, thistitle) # save to local disk

Set `end_tweets =` for Next Iteration

newtime <- range(origtw$created_at)[1]; newtime # retain the new end time for next iteration.

## [1] "2022-09-30T20:40:03.000Z"

Twitter Data Collection and Cleaning for Academic Research (academictwitteR)

Tiangeng Lu

2023-01-16

Download

Flatten Twitter Data

Regular Columns

Nested Mention List

Nested Hashtag List

Extract Referenced Users and Tweets

Save Flattened Dataframe

Set `end_tweets =` for Next Iteration

Twitter Data Collection and Cleaning for Academic Research (academictwitteR)

Tiangeng Lu

2023-01-16

Download

Flatten Twitter Data

Regular Columns

Nested Mention List

Nested Hashtag List

Extract Referenced Users and Tweets

Save Flattened Dataframe

Set end_tweets = for Next Iteration

Set `end_tweets =` for Next Iteration