Last compiled on 2023-01-17 22:45:51
NOTES:
Download tweets by their ids using
hydrate_tweets(ids = , bear_token = , bind_tweets = TRUE)
Include illustrations of interim data tables
Document how to flatten nested hashtag and mention information associated with each tweet
Scenario: id
s of referenced (source) tweets (e.g.,
replied_to
or quote
) are available but need
the content of these referenced (source) tweets.
The resultant data frame is ready to join with other Twitter data frames.
library(academictwitteR) # get_all_tweets()
library(rtweet) # save_as_csv() that prepends numerical ids as characters
library(dplyr) # %>% convenient data cleaning
gcbcklog_all2011-2023_79997.csv
is the combined
dataset.
Many tweets from our data referenced to other
tweets. They can be replied_to
or
quoted
(excluding retweets because retweets are handled
differently).
table(tweet$type)
##
## initial quoted replied_to
## 33433 10745 36562
paste(round(100*table(tweet$type)/length(tweet$type),1), "%", sep = "") # Distribution of tweet types
## [1] "41.4%" "13.3%" "45.3%"
tail(tweet[tweet$type!="initial",c(3,4,15)])
## # A tibble: 6 × 3
## created_at hashtag type
## <chr> <chr> <chr>
## 1 2023-01-02T00:12:54.000Z <NA> replied_to
## 2 2023-01-02T02:32:41.000Z gcbacklog replied_to
## 3 2023-01-02T03:59:46.000Z greencardbacklog,H4EADdelays quoted
## 4 2023-01-02T17:44:58.000Z <NA> replied_to
## 5 2023-01-02T22:05:19.000Z gcbacklog replied_to
## 6 2023-01-03T00:00:02.000Z GCBacklog,EAGLEAct replied_to
ATTENTION: REMOVE ’NA’s in the vector of tweet ids.
refids <- unique(tweet$referenced_status_id);length(refids)
## [1] 32968
refids <- refids[!is.na(refids)]
round(summary(nchar(refids))) # Make sure there's no NA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10 19 19 19 19 19
There’re 32967 referenced tweets whose information needs to be collected.
Data collection started at 2023-01-17 22:45:53
reftws1 <- hydrate_tweets(ids = refids[1:1000], bearer_token = bearer.token, bind_tweets = TRUE) # pause after every 299 batch (100 tweets per batch)
## Batch 1 out of 10 : ids 142033295075573763 to 696801328056692740
## Total of 87 out of 1000 tweet(s) retrieved.
## Batch 2 out of 10 : ids 699625228444389376 to 903822385022984192
## Total of 168 out of 1000 tweet(s) retrieved.
## Batch 3 out of 10 : ids 903736264935034881 to 921025153206833154
## Total of 253 out of 1000 tweet(s) retrieved.
## Batch 4 out of 10 : ids 921112528540889088 to 928263204291141632
## Total of 339 out of 1000 tweet(s) retrieved.
## Batch 5 out of 10 : ids 928464991178518535 to 938465573645758464
## Total of 430 out of 1000 tweet(s) retrieved.
## Batch 6 out of 10 : ids 938516594887745541 to 944319115367399424
## Total of 518 out of 1000 tweet(s) retrieved.
## Batch 7 out of 10 : ids 944695383053295616 to 950420401309536257
## Total of 607 out of 1000 tweet(s) retrieved.
## Batch 8 out of 10 : ids 950510932018548736 to 951839934184411136
## Total of 702 out of 1000 tweet(s) retrieved.
## Batch 9 out of 10 : ids 951849705423736832 to 952565454744285186
## Total of 795 out of 1000 tweet(s) retrieved.
## Batch 10 out of 10 : ids 952565585132605440 to 953221819414253570
## Total of 882 out of 1000 tweet(s) retrieved.
Data collection completed at 2023-01-17 22:46:11
names(reftws1)
## [1] "entities" "public_metrics" "created_at"
## [4] "possibly_sensitive" "id" "edit_history_tweet_ids"
## [7] "conversation_id" "author_id" "lang"
## [10] "text" "attachments" "referenced_tweets"
## [13] "in_reply_to_user_id" "geo"
nrow(reftws1)
## [1] 882
head(reftws1[,c("created_at", "text")])
## created_at
## 1 2013-05-15T17:16:32.000Z
## 2 2013-08-17T17:49:16.000Z
## 3 2013-08-20T22:56:45.000Z
## 4 2013-10-19T00:20:09.000Z
## 5 2013-03-21T16:12:39.000Z
## 6 2013-10-19T00:38:56.000Z
## text
## 1 Facing an 8-year green card backlog, this @wharton grad has no choice but to take her business skills abroad: http://t.co/S97rVxQH8i #iMarch
## 2 RNC Calls for Immigration Reform Focused on 'Needs of United States Employers' via Breitbart http://t.co/CdEHdI3BDO
## 3 Why do you support #immigration reform? #CIR #timeisnow http://t.co/YrvoMrZ4T0
## 4 Bill Young dedicated his life to serving the people of Florida, and his loss is a great one for his constituents. http://t.co/6Fcuji2FTX
## 5 Today, the House delivered a responsible, #BalancedBudget. http://t.co/MqBpZNzPqn
## 6 Bill Young was a tireless servant to his constituents and our armed forces. My deepest sympathies go out to his family & loved ones.
range(reftws1$created_at)
## [1] "2013-03-21T16:12:39.000Z" "2018-01-16T13:42:45.000Z"
df0 <- reftws1[,c("author_id","created_at","id","conversation_id","text","lang","in_reply_to_user_id")]
Extract nested columns of same length
reftws1$public_metrics[1:3,]
## retweet_count reply_count like_count quote_count impression_count
## 1 356 190 170 0 0
## 2 1 1 0 0 0
## 3 14 2 3 0 0
df1 = df0
df1$retweet_count <- reftws1$public_metrics$retweet_count
df1$reply_count <- reftws1$public_metrics$reply_count
df1$like_count <- reftws1$public_metrics$like_count
df1$quote_count <- reftws1$public_metrics$quote_count
names(df1)
## [1] "author_id" "created_at" "id"
## [4] "conversation_id" "text" "lang"
## [7] "in_reply_to_user_id" "retweet_count" "reply_count"
## [10] "like_count" "quote_count"
We have no control over how many people were mentioned and how many hashtags were used in a given tweet. The randomness of mention and hashtag length create extra steps of processing Twitter data.
The nested mention table has 4 variables. username
is
the Twitter user screen name and id
is the user id (back in
2021, the id
information of mentioned users weren’t
available). start
and end
keep the numeric
value of the location in the tweet where the mention occurred.
The number of mentions varies in each tweets. Thus, the length of the nested mention table is different for each observation.
The purpose of this step is to concatenate username
into
a row called mention
and id
information into
mentioned_id
.
names(reftws1$entities$mentions[[1]]) # variables of the nested mention table
## [1] "start" "end" "username" "id"
data.frame(cbind(
start = reftws1$entities$mentions[[1]][,1],
end = reftws1$entities$mentions[[1]][,2],
username = paste(substr(reftws1$entities$mentions[[1]][,3],1,2),"***", sep = ""),
id = paste(substr(reftws1$entities$mentions[[1]][,4],1,3),"***", sep = "")
))
## start end username id
## 1 42 50 Wh*** 771***
mention <- reftws1$entities$mentions
names(mention) <- seq_along(mention)
mention_df <- do.call(what = "rbind", args = lapply(mention, as.data.frame))
head(mention_df)
## start end username id
## 1 42 50 Wharton 7717612
## 10 0 13 RepGoodlatte 37920978
## 11 0 8 GOPWhip 2693983838
## 13 109 116 fwd_us 1015198328441786369
## 15 0 15 HouseJudiciary 246357149
## 16 0 13 GOPoversight 22508473
mention_df$row_id <- floor(as.numeric(rownames(mention_df)))
drop_col = c("start","end")
mention_df <- mention_df[,!names(mention_df)%in%c("start","end")]
class(mention_df$username)
## [1] "character"
mention_df$username <- trimws(mention_df$username)
Take a look at the resultant mention data. The row number was
successfully preserved. The username
and id
information was hidden for privacy.
head(data.frame(cbind(
username = paste(substr(mention_df$username,1,2), "***", sep = ""),
id = paste(substr(mention_df$id,1,3), "***", sep = ""),
row_id = mention_df$row_id)))
## username id row_id
## 1 Wh*** 771*** 1
## 2 Re*** 379*** 10
## 3 GO*** 269*** 11
## 4 fw*** 101*** 13
## 5 Ho*** 246*** 15
## 6 GO*** 225*** 16
Move the username
and id
from the same
row_id
into one row, separated by comma
df2 <- aggregate(mention_df[,c("username", "id")], list(mention_df$row_id), paste, collapse = ",")
names(df2) <- c("row_id", "mention", "mentioned_id")
head(data.frame(cbind(
row_id = df2$row_id,
mention = paste(substr(df2$mention, 1, 2), "**, **, **", substr(df2$mention, nchar(df2$mention)-2, nchar(df2$mention)), sep = ""),
mentioned_id = paste(substr(df2$mentioned_id, 1, 2), "**, **, **", substr(df2$mentioned_id, nchar(df2$mentioned_id)-2, nchar(df2$mentioned_id)), sep = "")
)))
## row_id mention mentioned_id
## 1 1 Wh**, **, **ton 77**, **, **612
## 2 10 Re**, **, **tte 37**, **, **978
## 3 11 GO**, **, **hip 26**, **, **838
## 4 13 fw**, **, **_us 10**, **, **369
## 5 15 Ho**, **, **ary 24**, **, **149
## 6 16 GO**, **, **ght 22**, **, **473
Merge the mention data (mention_df
)
df1$row_id <- 1:nrow(df1)
names(df1);names(df2)
## [1] "author_id" "created_at" "id"
## [4] "conversation_id" "text" "lang"
## [7] "in_reply_to_user_id" "retweet_count" "reply_count"
## [10] "like_count" "quote_count" "row_id"
## [1] "row_id" "mention" "mentioned_id"
df3 <- merge(df1, df2, by.x = "row_id", by.y = "row_id", all.x = TRUE)
names(df3)
## [1] "row_id" "author_id" "created_at"
## [4] "id" "conversation_id" "text"
## [7] "lang" "in_reply_to_user_id" "retweet_count"
## [10] "reply_count" "like_count" "quote_count"
## [13] "mention" "mentioned_id"
The nested hashtag list is handled in a similar manner as the mention list.
hashtag <- reftws1$entities$hashtags
names(hashtag) <- seq_along(hashtag)
hashtag_df <- do.call(what = "rbind", args = lapply(hashtag, as.data.frame))
hashtag_df$row_id <- floor(as.numeric(rownames(hashtag_df)))
head(hashtag_df)
## start end tag row_id
## 1 133 140 iMarch 1
## 3.1 19 31 immigration 3
## 3.2 40 44 CIR 3
## 3.3 45 55 timeisnow 3
## 5 42 57 BalancedBudget 5
## 8 25 37 immigration 8
df4 <- aggregate(trimws(hashtag_df$tag), list(hashtag_df$row_id), paste, collapse = ",")
names(df4) <- c("row_id", "hashtag")
head(df4)
## row_id hashtag
## 1 1 iMarch
## 2 3 immigration,CIR,timeisnow
## 3 5 BalancedBudget
## 4 8 immigration
## 5 9 immigration
## 6 10 immigration
Merge the hashtag table df4
to the main dataset (already
has mention)
df5 <- merge(df3, df4, by.x = "row_id", by.y = "row_id", all.x = TRUE)
df5 <- df5[,colnames(df5) != "row_id"]
colnames(df5)[colnames(df5) == "id"] <- "status_id"
Because this is a “referenced” data frame (the source that the
original tweets quoted or replied to), it is necessary to add a prefix
referenced_
to variables.
prefix = "referenced_"
colnames(df5)[!grepl(prefix, colnames(df5), ignore.case = F)] <- paste(prefix,colnames(df5)[!grepl(prefix, colnames(df5), ignore.case = F)], sep = "")
df5 <- df5[,sort(colnames(df5))]
colnames(df5)
## [1] "referenced_author_id" "referenced_conversation_id"
## [3] "referenced_created_at" "referenced_hashtag"
## [5] "referenced_in_reply_to_user_id" "referenced_lang"
## [7] "referenced_like_count" "referenced_mention"
## [9] "referenced_mentioned_id" "referenced_quote_count"
## [11] "referenced_reply_count" "referenced_retweet_count"
## [13] "referenced_status_id" "referenced_text"
thetitle <- paste("gcbcklog_ref", substr(range(reftws1$created_at)[1], 1, 4), "to", substr(range(reftws1$created_at)[2], 1, 4),"total", nrow(df5), "obs.csv", sep = ""); thetitle
## [1] "gcbcklog_ref2013to2018total882obs.csv"
save_as_csv(df5, thetitle)
Execution ended at 2023-01-17 22:46:11