Last compiled on 2023-01-17 22:45:51

NOTES:

  1. Download tweets by their ids using hydrate_tweets(ids = , bear_token = , bind_tweets = TRUE)

  2. Include illustrations of interim data tables

  3. Document how to flatten nested hashtag and mention information associated with each tweet

  4. Scenario: ids of referenced (source) tweets (e.g., replied_to or quote) are available but need the content of these referenced (source) tweets.

  5. The resultant data frame is ready to join with other Twitter data frames.

library(academictwitteR) # get_all_tweets()
library(rtweet) # save_as_csv() that prepends numerical ids as characters
library(dplyr) # %>% convenient data cleaning

Import Tweet Data

gcbcklog_all2011-2023_79997.csv is the combined dataset.

Many tweets from our data referenced to other tweets. They can be replied_to or quoted (excluding retweets because retweets are handled differently).

table(tweet$type)
## 
##    initial     quoted replied_to 
##      33433      10745      36562
paste(round(100*table(tweet$type)/length(tweet$type),1), "%", sep = "") # Distribution of tweet types
## [1] "41.4%" "13.3%" "45.3%"
tail(tweet[tweet$type!="initial",c(3,4,15)])
## # A tibble: 6 × 3
##   created_at               hashtag                      type      
##   <chr>                    <chr>                        <chr>     
## 1 2023-01-02T00:12:54.000Z <NA>                         replied_to
## 2 2023-01-02T02:32:41.000Z gcbacklog                    replied_to
## 3 2023-01-02T03:59:46.000Z greencardbacklog,H4EADdelays quoted    
## 4 2023-01-02T17:44:58.000Z <NA>                         replied_to
## 5 2023-01-02T22:05:19.000Z gcbacklog                    replied_to
## 6 2023-01-03T00:00:02.000Z GCBacklog,EAGLEAct           replied_to

Get Tweet IDs

ATTENTION: REMOVE ’NA’s in the vector of tweet ids.

refids <- unique(tweet$referenced_status_id);length(refids)
## [1] 32968
refids <- refids[!is.na(refids)]
round(summary(nchar(refids))) # Make sure there's no NA
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      10      19      19      19      19      19

There’re 32967 referenced tweets whose information needs to be collected.

Download Referenced Tweets

Data collection started at 2023-01-17 22:45:53

reftws1 <- hydrate_tweets(ids = refids[1:1000], bearer_token = bearer.token, bind_tweets = TRUE) # pause after every 299 batch (100 tweets per batch)
## Batch 1 out of 10 : ids 142033295075573763 to 696801328056692740 
## Total of  87  out of  1000  tweet(s) retrieved.
## Batch 2 out of 10 : ids 699625228444389376 to 903822385022984192 
## Total of  168  out of  1000  tweet(s) retrieved.
## Batch 3 out of 10 : ids 903736264935034881 to 921025153206833154 
## Total of  253  out of  1000  tweet(s) retrieved.
## Batch 4 out of 10 : ids 921112528540889088 to 928263204291141632 
## Total of  339  out of  1000  tweet(s) retrieved.
## Batch 5 out of 10 : ids 928464991178518535 to 938465573645758464 
## Total of  430  out of  1000  tweet(s) retrieved.
## Batch 6 out of 10 : ids 938516594887745541 to 944319115367399424 
## Total of  518  out of  1000  tweet(s) retrieved.
## Batch 7 out of 10 : ids 944695383053295616 to 950420401309536257 
## Total of  607  out of  1000  tweet(s) retrieved.
## Batch 8 out of 10 : ids 950510932018548736 to 951839934184411136 
## Total of  702  out of  1000  tweet(s) retrieved.
## Batch 9 out of 10 : ids 951849705423736832 to 952565454744285186 
## Total of  795  out of  1000  tweet(s) retrieved.
## Batch 10 out of 10 : ids 952565585132605440 to 953221819414253570 
## Total of  882  out of  1000  tweet(s) retrieved.

Data collection completed at 2023-01-17 22:46:11

names(reftws1)
##  [1] "entities"               "public_metrics"         "created_at"            
##  [4] "possibly_sensitive"     "id"                     "edit_history_tweet_ids"
##  [7] "conversation_id"        "author_id"              "lang"                  
## [10] "text"                   "attachments"            "referenced_tweets"     
## [13] "in_reply_to_user_id"    "geo"
nrow(reftws1)
## [1] 882
head(reftws1[,c("created_at", "text")])
##                 created_at
## 1 2013-05-15T17:16:32.000Z
## 2 2013-08-17T17:49:16.000Z
## 3 2013-08-20T22:56:45.000Z
## 4 2013-10-19T00:20:09.000Z
## 5 2013-03-21T16:12:39.000Z
## 6 2013-10-19T00:38:56.000Z
##                                                                                                                                           text
## 1 Facing an 8-year green card backlog, this @wharton grad has no choice but to take her business skills abroad: http://t.co/S97rVxQH8i #iMarch
## 2                         RNC Calls for Immigration Reform Focused on 'Needs of United States Employers'  via Breitbart http://t.co/CdEHdI3BDO
## 3                                                               Why do you support #immigration reform? #CIR #timeisnow http://t.co/YrvoMrZ4T0
## 4     Bill Young dedicated his life to serving the people of Florida, and his loss is a great one for his constituents. http://t.co/6Fcuji2FTX
## 5                                                            Today, the House delivered a responsible, #BalancedBudget. http://t.co/MqBpZNzPqn
## 6     Bill Young was a tireless servant to his constituents and our armed forces. My deepest sympathies go out to his family &amp; loved ones.
range(reftws1$created_at)
## [1] "2013-03-21T16:12:39.000Z" "2018-01-16T13:42:45.000Z"

Copy Regular Columns

df0 <- reftws1[,c("author_id","created_at","id","conversation_id","text","lang","in_reply_to_user_id")]

Extract nested columns of same length

reftws1$public_metrics[1:3,]
##   retweet_count reply_count like_count quote_count impression_count
## 1           356         190        170           0                0
## 2             1           1          0           0                0
## 3            14           2          3           0                0
df1 = df0
df1$retweet_count <- reftws1$public_metrics$retweet_count
df1$reply_count <- reftws1$public_metrics$reply_count
df1$like_count <- reftws1$public_metrics$like_count
df1$quote_count <- reftws1$public_metrics$quote_count
names(df1)
##  [1] "author_id"           "created_at"          "id"                 
##  [4] "conversation_id"     "text"                "lang"               
##  [7] "in_reply_to_user_id" "retweet_count"       "reply_count"        
## [10] "like_count"          "quote_count"

Nested Columns of Different Lengths

We have no control over how many people were mentioned and how many hashtags were used in a given tweet. The randomness of mention and hashtag length create extra steps of processing Twitter data.

Mention

The nested mention table has 4 variables. username is the Twitter user screen name and id is the user id (back in 2021, the id information of mentioned users weren’t available). start and end keep the numeric value of the location in the tweet where the mention occurred.

The number of mentions varies in each tweets. Thus, the length of the nested mention table is different for each observation.

The purpose of this step is to concatenate username into a row called mention and id information into mentioned_id.

names(reftws1$entities$mentions[[1]]) # variables of the nested mention table
## [1] "start"    "end"      "username" "id"
data.frame(cbind(
  start = reftws1$entities$mentions[[1]][,1],
  end = reftws1$entities$mentions[[1]][,2],
  username = paste(substr(reftws1$entities$mentions[[1]][,3],1,2),"***", sep = ""),
  id = paste(substr(reftws1$entities$mentions[[1]][,4],1,3),"***", sep = "")
))
##   start end username     id
## 1    42  50    Wh*** 771***
mention <- reftws1$entities$mentions
names(mention) <- seq_along(mention)
mention_df <- do.call(what = "rbind", args = lapply(mention, as.data.frame))
head(mention_df)
##    start end       username                  id
## 1     42  50        Wharton             7717612
## 10     0  13   RepGoodlatte            37920978
## 11     0   8        GOPWhip          2693983838
## 13   109 116         fwd_us 1015198328441786369
## 15     0  15 HouseJudiciary           246357149
## 16     0  13   GOPoversight            22508473
mention_df$row_id <- floor(as.numeric(rownames(mention_df)))
drop_col = c("start","end")
mention_df <- mention_df[,!names(mention_df)%in%c("start","end")]
class(mention_df$username)
## [1] "character"
mention_df$username <- trimws(mention_df$username)

Take a look at the resultant mention data. The row number was successfully preserved. The username and id information was hidden for privacy.

head(data.frame(cbind(
  username = paste(substr(mention_df$username,1,2), "***", sep = ""),
  id = paste(substr(mention_df$id,1,3), "***", sep = ""),
  row_id = mention_df$row_id)))
##   username     id row_id
## 1    Wh*** 771***      1
## 2    Re*** 379***     10
## 3    GO*** 269***     11
## 4    fw*** 101***     13
## 5    Ho*** 246***     15
## 6    GO*** 225***     16

Move the username and id from the same row_id into one row, separated by comma

df2 <- aggregate(mention_df[,c("username", "id")], list(mention_df$row_id), paste, collapse = ",")
names(df2) <- c("row_id", "mention", "mentioned_id")
head(data.frame(cbind(
  row_id = df2$row_id,
  mention = paste(substr(df2$mention, 1, 2), "**, **, **", substr(df2$mention, nchar(df2$mention)-2, nchar(df2$mention)), sep = ""),
  mentioned_id = paste(substr(df2$mentioned_id, 1, 2), "**, **, **", substr(df2$mentioned_id, nchar(df2$mentioned_id)-2, nchar(df2$mentioned_id)), sep = "")
)))
##   row_id         mention    mentioned_id
## 1      1 Wh**, **, **ton 77**, **, **612
## 2     10 Re**, **, **tte 37**, **, **978
## 3     11 GO**, **, **hip 26**, **, **838
## 4     13 fw**, **, **_us 10**, **, **369
## 5     15 Ho**, **, **ary 24**, **, **149
## 6     16 GO**, **, **ght 22**, **, **473

Merge the mention data (mention_df)

df1$row_id <- 1:nrow(df1)
names(df1);names(df2)
##  [1] "author_id"           "created_at"          "id"                 
##  [4] "conversation_id"     "text"                "lang"               
##  [7] "in_reply_to_user_id" "retweet_count"       "reply_count"        
## [10] "like_count"          "quote_count"         "row_id"
## [1] "row_id"       "mention"      "mentioned_id"
df3 <- merge(df1, df2, by.x = "row_id", by.y = "row_id", all.x = TRUE)
names(df3)
##  [1] "row_id"              "author_id"           "created_at"         
##  [4] "id"                  "conversation_id"     "text"               
##  [7] "lang"                "in_reply_to_user_id" "retweet_count"      
## [10] "reply_count"         "like_count"          "quote_count"        
## [13] "mention"             "mentioned_id"

Hashtag

The nested hashtag list is handled in a similar manner as the mention list.

hashtag <- reftws1$entities$hashtags
names(hashtag) <- seq_along(hashtag)
hashtag_df <- do.call(what = "rbind", args = lapply(hashtag, as.data.frame))
hashtag_df$row_id <- floor(as.numeric(rownames(hashtag_df)))
head(hashtag_df)
##     start end            tag row_id
## 1     133 140         iMarch      1
## 3.1    19  31    immigration      3
## 3.2    40  44            CIR      3
## 3.3    45  55      timeisnow      3
## 5      42  57 BalancedBudget      5
## 8      25  37    immigration      8
df4 <- aggregate(trimws(hashtag_df$tag), list(hashtag_df$row_id), paste, collapse = ",")
names(df4) <- c("row_id", "hashtag")
head(df4)
##   row_id                   hashtag
## 1      1                    iMarch
## 2      3 immigration,CIR,timeisnow
## 3      5            BalancedBudget
## 4      8               immigration
## 5      9               immigration
## 6     10               immigration

Merge the hashtag table df4 to the main dataset (already has mention)

df5 <- merge(df3, df4, by.x = "row_id", by.y = "row_id", all.x = TRUE)
df5 <- df5[,colnames(df5) != "row_id"]
colnames(df5)[colnames(df5) == "id"] <- "status_id"

Rename Variables and Save

Because this is a “referenced” data frame (the source that the original tweets quoted or replied to), it is necessary to add a prefix referenced_ to variables.

prefix = "referenced_"
colnames(df5)[!grepl(prefix, colnames(df5), ignore.case = F)] <- paste(prefix,colnames(df5)[!grepl(prefix, colnames(df5), ignore.case = F)], sep = "")
df5 <- df5[,sort(colnames(df5))]
colnames(df5)
##  [1] "referenced_author_id"           "referenced_conversation_id"    
##  [3] "referenced_created_at"          "referenced_hashtag"            
##  [5] "referenced_in_reply_to_user_id" "referenced_lang"               
##  [7] "referenced_like_count"          "referenced_mention"            
##  [9] "referenced_mentioned_id"        "referenced_quote_count"        
## [11] "referenced_reply_count"         "referenced_retweet_count"      
## [13] "referenced_status_id"           "referenced_text"
thetitle <- paste("gcbcklog_ref", substr(range(reftws1$created_at)[1], 1, 4), "to", substr(range(reftws1$created_at)[2], 1, 4),"total", nrow(df5), "obs.csv", sep = ""); thetitle
## [1] "gcbcklog_ref2013to2018total882obs.csv"
save_as_csv(df5, thetitle)

Execution ended at 2023-01-17 22:46:11