🧭 Overview

This tutorial walks you through on:

  1. How to retrieve tweets programmatically via the Twitter API v2 using the rtweet package
  2. How to run a network analysis based on the number of retweets using R’s igraph package

Using retweets to measure influence is technically more challenging compared to using mentions. We will need to extract the user information of the original tweets and query the user information of the retweets. An original tweet’s user information is nested in one of the returned columns.

📌 Things to check

  • ✔️ You must have a Twitter Developer account with elevated access.
  • ✔️ You must have the following keys/tokens:
    1. API Key
    2. API Key Secret
    3. Access Token
    4. Access Token Secret
  • ✔️ You must use rtweet version 1.x.x. Using an older version may display warning messages or throw an error as many functions are deprecated.

📦 Install and load packages

Install tidyverse, rtweet, and igraph if you do not have them in your R environment.

# uncomment and run the lines below if you need to install these packages

# tidyverse is a collection of R packages for data science
# install.packages("tidyverse")

# rtweet is used to pull tweets using your own account
# install.packages("rtweet")

# igraph is a collection of network analysis tools
# install.packages("igraph")

Load packages.

library(tidyverse)
library(rtweet)
library(igraph)

🔑 Authenticate Twitter Using API Keys

The rtweet package provides three methods of authentication. This example authenticates as a bot, but the other two methods will work as well unless you’re planning to take actions on Twitter using the API (e.g., posting a tweet through API).

# credentials required to generate an API token
# replace with your own keys and tokens
api_key <- "...Your API Key..."
api_key_secret <- "...Your API Key Secret..."
access_token <- "...Your API Access Token..."
access_token_secret <-  "...Your API Access Token Secret..."

# create an authentication token a as a Twitter bot
auth_token = rtweet_bot(
  api_key = api_key,
  api_secret = api_key_secret,
  access_token = access_token,
  access_secret = access_token_secret
)

# print out token details
auth_token
## <Token>
## <oauth_endpoint>
##  request:   https://api.twitter.com/oauth/request_token
##  authorize: https://api.twitter.com/oauth/authenticate
##  access:    https://api.twitter.com/oauth/access_token
## <oauth_app> rtweet
##   key:    ...Your API Key...
##   secret: <hidden>
## <credentials> oauth_token, oauth_token_secret
## ---

Configure rtweet to use the token generated in the previous step.

# use token authentication
auth_as(auth_token)

👉 What are tokens?

You can think of API tokens as an access key card. Twitter will use the token to check whether your user credential is valid (authentication) and whether you have permission to perform an action (authorization).

🔍 Search tweets

Use the search_tweets() to search by keyword, hashtags, or mentions. Because we are running a network analysis on retweets, include_rts parameter must be TRUE.

Notes 1. Twitter’s default search only returns tweets within a week. This is a limit on Twitter’s Essential/Elevated Access API. If you use an outdated hashtag, you may not see any returned result. 2. Retrieving a large number of rows (e.g., n = 4000) can take up to a few minutes. If you are unsure whether your API call is working or not, try starting with a small number of rows (e.g., n = 100).

# search term
# can be a hashtag (e.g., #microsoft), a mention (e.g., @nike), or a keyword (e.g., iphone)
search_term = "iphone"

# search tweets
# we specify lang = "en" to retrieve only English tweets
# include_rts must be TRUE since our network analysis uses retweets
searched_tweets <- search_tweets(
  search_term,
  n = 10000,
  include_rts = TRUE,
  parse = TRUE,
  lang = "en"
)
## Warning: Terminating paginate early due to rate limit.
## Rate limit exceeded for Twitter endpoint '/1.1/search/tweets'
## • Will receive 180 more requests at 02:01
## ℹ Set `max_id = '1567091711284842495' to continue.
## ℹ Set `retryonratelimit = TRUE` to automatically wait for reset
# the result is stored as a tibble
searched_tweets %>% head(20)

Check the number of returned tweets. Ideally, you should have sufficient number of rows (minimum ~500). If you see a smaller number, try a different search term.

nrow(searched_tweets)
## [1] 4664

🧑 Extract user information from original tweets

Because we are running network analysis based on retweets, we are interested in the original tweet’s user information. In other words, we are looking to answer the question “Who posted the original tweet?”. We can use the retweeted_status column to extract those information.

# check what the "retweeted_status" column's nested DataFrames looks like
searched_tweets[[1, 'retweeted_status']]
## [[1]]
##   created_at id id_str full_text truncated display_text_range entities.hashtags
## 1       <NA> NA   <NA>      <NA>        NA               NULL              NULL
##   entities.symbols entities.user_mentions entities.urls entities.media media
## 1             NULL                   NULL          NULL           NULL  NULL
##   metadata.iso_language_code metadata.result_type source in_reply_to_status_id
## 1                       <NA>                 <NA>   <NA>                    NA
##   in_reply_to_status_id_str in_reply_to_user_id in_reply_to_user_id_str
## 1                        NA                  NA                      NA
##   in_reply_to_screen_name user.id user.id_str user.name user.screen_name
## 1                      NA      NA        <NA>      <NA>             <NA>
##   user.location user.description user.url user.entities.urls user.entities.urls
## 1          <NA>             <NA>     <NA>               NULL               NULL
##   user.protected user.followers_count user.friends_count user.listed_count
## 1             NA                   NA                 NA                NA
##   user.created_at user.favourites_count user.utc_offset user.time_zone
## 1            <NA>                    NA              NA             NA
##   user.geo_enabled user.verified user.statuses_count user.lang
## 1               NA            NA                  NA        NA
##   user.contributors_enabled user.is_translator user.is_translation_enabled
## 1                        NA                 NA                          NA
##   user.profile_background_color user.profile_background_image_url
## 1                          <NA>                              <NA>
##   user.profile_background_image_url_https user.profile_background_tile
## 1                                    <NA>                           NA
##   user.profile_image_url user.profile_image_url_https user.profile_banner_url
## 1                   <NA>                         <NA>                    <NA>
##   user.profile_link_color user.profile_sidebar_border_color
## 1                    <NA>                              <NA>
##   user.profile_sidebar_fill_color user.profile_text_color
## 1                            <NA>                    <NA>
##   user.profile_use_background_image user.has_extended_profile
## 1                                NA                        NA
##   user.default_profile user.default_profile_image user.following
## 1                   NA                         NA             NA
##   user.follow_request_sent user.notifications user.translator_type
## 1                       NA                 NA                 <NA>
##   user.withheld_in_countries geo coordinates place.id place.url
## 1                       NULL  NA          NA     <NA>      <NA>
##   place.place_type place.name place.full_name place.country_code place.country
## 1             <NA>       <NA>            <NA>               <NA>          <NA>
##   place.contained_within place.bounding_box.type place.bounding_box.coordinates
## 1                   NULL                    <NA>                           NULL
##   contributors is_quote_status retweet_count favorite_count favorited retweeted
## 1           NA              NA            NA             NA        NA        NA
##   possibly_sensitive lang quoted_status_id quoted_status_id_str
## 1                 NA <NA>               NA                 <NA>
##   quoted_status.created_at quoted_status.id quoted_status.id_str
## 1                     <NA>               NA                 <NA>
##   quoted_status.full_text quoted_status.truncated
## 1                    <NA>                      NA
##   quoted_status.display_text_range quoted_status.entities.hashtags
## 1                             NULL                            NULL
##   quoted_status.entities.symbols quoted_status.entities.user_mentions
## 1                           NULL                                 NULL
##   quoted_status.entities.urls quoted_status.metadata.iso_language_code
## 1                        NULL                                     <NA>
##   quoted_status.metadata.result_type quoted_status.source
## 1                               <NA>                 <NA>
##   quoted_status.in_reply_to_status_id quoted_status.in_reply_to_status_id_str
## 1                                  NA                                      NA
##   quoted_status.in_reply_to_user_id quoted_status.in_reply_to_user_id_str
## 1                                NA                                    NA
##   quoted_status.in_reply_to_screen_name quoted_status.user.id
## 1                                    NA                    NA
##   quoted_status.user.id_str quoted_status.user.name
## 1                      <NA>                    <NA>
##   quoted_status.user.screen_name quoted_status.user.location
## 1                           <NA>                        <NA>
##   quoted_status.user.description quoted_status.user.url
## 1                           <NA>                   <NA>
##   quoted_status.user.entities.urls quoted_status.user.entities.urls
## 1                             NULL                             NULL
##   quoted_status.user.protected quoted_status.user.followers_count
## 1                           NA                                 NA
##   quoted_status.user.friends_count quoted_status.user.listed_count
## 1                               NA                              NA
##   quoted_status.user.created_at quoted_status.user.favourites_count
## 1                          <NA>                                  NA
##   quoted_status.user.utc_offset quoted_status.user.time_zone
## 1                            NA                           NA
##   quoted_status.user.geo_enabled quoted_status.user.verified
## 1                             NA                          NA
##   quoted_status.user.statuses_count quoted_status.user.lang
## 1                                NA                      NA
##   quoted_status.user.contributors_enabled quoted_status.user.is_translator
## 1                                      NA                               NA
##   quoted_status.user.is_translation_enabled
## 1                                        NA
##   quoted_status.user.profile_background_color
## 1                                        <NA>
##   quoted_status.user.profile_background_image_url
## 1                                              NA
##   quoted_status.user.profile_background_image_url_https
## 1                                                    NA
##   quoted_status.user.profile_background_tile
## 1                                         NA
##   quoted_status.user.profile_image_url
## 1                                 <NA>
##   quoted_status.user.profile_image_url_https
## 1                                       <NA>
##   quoted_status.user.profile_banner_url quoted_status.user.profile_link_color
## 1                                  <NA>                                  <NA>
##   quoted_status.user.profile_sidebar_border_color
## 1                                            <NA>
##   quoted_status.user.profile_sidebar_fill_color
## 1                                          <NA>
##   quoted_status.user.profile_text_color
## 1                                  <NA>
##   quoted_status.user.profile_use_background_image
## 1                                              NA
##   quoted_status.user.has_extended_profile quoted_status.user.default_profile
## 1                                      NA                                 NA
##   quoted_status.user.default_profile_image quoted_status.user.following
## 1                                       NA                           NA
##   quoted_status.user.follow_request_sent quoted_status.user.notifications
## 1                                     NA                               NA
##   quoted_status.user.translator_type quoted_status.user.withheld_in_countries
## 1                               <NA>                                     NULL
##   quoted_status.geo quoted_status.coordinates quoted_status.place
## 1                NA                        NA                  NA
##   quoted_status.contributors quoted_status.is_quote_status
## 1                         NA                            NA
##   quoted_status.retweet_count quoted_status.favorite_count
## 1                          NA                           NA
##   quoted_status.favorited quoted_status.retweeted
## 1                      NA                      NA
##   quoted_status.possibly_sensitive quoted_status.lang
## 1                               NA               <NA>
# check what the "user" nested DataFrame in "retweed_status" column looks like
searched_tweets[[1, 'retweeted_status']][[1]] %>% pull('user')

The retweeted_status column contains a nested DataFrame in each row. We will have to extract the original user’s information using the purrr::map_dfr() function. A map() function is used to transform an existing column. In our case, we use it to transform the original tweet’s user DataFrame For now, you do not have to worry to much about understanding the map_dfr() function.

# for each row, this function will grab the original user's name, 
# screen_name, followers_count, and description
# if a row contains an original tweet (not a retweet), NULL values are returned
find_original_user_screen_name <- function(.df) {
  row_original_user_info <- .df[1, 'user'][1, c('name', 'screen_name',  'description', 'followers_count', 'friends_count', 'favourites_count')]
  row_original_user_info <- row_original_user_info %>% rename(
    original_tweet_user_name = name,
    original_tweet_user_screen_name = screen_name,
    original_tweet_user_description = description,
    original_tweet_user_followers_count = followers_count,
    original_tweet_user_friends_count = friends_count,
    original_tweet_user_favorites_count = favourites_count
  )
  
  return(row_original_user_info)
}

# store the extracted information of original tweets' users to a new variable
original_tweet_user_info <- searched_tweets$retweeted_status %>%
  map_dfr(find_original_user_screen_name)

original_tweet_user_info %>% 
  head(n = 10)

Add extracted user information to the original searched_tweets DataFrame.

# if the columns already exist, drop original tweet user's information
# this prevents us from running this cell multiple times
# and accidentally keep on adding these columns
searched_tweets <- searched_tweets %>%
  select(-any_of(colnames(original_tweet_user_info)))

searched_tweets <- dplyr::bind_cols(searched_tweets, original_tweet_user_info)
searched_tweets %>% head(n = 10)

👨🏽 Query user information of retweets

The rtweet::search_tweets() method returns a unique id of the user, not the name or screen name. This is because screen names can change and is not guaranteed to be consistent.

In this step, we query the user data using rtweet::users_data(). rtweet::users_data() will retrieve users’ information for each row in searched_tweets.

searched_tweets_user_data = users_data(searched_tweets)
searched_tweets_user_data %>% head(20)

✂️ Select only relevant columns from searched_tweets

How many columns does the users_data() method return? Like we did in one of the previous steps, we can find out using the ncol() function.

ncol(searched_tweets_user_data)
## [1] 23

We can see the column names using colnames().

colnames(searched_tweets_user_data)
##  [1] "id"                      "id_str"                 
##  [3] "name"                    "screen_name"            
##  [5] "location"                "description"            
##  [7] "url"                     "protected"              
##  [9] "followers_count"         "friends_count"          
## [11] "listed_count"            "created_at"             
## [13] "favourites_count"        "verified"               
## [15] "statuses_count"          "profile_image_url_https"
## [17] "profile_banner_url"      "default_profile"        
## [19] "default_profile_image"   "withheld_in_countries"  
## [21] "derived"                 "withheld_scope"         
## [23] "entities"

Select only relevant columns.

retweet_users_subset_columns <- searched_tweets_user_data %>% 
  select(
    name,
    screen_name
  ) %>% rename(
    retweet_user_name = name,
    retweet_user_screen_name = screen_name
  )
  

retweet_users_subset_columns %>% head(20)

⚒️ Combine tweets and users data

We now have two different DataFrames - one with a list of tweets and another with users information.

Use dplyr to bind the two DataFrames together.

searched_tweets_with_users <- dplyr::bind_cols(
  searched_tweets,
  retweet_users_subset_columns
)

searched_tweets_with_users %>% head(20)

✂️ Select only relevant columns from searched_tweets_with_users

Use ncol() to find out how many columns we have so far.

ncol(searched_tweets_with_users)
## [1] 51

We can see the column names using colnames().

colnames(searched_tweets_with_users)
##  [1] "created_at"                          "id"                                 
##  [3] "id_str"                              "full_text"                          
##  [5] "truncated"                           "display_text_range"                 
##  [7] "entities"                            "metadata"                           
##  [9] "source"                              "in_reply_to_status_id"              
## [11] "in_reply_to_status_id_str"           "in_reply_to_user_id"                
## [13] "in_reply_to_user_id_str"             "in_reply_to_screen_name"            
## [15] "geo"                                 "coordinates"                        
## [17] "place"                               "contributors"                       
## [19] "is_quote_status"                     "retweet_count"                      
## [21] "favorite_count"                      "favorited"                          
## [23] "retweeted"                           "lang"                               
## [25] "possibly_sensitive"                  "retweeted_status"                   
## [27] "quoted_status_id"                    "quoted_status_id_str"               
## [29] "quoted_status"                       "text"                               
## [31] "favorited_by"                        "scopes"                             
## [33] "display_text_width"                  "quoted_status_permalink"            
## [35] "quote_count"                         "timestamp_ms"                       
## [37] "reply_count"                         "filter_level"                       
## [39] "query"                               "withheld_scope"                     
## [41] "withheld_copyright"                  "withheld_in_countries"              
## [43] "possibly_sensitive_appealable"       "original_tweet_user_name"           
## [45] "original_tweet_user_screen_name"     "original_tweet_user_description"    
## [47] "original_tweet_user_followers_count" "original_tweet_user_friends_count"  
## [49] "original_tweet_user_favorites_count" "retweet_user_name"                  
## [51] "retweet_user_screen_name"

Select the columns that can be useful for our analysis.

retweet_subset_columns <- searched_tweets_with_users %>% 
  select(
    full_text,
    created_at,
    original_tweet_user_name,
    original_tweet_user_screen_name,
    original_tweet_user_description,
    original_tweet_user_followers_count,
    original_tweet_user_friends_count,
    original_tweet_user_favorites_count,
    retweet_user_name,
    retweet_user_screen_name
  ) %>%
  rename (
    retweeted_at = created_at
  )

retweet_subset_columns %>% head()

🧹 Filter retweets

Recall that our network analysis will be based on the retweets only. We will need to weed out all original tweets (i.e., non-retweets) before creating a graph.

# if the original tweet's user name is missing, we know that a row is not a retweet
retweet_subset_columns <- retweet_subset_columns %>%
  filter(!is.na(original_tweet_user_name))

retweet_subset_columns %>% head(n = 20)

💾 (Optional) Save to a CSV file

Use readr package’s readr::write_csv() to save your combined DataFrame to a CSV file. This can be useful if you want to browse through the tweets using a spreadsheet software (e.g., Excel, Google Sheets).

Note that rtweet provides its own rtweet:write_as_csv() function which performs a similar task. You can use either.

readr::write_csv(
  retweet_subset_columns,
  "retweets.csv"
)

✏️ Create an alias

Create an alias for our preprocessed DataFrame. This is for convenience so that we do not have to type searched_tweets_with_users every time.

tweets <- retweet_subset_columns

🚀 Perform network analysis

🐬 Create a DataFrame that describes a directed graph

Each retweet can be represented as a directed edge in a graph that connects from the tweet’s username to the retweeted username.

edges <- tweets %>% 
  select(retweet_user_screen_name, original_tweet_user_screen_name) %>% 
  rename(
    from = retweet_user_screen_name,
    to = original_tweet_user_screen_name
  )

edges <- unnest(edges, cols=to)

# display 20 first rows
head(edges, n = 20)

🧪 Create a graph

We created the edges DataFrame in one of the previous steps. We can build a graph object using the DataFrame. directed = TRUE parameter is used to create a directed graph.

graph <- graph_from_data_frame(edges, directed = TRUE)

# print graph
graph
## IGRAPH 1af7e8e DN-- 3079 2946 -- 
## + attr: name (v/c)
## + edges from 1af7e8e (vertex names):
##  [1] Shakira561     ->strangerUg      James18991457  ->lunagirl777    
##  [3] AbhilashSk37   ->Ashwsbreal      Sweetesttheresa->strangerUg     
##  [5] SymmonKembo    ->handsonmobile01 tinah982       ->ZackRazOfficial
##  [7] _7Fawaz        ->Holar_Folarin   Bmspcob        ->missufe        
##  [9] benefitcs      ->missufe         baehnjn        ->missufe        
## [11] tune_tha_goon  ->cayyluh         _mbalingwenya  ->SABreakingNews 
## [13] MariaGenerous  ->Zal_Da_Great    tinah982       ->jose_wagz      
## [15] sharlousbaby   ->MaziGadgetPlug  Tawongda       ->bennyMalama    
## + ... omitted several edges

🧮 Calculate in-degree centrality

# calculate degree centrality
deg <- degree(graph, mode = "in")

# sort by degree centrality in descending order
deg <- deg %>%
  sort(decreasing = TRUE)

deg %>% head(10)
##       missufe   theapplehub     AkiMarlin  OGBdeyforyou  kyler_steele 
##          1534           116            56            49            31 
##   Instamanbot       drayy09 stufflistings           CNN   Sakagadgets 
##            31            24            21            21            20

Check the number of vertices (i.e., users) in our graph

gorder(graph)
## [1] 3079

🧞 Top influencers by number of retweets

Identify the top 100 users by number of retweets.

top100 <- deg %>% head(n = 100)
top100 %>% head(10)
##       missufe   theapplehub     AkiMarlin  OGBdeyforyou  kyler_steele 
##          1534           116            56            49            31 
##   Instamanbot       drayy09 stufflistings           CNN   Sakagadgets 
##            31            24            21            21            20

top100 is a named numeric vector. Convert it to a DataFrame. This will allow us to add new columns.

df_top100 <- top100 %>%
  enframe(name = "screen_name", value="retweeted_count") 

df_top100 %>% head(n = 10)

Extract original tweets’ user information from our original DataFrame.

users_info <- tweets %>%
  select(
    original_tweet_user_name,
    original_tweet_user_screen_name,
    original_tweet_user_description,
    original_tweet_user_followers_count,
    original_tweet_user_friends_count,
    original_tweet_user_favorites_count,
  ) %>% 
  rename(
    name = original_tweet_user_name,
    screen_name = original_tweet_user_screen_name,
    description = original_tweet_user_description,
    followers_count = original_tweet_user_followers_count,
    friends_count = original_tweet_user_friends_count,
    favorites_count = original_tweet_user_favorites_count
  ) %>%
  distinct(screen_name, .keep_all = TRUE)

users_info %>% head(n = 10)

Join the users information into df_top100.

top100_details <- df_top100 %>%
  inner_join(users_info)
## Joining, by = "screen_name"
top100_details
ggplot(
  data = head(top100_details, n = 20),
  aes(x = retweeted_count, y = reorder(screen_name, retweeted_count))
) +
  geom_col() +
  theme_classic() +
  xlab("Number of Retweets by Other Users") +
  ylab("User")

📊 Number of retweets distribution

ggplot(
  data = top100_details,
  aes(x = retweeted_count)
) +
  geom_histogram(
    binwidth = 3,
    color = "black",
    fill = "white"
  ) +
  theme_classic() +
  xlab("Number of Retweets by Other Users")

✨ Graph visualization

plot(
  graph,
  layout = layout_with_fr(graph),
  main="Retweets network graph of all nodes",
  vertex.size = 4,
  vertex.label = NA,
  edge.arrow.size = 0,
)