This tutorial walks you through on:
rtweet packageigraph packageUsing retweets to measure influence is technically more challenging compared to using mentions. We will need to extract the user information of the original tweets and query the user information of the retweets. An original tweet’s user information is nested in one of the returned columns.
rtweet version 1.x.x. Using an older
version may display warning messages or throw an error as many functions
are deprecated.Install tidyverse, rtweet, and
igraph if you do not have them in your R environment.
# uncomment and run the lines below if you need to install these packages
# tidyverse is a collection of R packages for data science
# install.packages("tidyverse")
# rtweet is used to pull tweets using your own account
# install.packages("rtweet")
# igraph is a collection of network analysis tools
# install.packages("igraph")
Load packages.
library(tidyverse)
library(rtweet)
library(igraph)
The rtweet package provides three methods of
authentication. This example authenticates as a bot, but the other two
methods will work as well unless you’re planning to take actions on
Twitter using the API (e.g., posting a tweet through API).
# credentials required to generate an API token
# replace with your own keys and tokens
api_key <- "...Your API Key..."
api_key_secret <- "...Your API Key Secret..."
access_token <- "...Your API Access Token..."
access_token_secret <- "...Your API Access Token Secret..."
# create an authentication token a as a Twitter bot
auth_token = rtweet_bot(
api_key = api_key,
api_secret = api_key_secret,
access_token = access_token,
access_secret = access_token_secret
)
# print out token details
auth_token
## <Token>
## <oauth_endpoint>
## request: https://api.twitter.com/oauth/request_token
## authorize: https://api.twitter.com/oauth/authenticate
## access: https://api.twitter.com/oauth/access_token
## <oauth_app> rtweet
## key: ...Your API Key...
## secret: <hidden>
## <credentials> oauth_token, oauth_token_secret
## ---
Configure rtweet to use the token generated in the
previous step.
# use token authentication
auth_as(auth_token)
You can think of API tokens as an access key card. Twitter will use the token to check whether your user credential is valid (authentication) and whether you have permission to perform an action (authorization).
Use the search_tweets() to search by keyword, hashtags,
or mentions. Because we are running a network analysis on retweets,
include_rts parameter must be TRUE.
Notes 1. Twitter’s default search only returns
tweets within a week. This is a limit on Twitter’s Essential/Elevated
Access API. If you use an outdated hashtag, you may not see any returned
result. 2. Retrieving a large number of rows (e.g.,
n = 4000) can take up to a few minutes. If you are unsure
whether your API call is working or not, try starting with a small
number of rows (e.g., n = 100).
# search term
# can be a hashtag (e.g., #microsoft), a mention (e.g., @nike), or a keyword (e.g., iphone)
search_term = "iphone"
# search tweets
# we specify lang = "en" to retrieve only English tweets
# include_rts must be TRUE since our network analysis uses retweets
searched_tweets <- search_tweets(
search_term,
n = 10000,
include_rts = TRUE,
parse = TRUE,
lang = "en"
)
## Warning: Terminating paginate early due to rate limit.
## Rate limit exceeded for Twitter endpoint '/1.1/search/tweets'
## • Will receive 180 more requests at 02:01
## ℹ Set `max_id = '1567091711284842495' to continue.
## ℹ Set `retryonratelimit = TRUE` to automatically wait for reset
# the result is stored as a tibble
searched_tweets %>% head(20)
Check the number of returned tweets. Ideally, you should have sufficient number of rows (minimum ~500). If you see a smaller number, try a different search term.
nrow(searched_tweets)
## [1] 4664
Because we are running network analysis based on retweets, we are
interested in the original tweet’s user information. In other words, we
are looking to answer the question “Who posted the original tweet?”. We
can use the retweeted_status column to extract those
information.
# check what the "retweeted_status" column's nested DataFrames looks like
searched_tweets[[1, 'retweeted_status']]
## [[1]]
## created_at id id_str full_text truncated display_text_range entities.hashtags
## 1 <NA> NA <NA> <NA> NA NULL NULL
## entities.symbols entities.user_mentions entities.urls entities.media media
## 1 NULL NULL NULL NULL NULL
## metadata.iso_language_code metadata.result_type source in_reply_to_status_id
## 1 <NA> <NA> <NA> NA
## in_reply_to_status_id_str in_reply_to_user_id in_reply_to_user_id_str
## 1 NA NA NA
## in_reply_to_screen_name user.id user.id_str user.name user.screen_name
## 1 NA NA <NA> <NA> <NA>
## user.location user.description user.url user.entities.urls user.entities.urls
## 1 <NA> <NA> <NA> NULL NULL
## user.protected user.followers_count user.friends_count user.listed_count
## 1 NA NA NA NA
## user.created_at user.favourites_count user.utc_offset user.time_zone
## 1 <NA> NA NA NA
## user.geo_enabled user.verified user.statuses_count user.lang
## 1 NA NA NA NA
## user.contributors_enabled user.is_translator user.is_translation_enabled
## 1 NA NA NA
## user.profile_background_color user.profile_background_image_url
## 1 <NA> <NA>
## user.profile_background_image_url_https user.profile_background_tile
## 1 <NA> NA
## user.profile_image_url user.profile_image_url_https user.profile_banner_url
## 1 <NA> <NA> <NA>
## user.profile_link_color user.profile_sidebar_border_color
## 1 <NA> <NA>
## user.profile_sidebar_fill_color user.profile_text_color
## 1 <NA> <NA>
## user.profile_use_background_image user.has_extended_profile
## 1 NA NA
## user.default_profile user.default_profile_image user.following
## 1 NA NA NA
## user.follow_request_sent user.notifications user.translator_type
## 1 NA NA <NA>
## user.withheld_in_countries geo coordinates place.id place.url
## 1 NULL NA NA <NA> <NA>
## place.place_type place.name place.full_name place.country_code place.country
## 1 <NA> <NA> <NA> <NA> <NA>
## place.contained_within place.bounding_box.type place.bounding_box.coordinates
## 1 NULL <NA> NULL
## contributors is_quote_status retweet_count favorite_count favorited retweeted
## 1 NA NA NA NA NA NA
## possibly_sensitive lang quoted_status_id quoted_status_id_str
## 1 NA <NA> NA <NA>
## quoted_status.created_at quoted_status.id quoted_status.id_str
## 1 <NA> NA <NA>
## quoted_status.full_text quoted_status.truncated
## 1 <NA> NA
## quoted_status.display_text_range quoted_status.entities.hashtags
## 1 NULL NULL
## quoted_status.entities.symbols quoted_status.entities.user_mentions
## 1 NULL NULL
## quoted_status.entities.urls quoted_status.metadata.iso_language_code
## 1 NULL <NA>
## quoted_status.metadata.result_type quoted_status.source
## 1 <NA> <NA>
## quoted_status.in_reply_to_status_id quoted_status.in_reply_to_status_id_str
## 1 NA NA
## quoted_status.in_reply_to_user_id quoted_status.in_reply_to_user_id_str
## 1 NA NA
## quoted_status.in_reply_to_screen_name quoted_status.user.id
## 1 NA NA
## quoted_status.user.id_str quoted_status.user.name
## 1 <NA> <NA>
## quoted_status.user.screen_name quoted_status.user.location
## 1 <NA> <NA>
## quoted_status.user.description quoted_status.user.url
## 1 <NA> <NA>
## quoted_status.user.entities.urls quoted_status.user.entities.urls
## 1 NULL NULL
## quoted_status.user.protected quoted_status.user.followers_count
## 1 NA NA
## quoted_status.user.friends_count quoted_status.user.listed_count
## 1 NA NA
## quoted_status.user.created_at quoted_status.user.favourites_count
## 1 <NA> NA
## quoted_status.user.utc_offset quoted_status.user.time_zone
## 1 NA NA
## quoted_status.user.geo_enabled quoted_status.user.verified
## 1 NA NA
## quoted_status.user.statuses_count quoted_status.user.lang
## 1 NA NA
## quoted_status.user.contributors_enabled quoted_status.user.is_translator
## 1 NA NA
## quoted_status.user.is_translation_enabled
## 1 NA
## quoted_status.user.profile_background_color
## 1 <NA>
## quoted_status.user.profile_background_image_url
## 1 NA
## quoted_status.user.profile_background_image_url_https
## 1 NA
## quoted_status.user.profile_background_tile
## 1 NA
## quoted_status.user.profile_image_url
## 1 <NA>
## quoted_status.user.profile_image_url_https
## 1 <NA>
## quoted_status.user.profile_banner_url quoted_status.user.profile_link_color
## 1 <NA> <NA>
## quoted_status.user.profile_sidebar_border_color
## 1 <NA>
## quoted_status.user.profile_sidebar_fill_color
## 1 <NA>
## quoted_status.user.profile_text_color
## 1 <NA>
## quoted_status.user.profile_use_background_image
## 1 NA
## quoted_status.user.has_extended_profile quoted_status.user.default_profile
## 1 NA NA
## quoted_status.user.default_profile_image quoted_status.user.following
## 1 NA NA
## quoted_status.user.follow_request_sent quoted_status.user.notifications
## 1 NA NA
## quoted_status.user.translator_type quoted_status.user.withheld_in_countries
## 1 <NA> NULL
## quoted_status.geo quoted_status.coordinates quoted_status.place
## 1 NA NA NA
## quoted_status.contributors quoted_status.is_quote_status
## 1 NA NA
## quoted_status.retweet_count quoted_status.favorite_count
## 1 NA NA
## quoted_status.favorited quoted_status.retweeted
## 1 NA NA
## quoted_status.possibly_sensitive quoted_status.lang
## 1 NA <NA>
# check what the "user" nested DataFrame in "retweed_status" column looks like
searched_tweets[[1, 'retweeted_status']][[1]] %>% pull('user')
The retweeted_status column contains a nested
DataFrame in each row. We will have to extract the original user’s
information using the purrr::map_dfr() function. A
map() function is used to transform an existing column. In
our case, we use it to transform the original tweet’s user
DataFrame For now, you do not have to worry to much about understanding
the map_dfr() function.
# for each row, this function will grab the original user's name,
# screen_name, followers_count, and description
# if a row contains an original tweet (not a retweet), NULL values are returned
find_original_user_screen_name <- function(.df) {
row_original_user_info <- .df[1, 'user'][1, c('name', 'screen_name', 'description', 'followers_count', 'friends_count', 'favourites_count')]
row_original_user_info <- row_original_user_info %>% rename(
original_tweet_user_name = name,
original_tweet_user_screen_name = screen_name,
original_tweet_user_description = description,
original_tweet_user_followers_count = followers_count,
original_tweet_user_friends_count = friends_count,
original_tweet_user_favorites_count = favourites_count
)
return(row_original_user_info)
}
# store the extracted information of original tweets' users to a new variable
original_tweet_user_info <- searched_tweets$retweeted_status %>%
map_dfr(find_original_user_screen_name)
original_tweet_user_info %>%
head(n = 10)
Add extracted user information to the original
searched_tweets DataFrame.
# if the columns already exist, drop original tweet user's information
# this prevents us from running this cell multiple times
# and accidentally keep on adding these columns
searched_tweets <- searched_tweets %>%
select(-any_of(colnames(original_tweet_user_info)))
searched_tweets <- dplyr::bind_cols(searched_tweets, original_tweet_user_info)
searched_tweets %>% head(n = 10)
The rtweet::search_tweets() method returns a unique
id of the user, not the name or screen name. This is
because screen names can change and is not guaranteed to be
consistent.
In this step, we query the user data using
rtweet::users_data(). rtweet::users_data()
will retrieve users’ information for each row in
searched_tweets.
searched_tweets_user_data = users_data(searched_tweets)
searched_tweets_user_data %>% head(20)
searched_tweetsHow many columns does the users_data() method return?
Like we did in one of the previous steps, we can find out using the
ncol() function.
ncol(searched_tweets_user_data)
## [1] 23
We can see the column names using colnames().
colnames(searched_tweets_user_data)
## [1] "id" "id_str"
## [3] "name" "screen_name"
## [5] "location" "description"
## [7] "url" "protected"
## [9] "followers_count" "friends_count"
## [11] "listed_count" "created_at"
## [13] "favourites_count" "verified"
## [15] "statuses_count" "profile_image_url_https"
## [17] "profile_banner_url" "default_profile"
## [19] "default_profile_image" "withheld_in_countries"
## [21] "derived" "withheld_scope"
## [23] "entities"
Select only relevant columns.
retweet_users_subset_columns <- searched_tweets_user_data %>%
select(
name,
screen_name
) %>% rename(
retweet_user_name = name,
retweet_user_screen_name = screen_name
)
retweet_users_subset_columns %>% head(20)
We now have two different DataFrames - one with a list of tweets and another with users information.
Use dplyr to bind the two DataFrames together.
searched_tweets_with_users <- dplyr::bind_cols(
searched_tweets,
retweet_users_subset_columns
)
searched_tweets_with_users %>% head(20)
searched_tweets_with_usersUse ncol() to find out how many columns we have so
far.
ncol(searched_tweets_with_users)
## [1] 51
We can see the column names using colnames().
colnames(searched_tweets_with_users)
## [1] "created_at" "id"
## [3] "id_str" "full_text"
## [5] "truncated" "display_text_range"
## [7] "entities" "metadata"
## [9] "source" "in_reply_to_status_id"
## [11] "in_reply_to_status_id_str" "in_reply_to_user_id"
## [13] "in_reply_to_user_id_str" "in_reply_to_screen_name"
## [15] "geo" "coordinates"
## [17] "place" "contributors"
## [19] "is_quote_status" "retweet_count"
## [21] "favorite_count" "favorited"
## [23] "retweeted" "lang"
## [25] "possibly_sensitive" "retweeted_status"
## [27] "quoted_status_id" "quoted_status_id_str"
## [29] "quoted_status" "text"
## [31] "favorited_by" "scopes"
## [33] "display_text_width" "quoted_status_permalink"
## [35] "quote_count" "timestamp_ms"
## [37] "reply_count" "filter_level"
## [39] "query" "withheld_scope"
## [41] "withheld_copyright" "withheld_in_countries"
## [43] "possibly_sensitive_appealable" "original_tweet_user_name"
## [45] "original_tweet_user_screen_name" "original_tweet_user_description"
## [47] "original_tweet_user_followers_count" "original_tweet_user_friends_count"
## [49] "original_tweet_user_favorites_count" "retweet_user_name"
## [51] "retweet_user_screen_name"
Select the columns that can be useful for our analysis.
retweet_subset_columns <- searched_tweets_with_users %>%
select(
full_text,
created_at,
original_tweet_user_name,
original_tweet_user_screen_name,
original_tweet_user_description,
original_tweet_user_followers_count,
original_tweet_user_friends_count,
original_tweet_user_favorites_count,
retweet_user_name,
retweet_user_screen_name
) %>%
rename (
retweeted_at = created_at
)
retweet_subset_columns %>% head()
Recall that our network analysis will be based on the retweets only. We will need to weed out all original tweets (i.e., non-retweets) before creating a graph.
# if the original tweet's user name is missing, we know that a row is not a retweet
retweet_subset_columns <- retweet_subset_columns %>%
filter(!is.na(original_tweet_user_name))
retweet_subset_columns %>% head(n = 20)
Use readr package’s readr::write_csv() to
save your combined DataFrame to a CSV file. This can be useful if you
want to browse through the tweets using a spreadsheet software (e.g.,
Excel, Google Sheets).
Note that rtweet provides its own
rtweet:write_as_csv() function which performs a similar
task. You can use either.
readr::write_csv(
retweet_subset_columns,
"retweets.csv"
)
Create an alias for our preprocessed DataFrame. This is for
convenience so that we do not have to type
searched_tweets_with_users every time.
tweets <- retweet_subset_columns
Each retweet can be represented as a directed edge in a graph that connects from the tweet’s username to the retweeted username.
edges <- tweets %>%
select(retweet_user_screen_name, original_tweet_user_screen_name) %>%
rename(
from = retweet_user_screen_name,
to = original_tweet_user_screen_name
)
edges <- unnest(edges, cols=to)
# display 20 first rows
head(edges, n = 20)
We created the edges DataFrame in one of the previous
steps. We can build a graph object using the DataFrame.
directed = TRUE parameter is used to create a directed
graph.
graph <- graph_from_data_frame(edges, directed = TRUE)
# print graph
graph
## IGRAPH 1af7e8e DN-- 3079 2946 --
## + attr: name (v/c)
## + edges from 1af7e8e (vertex names):
## [1] Shakira561 ->strangerUg James18991457 ->lunagirl777
## [3] AbhilashSk37 ->Ashwsbreal Sweetesttheresa->strangerUg
## [5] SymmonKembo ->handsonmobile01 tinah982 ->ZackRazOfficial
## [7] _7Fawaz ->Holar_Folarin Bmspcob ->missufe
## [9] benefitcs ->missufe baehnjn ->missufe
## [11] tune_tha_goon ->cayyluh _mbalingwenya ->SABreakingNews
## [13] MariaGenerous ->Zal_Da_Great tinah982 ->jose_wagz
## [15] sharlousbaby ->MaziGadgetPlug Tawongda ->bennyMalama
## + ... omitted several edges
# calculate degree centrality
deg <- degree(graph, mode = "in")
# sort by degree centrality in descending order
deg <- deg %>%
sort(decreasing = TRUE)
deg %>% head(10)
## missufe theapplehub AkiMarlin OGBdeyforyou kyler_steele
## 1534 116 56 49 31
## Instamanbot drayy09 stufflistings CNN Sakagadgets
## 31 24 21 21 20
Check the number of vertices (i.e., users) in our graph
gorder(graph)
## [1] 3079
Identify the top 100 users by number of retweets.
top100 <- deg %>% head(n = 100)
top100 %>% head(10)
## missufe theapplehub AkiMarlin OGBdeyforyou kyler_steele
## 1534 116 56 49 31
## Instamanbot drayy09 stufflistings CNN Sakagadgets
## 31 24 21 21 20
top100 is a named numeric vector. Convert it to a
DataFrame. This will allow us to add new columns.
df_top100 <- top100 %>%
enframe(name = "screen_name", value="retweeted_count")
df_top100 %>% head(n = 10)
Extract original tweets’ user information from our original DataFrame.
users_info <- tweets %>%
select(
original_tweet_user_name,
original_tweet_user_screen_name,
original_tweet_user_description,
original_tweet_user_followers_count,
original_tweet_user_friends_count,
original_tweet_user_favorites_count,
) %>%
rename(
name = original_tweet_user_name,
screen_name = original_tweet_user_screen_name,
description = original_tweet_user_description,
followers_count = original_tweet_user_followers_count,
friends_count = original_tweet_user_friends_count,
favorites_count = original_tweet_user_favorites_count
) %>%
distinct(screen_name, .keep_all = TRUE)
users_info %>% head(n = 10)
Join the users information into df_top100.
top100_details <- df_top100 %>%
inner_join(users_info)
## Joining, by = "screen_name"
top100_details
ggplot(
data = head(top100_details, n = 20),
aes(x = retweeted_count, y = reorder(screen_name, retweeted_count))
) +
geom_col() +
theme_classic() +
xlab("Number of Retweets by Other Users") +
ylab("User")
ggplot(
data = top100_details,
aes(x = retweeted_count)
) +
geom_histogram(
binwidth = 3,
color = "black",
fill = "white"
) +
theme_classic() +
xlab("Number of Retweets by Other Users")
plot(
graph,
layout = layout_with_fr(graph),
main="Retweets network graph of all nodes",
vertex.size = 4,
vertex.label = NA,
edge.arrow.size = 0,
)