This guide will illustrate how to use the rtweet package to download Twitter data, and introduce network analysis with tidygraph package. Unless you’re already registered with the Twitter API, functions downloading data won’t work. Download the saved data files to work with and replicate the network analysis.
rtweetThe rtweet package directly interfaces with Twitter’s full API for app development. This means you will need to register with Twitter separately at https://apps.twitter.com/, which will eventually will be retired for https://developer.twitter.com/. This will provide you with unique keys to access the API. While you can input API keys directly from R scripts, I recommend adding them to a system file (~/.Renviron) to keep that information private. Finally, you’ll need to use those keys to create and authenticate a token file before you’re granted access to the API. I found these pages absolutely essential in setting all this up:
Bob Rudis’s whole book is a great way to dig into rtweet’s functionality. It’s a hobbyist work-in-progress, but I’ve found it very helpful.
rtweet is a very large package with a lot of functionality. Twitter maintains many different kinds of data, and ways to access it. You can identify trends, download tweets based on hashtags, identify and analyze retweets and favorites, look at following patterns, and more.
I’m interested in learning more about my social network on Twitter. I follow people from many distinct communities (academia, news and politics, soccer analysis, data science, etc.), and I’d like to see how insular or porous those communities might be.
First, we need to load the packages.
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.6
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(rtweet)
##
## Attaching package: 'rtweet'
## The following object is masked from 'package:purrr':
##
## flatten
Before we look at friends and followers, let’s grab some tweets and see what they look like.
# we can get recent tweets by hashtag
# returns data from the last 6-9 days
rstats <- search_tweets("#rstats", n = 50)
## Searching for tweets...
## Finished collecting tweets!
# and also by account handle
ben <- search_tweets("BenInquiring", n = 50)
## Searching for tweets...
## Finished collecting tweets!
ben
## # A tibble: 50 x 88
## user_id status_id created_at screen_name text source
## * <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 7155988… 10341955… 2018-08-27 21:46:38 BenInquiri… "\"The preva… Twitt…
## 2 7155988… 10341930… 2018-08-27 21:37:02 BenInquiri… "🇺🇸Populatio… Twitt…
## 3 7155988… 10341895… 2018-08-27 21:22:50 BenInquiri… "Shout out t… Twitt…
## 4 7155988… 10341894… 2018-08-27 21:22:35 BenInquiri… @MattDoyle76… Twitt…
## 5 7155988… 10340682… 2018-08-27 13:20:56 BenInquiri… UNC students… Twitt…
## 6 7155988… 10340526… 2018-08-27 12:19:02 BenInquiri… @SeanSteffen… Twitt…
## 7 7155988… 10340509… 2018-08-27 12:12:22 BenInquiri… https://t.co… Twitt…
## 8 7155988… 10334498… 2018-08-25 20:23:29 BenInquiri… "“Deplatform… Twitt…
## 9 7155988… 10333826… 2018-08-25 15:56:43 BenInquiri… As tidycensu… Twitt…
## 10 7155988… 10333278… 2018-08-25 12:18:53 BenInquiri… "I feel like… Twitt…
## # ... with 40 more rows, and 82 more variables: display_text_width <dbl>,
## # reply_to_status_id <chr>, reply_to_user_id <chr>,
## # reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
## # favorite_count <int>, retweet_count <int>, hashtags <list>,
## # symbols <list>, urls_url <list>, urls_t.co <list>,
## # urls_expanded_url <list>, media_url <list>, media_t.co <list>,
## # media_expanded_url <list>, media_type <list>, ext_media_url <list>,
## # ext_media_t.co <list>, ext_media_expanded_url <list>,
## # ext_media_type <chr>, mentions_user_id <list>,
## # mentions_screen_name <list>, lang <chr>, quoted_status_id <chr>,
## # quoted_text <chr>, quoted_created_at <dttm>, quoted_source <chr>,
## # quoted_favorite_count <int>, quoted_retweet_count <int>,
## # quoted_user_id <chr>, quoted_screen_name <chr>, quoted_name <chr>,
## # quoted_followers_count <int>, quoted_friends_count <int>,
## # quoted_statuses_count <int>, quoted_location <chr>,
## # quoted_description <chr>, quoted_verified <lgl>,
## # retweet_status_id <chr>, retweet_text <chr>,
## # retweet_created_at <dttm>, retweet_source <chr>,
## # retweet_favorite_count <int>, retweet_retweet_count <int>,
## # retweet_user_id <chr>, retweet_screen_name <chr>, retweet_name <chr>,
## # retweet_followers_count <int>, retweet_friends_count <int>,
## # retweet_statuses_count <int>, retweet_location <chr>,
## # retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## # place_name <chr>, place_full_name <chr>, place_type <chr>,
## # country <chr>, country_code <chr>, geo_coords <list>,
## # coords_coords <list>, bbox_coords <list>, status_url <chr>,
## # name <chr>, location <chr>, description <chr>, url <lgl>,
## # protected <lgl>, followers_count <int>, friends_count <int>,
## # listed_count <int>, statuses_count <int>, favourites_count <int>,
## # account_created_at <dttm>, verified <lgl>, profile_url <chr>,
## # profile_expanded_url <chr>, account_lang <chr>,
## # profile_banner_url <chr>, profile_background_url <lgl>,
## # profile_image_url <chr>
You can also get basic information about a specific account.
ben_profile <- search_users("BenInquiring")
## Searching for users...
## Finished collecting users!
ben_profile$name
## [1] "Benjamin Bellman"
ben_profile$description
## [1] "Sociology PhD student @BrownUniversity. Spatial demography, urban studies, soccer nerd. Jr. contributor for @AnalysisEvolved. Go Pids."
ben_profile$location
## [1] "Providence, RI"
When working with APIs, it’s important to remember rate limiting. API queries cost server time and money, so rate limiting API keys helps regulate traffic. The Twitter API is actually a collection of smaller APIs, all of which have different limits. We can check our key’s status on all limits with the rate_limit() function. The reset column is time remaining in minutes.
rate_limit()
## # A tibble: 145 x 7
## query limit remaining reset reset_at timestamp
## <chr> <int> <int> <time> <dttm> <dttm>
## 1 lists/l… 15 15 15.00… 2018-08-28 00:03:45 2018-08-27 23:48:45
## 2 lists/m… 75 75 15.00… 2018-08-28 00:03:45 2018-08-27 23:48:45
## 3 lists/s… 15 15 15.00… 2018-08-28 00:03:45 2018-08-27 23:48:45
## 4 lists/m… 900 900 15.00… 2018-08-28 00:03:45 2018-08-27 23:48:45
## 5 lists/s… 15 15 15.00… 2018-08-28 00:03:45 2018-08-27 23:48:45
## 6 lists/s… 75 75 15.00… 2018-08-28 00:03:45 2018-08-27 23:48:45
## 7 lists/o… 15 15 15.00… 2018-08-28 00:03:45 2018-08-27 23:48:45
## 8 lists/s… 180 180 15.00… 2018-08-28 00:03:45 2018-08-27 23:48:45
## 9 lists/m… 15 15 15.00… 2018-08-28 00:03:45 2018-08-27 23:48:45
## 10 lists/s… 900 900 15.00… 2018-08-28 00:03:45 2018-08-27 23:48:45
## # ... with 135 more rows, and 1 more variable: app <chr>
The rate limit for accessing friends (Twitter’s term for who an account follows) is 15 requests per 15 minutes. I follow 757 people, which would take nearly 13 hours to download. Instead, I’ll analyze 75 account sample of my 231 followers, telling R to sleep for 15 minutes before pulling the next 15 accounts’ friends.
my_followers <- get_followers("BenInquiring")
glimpse(my_followers)
## Observations: 231
## Variables: 1
## $ user_id <chr> "328301647", "321563970", "1977085874", "464906075", "...
The result is a tbl with one column, a de-identified account ID. I prefer to treat it as a vector, and sample 75 accounts.
ids <- sample.int(my_followers$user_id, 75, useHash = FALSE)
Now I’ll loop through the IDs and pause the loop every 15 accounts. See you in an hour!
# create empty list to store results
friends <- list()
# start loop
for (a in 1:length(ids)){
friends[[a]] <- get_friends(ids[a])
# pause if divisible by 15
if (a %% 15 == 0){
Sys.sleep(15*60) # must enter time in seconds
}
}
# Combine data tables in list
friends <- bind_rows(friends) %>%
rename(friend = user_id)
library(here)
write_csv(friends, here("data", "twitter_friends.csv"))
Unfortunately, all but 37 API queries I tried failed. One message said I wasn’t authorized, another said the page didn’t exist any more. Still, we now have edge data for a graph. Let’s see if any of my followers are following each other.
tidygraphFor a detailed introduciton on tidygraph, look at this post from the package’s author:
Let’s bring in the Twitter friend data, and turn it into an tbl_graph object. But first, let’s see how many of these follower’s friends are also my followers.
## here() starts at /Users/benjaminbellman/Documents/Computer Backup/Repos/r-workshop18
friends <- read_csv(here("data", "twitter_friends.csv"))
## Parsed with column specification:
## cols(
## user = col_integer(),
## friend = col_double()
## )
filter(friends, friend %in% user)
## # A tibble: 0 x 2
## # ... with 2 variables: user <int>, friend <dbl>
Uh oh! None these 37 followers are following any of my other followers. Let’s check instead to see if they’re following any of the same people (apart from me). To do this, I’ll count the times an ID appears in the “friend” column, and drop all the rows that appear once.
net <- friends %>%
group_by(friend) %>%
mutate(count = n()) %>%
ungroup() %>%
filter(count > 1)
glimpse(net)
## Observations: 91
## Variables: 3
## $ user <int> 18273808, 18273808, 18273808, 18273808, 18273808, 18273...
## $ friend <dbl> 183387449, 29479000, 16727535, 158487331, 813286, 60865...
## $ count <int> 2, 2, 2, 2, 3, 3, 2, 3, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2...
We’ve got links! Let’s create a directed tbl_graph!
library(tidygraph)
##
## Attaching package: 'tidygraph'
## The following object is masked from 'package:stats':
##
## filter
g <- net %>%
select(user, friend) %>% # drop the count column
as_tbl_graph()
g
## # A tbl_graph: 59 nodes and 91 edges
## #
## # A directed acyclic simple graph with 1 component
## #
## # Node Data: 59 x 1 (active)
## name
## <chr>
## 1 18273808
## 2 144996068
## 3 227841374
## 4 207090115
## 5 121545138
## 6 110975849
## # ... with 53 more rows
## #
## # Edge Data: 91 x 2
## from to
## <int> <int>
## 1 1 20
## 2 1 21
## 3 1 22
## # ... with 88 more rows
Network analysis is best expressed with visualizations, which can be implemented with the ggplot framework using the ggraph package.
library(ggraph)
ggraph(g) +
geom_edge_link() +
geom_node_point(size = 3, colour = 'steelblue') +
theme_graph()
## Using `nicely` as default layout
It certainly looks like certain nodes are more connected to other nodes, and that there are clusters of nodes that are closest to this central links. These are probably my followers, who are then following lots of other accounts that can overlap. We can visualize the direction with the arrow= argument of geom_edge_link() and base arrow() graphic function. I’ll also drop the nodes to help see the arrows.
ggraph(g) +
geom_edge_link(edge_width = 0.25, arrow = arrow(30, unit(.15, "cm"))) +
theme_graph()
## Using `nicely` as default layout
Let’s calculate the centrality of the nodes, and visualize this attribute. tidygraph contains LOTS different measures of centrality, so I’ll use betweenness, since that is the concept I’m most familiar with. This is an attribute for nodes, so we have to activate() that part of the g object. Note that I’m making the graph undirected, since only my own followers have connections to other nodes when links are directed.
g2 <- net %>%
select(user, friend) %>% # drop the count column
as_tbl_graph(directed = F) %>% # make undirected
activate(nodes) %>%
mutate(centrality = centrality_betweenness())
g2
## # A tbl_graph: 59 nodes and 91 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 59 x 2 (active)
## name centrality
## <chr> <dbl>
## 1 18273808 223.
## 2 144996068 178.
## 3 227841374 127.
## 4 207090115 75.2
## 5 121545138 164.
## 6 110975849 0
## # ... with 53 more rows
## #
## # Edge Data: 91 x 2
## from to
## <int> <int>
## 1 1 20
## 2 1 21
## 3 1 22
## # ... with 88 more rows
ggraph(g2) +
geom_edge_link() +
geom_node_point(aes(size = centrality, colour = centrality)) +
theme_graph()
## Using `nicely` as default layout
This brief tutorial only scratches the surface with both rtweet and tidygraph, but it shows that with a little extra background, R can make API calls and new data formats fairly easy to work with. It takes a lot of searching through documents to find the right functions and arguments, but it’s within your grasp!