In this quick tutorial you will learn how to visualize Twitter data. You will learn how to get access to Twitter’s API, how to retrieve tweets using rtweet, and how to study a Twitter user’s personal “Twittersphere”.
In this figure you can see a network of all major hashtags used in tweets that also include an @mention to @BernieSanders. Each of these hashtags corresponds to one of the nodes in the network. The linkages between the nodes depend on how often the different hashtags are used in combination. Hashtags that are rarely used in combination with other hashtags have small nodes, more “connected” hashtags have larger nodes.
An easy yet fairly powerful way to get data from the Twitter API using R, is the rtweet package. However, before you can use this package, you need to get approved by Twitter. To get approved as developer, you first need to create a regular Twitter user account. Once you’ve done that, go to https://developer.twitter.com/en, click on your user name (top right corner), select “Apps”, and then click on “Create an App”. This should bring you to the following page:
The term “app” simply refers to your access to the Twitter API. You won’t have to program anything here. It’s just a form that you have to fill out. When filling out the form, some questions will be quite similar, so you might need to rephrase the same text a couple of times. In the “Website URL” field you can for example add a link to your GitHub page. Once you completed the form, you will have a to wait a couple of days for Twitter to respond via email. They might ask a few final questions, you will probably have to rephrase your answers once again. Eventually, they will let you know that you have been approved, which means that you can now use your “app” to access the Twitter API. Just click on your app and go to “Keys and tokens”. Here you will find two so-called “keys” and two so-called “tokens” that you will need to pass to the “rtweet” package in order to access the Twitter API. Time to open up RStudio.
Go again to https://developer.twitter.com/en and select “Apps”. You should now be able to see your newly created app. Click on “details” and go to “Keys and tokens” tab. You will now see the “API key” and the “API secret key” of your app. Below you will find the “Access token” and the “Access token secret”. Regenerate these items and copy-paste them into a new R-script like so:
library(rtweet)
app_name <- "Paste the name of your app here"
api_key <- "Paste the API key of your app here"
api_secret <- "Paste API secret key of your app here"
access_token <- "Paste Access key of your app here"
access_token_secret <- "Paste Access token secret of your app here"
twitter_token <- create_token(
app = app_name,
consumer_key = api_key,
consumer_secret = api_secret,
access_token = access_token,
access_secret = access_token_secret)
You only have to run this code once. Your computer will remember your twitter token in all future R-sessions using the one-liner get_token(). Of course, if you regenerate your access details (on your Twitter developer page) you will have to go through this code chunk again.
# clear workspace and load packages
rm(list = ls())
library(rtweet)
library(tidyverse)
# connect to Twitter API
get_token()
# get most recent tweets
user <- "BernieSanders"
tweets <- search_tweets(user, retryonratelimit = TRUE)
# What's inside of the dataframe?
dim(tweets)
## [1] 15495 90
names(tweets)[1:32]
## [1] "user_id" "status_id" "created_at"
## [4] "screen_name" "text" "source"
## [7] "display_text_width" "reply_to_status_id" "reply_to_user_id"
## [10] "reply_to_screen_name" "is_quote" "is_retweet"
## [13] "favorite_count" "retweet_count" "quote_count"
## [16] "reply_count" "hashtags" "symbols"
## [19] "urls_url" "urls_t.co" "urls_expanded_url"
## [22] "media_url" "media_t.co" "media_expanded_url"
## [25] "media_type" "ext_media_url" "ext_media_t.co"
## [28] "ext_media_expanded_url" "ext_media_type" "mentions_user_id"
## [31] "mentions_screen_name" "lang"
tweets[1:10,1:4]
## user_id status_id created_at screen_name
## 1 1048034022855790592 1272214352523931655 2020-06-14 17:08:32 HyperBaroque
## 2 879082204005052418 1272214344957415426 2020-06-14 17:08:31 BloodymirPutin
## 3 30737670 1272214322274545666 2020-06-14 17:08:25 Smartdragon
## 4 30737670 1272214104921563139 2020-06-14 17:07:33 Smartdragon
## 5 30737670 1272213404544110595 2020-06-14 17:04:46 Smartdragon
## 6 983765495387181056 1272214302796242951 2020-06-14 17:08:21 EDNA_RFRANCO
## 7 1106756013468856320 1272214265546641410 2020-06-14 17:08:12 Jessbun26
## 8 1213632755814166529 1272214252707774464 2020-06-14 17:08:09 RyanLaborOV
## 9 891603310041497600 1272214148538040322 2020-06-14 17:07:44 ViniciusN1kolod
## 10 1270665728228818944 1272214096323239939 2020-06-14 17:07:31 PhoebeHuber10
# select hashtags
htags <- tweets$hashtags
class(htags)
## [1] "list"
head(htags)
## [[1]]
## [1] "Politics" "Discord"
##
## [[2]]
## [1] "2020election" "BBNNrealNews" "ROLEX"
## [4] "historicMoment" "age" "physicalCondition"
## [7] "DonaldTrump" "JoeBiden" "aVitalQuestion"
## [10] "Biden" "vicePresident" "KamalaHarris"
## [13] "ElizabethWarren" "BernieSanders" "MartinLutherKing3"
##
## [[3]]
## [1] NA
##
## [[4]]
## [1] NA
##
## [[5]]
## [1] NA
##
## [[6]]
## [1] NA
The rtweet package returns a multitude of variables that are suitable for many different types of analyses. For our purposes, however, we’ll only nee the “hashtags” variable, which stores the hashtags included in the different tweets.
As we can see, “htags” is a list-object. The elements of this list pertain the hashtags used in the different tweets. Some tweets don’t include any hashtags while other tweets include multiple hashtags. Whenever two (or more) hashtags are used in a single tweet, we can think of these hashtags as a pair. Depending on how many tweets jointly include these hashtags, the relationship between these hashtags is either rather weak or quite strong. When computing all these relationships for all possible pairs of hashtags, we arrive at the “Twittersphere” surrounding “@BernieSanders”.
A practical way of storing all the hashtag relationships is a so-called “adjacency matrix”. This is square matrix with one row and one column for each hashtag. The \(i\)-\(j\) element of this matrix tells us how often tweets including hashtag \(i\) also include hashtag \(j\).
# determine unique hashtags
utags <- unique(unlist(htags))
utags <- utags[-which(utags == "BernieSanders")]
head(utags)
## [1] "Politics" "Discord" "2020election" "BBNNrealNews"
## [5] "ROLEX" "historicMoment"
# create all zero adjancency matrix
mat <- matrix(0, length(utags), length(utags))
rownames(mat) <- utags
colnames(mat) <- utags
mat[1:6,1:6]
## Politics Discord 2020election BBNNrealNews ROLEX historicMoment
## Politics 0 0 0 0 0 0
## Discord 0 0 0 0 0 0
## 2020election 0 0 0 0 0 0
## BBNNrealNews 0 0 0 0 0 0
## ROLEX 0 0 0 0 0 0
## historicMoment 0 0 0 0 0 0
# fill adjacency matrix by looping through each tweet
for(t in 1:length(htags)){
# select the tweet's hashtags
tags <- htags[[t]]
# skip to next tweet, if the current tweet has less than two hashtags
if(length(tags) == 1) next()
# ignore @mentions to Bernie Sanders
tags <- tags[-which(tags == "BernieSanders")]
# add plus one to current value in adjacency matrix
mat[tags,tags] <- mat[tags,tags] + 1
}
rm(t)
# no hashtag is linked to itself
diag(mat) <- 0 # main diagonal = 0
# inspect the adjacency matrix
dim(mat)
## [1] 1347 1347
mat[1:6,1:6]
## Politics Discord 2020election BBNNrealNews ROLEX historicMoment
## Politics 0 0 0 0 0 0
## Discord 0 0 0 0 0 0
## 2020election 0 0 0 1 1 1
## BBNNrealNews 0 0 1 0 1 1
## ROLEX 0 0 1 1 0 1
## historicMoment 0 0 1 1 1 0
So far we only created a matrix. Now we need to tell R that this matrix is the underlying adjacency matrix of a network. We do this using the igraph package. This powerful package is used to model and manipulate network (or “graph”) objects in R. When visualizing our Twittersphere network, we want to ignore very insignificant nodes. Otherwise the resulting image would be overwhelmed by unimportant information. In particular, we want to get rid of nodes that do not have a at least two connections to any other node in the network (including multiple mentions to the same node). In the lingo of graph theory this property of nodes is called “strength” (not to be confused with “degree”, which excludes multiple connections to the same node). Moreover, we will use different colors to identify clusters of closely connected nodes.
### create network from adjancency matrix
library(igraph)
net <- graph_from_adjacency_matrix(mat, mode = "directed", weighted = T)
length(V(net)) # show the number of nodes in the network
## [1] 1347
# remove insignificant nodes
net <- delete.vertices(net, strength(net,mode="all") < 2)
length(V(net)) # show the number of nodes in the network
## [1] 223
# determine colors based on clusters in the network
clusters <- cluster_walktrap(net, steps = 5)
colors <- rainbow(length(unique(clusters$member)))
colors <- adjustcolor(colors[clusters$member], alpha.f = 0.7) # add transparency
There are of course many ways of visualizing the same underlying adjacency matrix. A very prominent way, however, are force-directed graph drawing algorithms like that of Frucherman & Reingold (1991). In a nutshell, these algorithms run a physics simulation, in which a network’s nodes are connected by mechanical springs. Depending on how closely two nodes are related to each other, the spring connecting them will be either rather elastic or inelastic. A force directed graph-drawing algorithm then tries to position the nodes such that the forces of all springs are perfectly balanced. In effect, highly connected nodes are placed towards the center of the network, while relatively unimportant nodes are placed towards the periphery.
# create pdf image
pdf("net.pdf", width = 20, height = 20)
set.seed(1) # set starting value of random number generator
# plot network
plot(simplify(net),
layout = layout_with_fr(net), # Fruchterman–Reingold algorithm
edge.arrow.width = .001, # width of arrow heads
edge.arrow.size = 0.001, # length of arrow heads
edge.color = adjustcolor(1, alpha.f = 0.05), # color of linkages
vertex.size = strength(net)^(1/2) / 2, # size of nodes
vertex.color = colors, # color of nodes
vertex.frame.color = adjustcolor("black", alpha.f = 0.9), # node frames
vertex.label.color = "black", # label color
vertex.label.cex = strength(net)^(1/2) / 10 , # label size
)
# close graphics device
graphics.off() # run this line before opening the .pdf
Of course, the igraph package allows for a whole range of different visualization techniques. Going through the package’s many options is beyond the scope of this tutorial. If you would like to know more about how to use the igraph package to conduct network analyses this tutorial is probably a good starting point.