This is part of the interactive tutorial for COMM 497DB on social data analytics. You can visit previous topics at https://curiositybits.shinyapps.io/R_social_data_analytics.
Have you wondered how information spreads on Twitter, how Instagram influencers are identified, and how different actors in an online community collaborate or confront one and another? There are the sorts of questions that can be best answered using network analysis and network visualization. In network analysis of internet communities, we visualize and quantify the structure of social relationships and information flows. See a real-world application my team has built to track the upcoming Philippine General Election.
Here, you can see a retweet network based on 9,999 #BreakUpBigTech tweets. In this network, a pair of users represents a retweeting relationship. That is, two users are connected to one and another if one retweets or is retweeted by the other. For simplicity, the graph below only shows users who at least twice retweeted or were retweeted by others.
Guess how the size and color of a node is determined.
Where do we begin to visualize a network? It all starts with nodes and edges. The table below shows 20 tweets.
An edgelist shows all edges in a network along with attributes of the edges. An edge is a pair of relationship between two nodes (in this case, users). An edge can be directed: for example, A retweets B will be expressed as User A → User B, whereas B retweets A is expressed as User B → User A. But, in some cases, an edge is undirected. Think about your Facebook relationships. If user A is a friend of user B. By default, user B is also connected to user A.
An edgelist based on the 20 tweets looks like this. The column source lists the Twitter users who retweeted. The column target shows those users who were on the receiving end of the retweets (i.e., users who were retweeted by others). The size column is edge weight, referring to the number of retweeting that occurred between the same pair of users.
In our example, a node is a Twitter user. Below is a list of nodes, with their id, labels, and attributes (e.g., size). Wonder how size is determined? We will cover this in the later part of the tutorial.
Here comes the real deal: how to turn collected tweets into a network. Previously, the process involves several steps of text cleaning to extract relevant @screennames. A new library called graphtweets[http://graphtweets.john-coene.com/index.html] makes the task much easier. graphtweets works seamlessly on Twitter data collected through rtweet[https://rtweet.info/]. You can easily tweak the code in graphtweets to make it work for data that come from different shapes and sizes. Below, I will use the data frame tweets as a demo. tweets contains 9,999 tweets that use #BreakUpBigTech.
Make sure graphtweets is installed and loaded. In addition to graphtweets, we also need igraph and twinetverse. To install twinetverse, use the code below.
library(devtools) #you would install this library as well
devtools::install_github("JohnCoene/twinetverse")We begin by defining a R function. A function is a set of codes organized together to perform a specific task. R has a large number of built-in functions. We can also create our own functions. A self-defined function will save a lot of repetitive work. See how I define a function called extractrt below.
library(graphTweets)
library(twinetverse)
extractrt <- function(df){
rt <- df %>%
gt_edges(screen_name, retweet_screen_name) %>% # get edges
gt_nodes() %>% # get nodes
gt_collect() # collect
return(rt)
}The self-defined function extractrt takes in df (it will have to be a data frame from rtweet in order for the function to work). The function then uses three in-built functions from graphtweets to extract nodes and edges from tweets in df. In a standard data frame returned from rtweet, the sender of a retweet (the user who retweets other) is in the screen_name column, and the retweeted users are in the retweet_screen_name column. See how this particular process is spelled out below.
gt_edges(screen_name, retweet_screen_name) %>% # get edgesThe function extractrt creates and returns an object called rt. rt can be easily converted to an igraph object. igraph is one of the most common network analysis libraries in R. We will deal with igraph later.
After a function is defined in R, we can apply it to a data frame of tweets. Below, we apply extractrt to tweets and create rtnet.
rtnet <- extractrt(tweets)The reason we call this function extractrt is that it only extracts retweeting relationships. How about Twitter mentions/replies. A Twitter mention/reply signifies a more engaged mode of interaction. A retweet is mostly a passive information relay, but a Twitter mention/reply is an active outreach. In the code below, I define extractmt as a function for extracting edges and nodes in Twitter mentions/replies. This function scans tweets and extracts @screennames in the screen_name and mentions_screen_name column. I then apply extractmt to tweets and create mtnet.
library(twinetverse)
extractmt <- function(df){
mt <- df %>%
gt_edges(screen_name, mentions_screen_name) %>% # get edges
gt_nodes() %>% # get nodes
gt_collect() # collect
return(mt)
}
mtnet <- extractmt(tweets)Now that we have two network objects: rtnet and mtnet. We want to take a look at them. But unlike data frames, you cannot just click to view a network object. You can use the two following self-defined functions to get node lists and edgelists from the two network objects.
#define a function called nodes to extract node information from a network object
nodes <- function(net){
c(edges, nodes) %<-% net
nodes$id <- as.factor(nodes$nodes)
nodes$size <- nodes$n
nodes <- nodes2sg(nodes)
nodes <- nodes[,2:5]
return(nodes)
}
#define a function called edges to extract edge information from a network object
edges <- function(net){
c(edges, nodes) %<-% net
edges$id <- seq(1, nrow(edges))
edges <- edges2sg(edges)
return(edges)
}
#apply the two self-defined functions
rtnet_nodes <- nodes(rtnet)
rtnet_edges <- edges(rtnet)
mtnet_nodes <- nodes(mtnet)
mtnet_edges <- edges(mtnet)From the above step, we create four objects: rtnet_nodes, rtnet_edges, mtnet_nodes, mtnet_edges. The four objects are all data frames. Let’s take a look at rtnet_edges. It is an edgelist.
library(DT)
datatable(rtnet_edges, options = list(pageLength = 5)) Network analysis is essentially a mathematical process. Any user and any network can be scored based on some attributes. To do this, we will convert our network objects into igraph objects. For example, for the retweet network, we can create an igraph object based on rtnet_edges and rtnet_nodes. See the code and comments below.
Make sure the library igraph is installed.
library(igraph) #make sure this is installed
# use rtnet_edges as the edgelist and rtnet_nodes as the node list. Set the network type as directed
rt <- graph_from_data_frame(d=rtnet_edges, vertices=rtnet_nodes, directed=T)
# see edge weight by copying the values from the size column in rtnet_edges
rt <- set_edge_attr(rt, "weight", value= rtnet_edges$size)
# we do the same for the mention network
mt <- graph_from_data_frame(d=mtnet_edges, vertices=mtnet_nodes, directed=T)
mt <- set_edge_attr(mt, "weight", value= mtnet_edges$size)But first, let’s just take a look at some network-level indicators.
A quick way to compare different networks (e.g., the retweet network vs. mention network) is looking at its size. Run the code below to get a count of edges and nodes in rtnet and mtnet.
Which network has more users in it? And which network has more connections?
vcount(rt) #this shows the number of nodes/vertices in rt ## [1] 9092
vcount(mt) #this shows the number of nodes/vertices in mt## [1] 9322
ecount(rt) #this shows the number of edges in rt ## [1] 9137
ecount(mt) #this shows the number of edges in mt## [1] 10399
A densely connected network (high density score) is a type of network in which many users are interconnected, whereas a sparse network (low density) is a network in which only a few are interconnected. Two contrasting examples of dense and sparse networks are a network of people in a family gathering in which almost everyone knows everyone else, and a network of people sitting on a public bus.
Which network is more interconnected?
edge_density(rt, loops = FALSE) #the density of rt## [1] 0.0001105433
edge_density(mt, loops = FALSE) #the density of mt## [1] 0.0001196796
Think of centralization as a question of inequality and who is in control. In a centralized network, a small number of nodes (users) control the information flow. In a retweet network specifically, it means that only a handful of users retweet or are retweeted by others. Centralized and decentralized networks have different ramification for the diffusion of ideas, norms, and effective mobilization.
by setting mode = c(“in”), we calculate the centralization score based on the extent to which users are retweeted by others (as opposed to retweet others).
So, which network is more centralized?
#Calculate centralization
centr_degree(rt, mode = c("in"), loops = TRUE,normalized = TRUE)$centralization## [1] 0.95006
centr_degree(mt, mode = c("in"), loops = TRUE,normalized = TRUE)$centralization## [1] 0.936797
Have you heard of the saying birds of a feather flock together? In a network, nodes tend to cluster together based on some shared attributes. For instance, Twitter users may retweet mostly content they agree with. Hence, this tendency will result in a cluster of nodes based on similar mindsets or opinions. To what extent is a network reflecting this pattern of clustering can be quantified by using clustering coefficient.
transitivity(rt)## [1] 1.929652e-06
transitivity(mt)## [1] 0.0001250248
Reciprocity is calculated as the proportion of reciprocated ties. In the retweet network, for example, reciprocity shows the extent to which a pair of users have mutually retweeted one and another.
Which form of Twitter interactions (retweet vs. mention) is more reciprocal?
reciprocity(rt)## [1] 0.000219082
reciprocity(mt)## [1] 0.001733436
I have introduced previously a range of indicators to quantify a network. Such indicators are only useful when it involves a comparison of different networks. When analyzing one single network, we are more interested in node-level indicators.
A common task in network analysis is identifying influencers? An influencer could mean different things to different people. Here we try a couple of dfferent metrics.
indegree centrality measures the number of incoming connections a user has received. A high indegree in the retweet network means that the user is frequently retweeted by others. Do you agree that the most retweeted users are influencers? And why?
indegree_rt <- sort(degree(rt,mode = "in"),decreasing = TRUE)
indegree_rt[1:10] #show the top 10 users ranked by in-degree## ewarren anandwrites omanreagan stclairashley
## 8638 107 40 40
## guardian chadfelixg a35362 soyrosa
## 39 20 19 16
## jc_cali myth_capitalism
## 15 14
outdegree centrality measures the number of outgoing connections a user has. A high outdegree in the retweet network means that the user frequently retweets other users. What would you call such users, mobilizers?
outdegree_rt <- sort(degree(rt,mode = "out"),decreasing = TRUE)
outdegree_rt[1:10] #show the top 10 users ranked by out-degree## edwood05572006 raqb16 damonbethea1 fuelgrannie gavin_bonnar
## 13 6 5 5 4
## atheist_cvnt natemezmer philippejouan sharonresists tedgrunewald
## 3 3 3 3 3
Betweenness centrality measures the number of times a node lies on the shortest path between other nodes. We use this metric to find users who act as ‘bridges’ between nodes in a network and who influence the information flow around a network.
bt <- sort(betweenness(rt, directed=T, weights=NA), decreasing = TRUE)
bt[1:10] #show the top 10 nodes by betweenness centrality ## omanreagan chadfelixg soyrosa damonbethea1
## 32 18 16 13
## jc_cali commondreams resistasista76 yitzee
## 8 4 3 3
## tuxcedocat vivek_gkrishnan
## 2 2
Ever wonder how Google ranks search results? It uses the PageRank algorithm developed by Google’s founders Sergey Brin and Larry Page. We can use PageRank to locate influencers as well.
pr <- page_rank(rt, algo = c("prpack"))
pr <- sort(pr$vector,decreasing = TRUE)
pr[1:10] #show the top 10 users ranked by PageRank## ewarren anandwrites myth_capitalism omanreagan
## 0.4338045988 0.0047293401 0.0042530946 0.0042105884
## scottrickhoff stclairashley guardian chadfelixg
## 0.0020597744 0.0017596358 0.0016080549 0.0010092895
## a35362 soyrosa
## 0.0009209153 0.0008592202
We use community detection algorithm to cluster users into different groups (we call such groups clusters or cliques). Users in the same cluster are more connected with one and another than with users outside of the cluster. By using the community detection method, we can reveal important divisions and fragmentation that exist due to different opinions, values, and user characteristics.
Some community detection algorithms require intensive computating. It may take a long time to produce an output.
k-core
Creating k-core is fast and easy. We can use k-core to identify a small subset of users who are the most interconnected. In a k-core, each node has at least k connections with everyone else. Below we extract a 2-core (named twocore) in which each user has at least 2 edges with any other users in the core.
kcore <- coreness(rt, mode="all")
twocore <- induced_subgraph(rt, kcore>=2)edge betweenness (Newman-Girvan)
This is one of the community detection algorithm that is computationally intensive. Be patient when it is crunching numbers for you.
The code above creates an object call ceb. It contains the information about which cluster each node belongs to. We can run the code below to see the cluster ID of the first 10 nodes.
ceb <- cluster_edge_betweenness(rt)
print("there are",length(ceb),"clusters based on this community detection algorithm")
membership(ceb)[1:10] #list only 10 nodes.There are many ways to visualize a network. You can make the visualization static or interactive as this example shows. You can even create a dynamic one showing the evolution of a network over time an example.
Below, we will try some of the basics using two libraries igraph and VisNetwork. igraph comes with some in-built functions for visualization. VisNetwork takes a step further by making it prettier and interactive.
Before you visualize a network, here are the decisions you need to make:
In our example below, we color nodes based on the clusters they belong to. We set the node size based on PageRank score (the famous scoring technique used by Google), with central nodes represented by bigger nodes. And we don’t want to show all nodes as that will create a messy network; Instead, we would show only the most interconnected subset (using k-core).
In the previous steps, we know the codes for calculating node and network-level metrics (e.g., centrality). Here, we will pass the metrics to nodes and store them as node attributes. This will allows the visualization code to pick up the attributes and use them for sizing and coloring.
In the code below, we add PageRank score (used for node size) and the cluster id (used for assigning color). We use V(rt) to access node attributes and E(rt) to access edge attributes.
library(igraph)
library(visNetwork)
library(scales)
pr <-page_rank(rt, algo = c("prpack"))
V(rt)$size <- pr$vector*100 #set node size by PageRank scores.
wc <- cluster_walktrap(rt)
V(rt)$color <- membership(wc) # set color by subgroup idSince we visualize only the 2-core. We create a subset of the network.
kcore <- coreness(rt, mode="all")
twocore <- induced_subgraph(rt, kcore>=2)Find a visualization algorithm that fits
And we visualize it. Notice that we set layout = “layout_nicely”? This is how we specify which visualization algorithm to use. There is a whole bunch of them: see the listing. If you are curious about visual effects from different algorithms, try layout =“layout_in_circle” or layout =“layout_with_kk” or layout =“layout_with_sugiyama”
visIgraph(twocore,idToLabel = TRUE,layout = "layout_nicely") %>%
visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE)