This is part of the interactive tutorial for COMM 497DB on social data analytics. You can visit previous topics at https://curiositybits.shinyapps.io/R_social_data_analytics.

What can a network tell us?

Have you wondered how information spreads on Twitter, how Instagram influencers are identified, and how different actors in an online community collaborate or confront one and another? There are the sorts of questions that can be best answered using network analysis and network visualization. In network analysis of internet communities, we visualize and quantify the structure of social relationships and information flows. See a real-world application my team has built to track the upcoming Philippine General Election.

Here, you can see a retweet network based on 9,999 #BreakUpBigTech tweets. In this network, a pair of users represents a retweeting relationship. That is, two users are connected to one and another if one retweets or is retweeted by the other. For simplicity, the graph below only shows users who at least twice retweeted or were retweeted by others.

Guess how the size and color of a node is determined.

Edges

Where do we begin to visualize a network? It all starts with nodes and edges. The table below shows 20 tweets.

An edgelist shows all edges in a network along with attributes of the edges. An edge is a pair of relationship between two nodes (in this case, users). An edge can be directed: for example, A retweets B will be expressed as User A → User B, whereas B retweets A is expressed as User B → User A. But, in some cases, an edge is undirected. Think about your Facebook relationships. If user A is a friend of user B. By default, user B is also connected to user A.

An edgelist based on the 20 tweets looks like this. The column source lists the Twitter users who retweeted. The column target shows those users who were on the receiving end of the retweets (i.e., users who were retweeted by others). The size column is edge weight, referring to the number of retweeting that occurred between the same pair of users.

Nodes

In our example, a node is a Twitter user. Below is a list of nodes, with their id, labels, and attributes (e.g., size). Wonder how size is determined? We will cover this in the later part of the tutorial.

How to turn tweets into network?

Here comes the real deal: how to turn collected tweets into a network. Previously, the process involves several steps of text cleaning to extract relevant @screennames. A new library called graphtweets[http://graphtweets.john-coene.com/index.html] makes the task much easier. graphtweets works seamlessly on Twitter data collected through rtweet[https://rtweet.info/]. You can easily tweak the code in graphtweets to make it work for data that come from different shapes and sizes. Below, I will use the data frame tweets as a demo. tweets contains 9,999 tweets that use #BreakUpBigTech.

Make sure graphtweets is installed and loaded. In addition to graphtweets, we also need igraph and twinetverse. To install twinetverse, use the code below.

library(devtools) #you would install this library as well
devtools::install_github("JohnCoene/twinetverse")

We begin by defining a R function. A function is a set of codes organized together to perform a specific task. R has a large number of built-in functions. We can also create our own functions. A self-defined function will save a lot of repetitive work. See how I define a function called extractrt below.

library(graphTweets)
library(twinetverse)

extractrt <- function(df){
  rt <- df %>% 
    gt_edges(screen_name, retweet_screen_name) %>% # get edges
    gt_nodes() %>% # get nodes
    gt_collect() # collect
  
  return(rt)
}

The self-defined function extractrt takes in df (it will have to be a data frame from rtweet in order for the function to work). The function then uses three in-built functions from graphtweets to extract nodes and edges from tweets in df. In a standard data frame returned from rtweet, the sender of a retweet (the user who retweets other) is in the screen_name column, and the retweeted users are in the retweet_screen_name column. See how this particular process is spelled out below.

 gt_edges(screen_name, retweet_screen_name) %>% # get edges

The function extractrt creates and returns an object called rt. rt can be easily converted to an igraph object. igraph is one of the most common network analysis libraries in R. We will deal with igraph later.

After a function is defined in R, we can apply it to a data frame of tweets. Below, we apply extractrt to tweets and create rtnet.

rtnet <- extractrt(tweets)

The reason we call this function extractrt is that it only extracts retweeting relationships. How about Twitter mentions/replies. A Twitter mention/reply signifies a more engaged mode of interaction. A retweet is mostly a passive information relay, but a Twitter mention/reply is an active outreach. In the code below, I define extractmt as a function for extracting edges and nodes in Twitter mentions/replies. This function scans tweets and extracts @screennames in the screen_name and mentions_screen_name column. I then apply extractmt to tweets and create mtnet.

library(twinetverse)

extractmt <- function(df){
  
  mt <- df %>% 
    gt_edges(screen_name, mentions_screen_name) %>% # get edges
    gt_nodes() %>% # get nodes
    gt_collect() # collect
  
  return(mt)
}

mtnet <- extractmt(tweets)

Now that we have two network objects: rtnet and mtnet. We want to take a look at them. But unlike data frames, you cannot just click to view a network object. You can use the two following self-defined functions to get node lists and edgelists from the two network objects.

#define a function called nodes to extract node information from a network object

nodes <- function(net){
  
  c(edges, nodes) %<-% net
  nodes$id <- as.factor(nodes$nodes) 
  nodes$size <- nodes$n 
  nodes <- nodes2sg(nodes)
  nodes <- nodes[,2:5]
  
  return(nodes)
}

#define a function called edges to extract edge information from a network object

edges <- function(net){
  
  c(edges, nodes) %<-% net
  edges$id <- seq(1, nrow(edges))
  edges <- edges2sg(edges)
  
  return(edges)
}

#apply the two self-defined functions
rtnet_nodes <- nodes(rtnet)
rtnet_edges <- edges(rtnet)

mtnet_nodes <- nodes(mtnet)
mtnet_edges <- edges(mtnet)

From the above step, we create four objects: rtnet_nodes, rtnet_edges, mtnet_nodes, mtnet_edges. The four objects are all data frames. Let’s take a look at rtnet_edges. It is an edgelist.

library(DT)
datatable(rtnet_edges, options = list(pageLength = 5))

Convert to igraph

Network analysis is essentially a mathematical process. Any user and any network can be scored based on some attributes. To do this, we will convert our network objects into igraph objects. For example, for the retweet network, we can create an igraph object based on rtnet_edges and rtnet_nodes. See the code and comments below.

Make sure the library igraph is installed.

library(igraph) #make sure this is installed 

# use rtnet_edges as the edgelist and rtnet_nodes as the node list. Set the network type as directed

rt <- graph_from_data_frame(d=rtnet_edges, vertices=rtnet_nodes, directed=T) 

# see edge weight by copying the values from the size column in rtnet_edges

rt <- set_edge_attr(rt, "weight", value= rtnet_edges$size)

# we do the same for the mention network

mt <- graph_from_data_frame(d=mtnet_edges, vertices=mtnet_nodes, directed=T) 
mt <- set_edge_attr(mt, "weight", value= mtnet_edges$size)

But first, let’s just take a look at some network-level indicators.

Size matters!

A quick way to compare different networks (e.g., the retweet network vs. mention network) is looking at its size. Run the code below to get a count of edges and nodes in rtnet and mtnet.

Which network has more users in it? And which network has more connections?

vcount(rt) #this shows the number of nodes/vertices in rt

## [1] 9092

vcount(mt) #this shows the number of nodes/vertices in mt

## [1] 9322

ecount(rt) #this shows the number of edges in rt

## [1] 9137

ecount(mt) #this shows the number of edges in mt

## [1] 10399

Dense or sparse?

A densely connected network (high density score) is a type of network in which many users are interconnected, whereas a sparse network (low density) is a network in which only a few are interconnected. Two contrasting examples of dense and sparse networks are a network of people in a family gathering in which almost everyone knows everyone else, and a network of people sitting on a public bus.

Which network is more interconnected?

edge_density(rt, loops = FALSE) #the density of rt

## [1] 0.0001105433

edge_density(mt, loops = FALSE) #the density of mt

## [1] 0.0001196796

Centralized or decentralized?

Think of centralization as a question of inequality and who is in control. In a centralized network, a small number of nodes (users) control the information flow. In a retweet network specifically, it means that only a handful of users retweet or are retweeted by others. Centralized and decentralized networks have different ramification for the diffusion of ideas, norms, and effective mobilization.

by setting mode = c(“in”), we calculate the centralization score based on the extent to which users are retweeted by others (as opposed to retweet others).

So, which network is more centralized?

#Calculate centralization
centr_degree(rt, mode = c("in"), loops = TRUE,normalized = TRUE)$centralization

## [1] 0.95006

centr_degree(mt, mode = c("in"), loops = TRUE,normalized = TRUE)$centralization

## [1] 0.936797

Birds of a feathre flock together?

Have you heard of the saying birds of a feather flock together? In a network, nodes tend to cluster together based on some shared attributes. For instance, Twitter users may retweet mostly content they agree with. Hence, this tendency will result in a cluster of nodes based on similar mindsets or opinions. To what extent is a network reflecting this pattern of clustering can be quantified by using clustering coefficient.

transitivity(rt)

## [1] 1.929652e-06

transitivity(mt)

## [1] 0.0001250248

Is it reciprocal?

Reciprocity is calculated as the proportion of reciprocated ties. In the retweet network, for example, reciprocity shows the extent to which a pair of users have mutually retweeted one and another.

Which form of Twitter interactions (retweet vs. mention) is more reciprocal?

reciprocity(rt)

## [1] 0.000219082

reciprocity(mt)

## [1] 0.001733436

Look for influencers

I have introduced previously a range of indicators to quantify a network. Such indicators are only useful when it involves a comparison of different networks. When analyzing one single network, we are more interested in node-level indicators.

A common task in network analysis is identifying influencers? An influencer could mean different things to different people. Here we try a couple of dfferent metrics.

indegree centrality measures the number of incoming connections a user has received. A high indegree in the retweet network means that the user is frequently retweeted by others. Do you agree that the most retweeted users are influencers? And why?

indegree_rt <- sort(degree(rt,mode = "in"),decreasing = TRUE)
indegree_rt[1:10] #show the top 10 users ranked by in-degree

##         ewarren     anandwrites      omanreagan   stclairashley 
##            8638             107              40              40 
##        guardian      chadfelixg          a35362         soyrosa 
##              39              20              19              16 
##         jc_cali myth_capitalism 
##              15              14

outdegree centrality measures the number of outgoing connections a user has. A high outdegree in the retweet network means that the user frequently retweets other users. What would you call such users, mobilizers?

outdegree_rt <- sort(degree(rt,mode = "out"),decreasing = TRUE)
outdegree_rt[1:10] #show the top 10 users ranked by out-degree

## edwood05572006         raqb16   damonbethea1    fuelgrannie   gavin_bonnar 
##             13              6              5              5              4 
##   atheist_cvnt     natemezmer  philippejouan  sharonresists   tedgrunewald 
##              3              3              3              3              3

Betweenness centrality measures the number of times a node lies on the shortest path between other nodes. We use this metric to find users who act as ‘bridges’ between nodes in a network and who influence the information flow around a network.

bt <- sort(betweenness(rt, directed=T, weights=NA), decreasing = TRUE)
bt[1:10] #show the top 10 nodes by betweenness centrality

##      omanreagan      chadfelixg         soyrosa    damonbethea1 
##              32              18              16              13 
##         jc_cali    commondreams  resistasista76          yitzee 
##               8               4               3               3 
##      tuxcedocat vivek_gkrishnan 
##               2               2

Ever wonder how Google ranks search results? It uses the PageRank algorithm developed by Google’s founders Sergey Brin and Larry Page. We can use PageRank to locate influencers as well.

pr <- page_rank(rt, algo = c("prpack"))
pr <- sort(pr$vector,decreasing = TRUE)
pr[1:10] #show the top 10 users ranked by PageRank

##         ewarren     anandwrites myth_capitalism      omanreagan 
##    0.4338045988    0.0047293401    0.0042530946    0.0042105884 
##   scottrickhoff   stclairashley        guardian      chadfelixg 
##    0.0020597744    0.0017596358    0.0016080549    0.0010092895 
##          a35362         soyrosa 
##    0.0009209153    0.0008592202

Look for clusters/cliques

We use community detection algorithm to cluster users into different groups (we call such groups clusters or cliques). Users in the same cluster are more connected with one and another than with users outside of the cluster. By using the community detection method, we can reveal important divisions and fragmentation that exist due to different opinions, values, and user characteristics.

Some community detection algorithms require intensive computating. It may take a long time to produce an output.

k-core

Creating k-core is fast and easy. We can use k-core to identify a small subset of users who are the most interconnected. In a k-core, each node has at least k connections with everyone else. Below we extract a 2-core (named twocore) in which each user has at least 2 edges with any other users in the core.

kcore <- coreness(rt, mode="all") 
twocore <- induced_subgraph(rt, kcore>=2)

edge betweenness (Newman-Girvan)

This is one of the community detection algorithm that is computationally intensive. Be patient when it is crunching numbers for you.

The code above creates an object call ceb. It contains the information about which cluster each node belongs to. We can run the code below to see the cluster ID of the first 10 nodes.

ceb <- cluster_edge_betweenness(rt) 

print("there are",length(ceb),"clusters based on this community detection algorithm")

membership(ceb)[1:10] #list only 10 nodes.

Visualize a network and make it pretty!

There are many ways to visualize a network. You can make the visualization static or interactive as this example shows. You can even create a dynamic one showing the evolution of a network over time an example.

Below, we will try some of the basics using two libraries igraph and VisNetwork. igraph comes with some in-built functions for visualization. VisNetwork takes a step further by making it prettier and interactive.

Before you visualize a network, here are the decisions you need to make:

do you want to assign colors to nodes based on some node attributes?
do you want to set the node size based on some attributes?
do you want to show all nodes?

In our example below, we color nodes based on the clusters they belong to. We set the node size based on PageRank score (the famous scoring technique used by Google), with central nodes represented by bigger nodes. And we don’t want to show all nodes as that will create a messy network; Instead, we would show only the most interconnected subset (using k-core).

In the previous steps, we know the codes for calculating node and network-level metrics (e.g., centrality). Here, we will pass the metrics to nodes and store them as node attributes. This will allows the visualization code to pick up the attributes and use them for sizing and coloring.

In the code below, we add PageRank score (used for node size) and the cluster id (used for assigning color). We use V(rt) to access node attributes and E(rt) to access edge attributes.

library(igraph)
library(visNetwork)
library(scales)

pr <-page_rank(rt, algo = c("prpack"))
V(rt)$size <- pr$vector*100  #set node size by PageRank scores.

wc <- cluster_walktrap(rt)

V(rt)$color <- membership(wc) # set color by subgroup id

Since we visualize only the 2-core. We create a subset of the network.

kcore <- coreness(rt, mode="all") 
twocore <- induced_subgraph(rt, kcore>=2)

Find a visualization algorithm that fits

And we visualize it. Notice that we set layout = “layout_nicely”? This is how we specify which visualization algorithm to use. There is a whole bunch of them: see the listing. If you are curious about visual effects from different algorithms, try layout =“layout_in_circle” or layout =“layout_with_kk” or layout =“layout_with_sugiyama”

visIgraph(twocore,idToLabel = TRUE,layout = "layout_nicely") %>%
  visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE)

COMM497DB: Insights from networks

curiositybits.cc