An SNA in R Crash Course

Introduction

Hello everyone! This is meant to be a crash course in social network analysis for people learning about and researching online extremism. Social network analysis (SNA) is a vital skill for understanding online extremism, as extremism is, at its core, a phenomenon built on relationships. As a result, we need to use the specialized tools for collecting, analyzing, and visualizing data that depicts those relationships.

The good news is that data about the social behaviors of online extremists is everywhere, just waiting for dedicated researchers to dig into it. Although SNA itself is a well-established practice, its application in extremism studies is still fairly limited; there is a huge amount of untapped potential in both existing datasets and other open-source data to apply social network approaches.

In this tutorial, I will introduce you to the vital concepts of SNA and walk you through applying them to real-world data. At the end of this, you should hopefully have enough understanding to try on your own and to know what to look up when you run into issues.

Prerequisites

This tutorial is presented using the R coding language. While you can do SNA in Python and other coding languages, or in a graphical user interface like UCINET or Gephi, I find it most practical and useful to code in R, where you have a lot of power at your fingertips with the SNA libraries written for it.

Don’t panic if you have never coded before! This is actually a great way to dip your toes in and start coding, so I encourage you to follow along to the best of your ability.

Before continuing: If you do not have R and R Studio installed, and you have never coded in R before, read through and follow the prerequisite install instructions in R for Data Science, the best R textbook out there (and totally free!). I also encourage you to read the first couple of chapters in that book and follow along there, too, if you can.

Please also install our core SNA package for R, igraph, and our Iron March packackage, ironmarch. You can install igraph by running the following code:

install.packages('igraph')
if (!requireNamespace("remotes", quietly = TRUE)) install.packages("remotes")
remotes::install_github("ctec-miis/ironmarch")

Loading our environment and exploring data

Every time you start coding in R, you need to load your necessary libraries. For today, we will load three: ironmarch, which has our data; igraph, which has our SNA functionality, and tidyverse, which has a whole bunch of helper functions that make our lives way easier.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(ironmarch)
library(igraph)

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:purrr':
## 
##     compose, simplify

## The following object is masked from 'package:tidyr':
## 
##     crossing

## The following object is masked from 'package:tibble':
## 
##     as_data_frame

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

We’re going to build a social network for Iron March direct messages. We can load in our direct messages by running the following code:

messages <- build_messages()

For those who are new to R, I’ll break this line down for you: * We create an object (or variable) called “messages” to store our data. * To tell R to store the data in the “messages” object, we run the function from ironmarch called “build_messages()”, which outputs a data frame (basically a spreadsheet). * We assign that data frame to “messages” using the arrow <-, which is equivalent to a single equals sign in other programming languages. Run that line of code to create the messages object!

You can see our newly created object has populated in the “enviornment” pane in R Studio. You can also explore that our code worked correctly and what the data contains by running:

glimpse(messages)

## Rows: 22,309
## Columns: 8
## $ msg_id            <int> 1, 2, 3, 4, 5, 6, 7, 8, 12, 13, 14, 15, 17, 18, 19, …
## $ msg_topic_id      <int> 1, 2, 2, 2, 2, 2, 3, 3, 5, 5, 6, 6, 8, 8, 8, 8, 8, 8…
## $ msg_date          <dttm> 2011-09-16 03:49:58, 2011-09-16 11:54:08, 2011-09-1…
## $ msg_post          <chr> "<p>The best first post to make on our forums is the…
## $ msg_post_key      <chr> "3320f7f06c422ef0fb77342724b4fd24", "9204e4883321af2…
## $ msg_author_id     <int> 1, 11, 1, 11, 1, 11, 16, 14, 1, 20, 1, 11, 2, 8, 8, …
## $ msg_ip_address    <chr> "178.140.119.217", "109.78.212.13", "178.140.119.217…
## $ msg_is_first_post <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,…

You should see that this data frame contains a bunch of columns. We are particularly interested in two columns: msg_author_id and msg_topic_id. The first column contains the unique identifier for the message’s author; the second one is the unique ID for the message thread, which is how we will construct our network. This is because Iron March organized its messages data in threads. Each thread ID has multiple author IDs.

Social Network Data

Social network data is structured differently than other types of data. The core unit of measurement is a node (also known as a vertex or entity), which is almost always an individaul person. If nodes have a relationship, they are connected by edges, which can be directed or undirected. If directed, edges can be called “arcs” (although in practice we usually just call them edges).

Edges are stored in a data structure called an edgelist. Imagine a small network of three people named Adam, Meghan, and Luke. Adam and Meghan know each other, and so they have an edge between them. Luke and Adam know each other, so they also have an edge between them. Meghan and Luke do not know each other, so there is no edge between them. This edgelist would look like:

This edgelist can be represented in graph form like this:

toy_data <- data.frame(
  Source = c("Adam", "Adam"),
  Target = c("Luke", "Meghan")
) 

toy_g <- graph_from_data_frame(toy_data, direct = FALSE)

plot(toy_g)

It’s obviously an extremely simple graph, but hopefully it makes some sense how an edgelist translates into a network. Our goal with the Iron March messages, then, is to create an edgelist, where the first column and the second column are both authors.

Looking at our data, we can tell that isn’t the case right now. Here’s how to select just the columns we care about today (find out more about the “%>%” in R for Data Science):

messages_net <- messages %>% 
  select(msg_author_id, msg_topic_id)

glimpse(messages_net)

## Rows: 22,309
## Columns: 2
## $ msg_author_id <int> 1, 11, 1, 11, 1, 11, 16, 14, 1, 20, 1, 11, 2, 8, 8, 2, 8…
## $ msg_topic_id  <int> 1, 2, 2, 2, 2, 2, 3, 3, 5, 5, 6, 6, 8, 8, 8, 8, 8, 8, 8,…

We have a column of author ids and a column of topic ids. This is what is called a “two-mode network”, where there are two different types of nodes. But for our purposes today, we want to change this into a one-mode network, where the edgelist is just people. This is a little funky, so I’ll walk you through it in the comments in the code:

# make an object called "edgelist" to store our edges. Take messages_net to start.
edgelist <- messages_net %>% 
# "join" messages_net to itself, using msg_topic_id as the key. At this point we have three columns, msg_author_id.x, msg_topic_id, and msg_author_id.y. 
  left_join(messages_net, by = "msg_topic_id") %>% 
# filter out cases where the msg_author_id columns are the same
  filter(msg_author_id.x != msg_author_id.y) %>% 
# drop the msg_topic_id column
  select(msg_author_id.x, msg_author_id.y) %>% 
# only keep one copy of each edge
  distinct()

Joins are sometimes hard to wrap our heads around, so I highly encourage reading the section on joins in, you guessed it, R for Data Science!

After all that, we’ve now got an edgelist in an object called, naturally, “edgelist.” We’re now going to employ our first igraph function to turn this edgelist into a “graph” object, so we can actually run SNA metrics on it!

g <- graph_from_data_frame(edgelist, directed = FALSE) %>% 
  simplify(remove.multiple = TRUE)

Network analysis and visualization

At long last, we’re at the point where we can start doing SNA. Let’s see how our network looks right now. I’ll use a couple of options to make it more legible

plot(g, 
     vertex.label = NA, # turn off node labels
     vertex.size = .8, #reduce size of nodes
     edge.width = .8 # reduce width of edges
     )

That’s kind of cool, isn’t it? You can see that most of the users who sent any messages on Iron March are connected to that big mass in the middle, and there’s only a few pairs of users who are disconnected (those little dots out in space around the outside of the central network). This indicates that Iron March wasn’t particularly fragmented, and there wasn’t much space for distinct echo chambers/isolated communities to arise. This isn’t that surprising on a forum as small and niche as Iron March, but it’s still great to have that confirmed by the visualization!

A visualization is just one small part of SNA, though. To really understand how a network is structured, we need to actually compile metrics to describe the network as a whole and the roles of individuals within it. There are a bunch of metrics that help with this task, and I’ll walk through a few here.

Local measures: centrality

The group of metrics you will use the most is called “centrality”. Centrality metrics are designed to describe the relative power of specific users in a network; thus, centrality scores are per-user metrics. There are four main types of centrality:

Degree centrality is the measure of how many edges a node has.
Closeness centrality is the measure of the average distance between a node and all other nodes in the network.
Betweenness centrality is the measure of how often a node is on the “shortest path” between any two other nodes in the network.
Eigenvector centrality is the most complicated metric, but essentially measures how connected a node is to other powerful nodes.

Hopefully you are already considering how each of these metrics can be used to tell a different story, or reveal different intelligence! Degree centrality, for instance, finds the most prolific users; closeness can measure the speed of information diffusion from a node to the rest of a network; betweenness locates “bridges,” or power brokers, in a network; and eigenvector finds those nodes that are particularly influential.

We can find all four of these measures for our network’s nodes without too much issue!

deg_df <- data.frame(degree(g)) %>% 
  rownames_to_column("channel_name")
bet_df <- data.frame(betweenness(g)) %>% 
              rownames_to_column("channel_name")
close_df <- data.frame(closeness(g)) %>% 
              rownames_to_column("channel_name")
eig_df <- data.frame(evcent(g)$vector) %>% 
              rownames_to_column("channel_name")

Let’s take a look at the top 5 nodes by degree centrality:

top5 <- deg_df %>% 
  arrange(desc(degree.g.)) %>%
  top_n(n = 5)

## Selecting by degree.g.

# to make the table appear nice, ignore
kableExtra::kable(top5, format="latex")

## Warning in !is.null(rmarkdown::metadata$output) && rmarkdown::metadata$output
## %in% : 'length(x) = 2 > 1' in coercion to 'logical(1)'

Because I’ve worked with this data a lot, I know that the 0 author ID is for missing and deleted users, so that’s not particularly helpful. However, I also know that author ID 1 is Alexander Slavros, the founder and administrator of Iron March. What this data indicates is he was incredibly active in direct messages on his forum, and in fact was the single most active user in the entire website!

7600, meanwhile–boasting the second-highest degree centrality at 103, meaning he messaged 103 different users–belongs to Iron March user Odin, or Brandon Russell. Russell is the founder of Atomwaffen Division and is currently under indictment for plotting power substation attacks.

There are a lot of similarities across the centrality measures in this network, but if you look at the top 5 nodes by betweenness centrality, you can see that there is a notable difference. Whereas user 9668 has the 5th-largest degree centrality, 3491 has the 5th-largest betweenness centrality.

top5_bet <- bet_df %>% 
  arrange(desc(betweenness.g.)) %>%
  top_n(n = 5)

## Selecting by betweenness.g.

#to make the table appear nice, ignore
kableExtra::kable(top5, format="latex")

One way you can see the differences between the centrality measures is by visualizing them on a network map. We can size the nodes by the centrality measures to show how this works.

I’ll introduce one more topic here so that we can visualize things more easily. We will select the ego network of a particular individual so that we can look at a smaller segment of the overall network. An ego net is the network composed just of the connections to one specific node–so a person and their friends.

In this case, we’ll size the whole network by the centrality metrics, but only visualize the ego network around Alexander Slavros.

Degree Centrality

V(g)$size <- log(degree(g)) # divide by 25 to make this manageable
ego_net <- make_ego_graph(g, nodes = V(g)[V(g)$name == 1])
ego_net <- ego_net[[1]]
plot(ego_net, vertex.label = NA)

Betweenness Centrality

V(g)$size <- log(betweenness(g))
ego_net <- make_ego_graph(g, nodes = V(g)[V(g)$name == 1])
ego_net <- ego_net[[1]]
plot(ego_net, vertex.label = NA)

Closeness Centrality

V(g)$size <- closeness(g) * 10000
ego_net <- make_ego_graph(g, nodes = V(g)[V(g)$name == 1])
ego_net <- ego_net[[1]]
plot(ego_net, vertex.label = NA)

### Eigenvector Centrality

V(g)$size <- evcent(g)$vector * 10
ego_net <- make_ego_graph(g, nodes = V(g)[V(g)$name == 1])
ego_net <- ego_net[[1]]
plot(ego_net, vertex.label = NA)

Conclusion

This is meant as a quick intro to SNA and a few core metrics, but I will continue updating this tutorial with more tricks, methods, and metrics in the future. Please reach out if you run into trouble!