Introduction

Social Network Analysis is a set of methods used to visualize networks, describe specific characteristics of overall network structure, and build mathematical and statistical models of network structures and dynamics (Luke 2015). Social networks are formally defined as a set of nodes that are tied by one or more types of relations (Scott and Carrington 2012). Nodes are most commonly persons or organizations, but in principle any units that can be connected to other units can be studied as nodes. For example, social network analysis has been used to study web pages, journal articles, countries, and neighborhoods. Dr. Paul Beckman and Jennifer Chi even used social network analysis to study the impact of baseball teammates on offensive metrics (2011). Social Network Analysis can be used in many areas to evaluate the impact of other entities, instead of evaluating the characteristics of an individual entity. This tutorial is an overview of basic Social Network Analysis functions. This tutorial draws from another tutorial by Katherine Ognyanova that has a more in depth look at some of the features of igraph (2016).

As a way to introduce Social Network Analysis to the beginner, this tutorial will take advantage of the popular game “Six Degrees of Kevin Bacon”. Six Degrees of Kevin Bacon is a parlour game based on the “six degrees of separation” concept, which posits that any two people on Earth are six or fewer acquaintance links apart. Movie buffs challenge each other to find the shortest path between an arbitrary actor and prolific character actor Kevin Bacon. It rests on the assumption that anyone involved in the Hollywood film industry can be linked through their film roles to Bacon within six steps. The game requires a group of players to try to connect any such individual to Kevin Bacon as quickly as possible and in as few links as possible (“Six Degrees of Kevin Bacon” 2017). For example, Tom Hanks has a “Bacon Score” of 1 as he was with Kevin Bacon in the movie Apollo 13. This tutorial will utilize three movies, (Apollo 13, Forest Gump, and The Rock) to build a social network of actors.

Outline

Requirements

The igraph package provides tools for network analysis. The main goals of the igraph library are to provide a set of data types and functions for 1) pain-free implementation of graph algorithms, 2) fast handling of large graphs, with millions of vertices and edges, 3) allowing rapid prototyping via high level languages like R. More information is available at http://igraph.org.

The readr package provides a fast and friendly way to read rectangular data (like ‘csv’, ‘tsv’, and ‘fwf’). It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes.

library(igraph)
library(readr)

The .csv files required for this tutorial are hosted on GitHub. The .csv files are Actors.csv and Movies.csv. These files should be downloaded and placed in the working directory to be read into data frames.

actors <- read_csv("https://raw.githubusercontent.com/OPER682-Tucker/Social-Network-Analysis/master/Actors.csv")
movies <- read_csv("https://raw.githubusercontent.com/OPER682-Tucker/Social-Network-Analysis/master/Movies.csv")

Building the Network

The actors data frame is a list of all of the referenced actors from the three previously mentioned movies. The eight actors in this network are Tom Hanks, Gary Sinise, Robin Wright, Bill Paxton, Kevin Bacon, Ed Harris, Sean Connery and Nicolas Cage. Note that these are only the biggest stars from these movies in order to keep the network small.

## # A tibble: 8 x 3
##          Actor Gender BestActorActress
##          <chr>  <chr>            <chr>
## 1    Tom Hanks   Male           Winner
## 2  Gary Sinise   Male             None
## 3 Robin Wright Female             None
## 4  Bill Paxton   Male             None
## 5  Kevin Bacon   Male             None
## 6    Ed Harris   Male        Nominated
## 7 Sean Connery   Male             None
## 8 Nicolas Cage   Male           Winner

These are the actor’s name, gender, and if they have won or been nominated for an Academy Award for Best Actor or Actress.

The movies data frame contains connections between actors based on what movies they were in together.

## # A tibble: 16 x 3
##       `Actor 1`    `Actor 2`       Movie
##           <chr>        <chr>       <chr>
##  1    Tom Hanks  Gary Sinise Forest Gump
##  2    Tom Hanks Robin Wright Forest Gump
##  3  Gary Sinise Robin Wright Forest Gump
##  4    Tom Hanks  Gary Sinise   Apollo 13
##  5    Tom Hanks  Bill Paxton   Apollo 13
##  6    Tom Hanks  Kevin Bacon   Apollo 13
##  7    Tom Hanks    Ed Harris   Apollo 13
##  8  Gary Sinise  Bill Paxton   Apollo 13
##  9  Gary Sinise  Kevin Bacon   Apollo 13
## 10  Gary Sinise    Ed Harris   Apollo 13
## 11  Bill Paxton  Kevin Bacon   Apollo 13
## 12  Bill Paxton    Ed Harris   Apollo 13
## 13  Kevin Bacon    Ed Harris   Apollo 13
## 14    Ed Harris Sean Connery    The Rock
## 15    Ed Harris Nicolas Cage    The Rock
## 16 Sean Connery Nicolas Cage    The Rock

These are the actors’ names who were together and what movies they appeared in together. For example, Tom Hanks and Gary Sinise appeared together in Forest Gump and Apollo 13 as Forest Gump and Lieutenant Dan, and Jim Lovell and Ken Mattingly respectively.

The first step in building the network is to create an igraph object. We will use the igraph function graph.data.frame to create this object from our existing data frames. The d variable takes the edges connecting the actor nodes that are held in the movies dataframe that was created, the vertices variable takes the actor nodes that are listed in the actors dataframe. There are social networks where the relationship is directional, for example a professor may have a directed relationship where he teaches students. As this is a list of actors who were in various movies together, this is an undirected network so the directed variable takes an argument of FALSE.

actorNetwork <- graph_from_data_frame(d=movies, vertices=actors, directed=F)

The vertices variable requires one column of node identifiers, which in this case are the actors’ names. The d argument requires a dataframe in two columns of connections between vertice identifiers. The additional columns in the movies and actors dataframes give other identifying variables to the edges and nodes respectively. For example, the third column in the movies dataframe which names the movie that the two actors share can be used as a categorical variable to describe the connection.

Plotting the Network

The simplest way to plot the network with default settings is to use the plot function, for example:

plot(actorNetwork)

This allows you to visualize the network without any additional information. In order to make a diagram that depicts additional information, other graphing variables can be used. For example, if I want to know what movies actors were in together I could color code the links between actors.

E(actorNetwork)$color <- ifelse(E(actorNetwork)$Movie == "Forest Gump", "green", 
                         ifelse(E(actorNetwork)$Movie == "Apollo 13", "black",
                                "orange"))

# Re-Plot the network
plot(actorNetwork)

Now actors that were in Forest Gump are connected with green edges, Apollo 13 are connected with black edges, and those from The Rock are connected with orange edges. You can also color code nodes based on characteristics of the actors.

V(actorNetwork)$color <- ifelse(V(actorNetwork)$BestActorActress == "Winner", "gold",
                         ifelse(V(actorNetwork)$BestActorActress == "Nominated","grey",
                                "lightblue"))

#Re-Plot the Network
plot(actorNetwork)

The following is a list of igraph plotting parameters pulled from Katherine Ognyanova’s tutorial (2016)

NODES

  • vertex.color Node color
  • vertex.frame.color Node border color
  • vertex.shape One of “none”, “circle”, “square”, “csquare”, “rectangle” “crectangle”, “vrectangle”, “pie”, “raster”, or “sphere”
  • vertex.size Size of the node (default is 15)
  • vertex.size2 The second size of the node (e.g. for a rectangle)
  • vertex.label Character vector used to label the nodes
  • vertex.label.family Font family of the label (e.g.“Times”, “Helvetica”)
  • vertex.label.font Font: 1 plain, 2 bold, 3, italic, 4 bold italic, 5 symbol
  • vertex.label.cex Font size (multiplication factor, device-dependent)
  • vertex.label.dist Distance between the label and the vertex
  • vertex.label.degree The position of the label in relation to the vertex, where 0 right, “pi” is left, “pi/2” is below, and “-pi/2” is above

EDGES

  • edge.color Edge color
  • edge.width Edge width, defaults to 1
  • edge.arrow.size Arrow size, defaults to 1
  • edge.arrow.width Arrow width, defaults to 1
  • edge.lty Line type, could be 0 or “blank”, 1 or “solid”, 2 or “dashed”, 3 or “dotted”, 4 or “dotdash”, 5 or “longdash”, 6 or “twodash”
  • edge.label Character vector used to label edges
  • edge.label.family Font family of the label (e.g.“Times”, “Helvetica”)
  • edge.label.font Font: 1 plain, 2 bold, 3, italic, 4 bold italic, 5 symbol
  • edge.label.cex Font size for edge labels
  • edge.curved Edge curvature, range 0-1 (FALSE sets it to 0, TRUE to 0.5)
  • arrow.mode Vector specifying whether edges should have arrows, possible values: 0 no arrow, 1 back, 2 forward, 3 both

OTHER

  • margin Empty space margins around the plot, vector with length 4
  • frame if TRUE, the plot will be framed
  • main If set, adds a title to the plot
  • sub If set, adds a subtitle to the plot

Legends

Finally, it can be useful to create a legend to show what the color coding for the graph means. For this the legend function works well.

plot(actorNetwork, vertex.frame.color="white")

legend("bottomright", c("Winner","Nominee", "Not Nominated"), pch=21,
  col="#777777", pt.bg=c("gold","grey","lightblue"), pt.cex=2, cex=.8)


legend("topleft", c("Forest Gump","Apollo 13", "The Rock"), 
       col=c("green","black","orange"), lty=1, cex=.8)

Describing the Network

Although visualizing the network can be useful for examining the data at a high level, one of the most important features of social network analysis is the ability to mathematically describe a node’s characteristics on the network. The positions of nodes on the network are often described in terms of centrality. Centrally positioned individuals enjoy a position of privilege over those relegated to the circumfrence of the network (Degenne and Forse 1999). The three main types of centrality are degree centrality, betweenness centrality, and closeness centrality.

Degree Centrality

Degree centrality is simplest of the methods, it measures the number of connections between a node and all other nodes. Looking at the plot above, Nicolas Cage is connected to Sean Connery and Ed Harris, so he should have a degree centrality score of 2. The igraph package has a function, degree, to measure degree centrality.

degree(actorNetwork, mode="all")
##    Tom Hanks  Gary Sinise Robin Wright  Bill Paxton  Kevin Bacon 
##            6            6            2            4            4 
##    Ed Harris Sean Connery Nicolas Cage 
##            6            2            2

Closeness Centrality

Closeness centrality is an evaluation of the proximity of a node to all other nodes in a network, not only the nodes to which it is directly connected. The closeness centrality of a node is defined by the inverse of the average length of the shortest paths to or from all the other nodes in the graph. The absolute closeness of node i to the other nodes j is given by \(C_{APi}^{-1} = \sum^{n}_{j=1}d_{ij}\). The relative closeness is then calculated by accounting for the number of nodes in the network by: \(C_{NPi}=(n-1)/C^{-1}_{APi}\) For example to calculate raw closeness, Robin Wright is connected to Tom Hanks and Gary Sinist by one link, to Bill Paxton, Kevin Bacon, and Ed Harris by two links, and to Sean Connery and Nicolas Cage by three links for a total of 14 links. To calculate relative closeness, the number of nodes minus one (7) is divided by the raw closeness score (14) for a relative closeness of .5. This corresponds to an average number of links to all other nodes of two.

closeness(actorNetwork, mode="all", weights=NA, normalized=T)
##    Tom Hanks  Gary Sinise Robin Wright  Bill Paxton  Kevin Bacon 
##    0.7777778    0.7777778    0.5000000    0.7000000    0.7000000 
##    Ed Harris Sean Connery Nicolas Cage 
##    0.8750000    0.5384615    0.5384615

Betweenness Centrality

Betweenness centrality offers another way of measuring an individuals centrality. In social networks there can be weakly connected individuals who are still indispensale to certail transactions. Although these individuals may not have a high level of degree centrality, they may be chokepoints through which information moves. The betweenness of a given point to two other points is its capacity of standing on the paths that connect them (Degenne and Forse 1999). To calculate the absolute betweenness centrality for a node, its betweenness for all pairs on the graph must be summed:

\[C_{ABi} = \sum^{n}_{j}\sum^{n}_{k}b_{jk}(i), j\neq k\neq i, and \ j < k\]

Relative betweenness centrality is then calculated by: \(C_{NBi} = (2C_{ABi}) / (n^{2} - 3n + 2)\). To calculate relative betweenness in R the function betweenness is used:

betweenness(actorNetwork, directed=F, weights=NA, normalized = T)
##    Tom Hanks  Gary Sinise Robin Wright  Bill Paxton  Kevin Bacon 
##    0.1190476    0.1190476    0.0000000    0.0000000    0.0000000 
##    Ed Harris Sean Connery Nicolas Cage 
##    0.4761905    0.0000000    0.0000000

Bacon Score

As this tutorial has been using the example of Six Degrees of Kevin Bacon to discuss social networks, it is only fitting to calculate each actor’s Bacon Score in this social network. The function distances can calculate the shortest paths between nodes on a network. The following code calculates all of the Bacon Scores:

distances(actorNetwork, v=V(actorNetwork)["Kevin Bacon"], to=V(actorNetwork), weights=NA)
##             Tom Hanks Gary Sinise Robin Wright Bill Paxton Kevin Bacon
## Kevin Bacon         1           1            2           1           0
##             Ed Harris Sean Connery Nicolas Cage
## Kevin Bacon         1            2            2

Military Utility

Social Network Analysis is currently being used by the Department of Defense for a variety of applications. In the introduction of A User’s Guide to Network Analysis in R (Luke 2015) he expounds on the utilization of SNA to understand the organization of the September 11th hijackers’ relationships. The hijackers on American 77 which crashed into the Pentagon had only one member of their cell that communicated with members of other cells. By understanding the relationships of terror cells, they can be easier to disrupt or dismantle. A similar application was utilized by the Army in Afghanistan. Data on relationships within communities was collected by intelligence analysts to map the communities. Many of the insurgents were members of marginalized parts of society who were not central to the workings of communities. By understanding the closeness centrality of members of that community, it was possible to identify people who may be on the outskirts of the society who had a greater likelihood of being associated with the insurgency.

Exercises

The .csv files required for these exercises are hosted on GitHub. The .csv files are ActorsExercise.csv and MoviesExercise.csv. These files should be downloaded and placed in the working directory to be read into data frames.

For these exercises:

  1. Create an igraph plot to show the network with color coded links for movies, and color coded nodes for gender.
  2. Who has the lowest degree centrality for the network?
  3. Who has the highest closeness centrality for the network?
  4. Who has the highest betweenness centrality for the network?

Exercise Solutions

Question 1

actors <- read_csv("https://raw.githubusercontent.com/OPER682-Tucker/Social-Network-Analysis/master/ActorsExercise.csv")
movies <- read_csv("https://raw.githubusercontent.com/OPER682-Tucker/Social-Network-Analysis/master/MoviesExercise.csv")
actorNetwork <- graph_from_data_frame(d=movies, vertices=actors, directed=F)
E(actorNetwork)$color <- ifelse(E(actorNetwork)$Movie == "Forest Gump", "green", 
                         ifelse(E(actorNetwork)$Movie == "Apollo 13", "black",
                         ifelse(E(actorNetwork)$Movie == "The Rock", "orange", "red")))
V(actorNetwork)$color <- ifelse(V(actorNetwork)$Gender == "Male", "lightblue", "pink")
plot(actorNetwork)
legend("topleft", c("Male","Female"), pch=21,
  col="#777777", pt.bg=c("lightblue","pink"), pt.cex=2, cex=.8)
legend("bottomright", c("Forest Gump","Apollo 13", "The Rock", "Titanic"), 
       col=c("green","black","orange","red"), lty=1, cex=.8)

Question 2

degree(actorNetwork, mode="all")
##         Tom Hanks       Gary Sinise      Robin Wright       Bill Paxton 
##                 6                 6                 2                 7 
##       Kevin Bacon         Ed Harris      Sean Connery      Nicolas Cage 
##                 4                 6                 2                 2 
## Leonardo DiCaprio      Kate Winslet        Billy Zane 
##                 3                 3                 3

Robin Wright, Sean Connery, and Nicolas Cage have the lowest degree centrality.

Question 3

closeness(actorNetwork, mode="all", weights = NA, normalized = T)
##         Tom Hanks       Gary Sinise      Robin Wright       Bill Paxton 
##         0.6666667         0.6666667         0.4347826         0.7692308 
##       Kevin Bacon         Ed Harris      Sean Connery      Nicolas Cage 
##         0.6250000         0.7142857         0.4545455         0.4545455 
## Leonardo DiCaprio      Kate Winslet        Billy Zane 
##         0.5000000         0.5000000         0.5000000

Bill Paxton has the highest level of closeness centrality in this network.

Question 4

betweenness(actorNetwork, directed = F, weights = NA, normalized = T)
##         Tom Hanks       Gary Sinise      Robin Wright       Bill Paxton 
##        0.08888889        0.08888889        0.00000000        0.46666667 
##       Kevin Bacon         Ed Harris      Sean Connery      Nicolas Cage 
##        0.00000000        0.35555556        0.00000000        0.00000000 
## Leonardo DiCaprio      Kate Winslet        Billy Zane 
##        0.00000000        0.00000000        0.00000000

Bill Paxton has the highest level of betweenness centrality in this network.

References

Beckman, Paul, and Jennifer Chi. 2011. “More Highly Connected Baseball Players Have Better Offensive Performance.” http://sabr.org/research/more-highly-connected-baseball-players-have-better-offensive-performance.

Degenne, Alain, and Michel Forse. 1999. Introducing Social Networks.

Luke, Douglas A. 2015. A User’s Guide to Network Analysis in R.

Ognyanova, Katherine. 2016. “Network Analysis and Visualization with R and Igraph.” http://kateto.net/networks-r-igraph.

Scott, John, and Peter J. Carrington. 2012. The Sage Handbook of Social Network Analysis.

“Six Degrees of Kevin Bacon.” 2017. https://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon.