Centrality in igraph - for the most part

The purpose of this exercise is to acquaint you with running a few of the many centrality options available in igraph.

Getting Started with the Example Data

This part is important. Please be sure to do this part again:

Create a folder labeled “Practicum 4” someplace on your computer, such as your Desktop or wherever you will be able to easily find it again. Then, set your working directory to that folder in RStudio:

Session
- Set Working Directory
- Choose Directory…

Although it is not required, it can be helpful to follow along with this tutorial using the example data that we are using. Brendan Knapp has converted a classic network (Zachary’s “Karate Club” network) into an edgelist, and I have added some changes to give us a chance to demonstrate some simple data cleaning and manipulation.

We will use Zachary’s Karate Club network to walk though the process.

The edgelist (Zachary.csv) can be found in the “Network Data” portion of the website. You may access it directly as well using this link. If you wish to use the Karate Club network to follow along with the tutorial, download it as a CSV file and save it to the folder that you hopefully just created when you followed the directions, above.

Loading the CSV File

Because you are working with what will ultimately be network data, you should first go ahead and load a network analysis package. We are learning igraph first in this course, since it is somewhat better suited to visualization and has a wide range of available analyses.

So, go ahead and load the igraph package to get started.

library(igraph)

With igraph loaded, go ahead and read in the CSV.

Again, just to be clear, we are using Zachary’s “Karate Club” network as an example. The output you see from here on will be for that network. The network you are using for this practicum will look considerably different.

NetworkEL <- read.csv(file.choose(), header=TRUE, na.strings="")
        # na.strings="" tells R that blank cells are missing values.

Conversion from Edgelist to Network in `igraph`

An edgelist is one of the more common ways that a graph can be represented, and igraph allows us to construct a graph object directly from an edgelist by using the graph_from_edgelist() function.

With unfamiliar functions, your first step should always be to read the documentation. Take a look at it by running ?graph_from_edgelist, which searches for the documentation in RStudio’s help tab.

Think of what you see as a guide to how to use the function.

The Description section gives you a summary and basic details of the function’s intended use.
The Usage section details how the function is actually defined and the arguments that it expects.
The Arguments section explains what each of the arguments are referencing.

In graph_from_edgelist’s Description, we are told that it expects the edgelist to be a two-column matrix. In R, a matrix is a specific type of object. We can check to see if our NetworkEL is a matrix by running:

class(NetworkEL)

## [1] "data.frame"

We see that NetworkEL is actually a data.frame. While similar in concept, matrices and data frames are treated differently. This is because data frames can contain different types of columns/variables (character, numeric, etc.) while matrices only contain data of the same type.

In order for us to pass our edgelist to graph_from_edgelist(), we need to convert it into a matrix. Let’s create a new variable for the new, converted object. This is accomplished by running:

Network_Matrix <- as.matrix(NetworkEL)

We can then inspect Network_Matrix’s class to see if it works.

class(Network_Matrix)

## [1] "matrix" "array"

head(Network_Matrix)
    # You get the idea...

Returning to the documentation, we see in Usage how arguments are expected. Notice the directed = TRUE portion of the function’s arguments. This tells us that the directed argument is assigned a default in the event that we don’t pass our own argument. What this means is that unless we specify otherwise, graph_from_edgelist() defaults to making a directed graph.

Considering that our data do not represent one party being the source of an edge, and therefore that the edges have a direction, we know that our network is undirected.

Refresher on `igraph` Objects

Now, let’s create our network:

g is the variable to which we are assigning the network
Network_Matrix is the object we created by converting NetworkEL
we are explicitly setting directed = to FALSE:

g <- graph_from_edgelist(Network_Matrix, directed=FALSE)
# Or, if it should be a directed network:
gD <- graph_from_edgelist(Network_Matrix, directed=TRUE)  # Keep in mind that this network 
                                                          # was not actually undirected. We 
                                                          # are just pretending here for the 
                                                          # sake of having a directed example.

Sometimes, igraph will import an edgelist that doubles the edges. To be safe with undirected networks, simplify the network to remove duplicates.

g <- simplify(g)

We can then look at our new igraph object by simply calling its variable:

## IGRAPH c2440d8 U--- 34 78 -- 
## + edges from c2440d8:
##  [1]  1-- 2  1-- 3  1-- 4  1-- 5  1-- 6  1-- 7  1-- 8  1-- 9  1--11  1--12
## [11]  1--13  1--14  1--18  1--20  1--22  1--32  2-- 3  2-- 4  2-- 8  2--14
## [21]  2--18  2--20  2--22  2--31  3-- 4  3-- 8  3-- 9  3--10  3--14  3--28
## [31]  3--29  3--33  4-- 8  4--13  4--14  5-- 7  5--11  6-- 7  6--11  6--17
## [41]  7--17  9--31  9--33  9--34 10--34 14--34 15--33 15--34 16--33 16--34
## [51] 19--33 19--34 20--34 21--33 21--34 23--33 23--34 24--26 24--28 24--30
## [61] 24--33 24--34 25--26 25--28 25--32 26--32 27--30 27--34 28--34 29--32
## [71] 29--34 30--33 30--34 31--33 31--34 32--33 32--34 33--34

gD

## IGRAPH 2430ea6 D--- 34 155 -- 
## + edges from 2430ea6:
##  [1]  1-> 3  1-> 4  1-> 5  1-> 6  1-> 7  1-> 8  1-> 9  1->11  1->12  1->13
## [11]  1->14  1->18  1->20  1->22  1->32  2-> 1  2-> 3  2-> 4  2-> 8  2->14
## [21]  2->18  2->20  2->22  2->31  3-> 1  3-> 2  3-> 4  3-> 8  3-> 9  3->10
## [31]  3->14  3->28  3->29  3->33  4-> 1  4-> 2  4-> 3  4-> 8  4->13  4->14
## [41]  5-> 1  5-> 7  5->11  6-> 1  6-> 7  6->11  6->17  7-> 1  7-> 5  7-> 6
## [51]  7->17  8-> 1  8-> 2  8-> 3  8-> 4  9-> 1  9-> 3  9->31  9->33  9->34
## [61] 10-> 3 10->34 11-> 1 11-> 5 11-> 6 12-> 1 13-> 1 13-> 4 14-> 1 14-> 2
## [71] 14-> 3 14-> 4 14->34 15->33 15->34 16->33 16->34 17-> 6 17-> 7 18-> 1
## [81] 18-> 2 19->33 19->34 20-> 1 20-> 2 20->34 21->33 21->34 22-> 1 22-> 2
## + ... omitted several edges

Let’s understand the information contained in an igraph object:

IGRAPH simply annotates g as an igraph object
Whatever random six digit alphanumeric string follows IGRAPH is simply how igraph identifies the g for itself
- it’s not important for our purposes and will be referred to as arbitrary igraph name
UN-- refers to descriptive details of g:
- U tells us that g is an undirected graph
  - D would tell us that it is directed graph
- N indicates that g is a named graph, in that the vertices have a name attribute
- -- refers to attributes not applicable to g, but we will see them in the future:
  - W would refer to a weighted graph, where edges have a weight attribute
  - B would refer to a bipartite graph, where vertices have a type attribute
34 refers to the number of vertices in g
78 refers to the number of edges in g
attr: is a list of attributes within the graph. There are no attributes in this network. But, in cases where you load networks that use names instead of numbers, you will see name listed after attr:. You will see multiple attributes in the future.
- (v/c), which will appear following name, tells us that it is a vertex attribute of a character data type. character is simply what R calls a string.
- In the future we will also see:
  - (e/c) or (e/n) referring to edge attributes that are of character or numeric data types
  - (g/c) or (g/n) referring to graph attributes that are of character or numeric data types
+ edges from *arbitrary igraph name* (vertex names): lists a sample of g’s edges using the names of the vertices which they connect.

Analysis

Use the following code to produce the measures discussed in chapter 4 of Understanding Dark Networks.

Local Measures in `igraph`

There are many, many available centrality measures that have been developed for network analysis. At present, there is no program that is so comprehensive that it includes all of the measures. We will, therefore, limit this discussion to a subset of the measures that are included in igraph. These include the “big four” measures (degree, betweenness, closeness, and eigenvector) and a few useful others. In what follows, we introduce:

Degree Centrality
- In-degree
- Out-degree
Eigenvector Centrality
Hubs & Authorities
Closeness Centrality
Reach Centrality
Betweenness Centrality

Degree Centrality

There are variations in the way that each of these measures will treat the network you are analyzing. So it is always a good idea to check the default settings with ?. For instance, ?degree will produce the help page for the degree centrality measure. Under the word “usage” you will find:

degree(graph, v = V(graph), 
      mode = c("all", "out", "in", "total"), 
      loops = TRUE, normalized = FALSE)

This is saying that the default settings in igraph, among other things, consider loops when calculating degree (loops = TRUE). We can also see that the resulting output will be raw counts for degree, rather than normalized scores. These are defaults for the program. So if you simply calculate density using density(g), then R will return raw counts. To produce normalized output, add normalized = TRUE to your script: density(g, normalized=TRUE).

The “mode” information tells you what your options are for dealing with directed networks. You will notice that the first option in the list is “all”. This is the default, and will produce a value of degree that adds all incoming and outgoing ties (the sum of all ties adjacent to a node). To change this to indegree, just add degree(gD, mode="in"). “Mode” has no effect on undirected networks.

Try this out with the two networks (g, and gD) that we created above.

Degree <- degree(g)
Indegree.Undirected <- degree(g, mode="in")
Outdegree.Undirected <- degree(g, mode="out")

Degree.Directed <- degree(gD)
Indegree <- degree(gD, mode="in")
Outdegree <- degree(gD, mode="out")

To see what the output for each of these looks like, use the head() function. That will show you the first six observations.

A method that many may find easier, or possibly more intuitive is to group all of these measures together into a data frame and look at them all at once. To do this, use the cbind command to combine the measures for comparison.

CompareDegree <- cbind(Degree, Indegree.Undirected, Outdegree.Undirected, Degree.Directed, Indegree, Outdegree)
# Then look at just the first few observations, to save space.
head(CompareDegree)

##      Degree Indegree.Undirected Outdegree.Undirected Degree.Directed Indegree
## [1,]     16                  16                   16              31       16
## [2,]      9                   9                    9              17        8
## [3,]     10                  10                   10              20       10
## [4,]      6                   6                    6              12        6
## [5,]      3                   3                    3               6        3
## [6,]      4                   4                    4               8        4
##      Outdegree
## [1,]        15
## [2,]         9
## [3,]        10
## [4,]         6
## [5,]         3
## [6,]         4

This is a little bit of a manufactured comparison, since the network data that we are using were really designed as an undirected network. Normally, you will not see the indegree and outdegree from the directed network summing to the same total as the undirected version of the network. But, hopefully, you get the idea.

That was a fairly lengthy treatment of degree centrality. The following centralities will be fairly minimal in explanation and interpretation. The degree example should demonstrate how to modify the centrality measures to work with digraphs.

Eigenvector Centrality

Eigenvector centrality calculations are iterative and can produce a wealth of information. Most users will only be interested in the centrality scores, however. Therefore, when you calculate eigenvector centrality, you will need to tell igraph that you are only interested in the vector of centrality scores that you are calculating.

Eig <- evcent(g)$vector

Hubs and Authorities

These centrality (prominence) scores are calculated in a manner that is very similar to that of eigenvector centrality. You will, therefore, have to isolate only the vector of eigenvector scores in each of these.

Hub <- hub.score(g)$vector
Authority <- authority.score(g)$vector

Closeness

Note that this version of closeness is sensitive to Freeman’s original conceptualization of how closeness should be measured. Although it will produce measures for network that have multiple, disconnected components, it will also produce a warning that the question of what to do with disconnected components is not settled or well defined. So, caveat emptor.

Closeness <- closeness(g)

Reach

Reach is included with the igraph suite as a measure of the size of each node’s ego-network. That is, treating each node as the center of its own “ego-network” igraph produces a count of how many alters each node has at k steps out into the network.

The function is ego_size(network, k) and it actually produces a count of the number of nodes that are included in the ego-network, including the node itself. So, if you prefer a raw count, then you should use ego_size()-1 to remove the ego node from the count.

Also, note that the default for ego_size() is k=1. k=1, or one step out, is the same as calculating degree, plus one. So, we will also have to remove the ego from that calculation to give the maximum number of nodes that could be included in the ego’s ego-network.

Raw counts can be misleading and can make it difficult to compare relative differences between networks. As with other measures, it is much easier to interpret a proportion. So, a normalized version of reach can be produced by dividing reach by the size of the network. You can calculate the network’s size using vcount() to produce a count of the number of vertices in the network. As with the ego-network measure, vcount will return a count of all nodes in the network. To get the number of nodes an ego could potentially reach, we need $n-1$ to remove the ego from the count. The final version of reach is, therefore, as follows: (ego_size()-1)/(vcount()-1)).

## Reach at k=2
Reach_2 <- (ego_size(g, 2)-1)/(vcount(g)-1)

## Reach at k=3
Reach_3 <- (ego_size(g, 3)-1)/(vcount(g)-1)

Betweenness

Betweenness <- betweenness(g)

Comparing Centrality Scores

To keep all your work in one place, put it all together in one data frame. You can then export the data frame to a spreadsheet, or sort it in R (see the “Some More Advanced Versions of Doing the Above” section of this site for that).

centralities <- cbind(Degree, Eig, Hub, Authority, Closeness, Reach_2, Reach_3, Betweenness)

# Save it to your computer as a spreadsheet
write.csv(centralities, file="centralities.csv")

Comparing the Various Measures Statistically

R is primarily a statistical program. So you can take advantage of that by seeing how correlated the various centralities are.

round(cor(centralities), 2)

##             Degree  Eig  Hub Authority Closeness Reach_2 Reach_3 Betweenness
## Degree        1.00 0.92 0.92      0.92      0.77    0.44    0.52        0.91
## Eig           0.92 1.00 1.00      1.00      0.90    0.70    0.66        0.80
## Hub           0.92 1.00 1.00      1.00      0.90    0.70    0.66        0.80
## Authority     0.92 1.00 1.00      1.00      0.90    0.70    0.66        0.80
## Closeness     0.77 0.90 0.90      0.90      1.00    0.87    0.84        0.72
## Reach_2       0.44 0.70 0.70      0.70      0.87    1.00    0.66        0.42
## Reach_3       0.52 0.66 0.66      0.66      0.84    0.66    1.00        0.43
## Betweenness   0.91 0.80 0.80      0.80      0.72    0.42    0.43        1.00

The cor() function produces a correlation matrix. In this case, we can see the correlation values for each pair of centralities. We can see, for example, that degree is correlated with eigenvector, hubs, and authorities at a 0.91 level.

The round( , ) function sets the number of digits that are displayed after the decimal. If you would like to see what you would have gotten if you had not rounded to the first two places after the decimal, try: cor(centralities). The round function takes two arguments: (1) the object, number, or vector that you are rounding, and (2) the number of digits that should appear after the decimal in the output. In this case, we have set the output to just include two digits after the decimal.

Please keep in mind that this is just a “quick and dirty” way of comparing the centralities. In most cases, when you are comparing numeric attributes of vertices it is best to use statistical tests that are designed with network data in mind. Tools such as conditional uniform graphs (CUG), quadratic assignment procedure (QAP), stochastic actor oriented modeling (SAOM) exponential random graph modeling (ERGM / P*), and are explicitly designed to handle the interdependencies present within the network.

We will cover those in a later section.

Some More Advanced Options for Doing the Above

Make the centrality scores into vertex attributes

In each of the following examples, you are adding a vertex attribute. The word following V(g)$ is the name of the attribute being added to the network data. The function to the right of the assignment arrow (<-) is calculating the value of the attribute for each node in the network. Once you have finished these, check your work using g. You should see these six centralities listed among the vertex attributes.

Word to the wise, if you want to be able to access these another day, be sure to save your network (save(g, file="CentalityNetwork.Rda")). If you forget to save, you can always re-run this script. But, that is more work, isn’t it?

V(g)$degree <- degree(g)                        # Degree centrality
V(g)$eig <- evcent(g)$vector                    # Eigenvector centrality
V(g)$hubs <- hub.score(g)$vector                # "Hub" centrality
V(g)$authorities <- authority.score(g)$vector   # "Authority" centrality
V(g)$closeness <- closeness(g)                  # Closeness centrality
V(g)$betweenness <- betweenness(g)              # Vertex betweenness centrality

Alternate way to Create a Data Frame

centrality <- data.frame(row.names   = V(g)$name,
                         degree      = V(g)$degree,
                         closeness   = V(g)$closeness,
                         betweenness = V(g)$betweenness,
                         eigenvector = V(g)$eig)

centrality <- centrality[order(row.names(centrality)),]

head(centrality)

##    degree  closeness betweenness eigenvector
## 1      16 0.01724138 231.0714286   0.9521324
## 10      2 0.01315789   0.4476190   0.2749981
## 11      3 0.01149425   0.3333333   0.2034715
## 12      1 0.01111111   0.0000000   0.1415663
## 13      2 0.01123596   0.0000000   0.2256638
## 14      5 0.01562500  24.2158730   0.6065744

Save the data frame as a .csv file

Many will choose this method just to make life easier for themselves. Once you have saved this in CSV format, you can sort the output in a spreadsheet program such as Excel, Numbers, Google Sheets, etc.

write.csv(centrality, file = "Centrality.csv")

If you prefer to sort your data in R…

If we use the data frame that we created and named “centrality”, earlier, we can sort it multiple ways. For this example, we’ll only sort on the betweenness measure.

Lowest tie value

# Minimum value
min(centrality$betweenness)

## [1] 0

# Maximum value
max(centrality$betweenness)

## [1] 231.0714

# Bottom six 

head(centrality[order(-centrality$betweenness),])

##    degree  closeness betweenness eigenvector
## 1      16 0.01724138   231.07143   0.9521324
## 34     17 0.01666667   160.55159   1.0000000
## 33     12 0.01562500    76.69048   0.8266589
## 3      10 0.01694915    75.85079   0.8495542
## 32      6 0.01639344    73.00952   0.5116565
## 9       5 0.01562500    29.52937   0.6090684

# Bottom five
head(centrality[order(-centrality$betweenness),], n=5)

##    degree  closeness betweenness eigenvector
## 1      16 0.01724138   231.07143   0.9521324
## 34     17 0.01666667   160.55159   1.0000000
## 33     12 0.01562500    76.69048   0.8266589
## 3      10 0.01694915    75.85079   0.8495542
## 32      6 0.01639344    73.00952   0.5116565

# Bottom 10
head(centrality[order(-centrality$betweenness),], n=10)

##    degree  closeness betweenness eigenvector
## 1      16 0.01724138   231.07143   0.9521324
## 34     17 0.01666667   160.55159   1.0000000
## 33     12 0.01562500    76.69048   0.8266589
## 3      10 0.01694915    75.85079   0.8495542
## 32      6 0.01639344    73.00952   0.5116565
## 9       5 0.01562500    29.52937   0.6090684
## 2       9 0.01470588    28.47857   0.7123351
## 14      5 0.01562500    24.21587   0.6065744
## 20      3 0.01515152    17.14683   0.3961622
## 6       4 0.01162791    15.83333   0.2128838

Highest tie value

max(degree(g))

## [1] 17

# Top six
head(centrality[order(centrality$betweenness),])

##    degree  closeness betweenness eigenvector
## 12      1 0.01111111           0  0.14156633
## 13      2 0.01123596           0  0.22566382
## 15      2 0.01123596           0  0.27159396
## 16      2 0.01123596           0  0.27159396
## 17      2 0.00862069           0  0.06330461
## 18      2 0.01136364           0  0.24747879

# Top five
head(centrality[order(centrality$betweenness),], n=5)

##    degree  closeness betweenness eigenvector
## 12      1 0.01111111           0  0.14156633
## 13      2 0.01123596           0  0.22566382
## 15      2 0.01123596           0  0.27159396
## 16      2 0.01123596           0  0.27159396
## 17      2 0.00862069           0  0.06330461

# Top ten
head(centrality[order(centrality$betweenness),], n=10)

##    degree  closeness betweenness eigenvector
## 12      1 0.01111111           0  0.14156633
## 13      2 0.01123596           0  0.22566382
## 15      2 0.01123596           0  0.27159396
## 16      2 0.01123596           0  0.27159396
## 17      2 0.00862069           0  0.06330461
## 18      2 0.01136364           0  0.24747879
## 19      2 0.01123596           0  0.27159396
## 21      2 0.01123596           0  0.27159396
## 22      2 0.01136364           0  0.24747879
## 23      2 0.01123596           0  0.27159396

Visualization Options

Normally, igraph will calculate new coordinates every time you visualize a network. This can be annoying because it changes the orientation of the network every time. For those times when you prefer to show the network in one static position, to show the differences in node size or color, you can save the coordinates from a particular layout and reuse them multiple times. To do so, just save the coordinates from the first time you run the visualization, and reuse it as you like.

Here, lay is a variable to which you’re assigning the coordinates that igraph calculates based off an algorithm of your choosing. Here’s g using a force-directed algorithm referred to as Kamada-Kawai, the names of its creators.

lay <- layout_with_kk(g)

plot(g, layout = lay, 
     vertex.label = NA)

This is a complicated network for which Kamada-Kawai doesn’t yield a very helpful plot. You can find mode layouts to try in the igraph documentation by searching in the Help pane or running layout_. Simply assign the coordinates to the lay variable, and use your new lay as your layout argument when use the plot() function.

Here’s another example using another force-directed algorithm referred to as Fruchterman-Reingold, the names of its creators.

lay <- layout_with_fr(g)

plot(g, layout = lay, 
     vertex.label = NA)

Try some of the options from Practicum 2 to see if you can render a better visualization with some other layout.

Here is what you can do with the same layout:

Size vertices by centrality measure!

Note: Some centrality measures result in small values. To make the relative differences between them visible, it is important to rescale the centrality vector. We have noted the places where the centrality output was rescaled. There is no real rule to this. In this case, we tried out a few scaling values until the visualization took on the visual aspect we hoped to display.

It is important to rescale an entire vector of output at once in the same manner. Do not change only a few values, as that will change the relative differences between them.

lay <- layout.fruchterman.reingold(g) # save coordinates
plot.igraph(g, layout=lay)

plot.igraph(g, layout=lay, 
            vertex.size=degree(g), 
            main="Degree")

plot.igraph(gD, layout=lay, 
            vertex.size=degree(gD, mode="in"), 
            main="Indegree")

plot.igraph(gD, layout=lay, 
            vertex.size=degree(gD, mode="out"), 
            main="Outdegree")

plot.igraph(g, layout=lay, 
            vertex.size=evcent(g)$vector*15,   # Rescaled by multiplying by 15
            main="Eigenvector")

plot.igraph(gD, layout=lay, 
            vertex.size=hub.score(gD)$vector*15,   # Rescaled by multiplying by 15
            main="Hubs")

plot.igraph(gD, layout=lay, 
            vertex.size=authority.score(gD)$vector*15,   # Rescaled by multiplying by 15
            main="Authorities")

plot.igraph(g, layout=lay, 
            vertex.size=closeness(g)*1000,    # Rescaled by multiplying by 1000
            main="Closeness (X 1000)")

plot.igraph(g, layout=lay, 
            vertex.size=betweenness(g)*0.25,    # Rescaled by multiplying by 25
            main="Betweenness")

plot.igraph(g, layout=lay, 
            vertex.size=((ego_size(g, 2)-1)/(vcount(g)-1))*20,   # Rescaled by multiplying by 10
            main="Reach 2")

plot.igraph(g, layout=lay, 
            vertex.size=((ego_size(g, 2)-1)/(vcount(g)-1))*20,    # Rescaled by multiplying by 10
            main="Reach 3")