The purpose of this exercise is to acquaint you with running a few of the many centrality options available in igraph.
Create a folder labeled “Practicum 4” someplace on your computer, such as your Desktop or wherever you will be able to easily find it again. Then, set your working directory to that folder in RStudio:
Although it is not required, it can be helpful to follow along with this tutorial using the example data that we are using. Brendan Knapp has converted a classic network (Zachary’s “Karate Club” network) into an edgelist, and I have added some changes to give us a chance to demonstrate some simple data cleaning and manipulation.
We will use Zachary’s Karate Club network to walk though the process.
The edgelist (Zachary.csv) can be found in the “Network Data” portion of the website. You may access it directly as well using this link. If you wish to use the Karate Club network to follow along with the tutorial, download it as a CSV file and save it to the folder that you hopefully just created when you followed the directions, above.
Because you are working with what will ultimately be network data, you should first go ahead and load a network analysis package. We are learning igraph first in this course, since it is somewhat better suited to visualization and has a wide range of available analyses.
So, go ahead and load the igraph
package to get started.
library(igraph)
With igraph
loaded, go ahead and read in the CSV.
Again, just to be clear, we are using Zachary’s “Karate Club” network as an example. The output you see from here on will be for that network. The network you are using for this practicum will look considerably different.
NetworkEL <- read.csv(file.choose(), header=TRUE, na.strings="")
# na.strings="" tells R that blank cells are missing values.
igraph
An edgelist is one of the more common ways that a graph can be represented, and igraph
allows us to construct a graph object directly from an edgelist by using the graph_from_edgelist()
function.
With unfamiliar functions, your first step should always be to read the documentation. Take a look at it by running ?graph_from_edgelist
, which searches for the documentation in RStudio’s help tab.
Think of what you see as a guide to how to use the function.
In graph_from_edgelist
’s Description, we are told that it expects the edgelist to be a two-column matrix. In R
, a matrix is a specific type of object. We can check to see if our NetworkEL
is a matrix by running:
class(NetworkEL)
## [1] "data.frame"
We see that NetworkEL
is actually a data.frame
. While similar in concept, matrices and data frames are treated differently. This is because data frames can contain different types of columns/variables (character, numeric, etc.) while matrices only contain data of the same type.
In order for us to pass our edgelist to graph_from_edgelist()
, we need to convert it into a matrix. Let’s create a new variable for the new, converted object. This is accomplished by running:
Network_Matrix <- as.matrix(NetworkEL)
We can then inspect Network_Matrix
’s class to see if it works.
class(Network_Matrix)
## [1] "matrix" "array"
head(Network_Matrix)
# You get the idea...
Returning to the documentation, we see in Usage how arguments are expected. Notice the directed = TRUE
portion of the function’s arguments. This tells us that the directed
argument is assigned a default in the event that we don’t pass our own argument. What this means is that unless we specify otherwise, graph_from_edgelist()
defaults to making a directed graph.
Considering that our data do not represent one party being the source of an edge, and therefore that the edges have a direction, we know that our network is undirected.
igraph
ObjectsNow, let’s create our network:
g
is the variable to which we are assigning the networkNetwork_Matrix
is the object we created by converting NetworkEL
directed =
to FALSE
:g <- graph_from_edgelist(Network_Matrix, directed=FALSE)
# Or, if it should be a directed network:
gD <- graph_from_edgelist(Network_Matrix, directed=TRUE) # Keep in mind that this network
# was not actually undirected. We
# are just pretending here for the
# sake of having a directed example.
Sometimes, igraph will import an edgelist that doubles the edges. To be safe with undirected networks, simplify the network to remove duplicates.
g <- simplify(g)
We can then look at our new igraph
object by simply calling its variable:
g
## IGRAPH c2440d8 U--- 34 78 --
## + edges from c2440d8:
## [1] 1-- 2 1-- 3 1-- 4 1-- 5 1-- 6 1-- 7 1-- 8 1-- 9 1--11 1--12
## [11] 1--13 1--14 1--18 1--20 1--22 1--32 2-- 3 2-- 4 2-- 8 2--14
## [21] 2--18 2--20 2--22 2--31 3-- 4 3-- 8 3-- 9 3--10 3--14 3--28
## [31] 3--29 3--33 4-- 8 4--13 4--14 5-- 7 5--11 6-- 7 6--11 6--17
## [41] 7--17 9--31 9--33 9--34 10--34 14--34 15--33 15--34 16--33 16--34
## [51] 19--33 19--34 20--34 21--33 21--34 23--33 23--34 24--26 24--28 24--30
## [61] 24--33 24--34 25--26 25--28 25--32 26--32 27--30 27--34 28--34 29--32
## [71] 29--34 30--33 30--34 31--33 31--34 32--33 32--34 33--34
gD
## IGRAPH 2430ea6 D--- 34 155 --
## + edges from 2430ea6:
## [1] 1-> 3 1-> 4 1-> 5 1-> 6 1-> 7 1-> 8 1-> 9 1->11 1->12 1->13
## [11] 1->14 1->18 1->20 1->22 1->32 2-> 1 2-> 3 2-> 4 2-> 8 2->14
## [21] 2->18 2->20 2->22 2->31 3-> 1 3-> 2 3-> 4 3-> 8 3-> 9 3->10
## [31] 3->14 3->28 3->29 3->33 4-> 1 4-> 2 4-> 3 4-> 8 4->13 4->14
## [41] 5-> 1 5-> 7 5->11 6-> 1 6-> 7 6->11 6->17 7-> 1 7-> 5 7-> 6
## [51] 7->17 8-> 1 8-> 2 8-> 3 8-> 4 9-> 1 9-> 3 9->31 9->33 9->34
## [61] 10-> 3 10->34 11-> 1 11-> 5 11-> 6 12-> 1 13-> 1 13-> 4 14-> 1 14-> 2
## [71] 14-> 3 14-> 4 14->34 15->33 15->34 16->33 16->34 17-> 6 17-> 7 18-> 1
## [81] 18-> 2 19->33 19->34 20-> 1 20-> 2 20->34 21->33 21->34 22-> 1 22-> 2
## + ... omitted several edges
Let’s understand the information contained in an igraph
object:
IGRAPH
simply annotates g
as an igraph
objectIGRAPH
is simply how igraph
identifies the g
for itself
UN--
refers to descriptive details of g
:
U
tells us that g
is an undirected graph
D
would tell us that it is directed graphN
indicates that g
is a named graph, in that the vertices have a name
attribute--
refers to attributes not applicable to g
, but we will see them in the future:
W
would refer to a weighted graph, where edges have a weight
attributeB
would refer to a bipartite graph, where vertices have a type
attribute34
refers to the number of vertices in g
78
refers to the number of edges in g
attr:
is a list of attributes within the graph. There are no attributes in this network. But, in cases where you load networks that use names instead of numbers, you will see name
listed after attr:
. You will see multiple attributes in the future.
(v/c)
, which will appear following name
, tells us that it is a vertex attribute of a character
data type. character
is simply what R
calls a string
.(e/c)
or (e/n)
referring to edge attributes that are of character
or numeric
data types(g/c)
or (g/n)
referring to graph attributes that are of character
or numeric
data types+ edges from *arbitrary igraph name* (vertex names):
lists a sample of g
’s edges using the names of the vertices which they connect.Use the following code to produce the measures discussed in chapter 4 of Understanding Dark Networks.
igraph
There are many, many available centrality measures that have been developed for network analysis. At present, there is no program that is so comprehensive that it includes all of the measures. We will, therefore, limit this discussion to a subset of the measures that are included in igraph
. These include the “big four” measures (degree, betweenness, closeness, and eigenvector) and a few useful others. In what follows, we introduce:
There are variations in the way that each of these measures will treat the network you are analyzing. So it is always a good idea to check the default settings with ?
. For instance, ?degree
will produce the help page for the degree centrality measure. Under the word “usage” you will find:
degree(graph, v = V(graph),
mode = c("all", "out", "in", "total"),
loops = TRUE, normalized = FALSE)
This is saying that the default settings in igraph, among other things, consider loops when calculating degree (loops = TRUE
). We can also see that the resulting output will be raw counts for degree, rather than normalized scores. These are defaults for the program. So if you simply calculate density using density(g)
, then R will return raw counts. To produce normalized output, add normalized = TRUE
to your script: density(g, normalized=TRUE)
.
The “mode” information tells you what your options are for dealing with directed networks. You will notice that the first option in the list is “all”. This is the default, and will produce a value of degree that adds all incoming and outgoing ties (the sum of all ties adjacent to a node). To change this to indegree, just add degree(gD, mode="in")
. “Mode” has no effect on undirected networks.
Try this out with the two networks (g, and gD) that we created above.
Degree <- degree(g)
Indegree.Undirected <- degree(g, mode="in")
Outdegree.Undirected <- degree(g, mode="out")
Degree.Directed <- degree(gD)
Indegree <- degree(gD, mode="in")
Outdegree <- degree(gD, mode="out")
To see what the output for each of these looks like, use the head()
function. That will show you the first six observations.
A method that many may find easier, or possibly more intuitive is to group all of these measures together into a data frame and look at them all at once. To do this, use the cbind
command to combine the measures for comparison.
CompareDegree <- cbind(Degree, Indegree.Undirected, Outdegree.Undirected, Degree.Directed, Indegree, Outdegree)
# Then look at just the first few observations, to save space.
head(CompareDegree)
## Degree Indegree.Undirected Outdegree.Undirected Degree.Directed Indegree
## [1,] 16 16 16 31 16
## [2,] 9 9 9 17 8
## [3,] 10 10 10 20 10
## [4,] 6 6 6 12 6
## [5,] 3 3 3 6 3
## [6,] 4 4 4 8 4
## Outdegree
## [1,] 15
## [2,] 9
## [3,] 10
## [4,] 6
## [5,] 3
## [6,] 4
This is a little bit of a manufactured comparison, since the network data that we are using were really designed as an undirected network. Normally, you will not see the indegree and outdegree from the directed network summing to the same total as the undirected version of the network. But, hopefully, you get the idea.
That was a fairly lengthy treatment of degree centrality. The following centralities will be fairly minimal in explanation and interpretation. The degree example should demonstrate how to modify the centrality measures to work with digraphs.
Eigenvector centrality calculations are iterative and can produce a wealth of information. Most users will only be interested in the centrality scores, however. Therefore, when you calculate eigenvector centrality, you will need to tell igraph that you are only interested in the vector of centrality scores that you are calculating.
Eig <- evcent(g)$vector
Note that this version of closeness is sensitive to Freeman’s original conceptualization of how closeness should be measured. Although it will produce measures for network that have multiple, disconnected components, it will also produce a warning that the question of what to do with disconnected components is not settled or well defined. So, caveat emptor.
Closeness <- closeness(g)
Reach is included with the igraph suite as a measure of the size of each node’s ego-network. That is, treating each node as the center of its own “ego-network” igraph produces a count of how many alters each node has at k steps out into the network.
The function is ego_size(network, k)
and it actually produces a count of the number of nodes that are included in the ego-network, including the node itself. So, if you prefer a raw count, then you should use ego_size()-1
to remove the ego node from the count.
Also, note that the default for ego_size()
is k=1. k=1, or one step out, is the same as calculating degree, plus one. So, we will also have to remove the ego from that calculation to give the maximum number of nodes that could be included in the ego’s ego-network.
Raw counts can be misleading and can make it difficult to compare relative differences between networks. As with other measures, it is much easier to interpret a proportion. So, a normalized version of reach can be produced by dividing reach by the size of the network. You can calculate the network’s size using vcount()
to produce a count of the number of vertices in the network. As with the ego-network measure, vcount will return a count of all nodes in the network. To get the number of nodes an ego could potentially reach, we need \(n-1\) to remove the ego from the count. The final version of reach is, therefore, as follows: (ego_size()-1)/(vcount()-1))
.
## Reach at k=2
Reach_2 <- (ego_size(g, 2)-1)/(vcount(g)-1)
## Reach at k=3
Reach_3 <- (ego_size(g, 3)-1)/(vcount(g)-1)
Betweenness <- betweenness(g)
To keep all your work in one place, put it all together in one data frame. You can then export the data frame to a spreadsheet, or sort it in R (see the “Some More Advanced Versions of Doing the Above” section of this site for that).
centralities <- cbind(Degree, Eig, Hub, Authority, Closeness, Reach_2, Reach_3, Betweenness)
# Save it to your computer as a spreadsheet
write.csv(centralities, file="centralities.csv")
R is primarily a statistical program. So you can take advantage of that by seeing how correlated the various centralities are.
round(cor(centralities), 2)
## Degree Eig Hub Authority Closeness Reach_2 Reach_3 Betweenness
## Degree 1.00 0.92 0.92 0.92 0.77 0.44 0.52 0.91
## Eig 0.92 1.00 1.00 1.00 0.90 0.70 0.66 0.80
## Hub 0.92 1.00 1.00 1.00 0.90 0.70 0.66 0.80
## Authority 0.92 1.00 1.00 1.00 0.90 0.70 0.66 0.80
## Closeness 0.77 0.90 0.90 0.90 1.00 0.87 0.84 0.72
## Reach_2 0.44 0.70 0.70 0.70 0.87 1.00 0.66 0.42
## Reach_3 0.52 0.66 0.66 0.66 0.84 0.66 1.00 0.43
## Betweenness 0.91 0.80 0.80 0.80 0.72 0.42 0.43 1.00
The cor()
function produces a correlation matrix. In this case, we can see the correlation values for each pair of centralities. We can see, for example, that degree is correlated with eigenvector, hubs, and authorities at a 0.91 level.
The round( , )
function sets the number of digits that are displayed after the decimal. If you would like to see what you would have gotten if you had not rounded to the first two places after the decimal, try: cor(centralities)
. The round
function takes two arguments: (1) the object, number, or vector that you are rounding, and (2) the number of digits that should appear after the decimal in the output. In this case, we have set the output to just include two digits after the decimal.
Please keep in mind that this is just a “quick and dirty” way of comparing the centralities. In most cases, when you are comparing numeric attributes of vertices it is best to use statistical tests that are designed with network data in mind. Tools such as conditional uniform graphs (CUG), quadratic assignment procedure (QAP), stochastic actor oriented modeling (SAOM) exponential random graph modeling (ERGM / P*), and are explicitly designed to handle the interdependencies present within the network.
We will cover those in a later section.
In each of the following examples, you are adding a vertex attribute. The word following V(g)$
is the name of the attribute being added to the network data. The function to the right of the assignment arrow (<-
) is calculating the value of the attribute for each node in the network. Once you have finished these, check your work using g
. You should see these six centralities listed among the vertex attributes.
Word to the wise, if you want to be able to access these another day, be sure to save your network (save(g, file="CentalityNetwork.Rda")
). If you forget to save, you can always re-run this script. But, that is more work, isn’t it?
V(g)$degree <- degree(g) # Degree centrality
V(g)$eig <- evcent(g)$vector # Eigenvector centrality
V(g)$hubs <- hub.score(g)$vector # "Hub" centrality
V(g)$authorities <- authority.score(g)$vector # "Authority" centrality
V(g)$closeness <- closeness(g) # Closeness centrality
V(g)$betweenness <- betweenness(g) # Vertex betweenness centrality
centrality <- data.frame(row.names = V(g)$name,
degree = V(g)$degree,
closeness = V(g)$closeness,
betweenness = V(g)$betweenness,
eigenvector = V(g)$eig)
centrality <- centrality[order(row.names(centrality)),]
head(centrality)
## degree closeness betweenness eigenvector
## 1 16 0.01724138 231.0714286 0.9521324
## 10 2 0.01315789 0.4476190 0.2749981
## 11 3 0.01149425 0.3333333 0.2034715
## 12 1 0.01111111 0.0000000 0.1415663
## 13 2 0.01123596 0.0000000 0.2256638
## 14 5 0.01562500 24.2158730 0.6065744
Many will choose this method just to make life easier for themselves. Once you have saved this in CSV format, you can sort the output in a spreadsheet program such as Excel, Numbers, Google Sheets, etc.
write.csv(centrality, file = "Centrality.csv")
If we use the data frame that we created and named “centrality”, earlier, we can sort it multiple ways. For this example, we’ll only sort on the betweenness measure.
# Minimum value
min(centrality$betweenness)
## [1] 0
# Maximum value
max(centrality$betweenness)
## [1] 231.0714
# Bottom six
head(centrality[order(-centrality$betweenness),])
## degree closeness betweenness eigenvector
## 1 16 0.01724138 231.07143 0.9521324
## 34 17 0.01666667 160.55159 1.0000000
## 33 12 0.01562500 76.69048 0.8266589
## 3 10 0.01694915 75.85079 0.8495542
## 32 6 0.01639344 73.00952 0.5116565
## 9 5 0.01562500 29.52937 0.6090684
# Bottom five
head(centrality[order(-centrality$betweenness),], n=5)
## degree closeness betweenness eigenvector
## 1 16 0.01724138 231.07143 0.9521324
## 34 17 0.01666667 160.55159 1.0000000
## 33 12 0.01562500 76.69048 0.8266589
## 3 10 0.01694915 75.85079 0.8495542
## 32 6 0.01639344 73.00952 0.5116565
# Bottom 10
head(centrality[order(-centrality$betweenness),], n=10)
## degree closeness betweenness eigenvector
## 1 16 0.01724138 231.07143 0.9521324
## 34 17 0.01666667 160.55159 1.0000000
## 33 12 0.01562500 76.69048 0.8266589
## 3 10 0.01694915 75.85079 0.8495542
## 32 6 0.01639344 73.00952 0.5116565
## 9 5 0.01562500 29.52937 0.6090684
## 2 9 0.01470588 28.47857 0.7123351
## 14 5 0.01562500 24.21587 0.6065744
## 20 3 0.01515152 17.14683 0.3961622
## 6 4 0.01162791 15.83333 0.2128838
max(degree(g))
## [1] 17
# Top six
head(centrality[order(centrality$betweenness),])
## degree closeness betweenness eigenvector
## 12 1 0.01111111 0 0.14156633
## 13 2 0.01123596 0 0.22566382
## 15 2 0.01123596 0 0.27159396
## 16 2 0.01123596 0 0.27159396
## 17 2 0.00862069 0 0.06330461
## 18 2 0.01136364 0 0.24747879
# Top five
head(centrality[order(centrality$betweenness),], n=5)
## degree closeness betweenness eigenvector
## 12 1 0.01111111 0 0.14156633
## 13 2 0.01123596 0 0.22566382
## 15 2 0.01123596 0 0.27159396
## 16 2 0.01123596 0 0.27159396
## 17 2 0.00862069 0 0.06330461
# Top ten
head(centrality[order(centrality$betweenness),], n=10)
## degree closeness betweenness eigenvector
## 12 1 0.01111111 0 0.14156633
## 13 2 0.01123596 0 0.22566382
## 15 2 0.01123596 0 0.27159396
## 16 2 0.01123596 0 0.27159396
## 17 2 0.00862069 0 0.06330461
## 18 2 0.01136364 0 0.24747879
## 19 2 0.01123596 0 0.27159396
## 21 2 0.01123596 0 0.27159396
## 22 2 0.01136364 0 0.24747879
## 23 2 0.01123596 0 0.27159396
Normally, igraph will calculate new coordinates every time you visualize a network. This can be annoying because it changes the orientation of the network every time. For those times when you prefer to show the network in one static position, to show the differences in node size or color, you can save the coordinates from a particular layout and reuse them multiple times. To do so, just save the coordinates from the first time you run the visualization, and reuse it as you like.
Here, lay
is a variable to which you’re assigning the coordinates that igraph
calculates based off an algorithm of your choosing. Here’s g
using a force-directed algorithm referred to as Kamada-Kawai
, the names of its creators.
lay <- layout_with_kk(g)
plot(g, layout = lay,
vertex.label = NA)
This is a complicated network for which Kamada-Kawai
doesn’t yield a very helpful plot. You can find mode layouts to try in the igraph
documentation by searching in the Help pane or running layout_
. Simply assign the coordinates to the lay
variable, and use your new lay
as your layout
argument when use the plot()
function.
Here’s another example using another force-directed algorithm referred to as Fruchterman-Reingold
, the names of its creators.
lay <- layout_with_fr(g)
plot(g, layout = lay,
vertex.label = NA)
Try some of the options from Practicum 2 to see if you can render a better visualization with some other layout.
Size vertices by centrality measure!
Note: Some centrality measures result in small values. To make the relative differences between them visible, it is important to rescale the centrality vector. We have noted the places where the centrality output was rescaled. There is no real rule to this. In this case, we tried out a few scaling values until the visualization took on the visual aspect we hoped to display.
It is important to rescale an entire vector of output at once in the same manner. Do not change only a few values, as that will change the relative differences between them.
lay <- layout.fruchterman.reingold(g) # save coordinates
plot.igraph(g, layout=lay)
plot.igraph(g, layout=lay,
vertex.size=degree(g),
main="Degree")
plot.igraph(gD, layout=lay,
vertex.size=degree(gD, mode="in"),
main="Indegree")
plot.igraph(gD, layout=lay,
vertex.size=degree(gD, mode="out"),
main="Outdegree")
plot.igraph(g, layout=lay,
vertex.size=evcent(g)$vector*15, # Rescaled by multiplying by 15
main="Eigenvector")
plot.igraph(gD, layout=lay,
vertex.size=hub.score(gD)$vector*15, # Rescaled by multiplying by 15
main="Hubs")
plot.igraph(gD, layout=lay,
vertex.size=authority.score(gD)$vector*15, # Rescaled by multiplying by 15
main="Authorities")
plot.igraph(g, layout=lay,
vertex.size=closeness(g)*1000, # Rescaled by multiplying by 1000
main="Closeness (X 1000)")
plot.igraph(g, layout=lay,
vertex.size=betweenness(g)*0.25, # Rescaled by multiplying by 25
main="Betweenness")
plot.igraph(g, layout=lay,
vertex.size=((ego_size(g, 2)-1)/(vcount(g)-1))*20, # Rescaled by multiplying by 10
main="Reach 2")
plot.igraph(g, layout=lay,
vertex.size=((ego_size(g, 2)-1)/(vcount(g)-1))*20, # Rescaled by multiplying by 10
main="Reach 3")