Practicum 8 - Structural Equavelence

Practicum Setup

Networks either explicitly, or implicitly represent the flow of resources, information, influence or similar factors through a set of entities. It naturally follows that an analyst may wish to track the potential flow of those factors through the network. It can be particularly helpful to identify the paths or entities that hold the potential to facilitate, impede, or otherwise broker the flow of such factors.

For a quick overview of why we would bother with such a thing, read pages 195 to 199 in our text.

This part is important. Please be sure to do this part again

Create a folder labeled “Practicum 8” someplace on your computer, such as your Desktop or wherever you will be able to easily find it again. Then, set your working directory to that folder in RStudio:

Session
- Set Working Directory
- Choose Directory…

Getting Started

This time, we will be using two new packages - blockmodeling and concoR - in addition to statnet. Although igraph does not offer a great deal of analyses for structural equivalence, it is exceptionally easy to use, and will plot networks faster and more easily than statnet.

We’ll, therefore, begin by loading igraph and statnet, and then go on to working with the other packages.

library(statnet)
library(igraph)

Next, load the data.

For the purpose of example, we will be using data from Padgett and Ansell’s work on the Medici family. To download the data, go to the Network Data link on the course website: https://goo.gl/3rkUK4. You will find the edgelist data (Padgett.csv) and the attribute data (mediciNames.csv) in at least one of the folders there.

For your own purposes, we strongly suggest that you try out any of the other networks that we have tried so far this semester. We are using Padgett’s Medici network here for two reasons: (1) you know it well at this point; and (2) is is a small enough network to be able visualize it well and clearly on a website.

el <- read.csv(file.choose(), header=FALSE)
g <- graph.data.frame(el, directed=FALSE) # load igraph data
net<-network(el, directed=FALSE, matrix.type="edgelist") # load statnet data

For now, our goals are modest. The topics to be covered in this module are:

Structural Equivalence
- Euclidean Distance
- CONCOR
- Blockmodel Optimization
Automorphic Equivalence (Both of the following options are the same as the ones above.)
- Blockmodel Optimization
- Euclidean Distance
Regular Equivalence
- REGE
- CATREGE
- Regular Equivalence Optimization

Structural Equivalence Methods

The first thing to keep in mind about structural equivalence is that your expectations of finding perfectly equivalent vertices should be low. Perfectly substitutable vertices are rare. So, we tend to mitigate our expectations and seek more roughly equivalent vertices.

CONCOR!

The CONCOR algorithm is a classic of social network analysis that was developed in the 1970s to identify structurally equivalent nodes. CONCOR stands for “convergence of iterate correlations” and has been implemented in R as concoR. This is an elegant method for calculating structural equivalence and remains surprisingly useful and fast to this day.

The original CONCOR algorithm was developed by Ron Breiger, Scott Boorman, and Phipps Arabie. If you are interested, you can check out their original (1975) paper: “An Algorithm for Clustering Relational Data with Applications to Social Network Analysis and Comparison with Multidimensional Scaling. Journal of Mathematical Psychology, 12: 328–383. The original version was written in FORTRAN. Since then, Adam Slez has rewritten the program in R. You can find out more by checking his website: Bad Hessian.

The main idea behind CONCOR is that we want to categorize nodes into sets or groups, depending on the similarities and dissimilarities in their patterns of ties to others in the network. To do so, we run correlations for row and column vectors of a binary matrix that represents the ties between nodes in a network. The initial pass of correlations will give us some idea of the similarities and differences between nodes in the network. But, if we repeatedly continue to run correlations on the resulting matrices, then the matrix should eventually converge into a series of +1 and -1 values. Those may then be used to group nodes by their similarity and dissimilarity.

For a better discussion, see pages 205 to 208 in the text.

Although it was developed with structural equivalence in mind, we tend to use CONCOR for equivalence in general, since we rarely expect to see true structural equivalence in a network.

Because Adam Slez has not committed his concoR package to CRAN, we will have to install it from his GitHub site. You will only have to do this once.

First, however, you will have to install devtools is you have not done so already. The devtools suite contains a lot of functions that make an r package developer’s task a little easier. To check that out, visit their GitHub page (https://github.com/r-lib/devtools) and scroll down for the description. In this case, we’ll use devtools to install a package that has not yet been published and resides only on GitHub (an online code repository).

install.packages("devtools", dependencies = TRUE)

Once you have devtools, load it into R and use the install_github function to download and install CONCOR. You will notice that the package name is written as "aslez/concoR". This directs the installer to visit the “aslez” repository (which is what Adam Slez named his repository on GitHub) and install the “concoR” package that he has written and committed to the repository (just for you - and anyone else who is interested).

# Load devtools
library(devtools)

# Install concoR
install_github("aslez/concoR")

You are now ready to use concoR.

We will be using concoR with both, statnet, and igraph. We will use statnet to draw the block diagram, and we’ll use igraph to draw the network (because it is easier in igraph). So, pay careful attention to where we are using g (for igraph network data) and net (for statnet network data). It matters.

CONCOR requires a matrix, or stack of matrices to make its calculations. So, start by loading concoR and extracting a matrix from an igraph network.

# load CONCOR
library(concoR)

# Convert the network to a matrix
mat <- as.matrix(get.adjacency(g))

If you like, take a look at how the various nodes are initially correlated. (Though, you can skip this step if you prefer. It won’t change the outcome in any way.)

m0 <- cor(mat) # Calculate a correlation matrix
round(m0, 2)   # Round the output to 2 significant digits

##           ACCIAIUOL ALBIZZI BARBADORI BISCHERI CASTELLAN GUADAGNI MEDICI
## ACCIAIUOL      1.00    0.54      0.68    -0.12     -0.12    -0.15  -0.20
## ALBIZZI        0.54    1.00      0.30     0.18     -0.23    -0.28  -0.37
## BARBADORI      0.68    0.30      1.00    -0.18     -0.18    -0.22  -0.29
## BISCHERI      -0.12    0.18     -0.18     1.00      0.59    -0.28  -0.37
## CASTELLAN     -0.12   -0.23     -0.18     0.59      1.00    -0.28  -0.04
## GUADAGNI      -0.15   -0.28     -0.22    -0.28     -0.28     1.00   0.15
## MEDICI        -0.20   -0.37     -0.29    -0.37     -0.04     0.15   1.00
## PAZZI         -0.07   -0.12     -0.10    -0.12     -0.12    -0.15   0.33
## PERUZZI       -0.12   -0.23      0.30     0.18      0.18     0.09  -0.37
## RIDOLFI        0.54    0.18      0.30     0.18      0.18     0.09  -0.04
## PUCCI         -0.07   -0.12     -0.10    -0.12     -0.12    -0.15  -0.20
## GINORI        -0.07   -0.12     -0.10    -0.12     -0.12     0.45   0.33
## STROZZI       -0.15   -0.28      0.22     0.09      0.09     0.00  -0.15
## LAMBERTES     -0.07    0.54     -0.10     0.54     -0.12    -0.15  -0.20
## TORNABUON      0.54    0.59      0.30     0.18     -0.23    -0.28  -0.04
## SALVIATI       0.68    0.30      0.43    -0.18     -0.18    -0.22  -0.29
##           PAZZI PERUZZI RIDOLFI PUCCI GINORI STROZZI LAMBERTES TORNABUON
## ACCIAIUOL -0.07   -0.12    0.54 -0.07  -0.07   -0.15     -0.07      0.54
## ALBIZZI   -0.12   -0.23    0.18 -0.12  -0.12   -0.28      0.54      0.59
## BARBADORI -0.10    0.30    0.30 -0.10  -0.10    0.22     -0.10      0.30
## BISCHERI  -0.12    0.18    0.18 -0.12  -0.12    0.09      0.54      0.18
## CASTELLAN -0.12    0.18    0.18 -0.12  -0.12    0.09     -0.12     -0.23
## GUADAGNI  -0.15    0.09    0.09 -0.15   0.45    0.00     -0.15     -0.28
## MEDICI     0.33   -0.37   -0.04 -0.20   0.33   -0.15     -0.20     -0.04
## PAZZI      1.00   -0.12   -0.12 -0.07  -0.07   -0.15     -0.07     -0.12
## PERUZZI   -0.12    1.00    0.18 -0.12  -0.12    0.46     -0.12     -0.23
## RIDOLFI   -0.12    0.18    1.00 -0.12  -0.12   -0.28     -0.12      0.18
## PUCCI     -0.07   -0.12   -0.12  1.00  -0.07   -0.15     -0.07     -0.12
## GINORI    -0.07   -0.12   -0.12 -0.07   1.00   -0.15     -0.07     -0.12
## STROZZI   -0.15    0.46   -0.28 -0.15  -0.15    1.00     -0.15      0.09
## LAMBERTES -0.07   -0.12   -0.12 -0.07  -0.07   -0.15      1.00      0.54
## TORNABUON -0.12   -0.23    0.18 -0.12  -0.12    0.09      0.54      1.00
## SALVIATI  -0.10   -0.18    0.30 -0.10  -0.10   -0.22     -0.10      0.30
##           SALVIATI
## ACCIAIUOL     0.68
## ALBIZZI       0.30
## BARBADORI     0.43
## BISCHERI     -0.18
## CASTELLAN    -0.18
## GUADAGNI     -0.22
## MEDICI       -0.29
## PAZZI        -0.10
## PERUZZI      -0.18
## RIDOLFI       0.30
## PUCCI        -0.10
## GINORI       -0.10
## STROZZI      -0.22
## LAMBERTES    -0.10
## TORNABUON     0.30
## SALVIATI      1.00

The more highly correlated nodes should already be somewhat apparent. But, you will have to sort through a lot of correlations to work out which nodes belong in which equivalence classes if you stop at this point.

Next, run the concoR algorithm to identify the various structural equivalence “blocks.”

If you check the help section for ?concor_hca, which we recommend, then you will notice that there are only four arguments: the data object (m0), a cutoff to tell the program when it has reached a finishing point, the maximum number of times that it should rerun correlations (max.iter), and the number of partitions that you wish to reveal in the network (p).

In practice, you may find it necessary to adjust some or all of these arguments. The cutoff and maximum number of iterations are used when CONCOR doesn’t seem to be able to come to a solution (which we refer to as achieving “convergence”). You may need to give the function a little more opportunity to work by increasing the number of iterations that it can run. If that doesn’t work, then you may need to lower your definition of what “highly correlated” means. The cutoff is set to 0.999 by default. Once correlations reach an absolute value that matches the cutoff, the algorithm will cease and present the results. To start, you will generally leave the defaults as they are and not adjust them unless the algorithm fails to converge.

The number of partitions will likely be what you will spend the most time adjusting. CONCOR requires a priori specification of the number of groups you wish to identify. Feel free to try out a few different values. By default, concoR will seek one partition (two groups).

concor_hca(m0, cutoff = 0.999, 
           max.iter = 25, p = 1)

One last functional note: The list() command in the script for concor_hca, below, is necessary if you are using only one matrix. This function was developed to work with arrays (lists of matrices), so it doesn’t play well with single matrices without this command.

After some experimentation, we elected to create two partitions (p=2), resulting in four equivalence groups. (Partitions bifurcate each of the existing groups.)

blks <- concor_hca(list(mat),
                   max.iter=100, 
                   p = 2) 
blks

##    block    vertex
## 1      1 ACCIAIUOL
## 5      2   ALBIZZI
## 2      1 BARBADORI
## 6      2  BISCHERI
## 9      3 CASTELLAN
## 13     4  GUADAGNI
## 14     4    MEDICI
## 15     4     PAZZI
## 10     3   PERUZZI
## 3      1   RIDOLFI
## 11     3     PUCCI
## 16     4    GINORI
## 12     3   STROZZI
## 7      2 LAMBERTES
## 8      2 TORNABUON
## 4      1  SALVIATI

The output gives the vertex names, as well as the “blocks” or classes that each vertex was classified into. We can visualize this information in one of two ways: as a block matrix, or as a network visualization. We’ll do each below.

First, we can plot the network in statnet, using the blockmodel function. Note: We are using network data that was formatted for statnet here. (net)

If you check the help section (?blockmodel), the defaults for this function are as follows.

blockmodel(dat, ec, k=NULL, h=NULL, 
           block.content="density", 
           plabels=NULL, glabels=NULL, 
           rlabels=NULL, mode="digraph", 
           diag=FALSE)

The main - mandatory - arguments that the function takes are for the network (dat) and a data object that identifies equivalence classes (ec). The equivalence classes that this function will be able to use should result from one of the various structural equivalence functions in statnet. Other arguments include block.content, which may be used to emulate generalized blockmodeling procedures; and the options to select the number of classes to form (k) or where to split classes in a dendrogram (h). Labels may also be imported for the purpose of identification and visualization. These include labels to identify nodes (plabels), to identify individual networks when multiple matrices are entered (glabels), or labels for the various roles that are identified (rlabels).

It should be mentioned that the blockmodel function assumes that the network is directed (mode="digraph"). In the script below, we identify the Medici network that we loaded earlier. It is an undirected network, so we set mode to “graph.” For equivalence classes, we’ll use the classes that were derived using concor_hca. Using the ls() function to examine the object that we created using CONCOR (ls(blks)), you will see that the object contains “block” and “vertex” ID information. We will use both. The blocks are used for equivalency classes and we have supplied a title for the network using the glabels argument. The vertex labels were initially used to import “plabels”, but we determined that they made the output a little messy, so that has been suppressed (commented out: #) in the script below.

# Visualize using "blockmodel" function in statnet
#    NOTE: use statnet "net" data
blk_mod <- blockmodel(net, 
                      ec=blks$block,
                      mode="graph",
                   #  plabels=blks$vertex,
                      glabels="Medici CONCOR Equivalence") 
blk_mod

## 
## Network Blockmodel:
## 
## Block membership:
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 
##  1  2  1  2  3  4  4  4  3  1  3  4  3  2  2  1 
## 
## Reduced form blockmodel:
## 
##   Medici CONCOR Equivalence 
##         Block 1   Block 2   Block 3   Block 4
## Block 1  0.0000 0.0625000 0.3125000 0.0625000
## Block 2  0.0625 0.1666667 0.3750000 0.1875000
## Block 3  0.3125 0.3750000 0.3333333 0.0000000
## Block 4  0.0625 0.1875000 0.0000000 0.1666667

To look into the blk_mod object, use the ls() function: ls(blk_mod). There, you will see, among other items of output, “block.model”. That is the reduced form blockmodel information, presented above.

The “reduced form blockmodel” presents the density of each block in the matrix. We use this to help determine how well this partitioning format “fits” the network we are analyzing. In a best-case scenario, a good fit will have only very dense (approaching 1.0) and very sparse (approaching 0) blocks. In practice, we make do with the best we can get.

To make it more legible, try rounding the reduced form blockmodel with block densities to just two digits using the round function.

round(blk_mod$block.model, 2)

##         Block 1 Block 2 Block 3 Block 4
## Block 1    0.00    0.06    0.31    0.06
## Block 2    0.06    0.17    0.38    0.19
## Block 3    0.31    0.38    0.33    0.00
## Block 4    0.06    0.19    0.00    0.17

Here, you can see that most of the relatively dense blocks are off the diagonal. These are peripheral roles (blocks 1 & 2) - linked to the core (block 3). Block 4 is a core without much of a periphery.

Plot the blockmodel representation for a better idea of how the vertices are ordered. We cannot fit the vertex names onto the block matrix without them running together, so we’ll use vertex numbers instead for visual clarity.

plot(blk_mod)

To interpret this, you can use the vertex IDs that were contained in the CONCOR output but not used in the script above. Here, we convert the vertex names into a data frame in order to number them.

For the sake of reference, the Medicis (node #7) are in the forth block on the diagonal.

as.data.frame(blks$vertex)

##    blks$vertex
## 1    ACCIAIUOL
## 2      ALBIZZI
## 3    BARBADORI
## 4     BISCHERI
## 5    CASTELLAN
## 6     GUADAGNI
## 7       MEDICI
## 8        PAZZI
## 9      PERUZZI
## 10     RIDOLFI
## 11       PUCCI
## 12      GINORI
## 13     STROZZI
## 14   LAMBERTES
## 15   TORNABUON
## 16    SALVIATI

The CONCOR package is agnostic as to how we use the output that it creates. We can, therefore, visualize the output using igraph as well. Given that you are more used to visualizing in igraph at this point, that should be helpful.

As in the past, we start by adding the output to the network object as a vertex attribute named “blocks” (V(g)$blocks). We can then, color the nodes in the visualization by which equivalence group contains each vertex.

# Add the block designations to igraph data
V(g)$blocks <- blks$block

plot.igraph(g, vertex.color=V(g)$blocks) # plot in igraph

If you consider the visualization above, it should be apparent that the Medicis are not necessarily being placed in a class with other families that provides a satisfying description of their role in the network. Unfortunately, an additional partition, would result in eight classes and would not necessarily be better. But, try both and see what you think.

CONCOR is powerful, but tends to be - literally - categorical in its assignments, even when such a thing is not necessarily advisable. After all, the application of structural equivalence to a real world network is more of an exercise in mitigating ambiguity. We, therefore, have other ways of applying the concept to a network. One such tool is the use of blockmodels, which produce output like the block diagram above.

Optimization

For a little more control over the number of equivalence classes, you may want to also try network optimization. Optimization is a sorting approach that is meant to do much the same thing as CONCOR, though in a very different manner. The details are covered better on pages 208 to 210 of the text. But, we cover it briefly below as well.

The blockmodeling package for R was written by Aleš Žiberna, a student of Vlado Batagelj (of Pajek fame). You will need to install the package, but it is well worth the time.

install.packages("blockmodeling", dependencies=TRUE)

The “optimization” approach requires you to specify (a priori) some number of random partitions and require the algorithm to re-sort the network to a point where the various blocks contain a best fit to the network. There are a number of possible options for soting the network under the approach command. Here, we use the sum of squares methods. Also, the blocks command asks the algorithm to find “complete” blocks (all 1s, and no 0s) if possible. Try it with and without this.

“Goodness of fit” is used to determine the optimal block memberships. The algorithm does this be comparing each block with its “ideal.” Each deviation (e.g., a 0 where a 1 should be) is penalized.

For more - and much better - information on your options with this algorithm, and generalized blockmodeling in general, check out his publicaiton on the topic: ŽIBERNA, Aleš (2007): Generalized Blockmodeling of Valued Networks. Social Networks, Jan. 2007, vol. 29, no. 1, 105-126.

There are a lot of arguments (options) available in the optRandomParC function for optimizing block partitions. Here, we are only using six. A full treatment on all the arguments would require a very different format. Instead, we’ll cover just these six to bring you to a functional level. Further study is required to use this function well and fluently.

A very good place to start is the definitive text on the topic: Generalized Blockmodeling by Doreian, Batagelj, and Ferligoj. You can view the front matter here. Vlado has also written a much shorter version for use with his program, Pajek. Though, it gets the concepts across very well and will help the reader to get the basics of blockmodeling.

The options that we focus on here are as follows:

(Remember, if you would like to look at the defaults for this function, you will need to load the package first. (library(blockmodeling))

optRandomParC(M, k, rep, 
              approaches = "hom", 
              homFun = "ss", 
              blocks="com")

This function requires the input data (M) to be in matrix format, so that will be our first order of business in running it. The function also needs to know how many equivalence classes (k) to find, and how many different starting partitions to check (rep). The rep argument is usually set much higher to give the algorithm a chance to improve fit (generally $>1000$). Here, we’ll just use 10 for the sake of demonstration (also, this is a small network).

The three remaining arguments help the algorithm to select how the classes should be found. The approaches argument identifies the mothod for calculating the “goodness of fit” and can take a few different options. It is easiest to think of the various approaches in terms of how each defines a complete block (which, for binary networks, should be full of ones). Homogeniety blockmodeling ("hom") offers a multi-purpose tool by defining “complete” blocks as those where all values in the block are equal in value. For binary blockmodeling ("bin"), complete blocks must be filled with ones. In valued blockmodeling ("val"), the values in complete blocks must each be at least some prestated value (by default, the mean) In the sum of squares homogenity blockmodeling ("ss") approach, values are matched according to the block mean and penalties (for poor fit / deviations from the ideal) are assessed as the sum of squared differences from the mean ($ss(x)= \sum(x_i - \bar{x})^2$). In absolute deviations homogenity blockmodeling ("ad"), values in a block are matched according to the median and penalties are assessed according to the sum of the absolute value of differences between each value in the block and the median for that block ($ad(x)= \sum \mid x_i - MEDIAN(x) \mid$).

The homFun is used only with homogeniety blockmodeling (approaches="hom"), and establishes how the fit of each block is assessed. The two choices are to use the sum of squares ("ss") approach, or the absolute deviations ("ad") approach. In each case, the block penalties used to assess “goodness of fit” are assessed in the same manner as their corresponding approaches.

Because this is a generalized blockmodeling technique, it is possible to set one block type for the entire set of blocks, or assign an expected block type to each block. The block argument can, therefore, take either a single type, or a vector of types. There are twelve options from which you may choose. But, without a more complee discussion of generalized blockmodeling, introducing each in this context would be useless. Additionally, the definition of each changess, depending on whether the analysis is using valued, binary, or homogineity blockmodeling. For our purposes, we will stick with block="com" since we are using homoginiety blockmodeling. This way, the algorithm will seek blocks that are either empty, or complete.

If you wish to know more about the full capacities of this package, it is a very good idea to consider the readings above as well as Žiberna’s description in his original article: Generalized blockmodeling of valued networks.

library(blockmodeling)

mat <- as.matrix(get.adjacency(g)) # Extract the matrix in igraph (if you haven't already)

    # Try a two block partition.
class2 <- optRandomParC(M=mat, k=2, rep=10, 
                        approaches = "hom", 
                        homFun = "ss", 
                        blocks="com")
    # Tru a four block partition
class4 <- optRandomParC(M=mat, k=4, rep=10, 
                        approaches = "hom", 
                        homFun = "ss", 
                        blocks="com")

When you run this function, you will see the various partitioning attempts along with their corresponding error (“goodness/badness of fit”). The starting partition tells how the nodes in the network are initially, randomly, classified. The final partition is what results from the optimization process. The function will automatically select the final partition that has the lowest final error as the optimal solution.

To play around with this a little, try changing homFun to "ad", or try increasing the number of repetitions to see what happens. (Do this when you may not need the full computing capacity of your computer for a while. More repetitions take longer.)

A block visualization may be used to view the output from this run. We can plot the two solutions side-by-side to compare them.

par(mfrow=c(1,2)) # set the plot window for one row and two columns
plot(class2, main="")
  title("Two Block Partition")
plot(class4, main="") 
  title("Four Block Partition")
par(mfrow=c(1,1)) # reset the plot window back to one row and one column

Explore the output a little. Each of the objects that were created in these blockmodels contains a great deal of information and much of it is nested.

# See what is packed into the outpute object
ls(class4)

# You will see:
# [1] "best"    "call"    "checked.par"   "err"   "initial.param" "M"   "nIter"     
#
class4$best  #  Look inside "best"

# The best partition can therefore be found in: 

class4$best$best1$clu

# The error matrix can be found in:

class4$best$best1$EM

# Plot the error matrix using: 

plot(class4$best$best1$EM)

We can use the partition the same way we did in concoR. Add the partition to the network and plot it in igraph.

# Add the block designations to igraph data
V(g)$opt.blocks <- class4$best$best1$clu


plot.igraph(g, vertex.color=V(g)$opt.blocks) # plot in igraph

Euclidean Distance

Euclidean distance is a method of calculating similarities by comparing common distances from some nodes to other nodes. It may be used for either structural equivalence or automorphic equivalence. To run this analysis, we’ll use the equiv.clust function in statnet.

Before we begin, however, it is important to detach the blockmodeling package. The two do not play well together.

# BEWARE: The blockmodeling package interferes
#         with this function.
#         Detach blockmodeling before you begin. 
detach("package:blockmodeling", unload=TRUE)

If you check the help section for this function (?equiv.clust), then you will see the following arguments and their defaults.

The main - mandatory - arguments that we will want to consider in this example are the network (dat) and the type of distance to be used (method). The distance calculation defaults to method="hamming", which uses Hamming to calculate structurally equivalent classes. Other options for method include Pearson’s product moment correlation ("correlation"), euclidean distance ("euclidean"), and a nonparametric form of correlation called gamma correlation ("gamma"). It is also usually a good idea to tell statnet whether the network is directed ("digraph") or undirected ("graph"), as it defaults to mode="graph" and will not sense an undirected network on its own.

Other arguments include an alternate method for calculating equivalence categories (equiv.fun), which defaults to sedist, which, in turn makes use of the distance measure specified in method as listed above. If a distance matrix has, somehow, already been calculated, then it may be used instead of a specified method by using the equiv.dist argument. By default, the clustering method is set to cluster.method="complete", which finds similar nodes for classes. Other options for clustering method are listed in the help section of the hclust function. See: ?hclust.

Other than that, the function defaults to ignoring the diagonal ("diag=FALSE") and offers the opportunity for labels to be imported for the purpose of identification and visualization. These include labels to identify nodes (plabels), and to identify individual networks when multiple matrices are entered (glabels).

equiv.clust(dat, g=NULL, equiv.dist=NULL, 
            equiv.fun="sedist", method="hamming", 
            mode="digraph", diag=FALSE, 
            cluster.method="complete", 
            glabels=NULL, plabels=NULL, ...)

To classify nodes in the Medici network according to their Euclidean distance from one another, we need only a few of the options listed above.

For the blockmodel visualization, see the definitions listed above in the CONCOR section.

## IN STATNET ## 
#Cluster based on structural equivalence
eq<-equiv.clust(net,
                method="euclidean",
                mode="graph")

The equiv.clust function creates a set of possible clusters and may be depicted as a dendrogram using plot. To see the dendrogram, run the plot. To make the dendrogram easier to read, you will want to add in the labels that are embedded in the eq object that you just produced ("eq$labels").

plot(eq, 
     labels=eq$glabels)

Looking at the dendrogram, you can see that the Medicis are essentially in a class by themselves. To use the dendrogram as a guide, you should substitute h= for k= in the script below. If you use the h= argument, then you will be specifying where to cut the dendrogram. Using the k= argument, you are stating how many groups to find. Either one works. Use your personal preference.

If we cut the dendrogram at level four (using h=4 in the script below, rather than k=4), then we’ll have two classes: the Medicis and everyone else. If we lower the cut threshold to around h=3 (essentially drawing a horizontal line at 3), then there will be four equivalence classes: the Medicis; the Guadagnis, and two other classes.

# Form a blockmodel 
b<-blockmodel(net, eq, 
              k=4,    # find a four class solution
              mode = "graph") 
plot(b)      # Plot it

This provides a pretty good depiction of classes in the network, if you can read it. What you should be taking away from this and the visualizaiton below is that Euclidean distance identifies the top two best-connected nodes in the network as each being in classes by themselves. Additionally, we can see the four families at one end of the network (Strozzi, Castellan, Bischeri, and Peruzzi) were classified as all being in the same class, as were the sattelite families around the Medicis and Guadagnis.

It may help to visualize the results as a plotted network. This time, we’ll do it in statnet. But first, we’ll need to extract the membership vector.

If you check out the contents of the data object that we created using the blockmodel function in statnet (ls(b)), then you will see eleven types of information stored in there. Among those are “block.membership” and “order.vector.” This is because the block membership vector is stored out of order and will have to be sorted into its appropriate order before we can use it.

To reorder the block membership vector, you can use the order function from base R. All we need to do is identify what needs to be sorted (the block.membership in the data object named b) and, in square brackets, tell the order function what to use as a sorting order (the order.vector in the data object named b). We then save the sorted vector in an object named “classes,” which we will then use to color the nodes (vertex.col=classes).

classes <- b$block.membership[order(b$order.vector)]

#  use that to visualize using statnet
plot(net,              
     vertex.col=classes, 
     displaylabels=TRUE, 
     vertex.cex=2)

Plotting in statnet is nothing like plotting networks in igraph. You will note, in the script above, that statnet will display labels in the network object (displaylabels). You can adjust the size of the vertices using vertex.cex. In this case, the choice of vertex.cex=2 was arbitrary. It just looked better.

Try fiddling with the vertex sizes. Instead of entering an integer like 2, try something like: vertex.cex=degree(net); vertex.cex=evcent(net)*3; or vertex.cex=sqrt(betweenness(net)). Notice that we are multiplying the eigenvector centrality by 3 (evcent(net)*3) and taking the square root of betweenness (sqrt(betweenness(net))). That is because eigenvector centrality typically produces smaller values and betweenness produces larger values. We are using those mathematical adjustments to make the visualization easier to see. Try playing around with it to vary node sizes to your liking.

Regular Equivalence

REGE

REGE is actually a set of algorithms that compute similarities - or dissimilarities - between vertices that equate to regular equivalence. If you check the help section for REGE (?REGE), you will see that there are six variations of the REGE algorithm that are implemented in the blockmodeling package, each slightly modified to find regularly equivalent nodes in a slightly different manner from the others. The version we will be using is the classic version that was first designed and written ((in FORTRAN)[http://eclectic.ss.uci.edu/~drwhite/REGGE/REGGE.FOR]) in 1983 by Doug White. You can (read the initial paper here)[http://eclectic.ss.uci.edu/~drwhite/pub/whitereitz.pdf].

The REGE function searches for homomorphisms (similar patterns within the networks) through an iterative optimization process. The result is a matrix of similarities (or dissimilarities) that can then be optimized into block partitions using For a discussion of how REGE does this, see pages 199 and 214-216 in the text.

The program, as it is implemented in R, defaults to stopping once the optimization process ceases to produce changes in equivalency measures (until.change = TRUE) and it also pays attention to the diagonal (use.diag = TRUE). REGE requires a matrix (M) rather than a network object for igraph or statnet. The other defaults, initial dissimilarity value (E=1) and the number of iterations to be performed (iter=3) may be adjusted if the defaults fail to produce measures of similarities or differences between nodes.

The original REGE function does not work as well with binary networks as it does with valued ties in a network. As you will see, below, this can produce arbitrary results. Try the script below for classic REGE, and then change the initial dissimilarity value to two (E=2). It should produce different results, though they will still be arbitrary. For this reason, we also use a FORTRAN version of REGE (REGE.OWNM.for) that is able to handle the binary matrix. When you run each, you will see a difference.

REGE(M, E = 1, iter = 3, 
     until.change = TRUE, 
     use.diag = TRUE)

If you haven’t done so already, load the blockmodeling package. Then, get the matrix representation of the network using igraph.

library(blockmodeling) # Reload the blockmodeling package

mat <- as.matrix(get.adjacency(g)) # Extract the matrix in igraph (if you haven't already)

First, we will run the classic version of REGE.

Recall that, in either case, REGE is just producing a matrix of similarities or differences. In actuality, the function will produce several different types of output. But, the output that we actually need is the (dis)similarity matrix (referred to as E in the output), so that we can then refine the results with a blockmodeling or other clustering approach. You will see that when we run the function, we ask only for that one piece of output by appending $E to the outside of the parentheses on the function. That will call only the output labeled “E” to save in our data object. Try running the function without $E appended and then check the contents of the object into which you put it (in this case “REGE.reg”). You will see the various forms of output there.

REGE.reg<-REGE(M=mat)$E

At this point, check the REGE.reg object. You will see that there is no variation. That is because REGE could see no differences or similarities.

If we optimize this matrix using the optRandomParC function, as we did in the optimization section, then any class (block) assignments will be arbitrary. The script for doing so is below. Try it out and feel free to change the number of classes that it is seeking. Also look at the plot, for comparison later.

REGE.part1 <- optRandomParC(M=REGE.reg, 
                           k=3,        # look for 3 classes
                           rep=10,
                           approaches = "hom",
                           homFun = "ss",
                           blocks="com")
plot(REGE.part1)

Now, let’s switch to another variation of REGE that is better suited to binary networks.

# Try another variation of REGE
REGE.cat<-REGE.ownm.for(M=mat)$E 

REGE.part2 <- optRandomParC(M=REGE.cat, 
                           k=6,        # look for 6 classes
                           rep=10,
                           approaches = "hom",
                           homFun = "ss",
                           blocks="com")

## 
## 
## Starting optimization of the partiton 1 of 10 partitions.
## Starting partition: 2 6 3 2 2 5 2 2 2 2 2 1 2 2 2 4 
## Final error: 0.9973979 
## Final partition:    5 1 1 3 3 6 2 4 3 3 4 5 6 5 3 2 
## 
## 
## Starting optimization of the partiton 2 of 10 partitions.
## Starting partition: 4 3 5 1 4 2 4 5 2 1 4 4 5 1 2 6 
## Final error: 0.9973979 
## Final partition:    4 6 6 1 1 5 2 3 1 1 3 4 5 4 1 2 
## 
## 
## Starting optimization of the partiton 3 of 10 partitions.
## Starting partition: 2 6 1 1 1 1 1 1 3 1 1 5 1 1 1 4 
## Final error: 0.9973979 
## Final partition:    6 4 4 2 2 3 1 5 2 2 5 6 3 6 2 1 
## 
## 
## Starting optimization of the partiton 4 of 10 partitions.
## Starting partition: 2 5 5 3 6 1 6 2 1 4 3 1 5 4 4 2 
## Final error: 0.9132304 
## Final partition:    1 5 6 5 5 3 4 2 5 5 2 1 3 1 5 6 
## 
## 
## Starting optimization of the partiton 5 of 10 partitions.
## Starting partition: 4 4 3 1 2 5 3 3 6 5 1 4 3 4 1 3 
## Final error: 0.9973979 
## Final partition:    5 3 3 6 6 2 1 4 6 6 4 5 2 5 6 1 
## 
## 
## Starting optimization of the partiton 6 of 10 partitions.
## Starting partition: 2 2 4 6 1 5 1 5 2 6 1 4 3 3 5 2 
## Final error: 0.9132304 
## Final partition:    3 6 5 6 6 4 2 1 6 6 1 3 4 3 6 5 
## 
## 
## Starting optimization of the partiton 7 of 10 partitions.
## Starting partition: 2 2 2 2 2 2 1 2 2 2 5 2 3 6 2 4 
## Final error: 0.9973979 
## Final partition:    2 5 5 6 6 3 1 4 6 6 4 2 3 2 6 1 
## 
## 
## Starting optimization of the partiton 8 of 10 partitions.
## Starting partition: 4 3 1 5 2 6 6 5 6 2 3 1 5 3 2 4 
## Final error: 0.9728864 
## Final partition:    3 5 5 2 2 5 6 1 2 2 4 3 5 3 2 6 
## 
## 
## Starting optimization of the partiton 9 of 10 partitions.
## Starting partition: 3 1 6 2 1 4 6 4 3 2 1 5 6 5 4 2 
## Final error: 1.154139 
## Final partition:    5 3 3 6 6 3 2 1 6 4 1 5 3 5 4 2 
## 
## 
## Starting optimization of the partiton 10 of 10 partitions.
## Starting partition: 6 2 3 1 3 5 3 6 4 2 1 1 5 4 4 6 
## Final error: 0.9132304 
## Final partition:    6 2 5 2 2 3 4 1 2 2 1 6 3 6 2 5 
## 
## 
## Optimization of all partitions completed
## 1 solution(s) with minimal error = 0.9132304 found.

plot(REGE.part2)

The block matrix that you see above was not chosen arbitrarily. Rather, the six class partition was selected after first trying two classes, then three, etc. Once we saw a solution that seemed to be relatively more homogenous on the diagonal and made sense in the context of this network, we stopped increasing the number of classes and used that solution.

As you can see, the Medici family, with six ties, is again in a class by itself. This time, however, Guadagni and Strozzi share a class. They each have four undirected ties. Similarly, all the pendants that are attached to better connected nodes are in the same class; as are the Pucci and Pazzi families, the most poorly connected nodes. In other words, these nodes share structural similarities within their classes.

Structural similarities (regular equivalencies) will seldom be so easy to spot in a network visualization as they are here - especially in directed networks. So, the REGE algorithm can be very helpful for spotting structural similarities that are otherwise difficult to discern.

For the sake of easier interpretation, let’s visualize this solution - just for kicks and giggles…

V(g)$REGE.class <- REGE.part2$best$best1$clu

plot.igraph(g, vertex.color=V(g)$REGE.class)

Algebraic Approach

A close examination of the REGE otuput above will reveal that many of the structural regularities that the algorithm found could also be revealed using degree centrality - many, but certainly not all. The point is that at at least some level, centrality measures - or combinations thereof - will reveal regularities within the network. This implies that we can use an algebraic approach to understanding or identifying regularly equivalent roles and positions within a network.

By this point, you have seen the chart below many times. It is being included, not because I necessarily want you to simply apply these measures as they are, but because they are a great example of an early attempt to describe certain roles or positions within a network.

Take a look at the positions and their descriptions. In each case, some analyst took a moment to think through what sort of connections would describe various roles. This, of course, does not mean that these will always work brilliantly. For example, the CORE Lab at the Naval Postgraduate School found that emergent leaders may actually be more closely associated with closeness centrality alone. That does not mean that the emergent leader description above is necessarily wrong, but maybe just mislabeled.

Give this some thought on your own. The better your knowledge of centrality and other node-level measures (e.g., number of cliques, core location, etc.) the better you can create your own meaningful descriptions of roles and positions in a particular context.

Deliverable

Let’s get some more use out of the movie network.

Assignment Instructions
For this practicum, I want you to analyze the movie network. Which characters in the Holy Grail are structurally equivalent? In other words, which characters occupy the same, or similar roles or positions in the movie network?

The movie data are in matrix format, whereas the example data used for this page are in edgelist format. So, you should consult the page on importing the movie network data to load that network.

Try out as few or as many of the above approaches as you think are necessary to produce meaningful results. For the deliverable:

Describe the algorithm you used to produce the solution you present.
Tell why you selected that algorithm.
Describe what you see.
- Who is equivalent? (You don’t have to name them all. Just convey the idea.)
- Why does this make sense in this context?
- Why do you trust your answer? (Think “goodness of fit.”)

Good luck!