Cohesive Subgroups in R

Practicum Setup

Networks either explicitly, or implicitly represent the flow of resources, information, influence or similar factors through a set of entities. It naturally follows that an analyst may wish to track the potential flow of those factors through the network. It can be particularly helpful to identify the paths or entities that hold the potential to facilitate, impede, or otherwise broker the flow of such factors.

We will be using data from Zachary’s Karate Club network. To download the example data, go to the Network Data link on the course website. You will find the edgelist data Zachary.csv there.

This part is important. Please be sure to do this part again

Create a folder labeled “Cohesive Subgroups” someplace on your computer, such as your Desktop or wherever you will be able to easily find it again. Then, set your working directory to that folder in RStudio:

Session
- Set Working Directory
- Choose Directory…

Getting Started

We are adding to our toolset in R

As we move into more involved algorithms, we cannot continue to use just one analytic package. We are therefore going to be availing ourselves of Carter Butts’ “sna” package from this point to the end of the course. Actually, Butts first wrote “network” and “sna,” but has since subsumed those packages into the larger suite called “statnet.” We will, therefore, be referring to this package as “statnet” in this and all following practica.

If you have not done so at this point, go ahead and install statnet.

install.packages("statnet", dependencies=TRUE)

If you have already installed statnet, but it has been a while, you may update it to the newest version as well.

update.packages("statnet")

For more information on installing or updating statnet, feel free to visit their webpage on the topic.

Alternatives for loading packages

If you like, you may choose to start by loading igraph and statnet at the beginning.

library(igraph)
library(statnet)

If you have elected to load both packages at the outset, you may have noticed a warning that was generated after you loaded the second package.

Because both packages are designed for use in social network analysis, they often give their functions the same names. Although this makes perfect sense, it is often cause for some confusion in R. You will, therefore, see a warning when you load both packages together: “The following objects are masked from ‘package:igraph’:”

The packages are “masked,” because the sna package, which is integral to statnet, has eleven functions with the same names as those used in igraph. (betweenness, bonpow, closeness, components, degree, dyad.census, evcent, hierarchy, is.connected, neighborhood, triad.census) Since these functions are “masked” from igraph, R will default to running the sna version of the functions listed above, unless you give it more direction. In other words, you have a workaround: tell R which package you intend to use when you write your script.

The solution to the masking problem is fairly simple. Just identify the package that you intend to use at the time. So if, for example, you want to calculate betweenness for a network that you loaded in igraph, just enter something along the lines of igraph::betweenness(g), or bet <- igraph::betweenness(g). Entering the package name, followed by double colons (sna::, or igraph::) simply tells R where to look for the function you need. This is something that come in very handy if you are using a mix of packages in one script. You don’t have to load each package separately if you do not so wish.

One last caveat: Although we will be loading statnet from this point on, most of the basic network calculations that you have made up to this point are done using sna, which is a part of the statnet suite. We, therefore, need to identify sna as the package that we are using in some cases, and other packages in other cases. When in doubt about which package actually contains the function that you wish to use, just check it out by using the help function: ?degree. In the upper left-hand side of the help document, you will see the function name, followed by the package where it resides (degree {sna}).

Convention for loading data

When working with more than one analytic package in R, it is easy to create confusion. If, for example, you try to use a network that was loaded in statnet to make calculations in igraph, igraph will give you a warning: Error in igraph::degree(net) : Not a graph object

If you forget that you loaded the network in statnet, or if you forgot to convert the network to an igraph object, this can be fairly confusing. To help avoid such unnecessary confusion, we’ll use the following convention to identify whether a network has been loaded in igraph or statnet. All networks that are loaded using the statnet suite will use “net”, and all networks loaded in igraph will use “g.”

Using this convention, we may now load the network both ways.

You can downoad the example data here

el <- read.csv(file.choose(), header = FALSE)           # Read in the edgelist

g <- graph.data.frame(el, directed = TRUE)              # Load with igraph

net<-network(el, directed=TRUE, matrix.type="edgelist") # Load with network (for sna)

If you are not starting from an edgelist or matrix, then you still have the option of converting a network data object from igraph to sna, or from sna to igraph, using the intergraph package. If you would like to briefly explore your options with intergraph, visit this short tutorial in using its functions.

Practicum 4 already introduced how to decipher the information about an igraph network object. The sna package is a little more straightforward. Type net into the R console to get the details on the network object that sna uses.

net

##  Network attributes:
##   vertices = 34 
##   directed = TRUE 
##   hyper = FALSE 
##   loops = FALSE 
##   multiple = FALSE 
##   bipartite = FALSE 
##   total edges= 155 
##     missing edges= 0 
##     non-missing edges= 155 
## 
##  Vertex attribute names: 
##     vertex.names 
## 
## No edge attributes

This was actually created in a package called “network.” The summary tells us that the network has 34 nodes, directed ties, no loops, etc. Unlike igraph, the default summary does not list the actual edges. To see the edgelist, along with the above information, use summary(net).

For more information on how to read the summary, you can check out the original (2008) paper introducing the network package.: http://www.jstatsoft.org/v24/i02/.

Community Structure Analyses

There are a lot of ways to operationalize community structure. The simplest, and most straightforward is to simply establish what components there are in the network, so that one may then focus only on the largest - or main - component.

Components

Components in igraph

Running component identification and extraction in igraph is particularly simple. The function name is “decompose” and it isn’t as gross as it sounds. As the name implies, the function will decompose a network into a list of subnetworks that represent all of either, the strong components, or the weak components, contained within the network.

# ?decompose    # for more information on options
dg <- decompose(g)
dg              # Check out the subnetworks, if any.

## [[1]]
## IGRAPH f5cda7f DN-- 34 155 -- 
## + attr: name (v/c)
## + edges from f5cda7f (vertex names):
##  [1] 1 ->3  1 ->4  1 ->5  1 ->6  1 ->7  1 ->8  1 ->9  1 ->11 1 ->12 1 ->13
## [11] 1 ->14 1 ->18 1 ->20 1 ->22 1 ->32 2 ->1  2 ->3  2 ->4  2 ->8  2 ->14
## [21] 2 ->18 2 ->20 2 ->22 2 ->31 3 ->1  3 ->2  3 ->4  3 ->8  3 ->9  3 ->10
## [31] 3 ->14 3 ->28 3 ->29 3 ->33 4 ->1  4 ->2  4 ->3  4 ->8  4 ->13 4 ->14
## [41] 5 ->1  5 ->7  5 ->11 6 ->1  6 ->7  6 ->11 6 ->17 7 ->1  7 ->5  7 ->6 
## [51] 7 ->17 8 ->1  8 ->2  8 ->3  8 ->4  9 ->1  9 ->3  9 ->31 9 ->33 9 ->34
## [61] 10->3  10->34 11->1  11->5  11->6  12->1  13->1  13->4  14->1  14->2 
## [71] 14->3  14->4  14->34 15->33 15->34 16->33 16->34 17->6  17->7  18->1 
## + ... omitted several edges

The Karate Club network has only one component. So, this example won’t offer a particularly interesting result. But, hopefully, you get the idea.

If there are multiple components contained within the network, then the object you create, as we did in the example above (dg), will contain a list of those components. You can simply type in the name of the object (dg) to see them. In many cases, these lists can be quite long. So, you can consider the edgelist for any one of them by referencing them using double square brackets, like so: dg[[1]]

Another way of limiting the number of components that you extract is to limit the size of components that you consider to be meaningful. This is especially helpful if the network is particularly large, or contains a lot of isolates. To do so, set the minimum number of vertices to some number greater than two. For more information on this and other options, check out the help section: ?decompose.

dg <- decompose(g, mode="weak", 
                min.vertices = 5)

You’ll notice, above, that the subnetworks that are pulled using the decompose function are stored as edgelists. That means that you may use them as igraph objects just as they are. Use square brackets as we discuss above to identify which component you would like to plot.

plot.igraph(dg[[1]])   # Plot the largest one

Another option you will want to consider is saving the component you are interested in analyzing under a new name so that you can focus on it without constantly having to reference the network decomposition.

Note the double square brackets [[]] in the code below. They are used to reference one or more specific elements in a vector of elements. If there had been more than one component in the object we named dg, then each would be stored inside the dg object and they would be organized by number in a vector. So, to extract the the third component from the network, you would enter dg[[3]]. In this case, there was only one (dg[[1]]), so we use the script below to save that particular subnetwork under a new name.

g2 <- dg[[1]]   # Save the largest as a new network object
                #    Next, go on to analyze the subnetwork

In practice, we will frequently be interested in only the very largest component in a network.

To automatically extract the largest component you can use the which.max function. which.max will identify the largest of the components. We use which.max in conjunction with the sapply function. sapply is a function that will iterate a command over all items in a list. In this case, we are asking R to evaluate the order (number of nodes) in each of the components in dg using the vcount function in igraph. which.max will identify the largest one, to be saved in a new object named “max.g”. We can then extract that subnetwork from the dg object and save it as “g2”.

dg <- decompose.graph(g)
max.g <- which.max(sapply(dg, vcount))
g2 <- dg[[max.g]]
# Then use g2 as you would above.

Of course, there is only one component in this network. So, g2 will look exactly like g this time. For larger, disconnected, networks, this will be an important part of your preprocessing before further analysis.

Components in statnet

Extracting components in statnet is a bit more involved. The decomposition of the network is not automatic in statnet. You will therefore need to identify which nodes belong in which components, and then use that information to extract (get.inducedSubgraph) a subset of vertices - and the edges between them - using that information.

Note: Since we are using statnet for the following example, you will notice that we have used the naming convention of net to signify that this is a network object that is suitable for use in statnet, not igraph. We will do this for all network objects used by statnet.

Here is the process for investigating and extracting components in statnet.

You will notice net %v% "component" in the code below. This is one way to either extract or add attributes to the vertices in statnet. If you read it literally, it is saying: “in the network named”net” there is (or shall be) an attribute of the vertices (%v%) named "component". The code is essentially set up as network_name %v% "attribute_name". As such, this replaces both get.vertex.attribute() and set.vertex.attribute(). You can also do this for all edges or arcs, using %e%. For more information, see ?get.vertex.attribute().

# ?component.dist    # for more information on options
cd<-component.dist(net, 
                   connected="weak")   # Run components algorithm for "weak" components

net %v% "component" <- cd$membership   # Add "component" to vertex attributes

cd$membership                          # Check out the membership to see what is prevalent

##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

cd$csize                               # Check out the size of the various components

## [1] 34

subnet <- get.inducedSubgraph(net,v=which(net %v% "component"==1)) # Component 1 was largest

subnet                                 # Check out the new subnetwork

##  Network attributes:
##   vertices = 34 
##   directed = TRUE 
##   hyper = FALSE 
##   loops = FALSE 
##   multiple = FALSE 
##   bipartite = FALSE 
##   total edges= 155 
##     missing edges= 0 
##     non-missing edges= 155 
## 
##  Vertex attribute names: 
##     component vertex.names 
## 
## No edge attributes

gplot(subnet, 
      usearrows = FALSE,
      displaylabels = TRUE)

Cliques

Although the concept of cliques - maximal complete subgroups - is fairly easy to grasp, they can be somewhat difficult to analyze appropriately.

Remember that a clique is a maximal subset of a network, where each node in the set is adjacent to all other nodes in the set. The term “maximal” just means that every node that can be included in the subset has been included, and that no additional nodes may be included without violating its definition: [density(clique)=1.0]

Eliciting cliques is fairly straightforward. For this to be somewhat more realistic in social terms, we’ll adhere to the practice of considering only cliques comprised of three or more nodes to be meaningful.

Cliques in igraph

We’ll start in igraph. To see your clique analysis options in igraph, use the help function: ?igraph::cliques

First, we can find the number of cliques in a network using the count_max_cliques() function. Again, we are only interested in cliques sized 3 or greater.

Note: By default, igraph treats networks as being undirected when running all operations on cliques. If you have a directed network, then you will get a warning about this every time you do a clique analysis.

count_max_cliques(g, min=3)    # counts the number of cliques (minimum size=3)

## [1] 25

We can see that there are 25 cliques of size three or greater in this network.

We are often interested in the largest clique or cliques in a network. For information on the size of the largest clique, use clique_num().

clique_num(g)                           # size of largest clique(s)

## [1] 5

As we can see, the largest clique, or cliques, contain five nodes.

From here, we can extract the cliques.

# To list the cliques out on your screen
max_cliques(g, min=3)                          # all maximal cliques

# Alternately, create an object and call up the cliques
#   one by one.
mc <- max_cliques(g, min=3)

mc[[1]]     # View clique #1

mc[[21]]     # View clique #21

mc[[23]]    # View clique #23

We can see above, in count_max_cliques, that there are 25 cliques with three or more nodes. We did not run max_cliques here, since it would take a lot of space to display all 25 cliques. But, you are welcome to do so.

To save space, we save the results as a new object, named “mc.” You can see all the cliques by simply typing in mc, or by using the double square braces ([[]]) as we did above to identify which of the 25 cliques (or subnetworks) whose membership we are interested in seeing.

Try this for yourself. The function will give you a list of all cliques, as we define them. Again, the word “maximal” just means that you cannot add more nodes to the clique without violating the definition of a clique.

We may also turn our focus to just the largest clique or cliques,

largest_cliques(g)                      # all largest cliques

## Warning in largest_cliques(g): At core/cliques/maximal_cliques_template.h:269 :
## Edge directions are ignored for maximal clique calculation.

## [[1]]
## + 5/34 vertices, named, from 752bfa8:
## [1] 2 1 4 3 8
## 
## [[2]]
## + 5/34 vertices, named, from 752bfa8:
## [1] 2  1  4  3  14

cliq <- largest_cliques(g)              # get the largest cliques of g

## Warning in largest_cliques(g): At core/cliques/maximal_cliques_template.h:269 :
## Edge directions are ignored for maximal clique calculation.

The largest_cliques function lists two cliques, each with five vertices. If you save this information as an object (which we named “cliq” in this case), you may then plot the largest cliques.

As above, the double square brackets ([[]]) are used to reference items within the list of elements in the object we named “cliq”.

plot(induced_subgraph(g, cliq[[1]]))
plot(induced_subgraph(g, cliq[[2]]))

You can also get a little more fancy and plot them side-by-side. The length function measures how many items are in the list of elements in the object. for() is a type of function that will loop a function of set of functions oer a specified number of times. In this case, you can read this as: for each number (i) from 1 to however many are in the cliq object (1:length(cliq)), do the following… The i in plot(induced_subgraph(g, cliq[[i]])) will be replaced in each of a series of numbers. In this case, there are only two subgraphs in cliq.

par(mfrow = c(1, 2))                    # set graphics to plot 1 row, 2 cols

for(i in 1:length(cliq)){               # plot the 2 largest cliques
  plot(induced_subgraph(g, cliq[[i]]))
}

par(mfrow = c(1, 1))                  # Reset the plot window
                                      #    so you are not stuck
                                      #    with 2 columns

As mentioned, that was the fancy version. You can accomplish the same thing by specifying each network to be plotted.

par(mfrow = c(1, 2))                    # set graphics to plot 1 row, 2 cols

plot(induced_subgraph(g, cliq[[1]]))    # plot one of the 2 largest cliques
plot(induced_subgraph(g, cliq[[2]]))    # plot the other of the 2 largest cliques

par(mfrow = c(1, 1))                  # Reset the plot window
                                      #    so you are not stuck
                                      #    with 2 columns

This will accomplish the same thing as the code above.

Cliques in statnet

Statnet goes a little farther in the analysis of cliques. Though, it is also a little more involved.

Note, since we are using statnet for the following example, you will notice that we have used the naming convention of net for all network objects used by statnet. (This is the last time we will mention it.)

census <- clique.census(net, mode="graph")

ls(census)                          # Unpack the "census" object

## [1] "clique.count" "cliques"

The census object that was generated using the clique.census function in statnet has two types of information. We look at each, below.

census$clique.count                 # Number of cliques for each vertex

##   Agg 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
## 1   0 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
## 2  11 2 1 3 0 0 0 0 0 0  2  0  1  0  1  0  0  0  0  0  1  0  0  0  1  1  1  0
## 3  21 9 3 2 1 2 3 3 0 2  0  2  0  1  0  1  1  1  1  1  1  1  1  1  1  1  1  1
## 4   3 1 0 1 1 0 0 0 1 1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0
## 5   2 2 2 2 2 0 0 0 1 0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0
##   28 29 30 31 32 33 34
## 1  0  0  0  0  0  0  0
## 2  2  1  0  1  1  0  3
## 3  1  1  1  0  3  7  9
## 4  0  0  1  1  0  2  2
## 5  0  0  0  0  0  0  0

The clique.count shows how many times each node appears in a clique of a particular size. In this case, we have no cliques larger than five nodes, so the y axis only goes up to five.

census$cliques                      # List all cliques

We have suppressed the output for the list of actual cliques. It was too long. So, try this on your own.

Though, as with igraph, you can use this information to plot. Though, the plots are a little more versatile.

First, let’s change the layout of the network to better reflect similarity between nodes by clique overlap. To do so, run the clique.census again. But, this time, add the option to save a “co-membership” matrix (clique.comembership="sum"). You are summing up all the times that two nodes appear in the same clique These sums are saved as links between all pairs of nodes, in matrix format. Because that is all we are interested in this time, that is the only output we are preserving from this clique census. That is why we have “$clique.comemb” included on the outside paren.

The co-membership matrix provides some advantages. You may use the clique overlap counts as a means for understanding how similar the various nodes are to one another. One of the more well-used methods for doing this is to create coordinates using a technique known as multidimensional scaling (MDS). MDS will create a distance matrix based on overlap, such that nodes with more overlap will have smaller distances between them in an n-dimensional space.

For more information on MDS, consult this conceptual tutorial. Or, you can investigate your options in R. And, for those who wish to go deeper, here is a chapter that provides a more comprehensive background into the procedure.

comemb <- clique.census(net, clique.comembership="sum")$clique.comemb # Co-membership matrix

coords <- cmdscale(1/(1+comemb))        # Perform an MDS  (multidimensional scaling) 
                                        # the more overlap, 
                                        # the "closer" the vertices
                                        
gplot(net,
      coord=coords, 
      usearrows=FALSE, 
      displaylabels=TRUE)

Using an inverse of multidimensional scaling (see: ?cmdscale), you just produced a distance matrix of similarities and used that to plot the network in statnet. (The “1+” part was just to get rid of the zeroes in the matrix.) This effectively re-envisions the social space according to clique overlap.

You may have noticed that gplot, the visualizer for statnet, operates differently from plot.igraph. For a fairly complete list of the different options in gplot, check the help section; ?gplot.

We can also reuse the co-membership matrix to make a weighted visualization of the network using co-membership as the weights. The CoNet%e%'weight'code chunk is similar to igraph’s E(g)$attribute function.

CoNet <- network(comemb, matrix.type="adjacency", 
                 directed=FALSE,
                 ignore.eval=FALSE,   # Keep edge weights
                 names.eval="weight") # names.eval stores edge weights
                                      #   as edge attribute named "weight".

CoWeights <- as.sociomatrix(CoNet,"weight") 

gplot(net,coord=coords, 
      usearrows=FALSE, 
      displaylabels=TRUE,
      edge.lwd=CoNet%e%'weight'*2) # Use the edge attribute, "weight" (multiplied by 2

                                   #    to emphasize differences).

Last, to get a better idea of node similarity, we can use a dendrogram to visualize the similarity by co-membership. We generally use hierarchical clustering to establish a “hierarchy” of similarities.

hc <- hclust(as.dist(1/(CoWeights+1)))      # Hierarchal clustering by co-membership

plot(hc)                                    # Plot the dendrogram

  # Optional: 
rect.hclust(hc, h=0.6)                      # Plot a cutoff point

  # Use the rect.hclust() to make the pattern easier to see.

k-cores

We can find k-cores with both igraph and statnet.

k-cores in igraph

coreness(g)

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
##  8  8  8  8  6  6  6  8  8  4  6  2  4  8  4  4  4  4  4  6  4  4  4  6  6  6 
## 27 28 29 30 31 32 33 34 
##  4  6  6  6  8  6  8  8

As you can see, the coreness function produces a table listing to which core each node belongs. Vertex names are on the top row of the table, and core classifications are on the bottom row.

We can use this to plot the network.

kcore <- coreness(g)    # Extract k-cores as a data object.
V(g)$core <- kcore      # Add the cores as a vertex attribute
plot.igraph(g, vertex.color=V(g)$core) # plot

Extracting one or more subnetworks based on k-cores

If you would like to extract one or more of the cores as a subnetwork, then first consider the values of the various cores. Remember that k-cores are assigned according to degree within each core. That means that the densest core will have the greatest values.

In this case, let’s extract the densest core. Then, we can relax our criterion a little and extract a slightly larger subset.

To start, consider the values of the various cores.

kcore

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
##  8  8  8  8  6  6  6  8  8  4  6  2  4  8  4  4  4  4  4  6  4  4  4  6  6  6 
## 27 28 29 30 31 32 33 34 
##  4  6  6  6  8  6  8  8

It may also help to see how many nodes are included in each core. (Keep in mind that smaller valued cores contain all larger valued cores.)

table(kcore)

## kcore
##  2  4  6  8 
##  1 11 12 10

As you can see, the greatest value is eight (8), so all nodes with a k-core value of eight (the 8-core) will be in the densest portion of the network.

We can extract the 8-core(s) from the network using the induced_subgraph() function. The arguments you will need are the name of the network, from which you are extracting some nodes and their accompanying ties, and a list of the nodes you wish to extract. In this case we can use the information contained in the “kcore” object that we made when we ran the k-cores algorithm.

To extract only the densest core, we’ll use only those nodes that reside in that core (the 8-core).

Some operators you may need for this:

Operator	Meaning
>	Greater than
<	Less than
==	Equal to
>=	Greater than or equal to
<=	Less than or equal to

Note that the “==” double equal sign used in the argument. That is used because R, like other programming languages, uses = to indicate assignment, just like “<-”. The double equal sign is used to say that we mean “equals” in the mathematical sense.

g8c <- induced_subgraph(g, kcore==8)

To extract a slightly larger and less dense core, we can extract all nodes in the 6-core (there is no 7-core in this network), which also contains the 8-core.

Recall that lower value cores should always include the higher value cores. Because k-cores identify dense regions in the network, they are defined as regions where each node is connected with at least k other nodes. Therefore, everyone in the 8-core also qualifies for membership in the 6-core, since they are connected with at least 6 alters in the core.

g6c <- induced_subgraph(g, kcore>=6)  # The value of the k-core is greather than
                                      #    or equal to 6

Feel free to plot either, or both.

plot(g8c)

plot(g6c)

k-cores in statnet

Running k-cores in statnet is not all that involved.

Keep in mind, this is a function of relative density. So, degree within the core is what is being calculated.

kcores(net)

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
##  8  8  8  8  6  6  6  8  8  4  6  2  4  8  4  4  4  4  4  6  4  4  4  6  6  6 
## 27 28 29 30 31 32 33 34 
##  4  6  6  6  8  6  8  8

For both igraph and statnet, we can clearly see that there are four cores. There is one node in the two core, there are eleven nodes in the four core, twelve nodes in the six core, and there are ten nodes in the eight core.

As with igraph, we can use this information to plot. The difference is, we do not need to make the cores into a network attribute first.

set.seed(18675309)

kco <- kcores(net) 

gplot(net, 
      vertex.col=kco, 
      usearrows=FALSE,
      displaylabels=TRUE)

If you consider the values of each of the cores, you will note that the 8-core is the densest region. We can extract that region for further analysis.

net_kc <- net[kco>7, kco>7]

net_kc

##    1 2 3 4 8 9 14 31 33 34
## 1  0 0 1 1 1 1  1  0  0  0
## 2  1 0 1 1 1 0  1  1  0  0
## 3  1 1 0 1 1 1  1  0  1  0
## 4  1 1 1 0 1 0  1  0  0  0
## 8  1 1 1 1 0 0  0  0  0  0
## 9  1 0 1 0 0 0  0  1  1  1
## 14 1 1 1 1 0 0  0  0  0  1
## 31 0 1 0 0 0 1  0  0  1  1
## 33 0 0 1 0 0 1  0  1  0  1
## 34 0 0 0 0 0 1  1  1  1  0

We can also plot just that region.

gplot(net[kco>7, kco>7], 
      usearrows=FALSE,
      displaylabels=TRUE)

We can also extract the region as a subnetwork for further analysis as we did in igraph.

net8c <- get.inducedSubgraph(net, which(kco==8))

plot(net8c, displaylabels=TRUE)

Girvan-Newman (Edge Betweenness)

For this and all the following, we’ll stay in igraph.

Note: you may need to convert a directed graph to undirected for some of these. If you receive a warning about directed networks, use the following to make the network undirected.

g <- as.undirected(g, mode="collapse")

Now, run the edge betweenness algorithm and then save the membership as a node attribute in the network.

eb <- cluster_edge_betweenness(g)

You have a few alternatives that you can use to check to see which nodes are in which groups. Here, we’ll present some of the built-in options. Consider that the cluster_edge_betweenness function produces a lot of output that you will not immediately see until it is unpacked. To take a look at what is produced, examine the object that we just created to store the algorithm’s output: ls(eb). This will list the contents of the object that you just made called “eb.”

A lot of information is packed into the eb object. The igraph package can unpack these for you. For a list of the, consult ?communities. Here, we’ll use a few of these built-in functions to see how the nodes in the network were grouped by the edge betweenness algorithm.

For an overview of which nodes are in which group, use the membership function.

membership(eb)

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
##  1  1  2  1  3  3  3  1  4  5  3  1  1  1  4  4  3  1  4  1  4  1  4  4  2  2 
## 27 28 29 30 31 32 33 34 
##  4  2  2  4  4  2  4  4

You can store this information in the network as a vertex attribute. Below, we’ll call the attribute “groups.” Though, if you are going to store the results of a few different conmmunity structure algorithms, it is a good idea to use a more specific attribute name, such as “groups_eb” so that you can tell them apart later.

V(g)$group <- membership(eb) # assign membership to vertices

You can also count the number of nodes in each group (community) using:

sizes(eb)

## Community sizes
##  1  2  3  4  5 
## 10  6  5 12  1

Looking at the groupings, above, it is apparent that groups one (1) and four (4) have the most members. If you like, you can extract each.

When doing this, remember that you added the membership to the network object as a vertex attribute names “group.” We can use that to extract each group.

g_eb1 <- induced_subgraph(g, V(g)$group==1) # extract nodes in group 1

g_eb4 <- induced_subgraph(g, V(g)$group==4) # extract nodes in group 4

plot(g_eb1)

plot(g_eb4)

To view the groups in a visualization, you will need to first supply the object with the grouping information, and then the network to which the groups belong.

Here, we are also adding a title (main=), labeling the vertices by their index number (vertex.label=), and specifying which layout we want to use (layout=). If you are just taking a quick look, then plot(eb, g) will work.

plot(eb, g,
     vertex.label = V(g)$id,  # Uses index number instead of name
     layout = layout_with_kk, 
     main="Max Modularity Solution")

 # This overlays with a community polygon. 
       # internal edges = dark green
       # external edges = red

This solution looks a little odd to me. This is especially true if you consider that node 10 is in its own group. To help decide what may be a better cut, we can use a dendrogram to get a better idea of who is in which group each time the network was parsed.

Selecting the number of groups manually

By default, the edge betweenness function will produce a group structure that corresponds with the greatest modularity value. However, the greatest modularity value is not always the best representation of how the network should be parsed.

To get a quick look at the various ways that the algorithm could divide the network, use a dendrogram. In the dendrogram, the mode="hclust" argument allows us to use the rect=2 argument to help identify what a two group solution would look like. Try changing the number of groups to see what the various solutions would look like.

plot_dendrogram(eb, 
                mode="hclust", 
                rect=2)

Read the dendrogram from top to bottom. If you were to delete the top horizontal line, then you would separate the network into two groups, like we have identified using the rect=2 argument. The next cut (remove the next horizontal line in your mind) produced three groups, but node ten was then on its own as a group unto itself.

The dendrogram should give you a little better idea of how the network may be parsed and the membership of each group as it is parsed using a particular algorithm. In this case, we can see how the Karate Club network would divide into two groups using edge betweenness.

If, you would rather seek a two community solution rather than the solution that is identified as “optimal” using modularity (the default for edge betweenness and other algorithms), you may specify that - for any of the following algorithms. The example below uses edge betweenness, but you can substitute any other community detection algorithm that uses modularity.

V(g)$group <- membership(eb)


# Change the number of communities in the solution.
eb$membership <- cut_at(eb, 2) # 2 communities

plot(eb, g,
     vertex.label = V(g)$id,  # Uses index number instead of name
     layout = layout_with_kk, 
     main="Two-Community Solution")

You may incorporate the above in the remaining algorithms on this page.

Spinglass

sg <- cluster_spinglass(g)

plot(sg, g)

# Save your work in a data frame. 
SG.Communities <- data.frame(sg$membership, sg$names)

head(SG.Communities )     # Take a peek

##   sg.membership sg.names
## 1             2        1
## 2             2        2
## 3             2        3
## 4             2        4
## 5             4        5
## 6             4        6

Spinglass will allow the user to specify the maximum number of communites using the spins argument. To see the two community solution for spinglass, set spins equal to 2.

sg_2 <- cluster_spinglass(g, spins=2)

plot(sg_2, g,
     vertex.label = V(g)$id,  # Uses index number instead of name
     layout = layout_with_kk, 
     main="Two-Community Spinglass Solution")

Getting Fancy: Assess Multiple Spinglass Iterations

sg_results_members<- list()
sg_results_modularity <- c()

for (i in 1:5) {
  set.seed(i)
  sg <- cluster_spinglass(g)
  sg_membership <- list(sg$names)
  sg_names <- list(sg$membership)
  sg_mod <- list(cluster_spinglass(g)$modularity)
  sg_results_members[i] <- sg_membership
  sg_results_modularity <- c(sg_results_modularity, sg_mod)
}

sg_results_modularity %>% unlist() %>% mean()
sg_results_modularity %>% unlist() %>% max()
sg_results_modularity %>% unlist() %>% min()
sg_results_modularity %>% unlist() %>% median()

best_sg <- which(max(unlist(sg_results_modularity)) %in% sg_results_modularity)

spins <- data.frame(sg_membership[best_sg],
                    sg_names[best_sg]) %>%
                    `colnames<-`(c("vertex", "group"))

head(spins)  # Take a peek

Louvain

lvc <- cluster_louvain(g)

sizes(lvc)

## Community sizes
##  1  2  3  4 
## 12  5 11  6

Louv <- data.frame(lvc$membership,
                   lvc$names)

head(Louv)

##   lvc.membership lvc.names
## 1              1         1
## 2              1         2
## 3              1         3
## 4              1         4
## 5              2         5
## 6              2         6

plot(lvc, g, vertex.label=V(g)$id)

A few Blockmodeling alternatives

gBlocks <- cohesive_blocks(g)

blocks(gBlocks)

## [[1]]
## + 34/34 vertices, named, from 60c2cd2:
##  [1] 1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34
## 
## [[2]]
## + 28/34 vertices, named, from 60c2cd2:
##  [1] 1  2  3  4  8  9  10 13 14 15 16 18 19 20 21 22 23 24 25 26 27 28 29 30 31
## [26] 32 33 34
## 
## [[3]]
## + 6/34 vertices, named, from 60c2cd2:
## [1] 1  5  6  7  11 17
## 
## [[4]]
## + 5/34 vertices, named, from 60c2cd2:
## [1] 1 2 3 4 8
## 
## [[5]]
## + 7/34 vertices, named, from 60c2cd2:
## [1] 1  2  3  9  31 33 34
## 
## [[6]]
## + 5/34 vertices, named, from 60c2cd2:
## [1] 1  5  6  7  11
## 
## [[7]]
## + 5/34 vertices, named, from 60c2cd2:
## [1] 1  2  3  4  14
## 
## [[8]]
## + 10/34 vertices, named, from 60c2cd2:
##  [1] 3  24 25 26 28 29 30 32 33 34

igraph::hierarchy(gBlocks)

## IGRAPH 60fe8be D--- 8 7 -- 
## + edges from 60fe8be:
## [1] 1->2 1->3 2->4 2->5 3->6 2->7 2->8

parent(gBlocks)

## [1] 0 1 1 2 2 3 2 2

cohesion(gBlocks)

## [1] 1 2 2 4 3 3 4 3

max_cohesion(gBlocks)

##  [1] 4 4 4 4 3 3 3 4 3 2 3 1 2 4 2 2 2 2 2 2 2 2 2 3 3 3 2 3 3 3 3 3 3 3

graphs_from_cohesive_blocks(gBlocks, g)

## [[1]]
## IGRAPH 4938e9d UN-- 34 78 -- 
## + attr: name (v/c), core (v/n), group (v/n)
## + edges from 4938e9d (vertex names):
##  [1] 1 --2  1 --3  2 --3  1 --4  2 --4  3 --4  1 --5  1 --6  1 --7  5 --7 
## [11] 6 --7  1 --8  2 --8  3 --8  4 --8  1 --9  3 --9  3 --10 1 --11 5 --11
## [21] 6 --11 1 --12 1 --13 4 --13 1 --14 2 --14 3 --14 4 --14 6 --17 7 --17
## [31] 1 --18 2 --18 1 --20 2 --20 1 --22 2 --22 24--26 25--26 3 --28 24--28
## [41] 25--28 3 --29 24--30 27--30 2 --31 9 --31 1 --32 25--32 26--32 29--32
## [51] 3 --33 9 --33 15--33 16--33 19--33 21--33 23--33 24--33 30--33 31--33
## [61] 32--33 9 --34 10--34 14--34 15--34 16--34 19--34 20--34 21--34 23--34
## [71] 24--34 27--34 28--34 29--34 30--34 31--34 32--34 33--34
## 
## [[2]]
## IGRAPH 9fbfd61 UN-- 28 67 -- 
## + attr: name (v/c), core (v/n), group (v/n)
## + edges from 9fbfd61 (vertex names):
##  [1] 1 --2  1 --3  2 --3  1 --4  2 --4  3 --4  1 --8  2 --8  3 --8  4 --8 
## [11] 1 --9  3 --9  3 --10 1 --13 4 --13 1 --14 2 --14 3 --14 4 --14 1 --18
## [21] 2 --18 1 --20 2 --20 1 --22 2 --22 24--26 25--26 3 --28 24--28 25--28
## [31] 3 --29 24--30 27--30 2 --31 9 --31 1 --32 25--32 26--32 29--32 3 --33
## [41] 9 --33 15--33 16--33 19--33 21--33 23--33 24--33 30--33 31--33 32--33
## [51] 9 --34 10--34 14--34 15--34 16--34 19--34 20--34 21--34 23--34 24--34
## [61] 27--34 28--34 29--34 30--34 31--34 32--34 33--34
## 
## [[3]]
## IGRAPH 445db92 UN-- 6 10 -- 
## + attr: name (v/c), core (v/n), group (v/n)
## + edges from 445db92 (vertex names):
##  [1] 1--5  1--6  1--7  5--7  6--7  1--11 5--11 6--11 6--17 7--17
## 
## [[4]]
## IGRAPH 7f18be2 UN-- 5 10 -- 
## + attr: name (v/c), core (v/n), group (v/n)
## + edges from 7f18be2 (vertex names):
##  [1] 1--2 1--3 2--3 1--4 2--4 3--4 1--8 2--8 3--8 4--8
## 
## [[5]]
## IGRAPH 25aaa17 UN-- 7 13 -- 
## + attr: name (v/c), core (v/n), group (v/n)
## + edges from 25aaa17 (vertex names):
##  [1] 1 --2  1 --3  2 --3  1 --9  3 --9  2 --31 9 --31 3 --33 9 --33 31--33
## [11] 9 --34 31--34 33--34
## 
## [[6]]
## IGRAPH dbe38dd UN-- 5 8 -- 
## + attr: name (v/c), core (v/n), group (v/n)
## + edges from dbe38dd (vertex names):
## [1] 1--5  1--6  1--7  5--7  6--7  1--11 5--11 6--11
## 
## [[7]]
## IGRAPH 94e5950 UN-- 5 10 -- 
## + attr: name (v/c), core (v/n), group (v/n)
## + edges from 94e5950 (vertex names):
##  [1] 1--2  1--3  2--3  1--4  2--4  3--4  1--14 2--14 3--14 4--14
## 
## [[8]]
## IGRAPH ad628e3 UN-- 10 20 -- 
## + attr: name (v/c), core (v/n), group (v/n)
## + edges from ad628e3 (vertex names):
##  [1] 24--26 25--26 3 --28 24--28 25--28 3 --29 24--30 25--32 26--32 29--32
## [11] 3 --33 24--33 30--33 32--33 24--34 28--34 29--34 30--34 32--34 33--34

plot(gBlocks, g,
     layout = layout_with_kk)

plot_hierarchy(gBlocks)

Other Techniques for community objects

eb <- cluster_edge_betweenness(g)

algorithm(eb)         # algorithm used to create community object

## [1] "edge betweenness"

is_hierarchical(eb)   # hierarchical?

## [1] TRUE

sizes(eb)             # community szes

## Community sizes
##  1  2  3  4  5 
## 10  6  5 12  1

compare(eb, lvc)      # compare community objects

## [1] 0.4600528

as.dendrogram(eb)     # converts to a dendrogram object

## 'dendrogram' with 2 branches and 34 members total, at height 33

plot_dendrogram(eb)
plot_dendrogram(eb, mode = "hclust")

data.frame(eb$membership, # which vertex in which group
           eb$names)

##    eb.membership eb.names
## 1              1        1
## 2              1        2
## 3              2        3
## 4              1        4
## 5              3        5
## 6              3        6
## 7              3        7
## 8              1        8
## 9              4        9
## 10             5       10
## 11             3       11
## 12             1       12
## 13             1       13
## 14             1       14
## 15             4       15
## 16             4       16
## 17             3       17
## 18             1       18
## 19             4       19
## 20             1       20
## 21             4       21
## 22             1       22
## 23             4       23
## 24             4       24
## 25             2       25
## 26             2       26
## 27             4       27
## 28             2       28
## 29             2       29
## 30             4       30
## 31             4       31
## 32             2       32
## 33             4       33
## 34             4       34

This should give you a fair start at what is possible in finding and defining cohesive subgroups in R. Consult your notes of chapter 5 of the text to refresh your memory on why and how to use each approach.

Deliverable

You have two choices for this practicum. You may either use the “talked to” network from the Death of Stalin; or, if you are up for a challenge, you may use a network of first responders in the hurricane Katrina disaster. The hurricane Katrina network was collected using secondary sources, well after the event. You can read the article on the topic here: https://www.cmu.edu/joss/content/articles/volume13/ButtsActonMarcum.pdf
+ To download the data, once again, go to the Network Data link on the course website. You will find the edgelist data (KatrinaResponse.csv) on the Network Data page of the website.

Use the network of your choice to do the following:

After trying out the various analyses, select the methods that you feel are most appropriate for this network.
At minimum, you will need to extract the largest component before you do anything else.
Interpret what you present. Include a visualization or two that helps you to describe what you see.

Remember: This should be an undirected network.

Good luck!