Network Analysis

In this exercise we will learn the basics of working with networks in R and use graph analysis to better understand collaboration networks in the US House.

Basics of igraph

library(tidyverse)
library(igraph)

Making a simple graph

g <- make_graph(edges = c(1,2, 1,5), n=10, directed = FALSE)
plot(g)

This is a graph with 10 vertices labeled 1 to 10. 1 is connected to 2 and 5, and there are no other connections.

The connections are also undirected.

The following graph represents the social network of Zachary’s karate club, that shows the friendship between 34 members of a karate club at a US university in the 1970s:

g <- make_graph('Zachary')
plot(g)

Let’s change the layout to see if we can make sense of the network better.

plot(g, layout = layout.circle(g))

Exercise
  • Explore igraph documentation to find out other possible layouts for the graph. Try out some of them.
  • Which of the layouts made it easy for you to understand the structure of the graph?

Accessing edges and vertices

Edges and vertices of graphs can be accessed using E(g) and V(g). We can also add additional information to edges and vertices, that might be more helpful in selecting the vertices/edges we are most interested in.

V(g)
+ 34/34 vertices, from 3dc6c99:
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34
#Adding additional values to vertices
V(g)$val <- 3:36
V(g)$val
 [1]  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
[26] 28 29 30 31 32 33 34 35 36

We can add other attributes to the vertices. These will be helpful if we want to filter nodes of the graph.

# Create a vector of 34 names
names <- c("Amelia", "Anna", "Ava", "Bella", "Charlotte", "Chloe", "Emma", "Emily", "Evelyn", "Gabriella", "Hannah", "Isabella", "Jessica", "Julia", "Madison", "Mia", "Olivia", "Sophia", "Aaliyah", "Amari", "Ayden", "Cameron", "Christopher", "Eli", "Ethan", "Gabriel", "James", "John", "Michael", "Nathan", "Noah", "William", "Nate", "Raghav")

# Print the vector
V(g)$people <- names

head(V(g)$people)
[1] "Amelia"    "Anna"      "Ava"       "Bella"     "Charlotte" "Chloe"    
E(g)
+ 78/78 edges from 3dc6c99:
 [1]  1-- 2  1-- 3  1-- 4  1-- 5  1-- 6  1-- 7  1-- 8  1-- 9  1--11  1--12
[11]  1--13  1--14  1--18  1--20  1--22  1--32  2-- 3  2-- 4  2-- 8  2--14
[21]  2--18  2--20  2--22  2--31  3-- 4  3-- 8  3--28  3--29  3--33  3--10
[31]  3-- 9  3--14  4-- 8  4--13  4--14  5-- 7  5--11  6-- 7  6--11  6--17
[41]  7--17  9--31  9--33  9--34 10--34 14--34 15--33 15--34 16--33 16--34
[51] 19--33 19--34 20--34 21--33 21--34 23--33 23--34 24--26 24--28 24--33
[61] 24--34 24--30 25--26 25--28 25--32 26--32 27--30 27--34 28--34 29--32
[71] 29--34 30--33 30--34 31--33 31--34 32--33 32--34 33--34
E(g)[1]
+ 1/78 edge from 3dc6c99:
[1] 1--2

Degree of a graph

Degreee Perhaps the simplest property one can think of is the degree. The degree of a vertex equals the number of edges adjacent to it.

degree(g)
 [1] 16  9 10  6  3  4  4  4  5  2  3  1  2  5  2  2  2  2  2  3  2  2  2  5  3
[26]  3  2  4  3  4  4  6 12 17
cbind(V(g), degree(g)) %>% head()
     [,1] [,2]
[1,]    1   16
[2,]    2    9
[3,]    3   10
[4,]    4    6
[5,]    5    3
[6,]    6    4

Now that we have some idea about working with graphs, let’s work with a planning related dataset and learn more graph analysis techniques.

Network of bill cosponsorshop in 117th congress

We have data from the 117th Congress (the current congress) on cosponsorship of bills, and we will use this data to define a graph or network where representatives are the nodes and cosponsorship of bills are the edges.

The dataset we have lists the sponsor and cosponsor(s) of every bill introduced in the 117th House where this data is available from the Government Printing Office. We will create a cosponsorship network where we create an edge between every cosponsor and the sponsor of that bill (we are not creating links between cosponsors of the same bill, as they may not actually have worked together). We are only including original cosponsors of the bill, not ones that signed on later.

First, we read the data

data = read_csv("sponsors.csv")
Rows: 50787 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): cosponsor, sponsor, bill

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)
# A tibble: 6 × 3
  cosponsor                      sponsor                           bill   
  <chr>                          <chr>                             <chr>  
1 Rep. Pelosi, Nancy [D-CA]      Rep. Sarbanes, John P. [D-MD]     HR 1   
2 Rep. Lofgren, Zoe [D-CA]       Rep. Sarbanes, John P. [D-MD]     HR 1   
3 Rep. Lowenthal, Alan S. [D-CA] Rep. Fitzpatrick, Brian K. [R-PA] HR 100 
4 Rep. Carson, Andre [D-IN]      Rep. Maloney, Carolyn B. [D-NY]   HR 1004
5 Rep. Carson, Andre [D-IN]      Rep. Maloney, Carolyn B. [D-NY]   HR 1005
6 Rep. Carson, Andre [D-IN]      Rep. Maloney, Carolyn B. [D-NY]   HR 1006

To turn this into a graph, we need to format it into a data frame that has columns for the start and end of every edge, but we don’t want to have duplicate edges.

First, we need to modify the data frame to only include a single record for each sponsor/cosponsor pair. This is a little bit tricky because we are using an undirected graph - we don’t want duplicate records for the same two representatives when one was a sponsor and the other was a cosponsor, or vice-versa. To do this, we create a new data frame with sponsor1 and sponsor2 columns, and just put whoever comes earlier in alphabetical order first.

sponsors = mutate(data,
                  sponsor1=if_else(sponsor < cosponsor, sponsor, cosponsor),
                  sponsor2=if_else(sponsor < cosponsor, cosponsor, sponsor)
                  ) %>% 
  select(-c(sponsor, cosponsor)) #removing sponsor and cosponsor columns
head(sponsors)
# A tibble: 6 × 3
  bill    sponsor1                          sponsor2                       
  <chr>   <chr>                             <chr>                          
1 HR 1    Rep. Pelosi, Nancy [D-CA]         Rep. Sarbanes, John P. [D-MD]  
2 HR 1    Rep. Lofgren, Zoe [D-CA]          Rep. Sarbanes, John P. [D-MD]  
3 HR 100  Rep. Fitzpatrick, Brian K. [R-PA] Rep. Lowenthal, Alan S. [D-CA] 
4 HR 1004 Rep. Carson, Andre [D-IN]         Rep. Maloney, Carolyn B. [D-NY]
5 HR 1005 Rep. Carson, Andre [D-IN]         Rep. Maloney, Carolyn B. [D-NY]
6 HR 1006 Rep. Carson, Andre [D-IN]         Rep. Maloney, Carolyn B. [D-NY]

Now we just need to create a version of the data with a single entry for each sponsor pair

sponsor_pairs <- sponsors %>% 
  group_by(sponsor1, sponsor2) %>%
  summarize() #note the empty parameters inside summarize. This will only have the variables we are summarizing by, but no additional columns. Which is exactly what we want.
`summarise()` has grouped output by 'sponsor1'. You can override using the
`.groups` argument.

Finally, we can create the graph

graph <- graph_from_data_frame(sponsor_pairs, directed=F)

Let’s plot the graph.

plot(graph)

It’s a mess.

plot(graph,
     vertex.label = NA,
     vertex.size = 2,
     layout = layout_on_sphere(graph),
     edge.width = 0.01)

It’s still a large network that’s hard to make sense of. There also seems to be a huge amount of interconnection. The larger a network grows, the harder it’s to visualize and make sense of without the help of mathematics.

We can compute shortest paths in the Congress network.

Representative Ralph Norman [R-SC] is considered the most conservative house member, and Barbara Lee [D-CA] is the most liberal based on govtrack reports.

Let’s compute the shortest path between them

shortest_paths(graph, "Rep. Lee, Barbara [D-CA]", "Rep. Norman, Ralph [R-SC]")
$vpath
$vpath[[1]]
+ 3/447 vertices, named, from 3e7a8c3:
[1] Rep. Lee, Barbara [D-CA]  Rep. Biggs, Andy [R-AZ]  
[3] Rep. Norman, Ralph [R-SC]


$epath
NULL

$predecessors
NULL

$inbound_edges
NULL

that’s a bit surprising - I would not have expected to only need one intermediary (Rep. Andy Biggs [R-AZ]) to get from the most liberal to most conservative member of the House. But it’s true; we can look at the data to confirm.

First, find bills that were sponsored by Lee and cosponsored by Biggs, or vice versa Biggs will always be sponsor1 since his name is first alphabetically

filter(sponsors, sponsor1=="Rep. Biggs, Andy [R-AZ]", sponsor2=="Rep. Lee, Barbara [D-CA]")
# A tibble: 1 × 3
  bill   sponsor1                sponsor2                
  <chr>  <chr>                   <chr>                   
1 HR 256 Rep. Biggs, Andy [R-AZ] Rep. Lee, Barbara [D-CA]

Lee introduced HR 256 and Biggs cosponsored, which was a repeal of the 2002 Congressional authorization of use of force in Iraq (https://www.congress.gov/bill/117th-congress/house-bill/256/cosponsors)

Biggs will also appear first in the bill cosponsored with Norman

filter(sponsors, sponsor1=="Rep. Biggs, Andy [R-AZ]", sponsor2=="Rep. Norman, Ralph [R-SC]")
# A tibble: 13 × 3
   bill    sponsor1                sponsor2                 
   <chr>   <chr>                   <chr>                    
 1 HR 1901 Rep. Biggs, Andy [R-AZ] Rep. Norman, Ralph [R-SC]
 2 HR 2384 Rep. Biggs, Andy [R-AZ] Rep. Norman, Ralph [R-SC]
 3 HR 2593 Rep. Biggs, Andy [R-AZ] Rep. Norman, Ralph [R-SC]
 4 HR 380  Rep. Biggs, Andy [R-AZ] Rep. Norman, Ralph [R-SC]
 5 HR 381  Rep. Biggs, Andy [R-AZ] Rep. Norman, Ralph [R-SC]
 6 HR 5106 Rep. Biggs, Andy [R-AZ] Rep. Norman, Ralph [R-SC]
 7 HR 5284 Rep. Biggs, Andy [R-AZ] Rep. Norman, Ralph [R-SC]
 8 HR 5360 Rep. Biggs, Andy [R-AZ] Rep. Norman, Ralph [R-SC]
 9 HR 581  Rep. Biggs, Andy [R-AZ] Rep. Norman, Ralph [R-SC]
10 HR 6395 Rep. Biggs, Andy [R-AZ] Rep. Norman, Ralph [R-SC]
11 HR 6703 Rep. Biggs, Andy [R-AZ] Rep. Norman, Ralph [R-SC]
12 HR 6828 Rep. Biggs, Andy [R-AZ] Rep. Norman, Ralph [R-SC]
13 HR 7262 Rep. Biggs, Andy [R-AZ] Rep. Norman, Ralph [R-SC]

These representatives have cosponsored many bills, perhaps less surprising since they are both Republicans, and Biggs is fairly conservative.

That said, it’s still somewhat surprising that the most conservative and most liberal members of Congress are so close together in cosponsorship space. Maybe this is just a fluke because Lee sponsored a popular bill? Let’s try with two other very liberal and conservative members, Alexandria Ocasio-Cortez [D-NY] and Madison Cawthorn [R-NC].

shortest_paths(graph, "Rep. Ocasio-Cortez, Alexandria [D-NY]", "Rep. Cawthorn, Madison [R-NC]")
$vpath
$vpath[[1]]
+ 3/447 vertices, named, from 3e7a8c3:
[1] Rep. Ocasio-Cortez, Alexandria [D-NY]    
[2] Rep. Johnson, Henry C. "Hank," Jr. [D-GA]
[3] Rep. Cawthorn, Madison [R-NC]            


$epath
NULL

$predecessors
NULL

$inbound_edges
NULL

No, this doesn’t appear to be a fluke. We can use some graph-level metrics to determine how prevalent this is. One such metric is the mean distance - how long is the average path between any two representatives?

mean_distance(graph)
[1] 1.758409

That is pretty short. It’s worth checking how many graph “components” there are - a component is a disconnected part of the graph with no paths to the other representatives. If there is only one component, then all nodes in the graph are directly or indirectly connected to all others. If there are many components, the mean distance could be artificially low because it’s not including distances between representatives in different components.

count_components(graph)
[1] 1

Every representative is connected to every other representative.

We can compute all of the distances also, for all pairs.

dists = distances(graph)

what is longest distance?

max(dists)
[1] 4

What is the distribution of distances?

table(dists)
dists
     0      1      2      3      4 
   447  49768 147992   1600      2 

let’s find the pairs with the longest distance between them - you could consider these representatives the “most dissimilar.” The “which” function will find the indices where the expression is true, and the arr.ind tells it to return original array/matrix indices.

which(dists == 4, arr.ind=T)
                                row col
Rep. Richmond, Cedric L. [D-LA] 440  59
Rep. Carey, Mike [R-OH]          59 440

Another measure of connectedness is the edge density, which is simply a ratio of how many connections there are in the graph compared to how many there could be. This is pretty easy to calculate

total_edges = length(E(graph))
total_nodes = length(V(graph))

How many possible edges are there in the graph?

possible_edges = total_nodes * (total_nodes - 1) / 2

now we can calculate edge density

total_edges / possible_edges
[1] 0.2496363

igraph has a build in function for edge density as well. By default, this function assumes that “loop edges” from a node to itself don’t exist, which they don’t in this graph, but if you did have loop edges you would need to specify loop=T to correctly calculate the denominator for this metric.

edge_density(graph)
[1] 0.2496363

There are a bunch of different types of centrality metrics, all of which aim to measure how central a given node (or edge, but we won’t look at that here) is to the graph overall. The simplest is “degree centrality”, which is just the “degree” or number of edges connected to a node. Does the degree centrality suggest any reason why Mike Carey and Cedric Richmond were so far apart?

head(degree(graph) %>% sort())
        Rep. Richmond, Cedric L. [D-LA]            Rep. Fudge, Marcia L. [D-OH] 
                                      1                                       6 
                Rep. Carey, Mike [R-OH] Rep. Cherfilus-McCormick, Sheila [D-FL] 
                                      7                                       8 
          Rep. Haaland, Debra A. [D-NM]           Rep. Brown, Shontel M. [D-OH] 
                                      8                                      13 

more complex types of centrality include closeness centrality, which is the inverse of the total distance from a node to all others. So higher closeness centrality indicates you can reach more representatives with fewer hops. Closeness centrality is often normalized by multiplying by the number of nodes - 1 so it represents the inverse of the average distance rather than the sum.

head(closeness(graph, normalize=T) %>% sort())
        Rep. Richmond, Cedric L. [D-LA]                 Rep. Carey, Mike [R-OH] 
                              0.4007188                               0.4313346 
Rep. Cherfilus-McCormick, Sheila [D-FL]           Rep. Brown, Shontel M. [D-OH] 
                              0.4464464                               0.4754797 
           Rep. Fudge, Marcia L. [D-OH]           Rep. Haaland, Debra A. [D-NM] 
                              0.4806034                               0.4853101 

Another type of centrality metric is betweenness centrality - this measures how often a representative is on the shortest path between all pairs of other representativess.

It could be treated as a measure of the power a representative has to connect others, though treating it as an “importance” can be tricky. People often assume that representatives with high betweeness centrality are very important to the network and if removed would increase the distance significantly, but this is not necessarily the case as betweenness centrality doesn’t mean there is not another representative that could play a similar role.

head(betweenness(graph) %>% sort())
        Rep. Richmond, Cedric L. [D-LA]            Rep. Fudge, Marcia L. [D-OH] 
                             0.00000000                              0.03636754 
Rep. Cherfilus-McCormick, Sheila [D-FL]                 Rep. Carey, Mike [R-OH] 
                             0.04173913                              0.15322642 
          Rep. Haaland, Debra A. [D-NM]             Rep. Hoyer, Steny H. [D-MD] 
                             0.56132015                              1.24425937 

The final metric of centrality we’ll consider is eigenvector centrality. The centrality of each node is a scaled sum of the centralities of all the nodes it’s connected to, so a representative that is connected to other representatives that are highly connected scores more highly than a representative that is connected to the same number of less connected representatives. As it happens, this is also the eigenvector of the adjacency matrix (i.e. a square matrix with the nodes as rows and columns and a 1 if there is an edge between the nodes referenced by the rows and columns). This is roughly the algorithm used by Google to rank web pages, or at least it was when Google first started.

Unlike other centrality metrics, the eigen_centrality function returns a list. We can get the vector of centralities with $vector

#head(eigen_centrality(graph)$vector %>% sort())
tail(eigen_centrality(graph)$vector %>% sort())
  Rep. Cicilline, David N. [D-RI]        Rep. Speier, Jackie [D-CA] 
                        0.8873877                         0.8938260 
    Rep. DeFazio, Peter A. [D-OR] Rep. Schakowsky, Janice D. [D-IL] 
                        0.8965134                         0.9104987 
         Rep. Cohen, Steve [D-TN]     Rep. Suozzi, Thomas R. [D-NY] 
                        0.9261281                         1.0000000 
Exercise

Do the different centrality metrics tell similar stories? If they differ, why do you think they do?

centralities = tibble(name=names(degree(graph)), degree=degree(graph), betweenness=betweenness(graph), closeness=closeness(graph), eigen=eigen_centrality(graph)$vector)
select(centralities, -name) %>% cor()
               degree betweenness closeness     eigen
degree      1.0000000   0.8011382 0.9887684 0.8744824
betweenness 0.8011382   1.0000000 0.8152299 0.5534843
closeness   0.9887684   0.8152299 1.0000000 0.8631759
eigen       0.8744824   0.5534843 0.8631759 1.0000000
centralities %>% arrange(closeness) %>% head()
# A tibble: 6 × 5
  name                                    degree betweenness closeness   eigen
  <chr>                                    <dbl>       <dbl>     <dbl>   <dbl>
1 Rep. Richmond, Cedric L. [D-LA]              1      0       0.000898 0.00618
2 Rep. Carey, Mike [R-OH]                      7      0.153   0.000967 0.0127 
3 Rep. Cherfilus-McCormick, Sheila [D-FL]      8      0.0417  0.00100  0.0405 
4 Rep. Brown, Shontel M. [D-OH]               13      1.90    0.00107  0.0494 
5 Rep. Fudge, Marcia L. [D-OH]                 6      0.0364  0.00108  0.0337 
6 Rep. Haaland, Debra A. [D-NM]                8      0.561   0.00109  0.0399 

We’ve been working with an unweighted graph up until now, meaning the edges were all the same. However, it’s reasonable to think that people who cosponsor more often might be more connected. We can use a weighted graph instead. In a weighted graph, each edge is assigned a “weight,” which can be thought of as a length or distance. Then a shortest path algorithm will not find the fewest number of links, but the lowest total weight. For instance, we can create a graph where the weight is the reciprocal of the number of times two representatives worked together. All we need to do is create a weight column in the data frame we use to create the graph. Copy the graph creation code from above and modify it to have a weight column that is the reciprocal of the number of times two representatives worked together.

sponsor_pairs = group_by(sponsors, sponsor1, sponsor2) %>%
  summarize(weight=1/n())
`summarise()` has grouped output by 'sponsor1'. You can override using the
`.groups` argument.
weight_graph = graph_from_data_frame(sponsor_pairs, directed=F)

centralities = tibble(name=names(degree(weight_graph)), degree=degree(weight_graph), betweenness=betweenness(weight_graph), closeness=closeness(weight_graph), eigen=eigen_centrality(weight_graph)$vector)
head(centralities)
# A tibble: 6 × 5
  name                                        degree betweenness closeness eigen
  <chr>                                        <dbl>       <dbl>     <dbl> <dbl>
1 Del. Norton, Eleanor Holmes [D-DC]             225      4807.    0.00634 0.415
2 Del. Plaskett, Stacey E. [D-VI]                 43        13.3   0.00387 0.153
3 Del. Radewagen, Aumua Amata Coleman [R-AS]      35        65     0.00432 0.142
4 Del. Sablan, Gregorio Kilili Camacho [D-MP]    149       448     0.00495 0.549
5 Del. San Nicolas, Michael F. Q. [D-GU]         103        65.3   0.00445 0.386
6 Rep. Adams, Alma S. [D-NC]                     153        68     0.00493 0.444

how do the centralities compare to before?

Some representatives have become much more important in some centrality measures (e.g. Eleanor Holmes Norton [D-DC], who is a delegate rather than a representative because residents of the District of Columbia do not have a vote in Congress.) Why do you think this is?

Some representatives have become more central because they co-sponsor a lot of bills, some of which may be co-sponsored by many representatives.

Graph clustering

Graph clustering, also known as community detection, is the process of partitioning nodes in a graph into groups or clusters based on their connectivity patterns. It aims to identify densely connected subgroups of nodes that exhibit stronger internal connections than connections with nodes outside the group.

The goal of graph clustering is to reveal the underlying structure and organization of complex networks by grouping nodes that share similar characteristics or functions. Clusters represent cohesive subsets of nodes that are more densely interconnected, indicating potential communities or modules within the network. The identification of these communities helps in understanding the relationships, dynamics, and functional properties of the network.

Graph clustering is also known as community detection as it can be used to identify closely connected communities within larger graphs.

Let’s use Louvain method to identify clusters in the graph.

The Louvain method is an algorithm that helps us find groups or communities in a network. It works by looking at how connected the different parts of the network are. First, each node is put in their own group. Then, we repeatedly check if moving a node to a different group would make the groups more connected. If it would, we move them to that group. We keep doing this until we can’t make the groups any more connected. This helps us find groups of nodes that are closely connected to each other in a network, like friends in a social network or neighborhoods in a city.

set.seed(123)
clusters <- cluster_louvain(graph)

It creates two clusters.

cbind(clusters$names, clusters$membership) %>%  head(n=15)
      [,1]                                          [,2]
 [1,] "Del. Norton, Eleanor Holmes [D-DC]"          "1" 
 [2,] "Del. Plaskett, Stacey E. [D-VI]"             "1" 
 [3,] "Del. Radewagen, Aumua Amata Coleman [R-AS]"  "1" 
 [4,] "Del. Sablan, Gregorio Kilili Camacho [D-MP]" "1" 
 [5,] "Del. San Nicolas, Michael F. Q. [D-GU]"      "1" 
 [6,] "Rep. Adams, Alma S. [D-NC]"                  "1" 
 [7,] "Rep. Aderholt, Robert B. [R-AL]"             "2" 
 [8,] "Rep. Aguilar, Pete [D-CA]"                   "1" 
 [9,] "Rep. Allen, Rick W. [R-GA]"                  "2" 
[10,] "Rep. Allred, Colin Z. [D-TX]"                "1" 
[11,] "Rep. Amodei, Mark E. [R-NV]"                 "2" 
[12,] "Rep. Armstrong, Kelly [R-ND]"                "2" 
[13,] "Rep. Arrington, Jodey C. [R-TX]"             "2" 
[14,] "Rep. Auchincloss, Jake [D-MA]"               "1" 
[15,] "Rep. Axne, Cynthia [D-IA]"                   "1" 

I only showed 15 observations, but it’s pretty consistent over all observations that Louvain clustering clustered the network to two groups: Democrats and Republicans. Using network clustering/community detection techniques, we can identify latent structures in networks.

Exercise
  • Try the following clustering algorithms cluster_infomap, and cluster_fast_greedy. Do a quick search to understand how they work. Do you find differences in results?

Credits: This tutorial heavily borrows and builds on Matthew Wigginton Bhagat-Conway’s Network Analysis tutorial for PLAN 372 Spring 2023.