library(tidyverse)
library(igraph)Network Analysis
In this exercise we will learn the basics of working with networks in R and use graph analysis to better understand collaboration networks in the US House.
Basics of igraph
Making a simple graph
g <- make_graph(edges = c(1,2, 1,5), n=10, directed = FALSE)
plot(g)This is a graph with 10 vertices labeled 1 to 10. 1 is connected to 2 and 5, and there are no other connections.
The connections are also undirected.
The following graph represents the social network of Zachary’s karate club, that shows the friendship between 34 members of a karate club at a US university in the 1970s:
g <- make_graph('Zachary')
plot(g)Let’s change the layout to see if we can make sense of the network better.
plot(g, layout = layout.circle(g))Accessing edges and vertices
Edges and vertices of graphs can be accessed using E(g) and V(g). We can also add additional information to edges and vertices, that might be more helpful in selecting the vertices/edges we are most interested in.
V(g)+ 34/34 vertices, from 3dc6c99:
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34
#Adding additional values to vertices
V(g)$val <- 3:36
V(g)$val [1] 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
[26] 28 29 30 31 32 33 34 35 36
We can add other attributes to the vertices. These will be helpful if we want to filter nodes of the graph.
# Create a vector of 34 names
names <- c("Amelia", "Anna", "Ava", "Bella", "Charlotte", "Chloe", "Emma", "Emily", "Evelyn", "Gabriella", "Hannah", "Isabella", "Jessica", "Julia", "Madison", "Mia", "Olivia", "Sophia", "Aaliyah", "Amari", "Ayden", "Cameron", "Christopher", "Eli", "Ethan", "Gabriel", "James", "John", "Michael", "Nathan", "Noah", "William", "Nate", "Raghav")
# Print the vector
V(g)$people <- names
head(V(g)$people)[1] "Amelia" "Anna" "Ava" "Bella" "Charlotte" "Chloe"
E(g)+ 78/78 edges from 3dc6c99:
[1] 1-- 2 1-- 3 1-- 4 1-- 5 1-- 6 1-- 7 1-- 8 1-- 9 1--11 1--12
[11] 1--13 1--14 1--18 1--20 1--22 1--32 2-- 3 2-- 4 2-- 8 2--14
[21] 2--18 2--20 2--22 2--31 3-- 4 3-- 8 3--28 3--29 3--33 3--10
[31] 3-- 9 3--14 4-- 8 4--13 4--14 5-- 7 5--11 6-- 7 6--11 6--17
[41] 7--17 9--31 9--33 9--34 10--34 14--34 15--33 15--34 16--33 16--34
[51] 19--33 19--34 20--34 21--33 21--34 23--33 23--34 24--26 24--28 24--33
[61] 24--34 24--30 25--26 25--28 25--32 26--32 27--30 27--34 28--34 29--32
[71] 29--34 30--33 30--34 31--33 31--34 32--33 32--34 33--34
E(g)[1]+ 1/78 edge from 3dc6c99:
[1] 1--2
Degree of a graph
Degreee Perhaps the simplest property one can think of is the degree. The degree of a vertex equals the number of edges adjacent to it.
degree(g) [1] 16 9 10 6 3 4 4 4 5 2 3 1 2 5 2 2 2 2 2 3 2 2 2 5 3
[26] 3 2 4 3 4 4 6 12 17
cbind(V(g), degree(g)) %>% head() [,1] [,2]
[1,] 1 16
[2,] 2 9
[3,] 3 10
[4,] 4 6
[5,] 5 3
[6,] 6 4
Now that we have some idea about working with graphs, let’s work with a planning related dataset and learn more graph analysis techniques.
Network of bill cosponsorshop in 117th congress
We have data from the 117th Congress (the current congress) on cosponsorship of bills, and we will use this data to define a graph or network where representatives are the nodes and cosponsorship of bills are the edges.
The dataset we have lists the sponsor and cosponsor(s) of every bill introduced in the 117th House where this data is available from the Government Printing Office. We will create a cosponsorship network where we create an edge between every cosponsor and the sponsor of that bill (we are not creating links between cosponsors of the same bill, as they may not actually have worked together). We are only including original cosponsors of the bill, not ones that signed on later.
First, we read the data
data = read_csv("sponsors.csv")Rows: 50787 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): cosponsor, sponsor, bill
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)# A tibble: 6 × 3
cosponsor sponsor bill
<chr> <chr> <chr>
1 Rep. Pelosi, Nancy [D-CA] Rep. Sarbanes, John P. [D-MD] HR 1
2 Rep. Lofgren, Zoe [D-CA] Rep. Sarbanes, John P. [D-MD] HR 1
3 Rep. Lowenthal, Alan S. [D-CA] Rep. Fitzpatrick, Brian K. [R-PA] HR 100
4 Rep. Carson, Andre [D-IN] Rep. Maloney, Carolyn B. [D-NY] HR 1004
5 Rep. Carson, Andre [D-IN] Rep. Maloney, Carolyn B. [D-NY] HR 1005
6 Rep. Carson, Andre [D-IN] Rep. Maloney, Carolyn B. [D-NY] HR 1006
To turn this into a graph, we need to format it into a data frame that has columns for the start and end of every edge, but we don’t want to have duplicate edges.
First, we need to modify the data frame to only include a single record for each sponsor/cosponsor pair. This is a little bit tricky because we are using an undirected graph - we don’t want duplicate records for the same two representatives when one was a sponsor and the other was a cosponsor, or vice-versa. To do this, we create a new data frame with sponsor1 and sponsor2 columns, and just put whoever comes earlier in alphabetical order first.
sponsors = mutate(data,
sponsor1=if_else(sponsor < cosponsor, sponsor, cosponsor),
sponsor2=if_else(sponsor < cosponsor, cosponsor, sponsor)
) %>%
select(-c(sponsor, cosponsor)) #removing sponsor and cosponsor columns
head(sponsors)# A tibble: 6 × 3
bill sponsor1 sponsor2
<chr> <chr> <chr>
1 HR 1 Rep. Pelosi, Nancy [D-CA] Rep. Sarbanes, John P. [D-MD]
2 HR 1 Rep. Lofgren, Zoe [D-CA] Rep. Sarbanes, John P. [D-MD]
3 HR 100 Rep. Fitzpatrick, Brian K. [R-PA] Rep. Lowenthal, Alan S. [D-CA]
4 HR 1004 Rep. Carson, Andre [D-IN] Rep. Maloney, Carolyn B. [D-NY]
5 HR 1005 Rep. Carson, Andre [D-IN] Rep. Maloney, Carolyn B. [D-NY]
6 HR 1006 Rep. Carson, Andre [D-IN] Rep. Maloney, Carolyn B. [D-NY]
Now we just need to create a version of the data with a single entry for each sponsor pair
sponsor_pairs <- sponsors %>%
group_by(sponsor1, sponsor2) %>%
summarize() #note the empty parameters inside summarize. This will only have the variables we are summarizing by, but no additional columns. Which is exactly what we want.`summarise()` has grouped output by 'sponsor1'. You can override using the
`.groups` argument.
Finally, we can create the graph
graph <- graph_from_data_frame(sponsor_pairs, directed=F)Let’s plot the graph.
plot(graph)It’s a mess.
plot(graph,
vertex.label = NA,
vertex.size = 2,
layout = layout_on_sphere(graph),
edge.width = 0.01)It’s still a large network that’s hard to make sense of. There also seems to be a huge amount of interconnection. The larger a network grows, the harder it’s to visualize and make sense of without the help of mathematics.
We can compute shortest paths in the Congress network.
Representative Ralph Norman [R-SC] is considered the most conservative house member, and Barbara Lee [D-CA] is the most liberal based on govtrack reports.
Let’s compute the shortest path between them
shortest_paths(graph, "Rep. Lee, Barbara [D-CA]", "Rep. Norman, Ralph [R-SC]")$vpath
$vpath[[1]]
+ 3/447 vertices, named, from 3e7a8c3:
[1] Rep. Lee, Barbara [D-CA] Rep. Biggs, Andy [R-AZ]
[3] Rep. Norman, Ralph [R-SC]
$epath
NULL
$predecessors
NULL
$inbound_edges
NULL
that’s a bit surprising - I would not have expected to only need one intermediary (Rep. Andy Biggs [R-AZ]) to get from the most liberal to most conservative member of the House. But it’s true; we can look at the data to confirm.
First, find bills that were sponsored by Lee and cosponsored by Biggs, or vice versa Biggs will always be sponsor1 since his name is first alphabetically
filter(sponsors, sponsor1=="Rep. Biggs, Andy [R-AZ]", sponsor2=="Rep. Lee, Barbara [D-CA]")# A tibble: 1 × 3
bill sponsor1 sponsor2
<chr> <chr> <chr>
1 HR 256 Rep. Biggs, Andy [R-AZ] Rep. Lee, Barbara [D-CA]
Lee introduced HR 256 and Biggs cosponsored, which was a repeal of the 2002 Congressional authorization of use of force in Iraq (https://www.congress.gov/bill/117th-congress/house-bill/256/cosponsors)
Biggs will also appear first in the bill cosponsored with Norman
filter(sponsors, sponsor1=="Rep. Biggs, Andy [R-AZ]", sponsor2=="Rep. Norman, Ralph [R-SC]")# A tibble: 13 × 3
bill sponsor1 sponsor2
<chr> <chr> <chr>
1 HR 1901 Rep. Biggs, Andy [R-AZ] Rep. Norman, Ralph [R-SC]
2 HR 2384 Rep. Biggs, Andy [R-AZ] Rep. Norman, Ralph [R-SC]
3 HR 2593 Rep. Biggs, Andy [R-AZ] Rep. Norman, Ralph [R-SC]
4 HR 380 Rep. Biggs, Andy [R-AZ] Rep. Norman, Ralph [R-SC]
5 HR 381 Rep. Biggs, Andy [R-AZ] Rep. Norman, Ralph [R-SC]
6 HR 5106 Rep. Biggs, Andy [R-AZ] Rep. Norman, Ralph [R-SC]
7 HR 5284 Rep. Biggs, Andy [R-AZ] Rep. Norman, Ralph [R-SC]
8 HR 5360 Rep. Biggs, Andy [R-AZ] Rep. Norman, Ralph [R-SC]
9 HR 581 Rep. Biggs, Andy [R-AZ] Rep. Norman, Ralph [R-SC]
10 HR 6395 Rep. Biggs, Andy [R-AZ] Rep. Norman, Ralph [R-SC]
11 HR 6703 Rep. Biggs, Andy [R-AZ] Rep. Norman, Ralph [R-SC]
12 HR 6828 Rep. Biggs, Andy [R-AZ] Rep. Norman, Ralph [R-SC]
13 HR 7262 Rep. Biggs, Andy [R-AZ] Rep. Norman, Ralph [R-SC]
These representatives have cosponsored many bills, perhaps less surprising since they are both Republicans, and Biggs is fairly conservative.
That said, it’s still somewhat surprising that the most conservative and most liberal members of Congress are so close together in cosponsorship space. Maybe this is just a fluke because Lee sponsored a popular bill? Let’s try with two other very liberal and conservative members, Alexandria Ocasio-Cortez [D-NY] and Madison Cawthorn [R-NC].
shortest_paths(graph, "Rep. Ocasio-Cortez, Alexandria [D-NY]", "Rep. Cawthorn, Madison [R-NC]")$vpath
$vpath[[1]]
+ 3/447 vertices, named, from 3e7a8c3:
[1] Rep. Ocasio-Cortez, Alexandria [D-NY]
[2] Rep. Johnson, Henry C. "Hank," Jr. [D-GA]
[3] Rep. Cawthorn, Madison [R-NC]
$epath
NULL
$predecessors
NULL
$inbound_edges
NULL
No, this doesn’t appear to be a fluke. We can use some graph-level metrics to determine how prevalent this is. One such metric is the mean distance - how long is the average path between any two representatives?
mean_distance(graph)[1] 1.758409
That is pretty short. It’s worth checking how many graph “components” there are - a component is a disconnected part of the graph with no paths to the other representatives. If there is only one component, then all nodes in the graph are directly or indirectly connected to all others. If there are many components, the mean distance could be artificially low because it’s not including distances between representatives in different components.
count_components(graph)[1] 1
Every representative is connected to every other representative.
We can compute all of the distances also, for all pairs.
dists = distances(graph)what is longest distance?
max(dists)[1] 4
What is the distribution of distances?
table(dists)dists
0 1 2 3 4
447 49768 147992 1600 2
let’s find the pairs with the longest distance between them - you could consider these representatives the “most dissimilar.” The “which” function will find the indices where the expression is true, and the arr.ind tells it to return original array/matrix indices.
which(dists == 4, arr.ind=T) row col
Rep. Richmond, Cedric L. [D-LA] 440 59
Rep. Carey, Mike [R-OH] 59 440
Another measure of connectedness is the edge density, which is simply a ratio of how many connections there are in the graph compared to how many there could be. This is pretty easy to calculate
total_edges = length(E(graph))
total_nodes = length(V(graph))How many possible edges are there in the graph?
possible_edges = total_nodes * (total_nodes - 1) / 2now we can calculate edge density
total_edges / possible_edges[1] 0.2496363
igraph has a build in function for edge density as well. By default, this function assumes that “loop edges” from a node to itself don’t exist, which they don’t in this graph, but if you did have loop edges you would need to specify loop=T to correctly calculate the denominator for this metric.
edge_density(graph)[1] 0.2496363
There are a bunch of different types of centrality metrics, all of which aim to measure how central a given node (or edge, but we won’t look at that here) is to the graph overall. The simplest is “degree centrality”, which is just the “degree” or number of edges connected to a node. Does the degree centrality suggest any reason why Mike Carey and Cedric Richmond were so far apart?
head(degree(graph) %>% sort()) Rep. Richmond, Cedric L. [D-LA] Rep. Fudge, Marcia L. [D-OH]
1 6
Rep. Carey, Mike [R-OH] Rep. Cherfilus-McCormick, Sheila [D-FL]
7 8
Rep. Haaland, Debra A. [D-NM] Rep. Brown, Shontel M. [D-OH]
8 13
more complex types of centrality include closeness centrality, which is the inverse of the total distance from a node to all others. So higher closeness centrality indicates you can reach more representatives with fewer hops. Closeness centrality is often normalized by multiplying by the number of nodes - 1 so it represents the inverse of the average distance rather than the sum.
head(closeness(graph, normalize=T) %>% sort()) Rep. Richmond, Cedric L. [D-LA] Rep. Carey, Mike [R-OH]
0.4007188 0.4313346
Rep. Cherfilus-McCormick, Sheila [D-FL] Rep. Brown, Shontel M. [D-OH]
0.4464464 0.4754797
Rep. Fudge, Marcia L. [D-OH] Rep. Haaland, Debra A. [D-NM]
0.4806034 0.4853101
Another type of centrality metric is betweenness centrality - this measures how often a representative is on the shortest path between all pairs of other representativess.
It could be treated as a measure of the power a representative has to connect others, though treating it as an “importance” can be tricky. People often assume that representatives with high betweeness centrality are very important to the network and if removed would increase the distance significantly, but this is not necessarily the case as betweenness centrality doesn’t mean there is not another representative that could play a similar role.
head(betweenness(graph) %>% sort()) Rep. Richmond, Cedric L. [D-LA] Rep. Fudge, Marcia L. [D-OH]
0.00000000 0.03636754
Rep. Cherfilus-McCormick, Sheila [D-FL] Rep. Carey, Mike [R-OH]
0.04173913 0.15322642
Rep. Haaland, Debra A. [D-NM] Rep. Hoyer, Steny H. [D-MD]
0.56132015 1.24425937
The final metric of centrality we’ll consider is eigenvector centrality. The centrality of each node is a scaled sum of the centralities of all the nodes it’s connected to, so a representative that is connected to other representatives that are highly connected scores more highly than a representative that is connected to the same number of less connected representatives. As it happens, this is also the eigenvector of the adjacency matrix (i.e. a square matrix with the nodes as rows and columns and a 1 if there is an edge between the nodes referenced by the rows and columns). This is roughly the algorithm used by Google to rank web pages, or at least it was when Google first started.
Unlike other centrality metrics, the eigen_centrality function returns a list. We can get the vector of centralities with $vector
#head(eigen_centrality(graph)$vector %>% sort())
tail(eigen_centrality(graph)$vector %>% sort()) Rep. Cicilline, David N. [D-RI] Rep. Speier, Jackie [D-CA]
0.8873877 0.8938260
Rep. DeFazio, Peter A. [D-OR] Rep. Schakowsky, Janice D. [D-IL]
0.8965134 0.9104987
Rep. Cohen, Steve [D-TN] Rep. Suozzi, Thomas R. [D-NY]
0.9261281 1.0000000
centralities = tibble(name=names(degree(graph)), degree=degree(graph), betweenness=betweenness(graph), closeness=closeness(graph), eigen=eigen_centrality(graph)$vector)
select(centralities, -name) %>% cor() degree betweenness closeness eigen
degree 1.0000000 0.8011382 0.9887684 0.8744824
betweenness 0.8011382 1.0000000 0.8152299 0.5534843
closeness 0.9887684 0.8152299 1.0000000 0.8631759
eigen 0.8744824 0.5534843 0.8631759 1.0000000
centralities %>% arrange(closeness) %>% head()# A tibble: 6 × 5
name degree betweenness closeness eigen
<chr> <dbl> <dbl> <dbl> <dbl>
1 Rep. Richmond, Cedric L. [D-LA] 1 0 0.000898 0.00618
2 Rep. Carey, Mike [R-OH] 7 0.153 0.000967 0.0127
3 Rep. Cherfilus-McCormick, Sheila [D-FL] 8 0.0417 0.00100 0.0405
4 Rep. Brown, Shontel M. [D-OH] 13 1.90 0.00107 0.0494
5 Rep. Fudge, Marcia L. [D-OH] 6 0.0364 0.00108 0.0337
6 Rep. Haaland, Debra A. [D-NM] 8 0.561 0.00109 0.0399
We’ve been working with an unweighted graph up until now, meaning the edges were all the same. However, it’s reasonable to think that people who cosponsor more often might be more connected. We can use a weighted graph instead. In a weighted graph, each edge is assigned a “weight,” which can be thought of as a length or distance. Then a shortest path algorithm will not find the fewest number of links, but the lowest total weight. For instance, we can create a graph where the weight is the reciprocal of the number of times two representatives worked together. All we need to do is create a weight column in the data frame we use to create the graph. Copy the graph creation code from above and modify it to have a weight column that is the reciprocal of the number of times two representatives worked together.
sponsor_pairs = group_by(sponsors, sponsor1, sponsor2) %>%
summarize(weight=1/n())`summarise()` has grouped output by 'sponsor1'. You can override using the
`.groups` argument.
weight_graph = graph_from_data_frame(sponsor_pairs, directed=F)
centralities = tibble(name=names(degree(weight_graph)), degree=degree(weight_graph), betweenness=betweenness(weight_graph), closeness=closeness(weight_graph), eigen=eigen_centrality(weight_graph)$vector)
head(centralities)# A tibble: 6 × 5
name degree betweenness closeness eigen
<chr> <dbl> <dbl> <dbl> <dbl>
1 Del. Norton, Eleanor Holmes [D-DC] 225 4807. 0.00634 0.415
2 Del. Plaskett, Stacey E. [D-VI] 43 13.3 0.00387 0.153
3 Del. Radewagen, Aumua Amata Coleman [R-AS] 35 65 0.00432 0.142
4 Del. Sablan, Gregorio Kilili Camacho [D-MP] 149 448 0.00495 0.549
5 Del. San Nicolas, Michael F. Q. [D-GU] 103 65.3 0.00445 0.386
6 Rep. Adams, Alma S. [D-NC] 153 68 0.00493 0.444
how do the centralities compare to before?
Some representatives have become much more important in some centrality measures (e.g. Eleanor Holmes Norton [D-DC], who is a delegate rather than a representative because residents of the District of Columbia do not have a vote in Congress.) Why do you think this is?
Some representatives have become more central because they co-sponsor a lot of bills, some of which may be co-sponsored by many representatives.
Graph clustering
Graph clustering, also known as community detection, is the process of partitioning nodes in a graph into groups or clusters based on their connectivity patterns. It aims to identify densely connected subgroups of nodes that exhibit stronger internal connections than connections with nodes outside the group.
The goal of graph clustering is to reveal the underlying structure and organization of complex networks by grouping nodes that share similar characteristics or functions. Clusters represent cohesive subsets of nodes that are more densely interconnected, indicating potential communities or modules within the network. The identification of these communities helps in understanding the relationships, dynamics, and functional properties of the network.
Graph clustering is also known as community detection as it can be used to identify closely connected communities within larger graphs.
Let’s use Louvain method to identify clusters in the graph.
The Louvain method is an algorithm that helps us find groups or communities in a network. It works by looking at how connected the different parts of the network are. First, each node is put in their own group. Then, we repeatedly check if moving a node to a different group would make the groups more connected. If it would, we move them to that group. We keep doing this until we can’t make the groups any more connected. This helps us find groups of nodes that are closely connected to each other in a network, like friends in a social network or neighborhoods in a city.
set.seed(123)
clusters <- cluster_louvain(graph)It creates two clusters.
cbind(clusters$names, clusters$membership) %>% head(n=15) [,1] [,2]
[1,] "Del. Norton, Eleanor Holmes [D-DC]" "1"
[2,] "Del. Plaskett, Stacey E. [D-VI]" "1"
[3,] "Del. Radewagen, Aumua Amata Coleman [R-AS]" "1"
[4,] "Del. Sablan, Gregorio Kilili Camacho [D-MP]" "1"
[5,] "Del. San Nicolas, Michael F. Q. [D-GU]" "1"
[6,] "Rep. Adams, Alma S. [D-NC]" "1"
[7,] "Rep. Aderholt, Robert B. [R-AL]" "2"
[8,] "Rep. Aguilar, Pete [D-CA]" "1"
[9,] "Rep. Allen, Rick W. [R-GA]" "2"
[10,] "Rep. Allred, Colin Z. [D-TX]" "1"
[11,] "Rep. Amodei, Mark E. [R-NV]" "2"
[12,] "Rep. Armstrong, Kelly [R-ND]" "2"
[13,] "Rep. Arrington, Jodey C. [R-TX]" "2"
[14,] "Rep. Auchincloss, Jake [D-MA]" "1"
[15,] "Rep. Axne, Cynthia [D-IA]" "1"
I only showed 15 observations, but it’s pretty consistent over all observations that Louvain clustering clustered the network to two groups: Democrats and Republicans. Using network clustering/community detection techniques, we can identify latent structures in networks.
Credits: This tutorial heavily borrows and builds on Matthew Wigginton Bhagat-Conway’s Network Analysis tutorial for PLAN 372 Spring 2023.