for a while now, I’ve been interested in learning about building network graphs in R. However, most tutorials that I’ve seen use data that already shaped to describe the network structure. In my line of work I’m more used to dealing with financial data and I’ve been struggling to come up with a network visualization use case in that domain.
Recently I finally got around to learning more about this subject and different implementation methods available in R. In this post I provide step-by-step guide on turning real-life raw financial data into a network graph and extracting some actionable insight.
Since this is my first foray into this area, the code below is aimed to help beginners like me take their first steps in hands-on network analysis. I realize there are probably better tools and methods for achieving the same results, so your feedback with corrections and/or suggestions would be appreciated.
The basic idea came to me from this post on using network analysis to detect fraud and other types of illicit behavior.
The authors describe a False Claims Act case regarding the quantification and identification of fraudulent medical claims. The case involved over 140,000 medical claims where a sample of 2,500 claims was selected and reviewed for suspicious activities. Of these 2,500 medical claims, 161 were identified as being false claims. The traditional types of unsupervised analyses, such as correlations, tabulations, and charts produced little or no results. Even the more sophisticated supervised predictive models, such as logistic regression, produced poor results and had little ability to identify the fraudulent claims.
When doctors treat the same patient, it is very likely they would have interactions with one another and thus know each other. This information allows us to create a network graph of doctors, detailing the relationships and the extent of interactions that the doctors had with one another and, more importantly, how these interactions were related to the fraudulent claims. Below is a representative example of the network of doctors, where the doctors associated with patients with fraudulent claims have been marked as a red node.
By creating a doctor network from actual claims data, patterns associated with fraud cases (to which the more traditional methods are blind) begin to emerge. One of the obvious features of this doctor network is homophily; that is, doctors who are associated with fraudulent claims are more likely to be connected to other doctors who are also associated with fraudulent claims. In other words, “fraud begets fraud.” In fact, of the 18 doctors who were associated with fraudulent claims only 2 were not connected with another doctor who was also associated with fraudulent claims. Stated another way, 16 of the 18 doctors who were associated with a fraudulent claim were connected with at least one other doctor who was also associated with a fraudulent claim.
In addition to the above use case example, I used this Introduction to Network Mathematics by Bruce Hoppe as an excellent primer on the theoretical concepts behind network graphs.
Instead of using the same old textbook examples to learn network visualization, I wanted to use real-world data to build a graph analogous to the network of doctors connected by common patients as shown above. I decided to use the City of Baltimore contracts data available from this portal, which I downloaded into a CSV file. My goal was not to uncover fraud, but to be able to visualize a network of city agencies interconnected by their common vendors. The a priori intuition in this case is that the agencies with similar missions would be connected through their use of common contractors.
After some research and going over the documentation for the igraph package, I came up with the following R program.
First, we load the required packages and examine the structure of the City’s contracts data.
# load required packages
library(igraph)
library(networkD3)
library(RColorBrewer)
library(DT)
library(knitr)
library(tidyr)
library(dplyr)
# markdown options
knitr::opts_chunk$set(tidy=FALSE)
# load raw data from csv
bmore.raw <-
read.csv("Baltimore_City_Contracts_-__Bureau_of_Purchases__Department_of_Finance.csv",
stringsAsFactors=FALSE)
# sample of raw data
bmore.raw %>% head(5) %>% kable
| contractNo | desc | vendorID | vendorName | agency | beginDate | endDate | totalContractAmt | amtSpent | amtRemaining |
|---|---|---|---|---|---|---|---|---|---|
| P503784:0 | Way Finding Signage - Heritage | 363 | Triangle Sign & Service | MAYOR’S OFFICE OF CIVIC PROMOTION | 08/20/2011 | 08/19/2012 | $541648.00 | $344620.00 | $197028.00 |
| P503784:0 | Way Finding Signage - Heritage | 363 | Triangle Sign & Service | City-Wide | 08/20/2011 | 08/19/2012 | $100000.00 | $76020.00 | $23980.00 |
| P504527:0 | Railroad Spikes | 936 | ROBNET, INC | PUBLIC WORKS | 10/08/2010 | 09/13/2011 | $11700.00 | $11700.00 | $0.00 |
| P506086:0 | EXTERIOR WINDOW WASHING | 4283 | With Anointed Hands, LLC | ENOCH PRATT FREE LIBRARY | 01/02/2011 | 01/20/2012 | $69480.00 | $62790.00 | $6690.00 |
| P508318:0 | Lead Abatement Low Income Residences Part II | 5961 | Coalition to End Childhood Lead Poisoning, Inc. | City-Wide | 07/18/2010 | 01/31/2012 | $261600.00 | $25765.00 | $235835.00 |
For every contract, the dataset provides the following information:
For the purposes of network analysis, we’re primarily interested in the connections between City agencies and their vendors.
Before jumping into building a network graph, I wanted to examine its nodes, aka verices. In our case the graph would have two distinct types of vertices: Agencies and Vendors.
I begin by aggregating the data by City Agency and calculating the total number of vendor connections, number of contracts, and contract amounts.
agency <-
bmore.raw %>%
# aggregate by agency
group_by(agency) %>%
summarise(vendor_count = length(unique(vendorID)),
contr_count = n(),
contr_amt = as.integer(sum(extract_numeric(totalContractAmt))/1000)) %>%
# sort by number of vendors
arrange(-vendor_count)
agency %>% datatable
The Agency summary table shows that the number of Vendor connections varies from 1 to 102.
Similarly, I aggregate the data by Vendor and calculate the total number of Agency connections, number of contracts, and contract amounts.
vendor <-
bmore.raw %>%
# aggregate by agency
group_by(vendorName) %>%
summarise(agency_count = length(unique(agency)),
contr_count = n(),
contr_amt = as.integer(sum(extract_numeric(totalContractAmt))/1000)) %>%
# sort by number of vendors
arrange(-agency_count)
vendor %>% head(20) %>% datatable
The Vendor summary table reveals that the number Agency connections per Vendor is much lower with a maximum number of Agencies contracting one vendor = 4.
Next, I define a list of network connections between the Agencies and their Vendors. To avoid multiple connections between the same Agency-Vendor pairs, I collapse the data by each unique pair and sum their contract amounts.
conn <-
bmore.raw %>%
select(agency,
vendor = vendorName,
amt = totalContractAmt) %>%
group_by(agency, vendor) %>%
summarise(n = n(), amt = as.integer(sum(extract_numeric(amt))/1000)) %>%
ungroup
conn %>% head(100) %>% datatable
Finally, I’m ready to build my first network graph from the data. Note that the connection between an Agency and a Vendor is always one-way, I use the option directed = TRUE.
set.seed(551)
# define network graph from data frame
g <- graph_from_data_frame(conn, directed = TRUE)
# list of graph vertices (nodes)
V(g)
## + 522/522 vertices, named:
## [1] BUREAU OF PURCHASES
## [2] City-Wide
## [3] COMM. ON AGING
## [4] COMM. RELATIONS
## [5] COMPTROLLER
## [6] CONVENTION COMPLEX
## [7] COURT: ORPHANS COURT
## [8] ENOCH PRATT FREE LIBRARY
## [9] FINANCE
## [10] FIRE
## + ... omitted several vertices
# list of graph edges (connections)
E(g)
## + 571/571 edges (vertex names):
## [1] BUREAU OF PURCHASES->Braden Sutphin Ink Co.
## [2] BUREAU OF PURCHASES->c&j graphics inc.
## [3] BUREAU OF PURCHASES->Capital Summit Laminators, Inc
## [4] BUREAU OF PURCHASES->Compass Group
## [5] BUREAU OF PURCHASES->Downs Engravers and Stationers
## [6] BUREAU OF PURCHASES->Express Auction Services
## [7] BUREAU OF PURCHASES->G E Richards' Graphic Supply
## [8] BUREAU OF PURCHASES->General Binding Corporation
## [9] BUREAU OF PURCHASES->LS Patton Printing Supplies
## [10] BUREAU OF PURCHASES->MARTIN SUPPLY INC.
## + ... omitted several edges
# plot the graph
par(mai=c(0,0,1,0))
plot(g)
Not surprisingly, given so many graph vertices (522), my initial graph is a mess. There are too many labels and each one of the nodes is of the same size.
To make the graph more readable, I differentiate between Agencies and Vendors. Since there are much fewer Agencies (26) than Vendors (496), I only label the Agencies and suppress Vendor labels. I also assign them different colors.
In addition, I adjust the size of each Agency node by the number of outgoing connections, so that an Agency with connections to more Vendors is larger.
The full list of graph formatting options is documented here.
# index of Vendors
is.vendor <- V(g)$name %in% conn$vendor
# Vendor labels
V(g)$label[is.vendor] <- NA
# Agency labels
V(g)$label[!is.vendor] <- V(g)$name[!is.vendor]
# node colors
V(g)$color <- ifelse(is.vendor, 'black', brewer.pal(3,'Pastel2')[1])
# node size
V(g)$size<-degree(g, mode = 'out')/5
plot(g,
vertex.label.color='darkblue',
vertex.label.cex=0.75, vertex.frame.color = NA,
edge.arrow.size = 0, edge.arrow.width = 0)
The resulting graph is not perfect, but much more readable. The Agency nodes are sized according to their respective number of outgoing connections (degrees). The Vendor nodes are shrunk to nothing, due to their lack of any outgoing connections. we can see that the Agencies with the most connections are naturally more likely to have Vendors in common with other Agencies. These large, well-connected Agencies form an inter-connected cluster of nodes in the center of the graph. The less-connected Agencies are located along the graph’s periphery.
I’m only interested in the Agency nodes and their interconnectedness. However, the Agencies are not connected to each other directly, but rather indirectly through a network of Vendors. This challenge is common in fraud detection scenarios, where it’s difficult to connect entities that want to appear not connected.
As the next step of my analysis, I would like to drop all Vendor nodes from the graph and explicitly connect those Agency nodes that share common Vendors. To achieve this goal, I was able to use the distances() function from the igraph package, which calculates the shortest distance (in terms of number of connections) between any given set of nodes. Below, is a partial distance matrix for Agency nodes only.
# matrix of shortest distances between non-vendors
d.mat <- distances(g, V(g)[!is.vendor], to = V(g)[!is.vendor])
d.mat[1:8, 1:8] %>% datatable()
The diagonal of this matrix contains all zeros, because the shortest distance from a node to itself is zero. The blank matrix cells indicate that the two nodes are not connected at all, as they do not have any Vendors in common. The shortest possible distance between two Agencies is 2, because to connect to another Agency, we must go through at least one Vendor. I use the code below to extract a list of Agency-to-Agency connections where the shortest distance equals to 2, meaning both nodes are connected directly by one Vendor.
# remove values in the lower triangle (avoid duplicates A->B is the same as b->A)
# and on matrix diagonal (connection to itself)
d.mat[!upper.tri(d.mat)] <- NA
# convert matrix to dataframe
d <-
d.mat %>%
as.data.frame() %>%
bind_cols(data_frame(agy1=row.names(.)), .) %>%
# reshape table from wide to tall
gather(agy2, d, -agy1, convert=T) %>%
# keep direct connections only
filter(d==2)
d
## Source: local data frame [47 x 3]
##
## agy1 agy2 d
## (chr) (chr) (dbl)
## 1 COMM. ON AGING ENOCH PRATT FREE LIBRARY 2
## 2 BUREAU OF PURCHASES FIRE 2
## 3 City-Wide FIRE 2
## 4 ENOCH PRATT FREE LIBRARY FIRE 2
## 5 City-Wide GENERAL SERVICES 2
## 6 ENOCH PRATT FREE LIBRARY GENERAL SERVICES 2
## 7 FIRE GENERAL SERVICES 2
## 8 City-Wide HEALTH 2
## 9 COMM. ON AGING HEALTH 2
## 10 ENOCH PRATT FREE LIBRARY HEALTH 2
## .. ... ... ...
We can now build another graph showing the network of Agencies with direct connections. It looks similar to the previous plot, but it’s much cleaner, because we dropped all Agency connections to hundreds of different Vendors.
g2 <- graph_from_data_frame(d, directed = F, vertices = agency)
plot(g2)
We can take this analysis a step further, by assigning the nodes with the most inter-connections into clusters and plotting them in different colors.
g2_groups <- cluster_optimal(g2)
V(g2)$color <- brewer.pal(12, 'Set3')[membership(g2_groups)]
plot(g2,
vertex.label.color='darkblue',
vertex.label.cex=0.75, vertex.frame.color = NA,
edge.arrow.size = 0, edge.arrow.width = 0)
Clustering the nodes reveals two major interconnected agency clusters. One contains agencies involved in law enforcement, public safety, and transportation (green nodes). The other big cluster has to do with health, housing, and community development (purple nodes). Unsurprisingly, in the center of it all is the City-wide category. A number of disconnected agencies appear the side of the graph, reflecting on the narrow and unique nature of their mission needs.
It is also possible to arrange the graph nodes into a circle with just a few lines of extra code. Of course, this layout lack the concept of network centrality apparent in the previous graph.
coords <- layout_in_circle(g2, order = order(membership(g2_groups)))
plot(g2,
layout = coords,
vertex.label.color='darkblue',
vertex.label.cex=0.75, vertex.frame.color = NA,
edge.arrow.size = 0, edge.arrow.width = 0)
D3.js is a JavaScript library for creating powerful data visualization documents. D3 helps bring data to life by rendering HTML-based graphics that are well-suited for interactive documents, such as web pages. networkD3 is an R package developed by Christopher Gandrud that lets users easily create D3 network, tree, dendrogram, and Sankey graphs with minimal amount of code.
We begin by rendering a simple graph by applying the simpleNetwork() function to the same data as before. This function requires a data frame of network links with source-target node names defined in the first two columns.
simpleNetwork(d, width = 750, height = 650,
charge=-5000, opacity = 0.95,
fontSize = 14, textColour='black')
The nodes customization options for the above simpleNetwork() function are minimal. To add cluster coloring, I used a more flexible forceNetwork() function. The data for this function needs to be formatted differently. It requires one dataset of graph nodes and a separate dataset of links defined in terms of node numbers. The code below converts the node names to node numbers (starting from 0).
# for each node assign a index ID starting with 0
node_id <-
data_frame(
id = 0:(nrow(agency)-1),
agency = agency$agency)
head(node_id, 5)
## Source: local data frame [5 x 2]
##
## id agency
## (int) (chr)
## 1 0 PUBLIC WORKS
## 2 1 City-Wide
## 3 2 GENERAL SERVICES
## 4 3 POLICE
## 5 4 FIRE
# for each source-target pair define a pair of numeric IDs
links <-
d %>%
left_join(node_id %>% rename(source = id), by = c("agy1" = "agency")) %>%
left_join(node_id %>% rename(target = id), by = c("agy2" = "agency")) %>%
arrange(source)
head(links, 5)
## Source: local data frame [5 x 5]
##
## agy1 agy2 d source target
## (chr) (chr) (dbl) (int) (int)
## 1 PUBLIC WORKS RECREATION & PARKS 2 0 12
## 2 PUBLIC WORKS SHERIFF 2 0 13
## 3 PUBLIC WORKS TRANSPORTATION 2 0 6
## 4 City-Wide FIRE 2 1 4
## 5 City-Wide GENERAL SERVICES 2 1 2
Note that the links are now specified by source-target node numbers (the last two columns). Now that the graph data is in the right format, it’s easy to build a fully formatted click-and-drag interactive D3 version of the agencies network graph.
forceNetwork(
# specify nodes data and fields
Nodes = agency %>% mutate(group = membership(g2_groups)),
NodeID = 'agency', Nodesize = 'vendor_count',
Group = 'group',
# specify links data and fields
Links = links,
Source = 'source', Target = 'target', Value = 'd',
# add formatting
opacity = 0.95, fontSize=14, charge = -500,
width = 750, height = 650
)
The examples above demonstrate how it is possible to use free and open source software like R to transform transaction-type data into network graph-ready data, build network graphs, and assign nodes into clusters based on the level of their network proximity. These methods are widely used in uncovering hidden pattern and helping to detect improper payments, fraud waste and abuse, and other illicit activities.
Download raw data and R Markdown code from GitHub.