Introduction

for a while now, I’ve been interested in learning about building network graphs in R. However, most tutorials that I’ve seen use data that already shaped to describe the network structure. In my line of work I’m more used to dealing with financial data and I’ve been struggling to come up with a network visualization use case in that domain.

Recently I finally got around to learning more about this subject and different implementation methods available in R. In this post I provide step-by-step guide on turning real-life raw financial data into a network graph and extracting some actionable insight.

Since this is my first foray into this area, the code below is aimed to help beginners like me take their first steps in hands-on network analysis. I realize there are probably better tools and methods for achieving the same results, so your feedback with corrections and/or suggestions would be appreciated.

Background and Motivation

The basic idea came to me from this post on using network analysis to detect fraud and other types of illicit behavior.

The authors describe a False Claims Act case regarding the quantification and identification of fraudulent medical claims. The case involved over 140,000 medical claims where a sample of 2,500 claims was selected and reviewed for suspicious activities. Of these 2,500 medical claims, 161 were identified as being false claims. The traditional types of unsupervised analyses, such as correlations, tabulations, and charts produced little or no results. Even the more sophisticated supervised predictive models, such as logistic regression, produced poor results and had little ability to identify the fraudulent claims.

When doctors treat the same patient, it is very likely they would have interactions with one another and thus know each other. This information allows us to create a network graph of doctors, detailing the relationships and the extent of interactions that the doctors had with one another and, more importantly, how these interactions were related to the fraudulent claims. Below is a representative example of the network of doctors, where the doctors associated with patients with fraudulent claims have been marked as a red node.

Doctor Network defined by common patients

By creating a doctor network from actual claims data, patterns associated with fraud cases (to which the more traditional methods are blind) begin to emerge. One of the obvious features of this doctor network is homophily; that is, doctors who are associated with fraudulent claims are more likely to be connected to other doctors who are also associated with fraudulent claims. In other words, “fraud begets fraud.” In fact, of the 18 doctors who were associated with fraudulent claims only 2 were not connected with another doctor who was also associated with fraudulent claims. Stated another way, 16 of the 18 doctors who were associated with a fraudulent claim were connected with at least one other doctor who was also associated with a fraudulent claim.

In addition to the above use case example, I used this Introduction to Network Mathematics by Bruce Hoppe as an excellent primer on the theoretical concepts behind network graphs.

The Data

Instead of using the same old textbook examples to learn network visualization, I wanted to use real-world data to build a graph analogous to the network of doctors connected by common patients as shown above. I decided to use the City of Baltimore contracts data available from this portal, which I downloaded into a CSV file. My goal was not to uncover fraud, but to be able to visualize a network of city agencies interconnected by their common vendors. The a priori intuition in this case is that the agencies with similar missions would be connected through their use of common contractors.

After some research and going over the documentation for the igraph package, I came up with the following R program.

Preparing Graph Data

Loading Raw Data and R Packages

First, we load the required packages and examine the structure of the City’s contracts data.

# load required packages
library(igraph)
library(networkD3)
library(RColorBrewer)
library(DT)
library(knitr)
library(tidyr)
library(dplyr)


# markdown options
knitr::opts_chunk$set(tidy=FALSE)

# load raw data from csv
bmore.raw <- 
    read.csv("Baltimore_City_Contracts_-__Bureau_of_Purchases__Department_of_Finance.csv", 
             stringsAsFactors=FALSE)

# sample of raw data
bmore.raw %>% head(5) %>% kable
contractNo desc vendorID vendorName agency beginDate endDate totalContractAmt amtSpent amtRemaining
P503784:0 Way Finding Signage - Heritage 363 Triangle Sign & Service MAYOR’S OFFICE OF CIVIC PROMOTION 08/20/2011 08/19/2012 $541648.00 $344620.00 $197028.00
P503784:0 Way Finding Signage - Heritage 363 Triangle Sign & Service City-Wide 08/20/2011 08/19/2012 $100000.00 $76020.00 $23980.00
P504527:0 Railroad Spikes 936 ROBNET, INC PUBLIC WORKS 10/08/2010 09/13/2011 $11700.00 $11700.00 $0.00
P506086:0 EXTERIOR WINDOW WASHING 4283 With Anointed Hands, LLC ENOCH PRATT FREE LIBRARY 01/02/2011 01/20/2012 $69480.00 $62790.00 $6690.00
P508318:0 Lead Abatement Low Income Residences Part II 5961 Coalition to End Childhood Lead Poisoning, Inc. City-Wide 07/18/2010 01/31/2012 $261600.00 $25765.00 $235835.00

For every contract, the dataset provides the following information:

  • Contracting Agency
  • Vendor name and ID
  • Contract description
  • Begin and end dates
  • Contract amounts

For the purposes of network analysis, we’re primarily interested in the connections between City agencies and their vendors.

Defining Network Nodes

Before jumping into building a network graph, I wanted to examine its nodes, aka verices. In our case the graph would have two distinct types of vertices: Agencies and Vendors.

City Agencies

I begin by aggregating the data by City Agency and calculating the total number of vendor connections, number of contracts, and contract amounts.

agency <- 
    bmore.raw %>% 
    # aggregate by agency
    group_by(agency) %>% 
    summarise(vendor_count = length(unique(vendorID)),
              contr_count = n(),
              contr_amt = as.integer(sum(extract_numeric(totalContractAmt))/1000)) %>% 
    # sort by number of vendors
    arrange(-vendor_count)

agency %>% datatable

The Agency summary table shows that the number of Vendor connections varies from 1 to 102.

Contract Vendors

Similarly, I aggregate the data by Vendor and calculate the total number of Agency connections, number of contracts, and contract amounts.

vendor <- 
    bmore.raw %>% 
    # aggregate by agency
    group_by(vendorName) %>% 
    summarise(agency_count = length(unique(agency)),
              contr_count = n(),
              contr_amt = as.integer(sum(extract_numeric(totalContractAmt))/1000)) %>% 
    # sort by number of vendors
    arrange(-agency_count)

vendor %>% head(20) %>% datatable

The Vendor summary table reveals that the number Agency connections per Vendor is much lower with a maximum number of Agencies contracting one vendor = 4.

Defining Network Connections

Next, I define a list of network connections between the Agencies and their Vendors. To avoid multiple connections between the same Agency-Vendor pairs, I collapse the data by each unique pair and sum their contract amounts.

conn <- 
    bmore.raw %>% 
    select(agency, 
           vendor = vendorName, 
           amt = totalContractAmt) %>% 
    group_by(agency, vendor) %>% 
    summarise(n = n(), amt = as.integer(sum(extract_numeric(amt))/1000)) %>% 
    ungroup

conn %>% head(100) %>% datatable

Network Graph with igraph package

Unformatted Graph

Finally, I’m ready to build my first network graph from the data. Note that the connection between an Agency and a Vendor is always one-way, I use the option directed = TRUE.

set.seed(551)

# define network graph from data frame
g <- graph_from_data_frame(conn, directed = TRUE)

# list of graph vertices (nodes)
V(g)
## + 522/522 vertices, named:
##   [1] BUREAU OF PURCHASES                                                        
##   [2] City-Wide                                                                  
##   [3] COMM. ON AGING                                                             
##   [4] COMM. RELATIONS                                                            
##   [5] COMPTROLLER                                                                
##   [6] CONVENTION COMPLEX                                                         
##   [7] COURT: ORPHANS COURT                                                       
##   [8] ENOCH PRATT FREE LIBRARY                                                   
##   [9] FINANCE                                                                    
##  [10] FIRE                                                                       
## + ... omitted several vertices
# list of graph edges (connections)
E(g)
## + 571/571 edges (vertex names):
##  [1] BUREAU OF PURCHASES->Braden Sutphin Ink Co.        
##  [2] BUREAU OF PURCHASES->c&j graphics inc.             
##  [3] BUREAU OF PURCHASES->Capital Summit Laminators, Inc
##  [4] BUREAU OF PURCHASES->Compass Group                 
##  [5] BUREAU OF PURCHASES->Downs Engravers and Stationers
##  [6] BUREAU OF PURCHASES->Express Auction Services      
##  [7] BUREAU OF PURCHASES->G E Richards' Graphic Supply  
##  [8] BUREAU OF PURCHASES->General Binding Corporation   
##  [9] BUREAU OF PURCHASES->LS Patton Printing Supplies   
## [10] BUREAU OF PURCHASES->MARTIN SUPPLY INC.            
## + ... omitted several edges
# plot the graph
par(mai=c(0,0,1,0))
plot(g)

Adjusting Graph Formatting

Not surprisingly, given so many graph vertices (522), my initial graph is a mess. There are too many labels and each one of the nodes is of the same size.

To make the graph more readable, I differentiate between Agencies and Vendors. Since there are much fewer Agencies (26) than Vendors (496), I only label the Agencies and suppress Vendor labels. I also assign them different colors.

In addition, I adjust the size of each Agency node by the number of outgoing connections, so that an Agency with connections to more Vendors is larger.

The full list of graph formatting options is documented here.

# index of Vendors
is.vendor <- V(g)$name %in% conn$vendor

# Vendor labels
V(g)$label[is.vendor] <- NA
# Agency labels
V(g)$label[!is.vendor] <- V(g)$name[!is.vendor]

# node colors
V(g)$color <- ifelse(is.vendor, 'black', brewer.pal(3,'Pastel2')[1])

# node size
V(g)$size<-degree(g, mode = 'out')/5

plot(g, 
     vertex.label.color='darkblue', 
     vertex.label.cex=0.75, vertex.frame.color = NA,
     edge.arrow.size = 0, edge.arrow.width = 0)

The resulting graph is not perfect, but much more readable. The Agency nodes are sized according to their respective number of outgoing connections (degrees). The Vendor nodes are shrunk to nothing, due to their lack of any outgoing connections. we can see that the Agencies with the most connections are naturally more likely to have Vendors in common with other Agencies. These large, well-connected Agencies form an inter-connected cluster of nodes in the center of the graph. The less-connected Agencies are located along the graph’s periphery.

Agency Network Graph

We can now build another graph showing the network of Agencies with direct connections. It looks similar to the previous plot, but it’s much cleaner, because we dropped all Agency connections to hundreds of different Vendors.

g2 <- graph_from_data_frame(d, directed = F, vertices = agency)
plot(g2)

Clustering Network Nodes

We can take this analysis a step further, by assigning the nodes with the most inter-connections into clusters and plotting them in different colors.

g2_groups <- cluster_optimal(g2)
V(g2)$color <- brewer.pal(12, 'Set3')[membership(g2_groups)]
plot(g2, 
     vertex.label.color='darkblue', 
     vertex.label.cex=0.75, vertex.frame.color = NA,
     edge.arrow.size = 0, edge.arrow.width = 0)

Interpreting the Results

Clustering the nodes reveals two major interconnected agency clusters. One contains agencies involved in law enforcement, public safety, and transportation (green nodes). The other big cluster has to do with health, housing, and community development (purple nodes). Unsurprisingly, in the center of it all is the City-wide category. A number of disconnected agencies appear the side of the graph, reflecting on the narrow and unique nature of their mission needs.

Re-Arranging Nodes into a Circle

It is also possible to arrange the graph nodes into a circle with just a few lines of extra code. Of course, this layout lack the concept of network centrality apparent in the previous graph.

coords <- layout_in_circle(g2, order = order(membership(g2_groups)))
plot(g2, 
     layout = coords, 
     vertex.label.color='darkblue', 
     vertex.label.cex=0.75, vertex.frame.color = NA,
     edge.arrow.size = 0, edge.arrow.width = 0)

Fun with Interactive Graphs

D3.js is a JavaScript library for creating powerful data visualization documents. D3 helps bring data to life by rendering HTML-based graphics that are well-suited for interactive documents, such as web pages. networkD3 is an R package developed by Christopher Gandrud that lets users easily create D3 network, tree, dendrogram, and Sankey graphs with minimal amount of code.

Simple Click-and-Drag Graph

We begin by rendering a simple graph by applying the simpleNetwork() function to the same data as before. This function requires a data frame of network links with source-target node names defined in the first two columns.

simpleNetwork(d, width = 750, height = 650, 
              charge=-5000, opacity = 0.95,
              fontSize = 14, textColour='black')

Adding Custom Formatting

The nodes customization options for the above simpleNetwork() function are minimal. To add cluster coloring, I used a more flexible forceNetwork() function. The data for this function needs to be formatted differently. It requires one dataset of graph nodes and a separate dataset of links defined in terms of node numbers. The code below converts the node names to node numbers (starting from 0).

# for each node assign a index ID starting with 0
node_id <- 
    data_frame(
        id = 0:(nrow(agency)-1),
        agency = agency$agency)

head(node_id, 5)
## Source: local data frame [5 x 2]
## 
##      id           agency
##   (int)            (chr)
## 1     0     PUBLIC WORKS
## 2     1        City-Wide
## 3     2 GENERAL SERVICES
## 4     3           POLICE
## 5     4             FIRE
# for each source-target pair define a pair of numeric IDs
links <- 
    d %>% 
    left_join(node_id %>% rename(source = id), by = c("agy1" = "agency")) %>% 
    left_join(node_id %>% rename(target = id), by = c("agy2" = "agency")) %>% 
    arrange(source)

head(links, 5)
## Source: local data frame [5 x 5]
## 
##           agy1               agy2     d source target
##          (chr)              (chr) (dbl)  (int)  (int)
## 1 PUBLIC WORKS RECREATION & PARKS     2      0     12
## 2 PUBLIC WORKS            SHERIFF     2      0     13
## 3 PUBLIC WORKS     TRANSPORTATION     2      0      6
## 4    City-Wide               FIRE     2      1      4
## 5    City-Wide   GENERAL SERVICES     2      1      2

Note that the links are now specified by source-target node numbers (the last two columns). Now that the graph data is in the right format, it’s easy to build a fully formatted click-and-drag interactive D3 version of the agencies network graph.

forceNetwork(
    # specify nodes data and fields
    Nodes = agency %>% mutate(group = membership(g2_groups)), 
    NodeID = 'agency', Nodesize = 'vendor_count',
    Group = 'group',
    # specify links data and fields
    Links = links, 
    Source = 'source', Target = 'target', Value = 'd',
    # add formatting
    opacity = 0.95, fontSize=14, charge = -500,
    width = 750, height = 650
)

Conclusion

The examples above demonstrate how it is possible to use free and open source software like R to transform transaction-type data into network graph-ready data, build network graphs, and assign nodes into clusters based on the level of their network proximity. These methods are widely used in uncovering hidden pattern and helping to detect improper payments, fraud waste and abuse, and other illicit activities.


Download raw data and R Markdown code from GitHub.