Tamr Government Spend Data Engineering Problem

Executive Summary

The goal of this script is to identify the number of distinct suppliers in the USA spending data set. This goal is achieved by merging rows by matching values in multiple columns on at a time. When there were multiple values in the same column per group, the mode of the column was selected for the merge. The mode was selected because trying to merge on all matches between a common field was computationally not feasible on the available resources. Below are how the counts of unique groups changed after merging by each field.

Start: 4,802,211
Unique Rows: 214,320
vendorname: 152,161
phoenno: 143,845
faxno: 141,563
parentdunsnumber: 128,201
mod_parent: 128,034

The number of unique groups was shrunk from 214,320 to 128,034. Merging the data using this methodology would allow users to answer the following questions:

Which vendors are receiving the most number of contracts and amount of money from the federal government
If someone knew which lobbying firms are representing different companies, they may be able to identify a good lobbying firm if they have similar needs.
Someone could determine the percent of vendors that are winning different types of contracts. For example Lockeed Martin & Boeing each wiht 50% of aerospace contracts.

Potential improvements to the algorithm

There are a number of potential improvements to this algorithm. It is currently designed so that it can easily be modified and improved in any of the following methods.

Creation of new variables by cleaning up or merging columns (i.e. address, city & State)
Subsituting additional functions instead of the the mode of a column value. The real overarching goal of the methodolgy is to find something common that is unique to the group. This could include:
Most frequent word combination between a multiple fields that matches against a list of company names
Address cleaned up in a nice format
When there are multiple values instead of using the mode try the second or third most frequent value randomly to find other matches
Running a cluster algorithm to split up the groups. Because this methodology is very loose to merge groups, there are likely multiple companies that are put in the same bucket. However, the buckets are fairly small, and if proper distance metrics can be created this is ripe for a clustering algorithm to separate companies within a cluster.

Set up environment and create functions

Set up environment and install packages

Create mode function that eliminates NAs

 Mode <- function(x) {
   x = na.omit(x)
   ux <- unique(x)
   ux[which.max(tabulate(match(x, ux)))]
 }

Create function to merge the clusters based on the mode of a selected column

groupCompanies = function(rawData,column_name){

  #Update merge column with merge data of interest
  rawData[,match_var := get(column_name),]
  
  #Group on group_id and calculate mode of match variable
  new.group = ddply(rawData,
                    .(group_id),
                    .parallel = TRUE,
                    summarise,
                    match_var_new = Mode(match_var))
  
  
  #Identify unique merge values and assign unique ID
  new.group = na.omit(new.group)
  new.group.ids = as.data.table(unique(new.group$match_var_new))
  names(new.group.ids)[1] = "match_var_new"
  max_group_id = max(rawData$group_id, na.rm = TRUE)
  new.group.ids$group_id_new = 
    seq.int(from = max_group_id + 1,to = max_group_id + nrow(new.group))
  
  #Merge new groupings back with group IDs
  new.group = data.table(new.group)
  setkey(new.group,match_var_new)
  new.group.ids = data.table(new.group.ids)
  setkey(new.group.ids,match_var_new)
  new.group.match = merge(new.group,new.group.ids)
  new.group.match$match_var_new = NULL
  
  #Merge groupings back to full data set
  new.group.match = data.table(na.omit(new.group.match))
  setkey(new.group.match,group_id)
  setkey(rawData,group_id)
  rawData <- data.table(merge(rawData,new.group.match,all.x = TRUE))
  
  #Update group_ids with the newly merged group IDs
  rawData[,group_id := ifelse(is.na(group_id_new),group_id,group_id_new),]
  rawData$group_id_new = NULL

  #Return table with updated data
  rawData
}

Start Data process

Prior to importing the data the data dictionary was identified and of the variables in the data set the following subset was selected to be useful in shrinking the data. There are 4802211 unique records in the data set.

Identify the unique combinations of rows in the data sets based on the variables that we pulled in.

#Read in Raw data
spendData = readRDS("spendData.rds")

#Create unique id of 1 to sum rows
spendData$val = 1

#Identify unique combinations and count of variables
spendDataUnique = ddply(spendData,
                        .(vendorname,vendoralternatename,vendordoingasbusinessname,
                          vendorlegalorganizationname,streetaddress,streetaddress2,       
                          streetaddress3,city,state,zipcode,faxno,phoneno,parentdunsnumber,
                          mod_parent),
                        summarise,
                        .parallel = TRUE,
                        count = sum(val))

#Convert to data.table, save for future use.
spendDataUnique = as.data.table(spendDataUnique)
nrow(spendDataUnique)
saveRDS(spendDataUnique,"spendDataUnique.R")

Add unique IDs to assist with merging

spendDataUnique = readRDS("spendDataUnique.R")  #Uncomment if need to load raw data
spendDataUnique = add_column(spendDataUnique,id = seq.int(nrow(spendDataUnique)),.before = 1)
spendDataUnique = add_column(spendDataUnique, group_id = spendDataUnique$id, .before = 1)

Clean some data to make the merging easier. This section can be updated to include additional rows if necessary

#Clear up obviously missing phone and fax numbers
spendDataUnique[,phoneno := gsub('\\D','',phoneno),]
spendDataUnique[,faxno := gsub('\\D','',faxno),]
spendDataUnique[grepl("^(\\d)\\1+$",faxno),faxno := NA,]
spendDataUnique[grepl("^(\\d)\\1+$",phoneno),phoneno := NA,]

Begin merging groups

Merging groups on the mode of “vendorname” reduced the number of unique groups from 214320 to

spendDataUnique = groupCompanies(spendDataUnique,"vendorname")
length(unique(spendDataUnique$group_id))

## [1] 152616

Merging groups on the mode of “phoenno” reduced to the following

spendDataUnique = groupCompanies(spendDataUnique,"phoneno")
length(unique(spendDataUnique$group_id))

## [1] 143845

Merging groups on the mode of “faxno” reduced the number of unique groups from 143838 to 141563

spendDataUnique = groupCompanies(spendDataUnique,"faxno")
length(unique(spendDataUnique$group_id))

## [1] 141653

Merging groups on the mode of “parentdunsnumber” reduced the number of unique groups from 141563 to 128201

spendDataUnique = groupCompanies(spendDataUnique,"parentdunsnumber")
length(unique(spendDataUnique$group_id))

## [1] 128201

Merging groups on the mode of “mod_parent” reduced the number of unique groups from 127740 to 128034

spendDataUnique = groupCompanies(spendDataUnique,"mod_parent")
length(unique(spendDataUnique$group_id))

## [1] 128034

Review results

Check largest clusters for obvious data cleanup & look at how big the stuff was merged

largestGroup = spendDataUnique[group_id == Mode(spendDataUnique$group_id),, ]
head(largestGroup)

##    group_id     id                            vendorname
## 1:   725337   9733 LOCKHEED MARTIN SERVICES INCORPORATED
## 2:   725337  23768 LOCKHEED MARTIN SERVICES INCORPORATED
## 3:   725337  28445 LOCKHEED MARTIN SERVICES INCORPORATED
## 4:   725337  54634 LOCKHEED MARTIN SERVICES INCORPORATED
## 5:   725337 126715 LOCKHEED MARTIN SERVICES INCORPORATED
## 6:   725337 152317 LOCKHEED MARTIN SERVICES INCORPORATED
##                             vendoralternatename vendordoingasbusinessname
## 1:                                           NA                        NA
## 2:                              LOCKHEED MARTIN                        NA
## 3: LOCKHEED MARTIN AIRCRAFT & LOGISTICS CENTERS                        NA
## 4:                              LOCKHEED MARTIN                        NA
## 5: LOCKHEED MARTIN AIRCRAFT & LOGISTICS CENTERS                        NA
## 6:                              LOCKHEED MARTIN                        NA
##              vendorlegalorganizationname                streetaddress
## 1: LOCKHEED MARTIN SERVICES INCORPORATED                2339 RTE 70 W
## 2: LOCKHEED MARTIN SERVICES INCORPORATED 2339 ROUTE 70 WEST, FLOOR 3W
## 3: LOCKHEED MARTIN SERVICES INCORPORATED 2339 ROUTE 70 WEST, FLOOR 3W
## 4: LOCKHEED MARTIN SERVICES INCORPORATED         7375 EXECUTIVE PLACE
## 5: LOCKHEED MARTIN SERVICES INCORPORATED          580 W. CHEYENNE AVE
## 6: LOCKHEED MARTIN SERVICES INCORPORATED                2339 RTE 70 W
##       streetaddress2 streetaddress3            city state zipcode faxno
## 1:                NA             NA     CHERRY HILL    NJ   08002    NA
## 2:                NA             NA     CHERRY HILL    NJ   08002    NA
## 3:                NA             NA     CHERRY HILL    NJ   08002    NA
## 4:                NA             NA        SEABROOK    MD   20706    NA
## 5: BLDG C, SUITE 130             NA NORTH LAS VEGAS    NV   89030    NA
## 6:                NA             NA     CHERRY HILL    NJ   08002    NA
##    phoneno parentdunsnumber                  mod_parent count
## 1:      NA        834951691 LOCKHEED MARTIN CORPORATION     2
## 2:      NA        834951691 LOCKHEED MARTIN CORPORATION     6
## 3:      NA        834951691 LOCKHEED MARTIN CORPORATION    10
## 4:      NA        834951691 LOCKHEED MARTIN CORPORATION    35
## 5:      NA        834951691 LOCKHEED MARTIN CORPORATION     1
## 6:      NA        834951691 LOCKHEED MARTIN CORPORATION     1
##                      match_var
## 1: LOCKHEED MARTIN CORPORATION
## 2: LOCKHEED MARTIN CORPORATION
## 3: LOCKHEED MARTIN CORPORATION
## 4: LOCKHEED MARTIN CORPORATION
## 5: LOCKHEED MARTIN CORPORATION
## 6: LOCKHEED MARTIN CORPORATION

MergeBreakdown = ddply(spendDataUnique,
                       .(group_id),
                       summarise,
                       .parallel = TRUE,
                       RowCount = length(group_id),
                       TotalCount = sum(count))