The goal of this script is to identify the number of distinct suppliers in the USA spending data set. This goal is achieved by merging rows by matching values in multiple columns on at a time. When there were multiple values in the same column per group, the mode of the column was selected for the merge. The mode was selected because trying to merge on all matches between a common field was computationally not feasible on the available resources. Below are how the counts of unique groups changed after merging by each field.
The number of unique groups was shrunk from 214,320 to 128,034. Merging the data using this methodology would allow users to answer the following questions:
There are a number of potential improvements to this algorithm. It is currently designed so that it can easily be modified and improved in any of the following methods.
Set up environment and install packages
Create mode function that eliminates NAs
Mode <- function(x) {
x = na.omit(x)
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
Create function to merge the clusters based on the mode of a selected column
groupCompanies = function(rawData,column_name){
#Update merge column with merge data of interest
rawData[,match_var := get(column_name),]
#Group on group_id and calculate mode of match variable
new.group = ddply(rawData,
.(group_id),
.parallel = TRUE,
summarise,
match_var_new = Mode(match_var))
#Identify unique merge values and assign unique ID
new.group = na.omit(new.group)
new.group.ids = as.data.table(unique(new.group$match_var_new))
names(new.group.ids)[1] = "match_var_new"
max_group_id = max(rawData$group_id, na.rm = TRUE)
new.group.ids$group_id_new =
seq.int(from = max_group_id + 1,to = max_group_id + nrow(new.group))
#Merge new groupings back with group IDs
new.group = data.table(new.group)
setkey(new.group,match_var_new)
new.group.ids = data.table(new.group.ids)
setkey(new.group.ids,match_var_new)
new.group.match = merge(new.group,new.group.ids)
new.group.match$match_var_new = NULL
#Merge groupings back to full data set
new.group.match = data.table(na.omit(new.group.match))
setkey(new.group.match,group_id)
setkey(rawData,group_id)
rawData <- data.table(merge(rawData,new.group.match,all.x = TRUE))
#Update group_ids with the newly merged group IDs
rawData[,group_id := ifelse(is.na(group_id_new),group_id,group_id_new),]
rawData$group_id_new = NULL
#Return table with updated data
rawData
}
Prior to importing the data the data dictionary was identified and of the variables in the data set the following subset was selected to be useful in shrinking the data. There are 4802211 unique records in the data set.
Identify the unique combinations of rows in the data sets based on the variables that we pulled in.
#Read in Raw data
spendData = readRDS("spendData.rds")
#Create unique id of 1 to sum rows
spendData$val = 1
#Identify unique combinations and count of variables
spendDataUnique = ddply(spendData,
.(vendorname,vendoralternatename,vendordoingasbusinessname,
vendorlegalorganizationname,streetaddress,streetaddress2,
streetaddress3,city,state,zipcode,faxno,phoneno,parentdunsnumber,
mod_parent),
summarise,
.parallel = TRUE,
count = sum(val))
#Convert to data.table, save for future use.
spendDataUnique = as.data.table(spendDataUnique)
nrow(spendDataUnique)
saveRDS(spendDataUnique,"spendDataUnique.R")
Add unique IDs to assist with merging
spendDataUnique = readRDS("spendDataUnique.R") #Uncomment if need to load raw data
spendDataUnique = add_column(spendDataUnique,id = seq.int(nrow(spendDataUnique)),.before = 1)
spendDataUnique = add_column(spendDataUnique, group_id = spendDataUnique$id, .before = 1)
Clean some data to make the merging easier. This section can be updated to include additional rows if necessary
#Clear up obviously missing phone and fax numbers
spendDataUnique[,phoneno := gsub('\\D','',phoneno),]
spendDataUnique[,faxno := gsub('\\D','',faxno),]
spendDataUnique[grepl("^(\\d)\\1+$",faxno),faxno := NA,]
spendDataUnique[grepl("^(\\d)\\1+$",phoneno),phoneno := NA,]
Merging groups on the mode of “vendorname” reduced the number of unique groups from 214320 to
spendDataUnique = groupCompanies(spendDataUnique,"vendorname")
length(unique(spendDataUnique$group_id))
## [1] 152616
Merging groups on the mode of “phoenno” reduced to the following
spendDataUnique = groupCompanies(spendDataUnique,"phoneno")
length(unique(spendDataUnique$group_id))
## [1] 143845
Merging groups on the mode of “faxno” reduced the number of unique groups from 143838 to 141563
spendDataUnique = groupCompanies(spendDataUnique,"faxno")
length(unique(spendDataUnique$group_id))
## [1] 141653
Merging groups on the mode of “parentdunsnumber” reduced the number of unique groups from 141563 to 128201
spendDataUnique = groupCompanies(spendDataUnique,"parentdunsnumber")
length(unique(spendDataUnique$group_id))
## [1] 128201
Merging groups on the mode of “mod_parent” reduced the number of unique groups from 127740 to 128034
spendDataUnique = groupCompanies(spendDataUnique,"mod_parent")
length(unique(spendDataUnique$group_id))
## [1] 128034
Check largest clusters for obvious data cleanup & look at how big the stuff was merged
largestGroup = spendDataUnique[group_id == Mode(spendDataUnique$group_id),, ]
head(largestGroup)
## group_id id vendorname
## 1: 725337 9733 LOCKHEED MARTIN SERVICES INCORPORATED
## 2: 725337 23768 LOCKHEED MARTIN SERVICES INCORPORATED
## 3: 725337 28445 LOCKHEED MARTIN SERVICES INCORPORATED
## 4: 725337 54634 LOCKHEED MARTIN SERVICES INCORPORATED
## 5: 725337 126715 LOCKHEED MARTIN SERVICES INCORPORATED
## 6: 725337 152317 LOCKHEED MARTIN SERVICES INCORPORATED
## vendoralternatename vendordoingasbusinessname
## 1: NA NA
## 2: LOCKHEED MARTIN NA
## 3: LOCKHEED MARTIN AIRCRAFT & LOGISTICS CENTERS NA
## 4: LOCKHEED MARTIN NA
## 5: LOCKHEED MARTIN AIRCRAFT & LOGISTICS CENTERS NA
## 6: LOCKHEED MARTIN NA
## vendorlegalorganizationname streetaddress
## 1: LOCKHEED MARTIN SERVICES INCORPORATED 2339 RTE 70 W
## 2: LOCKHEED MARTIN SERVICES INCORPORATED 2339 ROUTE 70 WEST, FLOOR 3W
## 3: LOCKHEED MARTIN SERVICES INCORPORATED 2339 ROUTE 70 WEST, FLOOR 3W
## 4: LOCKHEED MARTIN SERVICES INCORPORATED 7375 EXECUTIVE PLACE
## 5: LOCKHEED MARTIN SERVICES INCORPORATED 580 W. CHEYENNE AVE
## 6: LOCKHEED MARTIN SERVICES INCORPORATED 2339 RTE 70 W
## streetaddress2 streetaddress3 city state zipcode faxno
## 1: NA NA CHERRY HILL NJ 08002 NA
## 2: NA NA CHERRY HILL NJ 08002 NA
## 3: NA NA CHERRY HILL NJ 08002 NA
## 4: NA NA SEABROOK MD 20706 NA
## 5: BLDG C, SUITE 130 NA NORTH LAS VEGAS NV 89030 NA
## 6: NA NA CHERRY HILL NJ 08002 NA
## phoneno parentdunsnumber mod_parent count
## 1: NA 834951691 LOCKHEED MARTIN CORPORATION 2
## 2: NA 834951691 LOCKHEED MARTIN CORPORATION 6
## 3: NA 834951691 LOCKHEED MARTIN CORPORATION 10
## 4: NA 834951691 LOCKHEED MARTIN CORPORATION 35
## 5: NA 834951691 LOCKHEED MARTIN CORPORATION 1
## 6: NA 834951691 LOCKHEED MARTIN CORPORATION 1
## match_var
## 1: LOCKHEED MARTIN CORPORATION
## 2: LOCKHEED MARTIN CORPORATION
## 3: LOCKHEED MARTIN CORPORATION
## 4: LOCKHEED MARTIN CORPORATION
## 5: LOCKHEED MARTIN CORPORATION
## 6: LOCKHEED MARTIN CORPORATION
MergeBreakdown = ddply(spendDataUnique,
.(group_id),
summarise,
.parallel = TRUE,
RowCount = length(group_id),
TotalCount = sum(count))