This is an attempt to recreate and expand upon the work done by Fred Trotter on a network graph of pysicians and orgainizations involved in providing health care services to Medicare patients. The DocGraph was first presented at the Strata conference in 2012. This initative applied network analysis on health care provider (HCP) relationships that are not formally curated within a social network like Facebook, LinkedIn or Twitter. The FOIA enabled Medicare claims public datasets were used to infer relationships between the HCP entities. As far as we have researched (Feb 2014), the data nor the visualization of the graph is publically available yet.
In our work, we hope to create the entity relationship graph based on comman Medicare claim counts (referrals) with additional considerations to shared HCP specialization, shared drug prescriptions, propensity to prescribe propritory versus generics and geographic area. Specifically, our aim is to understand the strength of the relationship between HCPs that can, in turn, help answer the followin questions:
We see payers, pharmaceutical companies, patients, researchers and academia as potential stakeholders interested in answers to the above questions.
The insights will be aggregated at the following levels:
This effort is a part of the session end project submission requierments for Exploratory Data Analysis and Machine Learning for Data Science courses in the Data Science program at Columbia University, New York. Rajesh Madala, Mandeep Sigh and Mayank Misra worked on this project.
The scope was adjusted in April to focus on the New York, New Jersey, Connecticut tristate area and the HCPs involved with Oncology as a specialty.
The following datasets and sources were identified for this analysis:
National Plan and Provider Enumeration System (size 5 GB) data from Centers for Medicare and Medicaid Services (CMS).
Physician Referral Patterns – 2009, 2010, 2011 (size 2 GB) : These files represent 3 years of data showing the number of referrals from one physician to another (data for 2012 has also been released at the time of publishing). This dataset was made available as a FOIA request. Each record in this database captures the occurrences where a claim was made for the same patient by two health care providers (HCP) in a rolling 30 day time period. The two HCPs are picked up only if they have had 11 or more 'shared' patients for which Medicare was billed. The 11 number is a CMS standard to obfuscate identifying individual patient by their referral patterns. A record in this data set is of the form: {1112223334, 5556667778, 1100}. The first value in the NPI of the HCP who provided care the first. The second value is the NPI who provided care subsequently. The third value is the number of time this occurred in the past rolling 30 days for that year. This dataset, probably due to a technical issues, was offline for a period of time while this work was carried out. As a fall back DocGraph Edge Database v1.0 Open Source was download for a token of $1. This dataset is available for 2011 only. The documentation accompanying this dataset further explains the Physician Referral data as follows:
Medicare Part D Prescribing Data 2011 & Medicare Part D Prescribing Data (Patients 65 or Older) 2011 (size 7 GB): These data, made available by ProPublica, include all drugs prescribed by doctors 11 or more times that year to Part D patients, including those 65 and older.
Health Care Provider Taxonomy (size 2 MB). This dataset adds to the profile of a HCP entity by detailing their specialization (Type, Classification, Specialization, Definition).
National Drug Code Directory (25 MB): This dataset was used to add details for a prescribed drug. The ProPublica prescribing dataset was mapped to this Drug Directory to obtain information about the Labeler, Marketing start/end date, form strength etc.
Most of the datasets are in comma separated values. The files were individually downloaded and uncompressed. Cloud storage was used to archive the raw files and make them available to the team. The semi prepared and final versions of the data extracts were also uploaded to the cloud. They have been made public and are available here
Our first attempt at sampling, merging and enriching the dataset were based on Unix commands.
--replace comma in " "( double strings) with blank
awk -F'"' -v OFS='' '{ for (i=2; i<=NF; i+=2) gsub(",", "", $i) }1' npidata_20050523-20140309.csv > npidata.csv
-- create a file with npi ,entity_type_code ,provider_last_name ,provider_first_name ,provider_city ,provider_state provider_zipcode ,provider_taxonomy_code columns
cut -d, -f 1,2,6,7,31,32,33,48 < npidata.csv >> npi_data.csv
-- replace comma in " "( double strings) with blank
awk -F'"' -v OFS='' '{ for (i=2; i<=NF; i+=2) gsub(",", "", $i) }1' nucc_taxonomy_140.csv > nc_txnmy140.csv
-- create a file with code ,type ,classification ,specialization
cut -d, -f 1,2,3,4,31,32,33,48 < nc_txnmy140.csv >> npi_txnmy140.csv
-- replace comma with ;
sed s/\,/\;/g product.txt >> prod.txt
sed s/\\t/\,/g prod.txt >> prod1.csv
create file with proprietaryname , labelername
cut -d, -f 4,13 prod1.csv >> product.csv
The shear size of datafile, the lack of error handling and an inability to confirm the script workflow did not give us enough confidence in the output. We switched to the tried and trusted relational database to store the data and query the data. Oracle personal edition was used for this purposes.
OPTIONS (SKIP=1)
load data
truncate
into table npi_tbl
fields terminated by ","
(npi ,
entity_type_code ,
provider_last_name ,
provider_first_name ,
provider_city ,
provider_state ,
provider_zipcode ,
provider_taxonomy_code)
OPTIONS (SKIP=1)
load data
truncate
into table npi_taxnmy_tbl
fields terminated by ","
TRAILING NULLCOLS
(code ,
type ,
classification ,
specialization )
OPTIONS (SKIP=1)
load data
truncate
into table product_tbl
fields terminated by ","
TRAILING NULLCOLS
(
proprietaryname ,
labelername )
load data
truncate
into table propublica_ccw_id
fields terminated by ","
TRAILING NULLCOLS
(propub_id ,
bn ,
claim_count ,
claim_count_daw1 ,
claim_count_cmpnd2 ,
quantity_sum ,
day_supply_sum ,
gross_drug_cost_sum )
load data
truncate
into table propub_prscrbr_id
fields terminated by ","
TRAILING NULLCOLS
(propub_id ,
npi ,
dea1 ,
dea2 ,
dea3 ,
dea4 ,
dea5 ,
dea6 ,
dea7 )
SELECT distinct np.npi,
pr.npi_2,
pr.ref_cnt
FROM npi_tbl np,
npi_taxnmy_tbl nt,
propub_prscrbr_id ppr,
propublica_ccw_id pcc,
(SELECT prop_name, lblr_name
FROM (SELECT prop_name,
lblr_name,
lblr_cnt,
ROW_NUMBER ()
OVER (PARTITION BY prop_name ORDER BY lblr_cnt DESC)
rnk
FROM ( SELECT LOWER (proprietaryname) prop_name,
INITCAP (labelername) lblr_name,
COUNT (*) lblr_cnt
FROM product_tbl
GROUP BY LOWER (proprietaryname),
INITCAP (labelername)))
WHERE rnk = 1) pt,
physician_referrals pr
WHERE np.provider_taxonomy_code = nt.code
AND np.npi = ppr.npi
AND ppr.propub_id = pcc.propub_id
AND LOWER (pcc.bn) = pt.prop_name
AND np.npi = pr.npi_1
AND np.provider_state IN ('NJ', 'NY', 'CT')
AND LOWER (nt.specialization) LIKE '%oncology%'
ORDER BY npi
SELECT np.npi,
np.entity_type_code,
np.provider_last_name,
np.provider_first_name,
np.provider_city,
np.provider_state,
np.provider_zipcode,
np.provider_taxonomy_code,
nt.classification,
nt.specialization,
ppr.propub_id,
ppr.dea1,
claim_count,
claim_count_daw1,
claim_count_cmpnd2,
quantity_sum,
day_supply_sum,
gross_drug_cost_sum,
INITCAP (pt.prop_name) proprietary_name,
pt.lblr_name labeler_name,
pr.npi_2,
pr.ref_cnt
FROM npi_tbl np,
npi_taxnmy_tbl nt,
propub_prscrbr_id ppr,
propublica_ccw_id pcc,
(SELECT prop_name, lblr_name
FROM (SELECT prop_name,
lblr_name,
lblr_cnt,
ROW_NUMBER ()
OVER (PARTITION BY prop_name ORDER BY lblr_cnt DESC)
rnk
FROM ( SELECT LOWER (proprietaryname) prop_name,
INITCAP (labelername) lblr_name,
COUNT (*) lblr_cnt
FROM product_tbl
GROUP BY LOWER (proprietaryname),
INITCAP (labelername)))
WHERE rnk = 1) pt,
physician_referrals pr
WHERE np.provider_taxonomy_code = nt.code
AND np.npi = ppr.npi
AND ppr.propub_id = pcc.propub_id
AND LOWER (pcc.bn) = pt.prop_name
AND np.npi = pr.npi_1
AND np.provider_state IN ('NJ', 'NY', 'CT')
AND LOWER (nt.specialization) LIKE '%oncology%'
ORDER BY npi

The data is aggregated from multiple public sources:
library(ggplot2)
library(GGally)
## Loading required package: reshape
library(ggmap)
data <- read.csv("npi_pcg_prod.csv")
colnames(data) <- c("State", "City", "Specialization", "Classification", "NPI",
"Type", "Name", "Labeler", "Drug", "Ccount", "Cqty", "Csum")
The table below shows the frequency of
:
table(data[, "State"])
##
## CT NJ NY
## 288 1200 1972
Visualized as a barplot:
agg1 <- aggregate(Ccount ~ State, data = data, sum)
p1 <- ggplot(agg1, aes(State, Ccount, fill = State))
p1 + geom_bar(stat = "identity") + ylab("Claim Count") + ggtitle("Oncology Claims by State")
Split by
:
agg2 <- aggregate(Ccount ~ Classification, data = data, sum)
p2 <- ggplot(agg2, aes(Classification, Ccount, fill = Classification))
p2 + geom_bar(stat = "identity") + ylab("Claim Count") + ggtitle("Oncology Claims by Classification")
Correlation matrix for Claim Count, Quantity and Sum
data_sub <- data[, c(1, 10, 11, 12)]
ggpairs(data = data_sub, columns = 2:4, title = "Correlation Matrix", colour = "State")
Claim distribution
:
agg3 <- aggregate(Ccount ~ NPI + State, data = data, sum)
d3 <- density(agg3[, 3])
plot(d3, type = "n", main = "Distribution of Claim Count")
polygon(d3, col = "red", border = "gray")
Claim Count by Physicians in Tri-State Area
:
agg3 <- agg3[order(agg3$State, agg3$Ccount, decreasing = T), ]
p3 <- ggplot(agg3, aes(NPI, Ccount, colour = State))
p3 <- p3 + labs(colour = "State") + scale_x_continuous(breaks = c(1, 200))
p3 <- p3 + ggtitle("Claim Counts by Physician in TriState Area") + ylab("Claim Count")
p3 + geom_point() + facet_grid(. ~ State) + geom_smooth()
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
Distribution of Claim Quantity
agg4 <- aggregate(Cqty ~ Labeler, data = data, sum)
d4 <- density(agg4[, 2])
plot(d4, type = "n", main = "Distribution of Claim Quantity")
polygon(d4, col = "red", border = "gray")
Claim Volume by Top Labelers

agg4 <- agg4[order(agg4$Cqty, decreasing = T), ]
agg4_top <- agg4[1:6, ]
p4 <- ggplot(agg4_top, aes(Labeler, Cqty, fill = Labeler))
p4 + geom_bar(stat = "identity") + ylab("Claim Quantity") + ggtitle("Claim Quantity by Top Labelers")
Claim Amount by Top Labelers

agg5 <- aggregate(Csum ~ Labeler, data = data, sum)
agg5 <- agg5[order(agg5$Csum, decreasing = T), ]
agg5_top <- agg5[1:6, ]
p5 <- ggplot(agg5_top, aes(Labeler, Csum, fill = Labeler))
p5 + geom_bar(stat = "identity") + ylab("Claim Sum") + ggtitle("Claim Sum by Top Labelers")
Claim Distribution on a map (work in progress):

freq <- as.data.frame(table(data$City))
longlat <- geocode(unique(as.character(data$City)))
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=YONKERS&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=ROCHESTER&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=NEW+BRUNSWICK&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=NEW+HYDE+PARK&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=KENDALL+PARK&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=NEW+YORK&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=VOORHEES&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=NEW+HAVEN&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=YORKTOWN+HEIGHTS&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=JERSEY+CITY&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=MANASQUAN&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=BROOKLYN&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=MIDDLETOWN&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=HARTFORD&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=DENVILLE&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=TRENTON&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=SLEEPY+HOLLOW&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=JAMAICA&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=BRONX&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=MORRISTOWN&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=STONY+BROOK&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=WOODBURY&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=GARDEN+CITY&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=BUFFALO&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=FLUSHING&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=CAMDEN&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=ENGLEWOOD&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=HAMDEN&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=FARMINGTON&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=ENFIELD&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=ALBANY&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=MARLTON&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=BRICK&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=MOUNT+LAUREL&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=LINCROFT&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=ELIZABETH&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=NYACK&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=COOPERSTOWN&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=GLENS+FALLS&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=MAPLEWOOD&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=MARMORA&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=RAHWAY&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=HACKENSACK&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=TEANECK&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=SUMMIT&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=WEST+HAVEN&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=STATEN+ISLAND&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=WEST+LONG+BRANCH&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=EDISON&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=NEW+MILFORD&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=MONTCLAIR&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=E+SETAUKET&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=PRINCETON&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=RIDGEWOOD&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=PHILLIPSBURG&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
cities <- cbind(freq, longlat)
colnames(cities) <- c("City", "Freq", "long", "lat")
cities <- cities[order(cities$Freq, decreasing = T), ]
top_cities <- cities[1:10, ]
states <- map_data("state")
tri_states <- subset(states, region %in% c("Connecticut", "new jersey", "new york"))
p <- ggplot(tri_states, aes(x = long, y = lat, group = group))
p <- p + geom_polygon(fill = "grey10", colour = "white")
p <- p + xlim(-80, -70)
p <- p + geom_point(data = top_cities, inherit.aes = F, aes(x = long, y = lat,
size = Freq), colour = "red", alpha = 0.8) + scale_size(name = "Claim Frequency")
p <- p + geom_text(data = top_cities, inherit.aes = F, aes(x = long, y = lat,
label = City), vjust = 1, colour = "red", alpha = 0.8)
p + ggtitle("Physician Claims in TriState Area")
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_text).
Health and Humas Service itself acknowledge that Improvements Are Needed To Ensure Provider Enumeration and Medicare Enrollment Data Are Accurate, Complete, and Consistent
Ideally we would have liked to use the American Medical Association Physician Masterfile. This file has nationa level data on each physician, including demographic, education/training and practice information. The AMA website describes this file as including “current and historical data for more than 1.4 million physicians, residents, and medical students in the United States. This figure includes approximately 411,000 graduates of foreign medical schools who reside in the United States and who have met the educational and credentialing requirements necessary for recognition”. This is not a open dataset and not an option in absence of a budget. Our choice was limted to the NPPES data or its derivations. Although all practitioners and healthcare providers that participate in Medicare, Medicaid should have an NPI, the depth and completeness of data for physicians is not at par with the AMA master file.
This exercise is based on Medicare data and therfore may not provide a complete picture for the total population.
#Stretch goal
##Research:
The following questions have been posed by a Cloudera sponsored DataScience challenge. We can aim to answer these and add more insights by including the NPI and Prescription info. The challenge list the questions as:
Some providers and regions are likely to be different in more subtle ways.
More here: http://www.cloudera.com/content/cloudera/en/training/certification/ccp-ds/challenge/register.html