Physician Affinity Graph

Summary

This is an attempt to recreate and expand upon the work done by Fred Trotter on a network graph of pysicians and orgainizations involved in providing health care services to Medicare patients. The DocGraph was first presented at the Strata conference in 2012. This initative applied network analysis on health care provider (HCP) relationships that are not formally curated within a social network like Facebook, LinkedIn or Twitter. The FOIA enabled Medicare claims public datasets were used to infer relationships between the HCP entities. As far as we have researched (Feb 2014), the data nor the visualization of the graph is publically available yet.

What do we hope to answer?

In our work, we hope to create the entity relationship graph based on comman Medicare claim counts (referrals) with additional considerations to shared HCP specialization, shared drug prescriptions, propensity to prescribe propritory versus generics and geographic area. Specifically, our aim is to understand the strength of the relationship between HCPs that can, in turn, help answer the followin questions:

  1. Who are the most influential doctors for a particular disease area in a specific geography?
  2. Can we apply ranking or an influence score to the healthcare provider?
  3. Is there a way, using these data, to develop a 'Key Opinion Leader' KOL recommendation engine for the health care provider community?

We see payers, pharmaceutical companies, patients, researchers and academia as potential stakeholders interested in answers to the above questions.

The insights will be aggregated at the following levels:

This effort is a part of the session end project submission requierments for Exploratory Data Analysis and Machine Learning for Data Science courses in the Data Science program at Columbia University, New York. Rajesh Madala, Mandeep Sigh and Mayank Misra worked on this project.

The scope was adjusted in April to focus on the New York, New Jersey, Connecticut tristate area and the HCPs involved with Oncology as a specialty.

Data Sources

The following datasets and sources were identified for this analysis:

  1. National Plan and Provider Enumeration System (size 5 GB) data from Centers for Medicare and Medicaid Services (CMS).

  2. Physician Referral Patterns – 2009, 2010, 2011 (size 2 GB) : These files represent 3 years of data showing the number of referrals from one physician to another (data for 2012 has also been released at the time of publishing). This dataset was made available as a FOIA request. Each record in this database captures the occurrences where a claim was made for the same patient by two health care providers (HCP) in a rolling 30 day time period. The two HCPs are picked up only if they have had 11 or more 'shared' patients for which Medicare was billed. The 11 number is a CMS standard to obfuscate identifying individual patient by their referral patterns. A record in this data set is of the form: {1112223334, 5556667778, 1100}. The first value in the NPI of the HCP who provided care the first. The second value is the NPI who provided care subsequently. The third value is the number of time this occurred in the past rolling 30 days for that year. This dataset, probably due to a technical issues, was offline for a period of time while this work was carried out. As a fall back DocGraph Edge Database v1.0 Open Source was download for a token of $1. This dataset is available for 2011 only. The documentation accompanying this dataset further explains the Physician Referral data as follows:

  3. Medicare Part D Prescribing Data 2011 & Medicare Part D Prescribing Data (Patients 65 or Older) 2011 (size 7 GB): These data, made available by ProPublica, include all drugs prescribed by doctors 11 or more times that year to Part D patients, including those 65 and older.

  4. Health Care Provider Taxonomy (size 2 MB). This dataset adds to the profile of a HCP entity by detailing their specialization (Type, Classification, Specialization, Definition).

  5. National Drug Code Directory (25 MB): This dataset was used to add details for a prescribed drug. The ProPublica prescribing dataset was mapped to this Drug Directory to obtain information about the Labeler, Marketing start/end date, form strength etc.

Enabling technologies

Methodology

Relationships

Data Wrangling

Most of the datasets are in comma separated values. The files were individually downloaded and uncompressed. Cloud storage was used to archive the raw files and make them available to the team. The semi prepared and final versions of the data extracts were also uploaded to the cloud. They have been made public and are available here

Unix

Our first attempt at sampling, merging and enriching the dataset were based on Unix commands.

--replace comma in " "( double strings)  with blank
awk -F'"' -v OFS='' '{ for (i=2; i<=NF; i+=2) gsub(",", "", $i) }1' npidata_20050523-20140309.csv > npidata.csv

-- create a file with npi ,entity_type_code ,provider_last_name ,provider_first_name ,provider_city ,provider_state provider_zipcode ,provider_taxonomy_code columns
cut -d, -f 1,2,6,7,31,32,33,48 < npidata.csv >> npi_data.csv

-- replace comma in " "( double strings)  with blank
awk -F'"' -v OFS='' '{ for (i=2; i<=NF; i+=2) gsub(",", "", $i) }1' nucc_taxonomy_140.csv > nc_txnmy140.csv

-- create a file with code ,type ,classification ,specialization
cut -d, -f 1,2,3,4,31,32,33,48 < nc_txnmy140.csv >> npi_txnmy140.csv

-- replace comma with ;
sed s/\,/\;/g product.txt >> prod.txt

sed s/\\t/\,/g prod.txt >> prod1.csv

create file with proprietaryname   ,   labelername
cut -d, -f 4,13 prod1.csv >> product.csv

The shear size of datafile, the lack of error handling and an inability to confirm the script workflow did not give us enough confidence in the output. We switched to the tried and trusted relational database to store the data and query the data. Oracle personal edition was used for this purposes.

Relational Database (Oracle)

OPTIONS (SKIP=1)
load data
truncate
into table npi_tbl
fields terminated by ","
(npi ,
entity_type_code ,
provider_last_name ,
provider_first_name ,
provider_city ,
provider_state ,
provider_zipcode ,
provider_taxonomy_code)
OPTIONS (SKIP=1)
load data
truncate
into table npi_taxnmy_tbl
fields terminated by ","
TRAILING NULLCOLS
(code ,
type ,
classification ,
specialization )
OPTIONS (SKIP=1)
load data
truncate
into table product_tbl
fields terminated by ","
TRAILING NULLCOLS
(  
proprietaryname   ,    
labelername      )
load data
truncate
into table propublica_ccw_id
fields terminated by ","
TRAILING NULLCOLS
(propub_id ,
bn ,
claim_count ,
claim_count_daw1 ,
claim_count_cmpnd2 ,
quantity_sum ,
day_supply_sum ,
gross_drug_cost_sum )
load data
truncate
into table propub_prscrbr_id
fields terminated by ","
TRAILING NULLCOLS
(propub_id ,
npi  ,
dea1  ,
dea2  ,
dea3 ,
dea4  ,
dea5  ,
dea6  ,
dea7 )

Data Modeling

    SELECT  distinct np.npi,
         pr.npi_2,
         pr.ref_cnt
    FROM npi_tbl np,
         npi_taxnmy_tbl nt,
         propub_prscrbr_id ppr,
         propublica_ccw_id pcc,
         (SELECT prop_name, lblr_name
            FROM (SELECT prop_name,
                         lblr_name,
                         lblr_cnt,
                         ROW_NUMBER ()
                         OVER (PARTITION BY prop_name ORDER BY lblr_cnt DESC)
                            rnk
                    FROM (  SELECT LOWER (proprietaryname) prop_name,
                                   INITCAP (labelername) lblr_name,
                                   COUNT (*) lblr_cnt
                              FROM product_tbl
                          GROUP BY LOWER (proprietaryname),
                                   INITCAP (labelername)))
           WHERE rnk = 1) pt,
         physician_referrals pr
   WHERE     np.provider_taxonomy_code = nt.code
         AND np.npi = ppr.npi
         AND ppr.propub_id = pcc.propub_id
         AND LOWER (pcc.bn) = pt.prop_name
         AND np.npi = pr.npi_1
         AND np.provider_state IN ('NJ', 'NY', 'CT')
         AND LOWER (nt.specialization) LIKE '%oncology%'
   ORDER BY npi
  SELECT np.npi,
         np.entity_type_code,
         np.provider_last_name,
         np.provider_first_name,
         np.provider_city,
         np.provider_state,
         np.provider_zipcode,
         np.provider_taxonomy_code,
         nt.classification,
         nt.specialization,
         ppr.propub_id,
         ppr.dea1,
         claim_count,
         claim_count_daw1,
         claim_count_cmpnd2,
         quantity_sum,
         day_supply_sum,
         gross_drug_cost_sum,
         INITCAP (pt.prop_name) proprietary_name,
         pt.lblr_name labeler_name,
         pr.npi_2,
         pr.ref_cnt
    FROM npi_tbl np,
         npi_taxnmy_tbl nt,
         propub_prscrbr_id ppr,
         propublica_ccw_id pcc,
         (SELECT prop_name, lblr_name
            FROM (SELECT prop_name,
                         lblr_name,
                         lblr_cnt,
                         ROW_NUMBER ()
                         OVER (PARTITION BY prop_name ORDER BY lblr_cnt DESC)
                            rnk
                    FROM (  SELECT LOWER (proprietaryname) prop_name,
                                   INITCAP (labelername) lblr_name,
                                   COUNT (*) lblr_cnt
                              FROM product_tbl
                          GROUP BY LOWER (proprietaryname),
                                   INITCAP (labelername)))
           WHERE rnk = 1) pt,
         physician_referrals pr
   WHERE     np.provider_taxonomy_code = nt.code
         AND np.npi = ppr.npi
         AND ppr.propub_id = pcc.propub_id
         AND LOWER (pcc.bn) = pt.prop_name
         AND np.npi = pr.npi_1
         AND np.provider_state IN ('NJ', 'NY', 'CT')
         AND LOWER (nt.specialization) LIKE '%oncology%'
   ORDER BY npi

Exploration and Analysis

Network Analysis of Healthcare Providers based on Medicare Claims data Network Analysis of Healthcare Providers based on Medicare Claims data

The data is aggregated from multiple public sources:

library(ggplot2)
library(GGally)
## Loading required package: reshape
library(ggmap)
data <- read.csv("npi_pcg_prod.csv")
colnames(data) <- c("State", "City", "Specialization", "Classification", "NPI", 
    "Type", "Name", "Labeler", "Drug", "Ccount", "Cqty", "Csum")

The table below shows the frequency of Claims by State:

table(data[, "State"])
## 
##   CT   NJ   NY 
##  288 1200 1972

Visualized as a barplot:

agg1 <- aggregate(Ccount ~ State, data = data, sum)
p1 <- ggplot(agg1, aes(State, Ccount, fill = State))
p1 + geom_bar(stat = "identity") + ylab("Claim Count") + ggtitle("Oncology Claims by State")

plot of chunk unnamed-chunk-3

Split by Oncology Specialty Classification:

agg2 <- aggregate(Ccount ~ Classification, data = data, sum)
p2 <- ggplot(agg2, aes(Classification, Ccount, fill = Classification))
p2 + geom_bar(stat = "identity") + ylab("Claim Count") + ggtitle("Oncology Claims by Classification")

plot of chunk unnamed-chunk-4

Correlation matrix for Claim Count, Quantity and Sum

data_sub <- data[, c(1, 10, 11, 12)]
ggpairs(data = data_sub, columns = 2:4, title = "Correlation Matrix", colour = "State")

plot of chunk unnamed-chunk-5

Claim distribution Distribution of claim count:

agg3 <- aggregate(Ccount ~ NPI + State, data = data, sum)
d3 <- density(agg3[, 3])
plot(d3, type = "n", main = "Distribution of Claim Count")
polygon(d3, col = "red", border = "gray")

plot of chunk unnamed-chunk-6

Claim Count by Physicians in Tri-State Area Claim Count by Physicians in Tri-State Area:

agg3 <- agg3[order(agg3$State, agg3$Ccount, decreasing = T), ]
p3 <- ggplot(agg3, aes(NPI, Ccount, colour = State))
p3 <- p3 + labs(colour = "State") + scale_x_continuous(breaks = c(1, 200))
p3 <- p3 + ggtitle("Claim Counts by Physician in TriState Area") + ylab("Claim Count")
p3 + geom_point() + facet_grid(. ~ State) + geom_smooth()
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

plot of chunk unnamed-chunk-7

Distribution of Claim Quantity

agg4 <- aggregate(Cqty ~ Labeler, data = data, sum)
d4 <- density(agg4[, 2])
plot(d4, type = "n", main = "Distribution of Claim Quantity")
polygon(d4, col = "red", border = "gray")

plot of chunk unnamed-chunk-8

Claim Volume by Top Labelers Claim Count by Top Labelers

agg4 <- agg4[order(agg4$Cqty, decreasing = T), ]
agg4_top <- agg4[1:6, ]
p4 <- ggplot(agg4_top, aes(Labeler, Cqty, fill = Labeler))
p4 + geom_bar(stat = "identity") + ylab("Claim Quantity") + ggtitle("Claim Quantity by Top Labelers")

plot of chunk unnamed-chunk-9

Claim Amount by Top Labelers Claim Sum by Top Labelers

agg5 <- aggregate(Csum ~ Labeler, data = data, sum)
agg5 <- agg5[order(agg5$Csum, decreasing = T), ]
agg5_top <- agg5[1:6, ]
p5 <- ggplot(agg5_top, aes(Labeler, Csum, fill = Labeler))
p5 + geom_bar(stat = "identity") + ylab("Claim Sum") + ggtitle("Claim Sum by Top Labelers")

plot of chunk unnamed-chunk-10

Claim Distribution on a map (work in progress): Claim Distribution on a map

freq <- as.data.frame(table(data$City))
longlat <- geocode(unique(as.character(data$City)))
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=YONKERS&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=ROCHESTER&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=NEW+BRUNSWICK&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=NEW+HYDE+PARK&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=KENDALL+PARK&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=NEW+YORK&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=VOORHEES&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=NEW+HAVEN&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=YORKTOWN+HEIGHTS&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=JERSEY+CITY&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=MANASQUAN&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=BROOKLYN&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=MIDDLETOWN&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=HARTFORD&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=DENVILLE&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=TRENTON&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=SLEEPY+HOLLOW&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=JAMAICA&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=BRONX&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=MORRISTOWN&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=STONY+BROOK&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=WOODBURY&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=GARDEN+CITY&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=BUFFALO&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=FLUSHING&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=CAMDEN&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=ENGLEWOOD&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=HAMDEN&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=FARMINGTON&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=ENFIELD&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=ALBANY&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=MARLTON&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=BRICK&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=MOUNT+LAUREL&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=LINCROFT&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=ELIZABETH&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=NYACK&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=COOPERSTOWN&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=GLENS+FALLS&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=MAPLEWOOD&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=MARMORA&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=RAHWAY&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=HACKENSACK&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=TEANECK&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=SUMMIT&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=WEST+HAVEN&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=STATEN+ISLAND&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=WEST+LONG+BRANCH&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=EDISON&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=NEW+MILFORD&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=MONTCLAIR&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=E+SETAUKET&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=PRINCETON&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=RIDGEWOOD&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## .Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=PHILLIPSBURG&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
cities <- cbind(freq, longlat)
colnames(cities) <- c("City", "Freq", "long", "lat")
cities <- cities[order(cities$Freq, decreasing = T), ]
top_cities <- cities[1:10, ]
states <- map_data("state")
tri_states <- subset(states, region %in% c("Connecticut", "new jersey", "new york"))
p <- ggplot(tri_states, aes(x = long, y = lat, group = group))
p <- p + geom_polygon(fill = "grey10", colour = "white")
p <- p + xlim(-80, -70)
p <- p + geom_point(data = top_cities, inherit.aes = F, aes(x = long, y = lat, 
    size = Freq), colour = "red", alpha = 0.8) + scale_size(name = "Claim Frequency")
p <- p + geom_text(data = top_cities, inherit.aes = F, aes(x = long, y = lat, 
    label = City), vjust = 1, colour = "red", alpha = 0.8)
p + ggtitle("Physician Claims in TriState Area")
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_text).

plot of chunk unnamed-chunk-11

Limitations and Assumptions

Data are dirty (.gov data are very dirty)

Health and Humas Service itself acknowledge that Improvements Are Needed To Ensure Provider Enumeration and Medicare Enrollment Data Are Accurate, Complete, and Consistent

#Stretch goal

##Research:

The following questions have been posed by a Cloudera sponsored DataScience challenge. We can aim to answer these and add more insights by including the NPI and Prescription info. The challenge list the questions as:

Some providers and regions are likely to be different in more subtle ways.

More here: http://www.cloudera.com/content/cloudera/en/training/certification/ccp-ds/challenge/register.html