Look for an interesting 2 variable relationship in this dataset:
Here's the blurb on this dataset:
As part of the Obama Administration’s efforts to make our healthcare system more transparent, affordable, and accountable, the Centers for Medicare & Medicaid Services (CMS) has prepared a public data set, the Medicare Provider Utilization and Payment Data: Physician and Other Supplier Public Use File (Physician and Other Supplier PUF), with information on services and procedures provided to Medicare beneficiaries by physicians and other healthcare professionals. The Physician and Other Supplier PUF contains information on utilization, payment (allowed amount and Medicare payment), and submitted charges organized by National Provider Identifier (NPI), Healthcare Common Procedure Coding System (HCPCS) code, and place of service. This PUF is based on information from CMS’s National Claims History Standard Analytic Files. The data in the Physician and Other Supplier PUF covers calendar year 2012 and contains 100% final-action physician/supplier Part B non-institutional line items for the Medicare fee-for-service population.
While the Physician and Other Supplier PUF has a wealth of information on payment and utilization for Medicare Part B services, the dataset has a number of limitations. Of particular importance is the fact that the data may not be representative of a physician’s entire practice as it only includes information on Medicare fee-for-service beneficiaries. In addition, the data are not intended to indicate the quality of care provided and are not risk-adjusted to account for differences in underlying severity of disease of patient populations. For additional limitations, please review the methodology document available below.
Also, the HCPCS code is copyrighted by American Medical Association. In order to download the dataset, I had to consent to AMA's copyright restrictions. However, I didn't read or understand them, so I have no idea whether this analysis is permissible under that agreement.
library(ggplot2)
library(gridExtra)
## Loading required package: grid
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
colNames = c("npi", "nppes_provider_gender", "nppes_entity_code", "nppes_provider_zip",
"nppes_provider_state", "nppes_provider_country", "provider_type", "medicare_participation_indicator",
"place_of_service", "hcpcs_code", "hcpcs_description", "line_srvc_cnt",
"bene_unique_cnt", "bene_day_srvc_cnt", "average_Medicare_allowed_amt",
"stdev_Medicare_allowed_amt", "average_submitted_chrg_amt", "stdev_submitted_chrg_amt",
"average_Medicare_payment_amt", "stdev_Medicare_payment_amt")
colClasses = c("character", "character", "character", "character", "character",
"character", "character", "character", "character", "character", "character",
"numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric",
"numeric", "numeric")
puf = read.delim("~/MOOC_Data/DataExploration/PUF-CY2012-truncated.txt", header = FALSE,
skip = 2, nrows = 9500000, col.names = colNames, colClasses = colClasses) ## takes many minutes
What I'm interested in in this dataset is costs: amount billed, max allowed by Medicare, then amount Medicare actually paid.
But any graph from this dataframe takes forever to plot. So, let's aggregate the data a bit. Compute averages of the costs weighted by number of services delivered over whole state.
puf$tot_submitted = puf$line_srvc_cnt * puf$average_submitted_chrg_amt
puf$tot_medicare_paid = puf$line_srvc_cnt * puf$average_Medicare_payment_amt
puf$tot_medicare_allowed = puf$line_srvc_cnt * puf$average_Medicare_allowed_amt
pufAgg = puf %.% group_by(hcpcs_code) %.% summarize(n = n(), avg_submitted = sum(tot_submitted)/sum(line_srvc_cnt),
avg_medicare_allowed = sum(tot_medicare_allowed)/sum(line_srvc_cnt), avg_medicare_paid = sum(tot_medicare_paid)/sum(line_srvc_cnt)) %.%
mutate(billedToMaxRatio = avg_submitted/avg_medicare_allowed)
What is the distribution of frequency of procedures?
ggplot(data = pufAgg, aes(x = n)) + geom_histogram(binwidth = 4e+05/100) + coord_trans(y = "sqrt") +
labs(x = "number of services performed", y = "count", title = "Frequency of various medical services performed")
The distribution of service counts is very highly skewed: most kinds of services are performed only a few times a year (by all the Medicare providers in the world!). Of the 5949 service codes, 20 or so account for more than half of all services performed by number.
Let's get a feel for the distribution of costs, shall we? Focus on amount billed by provider.
ggplot(data = pufAgg, aes(x = n, y = avg_submitted)) + geom_point(rm.na = TRUE,
alpha = 1/2) + labs(x = "Number of services performed", y = "Avg submitted cost ($)",
title = "Procedure costs submitted to Medicare 2012, all procedures")
The very expensive services are actually delivered very rarely. Services billed at > $20,000 per incident were actually delivered only 10 times in the survey year (2012). And the commonly delivered kinds of services are much cheaper.
ggplot(data = pufAgg, aes(x = n, y = avg_submitted)) + geom_point(rm.na = TRUE,
alpha = 1/2) + coord_trans(x = "log10", y = "log10") + scale_x_continuous(breaks = c(1,
3, 10, 30, 100, 300, 1000, 3000, 10000, 30000, 1e+05)) + scale_y_continuous(breaks = c(0,
1, 3, 10, 30, 100, 300, 1000, 3000, 10000, 30000, 1e+05)) + labs(x = "Number of services performed",
y = "Avg submitted cost ($)", title = "Procedure costs submitted to Medicare 2012, all procedures")
Is there any correlation between frequency and submitted cost of a service? It doesn't look it on the chart, but let's compute correlation:
ServiceCostCor = cor.test(pufAgg$avg_submitted, pufAgg$n)
ServiceCostCor
##
## Pearson's product-moment correlation
##
## data: pufAgg$avg_submitted and pufAgg$n
## t = -4.495, df = 5947, p-value = 7.101e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.08347 -0.03282
## sample estimates:
## cor
## -0.05818
So not a significant correlation, yet the scaled chart does show a pattern: the more often the service is performed, the tighter a range the amount billed falls into. For example, the range of prices for services performed 2 times varies from about $1 to almost $30,000, 5 orders of magnitude. But the range of prices for services performed 30,000 times is only $10 to $1000, 2 orders of magnitude. This makes sense, the more commonly performed services would receive more scrutiny for cost control.
So the amount billed for an individual delivery of a service doesn't suprise. But maybe those very expensive services add up to a significant portion of total health care billing, even though they are not common?
Computing the total amount of money spent on each kind of service (cost * number of deliveries) (and adding graphN to allow the rare service points to be plotted with bigger size)
rareThreshold = 100
pufAgg$total_submitted = pufAgg$n * pufAgg$avg_submitted
pufAgg$graphN = log10(pufAgg$n) + 3 * (pufAgg$n < rareThreshold)
ggplot(data = pufAgg, aes(x = hcpcs_code, y = total_submitted)) + geom_point(aes(size = graphN,
color = n, shape = (n < rareThreshold)), alpha = 0.5) + scale_color_gradient(low = "blue",
high = "orange") + coord_trans(y = "sqrt") + labs(x = "Service Code", y = "Total amount billed during year",
title = "Aggregate billing by service code")
From the chart, the really rare kinds of service (n < 100, plotted as triangles) do not appear above the clutter on the origin. To confirm:
AggCostTop1pct = summary(subset(pufAgg, total_submitted > quantile(total_submitted,
0.99))$n)
AggCostTop1pct
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1770 6330 16600 46500 47400 396000
minN = AggCostTop1pct[1]
minNTotal = subset(pufAgg, n == minN)$total_submitted
maxN = AggCostTop1pct[6]
maxNTotal = subset(pufAgg, n == maxN)$total_submitted
So in the top 1% of services by total amount submitted, the least frequently used service was delivered 1770 times, with total billing . But the most frequently used service had total billing of . So moderately rare services do contribute a significant minority of total billing.