Explording CMS Medicare Data

Introduction, Data Import and Verification

Synopsis

Medicare costs are increasing at 9% every year, outpacing private health insurance spending by 4.6%. Medicare, additionally, accounts for 15% of the United States federal budget and for 20% of health care expenditure. An understanding of where the money is being spent within Medicare is crucial for understanding how to reduce the overall healthcare bill in the United States.

The Centers for Medicare Services (CMS) prepares a dataset every year called the Provider Utilization and Payment Data Inpatient Public Use File (Inpatient PUF) to help improve transparency of the healthcare system.

The dataset discussed in this report includes information on services provided to those eligible for Medicare for the Fiscal Year 2014.

This document will define the items tracked within this PUF in Data Description section. Next, we will import the data and look at the integrity of the data and the types of observations for each data column. The Data Cleaning section will respond to the integrity and manipulate the data in a format that will make analysis simple.

This document will also investigate where Medicare charges are originating, analyzing providers, specific charges and the parts of the body that put the most burden on Medicare to shed some light on where preventative measures could be taken.

Data description

The column categories in this dataset are described below:

DRG.Definition: The DRG code describes the MS-DRG code for each line item. The MS-DRG (Medicare Severity Diagnosis Related Groups) coding system is used by Medicare to group, price and reimburse Medicare claims. Since this dataset concerns inpatients only, a MS-DRG is assigned to each patient. The MS-DRG ranges from 001-999 with many numbers unused to accommodate future codes. Some diagnoses have multiple MS-DRGs that include MCC, CC or no CC/MCC. MCC is a major complication, while CC is a less significant complication that occurs during the inpatient stay.

Example of MS-DRG with complication coding:

  • MS-DRG 539, Osteomyelitis with MCC
  • MS-DRG 540, Osteomyelitis with CC
  • MS-DRG 541, Osteomyelitis without CC/MCC

A DRG.Definition code is also grouped via a separate table as one of 25 Major Diagnostic Categories (MDC). Later, we will import a table that categorizes and join these tables together.

Examples of MDC categories:

  • MDC 1 Diseases and Disorders of the Nervous System
  • MDC 2 Diseases and Disorders of the Eye
  • MDC 3 Diseases and Disorders of the Ear, Nose, Mouth and Throat

Provider.ID: The Provider.ID is a CMS Certification Number (CCN) assigned to each facility. A Mayo Clinic in Rochester, MN will have a different Provider.ID than a Mayo Clinic in Jacksonville, FL.

Provider.Name: The Provider.Name is the name of the facility providing Medicare treatment.

Provider.Street.Address: The Provider.Address is the street address of the facility providing Medicare treatment.

Provider.City: The Provider.City is the city of the provider facility.

Provider.State: The Provider.State is the state of the provider facility.

Provider.Zip.Code: The Provider.Zip.Code is the Zip Code of the provider facility.

Hospital.Referral.Region: The Hospital.Referral.Region (HRRs) are geographic units of analysis based on facility location zip codes that were developed by the Dartmouth Atlas of Health Care to delineate regional health care markets in the United States.

Total.Discharges: The number of discharges billed by the provider for inpatient services.

Average.Covered.Charges: The average charge/discharge for service from the provider.

Average.Total.Payments: The average payment made to the provider including co-payment and deductible amounts that the patient pays as well as additional third-party payments for coordination of benefits.

Average.Medicare.Payments: The average payment that Medicare pays to the provider not including beneficiary co-payments/deductible nor any additional payments to third-parties for coordination of benefits.

Packages required

The packages required for the code in this report are detailed in the code chunk below:

library(tidyverse) #tidyverse imports ggplot2 and dplyr (charting/manipulation)
library(stringr) #will be used to extract medical codes from categories
library(purrr) #the map function is used to customize functions and apply on df
library(rvest) #importing MDC categorization table
library(knitr) #table display

Importing data

The data is available at the below url: https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Inpatient.html

We will import the FY 2014 data as mentioned in the Synopsis. Since this is a zip file, we will assign a temp file and dump the zip file into this temp file. Using unz() we will open the 2014_CSV file and then unlink the temp file.

#let's import the zipfile from the cms.gov website by downloading to a temp file and unzipping the file.
url <- "https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Downloads/Inpatient_Data_2014_CSV.zip"
temp <- tempfile()
download.file(url, temp)
df <- read.csv(unz(temp, "Medicare_Provider_Charge_Inpatient_DRGALL_FY2014.csv"))
unlink(temp)

#convert to a tibble for data verification purposes
df<- as_tibble(df)

Now let’s import the MDC codes using hte rvest package for scraping a table from Wikipedia. Some codes were missing from the table on Wikipedia and so we manually edit these after the import.

**link: https://en.wikipedia.org/wiki/Major_Diagnostic_Category**

url <- "https://en.wikipedia.org/wiki/Major_Diagnostic_Category"
mdc <- url %>% 
  html() %>% 
  html_nodes(xpath='//*[@id="mw-content-text"]/table') %>%
  html_table()
mdc <- mdc[[1]]

Data verification

Missing values: As this is a transparency-driven government generated summary table for payments, the expectation for data integrity is high. The user guide does not call out missing values and there are no null values in the dataset. Additionally, we explore each variable to ensure that the values make sense below.

sum(is.na(df))
## [1] 0

We use the summary() function to look at the different variable columns and check the ints for odd values. Of note is that we have a zip code with 4 digits and a provider ID of 5 digits. Let’s check these out.

summary(df)
##                                                                   DRG.Definition  
##  194 - SIMPLE PNEUMONIA & PLEURISY W CC                                  :  2837  
##  871 - SEPTICEMIA OR SEVERE SEPSIS W/O MV 96+ HOURS W MCC                :  2837  
##  292 - HEART FAILURE & SHOCK W CC                                        :  2774  
##  392 - ESOPHAGITIS, GASTROENT & MISC DIGEST DISORDERS W/O MCC            :  2743  
##  690 - KIDNEY & URINARY TRACT INFECTIONS W/O MCC                         :  2731  
##  470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT OF LOWER EXTREMITY W/O MCC:  2702  
##  (Other)                                                                 :186032  
##   Provider.Id                            Provider.Name   
##  Min.   : 10001   GOOD SAMARITAN HOSPITAL       :   554  
##  1st Qu.:110082   BAPTIST MEDICAL CENTER        :   460  
##  Median :240093   FLORIDA HOSPITAL              :   413  
##  Mean   :255323   NEW YORK-PRESBYTERIAN HOSPITAL:   395  
##  3rd Qu.:380014   MERCY HOSPITAL                :   393  
##  Max.   :670088   SAINT FRANCIS MEDICAL CENTER  :   374  
##                   (Other)                       :200067  
##                      Provider.Street.Address      Provider.City   
##  601 E ROLLINS ST                :   413     CHICAGO     :  1878  
##  525 EAST 68TH STREET            :   395     BALTIMORE   :  1793  
##  20 YORK ST                      :   330     NEW YORK    :  1521  
##  ONE BARNES-JEWISH HOSPITAL PLAZA:   321     HOUSTON     :  1477  
##  4755 OGLETOWN-STANTON ROAD      :   317     PHILADELPHIA:  1416  
##  1216 SECOND STREET SOUTHWEST    :   315     SPRINGFIELD :  1323  
##  (Other)                         :200565     (Other)     :193248  
##  Provider.State   Provider.Zip.Code
##  CA     : 15387   Min.   : 1040    
##  FL     : 15278   1st Qu.:25301    
##  TX     : 14192   Median :43701    
##  NY     : 11423   Mean   :47029    
##  IL     :  9620   3rd Qu.:71603    
##  PA     :  9521   Max.   :99801    
##  (Other):127235                    
##  Hospital.Referral.Region..HRR..Description Total.Discharges 
##  CA - Los Angeles     :  4206               Min.   :  11.00  
##  MA - Boston          :  4141               1st Qu.:  14.00  
##  PA - Philadelphia    :  3397               Median :  22.00  
##  GA - Atlanta         :  3218               Mean   :  36.01  
##  TX - Houston         :  3092               3rd Qu.:  39.00  
##  NY - East Long Island:  2950               Max.   :3855.00  
##  (Other)              :181652                                
##  Average.Covered.Charges Average.Total.Payments Average.Medicare.Payments
##  Min.   :   1368         Min.   :  2320         Min.   :  1330           
##  1st Qu.:  20949         1st Qu.:  6259         1st Qu.:  5012           
##  Median :  34634         Median :  9180         Median :  7715           
##  Mean   :  52589         Mean   : 13169         Mean   : 11218           
##  3rd Qu.:  60762         3rd Qu.: 14632         3rd Qu.: 12473           
##  Max.   :2238699         Max.   :434396         Max.   :403454           
## 
df %>%
  filter(Provider.Id <=100000) %>%
  group_by(Provider.Name) %>%
  summarise(n=n())
## # A tibble: 571 x 2
##                           Provider.Name     n
##                                  <fctr> <int>
## 1               ABRAZO ARROWHEAD CAMPUS    50
## 2                 ABRAZO CENTRAL CAMPUS    57
## 3                ABRAZO MARYVALE CAMPUS     8
## 4              ABRAZO SCOTTSDALE CAMPUS    17
## 5                    ABRAZO WEST CAMPUS    43
## 6              ADVENTIST MEDICAL CENTER    82
## 7    ADVENTIST MEDICAL CENTER - REEDLEY     2
## 8  AHMC ANAHEIM REGIONAL MEDICAL CENTER    56
## 9                      ALAMEDA HOSPITAL    28
## 10         ALASKA NATIVE MEDICAL CENTER    29
## # ... with 561 more rows

It looks like some places do have only 4 digits in their zip codes! New England states. Let’s now look at the provider ID with 5 digits:

df %>%
  filter(Provider.Zip.Code <=10000) %>%
  group_by(Provider.Name) %>%
  summarise(n=n())
## # A tibble: 202 x 2
##                                     Provider.Name     n
##                                            <fctr> <int>
## 1                ADCARE HOSPITAL OF WORCESTER INC     4
## 2                            ANNA JAQUES HOSPITAL    66
## 3                        AROOSTOOK MEDICAL CENTER    28
## 4  ATLANTICARE REGIONAL MEDICAL CENTER - CITY DIV   202
## 5                     BAYSHORE COMMUNITY HOSPITAL    79
## 6                BAYSTATE FRANKLIN MEDICAL CENTER    39
## 7                     BAYSTATE MARY LANE HOSPITAL    15
## 8                         BAYSTATE MEDICAL CENTER   220
## 9      BAYSTATE WING HOSPITAL AND MEDICAL CENTERS    29
## 10                 BERGEN REGIONAL MEDICAL CENTER    11
## # ... with 192 more rows

There are 561 providers with 5 digit provider IDs so this seems normal.

For the MDC data table from Wikipedia, we find that some DRG codes are missing from the categorization menu. Additionally, a “pre-MDC” row is included, which is not needed for our data.

str(mdc)
## 'data.frame':    28 obs. of  3 variables:
##  $ MDC        : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Description: chr  "Pre-MDC" "Diseases and Disorders of the Nervous System" "Diseases and Disorders of the Eye" "Diseases and Disorders of the Ear, Nose, Mouth And Throat" ...
##  $ MS-DRG[1]  : chr  "" "020 - 103" "113 - 125" "129 - 159" ...
summarise(mdc)
## data frame with 0 columns and 0 rows
mdc <- mdc[-1,] #Removes pre-MDC row
mdc <- rename(mdc, DRG_codes = `MS-DRG[1]`) #Rename column to DRG_codes
mdc$DRG_codes[27] <- "981 - 997" #contained 999, which does not exist in dataset
mdc$DRG_codes[14] <- "765 - 782" #contained 998, which does not exist in dataset
mdc$DRG_codes[26] <- "001 - 017" #missing codes on Wikipedia
mdc$DRG_codes[9] <- "570 - 607" #missing codes on Wikipedia

Column information after importing the data:

Let’s look at unique values for each variable and explore the summary statistics for number data types.

#data type
str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame':    202656 obs. of  12 variables:
##  $ DRG.Definition                            : Factor w/ 564 levels "001 - HEART TRANSPLANT OR IMPLANT OF HEART ASSIST SYSTEM W MCC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Provider.Id                               : int  10033 30103 50108 50262 50441 50454 50625 70022 70025 90011 ...
##  $ Provider.Name                             : Factor w/ 3151 levels "ABBEVILLE GENERAL HOSPITAL",..: 2913 1483 2715 2226 2670 2864 399 3144 1000 1543 ...
##  $ Provider.Street.Address                   : Factor w/ 3266 levels "#1 MEDICAL PARK DRIVE",..: 2587 2471 1599 2833 1684 2326 3020 1093 2881 270 ...
##  $ Provider.City                             : Factor w/ 1949 levels "ABBEVILLE","ABERDEEN",..: 157 1348 1499 978 1655 1527 978 1190 721 1832 ...
##  $ Provider.State                            : Factor w/ 51 levels "AK","AL","AR",..: 2 4 5 5 5 5 5 7 7 8 ...
##  $ Provider.Zip.Code                         : int  35233 85054 95816 90095 94305 94143 90048 6504 6102 20010 ...
##  $ Hospital.Referral.Region..HRR..Description: Factor w/ 306 levels "AK - Anchorage",..: 2 14 28 22 35 32 22 50 49 51 ...
##  $ Total.Discharges                          : int  13 20 25 14 23 20 43 12 17 31 ...
##  $ Average.Covered.Charges                   : num  1172866 437531 815674 1499044 2238699 ...
##  $ Average.Total.Payments                    : num  251876 240423 233197 415969 420865 ...
##  $ Average.Medicare.Payments                 : num  244458 133510 221682 366609 403454 ...
length(unique(df$DRG.Definition))
## [1] 564

DRG.Definition is a character string with 564 unique MS-DRG codes in this dataset. This means that there were (999-564) codes unassigned in the FY 2014 because they weren’t charged or they don’t exist.

length(unique(df$Provider.Id))
## [1] 3276
length(unique(df$Provider.Name))
## [1] 3151

Provider.Id is an integer, but since this is a unique ID, we could convert this later. There are 3276 unique IDs.

Provider.Name is a character string. There are 3151 unique names, which is less than IDs. Some providers have multiple locations. For example, Wesley Medical Center has a facility in KS and MS.

Provider.City is a character string. There are 1949 cities in this dataset.

Provider.State is a character string with the state names. There are 51 states. We look at the list to see why and see that Washington, D.C. is considered a State here. All of the Cities for State DC are “Washington”, which makes sense.

Provider.Zip.Code is a integer, which could be convereted to a character string later based on context. There are 2987 total zip codes in the dataset.

Hospital.Referral.Region..HRR is a character string. There are 306 HRRs.

length(unique(df$Provider.City))
## [1] 1949
length(unique(df$Provider.State))
## [1] 51
unique(df$Provider.State)
##  [1] AL AZ CA CT DC FL GA IL IN KY LA MA MI MN MO NE NJ NY NC OH OK PA SC
## [24] TN TX VA WA WI AK AR CO DE ID IA KS ME MD MS NV NH NM ND OR RI UT VT
## [47] WV HI SD MT WY
## 51 Levels: AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA ... WY
length(unique(df$Provider.Zip.Code))
## [1] 2987
length(unique(df$Hospital.Referral.Region..HRR..Description))
## [1] 306

Total.Discharges is an integer. There were between 11 and 3855 discharges for each DRG with a median of 22 and a mean of 36. Let’s look at the 3855 to see if this makes sense. The Hospital for Special Surgery in Manhattan discharged 3855 inpatients for major joint replacement or reattachment of lower extremity without complications.

Average.Covered.Charges is an integer. On average, the provider billed $52,590/discharge with a median of $34,630/discharge. The lowest amount billed per discharge was 1368 and highest was 2239000. Respectively, these are from alcohol or drug abuse at Healthsource Saginaw and heart transplant at Stanford Hospital. Both of these seem logical.

Average.Total.Payments is an integer that spans between $2320 and $434400. The median total payment per discharge was $9180, while the mean total payment per discharge is $13170. The minimum payment was again for alcohol or drug abuse inpatient treatment but at Proctor Hospital in Peoria, IL and the maximum payment was for Cranial & Nerve Disorders with Major Complications at Memorial Hospital in Jacksonville, FL.

Average.Medicare.Payments is an integer that spans between $1330 and $403500. The median total payment per discharge was $7715 and the mean was $11220. This makes sense as it’s less than the Average.Total.Payments, which includes more additional costs for the same discharges. The minimum payment was for Cranial & Nerve Disorders without Major Complications at Thibodaux Regional Medical Center in LA. The maximum payment per discharge was for Heart Transplant with Major Complications at Stanford Hospital in CA, again.

summary(df$Total.Discharges)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   14.00   22.00   36.01   39.00 3855.00
df %>% filter(Total.Discharges == 3855)
## # A tibble: 1 x 12
##                                                             DRG.Definition
##                                                                     <fctr>
## 1 470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT OF LOWER EXTREMITY W/O MCC
## # ... with 11 more variables: Provider.Id <int>, Provider.Name <fctr>,
## #   Provider.Street.Address <fctr>, Provider.City <fctr>,
## #   Provider.State <fctr>, Provider.Zip.Code <int>,
## #   Hospital.Referral.Region..HRR..Description <fctr>,
## #   Total.Discharges <int>, Average.Covered.Charges <dbl>,
## #   Average.Total.Payments <dbl>, Average.Medicare.Payments <dbl>
boxplot(df$Total.Discharges) #a little strange - worth exploring the highpoint

summary(df$Average.Covered.Charges)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1368   20950   34630   52590   60760 2239000
df %>% filter(min(Average.Covered.Charges) == Average.Covered.Charges)
## # A tibble: 1 x 12
##                                     DRG.Definition Provider.Id
##                                             <fctr>       <int>
## 1 894 - ALCOHOL/DRUG ABUSE OR DEPENDENCE, LEFT AMA      230275
## # ... with 10 more variables: Provider.Name <fctr>,
## #   Provider.Street.Address <fctr>, Provider.City <fctr>,
## #   Provider.State <fctr>, Provider.Zip.Code <int>,
## #   Hospital.Referral.Region..HRR..Description <fctr>,
## #   Total.Discharges <int>, Average.Covered.Charges <dbl>,
## #   Average.Total.Payments <dbl>, Average.Medicare.Payments <dbl>
df %>% filter(max(Average.Covered.Charges) == Average.Covered.Charges)
## # A tibble: 1 x 12
##                                                   DRG.Definition
##                                                           <fctr>
## 1 001 - HEART TRANSPLANT OR IMPLANT OF HEART ASSIST SYSTEM W MCC
## # ... with 11 more variables: Provider.Id <int>, Provider.Name <fctr>,
## #   Provider.Street.Address <fctr>, Provider.City <fctr>,
## #   Provider.State <fctr>, Provider.Zip.Code <int>,
## #   Hospital.Referral.Region..HRR..Description <fctr>,
## #   Total.Discharges <int>, Average.Covered.Charges <dbl>,
## #   Average.Total.Payments <dbl>, Average.Medicare.Payments <dbl>
summary(df$Average.Total.Payments)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2320    6259    9180   13170   14630  434400
df %>% filter(min(Average.Total.Payments) == Average.Total.Payments)
## # A tibble: 1 x 12
##                                     DRG.Definition Provider.Id
##                                             <fctr>       <int>
## 1 894 - ALCOHOL/DRUG ABUSE OR DEPENDENCE, LEFT AMA      140013
## # ... with 10 more variables: Provider.Name <fctr>,
## #   Provider.Street.Address <fctr>, Provider.City <fctr>,
## #   Provider.State <fctr>, Provider.Zip.Code <int>,
## #   Hospital.Referral.Region..HRR..Description <fctr>,
## #   Total.Discharges <int>, Average.Covered.Charges <dbl>,
## #   Average.Total.Payments <dbl>, Average.Medicare.Payments <dbl>
df %>% filter(max(Average.Total.Payments) == Average.Total.Payments)
## # A tibble: 1 x 12
##                                     DRG.Definition Provider.Id
##                                             <fctr>       <int>
## 1 073 - CRANIAL & PERIPHERAL NERVE DISORDERS W MCC      100179
## # ... with 10 more variables: Provider.Name <fctr>,
## #   Provider.Street.Address <fctr>, Provider.City <fctr>,
## #   Provider.State <fctr>, Provider.Zip.Code <int>,
## #   Hospital.Referral.Region..HRR..Description <fctr>,
## #   Total.Discharges <int>, Average.Covered.Charges <dbl>,
## #   Average.Total.Payments <dbl>, Average.Medicare.Payments <dbl>
summary(df$Average.Medicare.Payments)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1330    5012    7715   11220   12470  403500
test <- df %>%
  filter(min(Average.Medicare.Payments) == Average.Medicare.Payments)
df %>%
  filter(max(Average.Medicare.Payments) == Average.Medicare.Payments)
## # A tibble: 1 x 12
##                                                   DRG.Definition
##                                                           <fctr>
## 1 001 - HEART TRANSPLANT OR IMPLANT OF HEART ASSIST SYSTEM W MCC
## # ... with 11 more variables: Provider.Id <int>, Provider.Name <fctr>,
## #   Provider.Street.Address <fctr>, Provider.City <fctr>,
## #   Provider.State <fctr>, Provider.Zip.Code <int>,
## #   Hospital.Referral.Region..HRR..Description <fctr>,
## #   Total.Discharges <int>, Average.Covered.Charges <dbl>,
## #   Average.Total.Payments <dbl>, Average.Medicare.Payments <dbl>

For the separate MDC table, we can see that each of the three columns is unique with 27 MDCs, descriptions and DRG code ranges.

str(mdc)
## 'data.frame':    27 obs. of  3 variables:
##  $ MDC        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Description: chr  "Diseases and Disorders of the Nervous System" "Diseases and Disorders of the Eye" "Diseases and Disorders of the Ear, Nose, Mouth And Throat" "Diseases and Disorders of the Respiratory System" ...
##  $ DRG_codes  : chr  "020 - 103" "113 - 125" "129 - 159" "163 - 208" ...
summary(mdc)
##       MDC       Description         DRG_codes        
##  Min.   : 1.0   Length:27          Length:27         
##  1st Qu.: 7.5   Class :character   Class :character  
##  Median :14.0   Mode  :character   Mode  :character  
##  Mean   :14.0                                        
##  3rd Qu.:20.5                                        
##  Max.   :27.0
length(unique(mdc$MDC))
## [1] 27
length(unique(mdc$Description))
## [1] 27
length(unique(mdc$DRG_codes))
## [1] 27

Joining the MDC and DRG information together

To join the MDC categorizations to the DRG codes, we must first separate the DRG code from the DRG.Definition column into just the DRG code. We do this by creating a function, called str_extract that returns just the code.

We then use the purrr package map_chr function to apply the function through the data frame.

#create new column of DRG codes
sep_col <- function(s)
{
  return(str_extract(s, "\\d+"))
}

df$DRG_code <- map_chr(df$DRG.Definition, sep_col)

The DRG codes that are included in each MDC are listed as a range: Format: “###-###”

There are a few ways to join the MDC code and description from the MDC table into the DRG table. Two functions will be used here to take a string and search within the range of DRG codes on the MDC table and return either the MDC code or the description, depending on the function.

These two functions are applied on the DRG table to obtain two new columns, the MDC number and the MDC description, for each row.

getMDC <- function(x)
{
  for (j in seq_along(mdc$DRG_codes))
  {
    if((as.numeric(x) >= as.numeric(str_sub(mdc$DRG_codes[j],1,3))) &
      (as.numeric(x) <= as.numeric(str_sub(mdc$DRG_codes[j],-3,-1))))
    {
    return(mdc$MDC[j])
    break
    }
  }
}

getMDC_desc <- function(x)
{
  for (j in seq_along(mdc$DRG_codes))
  {
    if((as.numeric(x) >= as.numeric(str_sub(mdc$DRG_codes[j],1,3))) &
      (as.numeric(x) <= as.numeric(str_sub(mdc$DRG_codes[j],-3,-1))))
    {
    return(mdc$Description[j])
    break
    }
  }
}

df$MDC <- map_chr(df$DRG_code, getMDC)
df$MDC_desc <- map_chr(df$DRG_code, getMDC_desc)

Data Exploration

In this section, the dataset will be explored in an effort to understand which cost associated with different variables in the dataset.

Understanding Total Discharge Numbers and Location

Providers with the Most Discharges

First, we will group the data by Provider and add total discharges so we can see the most active providers.

Florida Hospital had by far the majority of discharges at 37,886 discharges in the year.

df_top <- df %>% group_by(Provider.Name) %>%
  summarise(total_discharge=sum(Total.Discharges))

boxplot(df_top$total_discharge)

summary(df_top$total_discharge)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    11.0   409.5  1326.0  2316.0  3170.0 37890.0
knitr::kable(top_n(df_top, 10, total_discharge))
Provider.Name total_discharge
BAPTIST MEDICAL CENTER 23500
BEAUMONT HOSPITAL, ROYAL OAK 19460
CHRISTIANA CARE HEALTH SERVICES, INC. 21049
FLORIDA HOSPITAL 37886
GOOD SAMARITAN HOSPITAL 18342
MAYO CLINIC HOSPITAL ROCHESTER 16543
METHODIST HEALTHCARE MEMPHIS HOSPITALS 17949
METHODIST HOSPITAL 21067
NEW YORK-PRESBYTERIAN HOSPITAL 27026
YALE-NEW HAVEN HOSPITAL 19062
df %>% group_by(Provider.Name) %>%
  summarise(total_discharge=sum(Total.Discharges)) %>% 
  top_n(10) %>%
  ggplot(aes(x=reorder(Provider.Name, -total_discharge), y=total_discharge))+
  geom_bar(stat='identity') +
  xlab("Provider") +
  ylab("Total discharges") +
  ggtitle("Most Discharges per Provider") +
  coord_flip()

States that Billed the Most to Medicare

California had by far and away the most charges. This information by itself isn’t super helpful as we should normalize for population. Ohio makes the top 10 at number 8, which is only interesting for me because I live in Ohio.

df_top <- df %>%
  mutate(total_charges = Average.Covered.Charges*Total.Discharges) %>% 
  group_by(Provider.State) %>%
  summarise(total_charges=sum(total_charges)) %>% 
  top_n(10, total_charges) %>% arrange(desc(total_charges))

knitr::kable(df_top)
Provider.State total_charges
CA 44365030424
FL 33961149467
TX 28221200172
NY 22119819185
NJ 20250229517
PA 18407847026
IL 16133340349
OH 11519645159
NC 9909072357
MI 9790055394
df_top %>% ggplot(aes(x=reorder(Provider.State, total_charges), y=total_charges)) +
  geom_bar(stat='identity') +
  ylab("State") +
  xlab("Total charges by State for code ($)") +
  ggtitle("States that have charged the most to Medicare 2014") +
  coord_flip()

Neither of these subcategories tells us a whole lot without additional information like state population.

Analyzing Average Provider Charges and Medicare Payments

By examining total billing and payments by different variables we can get a grasp of where most of the charges sit.

Which Providers Bill (and are Paid) the Most?

First, total about billed to Medicare and paid by Medicare is examined by provider.

Cedar-Sinai Medical Center had far and away the most billing to Medicare at ~$2.8 Billion. The provider was not in the top 10 for # of discharges so it’s possible that the medical center performed expensive procedures.

The ratio of paid to charged is also quantified here. This ratio for the top 10 providers that billed to Medicare is between 10.1 and 26.4%. Later we will look at how this percentage varies by MDC.

df_top <- df %>% mutate(total_c = Average.Covered.Charges*Total.Discharges,
                        total_p = Average.Medicare.Payments*Total.Discharges) %>% 
  group_by(Provider.Name) %>%
  summarise(total_charges=sum(total_c), total_paid = sum(total_p),
            ratio_paid_charged = sum(total_p)/sum(total_c)) %>% 
  top_n(10, wt = total_charges) %>%
  arrange(desc(total_charges))

knitr::kable(df_top)
Provider.Name total_charges total_paid ratio_paid_charged
CEDARS-SINAI MEDICAL CENTER 2788685964 313833786 0.1125382
FLORIDA HOSPITAL 2573698475 353695767 0.1374270
NEW YORK-PRESBYTERIAN HOSPITAL 2302847905 608910781 0.2644164
UPMC PRESBYTERIAN SHADYSIDE 1795562371 182333355 0.1015467
STANFORD HOSPITAL 1670843873 237589872 0.1421975
BAPTIST MEDICAL CENTER 1383022957 218881595 0.1582632
METHODIST HOSPITAL 1360782358 216809840 0.1593273
YALE-NEW HAVEN HOSPITAL 1337561828 335195291 0.2506017
GOOD SAMARITAN HOSPITAL 1297986815 204188419 0.1573116
MASSACHUSETTS GENERAL HOSPITAL 1255012049 298675391 0.2379861
df_top %>% ggplot(aes(x=reorder(Provider.Name, total_charges), y=total_charges)) +
  geom_bar(stat='identity') +
  xlab("Provider") +
  ylab("Total charges by Provider ($)") +
  ggtitle("Providers that billed the most in Medicare") +
  coord_flip()

df_top %>% arrange(desc(total_paid)) %>% 
  ggplot(aes(x=reorder(Provider.Name, total_paid), y=total_paid)) +
  geom_bar(stat='identity') +
  xlab("Provider") +
  ylab("Total paid to Provider ($)") +
  ggtitle("Providers that were paid the most by Medicare") +
  coord_flip()

The Most Expensive DRG Codes Across Medicare

Next, we look at the most expensive DRG across Medicare. The most expensive codes are 470 and 871, Major Joint Replacement/Reattachment of Lower Extremity and Septicemia/Severe Sepsis without MV. We can see that later in the list Septicemia or severe sepsis with MV shows again, so maybe MDC grouping will tell us something different.

The ratio paid for the top 10 charged was between 15.8 and 23.6%. The median was 22.7% and the minimum was DRG 247, a cardiovascular procedure.

df_top <- df %>% mutate(total_c = Average.Covered.Charges*Total.Discharges,
                        total_p = Average.Medicare.Payments*Total.Discharges) %>% 
  group_by(DRG.Definition) %>%
  summarise(total_charges=sum(total_c), total_paid = sum(total_p),
            ratio_paid_charged = sum(total_p)/sum(total_c)) %>% 
  top_n(10, wt = total_charges) %>%
  arrange(desc(total_charges))

knitr::kable(df_top)
DRG.Definition total_charges total_paid ratio_paid_charged
470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT OF LOWER EXTREMITY W/O MCC 25653617785 5596091568 0.2181404
871 - SEPTICEMIA OR SEVERE SEPSIS W/O MV 96+ HOURS W MCC 23292420516 5306424301 0.2278176
853 - INFECTIOUS & PARASITIC DISEASES W O.R. PROCEDURE W MCC 8367934257 1972994594 0.2357804
291 - HEART FAILURE & SHOCK W MCC 8331664766 1901379449 0.2282112
460 - SPINAL FUSION EXCEPT CERVICAL W/O MCC 7745595793 1802530535 0.2327168
247 - PERC CARDIOVASC PROC W DRUG-ELUTING STENT W/O MCC 6544077580 1036121273 0.1583296
003 - ECMO OR TRACH W MV 96+ HRS OR PDX EXC FACE, MOUTH & NECK W MAJ O.R. 6497437346 1529531693 0.2354054
193 - SIMPLE PNEUMONIA & PLEURISY W MCC 5399897958 1165581160 0.2158524
292 - HEART FAILURE & SHOCK W CC 5333146088 1207892117 0.2264877
870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS 5119709555 1155757648 0.2257467
summary(df_top$ratio_paid_charged)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1583  0.2200  0.2272  0.2204  0.2316  0.2358
df_top %>% ggplot(aes(x=reorder(DRG.Definition, total_charges), y=total_charges)) +
  geom_bar(stat='identity') +
  xlab("DRG") +
  ylab("Total charges by DRG ($)") +
  theme(axis.text = element_text(size = 8)) +
  ggtitle("Most billed DRG") +
  coord_flip()

df_top %>% arrange(desc(total_paid)) %>% 
  ggplot(aes(x=reorder(DRG.Definition, total_paid), y=total_paid)) +
  geom_bar(stat='identity') +
  xlab("DRG") +
  ylab("Total paid by DRG ($)") +
  theme(axis.text = element_text(size = 8)) +
  ggtitle("Most paid DRG") +
  coord_flip()

The Most Expensive MDCs

Let’s look at a specific MDC to understand the most charged ‘organ’ system to Medicare.

df_top <- df %>% mutate(total_c = Average.Covered.Charges*Total.Discharges,
                        total_p = Average.Medicare.Payments*Total.Discharges) %>% 
  group_by(MDC_desc) %>%
  summarise(total_charges=sum(total_c), total_paid = sum(total_p),
            ratio_paid_charged = sum(total_p)/sum(total_c)) %>%
  arrange(desc(total_charges))

knitr::kable(df_top)
MDC_desc total_charges total_paid ratio_paid_charged
Diseases and Disorders of the Circulatory System 89163784935 18437486561 0.2067822
Diseases and Disorders of the Musculoskeletal System And Connective Tissue 58475069377 12786752478 0.2186702
Diseases and Disorders of the Respiratory System 44051223925 9380699345 0.2129498
Infectious and Parasitic DDs (Systemic or unspecified sites) 42566102774 9696949315 0.2278092
Diseases and Disorders of the Digestive System 31186796587 6568855659 0.2106294
Diseases and Disorders of the Nervous System 22458996599 4504694005 0.2005741
Diseases and Disorders of the Kidney And Urinary Tract 21165815046 4442528600 0.2098917
Transplants 12803788304 2979028618 0.2326678
Diseases and Disorders of the Endocrine, Nutritional And Metabolic System 7229438722 1554436781 0.2150149
Diseases and Disorders of the Hepatobiliary System And Pancreas 6292197617 1299026590 0.2064504
Diseases and Disorders of the Skin, Subcutaneous Tissue And Breast 4163313878 912154985 0.2190935
Diseases and Disorders of the Blood and Blood Forming Organs and Immunological Disorders 3419849391 723656974 0.2116049
Extensive Procedures Unrelated to Principal Diagnosis 3143558507 703853347 0.2239034
Injuries, Poison And Toxic Effect of Drugs 2507636029 564774943 0.2252221
Mental Diseases and Disorders 2315320272 745177152 0.3218463
Myeloproliferative DDs (Poorly Differentiated Neoplasms) 1681070320 388517968 0.2311135
Factors Influencing Health Status and Other Contacts with Health Services 1122970543 313803568 0.2794406
Alcohol/Drug Use or Induced Mental Disorders 792063049 266490314 0.3364509
Diseases and Disorders of the Male Reproductive System 588586033 100039171 0.1699652
Diseases and Disorders of the Ear, Nose, Mouth And Throat 525155845 94743650 0.1804105
Diseases and Disorders of the Female Reproductive System 471446747 96396244 0.2044690
Multiple Significant Trauma 312485680 72249816 0.2312100
Burns 199470077 50869737 0.2550244
Human Immunodeficiency Virus Infection 194030821 54192299 0.2792974
Pregnancy, Childbirth And Puerperium 61051060 19912454 0.3261607
Diseases and Disorders of the Eye 21265894 4322467 0.2032582
summary(df_top$ratio_paid_charged)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1700  0.2076  0.2189  0.2323  0.2323  0.3365
df_top %>% ggplot(aes(x=reorder(MDC_desc, total_charges), y=total_charges)) +
  geom_bar(stat='identity') +
  xlab("MDC") +
  ylab("Total charges by MDC ($)") +
  theme(axis.text = element_text(size = 8), axis.text.x = element_text(angle = 90)) +
  ggtitle("Most billed by MDC in Medicare") +
  coord_flip()

df_top %>% arrange(desc(total_paid)) %>% 
  ggplot(aes(x=reorder(MDC_desc, total_paid), y=total_paid)) +
  geom_bar(stat='identity') +
  xlab("MDC") +
  ylab("Total paid by MDC ($)") +
  theme(axis.text = element_text(size = 8), axis.text.x = element_text(angle = 90)) +
  ggtitle("Most paid MDC in Medicare") +
  coord_flip()

The Circulatory System puts the most burden on our Medicare system, while the Musculoskeletal System and Connective Tissue is the next highest.

To see if the proportion of amount paid to amount charged is different by MDC grouping we perform a scatter plot and find that the relationship is quite linear.

df_top %>%
  ggplot(aes(x= total_charges, y=total_paid)) +
  geom_point() +
  xlab("Total charged to Medicare ($)") +
  ylab("Total paid by Medicare ($)") +
  ggtitle("Charges versus payments for Medicare by MDC") +
  coord_flip()

This above chart is fascinating. This means that the proportion paid to the providers is quite similar across all MDCs.

When we normalize against the total amount billed to Medicare, the results show a more holistic picture. The Circulatory System accounts for a quarter of all billing to Medicare!

df_new <- df %>% mutate(total_charges = Average.Covered.Charges*Total.Discharges)
total_medicare <- sum(df_new$total_charges)

df_top <- df %>%
  mutate(total_charges = Average.Covered.Charges*Total.Discharges) %>% 
  group_by(MDC_desc) %>%
  summarise(total_charges=sum(total_charges)/total_medicare) %>% 
  top_n(10, total_charges) %>% arrange(desc(total_charges))

knitr::kable(df_top)
MDC_desc total_charges
Diseases and Disorders of the Circulatory System 0.2498197
Diseases and Disorders of the Musculoskeletal System And Connective Tissue 0.1638359
Diseases and Disorders of the Respiratory System 0.1234230
Infectious and Parasitic DDs (Systemic or unspecified sites) 0.1192620
Diseases and Disorders of the Digestive System 0.0873794
Diseases and Disorders of the Nervous System 0.0629258
Diseases and Disorders of the Kidney And Urinary Tract 0.0593025
Transplants 0.0358737
Diseases and Disorders of the Endocrine, Nutritional And Metabolic System 0.0202555
Diseases and Disorders of the Hepatobiliary System And Pancreas 0.0176295
df_top %>% ggplot(aes(x=reorder(MDC_desc, total_charges), y=total_charges)) +
  geom_bar(stat='identity') +
  xlab("MDC description") +
  ylab("% of total Medicare amount billed") +
  theme(axis.text = element_text(size = 8), axis.text.x = element_text(angle = 90)) +
  ggtitle("Most expensive MDC groupings") +
  coord_flip()

To further analyze the proportion of payments to total charges, we will use a scatter plot again.

The results are surprising. Medicare pays out more than it is billed for certain DRG and MDCs. MDC 20, alcohol and drug use or induced mental disorders is a reoccuring MDC for the top payments/amount charged.

When the scatter plot is analyzed, across all rows, it’s interesting to see that the proportions for certain providers and certain procedures within the provider have a wide variation.

knitr::kable(df %>%
  mutate(prop_paid = Average.Medicare.Payments/Average.Covered.Charges) %>% 
  top_n(15, wt = prop_paid) %>% 
  arrange(desc(prop_paid)) %>% 
  select(DRG.Definition,Provider.Name,MDC_desc,prop_paid, Total.Discharges))
DRG.Definition Provider.Name MDC_desc prop_paid Total.Discharges
894 - ALCOHOL/DRUG ABUSE OR DEPENDENCE, LEFT AMA KENSINGTON HOSPITAL Alcohol/Drug Use or Induced Mental Disorders 6.473323 13
897 - ALCOHOL/DRUG ABUSE OR DEPENDENCE W/O REHABILITATION THERAPY W/O MCC KENSINGTON HOSPITAL Alcohol/Drug Use or Induced Mental Disorders 4.184601 32
313 - CHEST PAIN RED LAKE HOSPITAL A Diseases and Disorders of the Circulatory System 3.349014 13
775 - VAGINAL DELIVERY W/O COMPLICATING DIAGNOSES WOMAN’S HOSPITAL Pregnancy, Childbirth And Puerperium 3.320698 15
894 - ALCOHOL/DRUG ABUSE OR DEPENDENCE, LEFT AMA COLLEGE HOSPITAL COSTA MESA Alcohol/Drug Use or Induced Mental Disorders 2.813144 21
689 - KIDNEY & URINARY TRACT INFECTIONS W MCC FORT DEFIANCE INDIAN HOSPITAL Diseases and Disorders of the Kidney And Urinary Tract 2.727023 17
897 - ALCOHOL/DRUG ABUSE OR DEPENDENCE W/O REHABILITATION THERAPY W/O MCC COLLEGE HOSPITAL COSTA MESA Alcohol/Drug Use or Induced Mental Disorders 2.246975 106
895 - ALCOHOL/DRUG ABUSE OR DEPENDENCE W REHABILITATION THERAPY ST ELIZABETH’S MEDICAL CENTER Alcohol/Drug Use or Induced Mental Disorders 2.222524 114
690 - KIDNEY & URINARY TRACT INFECTIONS W/O MCC YUKON KUSKOKWIM DELTA REG HOSPITAL Diseases and Disorders of the Kidney And Urinary Tract 2.121023 11
894 - ALCOHOL/DRUG ABUSE OR DEPENDENCE, LEFT AMA BELLEVUE HOSPITAL CENTER Alcohol/Drug Use or Induced Mental Disorders 2.106272 16
847 - CHEMOTHERAPY W/O ACUTE LEUKEMIA AS SECONDARY DIAGNOSIS W CC BRONX-LEBANON HOSPITAL CENTER Myeloproliferative DDs (Poorly Differentiated Neoplasms) 2.071471 21
470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT OF LOWER EXTREMITY W/O MCC CHICKASAW NATION MEDICAL CENTER Diseases and Disorders of the Musculoskeletal System And Connective Tissue 2.035351 16
897 - ALCOHOL/DRUG ABUSE OR DEPENDENCE W/O REHABILITATION THERAPY W/O MCC BRONX-LEBANON HOSPITAL CENTER Alcohol/Drug Use or Induced Mental Disorders 2.026275 56
603 - CELLULITIS W/O MCC YUKON KUSKOKWIM DELTA REG HOSPITAL Diseases and Disorders of the Skin, Subcutaneous Tissue And Breast 2.021620 22
743 - UTERINE & ADNEXA PROC FOR NON-MALIGNANCY W/O CC/MCC WOMAN’S HOSPITAL Diseases and Disorders of the Female Reproductive System 1.991255 24
df %>% 
  ggplot(aes(x= Average.Covered.Charges, y=Average.Medicare.Payments)) +
  geom_point() +
  xlab("Total charged to Medicare ($)") +
  ylab("Total paid by Medicare ($)") +
  ggtitle("Charges versus payments for Medicare by MDC") +
  coord_flip()

Exploring an individual MDC - Diseases of the Circulatory System

As seen earlier, the Circulatory System puts the most burden on our Medicare system, while the Musculoskeletal System and Connective Tissue comes in second.

In this section, we will explore MDC 5, Diseases and Disorders, to understand what contributes to the high expenditure of this category.

df5 <- df %>%
  filter(MDC == "5")
nrow(df5)
## [1] 45605

We see above that the total number of rows is about a quarter of the rows in the entire dataset. We saw already that total amount charged to medicare from this MDC was about 25%, so we will assign the total billed from this MDC to a variable so we can see which DRGs contributed most to this number.

df5_tot<- df5 %>% mutate(total_charges = Average.Covered.Charges*Total.Discharges)
total_mdc5 <- sum(df5_tot$total_charges)

Let’s see which DRGs and Providers had the highest average charges.

df5_top <- df5 %>%
  top_n(15, wt = Average.Covered.Charges) %>% 
  arrange(desc(Average.Covered.Charges)) %>% 
  select(DRG_code, Provider.Name,Average.Covered.Charges)

knitr::kable(df5_top)
DRG_code Provider.Name Average.Covered.Charges
216 GOOD SAMARITAN HOSPITAL 960960.6
216 STANFORD HOSPITAL 912233.5
215 ROBERT WOOD JOHNSON UNIVERSITY HOSPITAL 834653.1
216 CEDARS-SINAI MEDICAL CENTER 819072.0
219 GOOD SAMARITAN HOSPITAL 818491.4
219 REGIONAL MEDICAL CENTER OF SAN JOSE 754724.9
216 DOCTORS MEDICAL CENTER 751233.3
216 UNIVERSITY OF CALIFORNIA DAVIS MEDICAL CENTER 700702.9
233 GARFIELD MEDICAL CENTER 695798.1
233 REGIONAL MEDICAL CENTER OF SAN JOSE 689924.3
216 MERCY GENERAL HOSPITAL 688602.1
233 DOCTORS MEDICAL CENTER 676566.5
233 EASTON HOSPITAL 675319.0
235 STANFORD HOSPITAL 671651.5
219 ST HELENA HOSPITAL 671387.0
summary(df5_top$Average.Covered.Charges)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  671400  682600  700700  754800  818800  961000

The distribution from the summary function above shows that some providers and DRGs were abnormally high relative to the median and mean. The most expensive DRG is 216 with 219 also occurring frequently in the top 10. Good Samaritan Hospital and Stanford Hospital appear twice in this top 15 list.

When we look across DRGs below, we find that 291, Heart Failure, accounts for the highest amount of charges at 10% of the total MDC. Within 291, we find large discrepencies amongst providers and what they are charging. We show the top 15 culprits below.

df_new1 <- df5 %>%
  mutate(tot_charges = Total.Discharges*Average.Covered.Charges) %>% 
  group_by(DRG.Definition) %>% 
  summarise(pct = sum(tot_charges)/total_mdc5) %>% 
  arrange(desc(pct)) %>% 
  top_n(10)

knitr::kable(df_new1) # top 10 DRGs within MDC
DRG.Definition pct
291 - HEART FAILURE & SHOCK W MCC 0.0934423
247 - PERC CARDIOVASC PROC W DRUG-ELUTING STENT W/O MCC 0.0733939
292 - HEART FAILURE & SHOCK W CC 0.0598129
219 - CARDIAC VALVE & OTH MAJ CARDIOTHORACIC PROC W/O CARD CATH W MCC 0.0497051
220 - CARDIAC VALVE & OTH MAJ CARDIOTHORACIC PROC W/O CARD CATH W CC 0.0418262
287 - CIRCULATORY DISORDERS EXCEPT AMI, W CARD CATH W/O MCC 0.0381481
246 - PERC CARDIOVASC PROC W DRUG-ELUTING STENT W MCC OR 4+ VESSELS/STENTS 0.0367650
238 - MAJOR CARDIOVASC PROCEDURES W/O MCC 0.0346714
280 - ACUTE MYOCARDIAL INFARCTION, DISCHARGED ALIVE W MCC 0.0343773
252 - OTHER VASCULAR PROCEDURES W MCC 0.0320281
summary(df_new1$pct) #range shows that some DRGs are much more expensive
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03203 0.03519 0.03999 0.04942 0.05729 0.09344
sum(df_new1$pct)
## [1] 0.4941701

The above tibble shows us that 10 DRG diagnoses account for 50% of the MDC. The majority of these DRGs deal with Heart Failure, Stents or Major Cardiovasc Procedures. So 10 diagnoses account for 12.5% of all Medicare expenses.

For further visualization, we look at a scatter plot of all charges versus the payment for just heart failure, code 291. There is a wide array within a single DRG. Remember, these are average charges, not total charged.

df291 <- df5 %>% filter(DRG_code == "291")

summary(df291$Average.Covered.Charges)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8195   23930   33590   40170   49730  208600
boxplot(df291$Average.Covered.Charges)

knitr::kable(df291 %>% 
  select(Provider.Name, Average.Covered.Charges) %>%
  arrange(desc(Average.Covered.Charges)) %>% 
  top_n(15, Average.Covered.Charges))
Provider.Name Average.Covered.Charges
STANFORD HOSPITAL 208578.0
CAREPOINT HEALTH-CHRIST HOSPITAL 203639.7
CAREPOINT HEALTH - BAYONNE MEDICAL CENTER 193575.6
MONTEREY PARK HOSPITAL 183909.4
SAINT PETER’S UNIVERSITY HOSPITAL 162203.3
HAHNEMANN UNIVERSITY HOSPITAL 156441.7
WESTCHESTER MEDICAL CENTER 154738.5
CAPITAL HEALTH MEDICAL CENTER - HOPEWELL 153973.9
GATEWAY REGIONAL MEDICAL CENTER 153616.5
CAPITAL HEALTH SYSTEM-FULD CAMPUS 149121.5
REGIONAL MEDICAL CENTER OF SAN JOSE 148973.0
DOCTORS MEDICAL CENTER 146757.2
CEDARS-SINAI MEDICAL CENTER 146381.1
SEQUOIA HOSPITAL 141240.4
VALLEY HOSPITAL MEDICAL CENTER 139482.7
df291 %>% 
  ggplot(aes(x= Average.Covered.Charges, y=Average.Medicare.Payments)) +
  geom_point() +
  xlab("Average charged to Medicare ($)") +
  ylab("Average paid by Medicare ($)") +
  ggtitle("Charges versus payments for Medicare by MDC") +
  coord_flip()

The above analysis of the individual MDC shows a wide array paid to hospitals performing precedures on the same diagnoses. This tells us a simple things. That prevention should focus on care of the heart. Heart precedures account for 25% of all billed to Medicare and 10 diagnoses accounting for half of those charges.

Additionally, many providers seem to charge more for the same procedure. It would be worthwhile for Medicare to investigate (assuming they have not already) why certain providers bill so far away from the median.

Summary

This publically available dataset provides transparency to how Medicare is paying across procedures. After attaching the MDC organ system to the DRG codes we can see a story of what charges make up the largest share of Medicare charges.

When breaking the data down on an MDC level, it is easy to see that a few organ systems make up a majority of the Medicare charges. Ailments of the Circulatory System, the Musculoskeletal System, the Respiratory System and of Infectious diseases make up a disproportionatly high share of Medicare costs.

Across MDCs, however, the portion paid compared to the portion charged by the provider generated a linear relationship. This means that Medicare pays similar amounts to the providers regardless of the organ system.

When analyzing DRG codes, joint replacement of the lower part of the body made up the largest total cost to Medicare. of the top 10, most of the DRGs had a similar paid/charge ratio. This illustated that Medicare payments seem to be consistent across the top expenditures.

We see a disproportionate share of total amount billed to Medicare attributed to the Circulatory system. 10 diagnoses within this MDC account for half of the amount billed to Medicare under this MDC. Additionally, there is a wide array of average charges within the most expensive DRG, 291 - Heart Failure.

As shared in the previous section, the MDC level analysis gives us the best glimpse at Providers that charge more for a given DRG as well as allows the researcher to group the charges in a logical way. The Circulatory MDC accounts for a quarter of all bills to Medicare with a few procedures making up the majority. It would be worth investigating why some providers charge more (it’s possible that the better institutions get the trickier cases) as well as how the government can target preventative measures to reduce the burden on the government long term.