Synopsis

The Centers for Medicare Services (CMS) prepares a dataset every year called the Provider Utilization and Payment Data Inpatient Public Use File (Inpatient PUF) to help improve transparency of the healthcare system. This dataset includes information on services provided to those eligible for Medicare for the Fiscal Year 2014.

This document will define the items tracked within this PUF in Data Description section. Next, we will import the data and look at the integrity of the data and the types of observations for each data column. The Data Cleaning section will respond to the integrity and manipulate the data in a format that will make analysis simple.

Finally, this document will state the analysis goals and plans to portray the data by the deadline, December 10, 2016.

Data description

The column categories in this dataset are described below:

DRG.Definition: The DRG code describes the MS-DRG code for each line item. The MS-DRG (Medicare Severity Diagnosis Related Groups) coding system is used by Medicare to group, price and reimburse Medicare claims. Since this dataset concerns inpatients only, a MS-DRG is assigned to each patient. The MS-DRG ranges from 001-999 with many numbers unused to accommodate future codes. Some diagnoses have multiple MS-DRGs that include MCC, CC or no CC/MCC. MCC is a major complication, while CC is a less significant complication that occurs during the inpatient stay.

Example of MS-DRG with complication coding:

A DRG.Definition code is also grouped via a separate table as one of 25 Major Diagnostic Categories (MDC). Later, we will import a table that categorizes and join these tables together.

Examples of MDC categories:

Provider.ID: The Provider.ID is a CMS Certification Number (CCN) assigned to each facility. A Mayo Clinic in Rochester, MN will have a different Provider.ID than a Mayo Clinic in Jacksonville, FL.

Provider.Name: The Provider.Name is the name of the facility providing Medicare treatment.

Provider.Street.Address: The Provider.Address is the street address of the facility providing Medicare treatment.

Provider.City: The Provider.City is the city of the provider facility.

Provider.State: The Provider.State is the state of the provider facility.

Provider.Zip.Code: The Provider.Zip.Code is the Zip Code of the provider facility.

Hospital.Referral.Region: The Hospital.Referral.Region (HRRs) are geographic units of analysis based on facility location zip codes that were developed by the Dartmouth Atlas of Health Care to delineate regional health care markets in the United States.

Total.Discharges: The number of discharges billed by the provider for inpatient services.

Average.Covered.Charges: The average charge/discharge for service from the provider.

Average.Total.Payments: The average payment made to the provider including co-payment and deductible amounts that the patient pays as well as additional third-party payments for coordination of benefits.

Average.Medicare.Payments: The average payment that Medicare pays to the provider not including beneficiary co-payments/deductible nor any additional payments to third-parties for coordination of benefits.

Packages required

library(tidyverse)

Importing data

The data is available at the below url: https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Inpatient.html

We will import the FY 2014 data as mentioned in the Synopsis. Since this is a zip file, we will assign a temp file and dump the zip file into this temp file. Using unz() we will open the 2014_CSV file and then unlink the temp file.

#let's import the zipfile from the cms.gov website by downloading to a temp file and unzipping the file.
url <- "https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Downloads/Inpatient_Data_2014_CSV.zip"
temp <- tempfile()
download.file(url, temp)
df <- read.csv(unz(temp, "Medicare_Provider_Charge_Inpatient_DRGALL_FY2014.csv"))
unlink(temp)

#convert to a tibble for data verification purposes
df<- as_tibble(df)

Data verification

Missing values: As this is a transparency-driven government generated summary table for payments, the expectation for data integrity is high. The user guide does not call out missing values and there are no null values in the dataset. Additionally, we explore each variable to ensure that the values make sense below.

sum(is.na(df))
## [1] 0

We use the summary() function to look at the different variable columns and check the ints for odd values. Of note is that we have a zip code with 4 digits and a provider ID of 5 digits. Let’s check these out.

summary(df)
##                                                                   DRG.Definition  
##  194 - SIMPLE PNEUMONIA & PLEURISY W CC                                  :  2837  
##  871 - SEPTICEMIA OR SEVERE SEPSIS W/O MV 96+ HOURS W MCC                :  2837  
##  292 - HEART FAILURE & SHOCK W CC                                        :  2774  
##  392 - ESOPHAGITIS, GASTROENT & MISC DIGEST DISORDERS W/O MCC            :  2743  
##  690 - KIDNEY & URINARY TRACT INFECTIONS W/O MCC                         :  2731  
##  470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT OF LOWER EXTREMITY W/O MCC:  2702  
##  (Other)                                                                 :186032  
##   Provider.Id                            Provider.Name   
##  Min.   : 10001   GOOD SAMARITAN HOSPITAL       :   554  
##  1st Qu.:110082   BAPTIST MEDICAL CENTER        :   460  
##  Median :240093   FLORIDA HOSPITAL              :   413  
##  Mean   :255323   NEW YORK-PRESBYTERIAN HOSPITAL:   395  
##  3rd Qu.:380014   MERCY HOSPITAL                :   393  
##  Max.   :670088   SAINT FRANCIS MEDICAL CENTER  :   374  
##                   (Other)                       :200067  
##                      Provider.Street.Address      Provider.City   
##  601 E ROLLINS ST                :   413     CHICAGO     :  1878  
##  525 EAST 68TH STREET            :   395     BALTIMORE   :  1793  
##  20 YORK ST                      :   330     NEW YORK    :  1521  
##  ONE BARNES-JEWISH HOSPITAL PLAZA:   321     HOUSTON     :  1477  
##  4755 OGLETOWN-STANTON ROAD      :   317     PHILADELPHIA:  1416  
##  1216 SECOND STREET SOUTHWEST    :   315     SPRINGFIELD :  1323  
##  (Other)                         :200565     (Other)     :193248  
##  Provider.State   Provider.Zip.Code
##  CA     : 15387   Min.   : 1040    
##  FL     : 15278   1st Qu.:25301    
##  TX     : 14192   Median :43701    
##  NY     : 11423   Mean   :47029    
##  IL     :  9620   3rd Qu.:71603    
##  PA     :  9521   Max.   :99801    
##  (Other):127235                    
##  Hospital.Referral.Region..HRR..Description Total.Discharges 
##  CA - Los Angeles     :  4206               Min.   :  11.00  
##  MA - Boston          :  4141               1st Qu.:  14.00  
##  PA - Philadelphia    :  3397               Median :  22.00  
##  GA - Atlanta         :  3218               Mean   :  36.01  
##  TX - Houston         :  3092               3rd Qu.:  39.00  
##  NY - East Long Island:  2950               Max.   :3855.00  
##  (Other)              :181652                                
##  Average.Covered.Charges Average.Total.Payments Average.Medicare.Payments
##  Min.   :   1368         Min.   :  2320         Min.   :  1330           
##  1st Qu.:  20949         1st Qu.:  6259         1st Qu.:  5012           
##  Median :  34634         Median :  9180         Median :  7715           
##  Mean   :  52589         Mean   : 13169         Mean   : 11218           
##  3rd Qu.:  60762         3rd Qu.: 14632         3rd Qu.: 12473           
##  Max.   :2238699         Max.   :434396         Max.   :403454           
## 
df %>% filter(Provider.Id <=100000) %>% group_by(Provider.Name) %>% summarise(n=n())
## # A tibble: 571 x 2
##                           Provider.Name     n
##                                  <fctr> <int>
## 1               ABRAZO ARROWHEAD CAMPUS    50
## 2                 ABRAZO CENTRAL CAMPUS    57
## 3                ABRAZO MARYVALE CAMPUS     8
## 4              ABRAZO SCOTTSDALE CAMPUS    17
## 5                    ABRAZO WEST CAMPUS    43
## 6              ADVENTIST MEDICAL CENTER    82
## 7    ADVENTIST MEDICAL CENTER - REEDLEY     2
## 8  AHMC ANAHEIM REGIONAL MEDICAL CENTER    56
## 9                      ALAMEDA HOSPITAL    28
## 10         ALASKA NATIVE MEDICAL CENTER    29
## # ... with 561 more rows

It looks like some places do have only 4 digits in their zip codes! New England states. Let’s now look at the provider ID with 5 digits:

df %>% filter(Provider.Zip.Code <=10000) %>% group_by(Provider.Name) %>% summarise(n=n())
## # A tibble: 202 x 2
##                                     Provider.Name     n
##                                            <fctr> <int>
## 1                ADCARE HOSPITAL OF WORCESTER INC     4
## 2                            ANNA JAQUES HOSPITAL    66
## 3                        AROOSTOOK MEDICAL CENTER    28
## 4  ATLANTICARE REGIONAL MEDICAL CENTER - CITY DIV   202
## 5                     BAYSHORE COMMUNITY HOSPITAL    79
## 6                BAYSTATE FRANKLIN MEDICAL CENTER    39
## 7                     BAYSTATE MARY LANE HOSPITAL    15
## 8                         BAYSTATE MEDICAL CENTER   220
## 9      BAYSTATE WING HOSPITAL AND MEDICAL CENTERS    29
## 10                 BERGEN REGIONAL MEDICAL CENTER    11
## # ... with 192 more rows

There are 561 providers with 5 digit provider IDs so this seems normal.

Column information after importing the data:

Let’s look at unique values for each variable and explore the summary statistics for numbers.

#data type
str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame':    202656 obs. of  12 variables:
##  $ DRG.Definition                            : Factor w/ 564 levels "001 - HEART TRANSPLANT OR IMPLANT OF HEART ASSIST SYSTEM W MCC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Provider.Id                               : int  10033 30103 50108 50262 50441 50454 50625 70022 70025 90011 ...
##  $ Provider.Name                             : Factor w/ 3151 levels "ABBEVILLE GENERAL HOSPITAL",..: 2913 1483 2715 2226 2670 2864 399 3144 1000 1543 ...
##  $ Provider.Street.Address                   : Factor w/ 3266 levels "#1 MEDICAL PARK DRIVE",..: 2587 2471 1599 2833 1684 2326 3020 1093 2881 270 ...
##  $ Provider.City                             : Factor w/ 1949 levels "ABBEVILLE","ABERDEEN",..: 157 1348 1499 978 1655 1527 978 1190 721 1832 ...
##  $ Provider.State                            : Factor w/ 51 levels "AK","AL","AR",..: 2 4 5 5 5 5 5 7 7 8 ...
##  $ Provider.Zip.Code                         : int  35233 85054 95816 90095 94305 94143 90048 6504 6102 20010 ...
##  $ Hospital.Referral.Region..HRR..Description: Factor w/ 306 levels "AK - Anchorage",..: 2 14 28 22 35 32 22 50 49 51 ...
##  $ Total.Discharges                          : int  13 20 25 14 23 20 43 12 17 31 ...
##  $ Average.Covered.Charges                   : num  1172866 437531 815674 1499044 2238699 ...
##  $ Average.Total.Payments                    : num  251876 240423 233197 415969 420865 ...
##  $ Average.Medicare.Payments                 : num  244458 133510 221682 366609 403454 ...
length(unique(df$DRG.Definition))
## [1] 564

DRG.Definition is a character string with 564 unique MS-DRG codes in this dataset. This means that there were 999-564 codes unassigned in the FY 2014 because they weren’t charged or they don’t exist.

length(unique(df$Provider.Id))
## [1] 3276
length(unique(df$Provider.Name))
## [1] 3151
#Question: how to find the duplicate names for unique IDs?

Provider.Id is an integer, but since this is a unique ID, we could convert this later. There are 3276 unique IDs.

Provider.Name is a character string. There are 3151 unique names, which is less than IDs. Some providers have multiple locations. For example, Wesley Medical Center has a facility in KS and MS.

Provider.City is a character string. There are 1949 cities in this dataset.

Provider.State is a character string with the state names. There are 51 states. We look at the list to see why and see that Washington, D.C. is considered a State here. All of the Cities for State DC are “Washington”, which makes sense.

Provider.Zip.Code is a integer, which could be convereted to a character string later based on context. There are 2987 total zip codes in the dataset.

Hospital.Referral.Region..HRR is a character string. We will rename this mouthful later. There are 306 HRRs.

length(unique(df$Provider.City))
## [1] 1949
length(unique(df$Provider.State))
## [1] 51
unique(df$Provider.State)
##  [1] AL AZ CA CT DC FL GA IL IN KY LA MA MI MN MO NE NJ NY NC OH OK PA SC
## [24] TN TX VA WA WI AK AR CO DE ID IA KS ME MD MS NV NH NM ND OR RI UT VT
## [47] WV HI SD MT WY
## 51 Levels: AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA ... WY
length(unique(df$Provider.Zip.Code))
## [1] 2987
length(unique(df$Hospital.Referral.Region..HRR..Description))
## [1] 306

Total.Discharges is an integer. There were between 11 and 3855 discharges for each DRG with a median of 22 and a mean of 36. Let’s look at the 3855 to see if this makes sense. The Hospital for Special Surgery in Manhattan discharged 3855 inpatients for major joint replacement or reattachment of lower extremity without complications.

Average.Covered.Charges is an integer. On average, the provider billed $52,590/discharge with a median of $34,630/discharge. The lowest amount billed per discharge was 1368 and highest was 2239000. Respectively, these are from alcohol or drug abuse at Healthsource Saginaw and heart transplant at Stanford Hospital. Both of these seem logical.

Average.Total.Payments is an integer that spans between $2320 and $434400. The median total payment per discharge was $9180, while the mean total payment per discharge is $13170. The minimum payment was again for alcohol or drug abuse inpatient treatment but at Proctor Hospital in Peoria, IL and the maximum payment was for Cranial & Nerve Disorders with Major Complications at Memorial Hospital in Jacksonville, FL.

Average.Medicare.Payments is an integer that spans between $1330 and $403500. The median total payment per discharge was $7715 and the mean was $11220. This makes sense as it’s less than the Average.Total.Payments, which includes more additional costs for the same discharges. The minimum payment was for Cranial & Nerve Disorders without Major Complications at Thibodaux Regional Medical Center in LA. The maximum payment per discharge was for Heart Transplant with Major Complications at Stanford Hospital in CA, again.

summary(df$Total.Discharges)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   14.00   22.00   36.01   39.00 3855.00
df %>% filter(Total.Discharges == 3855)
## # A tibble: 1 x 12
##                                                             DRG.Definition
##                                                                     <fctr>
## 1 470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT OF LOWER EXTREMITY W/O MCC
## # ... with 11 more variables: Provider.Id <int>, Provider.Name <fctr>,
## #   Provider.Street.Address <fctr>, Provider.City <fctr>,
## #   Provider.State <fctr>, Provider.Zip.Code <int>,
## #   Hospital.Referral.Region..HRR..Description <fctr>,
## #   Total.Discharges <int>, Average.Covered.Charges <dbl>,
## #   Average.Total.Payments <dbl>, Average.Medicare.Payments <dbl>
boxplot(df$Total.Discharges) #a little strange - worth exploring the highpoint

summary(df$Average.Covered.Charges)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1368   20950   34630   52590   60760 2239000
df %>% filter(min(Average.Covered.Charges) == Average.Covered.Charges)
## # A tibble: 1 x 12
##                                     DRG.Definition Provider.Id
##                                             <fctr>       <int>
## 1 894 - ALCOHOL/DRUG ABUSE OR DEPENDENCE, LEFT AMA      230275
## # ... with 10 more variables: Provider.Name <fctr>,
## #   Provider.Street.Address <fctr>, Provider.City <fctr>,
## #   Provider.State <fctr>, Provider.Zip.Code <int>,
## #   Hospital.Referral.Region..HRR..Description <fctr>,
## #   Total.Discharges <int>, Average.Covered.Charges <dbl>,
## #   Average.Total.Payments <dbl>, Average.Medicare.Payments <dbl>
df %>% filter(max(Average.Covered.Charges) == Average.Covered.Charges)
## # A tibble: 1 x 12
##                                                   DRG.Definition
##                                                           <fctr>
## 1 001 - HEART TRANSPLANT OR IMPLANT OF HEART ASSIST SYSTEM W MCC
## # ... with 11 more variables: Provider.Id <int>, Provider.Name <fctr>,
## #   Provider.Street.Address <fctr>, Provider.City <fctr>,
## #   Provider.State <fctr>, Provider.Zip.Code <int>,
## #   Hospital.Referral.Region..HRR..Description <fctr>,
## #   Total.Discharges <int>, Average.Covered.Charges <dbl>,
## #   Average.Total.Payments <dbl>, Average.Medicare.Payments <dbl>
summary(df$Average.Total.Payments)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2320    6259    9180   13170   14630  434400
df %>% filter(min(Average.Total.Payments) == Average.Total.Payments)
## # A tibble: 1 x 12
##                                     DRG.Definition Provider.Id
##                                             <fctr>       <int>
## 1 894 - ALCOHOL/DRUG ABUSE OR DEPENDENCE, LEFT AMA      140013
## # ... with 10 more variables: Provider.Name <fctr>,
## #   Provider.Street.Address <fctr>, Provider.City <fctr>,
## #   Provider.State <fctr>, Provider.Zip.Code <int>,
## #   Hospital.Referral.Region..HRR..Description <fctr>,
## #   Total.Discharges <int>, Average.Covered.Charges <dbl>,
## #   Average.Total.Payments <dbl>, Average.Medicare.Payments <dbl>
df %>% filter(max(Average.Total.Payments) == Average.Total.Payments)
## # A tibble: 1 x 12
##                                     DRG.Definition Provider.Id
##                                             <fctr>       <int>
## 1 073 - CRANIAL & PERIPHERAL NERVE DISORDERS W MCC      100179
## # ... with 10 more variables: Provider.Name <fctr>,
## #   Provider.Street.Address <fctr>, Provider.City <fctr>,
## #   Provider.State <fctr>, Provider.Zip.Code <int>,
## #   Hospital.Referral.Region..HRR..Description <fctr>,
## #   Total.Discharges <int>, Average.Covered.Charges <dbl>,
## #   Average.Total.Payments <dbl>, Average.Medicare.Payments <dbl>
summary(df$Average.Medicare.Payments)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1330    5012    7715   11220   12470  403500
test <- df %>% filter(min(Average.Medicare.Payments) == Average.Medicare.Payments)
df %>% filter(max(Average.Medicare.Payments) == Average.Medicare.Payments)
## # A tibble: 1 x 12
##                                                   DRG.Definition
##                                                           <fctr>
## 1 001 - HEART TRANSPLANT OR IMPLANT OF HEART ASSIST SYSTEM W MCC
## # ... with 11 more variables: Provider.Id <int>, Provider.Name <fctr>,
## #   Provider.Street.Address <fctr>, Provider.City <fctr>,
## #   Provider.State <fctr>, Provider.Zip.Code <int>,
## #   Hospital.Referral.Region..HRR..Description <fctr>,
## #   Total.Discharges <int>, Average.Covered.Charges <dbl>,
## #   Average.Total.Payments <dbl>, Average.Medicare.Payments <dbl>

PLACEHOLDER: Here we will add one more column for the MDC for each DRG by joining two tables.

Planned Analysis

National facts: 1.) We will look at top 10 charged per discharge overall (table/graph) Provider level DRG level By state By HRR 2.) We will then look at top 10 amount charged per overall (table/graph) by Provider level DRG level By state By HRR 3.) We will look at top 10/lowest 10 greatest difference between amount charged and total payment (table/graph) by DRG and by Provider 4.) We will look at the charge distributions by MDC (which organ system generates the highest charges)

MDC specific facts: The user will select an MDC and the above analysis will be done on the MDC level (except 4)

Additionally - DRG with highest complications overall, and provider with highest MCC