Final Project - DRG Payment Analysis

The data set used for this particular project came from Data.gov. It is the Center for Medicare and Medicaid Services official data regarding Inpatient Services. The Data reflects the top 100 DRG codes. Diagnosis-Related Group are a set of codes that group diagnosis in order to facilitate payments to hospitals. These codes make it easier for hospitals to bill for services they provide. Some of the top DRGS include: Chest Pain, Pneumonia and COPD. The Below data only represents the NY Metropolitan Area and Surrounding Suburbs.The below code shows how this data is imported:

library(RCurl)

## Loading required package: bitops

library(sqldf)

## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
## Loading required package: DBI

library(ggplot2)


x <- getURL("https://raw.githubusercontent.com/mfarris9505/DRG-Dataset-/master/DRG%20Payments%202011.csv")

DRG_Payments<- read.csv(text =x) 

#Removing Excess Columns Simplify 
DRG_Payments$Provider.Street.Address <-NULL
DRG_Payments$Provider.Zip.Code <-NULL
DRG_Payments$Provider.State <-NULL

#Renaming Columns
columns <- c("DRG", "ID", "Provider", "City", "Region", "DC","Charges","Payments", "Medicare")
names(DRG_Payments) <- columns

#Reclassifying Dollar Values as Numeric
DRG_Payments$Payments <- as.numeric(as.character(sub("\\$","",DRG_Payments$Payments)))
DRG_Payments$Charges <- as.numeric(as.character(sub("\\$","",DRG_Payments$Charges)))
DRG_Payments$Medicare <- as.numeric(as.character(sub("\\$","",DRG_Payments$Medicare)))

As you can see this is the data I used previously from the R assignment. However, When was retrieving this data, I wanted to create a comparsion from DRG payments to average income. For my own personal view, this has been an area of interest for me. I work in hospital finance, and have an interest in payments and rates based on the population I work with. As an employee of a city hospital, I understand that the majority of our patients come from and underserved population, and are generally in the lower income bracket. So, I used SQL to run the following queries to determine which city had the highest charges and payments, and I focused singularly on one DRG - Chest Pain to find which cities had the highest and lowest payments.

Query_Charges <- "SELECT City, AVG(Charges), AVG(Payments)
                  FROM DRG_Payments
                  WHERE DRG like '313%'
                  GROUP BY City
                  ORDER BY AVG(Charges)"
City <- sqldf(Query_Charges)

## Loading required package: tcltk

City

##               City AVG(Charges) AVG(Payments)
## 1          JAMAICA     7868.785      6766.540
## 2         ELMHURST     7953.170      7161.840
## 3     NEW ROCHELLE     8287.290      5340.790
## 4          YONKERS     9323.205      5046.215
## 5      MOUNT KISCO     9498.600      3820.880
## 6      EAST MEADOW     9662.870      6559.730
## 7      SOUTHAMPTON    10348.010      4298.880
## 8            BRONX    10784.743      7127.389
## 9       BRONXVILLE    10866.420      3757.130
## 10       GLEN COVE    11483.500      3947.110
## 11       PLAINVIEW    11657.870      4097.100
## 12       OCEANSIDE    12195.430      3981.390
## 13        FLUSHING    12291.930      5798.855
## 14    FAR ROCKAWAY    12356.695      6370.535
## 15        BROOKLYN    12487.284      6647.715
## 16    WHITE PLAINS    12742.700      3785.880
## 17        NEW YORK    12833.925      6554.029
## 18   STATEN ISLAND    12865.140      5337.230
## 19       GREENPORT    12997.780      4422.000
## 20       RIVERHEAD    12998.760      4058.340
## 21    FOREST HILLS    13350.930      4901.720
## 22        BETHPAGE    13665.310      3713.080
## 23      HUNTINGTON    14056.650      3733.530
## 24    MOUNT VERNON    14548.560      5514.760
## 25     STONY BROOK    14550.390      5658.720
## 26 CORTLANDT MANOR    14921.610      3917.160
## 27      LONG BEACH    14931.170      4587.760
## 28          ROSLYN    16453.660      4143.540
## 29   VALLEY STREAM    16456.880      3894.660
## 30   SLEEPY HOLLOW    17058.920      4639.470
## 31       BAY SHORE    17873.400      4480.130
## 32        NEWBURGH    17968.280      3940.030
## 33 ROCKVILLE CENTR    18141.430      4061.370
## 34  PORT JEFFERSON    18398.895      3913.060
## 35       MANHASSET    19411.000      5217.380
## 36      MIDDLETOWN    19418.890      3929.060
## 37      WEST ISLIP    19498.220      4147.000
## 38       SMITHTOWN    22401.330      3697.240
## 39   NEW HYDE PARK    22448.770      6203.000
## 40        VALHALLA    23048.230      6228.240
## 41       PATCHOGUE    23481.430      4127.950
## 42         MINEOLA    25022.290      5072.330

summary(City)

##               City     AVG(Charges)   AVG(Payments) 
##  BAY SHORE      : 1   Min.   : 7869   Min.   :3697  
##  BETHPAGE       : 1   1st Qu.:11792   1st Qu.:3942  
##  BRONX          : 1   Median :13508   Median :4451  
##  BRONXVILLE     : 1   Mean   :14729   Mean   :4871  
##  BROOKLYN       : 1   3rd Qu.:17945   3rd Qu.:5623  
##  CORTLANDT MANOR: 1   Max.   :25022   Max.   :7162  
##  (Other)        :36

What I found quite surprising about this data was that despite the vastly different charges from one city to the next, the payments remained relatively the same.

I repeated the data for all the top DRG to see if the results would much different.

Query_Charges <- "SELECT City, AVG(Charges),AVG(Payments)
                  FROM DRG_Payments
                  GROUP BY City
                  ORDER BY AVG(Charges)"
City <- sqldf(Query_Charges)
City

##               City AVG(Charges) AVG(Payments)
## 1          JAMAICA     18281.03     15580.804
## 2     MOUNT VERNON     19403.99     10802.598
## 3         ELMHURST     20573.45     15782.273
## 4      MOUNT KISCO     23838.91      8958.089
## 5      EAST MEADOW     24818.40     15627.257
## 6          YONKERS     25681.68     13050.371
## 7     NEW ROCHELLE     25726.12     13758.726
## 8        GREENPORT     27999.30      7295.806
## 9       LONG BEACH     28796.80      8879.311
## 10     SOUTHAMPTON     28811.65      8687.643
## 11    FAR ROCKAWAY     28930.31     15242.021
## 12           BRONX     29942.89     17025.578
## 13       GLEN COVE     30753.17      9546.289
## 14        BROOKLYN     30998.76     16665.233
## 15        FLUSHING     31077.35     14890.020
## 16      BRONXVILLE     31338.64      9375.367
## 17    WHITE PLAINS     31713.79     10197.311
## 18        BETHPAGE     33300.56      9148.238
## 19    FOREST HILLS     33335.35     12568.871
## 20   VALLEY STREAM     34861.67      9207.622
## 21       BAY SHORE     35456.88     11079.026
## 22   STATEN ISLAND     36516.86     13335.398
## 23   SLEEPY HOLLOW     37671.87     10505.240
## 24      HUNTINGTON     38078.08     10266.602
## 25        NEWBURGH     39933.58     10886.627
## 26 ROCKVILLE CENTR     39962.56     10635.395
## 27       PLAINVIEW     40134.75     11862.878
## 28       RIVERHEAD     41095.88     10042.643
## 29       OCEANSIDE     41461.97     10929.350
## 30  PORT JEFFERSON     42242.16      9652.358
## 31 CORTLANDT MANOR     42594.33     10192.186
## 32        NEW YORK     45454.40     16388.576
## 33      MIDDLETOWN     46043.21     10340.652
## 34          ROSLYN     49015.27     10995.534
## 35     STONY BROOK     49272.02     16478.854
## 36       SMITHTOWN     51198.23      9602.994
## 37   NEW HYDE PARK     51790.80     16623.213
## 38      WEST ISLIP     53466.39     11719.705
## 39       MANHASSET     54888.50     14493.102
## 40         MINEOLA     56302.44     13478.650
## 41       PATCHOGUE     58229.75     10359.315
## 42        VALHALLA     92312.90     22982.273

summary(City)

##               City     AVG(Charges)   AVG(Payments)  
##  BAY SHORE      : 1   Min.   :18281   Min.   : 7296  
##  BETHPAGE       : 1   1st Qu.:29183   1st Qu.:10080  
##  BRONX          : 1   Median :35987   Median :10962  
##  BRONXVILLE     : 1   Mean   :38174   Mean   :12265  
##  BROOKLYN       : 1   3rd Qu.:44739   3rd Qu.:14791  
##  CORTLANDT MANOR: 1   Max.   :92313   Max.   :22982  
##  (Other)        :36

I found that most of the cities remained within the same position (based on charges), so for further analysis I am going to take the top 5 and lowest 5 cities for further analysis.

Top Five	Bottom Five
West Islip	East Meadows
Manhasset	Mount Kisco
Mineola	Elmhurst
Patchogue	Mount vernon
Valhalla	Jamaica

An interesting side note: My hospital is in Jamaica. Apparently we are not charging enough.

For these 10 cities, I found the Average Income for the Year of 2011 along with the relative poverty rates-from wikipedia. I went ahead and created these tables in SQL and exported the data to a csv file. The code is shown here

*Note: I could have created these tables in SQL and imported them here, to verify my skills with creating tables. However, I felt that to be much too tedious, and this was a simplier solution. However, I did verify my skills with joins later

Town <-c('WEST ISLIP','MANHASSET','MINEOLA','PATCHOGUE','VALHALLA','EAST MEADOW','MOUNT KISCO', 'ELMHURST','MOUNT VERNON','JAMAICA')
Median_income <- c( 103789,105938,60706,47027,76003,67185,62699,40394,47128,47944)
Poverty_Rate <- c(.034,.057,.042,.107,.017,.018,.105,.205,.139,.072)
Population <- c(28335,8080,18799,11798,3162,38132,10877,88427,67292,76579)
income <-data.frame(Town,Median_income,Poverty_Rate,Population)
income

##            Town Median_income Poverty_Rate Population
## 1    WEST ISLIP        103789        0.034      28335
## 2     MANHASSET        105938        0.057       8080
## 3       MINEOLA         60706        0.042      18799
## 4     PATCHOGUE         47027        0.107      11798
## 5      VALHALLA         76003        0.017       3162
## 6   EAST MEADOW         67185        0.018      38132
## 7   MOUNT KISCO         62699        0.105      10877
## 8      ELMHURST         40394        0.205      88427
## 9  MOUNT VERNON         47128        0.139      67292
## 10      JAMAICA         47944        0.072      76579

What is very telling from the data presented already is that bottom five tend to have lower incomes than the top five, with two cities in the Top 5 breaking the 6 figure mark. From what I see here, it would be very interesting to create a much more elaborate dataset, however, that seems a bit out of scope, and maybe for another day.

It was now time to combine the datasets. To do this, I created a left join from the small dataset to the larger one in order to sparse out the data I was not focusing on.

join_query <-"SELECT I.*, P.*
            FROM income I
            Left JOIN DRG_Payments P
            WHERE City = Town"
Combo <- sqldf(join_query)

I plotted the Payment VS Charges to show the comparsion.

ggplot(Combo, aes(x=Charges, y=Payments)) + geom_point(aes(colour=Combo$Town))

As you can see, Valhalla has several significantly high charged and high paying DRGs. This is likely due to some specialization, which is both billed and paid at a higher rate (Neurosurgery/Cancer Specialization etc.). However, from the basic trending data, it can been seen that most of the Payments seem to be relatively the same. Plotting the Histogram shows that all the majority of the payments fall between 5 and 15 thousand dollars.

hist(Combo$Payments)

Next I wanted to create a correlation relation between Median Income and Average Charges. To do this I created a new sql table with

income_query <- "SELECT Town, AVG(Charges) AS Charges 
              From Combo
              GROUP BY Town
              ORDER BY Town"

incometotal <- sqldf(income_query)

joinincome_query2 <- "SELECT c.Town, c.Charges, i.Median_Income AS Income 
                From incometotal c
                Join income i
                Where c.Town = i.Town"
IncomeData <- sqldf(joinincome_query2)
 
IncomeData

##            Town  Charges Income
## 1   EAST MEADOW 24818.40  67185
## 2      ELMHURST 20573.45  40394
## 3       JAMAICA 18281.03  47944
## 4     MANHASSET 54888.50 105938
## 5       MINEOLA 56302.44  60706
## 6   MOUNT KISCO 23838.91  62699
## 7  MOUNT VERNON 19403.99  47128
## 8     PATCHOGUE 58229.75  47027
## 9      VALHALLA 92312.90  76003
## 10   WEST ISLIP 53466.39 103789

I plotted the Median Income vs. Total Charges to see if there is a correlation between the two.

ggplot(IncomeData, aes(x=Charges, y=Income)) + geom_point(aes(colour=IncomeData$Town))

x<-IncomeData$Charges
y<-IncomeData$Income
cor(x,y)

## [1] 0.4876487

So, for the Data Science Math Portion of this Final I decied to try and create some fictionalized Probabilites. For this I took the existing Data set, and used the Number of DC for a given DRG to find some percentages. For instance, I took the total number of DC as the total sample size from all town, and I assumed this was the number of total hospital visit (its not). I divide the number of Chest Pain DRGs by that total number to get a percentage, and repeated that for each city. The data looks like this:

drg_query <- "SELECT Town, SUM(DC) AS CountCH
              From Combo
              WHERE DRG LIKE '313%'
              GROUP BY Town
              ORDER BY Town"
Chest_Pain<- sqldf(drg_query)

sum_query <- "SELECT Town, SUM(DC) as CountTotal, AVG(Poverty_Rate) As PovertyRate
              From Combo
              GROUP BY Town
              ORDER BY Town"

sumtotal <- sqldf(sum_query)

join_query2 <- "SELECT c.Town, c.CountTotal, ch.CountCH, c.PovertyRate
                From sumtotal c
                Join Chest_Pain Ch
                Where c.Town = ch.Town"
ChestPainData <- sqldf(join_query2)
 
ChestPainData$PercentChestPain = ChestPainData$CountCH/ChestPainData$CountTotal
ChestPainData

##            Town CountTotal CountCH PovertyRate PercentChestPain
## 1   EAST MEADOW       2271     126       0.018       0.05548217
## 2      ELMHURST       1945     119       0.205       0.06118252
## 3       JAMAICA       3733     177       0.072       0.04741495
## 4     MANHASSET      11630     420       0.057       0.03611350
## 5       MINEOLA       7052     302       0.042       0.04282473
## 6   MOUNT KISCO       2268     137       0.105       0.06040564
## 7  MOUNT VERNON        548      25       0.139       0.04562044
## 8     PATCHOGUE       5116     222       0.107       0.04339328
## 9      VALHALLA       2001      69       0.017       0.03448276
## 10   WEST ISLIP       6315     215       0.034       0.03404592

From the previous source, I combind that data with the poverty rate, and now have two percentages. If we make several far fetched assumptions,we could find the a pretend conditional probabilty. For instance, what is the probabilty that a person in the hospital is both below the poverty level and in the hospital for chest pain. First, we would have to assume that these are independent events. Second, we assume only people from their respective city goes the their respective hospital. So what is the probability of a random patient is both in the hospital for chest pain, and beneath the poverty level. This would be given by multplying the 2 probabilties.

ChestPainData$CondProb = ChestPainData$PovertyRate * ChestPainData$PercentChestPain

ChestPainData

##            Town CountTotal CountCH PovertyRate PercentChestPain
## 1   EAST MEADOW       2271     126       0.018       0.05548217
## 2      ELMHURST       1945     119       0.205       0.06118252
## 3       JAMAICA       3733     177       0.072       0.04741495
## 4     MANHASSET      11630     420       0.057       0.03611350
## 5       MINEOLA       7052     302       0.042       0.04282473
## 6   MOUNT KISCO       2268     137       0.105       0.06040564
## 7  MOUNT VERNON        548      25       0.139       0.04562044
## 8     PATCHOGUE       5116     222       0.107       0.04339328
## 9      VALHALLA       2001      69       0.017       0.03448276
## 10   WEST ISLIP       6315     215       0.034       0.03404592
##        CondProb
## 1  0.0009986790
## 2  0.0125424165
## 3  0.0034138762
## 4  0.0020584695
## 5  0.0017986387
## 6  0.0063425926
## 7  0.0063412409
## 8  0.0046430805
## 9  0.0005862069
## 10 0.0011575614

Final Project

Matthew Farris

July 26, 2015

Final Project - DRG Payment Analysis