Final Project - DRG Payment Analysis

library(RCurl)
## Loading required package: bitops
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
## Loading required package: DBI
library(ggplot2)


x <- getURL("https://raw.githubusercontent.com/mfarris9505/DRG-Dataset-/master/DRG%20Payments%202011.csv")

DRG_Payments<- read.csv(text =x) 

#Removing Excess Columns Simplify 
DRG_Payments$Provider.Street.Address <-NULL
DRG_Payments$Provider.Zip.Code <-NULL
DRG_Payments$Provider.State <-NULL

#Renaming Columns
columns <- c("DRG", "ID", "Provider", "City", "Region", "DC","Charges","Payments", "Medicare")
names(DRG_Payments) <- columns

#Reclassifying Dollar Values as Numeric
DRG_Payments$Payments <- as.numeric(as.character(sub("\\$","",DRG_Payments$Payments)))
DRG_Payments$Charges <- as.numeric(as.character(sub("\\$","",DRG_Payments$Charges)))
DRG_Payments$Medicare <- as.numeric(as.character(sub("\\$","",DRG_Payments$Medicare)))

As you can see this is the data I used previously from the R assignment. However, When was retrieving this data, I wanted to create a comparsion from DRG payments to average income. For my own personal view, this has been an area of interest for me. I work in hospital finance, and have an interest in payments an rates based on the population I work with. As an employee of a city hospital, I understand that the majority of our patients come from and underserved population, and are generally in the lower income bracket. So, I used SQL to run the following queries to determine which city had the highest charges and payments, and I focused singularly on one DRG - Chest Pain to find which cities had the highest and lowest payments.

Query_Charges <- "select City, AVG(Charges),AVG(Payments)
                  from DRG_Payments
                  WHERE DRG like '313%'
                  Group BY City
                  Order By AVG(Charges)"
City <- sqldf(Query_Charges)
## Loading required package: tcltk
City
##               City AVG(Charges) AVG(Payments)
## 1          JAMAICA     7868.785      6766.540
## 2         ELMHURST     7953.170      7161.840
## 3     NEW ROCHELLE     8287.290      5340.790
## 4          YONKERS     9323.205      5046.215
## 5      MOUNT KISCO     9498.600      3820.880
## 6      EAST MEADOW     9662.870      6559.730
## 7      SOUTHAMPTON    10348.010      4298.880
## 8            BRONX    10784.743      7127.389
## 9       BRONXVILLE    10866.420      3757.130
## 10       GLEN COVE    11483.500      3947.110
## 11       PLAINVIEW    11657.870      4097.100
## 12       OCEANSIDE    12195.430      3981.390
## 13        FLUSHING    12291.930      5798.855
## 14    FAR ROCKAWAY    12356.695      6370.535
## 15        BROOKLYN    12487.284      6647.715
## 16    WHITE PLAINS    12742.700      3785.880
## 17        NEW YORK    12833.925      6554.029
## 18   STATEN ISLAND    12865.140      5337.230
## 19       GREENPORT    12997.780      4422.000
## 20       RIVERHEAD    12998.760      4058.340
## 21    FOREST HILLS    13350.930      4901.720
## 22        BETHPAGE    13665.310      3713.080
## 23      HUNTINGTON    14056.650      3733.530
## 24    MOUNT VERNON    14548.560      5514.760
## 25     STONY BROOK    14550.390      5658.720
## 26 CORTLANDT MANOR    14921.610      3917.160
## 27      LONG BEACH    14931.170      4587.760
## 28          ROSLYN    16453.660      4143.540
## 29   VALLEY STREAM    16456.880      3894.660
## 30   SLEEPY HOLLOW    17058.920      4639.470
## 31       BAY SHORE    17873.400      4480.130
## 32        NEWBURGH    17968.280      3940.030
## 33 ROCKVILLE CENTR    18141.430      4061.370
## 34  PORT JEFFERSON    18398.895      3913.060
## 35       MANHASSET    19411.000      5217.380
## 36      MIDDLETOWN    19418.890      3929.060
## 37      WEST ISLIP    19498.220      4147.000
## 38       SMITHTOWN    22401.330      3697.240
## 39   NEW HYDE PARK    22448.770      6203.000
## 40        VALHALLA    23048.230      6228.240
## 41       PATCHOGUE    23481.430      4127.950
## 42         MINEOLA    25022.290      5072.330

What I found quite surprising about this data was that despite the vastly different charges from one city to the next, the payments remained relatively the same. I repeated the data for all the top DRG to see if the results would much different.

Query_Charges <- "select City, AVG(Charges),AVG(Payments)
                  from DRG_Payments
                  Group BY City
                  Order By AVG(Charges)"
City <- sqldf(Query_Charges)
City
##               City AVG(Charges) AVG(Payments)
## 1          JAMAICA     18281.03     15580.804
## 2     MOUNT VERNON     19403.99     10802.598
## 3         ELMHURST     20573.45     15782.273
## 4      MOUNT KISCO     23838.91      8958.089
## 5      EAST MEADOW     24818.40     15627.257
## 6          YONKERS     25681.68     13050.371
## 7     NEW ROCHELLE     25726.12     13758.726
## 8        GREENPORT     27999.30      7295.806
## 9       LONG BEACH     28796.80      8879.311
## 10     SOUTHAMPTON     28811.65      8687.643
## 11    FAR ROCKAWAY     28930.31     15242.021
## 12           BRONX     29942.89     17025.578
## 13       GLEN COVE     30753.17      9546.289
## 14        BROOKLYN     30998.76     16665.233
## 15        FLUSHING     31077.35     14890.020
## 16      BRONXVILLE     31338.64      9375.367
## 17    WHITE PLAINS     31713.79     10197.311
## 18        BETHPAGE     33300.56      9148.238
## 19    FOREST HILLS     33335.35     12568.871
## 20   VALLEY STREAM     34861.67      9207.622
## 21       BAY SHORE     35456.88     11079.026
## 22   STATEN ISLAND     36516.86     13335.398
## 23   SLEEPY HOLLOW     37671.87     10505.240
## 24      HUNTINGTON     38078.08     10266.602
## 25        NEWBURGH     39933.58     10886.627
## 26 ROCKVILLE CENTR     39962.56     10635.395
## 27       PLAINVIEW     40134.75     11862.878
## 28       RIVERHEAD     41095.88     10042.643
## 29       OCEANSIDE     41461.97     10929.350
## 30  PORT JEFFERSON     42242.16      9652.358
## 31 CORTLANDT MANOR     42594.33     10192.186
## 32        NEW YORK     45454.40     16388.576
## 33      MIDDLETOWN     46043.21     10340.652
## 34          ROSLYN     49015.27     10995.534
## 35     STONY BROOK     49272.02     16478.854
## 36       SMITHTOWN     51198.23      9602.994
## 37   NEW HYDE PARK     51790.80     16623.213
## 38      WEST ISLIP     53466.39     11719.705
## 39       MANHASSET     54888.50     14493.102
## 40         MINEOLA     56302.44     13478.650
## 41       PATCHOGUE     58229.75     10359.315
## 42        VALHALLA     92312.90     22982.273

I found that most of the cities remained within the same position (based on charges), so for further analysis I am going to take the top 5 and lowest 5 cities for further analysis.

Top Five Bottom Five
West Islip East Meadows
Manhasset Mount Kisco
Mineola Elmhurst
Patchogue Mount vernon
Valhalla Jamaica

An interesting side note: My hospital is in Jamaica. Apparently we are not charging enough.

For these 10 cities, I found the Average Income for the Year of 2011 along with the relative poverty rates-from wikipedia. I went ahead and created these tables in SQL amd exported the data to a csv file. The code is shown here

*Note: I could have created these tables in SQL and imported them here, to verify my skills with creating tables. However, I felt that to be much too tedious, and this was a simplier solution. However, I did verify my skills with joins later

Town <-c('WEST ISLIP','MANHASSET','MINEOLA','PATCHOGUE','VALHALLA','EAST MEADOW','MOUNT KISCO', 'ELMHURST','MOUNT VERNON','JAMAICA')
Median_income <- c( 103789,105938,60706,47027,76003,67185,62699,40394,47128,47944)
Poverty_Rate <- c(.034,.057,.042,.107,.017,.018,.105,.205,.139,.072)
Population <- c(28335,8080,18799,11798,3162,38132,10877,88427,67292,76579)
income <-data.frame(Town,Median_income,Poverty_Rate,Population)
income
##            Town Median_income Poverty_Rate Population
## 1    WEST ISLIP        103789        0.034      28335
## 2     MANHASSET        105938        0.057       8080
## 3       MINEOLA         60706        0.042      18799
## 4     PATCHOGUE         47027        0.107      11798
## 5      VALHALLA         76003        0.017       3162
## 6   EAST MEADOW         67185        0.018      38132
## 7   MOUNT KISCO         62699        0.105      10877
## 8      ELMHURST         40394        0.205      88427
## 9  MOUNT VERNON         47128        0.139      67292
## 10      JAMAICA         47944        0.072      76579

What is very telling from the data presented already is that bottom five tend to have lower incomes than the top five, with two cities in the Top 5 breaking the 6 figure mark. From what I see here, it would be very interesting to create a much more elaborate dataset, however, that seems a bit out of scope, and maybe for another day.

It was now time to combine the datasets. To do this, I created a left join from the small dataset to the larger one in order to sparse out the data I was not focusing on.

join_query <-"SELECT I.*, P.*
            FROM income I
            Left JOIN DRG_Payments P
            WHERE City = Town"
Combo <- sqldf(join_query)

I plotted the Payment VS Charges to show the comparsion.

ggplot(Combo, aes(x=Charges, y=Payments)) + geom_point(aes(colour=Combo$Town))

As you can see, Valhalla has several significantly high charged and high paying DRGs. This is likely due to some specialization, which is both billed and paid at a higher rate (Neurosurgery/Cancer Specialization etc.). However, from the basic trending data, it can been seen that most of the Payments seem to be relatively the same. Plotting the Histogram shows that all the majority of the payments fall between 5 and 15 thousand dollars.

hist(Combo$Payments)

So, for the Data Science Math Portion of this Final I decied to try and create some fictionalized Probabilites. For this I took the existing Data set, and used the Number of DC for a given DRG to find some probabilities. For instance, I took the total number of DC from Jamaica Hospitals, and I assumed this was the number of total hospital visit. I divide the number of Chest Pain DRGs by that total number to get a percentage probabilty, and repeated that for each city. The data looks like this:

drg_query <- "SELECT Town, SUM(DC) AS CountCH
              From Combo
              WHERE DRG LIKE '313%'
              GROUP BY Town
              ORDER BY Town"
Chest_Pain<- sqldf(drg_query)

sum_query <- "SELECT Town, SUM(DC) as CountTotal, AVG(Poverty_Rate)
              From Combo
              GROUP BY Town
              ORDER BY Town"

sumtotal <- sqldf(sum_query)

join_query2 <- "SELECT c.*, ch.*
                From sumtotal c
                Join Chest_Pain Ch
                Where c.Town = ch.Town"
ChestPainData <- sqldf(join_query2)

ChestPainData$Percent = ChestPainData$CountCH/ChestPainData$CountTotal
ChestPainData
##            Town CountTotal AVG(Poverty_Rate)         Town CountCH
## 1   EAST MEADOW       2271             0.018  EAST MEADOW     126
## 2      ELMHURST       1945             0.205     ELMHURST     119
## 3       JAMAICA       3733             0.072      JAMAICA     177
## 4     MANHASSET      11630             0.057    MANHASSET     420
## 5       MINEOLA       7052             0.042      MINEOLA     302
## 6   MOUNT KISCO       2268             0.105  MOUNT KISCO     137
## 7  MOUNT VERNON        548             0.139 MOUNT VERNON      25
## 8     PATCHOGUE       5116             0.107    PATCHOGUE     222
## 9      VALHALLA       2001             0.017     VALHALLA      69
## 10   WEST ISLIP       6315             0.034   WEST ISLIP     215
##       Percent
## 1  0.05548217
## 2  0.06118252
## 3  0.04741495
## 4  0.03611350
## 5  0.04282473
## 6  0.06040564
## 7  0.04562044
## 8  0.04339328
## 9  0.03448276
## 10 0.03404592

From this data we could find the a pretend conditional probabilty. For instance, what is the probabilty that a person in the hospital is both below the poverty level and in the hospital for chest pain. We would have to assume that these are independent events.