The data set used for this particular project came from Data.gov. It is the Center for Medicare and Medicaid Services official data regarding Inpatient Services. The Data reflects the top 100 DRG codes. Diagnosis-Related Group are a set of codes that group diagnosis in order to facilitate payments to hospitals. These codes make it easier for hospitals to bill for services they provide. Some of the top DRGS include: Chest Pain, Pneumonia and COPD. The Below data only represents the NY Metropolitan Area and Surrounding Suburbs.The below code shows how this data is imported:
library(RCurl)
## Loading required package: bitops
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
## Loading required package: DBI
library(ggplot2)
x <- getURL("https://raw.githubusercontent.com/mfarris9505/DRG-Dataset-/master/DRG%20Payments%202011.csv")
DRG_Payments<- read.csv(text =x)
#Removing Excess Columns Simplify
DRG_Payments$Provider.Street.Address <-NULL
DRG_Payments$Provider.Zip.Code <-NULL
DRG_Payments$Provider.State <-NULL
#Renaming Columns
columns <- c("DRG", "ID", "Provider", "City", "Region", "DC","Charges","Payments", "Medicare")
names(DRG_Payments) <- columns
#Reclassifying Dollar Values as Numeric
DRG_Payments$Payments <- as.numeric(as.character(sub("\\$","",DRG_Payments$Payments)))
DRG_Payments$Charges <- as.numeric(as.character(sub("\\$","",DRG_Payments$Charges)))
DRG_Payments$Medicare <- as.numeric(as.character(sub("\\$","",DRG_Payments$Medicare)))
As you can see this is the data I used previously from the R assignment. However, When was retrieving this data, I wanted to create a comparsion from DRG payments to average income. For my own personal view, this has been an area of interest for me. I work in hospital finance, and have an interest in payments and rates based on the population I work with. As an employee of a city hospital, I understand that the majority of our patients come from and underserved population, and are generally in the lower income bracket. So, I used SQL to run the following queries to determine which city had the highest charges and payments, and I focused singularly on one DRG - Chest Pain to find which cities had the highest and lowest payments.
Query_Charges <- "SELECT City, AVG(Charges), AVG(Payments)
FROM DRG_Payments
WHERE DRG like '313%'
GROUP BY City
ORDER BY AVG(Charges)"
City <- sqldf(Query_Charges)
## Loading required package: tcltk
City
## City AVG(Charges) AVG(Payments)
## 1 JAMAICA 7868.785 6766.540
## 2 ELMHURST 7953.170 7161.840
## 3 NEW ROCHELLE 8287.290 5340.790
## 4 YONKERS 9323.205 5046.215
## 5 MOUNT KISCO 9498.600 3820.880
## 6 EAST MEADOW 9662.870 6559.730
## 7 SOUTHAMPTON 10348.010 4298.880
## 8 BRONX 10784.743 7127.389
## 9 BRONXVILLE 10866.420 3757.130
## 10 GLEN COVE 11483.500 3947.110
## 11 PLAINVIEW 11657.870 4097.100
## 12 OCEANSIDE 12195.430 3981.390
## 13 FLUSHING 12291.930 5798.855
## 14 FAR ROCKAWAY 12356.695 6370.535
## 15 BROOKLYN 12487.284 6647.715
## 16 WHITE PLAINS 12742.700 3785.880
## 17 NEW YORK 12833.925 6554.029
## 18 STATEN ISLAND 12865.140 5337.230
## 19 GREENPORT 12997.780 4422.000
## 20 RIVERHEAD 12998.760 4058.340
## 21 FOREST HILLS 13350.930 4901.720
## 22 BETHPAGE 13665.310 3713.080
## 23 HUNTINGTON 14056.650 3733.530
## 24 MOUNT VERNON 14548.560 5514.760
## 25 STONY BROOK 14550.390 5658.720
## 26 CORTLANDT MANOR 14921.610 3917.160
## 27 LONG BEACH 14931.170 4587.760
## 28 ROSLYN 16453.660 4143.540
## 29 VALLEY STREAM 16456.880 3894.660
## 30 SLEEPY HOLLOW 17058.920 4639.470
## 31 BAY SHORE 17873.400 4480.130
## 32 NEWBURGH 17968.280 3940.030
## 33 ROCKVILLE CENTR 18141.430 4061.370
## 34 PORT JEFFERSON 18398.895 3913.060
## 35 MANHASSET 19411.000 5217.380
## 36 MIDDLETOWN 19418.890 3929.060
## 37 WEST ISLIP 19498.220 4147.000
## 38 SMITHTOWN 22401.330 3697.240
## 39 NEW HYDE PARK 22448.770 6203.000
## 40 VALHALLA 23048.230 6228.240
## 41 PATCHOGUE 23481.430 4127.950
## 42 MINEOLA 25022.290 5072.330
summary(City)
## City AVG(Charges) AVG(Payments)
## BAY SHORE : 1 Min. : 7869 Min. :3697
## BETHPAGE : 1 1st Qu.:11792 1st Qu.:3942
## BRONX : 1 Median :13508 Median :4451
## BRONXVILLE : 1 Mean :14729 Mean :4871
## BROOKLYN : 1 3rd Qu.:17945 3rd Qu.:5623
## CORTLANDT MANOR: 1 Max. :25022 Max. :7162
## (Other) :36
What I found quite surprising about this data was that despite the vastly different charges from one city to the next, the payments remained relatively the same.
I repeated the data for all the top DRG to see if the results would much different.
Query_Charges <- "SELECT City, AVG(Charges),AVG(Payments)
FROM DRG_Payments
GROUP BY City
ORDER BY AVG(Charges)"
City <- sqldf(Query_Charges)
City
## City AVG(Charges) AVG(Payments)
## 1 JAMAICA 18281.03 15580.804
## 2 MOUNT VERNON 19403.99 10802.598
## 3 ELMHURST 20573.45 15782.273
## 4 MOUNT KISCO 23838.91 8958.089
## 5 EAST MEADOW 24818.40 15627.257
## 6 YONKERS 25681.68 13050.371
## 7 NEW ROCHELLE 25726.12 13758.726
## 8 GREENPORT 27999.30 7295.806
## 9 LONG BEACH 28796.80 8879.311
## 10 SOUTHAMPTON 28811.65 8687.643
## 11 FAR ROCKAWAY 28930.31 15242.021
## 12 BRONX 29942.89 17025.578
## 13 GLEN COVE 30753.17 9546.289
## 14 BROOKLYN 30998.76 16665.233
## 15 FLUSHING 31077.35 14890.020
## 16 BRONXVILLE 31338.64 9375.367
## 17 WHITE PLAINS 31713.79 10197.311
## 18 BETHPAGE 33300.56 9148.238
## 19 FOREST HILLS 33335.35 12568.871
## 20 VALLEY STREAM 34861.67 9207.622
## 21 BAY SHORE 35456.88 11079.026
## 22 STATEN ISLAND 36516.86 13335.398
## 23 SLEEPY HOLLOW 37671.87 10505.240
## 24 HUNTINGTON 38078.08 10266.602
## 25 NEWBURGH 39933.58 10886.627
## 26 ROCKVILLE CENTR 39962.56 10635.395
## 27 PLAINVIEW 40134.75 11862.878
## 28 RIVERHEAD 41095.88 10042.643
## 29 OCEANSIDE 41461.97 10929.350
## 30 PORT JEFFERSON 42242.16 9652.358
## 31 CORTLANDT MANOR 42594.33 10192.186
## 32 NEW YORK 45454.40 16388.576
## 33 MIDDLETOWN 46043.21 10340.652
## 34 ROSLYN 49015.27 10995.534
## 35 STONY BROOK 49272.02 16478.854
## 36 SMITHTOWN 51198.23 9602.994
## 37 NEW HYDE PARK 51790.80 16623.213
## 38 WEST ISLIP 53466.39 11719.705
## 39 MANHASSET 54888.50 14493.102
## 40 MINEOLA 56302.44 13478.650
## 41 PATCHOGUE 58229.75 10359.315
## 42 VALHALLA 92312.90 22982.273
summary(City)
## City AVG(Charges) AVG(Payments)
## BAY SHORE : 1 Min. :18281 Min. : 7296
## BETHPAGE : 1 1st Qu.:29183 1st Qu.:10080
## BRONX : 1 Median :35987 Median :10962
## BRONXVILLE : 1 Mean :38174 Mean :12265
## BROOKLYN : 1 3rd Qu.:44739 3rd Qu.:14791
## CORTLANDT MANOR: 1 Max. :92313 Max. :22982
## (Other) :36
I found that most of the cities remained within the same position (based on charges), so for further analysis I am going to take the top 5 and lowest 5 cities for further analysis.
| Top Five | Bottom Five |
|---|---|
| West Islip | East Meadows |
| Manhasset | Mount Kisco |
| Mineola | Elmhurst |
| Patchogue | Mount vernon |
| Valhalla | Jamaica |
An interesting side note: My hospital is in Jamaica. Apparently we are not charging enough.
For these 10 cities, I found the Average Income for the Year of 2011 along with the relative poverty rates-from wikipedia. I went ahead and created these tables in SQL and exported the data to a csv file. The code is shown here
*Note: I could have created these tables in SQL and imported them here, to verify my skills with creating tables. However, I felt that to be much too tedious, and this was a simplier solution. However, I did verify my skills with joins later
Town <-c('WEST ISLIP','MANHASSET','MINEOLA','PATCHOGUE','VALHALLA','EAST MEADOW','MOUNT KISCO', 'ELMHURST','MOUNT VERNON','JAMAICA')
Median_income <- c( 103789,105938,60706,47027,76003,67185,62699,40394,47128,47944)
Poverty_Rate <- c(.034,.057,.042,.107,.017,.018,.105,.205,.139,.072)
Population <- c(28335,8080,18799,11798,3162,38132,10877,88427,67292,76579)
income <-data.frame(Town,Median_income,Poverty_Rate,Population)
income
## Town Median_income Poverty_Rate Population
## 1 WEST ISLIP 103789 0.034 28335
## 2 MANHASSET 105938 0.057 8080
## 3 MINEOLA 60706 0.042 18799
## 4 PATCHOGUE 47027 0.107 11798
## 5 VALHALLA 76003 0.017 3162
## 6 EAST MEADOW 67185 0.018 38132
## 7 MOUNT KISCO 62699 0.105 10877
## 8 ELMHURST 40394 0.205 88427
## 9 MOUNT VERNON 47128 0.139 67292
## 10 JAMAICA 47944 0.072 76579
What is very telling from the data presented already is that bottom five tend to have lower incomes than the top five, with two cities in the Top 5 breaking the 6 figure mark. From what I see here, it would be very interesting to create a much more elaborate dataset, however, that seems a bit out of scope, and maybe for another day.
It was now time to combine the datasets. To do this, I created a left join from the small dataset to the larger one in order to sparse out the data I was not focusing on.
join_query <-"SELECT I.*, P.*
FROM income I
Left JOIN DRG_Payments P
WHERE City = Town"
Combo <- sqldf(join_query)
I plotted the Payment VS Charges to show the comparsion.
ggplot(Combo, aes(x=Charges, y=Payments)) + geom_point(aes(colour=Combo$Town))
As you can see, Valhalla has several significantly high charged and high paying DRGs. This is likely due to some specialization, which is both billed and paid at a higher rate (Neurosurgery/Cancer Specialization etc.). However, from the basic trending data, it can been seen that most of the Payments seem to be relatively the same. Plotting the Histogram shows that all the majority of the payments fall between 5 and 15 thousand dollars.
hist(Combo$Payments)
Next I wanted to create a correlation relation between Median Income and Average Charges. To do this I created a new sql table with
income_query <- "SELECT Town, AVG(Charges) AS Charges
From Combo
GROUP BY Town
ORDER BY Town"
incometotal <- sqldf(income_query)
joinincome_query2 <- "SELECT c.Town, c.Charges, i.Median_Income AS Income
From incometotal c
Join income i
Where c.Town = i.Town"
IncomeData <- sqldf(joinincome_query2)
IncomeData
## Town Charges Income
## 1 EAST MEADOW 24818.40 67185
## 2 ELMHURST 20573.45 40394
## 3 JAMAICA 18281.03 47944
## 4 MANHASSET 54888.50 105938
## 5 MINEOLA 56302.44 60706
## 6 MOUNT KISCO 23838.91 62699
## 7 MOUNT VERNON 19403.99 47128
## 8 PATCHOGUE 58229.75 47027
## 9 VALHALLA 92312.90 76003
## 10 WEST ISLIP 53466.39 103789
I plotted the Median Income vs. Total Charges to see if there is a correlation between the two.
ggplot(IncomeData, aes(x=Charges, y=Income)) + geom_point(aes(colour=IncomeData$Town))
x<-IncomeData$Charges
y<-IncomeData$Income
cor(x,y)
## [1] 0.4876487
So, for the Data Science Math Portion of this Final I decied to try and create some fictionalized Probabilites. For this I took the existing Data set, and used the Number of DC for a given DRG to find some percentages. For instance, I took the total number of DC as the total sample size from all town, and I assumed this was the number of total hospital visit (its not). I divide the number of Chest Pain DRGs by that total number to get a percentage, and repeated that for each city. The data looks like this:
drg_query <- "SELECT Town, SUM(DC) AS CountCH
From Combo
WHERE DRG LIKE '313%'
GROUP BY Town
ORDER BY Town"
Chest_Pain<- sqldf(drg_query)
sum_query <- "SELECT Town, SUM(DC) as CountTotal, AVG(Poverty_Rate) As PovertyRate
From Combo
GROUP BY Town
ORDER BY Town"
sumtotal <- sqldf(sum_query)
join_query2 <- "SELECT c.Town, c.CountTotal, ch.CountCH, c.PovertyRate
From sumtotal c
Join Chest_Pain Ch
Where c.Town = ch.Town"
ChestPainData <- sqldf(join_query2)
ChestPainData$PercentChestPain = ChestPainData$CountCH/ChestPainData$CountTotal
ChestPainData
## Town CountTotal CountCH PovertyRate PercentChestPain
## 1 EAST MEADOW 2271 126 0.018 0.05548217
## 2 ELMHURST 1945 119 0.205 0.06118252
## 3 JAMAICA 3733 177 0.072 0.04741495
## 4 MANHASSET 11630 420 0.057 0.03611350
## 5 MINEOLA 7052 302 0.042 0.04282473
## 6 MOUNT KISCO 2268 137 0.105 0.06040564
## 7 MOUNT VERNON 548 25 0.139 0.04562044
## 8 PATCHOGUE 5116 222 0.107 0.04339328
## 9 VALHALLA 2001 69 0.017 0.03448276
## 10 WEST ISLIP 6315 215 0.034 0.03404592
From the previous source, I combind that data with the poverty rate, and now have two percentages. If we make several far fetched assumptions,we could find the a pretend conditional probabilty. For instance, what is the probabilty that a person in the hospital is both below the poverty level and in the hospital for chest pain. First, we would have to assume that these are independent events. Second, we assume only people from their respective city goes the their respective hospital. So what is the probability of a random patient is both in the hospital for chest pain, and beneath the poverty level. This would be given by multplying the 2 probabilties.
ChestPainData$CondProb = ChestPainData$PovertyRate * ChestPainData$PercentChestPain
ChestPainData
## Town CountTotal CountCH PovertyRate PercentChestPain
## 1 EAST MEADOW 2271 126 0.018 0.05548217
## 2 ELMHURST 1945 119 0.205 0.06118252
## 3 JAMAICA 3733 177 0.072 0.04741495
## 4 MANHASSET 11630 420 0.057 0.03611350
## 5 MINEOLA 7052 302 0.042 0.04282473
## 6 MOUNT KISCO 2268 137 0.105 0.06040564
## 7 MOUNT VERNON 548 25 0.139 0.04562044
## 8 PATCHOGUE 5116 222 0.107 0.04339328
## 9 VALHALLA 2001 69 0.017 0.03448276
## 10 WEST ISLIP 6315 215 0.034 0.03404592
## CondProb
## 1 0.0009986790
## 2 0.0125424165
## 3 0.0034138762
## 4 0.0020584695
## 5 0.0017986387
## 6 0.0063425926
## 7 0.0063412409
## 8 0.0046430805
## 9 0.0005862069
## 10 0.0011575614