Part 1 - Introduction
The Affordable Care Act (ACA), signed into law in 2010, along with other recent changes affecting healthcare reimbursement models, has caused significant shifts in incentives in the operations of large hospitals and health care systems. Inpatient care has been particularly impacted, as hospitals seek to avoid financial penalties related to readmissions and reduce the lengths of inpatient stays. Moreover, certain categories of surgeries have shifted from requiring multi-day hospital days to being done in a few hours as outpatient procedures at ambulatory surgery centers.
For this project, we will examine annual operational statistics related to California hospitals, including numbers of inpatient beds and inpatient patient census days and surgical procedure volume, to analyze trends from 2013 to 2016. In particular, we will focus on changes in inpatient and outpatient surgical volumes.
HealthData.gov makes a number of informative and robust data sets available for public download. Our dataset of interest includes a large number of statistics about hospital facilities provided by the State of California for its Office of Statewide Health Planning and Development (OSHPD) program.
Part 2 - Data
- Load libraries
library(knitr)
library(kableExtra)
library(prettydoc)
library(RCurl)
library(dplyr)
library(ggplot2)
library(tidyr)
library(tidyverse)
library(sqldf)
- Load the data as a dataframe
Note: The file as originally downloaded from HealthData.gov exceeds 85 MB. To fit the requirements of being under 25 MB for posting on Github, columns with extraneous information such as the contact information for the person who compiled data at each facility was removed. Additionally, measures and metrics that are not of interest for this analysis were removed. The full dataset can be accessed here.
#load the csv into R
df_hosputil <- read.csv("https://raw.githubusercontent.com/littlejohnjeff/Data_606/master/2012-2016-hosputilmr_edited.csv")
#Display top rows of dataframe
head(df_hosputil)
## Year OSHPD_ID FAC_NAME FAC_ADDRESS_ONE FAC_ADDRESS_TWO FAC_CITY
## 1 2012 106010735 ALAMEDA HOSPITAL 2070 CLINTON ALAMEDA
## 2 2012 106010735 ALAMEDA HOSPITAL 2070 CLINTON ALAMEDA
## 3 2012 106010735 ALAMEDA HOSPITAL 2070 CLINTON ALAMEDA
## 4 2012 106010735 ALAMEDA HOSPITAL 2070 CLINTON ALAMEDA
## 5 2012 106010735 ALAMEDA HOSPITAL 2070 CLINTON ALAMEDA
## 6 2012 106010735 ALAMEDA HOSPITAL 2070 CLINTON ALAMEDA
## FAC_ZIPCODE FAC_OPER_CURRYR BEG_DATE END_DATE PARENT_NAME PARENT_ZIP_9
## 1 94501 Yes 1/1/2012 12/31/2012
## 2 94501 Yes 1/1/2012 12/31/2012
## 3 94501 Yes 1/1/2012 12/31/2012
## 4 94501 Yes 1/1/2012 12/31/2012
## 5 94501 Yes 1/1/2012 12/31/2012
## 6 94501 Yes 1/1/2012 12/31/2012
## LIC_STATUS LIC_STATUS_DATE TRAUMA_CTR TEACH_HOSP ACLAIMS_NO
## 1 Open 1/1/1946 NO 140000011
## 2 Open 1/1/1946 NO 140000011
## 3 Open 1/1/1946 NO 140000011
## 4 Open 1/1/1946 NO 140000011
## 5 Open 1/1/1946 NO 140000011
## 6 Open 1/1/1946 NO 140000011
## ASSEMBLY_DIST SENATE_DIST CONGRESS_DIST CENS_TRACT HEALTH_SVC_AREA
## 1 16 9 13 6001428500 05 - East Bay
## 2 16 9 13 6001428500 05 - East Bay
## 3 16 9 13 6001428500 05 - East Bay
## 4 16 9 13 6001428500 05 - East Bay
## 5 16 9 13 6001428500 05 - East Bay
## 6 16 9 13 6001428500 05 - East Bay
## COUNTY LICENSE_NO FACILITY_LEVEL LONGITUDE LATITUDE TYPE_LIC
## 1 Alameda 140000002 Parent Facility -122.2536 37.76295 General Acute Care
## 2 Alameda 140000002 Parent Facility -122.2536 37.76295 General Acute Care
## 3 Alameda 140000002 Parent Facility -122.2536 37.76295 General Acute Care
## 4 Alameda 140000002 Parent Facility -122.2536 37.76295 General Acute Care
## 5 Alameda 140000002 Parent Facility -122.2536 37.76295 General Acute Care
## 6 Alameda 140000002 Parent Facility -122.2536 37.76295 General Acute Care
## TYPE_SVC_PRINCIPAL Measure.Variable Description
## 1 General Medical / Surgical 3.1.1 MED_SURG_BED_LIC
## 2 General Medical / Surgical 3.1.2 MED_SURG_LICBED_DAY
## 3 General Medical / Surgical 3.1.3 MED_SURG_DIS
## 4 General Medical / Surgical 3.1.5 MED_SURG_CENS_DAY
## 5 General Medical / Surgical 3.25.1 HOSP_BED_LIC_TOTL
## 6 General Medical / Surgical 3.25.2 HOSP_LICBED_DAY_TOTL
## Response Amount
## 1 204
## 2 49104
## 3 2619
## 4 9457
## 5 401
## 6 121206
- Subset Columns
Here, we remove columns that will not be included in our analysis and rows related to a measure that’s clearly outside our scope of interest.
#subset to our columns of interest
df_hosputil_2 <- df_hosputil[c(2,3,1,31,33)]
#remove rows with certain measure
df_hosputil_2 <- df_hosputil_2[df_hosputil_2$Description != "AMB_SURG_PROG",]
head(df_hosputil_2)
## OSHPD_ID FAC_NAME Year Description Amount
## 1 106010735 ALAMEDA HOSPITAL 2012 MED_SURG_BED_LIC 204
## 2 106010735 ALAMEDA HOSPITAL 2012 MED_SURG_LICBED_DAY 49104
## 3 106010735 ALAMEDA HOSPITAL 2012 MED_SURG_DIS 2619
## 4 106010735 ALAMEDA HOSPITAL 2012 MED_SURG_CENS_DAY 9457
## 5 106010735 ALAMEDA HOSPITAL 2012 HOSP_BED_LIC_TOTL 401
## 6 106010735 ALAMEDA HOSPITAL 2012 HOSP_LICBED_DAY_TOTL 121206
- Tidy data
Following the tidy principles:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell
So each hospital and year will be a row, and we’re going to take the measures horizontal.
#turn Description field into columns; amount as Key using Spread
df_hosputil_3 <- df_hosputil_2 %>%
spread(key = Description, value = Amount)
head(df_hosputil_3)
## OSHPD_ID FAC_NAME Year
## 1 106010735 ALAMEDA HOSPITAL 2012
## 2 106010735 ALAMEDA HOSPITAL 2013
## 3 106010735 ALAMEDA HOSPITAL 2014
## 4 106010735 ALAMEDA HOSPITAL 2015
## 5 106010735 ALAMEDA HOSPITAL 2016
## 6 106010739 ALTA BATES SUMMIT MED CTR-ALTA BATES CAMPUS 2012
## HOSP_BED_LIC_TOTL HOSP_CENS_DAY_TOTL HOSP_DIS_TOTL HOSP_LICBED_DAY_TOTL
## 1 401 46073 3006 121206
## 2 281 72058 3026 102565
## 3 281 47310 1882 102565
## 4 281 70488 2476 102565
## 5 281 71354 2154 102846
## 6 347 77913 17003 127002
## MED_SURG_BED_LIC MED_SURG_CENS_DAY MED_SURG_DIS MED_SURG_LICBED_DAY
## 1 204 9457 2619 49104
## 2 84 9853 2575 30660
## 3 84 5619 1547 30660
## 4 84 6410 2045 30660
## 5 84 7140 1964 30744
## 6 146 34109 7896 53436
## SURG_IP SURG_OP
## 1 43099 74637
## 2 568 1477
## 3 544 843
## 4 630 840
## 5 635 1560
## 6 394287 321180
Now that our data is prepared for analysis, let’s talk a little more about what it means.
Each case represents a California hospital. Note that a multi-facility health system will be represented by one case per facility. There are 661 distinct facilities in the dataset that met our initial criteria, but that’s just a count of distinct facilities present in the data from 2012 to 2016. Some facilities closed; others opened in the time period. We might later exclude certain facilities due to missing or suspicious data. Thus, we start with 661 cases but expect to have fewer at the conclusion of our analysis.
What is the data source? Individual hospitals and health systems are required to self-report this facility data on an annual basis. Reported numbers do receive regulatory and other review, so there are strong incentives to report accurate information. However, possibilities of error and ambiguities of interpretation are entirely possible.
I should note that as a hospital data analyst/developer, I have prepared the data submissions for these programs previously. There’s a lot of ambiguity in reporting these numbers. The definitions are unclear. A hospital may have unused licensed beds. A health system might shift all its surgeries of one type to another facility. These observations are not independent - again, some of these facilities are owned or operated by the same parent organization, and many of these facilities operate in the same, near-zero-sum markets.
Annual California hospital utilization data are published here:
https://healthdata.gov/dataset/hospital-annual-utilization-report-pivot-tables
Note that for the purposes of this exercise we will be using the Hospital Annual Utilization Data - Machine-Readable Format CSV, which combines annual data from 2012 to 2016 in a format more friendly to analysis than the individual annual files.
You may have noticed 2012 values in the data despite our statement that we’ll be examining changes from 2013 to 2016. The 2012 values for certain hospitals and measures seem abnormally high, either due to error or the data being reported as an aggregation of pre-2013 annual values. So we will start with 2013.
The response variable is the change in outpatient surgical volume from 2013 to 2016. The explanatory variable is the change in inpatient surgical volume from 2013 to 2016.
This is an observational study - we will tentatively draw conclusions about the US healthcare system despite our data being limited to the state of California and the fact that there are numerous state-level distinctions in healthcare funding and incentives. In other words, any conclusions would at best only be invitations for further study. There’s a lot of bias in the data. Because it’s not an experimental study, no causation can be established.
In summary, this is an interesting dataset of great utility in our day job, but it’s not appropriate for true statistical inference.
Part 3 - Exploratory Data Analysis
For now, our big question is how annual inpatient and outpatient surgical volumes changed from 2013 to 2016, so we’re going to create some calculations to show the delta between 2013 and 2016 for these measures for each facility. This will likely take some time and require a smarter approach, but here is a start. We will calculate some summary statistics and then visualize.
#count number of facilities in dataset
sqldf("select count(distinct(FAC_NAME)) from df_hosputil_3")
## count(distinct(FAC_NAME))
## 1 661
Ok, we have 661 distinct hospitals in the dataset. But what about facilities that opened in 2015? Or ones that opened in 2014? If we were advanced health policy researchers, we would want to consider these closely. Evolving market pressures have caused an increasing number of rural hospitals to close in the last few years. A large health system may have closed an inpatient facility and shifted business to a new ambulatory surgery center.
But we are staying relatively simple with our inquiry. In hospitals that were open in 2013 through 2016, how did their outpatient and inpatient surgery values change?
Before we filter , let’s add lag columns
#objective - get difference between 2013 and 2016 values for these surgery counts
df_hosputil_3 <- df_hosputil_3 %>%
group_by(FAC_NAME) %>%
mutate(HOSP_OP_SURG_DIFF = as.numeric(SURG_OP - lag(SURG_OP, default = SURG_OP[4],n=3)))
df_hosputil_3 <- df_hosputil_3 %>%
group_by(FAC_NAME) %>%
mutate(HOSP_IP_SURG_DIFF = as.numeric(SURG_IP - lag(SURG_IP, default = SURG_IP[4],n=3)))
Now, let’s subset the data frames for our respective years and produce some summary 2013 and 2016 statistics
df_hosp_13 <- df_hosputil_3[df_hosputil_3$Year == 2013,]
df_hosp_16 <- df_hosputil_3[df_hosputil_3$Year == 2016,]
summary(df_hosp_13)
## OSHPD_ID FAC_NAME
## Min. :106010735 ALVARADO HOSPITAL MEDICAL CENTER : 2
## 1st Qu.:106190328 ADVENTIST MEDICAL CENTER : 1
## Median :106301279 ADVENTIST MEDICAL CENTER-SELMA : 1
## Mean :106871906 ADVENTIST MEDICAL CENTER - REEDLEY : 1
## 3rd Qu.:106380842 AHMC ANAHEIM REGIONAL MEDICAL CENTER: 1
## Max. :206351814 ALAMEDA COUNTY MEDICAL CENTER : 1
## (Other) :502
## Year HOSP_BED_LIC_TOTL HOSP_CENS_DAY_TOTL HOSP_DIS_TOTL
## Min. :2013 Min. : 10.0 Min. : 300 Min. : 15
## 1st Qu.:2013 1st Qu.: 67.0 1st Qu.: 12299 1st Qu.: 1194
## Median :2013 Median : 140.0 Median : 27024 Median : 4428
## Mean :2013 Mean : 202.5 Mean : 41813 Mean : 6969
## 3rd Qu.:2013 3rd Qu.: 269.0 3rd Qu.: 56060 3rd Qu.:10940
## Max. :2013 Max. :1500.0 Max. :511585 Max. :47332
## NA's :24 NA's :24
## HOSP_LICBED_DAY_TOTL MED_SURG_BED_LIC MED_SURG_CENS_DAY MED_SURG_DIS
## Min. : 288 Min. : 4.0 Min. : 35 Min. : 5
## 1st Qu.: 24455 1st Qu.: 47.0 1st Qu.: 7422 1st Qu.: 1385
## Median : 51100 Median : 96.0 Median : 17077 Median : 4238
## Mean : 74230 Mean :122.4 Mean : 24060 Mean : 5766
## 3rd Qu.: 99645 3rd Qu.:176.0 3rd Qu.: 34713 3rd Qu.: 9174
## Max. :547500 Max. :591.0 Max. :177330 Max. :36670
## NA's :94 NA's :117 NA's :117
## MED_SURG_LICBED_DAY SURG_IP SURG_OP HOSP_OP_SURG_DIFF
## Min. : 1460 Min. : 1.0 Min. : 3 Min. :-9365.0
## 1st Qu.: 16790 1st Qu.: 691.8 1st Qu.: 1110 1st Qu.: -372.8
## Median : 35040 Median : 1645.0 Median : 2283 Median : -83.5
## Mean : 44894 Mean : 2363.5 Mean : 3291 Mean : -183.2
## 3rd Qu.: 64331 3rd Qu.: 3439.0 3rd Qu.: 4118 3rd Qu.: 154.0
## Max. :219730 Max. :16213.0 Max. :21678 Max. : 4925.0
## NA's :93 NA's :135 NA's :142 NA's :227
## HOSP_IP_SURG_DIFF
## Min. :-3475.000
## 1st Qu.: -121.000
## Median : 13.000
## Mean : 0.239
## 3rd Qu.: 157.000
## Max. : 3008.000
## NA's :224
summary(df_hosp_16)
## OSHPD_ID
## Min. :106010735
## 1st Qu.:106190324
## Median :106301310
## Mean :106678916
## 3rd Qu.:106380863
## Max. :206351814
##
## FAC_NAME Year
## ALVARADO HOSPITAL MEDICAL CENTER : 2 Min. :2016
## FOUNTAIN VALLEY REGIONAL HOSPITAL AND MEDICAL CENT: 2 1st Qu.:2016
## ADVENTIST HEALTH MEDICAL CENTER TEHACHAPI VALLEY : 1 Median :2016
## ADVENTIST MEDICAL CENTER : 1 Mean :2016
## ADVENTIST MEDICAL CENTER - CENTRAL VALLEY : 1 3rd Qu.:2016
## ADVENTIST MEDICAL CENTER - REEDLEY : 1 Max. :2016
## (Other) :498
## HOSP_BED_LIC_TOTL HOSP_CENS_DAY_TOTL HOSP_DIS_TOTL HOSP_LICBED_DAY_TOTL
## Min. : 10.0 Min. : 51 Min. : 5 Min. : 3234
## 1st Qu.: 64.0 1st Qu.: 12060 1st Qu.: 1077 1st Qu.: 23150
## Median : 139.0 Median : 27841 Median : 4594 Median : 51240
## Mean : 197.5 Mean : 42569 Mean : 6961 Mean : 72057
## 3rd Qu.: 270.0 3rd Qu.: 57851 3rd Qu.:10822 3rd Qu.: 98820
## Max. :1500.0 Max. :572258 Max. :49736 Max. :549000
## NA's :14 NA's :14
## MED_SURG_BED_LIC MED_SURG_CENS_DAY MED_SURG_DIS MED_SURG_LICBED_DAY
## Min. : 4.0 Min. : 9 Min. : 1 Min. : 1464
## 1st Qu.: 48.0 1st Qu.: 7852 1st Qu.: 1492 1st Qu.: 17202
## Median : 99.0 Median : 18556 Median : 4410 Median : 36051
## Mean :122.1 Mean : 25855 Mean : 5963 Mean : 44464
## 3rd Qu.:176.0 3rd Qu.: 37119 3rd Qu.: 8863 3rd Qu.: 64050
## Max. :591.0 Max. :191110 Max. :39285 Max. :216306
## NA's :98 NA's :121 NA's :121 NA's :96
## SURG_IP SURG_OP HOSP_OP_SURG_DIFF HOSP_IP_SURG_DIFF
## Min. : 2.0 Min. : 13 Min. :-7394.0 Min. :-2756.000
## 1st Qu.: 660.5 1st Qu.: 1104 1st Qu.: -194.8 1st Qu.: -194.000
## Median : 1731.0 Median : 2416 Median : 106.0 Median : -26.000
## Mean : 2381.6 Mean : 3683 Mean : 246.6 Mean : 6.004
## 3rd Qu.: 3411.5 3rd Qu.: 4634 3rd Qu.: 471.5 3rd Qu.: 134.000
## Max. :15280.0 Max. :27682 Max. : 9588.0 Max. : 3475.000
## NA's :135 NA's :143 NA's :232 NA's :229
The 2013 data includes 502 facilities. The 2016 data has 498 facilities. We are gonna restrict our dataset to facilities that reported data for both years - and presume that this means we’re capturing data for all CA facilities that were operational from at least 2013 to 2016. Let’s restrict our annual datasets to only include facilities in the other data set.
#Generate vector of facilities in annual data
fac_2013 <- sqldf("select distinct(FAC_NAME) from df_hosp_13")
fac_2016 <- sqldf("select distinct(FAC_NAME) from df_hosp_16")
#limit data to facilities available in the other year
df_hosp_13 <- df_hosp_13 %>% filter(FAC_NAME %in% fac_2016$FAC_NAME)
df_hosp_16 <- df_hosp_16 %>% filter(FAC_NAME %in% fac_2013$FAC_NAME)
#check
summary(df_hosp_13)
## OSHPD_ID FAC_NAME
## Min. :106010735 ALVARADO HOSPITAL MEDICAL CENTER : 2
## 1st Qu.:106190390 ADVENTIST MEDICAL CENTER : 1
## Median :106301338 ADVENTIST MEDICAL CENTER - REEDLEY : 1
## Mean :106816090 AHMC ANAHEIM REGIONAL MEDICAL CENTER: 1
## 3rd Qu.:106380940 ALAMEDA HOSPITAL : 1
## Max. :206351814 ALVARADO PARKWAY INSTITUTE B.H.S. : 1
## (Other) :373
## Year HOSP_BED_LIC_TOTL HOSP_CENS_DAY_TOTL HOSP_DIS_TOTL
## Min. :2013 Min. : 10.00 Min. : 300 Min. : 15
## 1st Qu.:2013 1st Qu.: 63.75 1st Qu.: 11939 1st Qu.: 1215
## Median :2013 Median : 136.00 Median : 25514 Median : 4180
## Mean :2013 Mean : 203.81 Mean : 41970 Mean : 6780
## 3rd Qu.:2013 3rd Qu.: 278.25 3rd Qu.: 56285 3rd Qu.:10160
## Max. :2013 Max. :1500.00 Max. :511585 Max. :47332
## NA's :8 NA's :8
## HOSP_LICBED_DAY_TOTL MED_SURG_BED_LIC MED_SURG_CENS_DAY MED_SURG_DIS
## Min. : 288 Min. : 4.0 Min. : 35 Min. : 7
## 1st Qu.: 22904 1st Qu.: 46.0 1st Qu.: 5905 1st Qu.: 1352
## Median : 48545 Median : 92.0 Median : 15787 Median : 3898
## Mean : 74476 Mean :121.5 Mean : 23442 Mean : 5611
## 3rd Qu.:101744 3rd Qu.:177.0 3rd Qu.: 34593 3rd Qu.: 8810
## Max. :547500 Max. :591.0 Max. :177330 Max. :36670
## NA's :70 NA's :78 NA's :78
## MED_SURG_LICBED_DAY SURG_IP SURG_OP HOSP_OP_SURG_DIFF
## Min. : 1460 Min. : 1 Min. : 3.0 Min. :-9365.0
## 1st Qu.: 16242 1st Qu.: 680 1st Qu.: 977.5 1st Qu.: -329.5
## Median : 33215 Median : 1511 Median : 1947.5 Median : -82.0
## Mean : 44216 Mean : 2270 Mean : 2800.6 Mean : -173.1
## 3rd Qu.: 64605 3rd Qu.: 3185 3rd Qu.: 3444.8 3rd Qu.: 157.0
## Max. :213931 Max. :16213 Max. :15590.0 Max. : 4925.0
## NA's :69 NA's :95 NA's :100 NA's :105
## HOSP_IP_SURG_DIFF
## Min. :-3475.000
## 1st Qu.: -111.500
## Median : 14.000
## Mean : -0.504
## 3rd Qu.: 163.750
## Max. : 3008.000
## NA's :102
summary(df_hosp_16)
## OSHPD_ID FAC_NAME
## Min. :106010735 ALVARADO HOSPITAL MEDICAL CENTER : 2
## 1st Qu.:106190390 ADVENTIST MEDICAL CENTER : 1
## Median :106301338 ADVENTIST MEDICAL CENTER - REEDLEY : 1
## Mean :106816090 AHMC ANAHEIM REGIONAL MEDICAL CENTER: 1
## 3rd Qu.:106380940 ALAMEDA HOSPITAL : 1
## Max. :206351814 ALVARADO PARKWAY INSTITUTE B.H.S. : 1
## (Other) :373
## Year HOSP_BED_LIC_TOTL HOSP_CENS_DAY_TOTL HOSP_DIS_TOTL
## Min. :2016 Min. : 10.0 Min. : 99 Min. : 5
## 1st Qu.:2016 1st Qu.: 62.0 1st Qu.: 11826 1st Qu.: 1096
## Median :2016 Median : 133.5 Median : 26934 Median : 4254
## Mean :2016 Mean : 200.3 Mean : 43239 Mean : 6856
## 3rd Qu.:2016 3rd Qu.: 279.5 3rd Qu.: 58597 3rd Qu.:10238
## Max. :2016 Max. :1500.0 Max. :572258 Max. :49736
## NA's :8 NA's :8
## HOSP_LICBED_DAY_TOTL MED_SURG_BED_LIC MED_SURG_CENS_DAY MED_SURG_DIS
## Min. : 3660 Min. : 4.0 Min. : 9 Min. : 1
## 1st Qu.: 22601 1st Qu.: 47.0 1st Qu.: 6946 1st Qu.: 1464
## Median : 49044 Median : 94.0 Median : 17060 Median : 4157
## Mean : 73379 Mean :121.7 Mean : 25707 Mean : 5916
## 3rd Qu.:102288 3rd Qu.:176.8 3rd Qu.: 37197 3rd Qu.: 8868
## Max. :549000 Max. :591.0 Max. :191110 Max. :39285
## NA's :72 NA's :85 NA's :85
## MED_SURG_LICBED_DAY SURG_IP SURG_OP HOSP_OP_SURG_DIFF
## Min. : 1464 Min. : 3.0 Min. : 46 Min. :-7394.0
## 1st Qu.: 16600 1st Qu.: 654.5 1st Qu.: 1015 1st Qu.: -194.8
## Median : 34404 Median : 1490.0 Median : 1992 Median : 106.0
## Mean : 44467 Mean : 2320.0 Mean : 3079 Mean : 246.6
## 3rd Qu.: 64416 3rd Qu.: 3293.5 3rd Qu.: 3870 3rd Qu.: 471.5
## Max. :216306 Max. :15280.0 Max. :19000 Max. : 9588.0
## NA's :71 NA's :101 NA's :104 NA's :106
## HOSP_IP_SURG_DIFF
## Min. :-2756.000
## 1st Qu.: -194.000
## Median : -26.000
## Mean : 6.004
## 3rd Qu.: 134.000
## Max. : 3475.000
## NA's :103
What about instances in which a hospital didn’t have inpatient and outpatient surgery cases in 2013 or 2016? There may be cases where a hospital opened a new outpatient surgery center to move business to in 2015, but we’re going to exclude these as well by removing the cases with ‘NA’ values in our difference columns
df_hosp_16 <- df_hosp_16 %>% filter(!is.na(HOSP_OP_SURG_DIFF))
df_hosp_16 <- df_hosp_16 %>% filter(!is.na(HOSP_IP_SURG_DIFF))
summary(df_hosp_16)
## OSHPD_ID FAC_NAME
## Min. :106010735 ADVENTIST MEDICAL CENTER : 1
## 1st Qu.:106190519 ADVENTIST MEDICAL CENTER - REEDLEY : 1
## Median :106301337 AHMC ANAHEIM REGIONAL MEDICAL CENTER: 1
## Mean :106295268 ALAMEDA HOSPITAL : 1
## 3rd Qu.:106380962 ALVARADO HOSPITAL MEDICAL CENTER : 1
## Max. :106580996 ANTELOPE VALLEY HOSPITAL : 1
## (Other) :261
## Year HOSP_BED_LIC_TOTL HOSP_CENS_DAY_TOTL HOSP_DIS_TOTL
## Min. :2016 Min. : 10.0 Min. : 279 Min. : 185
## 1st Qu.:2016 1st Qu.:101.0 1st Qu.: 15130 1st Qu.: 3262
## Median :2016 Median :178.0 Median : 36587 Median : 6859
## Mean :2016 Mean :221.4 Mean : 45203 Mean : 9025
## 3rd Qu.:2016 3rd Qu.:328.5 3rd Qu.: 67095 3rd Qu.:13466
## Max. :2016 Max. :886.0 Max. :258958 Max. :49736
##
## HOSP_LICBED_DAY_TOTL MED_SURG_BED_LIC MED_SURG_CENS_DAY MED_SURG_DIS
## Min. : 3660 Min. : 10.0 Min. : 279 Min. : 149
## 1st Qu.: 36966 1st Qu.: 59.0 1st Qu.: 8644 1st Qu.: 2250
## Median : 65148 Median :106.0 Median : 18790 Median : 4760
## Mean : 81060 Mean :133.7 Mean : 27589 Mean : 6570
## 3rd Qu.:120231 3rd Qu.:192.0 3rd Qu.: 41634 3rd Qu.: 9712
## Max. :324276 Max. :591.0 Max. :191110 Max. :39285
## NA's :2 NA's :4 NA's :4
## MED_SURG_LICBED_DAY SURG_IP SURG_OP HOSP_OP_SURG_DIFF
## Min. : 3660 Min. : 3.0 Min. : 46 Min. :-7394.0
## 1st Qu.: 21594 1st Qu.: 772.5 1st Qu.: 1065 1st Qu.: -199.0
## Median : 38796 Median : 1673.0 Median : 2007 Median : 105.0
## Mean : 48971 Mean : 2407.8 Mean : 3127 Mean : 240.7
## 3rd Qu.: 70272 3rd Qu.: 3346.0 3rd Qu.: 3928 3rd Qu.: 463.5
## Max. :216306 Max. :15280.0 Max. :19000 Max. : 9588.0
## NA's :2
## HOSP_IP_SURG_DIFF
## Min. :-2756.00
## 1st Qu.: -195.00
## Median : -26.00
## Mean : 7.76
## 3rd Qu.: 144.50
## Max. : 3475.00
##
This really reduces our number of cases - we only see 278 facilities that submitted data in both 2013 and 2016 that had inpatient and outpatient surgery numbers in all cases. Based on our healthcare experience, we guess that fewer than half this reduction involves facilities closing but instead could be mergers, changes in reporting, license changes, etc.
- Total Numbers
As a sanity check, let’s do a boxplot of the annual census days by facility and reporting year. Note that this will include facilities filtered out in the previous step. We’re not looking at just 2013 and 2016 yet.
# boxplot of census days
boxplot(df_hosputil_3$HOSP_CENS_DAY_TOTL ~ df_hosputil_3$Year, col="blue", main="Distribution of Annual Total Census Days", ylab="Annual Census Days", xlab="Years")
Now, let’s take a look at the outpatient surgery volumes for the two years in question via histogram.
#histogram of outpatient surgeries for 2013
histogram(df_hosp_13$SURG_OP, type = "count", col="blue", main="Histogram: 2013 Outpatient Surgeries by Facility", ylab="Facility Count", xlab="Annual OP Surgeries by Facility")
#histogram of outpatient surgeries for 2016
histogram(df_hosp_16$SURG_OP, type = "count", col="blue", main="Histogram: 2016 Outpatient Surgeries by Facility", ylab="Facility Count", xlab="Annual OP Surgeries by Facility")
No readily apparently significant differences in the histograms of the two years is apparent. Let’s look at the delta - the calculated change columns from 2013 to 2016 from the 2016 data set. We also applied some additional filtering to the 2016 data.
ggplot(df_hosp_16, aes(x=HOSP_IP_SURG_DIFF, y=HOSP_OP_SURG_DIFF, size=HOSP_CENS_DAY_TOTL)) +
geom_point(alpha=0.2)
Under our hunch that hospitals have shifted many inpatient procedures to outpatient procedures, the majority of points would be in quadrant 1 - upper left. Of course, we could also have points in the upper right as long as the outpatient difference exceeded the inpatient growth - or, in a declining market, the lower left if the inpatient drop is greater than the outpatient decrease.
Part 4 - Correlation & Inference
We’re interested in the relationship between inpatient and outpatient surgery cases in 2013 and then in 2016. We could test if there was a significant change in inpatient cases between the two years - using the whole state population - but that’s not really our question.
Instead, we want to know if there’s a linear relationship between inpatient and outpatient changes at hospitals. We’ll call the population correlation coefficient p. Our null hypothesis is that there is no relationship between changes in inpatient and outpatient surgical volumes among facilities. We will use Pearson’s correlation, a method to examine if there is a linear relationship among variables.
To avoid disappoint, we’re going to commit the grievous sin of working our methods before ensuring the conditions are right for using them.
Let’s get a correlation.
cor.test(df_hosp_16$HOSP_IP_SURG_DIFF,df_hosp_16$HOSP_OP_SURG_DIFF, method="pearson")
##
## Pearson's product-moment correlation
##
## data: df_hosp_16$HOSP_IP_SURG_DIFF and df_hosp_16$HOSP_OP_SURG_DIFF
## t = 3.9646, df = 265, p-value = 9.464e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1199881 0.3468195
## sample estimates:
## cor
## 0.2366254
Here we see our t-test statistic values of 3.9646. Our degrees of freedom reflects our cases less 2. Our p-value indicates the significance of the test - it’s well less than the industry standard 0.05, so we could say that there’s a significant but small correlation between a change in inpatient surgery cases and outpatient surgery cases with a correlation (r) of .2366254. The lower-end of the 95% confidence interval exceeds 0, which would represents our null hypothesis that there is no relationship.
Let’s plot our relationship.
library("ggpubr")
ggscatter(df_hosp_16, x = df_hosp_16$HOSP_IP_SURG_DIFF, y = df_hosp_16$HOSP_OP_SURG_DIFF,
add = "reg.line", conf.int = TRUE,
cor.method = "pearson",
xlab = "Changes in Inpatient Surgery Volumes: 2013-2016", ylab = "Changes in Outpatient Surgery Volumes: 2013-2016")
Oh, that doesn’t look so good. Pearson’s correlation relies on five assumptions:
- Variables must either be interval or ratio measurements - yes, interval
- Variables must be approximately normally distributed - let’s check
hist(df_hosp_16$HOSP_IP_SURG_DIFF)
hist(df_hosp_16$HOSP_OP_SURG_DIFF)
Fairly normal.
- There is a linear relationship - utoh. Our first scatterplot did not show this.
- Outliers are kept to a minimum or removed - we have some.
- There is homscedasticity of the data - nope, variances are not uniform along our best fit line.
So we did our Pearson Correlation testing prematurely for practice, but it did not satisfy the conditions.
Let’s take a step back and go for something simpler. We’ll pretend our full population of CA hospitals is a valid sample of US hospitals from 2013 to 2016 and construct a 95% confidence interval for the increase in outpatient surgical cases per facility. Above, we saw a nearly normal distribution.
We’re pretending it’s an independent sample of more than 30 cases that is less than 10% of the total population. Given California’s size, it’s hospitals likely constitute more than 10% of US hospitals, but we’re already bending rules, so…
mean_op <- mean(df_hosp_16$HOSP_OP_SURG_DIFF)
#Here, we're cheating a bit too
sd_op <- sd(df_hosp_16$HOSP_OP_SURG_DIFF)
n_op <- 267
se_op <- qnorm(.975) * (sd_op/sqrt(n_op))
mean_op - se_op
## [1] 60.26167
mean_op + se_op
## [1] 421.1466
So, in another world where we have true independent sample, we have 95% confidence that the true mean growth in outpatient surigcal cases from 2013 to 2016 was between 60.26 and 421.15.
Part 5 - Conclusion
We have looked closely at data related to surgical procedures in California and how those numbers changed from 2013 to 2016, and we draw no conclusions other than the aging of the baby boomers may be overwhelming changes in business strategy in terms of growth in surgical cases.
We also conclude that we should have picked a data set more suited to testing what we learned this semester than a great whole population data set in which we have professional interest. But it was an educational exercise nonetheless.
Sources
R for Data Science: https://r4ds.had.co.nz/tidy-data.html
Pearson correlation: https://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide.php