Part 1 - Introduction

The Affordable Care Act (ACA), signed into law in 2010, along with other recent changes affecting healthcare reimbursement models, has caused significant shifts in incentives in the operations of large hospitals and health care systems. Inpatient care has been particularly impacted, as hospitals seek to avoid financial penalties related to readmissions and reduce the lengths of inpatient stays. Moreover, certain categories of surgeries have shifted from requiring multi-day hospital days to being done in a few hours as outpatient procedures at ambulatory surgery centers.

For this project, we will examine annual operational statistics related to California hospitals, including numbers of inpatient beds and inpatient patient census days and surgical procedure volume, to analyze trends from 2013 to 2016. In particular, we will focus on changes in inpatient and outpatient surgical volumes.

HealthData.gov makes a number of informative and robust data sets available for public download. Our dataset of interest includes a large number of statistics about hospital facilities provided by the State of California for its Office of Statewide Health Planning and Development (OSHPD) program.

Part 2 - Data

  • Load libraries
library(knitr)
library(kableExtra)
library(prettydoc)
library(RCurl)
library(dplyr)
library(ggplot2)
library(tidyr)
library(tidyverse)
library(sqldf)
  • Load the data as a dataframe

Note: The file as originally downloaded from HealthData.gov exceeds 85 MB. To fit the requirements of being under 25 MB for posting on Github, columns with extraneous information such as the contact information for the person who compiled data at each facility was removed. Additionally, measures and metrics that are not of interest for this analysis were removed. The full dataset can be accessed here.

#load the csv into R
df_hosputil <- read.csv("https://raw.githubusercontent.com/littlejohnjeff/Data_606/master/2012-2016-hosputilmr_edited.csv")
#Display top rows of dataframe
head(df_hosputil)
##   Year  OSHPD_ID         FAC_NAME FAC_ADDRESS_ONE FAC_ADDRESS_TWO FAC_CITY
## 1 2012 106010735 ALAMEDA HOSPITAL    2070 CLINTON                  ALAMEDA
## 2 2012 106010735 ALAMEDA HOSPITAL    2070 CLINTON                  ALAMEDA
## 3 2012 106010735 ALAMEDA HOSPITAL    2070 CLINTON                  ALAMEDA
## 4 2012 106010735 ALAMEDA HOSPITAL    2070 CLINTON                  ALAMEDA
## 5 2012 106010735 ALAMEDA HOSPITAL    2070 CLINTON                  ALAMEDA
## 6 2012 106010735 ALAMEDA HOSPITAL    2070 CLINTON                  ALAMEDA
##   FAC_ZIPCODE FAC_OPER_CURRYR BEG_DATE   END_DATE PARENT_NAME PARENT_ZIP_9
## 1       94501             Yes 1/1/2012 12/31/2012                         
## 2       94501             Yes 1/1/2012 12/31/2012                         
## 3       94501             Yes 1/1/2012 12/31/2012                         
## 4       94501             Yes 1/1/2012 12/31/2012                         
## 5       94501             Yes 1/1/2012 12/31/2012                         
## 6       94501             Yes 1/1/2012 12/31/2012                         
##   LIC_STATUS LIC_STATUS_DATE TRAUMA_CTR TEACH_HOSP ACLAIMS_NO
## 1       Open        1/1/1946                    NO  140000011
## 2       Open        1/1/1946                    NO  140000011
## 3       Open        1/1/1946                    NO  140000011
## 4       Open        1/1/1946                    NO  140000011
## 5       Open        1/1/1946                    NO  140000011
## 6       Open        1/1/1946                    NO  140000011
##   ASSEMBLY_DIST SENATE_DIST CONGRESS_DIST CENS_TRACT HEALTH_SVC_AREA
## 1            16           9            13 6001428500   05 - East Bay
## 2            16           9            13 6001428500   05 - East Bay
## 3            16           9            13 6001428500   05 - East Bay
## 4            16           9            13 6001428500   05 - East Bay
## 5            16           9            13 6001428500   05 - East Bay
## 6            16           9            13 6001428500   05 - East Bay
##    COUNTY LICENSE_NO  FACILITY_LEVEL LONGITUDE LATITUDE           TYPE_LIC
## 1 Alameda  140000002 Parent Facility -122.2536 37.76295 General Acute Care
## 2 Alameda  140000002 Parent Facility -122.2536 37.76295 General Acute Care
## 3 Alameda  140000002 Parent Facility -122.2536 37.76295 General Acute Care
## 4 Alameda  140000002 Parent Facility -122.2536 37.76295 General Acute Care
## 5 Alameda  140000002 Parent Facility -122.2536 37.76295 General Acute Care
## 6 Alameda  140000002 Parent Facility -122.2536 37.76295 General Acute Care
##           TYPE_SVC_PRINCIPAL Measure.Variable          Description
## 1 General Medical / Surgical            3.1.1     MED_SURG_BED_LIC
## 2 General Medical / Surgical            3.1.2  MED_SURG_LICBED_DAY
## 3 General Medical / Surgical            3.1.3         MED_SURG_DIS
## 4 General Medical / Surgical            3.1.5    MED_SURG_CENS_DAY
## 5 General Medical / Surgical           3.25.1    HOSP_BED_LIC_TOTL
## 6 General Medical / Surgical           3.25.2 HOSP_LICBED_DAY_TOTL
##   Response Amount
## 1             204
## 2           49104
## 3            2619
## 4            9457
## 5             401
## 6          121206
  • Subset Columns

Here, we remove columns that will not be included in our analysis and rows related to a measure that’s clearly outside our scope of interest.

#subset to our columns of interest
df_hosputil_2 <- df_hosputil[c(2,3,1,31,33)]
#remove rows with certain measure
df_hosputil_2 <- df_hosputil_2[df_hosputil_2$Description != "AMB_SURG_PROG",]
head(df_hosputil_2)
##    OSHPD_ID         FAC_NAME Year          Description Amount
## 1 106010735 ALAMEDA HOSPITAL 2012     MED_SURG_BED_LIC    204
## 2 106010735 ALAMEDA HOSPITAL 2012  MED_SURG_LICBED_DAY  49104
## 3 106010735 ALAMEDA HOSPITAL 2012         MED_SURG_DIS   2619
## 4 106010735 ALAMEDA HOSPITAL 2012    MED_SURG_CENS_DAY   9457
## 5 106010735 ALAMEDA HOSPITAL 2012    HOSP_BED_LIC_TOTL    401
## 6 106010735 ALAMEDA HOSPITAL 2012 HOSP_LICBED_DAY_TOTL 121206
  • Tidy data

Following the tidy principles:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell

So each hospital and year will be a row, and we’re going to take the measures horizontal.

#turn Description field into columns; amount as Key using Spread 
df_hosputil_3 <- df_hosputil_2 %>%
  spread(key = Description, value = Amount)

head(df_hosputil_3)
##    OSHPD_ID                                    FAC_NAME Year
## 1 106010735                            ALAMEDA HOSPITAL 2012
## 2 106010735                            ALAMEDA HOSPITAL 2013
## 3 106010735                            ALAMEDA HOSPITAL 2014
## 4 106010735                            ALAMEDA HOSPITAL 2015
## 5 106010735                            ALAMEDA HOSPITAL 2016
## 6 106010739 ALTA BATES SUMMIT MED CTR-ALTA BATES CAMPUS 2012
##   HOSP_BED_LIC_TOTL HOSP_CENS_DAY_TOTL HOSP_DIS_TOTL HOSP_LICBED_DAY_TOTL
## 1               401              46073          3006               121206
## 2               281              72058          3026               102565
## 3               281              47310          1882               102565
## 4               281              70488          2476               102565
## 5               281              71354          2154               102846
## 6               347              77913         17003               127002
##   MED_SURG_BED_LIC MED_SURG_CENS_DAY MED_SURG_DIS MED_SURG_LICBED_DAY
## 1              204              9457         2619               49104
## 2               84              9853         2575               30660
## 3               84              5619         1547               30660
## 4               84              6410         2045               30660
## 5               84              7140         1964               30744
## 6              146             34109         7896               53436
##   SURG_IP SURG_OP
## 1   43099   74637
## 2     568    1477
## 3     544     843
## 4     630     840
## 5     635    1560
## 6  394287  321180

Now that our data is prepared for analysis, let’s talk a little more about what it means.

Each case represents a California hospital. Note that a multi-facility health system will be represented by one case per facility. There are 661 distinct facilities in the dataset that met our initial criteria, but that’s just a count of distinct facilities present in the data from 2012 to 2016. Some facilities closed; others opened in the time period. We might later exclude certain facilities due to missing or suspicious data. Thus, we start with 661 cases but expect to have fewer at the conclusion of our analysis.

What is the data source? Individual hospitals and health systems are required to self-report this facility data on an annual basis. Reported numbers do receive regulatory and other review, so there are strong incentives to report accurate information. However, possibilities of error and ambiguities of interpretation are entirely possible.

I should note that as a hospital data analyst/developer, I have prepared the data submissions for these programs previously. There’s a lot of ambiguity in reporting these numbers. The definitions are unclear. A hospital may have unused licensed beds. A health system might shift all its surgeries of one type to another facility. These observations are not independent - again, some of these facilities are owned or operated by the same parent organization, and many of these facilities operate in the same, near-zero-sum markets.

Annual California hospital utilization data are published here:

https://healthdata.gov/dataset/hospital-annual-utilization-report-pivot-tables

Note that for the purposes of this exercise we will be using the Hospital Annual Utilization Data - Machine-Readable Format CSV, which combines annual data from 2012 to 2016 in a format more friendly to analysis than the individual annual files.

You may have noticed 2012 values in the data despite our statement that we’ll be examining changes from 2013 to 2016. The 2012 values for certain hospitals and measures seem abnormally high, either due to error or the data being reported as an aggregation of pre-2013 annual values. So we will start with 2013.

The response variable is the change in outpatient surgical volume from 2013 to 2016. The explanatory variable is the change in inpatient surgical volume from 2013 to 2016.
This is an observational study - we will tentatively draw conclusions about the US healthcare system despite our data being limited to the state of California and the fact that there are numerous state-level distinctions in healthcare funding and incentives. In other words, any conclusions would at best only be invitations for further study. There’s a lot of bias in the data. Because it’s not an experimental study, no causation can be established.

In summary, this is an interesting dataset of great utility in our day job, but it’s not appropriate for true statistical inference.

Part 3 - Exploratory Data Analysis

For now, our big question is how annual inpatient and outpatient surgical volumes changed from 2013 to 2016, so we’re going to create some calculations to show the delta between 2013 and 2016 for these measures for each facility. This will likely take some time and require a smarter approach, but here is a start. We will calculate some summary statistics and then visualize.

#count number of facilities in dataset
sqldf("select count(distinct(FAC_NAME)) from df_hosputil_3")
##   count(distinct(FAC_NAME))
## 1                       661

Ok, we have 661 distinct hospitals in the dataset. But what about facilities that opened in 2015? Or ones that opened in 2014? If we were advanced health policy researchers, we would want to consider these closely. Evolving market pressures have caused an increasing number of rural hospitals to close in the last few years. A large health system may have closed an inpatient facility and shifted business to a new ambulatory surgery center.

But we are staying relatively simple with our inquiry. In hospitals that were open in 2013 through 2016, how did their outpatient and inpatient surgery values change?

Before we filter , let’s add lag columns

#objective - get difference between 2013 and 2016 values for these surgery counts
df_hosputil_3 <- df_hosputil_3 %>%
  group_by(FAC_NAME) %>%
  mutate(HOSP_OP_SURG_DIFF = as.numeric(SURG_OP - lag(SURG_OP, default = SURG_OP[4],n=3)))

df_hosputil_3 <- df_hosputil_3 %>%
  group_by(FAC_NAME) %>%
  mutate(HOSP_IP_SURG_DIFF = as.numeric(SURG_IP - lag(SURG_IP, default = SURG_IP[4],n=3)))

Now, let’s subset the data frames for our respective years and produce some summary 2013 and 2016 statistics

df_hosp_13 <- df_hosputil_3[df_hosputil_3$Year == 2013,]
df_hosp_16 <- df_hosputil_3[df_hosputil_3$Year == 2016,]

summary(df_hosp_13)
##     OSHPD_ID                                         FAC_NAME  
##  Min.   :106010735   ALVARADO HOSPITAL MEDICAL CENTER    :  2  
##  1st Qu.:106190328   ADVENTIST MEDICAL CENTER            :  1  
##  Median :106301279   ADVENTIST MEDICAL CENTER-SELMA      :  1  
##  Mean   :106871906   ADVENTIST MEDICAL CENTER - REEDLEY  :  1  
##  3rd Qu.:106380842   AHMC ANAHEIM REGIONAL MEDICAL CENTER:  1  
##  Max.   :206351814   ALAMEDA COUNTY MEDICAL CENTER       :  1  
##                      (Other)                             :502  
##       Year      HOSP_BED_LIC_TOTL HOSP_CENS_DAY_TOTL HOSP_DIS_TOTL  
##  Min.   :2013   Min.   :  10.0    Min.   :   300     Min.   :   15  
##  1st Qu.:2013   1st Qu.:  67.0    1st Qu.: 12299     1st Qu.: 1194  
##  Median :2013   Median : 140.0    Median : 27024     Median : 4428  
##  Mean   :2013   Mean   : 202.5    Mean   : 41813     Mean   : 6969  
##  3rd Qu.:2013   3rd Qu.: 269.0    3rd Qu.: 56060     3rd Qu.:10940  
##  Max.   :2013   Max.   :1500.0    Max.   :511585     Max.   :47332  
##                                   NA's   :24         NA's   :24     
##  HOSP_LICBED_DAY_TOTL MED_SURG_BED_LIC MED_SURG_CENS_DAY  MED_SURG_DIS  
##  Min.   :   288       Min.   :  4.0    Min.   :    35    Min.   :    5  
##  1st Qu.: 24455       1st Qu.: 47.0    1st Qu.:  7422    1st Qu.: 1385  
##  Median : 51100       Median : 96.0    Median : 17077    Median : 4238  
##  Mean   : 74230       Mean   :122.4    Mean   : 24060    Mean   : 5766  
##  3rd Qu.: 99645       3rd Qu.:176.0    3rd Qu.: 34713    3rd Qu.: 9174  
##  Max.   :547500       Max.   :591.0    Max.   :177330    Max.   :36670  
##                       NA's   :94       NA's   :117       NA's   :117    
##  MED_SURG_LICBED_DAY    SURG_IP           SURG_OP      HOSP_OP_SURG_DIFF
##  Min.   :  1460      Min.   :    1.0   Min.   :    3   Min.   :-9365.0  
##  1st Qu.: 16790      1st Qu.:  691.8   1st Qu.: 1110   1st Qu.: -372.8  
##  Median : 35040      Median : 1645.0   Median : 2283   Median :  -83.5  
##  Mean   : 44894      Mean   : 2363.5   Mean   : 3291   Mean   : -183.2  
##  3rd Qu.: 64331      3rd Qu.: 3439.0   3rd Qu.: 4118   3rd Qu.:  154.0  
##  Max.   :219730      Max.   :16213.0   Max.   :21678   Max.   : 4925.0  
##  NA's   :93          NA's   :135       NA's   :142     NA's   :227      
##  HOSP_IP_SURG_DIFF  
##  Min.   :-3475.000  
##  1st Qu.: -121.000  
##  Median :   13.000  
##  Mean   :    0.239  
##  3rd Qu.:  157.000  
##  Max.   : 3008.000  
##  NA's   :224
summary(df_hosp_16)
##     OSHPD_ID        
##  Min.   :106010735  
##  1st Qu.:106190324  
##  Median :106301310  
##  Mean   :106678916  
##  3rd Qu.:106380863  
##  Max.   :206351814  
##                     
##                                                FAC_NAME        Year     
##  ALVARADO HOSPITAL MEDICAL CENTER                  :  2   Min.   :2016  
##  FOUNTAIN VALLEY REGIONAL HOSPITAL AND MEDICAL CENT:  2   1st Qu.:2016  
##  ADVENTIST HEALTH MEDICAL CENTER TEHACHAPI VALLEY  :  1   Median :2016  
##  ADVENTIST MEDICAL CENTER                          :  1   Mean   :2016  
##  ADVENTIST MEDICAL CENTER - CENTRAL VALLEY         :  1   3rd Qu.:2016  
##  ADVENTIST MEDICAL CENTER - REEDLEY                :  1   Max.   :2016  
##  (Other)                                           :498                 
##  HOSP_BED_LIC_TOTL HOSP_CENS_DAY_TOTL HOSP_DIS_TOTL   HOSP_LICBED_DAY_TOTL
##  Min.   :  10.0    Min.   :    51     Min.   :    5   Min.   :  3234      
##  1st Qu.:  64.0    1st Qu.: 12060     1st Qu.: 1077   1st Qu.: 23150      
##  Median : 139.0    Median : 27841     Median : 4594   Median : 51240      
##  Mean   : 197.5    Mean   : 42569     Mean   : 6961   Mean   : 72057      
##  3rd Qu.: 270.0    3rd Qu.: 57851     3rd Qu.:10822   3rd Qu.: 98820      
##  Max.   :1500.0    Max.   :572258     Max.   :49736   Max.   :549000      
##                    NA's   :14         NA's   :14                          
##  MED_SURG_BED_LIC MED_SURG_CENS_DAY  MED_SURG_DIS   MED_SURG_LICBED_DAY
##  Min.   :  4.0    Min.   :     9    Min.   :    1   Min.   :  1464     
##  1st Qu.: 48.0    1st Qu.:  7852    1st Qu.: 1492   1st Qu.: 17202     
##  Median : 99.0    Median : 18556    Median : 4410   Median : 36051     
##  Mean   :122.1    Mean   : 25855    Mean   : 5963   Mean   : 44464     
##  3rd Qu.:176.0    3rd Qu.: 37119    3rd Qu.: 8863   3rd Qu.: 64050     
##  Max.   :591.0    Max.   :191110    Max.   :39285   Max.   :216306     
##  NA's   :98       NA's   :121       NA's   :121     NA's   :96         
##     SURG_IP           SURG_OP      HOSP_OP_SURG_DIFF HOSP_IP_SURG_DIFF  
##  Min.   :    2.0   Min.   :   13   Min.   :-7394.0   Min.   :-2756.000  
##  1st Qu.:  660.5   1st Qu.: 1104   1st Qu.: -194.8   1st Qu.: -194.000  
##  Median : 1731.0   Median : 2416   Median :  106.0   Median :  -26.000  
##  Mean   : 2381.6   Mean   : 3683   Mean   :  246.6   Mean   :    6.004  
##  3rd Qu.: 3411.5   3rd Qu.: 4634   3rd Qu.:  471.5   3rd Qu.:  134.000  
##  Max.   :15280.0   Max.   :27682   Max.   : 9588.0   Max.   : 3475.000  
##  NA's   :135       NA's   :143     NA's   :232       NA's   :229

The 2013 data includes 502 facilities. The 2016 data has 498 facilities. We are gonna restrict our dataset to facilities that reported data for both years - and presume that this means we’re capturing data for all CA facilities that were operational from at least 2013 to 2016. Let’s restrict our annual datasets to only include facilities in the other data set.

#Generate vector of facilities in annual data 
fac_2013 <- sqldf("select distinct(FAC_NAME) from df_hosp_13")
fac_2016 <- sqldf("select distinct(FAC_NAME) from df_hosp_16")

#limit data to facilities available in the other year
df_hosp_13 <- df_hosp_13 %>% filter(FAC_NAME %in% fac_2016$FAC_NAME)
df_hosp_16 <- df_hosp_16 %>% filter(FAC_NAME %in% fac_2013$FAC_NAME)

#check
summary(df_hosp_13)
##     OSHPD_ID                                         FAC_NAME  
##  Min.   :106010735   ALVARADO HOSPITAL MEDICAL CENTER    :  2  
##  1st Qu.:106190390   ADVENTIST MEDICAL CENTER            :  1  
##  Median :106301338   ADVENTIST MEDICAL CENTER - REEDLEY  :  1  
##  Mean   :106816090   AHMC ANAHEIM REGIONAL MEDICAL CENTER:  1  
##  3rd Qu.:106380940   ALAMEDA HOSPITAL                    :  1  
##  Max.   :206351814   ALVARADO PARKWAY INSTITUTE B.H.S.   :  1  
##                      (Other)                             :373  
##       Year      HOSP_BED_LIC_TOTL HOSP_CENS_DAY_TOTL HOSP_DIS_TOTL  
##  Min.   :2013   Min.   :  10.00   Min.   :   300     Min.   :   15  
##  1st Qu.:2013   1st Qu.:  63.75   1st Qu.: 11939     1st Qu.: 1215  
##  Median :2013   Median : 136.00   Median : 25514     Median : 4180  
##  Mean   :2013   Mean   : 203.81   Mean   : 41970     Mean   : 6780  
##  3rd Qu.:2013   3rd Qu.: 278.25   3rd Qu.: 56285     3rd Qu.:10160  
##  Max.   :2013   Max.   :1500.00   Max.   :511585     Max.   :47332  
##                                   NA's   :8          NA's   :8      
##  HOSP_LICBED_DAY_TOTL MED_SURG_BED_LIC MED_SURG_CENS_DAY  MED_SURG_DIS  
##  Min.   :   288       Min.   :  4.0    Min.   :    35    Min.   :    7  
##  1st Qu.: 22904       1st Qu.: 46.0    1st Qu.:  5905    1st Qu.: 1352  
##  Median : 48545       Median : 92.0    Median : 15787    Median : 3898  
##  Mean   : 74476       Mean   :121.5    Mean   : 23442    Mean   : 5611  
##  3rd Qu.:101744       3rd Qu.:177.0    3rd Qu.: 34593    3rd Qu.: 8810  
##  Max.   :547500       Max.   :591.0    Max.   :177330    Max.   :36670  
##                       NA's   :70       NA's   :78        NA's   :78     
##  MED_SURG_LICBED_DAY    SURG_IP         SURG_OP        HOSP_OP_SURG_DIFF
##  Min.   :  1460      Min.   :    1   Min.   :    3.0   Min.   :-9365.0  
##  1st Qu.: 16242      1st Qu.:  680   1st Qu.:  977.5   1st Qu.: -329.5  
##  Median : 33215      Median : 1511   Median : 1947.5   Median :  -82.0  
##  Mean   : 44216      Mean   : 2270   Mean   : 2800.6   Mean   : -173.1  
##  3rd Qu.: 64605      3rd Qu.: 3185   3rd Qu.: 3444.8   3rd Qu.:  157.0  
##  Max.   :213931      Max.   :16213   Max.   :15590.0   Max.   : 4925.0  
##  NA's   :69          NA's   :95      NA's   :100       NA's   :105      
##  HOSP_IP_SURG_DIFF  
##  Min.   :-3475.000  
##  1st Qu.: -111.500  
##  Median :   14.000  
##  Mean   :   -0.504  
##  3rd Qu.:  163.750  
##  Max.   : 3008.000  
##  NA's   :102
summary(df_hosp_16)
##     OSHPD_ID                                         FAC_NAME  
##  Min.   :106010735   ALVARADO HOSPITAL MEDICAL CENTER    :  2  
##  1st Qu.:106190390   ADVENTIST MEDICAL CENTER            :  1  
##  Median :106301338   ADVENTIST MEDICAL CENTER - REEDLEY  :  1  
##  Mean   :106816090   AHMC ANAHEIM REGIONAL MEDICAL CENTER:  1  
##  3rd Qu.:106380940   ALAMEDA HOSPITAL                    :  1  
##  Max.   :206351814   ALVARADO PARKWAY INSTITUTE B.H.S.   :  1  
##                      (Other)                             :373  
##       Year      HOSP_BED_LIC_TOTL HOSP_CENS_DAY_TOTL HOSP_DIS_TOTL  
##  Min.   :2016   Min.   :  10.0    Min.   :    99     Min.   :    5  
##  1st Qu.:2016   1st Qu.:  62.0    1st Qu.: 11826     1st Qu.: 1096  
##  Median :2016   Median : 133.5    Median : 26934     Median : 4254  
##  Mean   :2016   Mean   : 200.3    Mean   : 43239     Mean   : 6856  
##  3rd Qu.:2016   3rd Qu.: 279.5    3rd Qu.: 58597     3rd Qu.:10238  
##  Max.   :2016   Max.   :1500.0    Max.   :572258     Max.   :49736  
##                                   NA's   :8          NA's   :8      
##  HOSP_LICBED_DAY_TOTL MED_SURG_BED_LIC MED_SURG_CENS_DAY  MED_SURG_DIS  
##  Min.   :  3660       Min.   :  4.0    Min.   :     9    Min.   :    1  
##  1st Qu.: 22601       1st Qu.: 47.0    1st Qu.:  6946    1st Qu.: 1464  
##  Median : 49044       Median : 94.0    Median : 17060    Median : 4157  
##  Mean   : 73379       Mean   :121.7    Mean   : 25707    Mean   : 5916  
##  3rd Qu.:102288       3rd Qu.:176.8    3rd Qu.: 37197    3rd Qu.: 8868  
##  Max.   :549000       Max.   :591.0    Max.   :191110    Max.   :39285  
##                       NA's   :72       NA's   :85        NA's   :85     
##  MED_SURG_LICBED_DAY    SURG_IP           SURG_OP      HOSP_OP_SURG_DIFF
##  Min.   :  1464      Min.   :    3.0   Min.   :   46   Min.   :-7394.0  
##  1st Qu.: 16600      1st Qu.:  654.5   1st Qu.: 1015   1st Qu.: -194.8  
##  Median : 34404      Median : 1490.0   Median : 1992   Median :  106.0  
##  Mean   : 44467      Mean   : 2320.0   Mean   : 3079   Mean   :  246.6  
##  3rd Qu.: 64416      3rd Qu.: 3293.5   3rd Qu.: 3870   3rd Qu.:  471.5  
##  Max.   :216306      Max.   :15280.0   Max.   :19000   Max.   : 9588.0  
##  NA's   :71          NA's   :101       NA's   :104     NA's   :106      
##  HOSP_IP_SURG_DIFF  
##  Min.   :-2756.000  
##  1st Qu.: -194.000  
##  Median :  -26.000  
##  Mean   :    6.004  
##  3rd Qu.:  134.000  
##  Max.   : 3475.000  
##  NA's   :103

What about instances in which a hospital didn’t have inpatient and outpatient surgery cases in 2013 or 2016? There may be cases where a hospital opened a new outpatient surgery center to move business to in 2015, but we’re going to exclude these as well by removing the cases with ‘NA’ values in our difference columns

df_hosp_16 <- df_hosp_16 %>% filter(!is.na(HOSP_OP_SURG_DIFF)) 
df_hosp_16 <- df_hosp_16 %>% filter(!is.na(HOSP_IP_SURG_DIFF)) 

summary(df_hosp_16)
##     OSHPD_ID                                         FAC_NAME  
##  Min.   :106010735   ADVENTIST MEDICAL CENTER            :  1  
##  1st Qu.:106190519   ADVENTIST MEDICAL CENTER - REEDLEY  :  1  
##  Median :106301337   AHMC ANAHEIM REGIONAL MEDICAL CENTER:  1  
##  Mean   :106295268   ALAMEDA HOSPITAL                    :  1  
##  3rd Qu.:106380962   ALVARADO HOSPITAL MEDICAL CENTER    :  1  
##  Max.   :106580996   ANTELOPE VALLEY HOSPITAL            :  1  
##                      (Other)                             :261  
##       Year      HOSP_BED_LIC_TOTL HOSP_CENS_DAY_TOTL HOSP_DIS_TOTL  
##  Min.   :2016   Min.   : 10.0     Min.   :   279     Min.   :  185  
##  1st Qu.:2016   1st Qu.:101.0     1st Qu.: 15130     1st Qu.: 3262  
##  Median :2016   Median :178.0     Median : 36587     Median : 6859  
##  Mean   :2016   Mean   :221.4     Mean   : 45203     Mean   : 9025  
##  3rd Qu.:2016   3rd Qu.:328.5     3rd Qu.: 67095     3rd Qu.:13466  
##  Max.   :2016   Max.   :886.0     Max.   :258958     Max.   :49736  
##                                                                     
##  HOSP_LICBED_DAY_TOTL MED_SURG_BED_LIC MED_SURG_CENS_DAY  MED_SURG_DIS  
##  Min.   :  3660       Min.   : 10.0    Min.   :   279    Min.   :  149  
##  1st Qu.: 36966       1st Qu.: 59.0    1st Qu.:  8644    1st Qu.: 2250  
##  Median : 65148       Median :106.0    Median : 18790    Median : 4760  
##  Mean   : 81060       Mean   :133.7    Mean   : 27589    Mean   : 6570  
##  3rd Qu.:120231       3rd Qu.:192.0    3rd Qu.: 41634    3rd Qu.: 9712  
##  Max.   :324276       Max.   :591.0    Max.   :191110    Max.   :39285  
##                       NA's   :2        NA's   :4         NA's   :4      
##  MED_SURG_LICBED_DAY    SURG_IP           SURG_OP      HOSP_OP_SURG_DIFF
##  Min.   :  3660      Min.   :    3.0   Min.   :   46   Min.   :-7394.0  
##  1st Qu.: 21594      1st Qu.:  772.5   1st Qu.: 1065   1st Qu.: -199.0  
##  Median : 38796      Median : 1673.0   Median : 2007   Median :  105.0  
##  Mean   : 48971      Mean   : 2407.8   Mean   : 3127   Mean   :  240.7  
##  3rd Qu.: 70272      3rd Qu.: 3346.0   3rd Qu.: 3928   3rd Qu.:  463.5  
##  Max.   :216306      Max.   :15280.0   Max.   :19000   Max.   : 9588.0  
##  NA's   :2                                                              
##  HOSP_IP_SURG_DIFF 
##  Min.   :-2756.00  
##  1st Qu.: -195.00  
##  Median :  -26.00  
##  Mean   :    7.76  
##  3rd Qu.:  144.50  
##  Max.   : 3475.00  
## 

This really reduces our number of cases - we only see 278 facilities that submitted data in both 2013 and 2016 that had inpatient and outpatient surgery numbers in all cases. Based on our healthcare experience, we guess that fewer than half this reduction involves facilities closing but instead could be mergers, changes in reporting, license changes, etc.

  • Total Numbers

As a sanity check, let’s do a boxplot of the annual census days by facility and reporting year. Note that this will include facilities filtered out in the previous step. We’re not looking at just 2013 and 2016 yet.

# boxplot of census days
boxplot(df_hosputil_3$HOSP_CENS_DAY_TOTL ~ df_hosputil_3$Year, col="blue", main="Distribution of Annual Total Census Days", ylab="Annual Census Days", xlab="Years") 

Now, let’s take a look at the outpatient surgery volumes for the two years in question via histogram.

#histogram of outpatient surgeries for 2013
histogram(df_hosp_13$SURG_OP, type = "count", col="blue", main="Histogram: 2013 Outpatient Surgeries by Facility", ylab="Facility Count", xlab="Annual OP Surgeries by Facility")

#histogram of outpatient surgeries for 2016
histogram(df_hosp_16$SURG_OP, type = "count", col="blue", main="Histogram: 2016 Outpatient Surgeries by Facility", ylab="Facility Count", xlab="Annual OP Surgeries by Facility")

No readily apparently significant differences in the histograms of the two years is apparent. Let’s look at the delta - the calculated change columns from 2013 to 2016 from the 2016 data set. We also applied some additional filtering to the 2016 data.

ggplot(df_hosp_16, aes(x=HOSP_IP_SURG_DIFF, y=HOSP_OP_SURG_DIFF, size=HOSP_CENS_DAY_TOTL)) +
  geom_point(alpha=0.2)

Under our hunch that hospitals have shifted many inpatient procedures to outpatient procedures, the majority of points would be in quadrant 1 - upper left. Of course, we could also have points in the upper right as long as the outpatient difference exceeded the inpatient growth - or, in a declining market, the lower left if the inpatient drop is greater than the outpatient decrease.

Part 4 - Correlation & Inference

We’re interested in the relationship between inpatient and outpatient surgery cases in 2013 and then in 2016. We could test if there was a significant change in inpatient cases between the two years - using the whole state population - but that’s not really our question.

Instead, we want to know if there’s a linear relationship between inpatient and outpatient changes at hospitals. We’ll call the population correlation coefficient p. Our null hypothesis is that there is no relationship between changes in inpatient and outpatient surgical volumes among facilities. We will use Pearson’s correlation, a method to examine if there is a linear relationship among variables.

To avoid disappoint, we’re going to commit the grievous sin of working our methods before ensuring the conditions are right for using them.

Let’s get a correlation.

cor.test(df_hosp_16$HOSP_IP_SURG_DIFF,df_hosp_16$HOSP_OP_SURG_DIFF, method="pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  df_hosp_16$HOSP_IP_SURG_DIFF and df_hosp_16$HOSP_OP_SURG_DIFF
## t = 3.9646, df = 265, p-value = 9.464e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1199881 0.3468195
## sample estimates:
##       cor 
## 0.2366254

Here we see our t-test statistic values of 3.9646. Our degrees of freedom reflects our cases less 2. Our p-value indicates the significance of the test - it’s well less than the industry standard 0.05, so we could say that there’s a significant but small correlation between a change in inpatient surgery cases and outpatient surgery cases with a correlation (r) of .2366254. The lower-end of the 95% confidence interval exceeds 0, which would represents our null hypothesis that there is no relationship.

Let’s plot our relationship.

library("ggpubr")
ggscatter(df_hosp_16, x = df_hosp_16$HOSP_IP_SURG_DIFF, y = df_hosp_16$HOSP_OP_SURG_DIFF, 
          add = "reg.line", conf.int = TRUE,
        cor.method = "pearson",
         xlab = "Changes in Inpatient Surgery Volumes: 2013-2016", ylab = "Changes in Outpatient Surgery Volumes: 2013-2016")

Oh, that doesn’t look so good. Pearson’s correlation relies on five assumptions:

  1. Variables must either be interval or ratio measurements - yes, interval
  2. Variables must be approximately normally distributed - let’s check
hist(df_hosp_16$HOSP_IP_SURG_DIFF)

hist(df_hosp_16$HOSP_OP_SURG_DIFF)

Fairly normal.

  1. There is a linear relationship - utoh. Our first scatterplot did not show this.
  2. Outliers are kept to a minimum or removed - we have some.
  3. There is homscedasticity of the data - nope, variances are not uniform along our best fit line.

So we did our Pearson Correlation testing prematurely for practice, but it did not satisfy the conditions.

Let’s take a step back and go for something simpler. We’ll pretend our full population of CA hospitals is a valid sample of US hospitals from 2013 to 2016 and construct a 95% confidence interval for the increase in outpatient surgical cases per facility. Above, we saw a nearly normal distribution.

We’re pretending it’s an independent sample of more than 30 cases that is less than 10% of the total population. Given California’s size, it’s hospitals likely constitute more than 10% of US hospitals, but we’re already bending rules, so…

mean_op <- mean(df_hosp_16$HOSP_OP_SURG_DIFF)
#Here, we're cheating a bit too
sd_op <- sd(df_hosp_16$HOSP_OP_SURG_DIFF)
n_op <- 267

se_op <- qnorm(.975) * (sd_op/sqrt(n_op))

mean_op - se_op
## [1] 60.26167
mean_op + se_op
## [1] 421.1466

So, in another world where we have true independent sample, we have 95% confidence that the true mean growth in outpatient surigcal cases from 2013 to 2016 was between 60.26 and 421.15.

Part 5 - Conclusion

We have looked closely at data related to surgical procedures in California and how those numbers changed from 2013 to 2016, and we draw no conclusions other than the aging of the baby boomers may be overwhelming changes in business strategy in terms of growth in surgical cases.

We also conclude that we should have picked a data set more suited to testing what we learned this semester than a great whole population data set in which we have professional interest. But it was an educational exercise nonetheless.