First Dataset is available from CA Hospital Mortality Rates for 2012-2013..

Description of first dataset: The dataset contains risk-adjusted mortality rates, quality ratings, and number of deaths and cases for 6 medical conditions treated (Acute Stroke, Acute Myocardial Infarction, Heart Failure, Gastrointestinal Hemorrhage, Hip Fracture and Pneumonia) and 6 procedures performed (Abdominal Aortic Aneurysm Repair, Carotid Endarterectomy, Craniotomy, Esophageal Resection, Pancreatic Resection, Percutaneous Coronary Intervention) in California hospitals for 2012 and 2013. This dataset does not include conditions treated or procedures performed in outpatient settings. Please refer to statewide table for CA overall mortality rates.

Second Dataset is availabe from Hospital Profitability for 2009-2013.

Description of second dataset: The dataset contains income statement information for all licensed, comparable hospitals in the state of California. Kaiser hospitals, state mental hospitals, psychiatric health facilities, and hospitals with mainly long-term care patients are excluded. Deductions from Revenue, Net Patient Revenue, Net from Operations (Operating Revenue less Operating Expense), and Net Income for public hospitals has been adjusted for Disproportionate Share intergovernmental transfers for funding the Disproportionate Share Hospital Program. The program gets federal matching funds to pay supplemental payments to hospitals with a disproportionate share of uninsured, underinsured, and Medi-Cal patients.

Question: Can we predict hospital ratings based on risk adjusted mortality rates, number of deaths, number of cases, medical procedures performed, medical conditions treated and hospital profitability for 2012-2013?

Load the data from csv file.

data <- read.csv("California_Hospital_Inpatient_Mortality_Rates_and_Quality_Ratings__2012-2013.csv",sep=",",header=TRUE)
df <- tbl_df(data)
glimpse(df)
## Observations: 11,169
## Variables: 12
## $ Year                         <int> 2012, 2012, 2012, 2012, 2012, 201...
## $ County                       <fctr> Alameda, Alameda, Alameda, Alame...
## $ Hospital                     <fctr> Alameda Hospital, Alameda Hospit...
## $ OSHPDID                      <int> 106010735, 106010735, 106010735, ...
## $ Procedure.Condition          <fctr> PCI, AMI, Acute Stroke Ischemic,...
## $ Risk.Adjusted.Mortality.Rate <dbl> NA, 5.5, 2.3, NA, 1.3, 0.0, NA, 2...
## $ X..of.Deaths                 <int> NA, 4, 1, NA, 1, 0, NA, 5, NA, 5,...
## $ X..of.Cases                  <int> NA, 41, 60, NA, 65, 5, NA, 11, NA...
## $ Hospital.Ratings             <fctr> As Expected, As Expected, As Exp...
## $ Longitude                    <dbl> -122.2536, -122.2536, -122.2536, ...
## $ Latitude                     <dbl> 37.76295, 37.76295, 37.76295, 37....
## $ location1                    <fctr> (37.762953, -122.25362), (37.762...

Load the data from the website.

setwd("C:/Users/maria/Dropbox (Personal)/SpringBoard Fund/Rprojects/")
download.file("https://chhs.data.ca.gov/resource/dayj-x3m3.csv",destfile="mortality_data.csv")
data <- read.csv("mortality_data.csv",sep=",",header=TRUE)
df <- tbl_df(data)
glimpse(df)

Description, analysis and cleaning of variables in the dataset.

Dataset: 11169 observations and 12 variables.

Variables with missing values:

Remove missing values, since number of missing values consists of half of dataset.

df_clean <- df[which(is.na(df$X..of.Cases)==F),]

Clean Dataset: 6165 observations and 12 variables.

Variables with no missing values:

summary(df_clean$Hospital.Ratings)
## As Expected      Better       Worse 
##        5797         158         210
summary(df_clean$Procedure.Condition)
##                AAA Repair              Acute Stroke 
##                       283                       617 
##  Acute Stroke Hemorrhagic     Acute Stroke Ischemic 
##                       466                       615 
## Acute Stroke Subarachnoid                       AMI 
##                       241                       590 
##    Carotid Endarterectomy                Craniotomy 
##                       404                       298 
##      Esophageal Resection             GI Hemorrhage 
##                        75                       622 
##             Heart Failure              Hip Fracture 
##                       616                       426 
##         Pancreatic Cancer          Pancreatic Other 
##                       142                       130 
##      Pancreatic Resection                       PCI 
##                       190                       299 
##                 Pneumonia 
##                       151

Decoding Procedure/Condition variable.

According to the American Stroke Association (ASA), strokes can be classified into 2 main categories: 87% are ischemic strokes, caused by blockage of an artery; 13% are hemorrhagic strokes, caused by bleeding. Ischemic strokes are further divided into 2 groups: thrombotic and embolic strokes. Hemorrhagic strokes are divided into 2 main categories: intracerebral and subarachnoid hemorrhages.

Our clean dataset has four categories for Acute Stroke:

Within each hospital, there are different notations for Acute Stroke variable. It suggests that different doctor uses different notations for the condition. These four categories are combined in one: Acute Stroke.

df_clean$Procedure.Condition <- gsub("Acute Stroke .*","Acute Stroke",df_clean$Procedure.Condition)
df_clean$Procedure.Condition <- factor(df_clean$Procedure.Condition)

Two additional categories are present in Procedure/Condition variable:

These categories are separate medical conditions and are not combined in one category.

The Procedure.Condition variable contains 6 medical procedures and 8 medical conditions. To indicate what procedure was performed or what condition was treated, the Medical_Categorey variable was added to the Clean Dataset.

df_clean <- df_clean %>% 
  mutate(Medical_Category = ifelse(grepl("Repair",Procedure.Condition) | grepl("Endarterectomy",Procedure.Condition) | grepl("Craniotomy",Procedure.Condition) | grepl("Resection",Procedure.Condition) | grepl("PCI",Procedure.Condition),               "Procedure", "Condition"))

glimpse(df_clean)
## Observations: 6,165
## Variables: 13
## $ Year                         <int> 2012, 2012, 2012, 2012, 2012, 201...
## $ County                       <fctr> Alameda, Alameda, Alameda, Alame...
## $ Hospital                     <fctr> Alameda Hospital, Alameda Hospit...
## $ OSHPDID                      <int> 106010735, 106010735, 106010735, ...
## $ Procedure.Condition          <fctr> AMI, Acute Stroke, GI Hemorrhage...
## $ Risk.Adjusted.Mortality.Rate <dbl> 5.5, 2.3, 1.3, 0.0, 25.1, 2.6, 8....
## $ X..of.Deaths                 <int> 4, 1, 1, 0, 5, 5, 7, 0, 5, 18, 8,...
## $ X..of.Cases                  <int> 41, 60, 65, 5, 11, 139, 72, 35, 1...
## $ Hospital.Ratings             <fctr> As Expected, As Expected, As Exp...
## $ Longitude                    <dbl> -122.2536, -122.2536, -122.2536, ...
## $ Latitude                     <dbl> 37.76295, 37.76295, 37.76295, 37....
## $ location1                    <fctr> (37.762953, -122.25362), (37.762...
## $ Medical_Category             <chr> "Condition", "Condition", "Condit...

Initial Explanatory Data Analysis

Histrograms of # of Deaths, # of Cases and Risk.Adjusted.Mortality.Rate for 2012 and 2013 years.

p1 <- ggplot(df_clean,aes(log(X..of.Deaths),fill=factor(Year),colour=factor(Year)))+
  #geom_histogram(position="dodge")
  geom_density(alpha = 0.1)

p2 <- ggplot(df_clean,aes(log(X..of.Cases),fill=factor(Year),colour=factor(Year)))+
  #geom_histogram(position="dodge")
  geom_density(alpha = 0.1)

p3 <- ggplot(df_clean,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Year),colour=factor(Year)))+
  #geom_histogram(position="dodge")
  geom_density(alpha = 0.1)

grid.arrange(p1, p2, p3, ncol=1)

Histrograms of # of Deaths for Hospital.Ratings.

df_hr_ae <- df_clean[which(df_clean$Hospital.Ratings=="As Expected"),]
df_hr <- df_clean[which(df_clean$Hospital.Ratings!="As Expected"),]

p1 <- ggplot(df_hr_ae,aes(log(X..of.Deaths),fill=factor(Hospital.Ratings),
                          colour=factor(Hospital.Ratings)))+
  #geom_histogram(position="dodge")
  geom_density(alpha = 0.1)

p4 <- ggplot(df_hr,aes(log(X..of.Deaths),fill=factor(Hospital.Ratings),
                       colour=factor(Hospital.Ratings)))+
  #geom_histogram(position="dodge")
  geom_density(alpha = 0.1)

grid.arrange(p1, p4, ncol=1)

Histrograms of # of Cases for Hospital.Ratings.

p2 <- ggplot(df_hr_ae,aes(log(X..of.Cases),fill=factor(Hospital.Ratings),
                          colour=factor(Hospital.Ratings)))+
  #geom_histogram(position="dodge")
  geom_density(alpha = 0.1)

p5 <- ggplot(df_hr,aes(log(X..of.Cases),fill=factor(Hospital.Ratings),
                       colour=factor(Hospital.Ratings)))+
  #geom_histogram(position="dodge")
  geom_density(alpha = 0.1)

grid.arrange(p2, p5, ncol=1)

Histrograms of Risk.Adjusted.Mortality.Rate for Hospital.Ratings.

p3 <- ggplot(df_hr_ae,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Hospital.Ratings),
                          colour=factor(Hospital.Ratings)))+
  #geom_histogram(position="dodge")+
  geom_density(alpha = 0.1)

p6 <- ggplot(df_hr,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Hospital.Ratings),
                       colour=factor(Hospital.Ratings)))+
  #geom_histogram(position="dodge")+
  geom_density(alpha = 0.1)

grid.arrange(p3, p6, ncol=1)

Histrograms of # of Deaths, # of Cases and Risk.Adjusted.Mortality.Rate for Year and Hospital.Ratings.

p4 <- ggplot(df_hr_ae,aes(log(X..of.Deaths),fill=factor(Year),colour=factor(Year)))+
  #geom_histogram(position="dodge")+
  geom_density(alpha = 0.1)+
  facet_grid(Hospital.Ratings ~ ., scale="free_y", space="free_y")

p5 <- ggplot(df_hr_ae,aes(log(X..of.Cases),fill=factor(Year),colour=factor(Year)))+
  #geom_histogram(position="dodge")+
  geom_density(alpha = 0.1)+
  facet_grid(Hospital.Ratings ~ ., scale="free_y", space="free_y")

p6 <- ggplot(df_hr_ae,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Year),colour=factor(Year)))+
  #geom_histogram(position="dodge")+
  geom_density(alpha = 0.1)+
  facet_grid(Hospital.Ratings ~ ., scale="free_y", space="free_y")

grid.arrange(p4, p5, p6, ncol=1)

p7 <- ggplot(df_hr,aes(log(X..of.Deaths),fill=factor(Year),colour=factor(Year)))+
  #geom_histogram(position="dodge")+
  geom_density(alpha = 0.1)+
  facet_grid(Hospital.Ratings ~ ., scale="free_y", space="free_y")

p8 <- ggplot(df_hr,aes(log(X..of.Cases),fill=factor(Year),colour=factor(Year)))+
  #geom_histogram(position="dodge")+
  geom_density(alpha = 0.1)+
  facet_grid(Hospital.Ratings ~ ., scale="free_y", space="free_y")

p9 <- ggplot(df_hr,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Year),colour=factor(Year)))+
  #geom_histogram(position="dodge")+
  geom_density(alpha = 0.1)+
  facet_grid(Hospital.Ratings ~ ., scale="free_y", space="free_y")

grid.arrange(p7, p8, p9, ncol=2)

Conclusion 1: There is association between the risk adjusted mortality rate and hospital ratings.

Histrograms of # of Deaths, # of Cases and Risk.Adjusted.Mortality.Rate for Procedures Performed.

df_p <- df_clean[which(df_clean$Medical_Category=="Procedure"),]

p4 <- ggplot(df_p,aes(log(X..of.Deaths),fill=factor(Procedure.Condition),colour=factor(Procedure.Condition)))+
  geom_histogram(position="stack")
  #geom_density(alpha = 0.1)

p4

p5 <- ggplot(df_p,aes(log(X..of.Cases),fill=factor(Procedure.Condition),colour=factor(Procedure.Condition)))+
  geom_histogram(position="stack")
  #geom_density(alpha = 0.1)

p5

p6 <- ggplot(df_p,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Procedure.Condition),colour=factor(Procedure.Condition)))+
  geom_histogram(position="stack")
  #geom_density(alpha = 0.1)

p6

Histrograms of # of Deaths, # of Cases and Risk.Adjusted.Mortality.Rate for Conditions Treated.

df_c <- df_clean[which(df_clean$Medical_Category=="Condition"),]

p7 <- ggplot(df_c,aes(log(X..of.Deaths),fill=factor(Procedure.Condition),colour=factor(Procedure.Condition)))+
  geom_histogram(position="stack")
  #geom_density(alpha = 0.1)

p7

p8 <- ggplot(df_c,aes(log(X..of.Cases),fill=factor(Procedure.Condition),colour=factor(Procedure.Condition)))+
  geom_histogram(position="stack")
  #geom_density(alpha = 0.1)

p8

p9 <- ggplot(df_c,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Procedure.Condition),colour=factor(Procedure.Condition)))+
  geom_histogram(position="stack")
  #geom_density(alpha = 0.1)

p9

#grid.arrange(p4, p5, p6, ncol=1)

Conclusion 2:

Histrograms for Procedures for Hospital.Ratings.

df_p_hr_ae <- df_p[which(df_p$Hospital.Ratings=="As Expected"),]
df_p_hr <- df_p[which(df_p$Hospital.Ratings!="As Expected"),]

p1 <- ggplot(df_p_hr_ae,aes(Procedure.Condition, colour=factor(Hospital.Ratings),fill=factor(Hospital.Ratings))) +
  stat_count(width = 0.3, position="dodge")+
  labs(title="Histogram for Procedures")+
  labs(x="Procedure", y="Count")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

p2 <- ggplot(df_p_hr,aes(Procedure.Condition,colour=factor(Hospital.Ratings),fill=factor(Hospital.Ratings))) +
  stat_count(width = 0.3, position="dodge")+
  labs(title="Histogram for Procedures")+
  labs(x="Procedure", y="Count")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

grid.arrange(p1, p2, ncol=1)

Histrograms for Conditions for Hospital.Ratings.

df_c_hr_ae <- df_c[which(df_c$Hospital.Ratings=="As Expected"),]
df_c_hr <- df_c[which(df_c$Hospital.Ratings!="As Expected"),]

p3 <- ggplot(df_c_hr_ae,aes(Procedure.Condition, colour=factor(Hospital.Ratings),fill=factor(Hospital.Ratings))) +
  stat_count(width = 0.3, position="dodge")+
  labs(title="Histogram for Conditions")+
  labs(x="Condition", y="Count")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

p4 <- ggplot(df_c_hr,aes(Procedure.Condition,colour=factor(Hospital.Ratings),fill=factor(Hospital.Ratings))) +
  stat_count(width = 0.3, position="dodge")+
  labs(title="Histogram for Consitions")+
  labs(x="Condition", y="Count")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

grid.arrange(p3, p4, ncol=1)

Associations between medical procedure or condition with hospital ratings.

table(df_clean$Procedure.Condition,df_clean$Hospital.Ratings)
##                         
##                          As Expected Better Worse
##   AAA Repair                     281      0     2
##   Acute Stroke                  1792     74    73
##   AMI                            536     23    31
##   Carotid Endarterectomy         397      0     7
##   Craniotomy                     262     18    18
##   Esophageal Resection            75      0     0
##   GI Hemorrhage                  595     10    17
##   Heart Failure                  556     23    37
##   Hip Fracture                   422      0     4
##   Pancreatic Cancer              140      0     2
##   Pancreatic Other               128      0     2
##   Pancreatic Resection           186      0     4
##   PCI                            280      6    13
##   Pneumonia                      147      4     0

Conclusion 3:

Summary (from conclusions):

Risk.Adjusted.Mortality.Rate versus X..of.Cases or X..of.Deaths.

fit1 <- lm(Risk.Adjusted.Mortality.Rate ~ X..of.Cases, data=df_clean)
#summary(fit1)
p1 <- ggplot(df_clean, aes(X..of.Cases,Risk.Adjusted.Mortality.Rate))+ 
  geom_point(alpha = 0.1)+
  stat_smooth(method='lm')

fit2 <- lm(Risk.Adjusted.Mortality.Rate ~ X..of.Deaths, data=df_clean)
#summary(fit2)
p2 <- ggplot(df_clean, aes(X..of.Deaths,Risk.Adjusted.Mortality.Rate))+
  geom_point(alpha = 0.1)+
  stat_smooth(method='lm')

grid.arrange(p1, p2, ncol=2)

Mapping of Hospital.Ratings. (need to understand more here)

CAmap <- get_map(location="California",source="google",maptype="roadmap",crop=FALSE,zoom=6)
ggmap (CAmap) +
  geom_point(aes(x=Longitude,y=Latitude,size=Hospital.Ratings),data=df_clean,alpha=.3,color="darkred")#,size=2)