Introduction

Importance: Using hospital quality ratings, patients are able to make a better decision in what hospital they want to be treated and where the best care is available in state of California, based on overall hospital performance or based on particular medical condition or procedure.

Question: Can we predict hospital quality ratings based on risk adjusted mortality rates, number of deaths, number of cases, medical procedures performed and medical conditions treated for 2012-2013?

Description of Data Set

Dataset: is available from California Hospital Inpatient Mortality Rates and Quality Ratings, 2012-2013.

Description of dataset: The dataset contains risk-adjusted mortality rates, and number of deaths and cases for 6 medical conditions treated (Acute Stroke, Acute Myocardial Infarction, Heart Failure, Gastrointestinal Hemorrhage, Hip Fracture and Pneumonia) and 6 procedures performed (Abdominal Aortic Aneurysm Repair, Carotid Endarterectomy, Craniotomy, Esophageal Resection, Pancreatic Resection, Percutaneous Coronary Intervention) in California hospitals for 2012 and 2013. This dataset does not include conditions treated or procedures performed in outpatient settings.

Description, Analysis and Cleaning of Variables in the Data Set

Load the data from csv file.

setwd("C:/Users/postdoc/Dropbox (Personal)/SpringBoard Fund/Rprojects/")
data <- read.csv("California_Hospital_Inpatient_Mortality_Rates_and_Quality_Ratings__2012-2013.csv",sep=",",header=TRUE)
df <- tbl_df(data)
# glimpse(df)

Dataset: 11169 observations and 12 variables.

Variables with missing values:

Remove missing values, because number of missing values consists of half of dataset.

df_clean <- df[which(is.na(df$X..of.Cases)==F),]

Clean Dataset: 6165 observations and 12 variables.

Variables with no missing values:

summary(df_clean$Hospital.Ratings)
## As Expected      Better       Worse 
##        5797         158         210
summary(df_clean$Procedure.Condition)
##                AAA Repair              Acute Stroke 
##                       283                       617 
##  Acute Stroke Hemorrhagic     Acute Stroke Ischemic 
##                       466                       615 
## Acute Stroke Subarachnoid                       AMI 
##                       241                       590 
##    Carotid Endarterectomy                Craniotomy 
##                       404                       298 
##      Esophageal Resection             GI Hemorrhage 
##                        75                       622 
##             Heart Failure              Hip Fracture 
##                       616                       426 
##         Pancreatic Cancer          Pancreatic Other 
##                       142                       130 
##      Pancreatic Resection                       PCI 
##                       190                       299 
##                 Pneumonia 
##                       151

Decoding Procedure/Condition variable.

According to the American Stroke Association (ASA), strokes can be classified into 2 main categories: 87% are ischemic strokes, caused by blockage of an artery; 13% are hemorrhagic strokes, caused by bleeding. Ischemic strokes are further divided into 2 groups: thrombotic and embolic strokes. Hemorrhagic strokes are divided into 2 main categories: intracerebral and subarachnoid hemorrhages.

Our clean dataset has four categories for Acute Stroke:

Within each hospital, there are different notations for Acute Stroke variable. It suggests that different doctor uses different notations for the condition. These four categories are combined in one: Acute Stroke.

df_clean$Procedure.Condition <- gsub("Acute Stroke .*","Acute Stroke",df_clean$Procedure.Condition)
df_clean$Procedure.Condition <- factor(df_clean$Procedure.Condition)

Two additional categories are present in Procedure/Condition variable:

These categories are separate medical conditions and are not combined in one category.

The Procedure.Condition variable contains 6 medical procedures and 8 medical conditions. To indicate what procedure was performed or what condition was treated, the Medical_Categorey variable was added to the clean dataset.

df_clean <- df_clean %>% 
  mutate(Medical_Category = ifelse(grepl("Repair",Procedure.Condition) | grepl("Endarterectomy",Procedure.Condition) | grepl("Craniotomy",Procedure.Condition) | grepl("Resection",Procedure.Condition) | grepl("PCI",Procedure.Condition),               "Procedure", "Condition"))
#glimpse(df_clean)

Explanatory Data Analysis

Density Plots for # of Cases, # of Deaths and Risk Adjusted Mortality Rate by Hospital Ratings.

p1 <- ggplot(df_clean,aes(log(X..of.Cases),fill=factor(Hospital.Ratings),colour=factor(Hospital.Ratings)))+
  geom_density(alpha = 0.1)

p2 <- ggplot(df_clean,aes(log(X..of.Deaths),fill=factor(Hospital.Ratings),colour=factor(Hospital.Ratings)))+
  geom_density(alpha = 0.1)

p3 <- ggplot(df_clean,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Hospital.Ratings),colour=factor(Hospital.Ratings)))+
  geom_density(alpha = 0.1)

grid.arrange(p1, p2, p3, ncol=1)

Conclusions 1:

Associations between medical procedures or conditions with hospital ratings, number of cases, number of deaths and risk adjusted mortality rate.

Procedures.

df_p <- df_clean[which(df_clean$Medical_Category=="Procedure"),]

df_p_all <- df_p %>% 
  group_by(Procedure.Condition) %>% 
  summarise(all_cases = sum(X..of.Cases), 
            all_deaths = sum(X..of.Deaths),
            all_mortality_rate = sum(Risk.Adjusted.Mortality.Rate))

# df_p_all %>% arrange(desc(all_cases))
# df_p_all %>% arrange(desc(all_deaths))
# df_p_all %>% arrange(desc(all_mortality_rate))

df_p_all 
## # A tibble: 6 x 4
##      Procedure.Condition all_cases all_deaths all_mortality_rate
##                   <fctr>     <int>      <int>              <dbl>
## 1             AAA Repair      4927         59              508.5
## 2 Carotid Endarterectomy     12478         60              290.2
## 3             Craniotomy     30164       2159             2354.6
## 4   Esophageal Resection       619         28              436.9
## 5   Pancreatic Resection      3356         93             1002.8
## 6                    PCI     78660       2028              793.6

Conditions.

df_c <- df_clean[which(df_clean$Medical_Category=="Condition"),]

df_c_all <- df_c %>% 
  group_by(Procedure.Condition) %>% 
  summarise(all_cases = sum(X..of.Cases), 
            all_deaths = sum(X..of.Deaths),
            all_mortality_rate = sum(Risk.Adjusted.Mortality.Rate))

#df_c_all %>% arrange(desc(all_cases))
#df_c_all %>% arrange(desc(all_deaths))
#df_c_all %>% arrange(desc(all_mortality_rate))

df_c_all
## # A tibble: 8 x 4
##   Procedure.Condition all_cases all_deaths all_mortality_rate
##                <fctr>     <int>      <int>              <dbl>
## 1        Acute Stroke    217956      20461            26582.9
## 2                 AMI     93594       5731             3863.4
## 3       GI Hemorrhage     94804       2099             1597.8
## 4       Heart Failure    155066       4778             2200.0
## 5        Hip Fracture     32245        744              945.8
## 6   Pancreatic Cancer      1787         43              632.0
## 7    Pancreatic Other      1425         41              590.0
## 8           Pneumonia     20630       1019              552.5

Hospital Ratings.

prop.table(table(df_clean$Procedure.Condition,df_clean$Hospital.Ratings))*100
##                         
##                          As Expected     Better      Worse
##   AAA Repair               4.5579886  0.0000000  0.0324412
##   Acute Stroke            29.0673155  1.2003244  1.1841038
##   AMI                      8.6942417  0.3730738  0.5028386
##   Carotid Endarterectomy   6.4395783  0.0000000  0.1135442
##   Craniotomy               4.2497972  0.2919708  0.2919708
##   Esophageal Resection     1.2165450  0.0000000  0.0000000
##   GI Hemorrhage            9.6512571  0.1622060  0.2757502
##   Heart Failure            9.0186537  0.3730738  0.6001622
##   Hip Fracture             6.8450933  0.0000000  0.0648824
##   Pancreatic Cancer        2.2708840  0.0000000  0.0324412
##   Pancreatic Other         2.0762368  0.0000000  0.0324412
##   Pancreatic Resection     3.0170316  0.0000000  0.0648824
##   PCI                      4.5417680  0.0973236  0.2108678
##   Pneumonia                2.3844282  0.0648824  0.0000000

Conclusions 2:

Density Plots for Risk Adjusted Mortality Rate by Procedures Performed and Hospital Ratings.

p6 <- ggplot(df_p,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Hospital.Ratings),colour=factor(Hospital.Ratings)))+
  geom_density(alpha = 0.1)+
  theme(legend.position='bottom')+
  facet_wrap(~ Procedure.Condition, ncol=2, scales="free_y")

p6

Density Plots for Risk Adjusted Mortality Rate by Conditions Treated and Hospital Ratings.

p9 <- ggplot(df_c,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Hospital.Ratings),colour=factor(Hospital.Ratings)))+
  geom_density(alpha = 0.1)+
  theme(legend.position='bottom')+
  facet_wrap(~ Procedure.Condition, ncol=2, scales="free_y")

p9

Summary: The most contributing procedues and conditions.

Mapping and summary of overall hospital quality ratings and mean mortality rate among all conditions and procedures.

Summary of hospital ratings over all conditions and procedues.

df_clean <- df_clean %>% mutate(ratings =
           ifelse(grepl("As Expected",Hospital.Ratings),"0",                        
           ifelse(grepl("Better",Hospital.Ratings),"1",
           ifelse(grepl("Worse",Hospital.Ratings),"-1",NA))))
df_clean$ratings <- as.numeric(df_clean$ratings)

all_ratings <- df_clean %>% 
                group_by(Hospital,Latitude,Longitude) %>% 
                summarise(all_ratings = sum(ratings), 
                          mean_mortality_rate = mean(Risk.Adjusted.Mortality.Rate)) %>% 
                mutate(ratings =
                         ifelse(all_ratings > 0,"Better",
                         ifelse(all_ratings < 0, "Worse","As Expected")))
all_ratings$ratings <- as.factor(all_ratings$ratings)
all_ratings <- tbl_df(all_ratings)

Overall Hospital Ratings:

summary(all_ratings$ratings)
## As Expected      Better       Worse 
##         172          69          99
  • Hospitals with better quality ratings:
all_ratings %>% arrange(desc(all_ratings)) %>% select(Hospital,all_ratings) %>% slice(1:10)
## # A tibble: 10 x 2
##                                                        Hospital
##                                                          <fctr>
## 1                   Kaiser Foundation Hospital – Redwood City
## 2                         Kaiser Foundation Hospital – Sunset
## 3                             Centinela Hospital Medical Center
## 4                                        Scripps Green Hospital
## 5                                   Cedars Sinai Medical Center
## 6                                        Desert Valley Hospital
## 7                                Encino Hospital Medical Center
## 8          Glendale Adventist Medical Center – Wilson Terrace
## 9  Kaiser Foundation Hospital – Rehabilitation Center Vallejo
## 10                 Kaiser Foundation Hospital – San Francisco
## # ... with 1 more variables: all_ratings <dbl>
  • Hospitals with worse quality ratings:
all_ratings %>% arrange(all_ratings) %>% select(Hospital,all_ratings) %>% slice(1:10)
## # A tibble: 10 x 2
##                                                               Hospital
##                                                                 <fctr>
## 1                                       Palomar Health Downtown Campus
## 2  Los Angeles County/University of Southern California Medical Center
## 3                                       San Francisco General Hospital
## 4                                       Santa Barbara Cottage Hospital
## 5                                    Arrowhead Regional Medical Center
## 6                                                   Grossmont Hospital
## 7                                              Sharp Memorial Hospital
## 8                                             Antelope Valley Hospital
## 9                                               Doctors Medical Center
## 10                                Good Samaritan Hospital – San Jose
## # ... with 1 more variables: all_ratings <dbl>
  • Hospitals with the highest mean mortality rate:
all_ratings %>% arrange(desc(mean_mortality_rate)) %>% select(Hospital,mean_mortality_rate)  %>% slice(1:10)
## # A tibble: 10 x 2
##                                Hospital mean_mortality_rate
##                                  <fctr>               <dbl>
## 1           Memorial Hospital Los Banos            28.35714
## 2          Mountains Community Hospital            27.45000
## 3    Santa Ynez Valley Cottage Hospital            26.47500
## 4       Biggs Gridley Memorial Hospital            24.31111
## 5      Coalinga Regional Medical Center            23.71667
## 6       George L. Mee Memorial Hospital            19.30000
## 7  Adventist Medical Center – Reedley            18.83333
## 8     Central Valley Specialty Hospital            18.70000
## 9          San Joaquin General Hospital            14.68095
## 10       Goleta Valley Cottage Hospital            14.26000
  • Hospitals with the lowest mean mortality rate:
all_ratings %>% arrange(mean_mortality_rate) %>% select(Hospital,mean_mortality_rate) %>% slice(1:10)
## # A tibble: 10 x 2
##                                           Hospital mean_mortality_rate
##                                             <fctr>               <dbl>
## 1                       Corcoran District Hospital                   0
## 2       Eastern Plumas Hospital – Portola Campus                   0
## 3                             Glenn Medical Center                   0
## 4          Good Samaritan Hospital – Bakersfield                   0
## 5                        Hoag Orthopedic Institute                   0
## 6                  Kern Valley Healthcare District                   0
## 7  Laguna Honda Hospital and Rehabilitation Center                   0
## 8                                 Mammoth Hospital                   0
## 9                           Southern Inyo Hospital                   0
## 10                              Tehachapi Hospital                   0

Mapping of overall hospital ratings and mean mortality rates.

CAmap <- get_map(location="California",source="google",maptype="roadmap",crop=FALSE,zoom=6) 
ggmap (CAmap) +
  geom_point(aes(x=Longitude,y=Latitude,size=mean_mortality_rate,colour=ratings),data=all_ratings,alpha=0.7)+
  scale_colour_manual(values=c("Worse" = "darkred","Better" = "darkblue","As Expected" = "darkgrey"))+
  scale_size(range = c(0, 10))

Analysis 2:

I will analyze which condition or procedure has the best or worse hospital ratings, and map these results only for selected conditions or procedures.

Predictions

To predict hospital ratings I will use classification decision trees, random forests or multinomial logistic regression. First, I will use three models to train them on 2012-2013 data and access which model gives the best performance on the training data, then I will test three model performances on 2014 test data.

Recommendations to Patients

Hospitals with the best overall ratings and the lowest mean mortality rate in state of California.

best_ratings <- all_ratings %>% arrange(desc(all_ratings)) %>% slice(1:25)
lowest_mort_rate <- all_ratings %>% arrange(mean_mortality_rate) %>% slice(1:25)
best_lowest <- bind_rows(best_ratings,lowest_mort_rate)

CAmap <- get_map(location="California",source="google",maptype="roadmap",crop=FALSE,zoom=6) 
ggmap (CAmap) +
  geom_point(aes(x=Longitude,y=Latitude,size=ratings,colour=mean_mortality_rate),data=best_lowest,alpha=0.7)+
  scale_colour_distiller(type = "div", palette = "RdBu", direction = -1)

Hospitals with …

  1. Recommend to patients which hospital has the best ratings for particular medical condition or procedure in state of California;

Hospitals with …

  1. Recommend which hospital will have the best care in the future using predicted hospital ratings.