Importance: Using hospital quality ratings, patients are able to make a better decision in what hospital they want to be treated and where the best care is available in state of California, based on overall hospital performance or based on particular medical condition or procedure.
Question: Can we predict hospital quality ratings based on risk adjusted mortality rates, number of deaths, number of cases, medical procedures performed and medical conditions treated for 2012-2013?
Dataset: is available from California Hospital Inpatient Mortality Rates and Quality Ratings, 2012-2013.
Description of dataset: The dataset contains risk-adjusted mortality rates, and number of deaths and cases for 6 medical conditions treated (Acute Stroke, Acute Myocardial Infarction, Heart Failure, Gastrointestinal Hemorrhage, Hip Fracture and Pneumonia) and 6 procedures performed (Abdominal Aortic Aneurysm Repair, Carotid Endarterectomy, Craniotomy, Esophageal Resection, Pancreatic Resection, Percutaneous Coronary Intervention) in California hospitals for 2012 and 2013. This dataset does not include conditions treated or procedures performed in outpatient settings.
Load the data from csv file.
setwd("C:/Users/postdoc/Dropbox (Personal)/SpringBoard Fund/Rprojects/")
data <- read.csv("California_Hospital_Inpatient_Mortality_Rates_and_Quality_Ratings__2012-2013.csv",sep=",",header=TRUE)
df <- tbl_df(data)
# glimpse(df)
Dataset: 11169 observations and 12 variables.
Variables with missing values:
Remove missing values, because number of missing values consists of half of dataset.
df_clean <- df[which(is.na(df$X..of.Cases)==F),]
Clean Dataset: 6165 observations and 12 variables.
Variables with no missing values:
summary(df_clean$Hospital.Ratings)
## As Expected Better Worse
## 5797 158 210
summary(df_clean$Procedure.Condition)
## AAA Repair Acute Stroke
## 283 617
## Acute Stroke Hemorrhagic Acute Stroke Ischemic
## 466 615
## Acute Stroke Subarachnoid AMI
## 241 590
## Carotid Endarterectomy Craniotomy
## 404 298
## Esophageal Resection GI Hemorrhage
## 75 622
## Heart Failure Hip Fracture
## 616 426
## Pancreatic Cancer Pancreatic Other
## 142 130
## Pancreatic Resection PCI
## 190 299
## Pneumonia
## 151
Decoding Procedure/Condition variable.
According to the American Stroke Association (ASA), strokes can be classified into 2 main categories: 87% are ischemic strokes, caused by blockage of an artery; 13% are hemorrhagic strokes, caused by bleeding. Ischemic strokes are further divided into 2 groups: thrombotic and embolic strokes. Hemorrhagic strokes are divided into 2 main categories: intracerebral and subarachnoid hemorrhages.
Our clean dataset has four categories for Acute Stroke:
Within each hospital, there are different notations for Acute Stroke variable. It suggests that different doctor uses different notations for the condition. These four categories are combined in one: Acute Stroke.
df_clean$Procedure.Condition <- gsub("Acute Stroke .*","Acute Stroke",df_clean$Procedure.Condition)
df_clean$Procedure.Condition <- factor(df_clean$Procedure.Condition)
Two additional categories are present in Procedure/Condition variable:
These categories are separate medical conditions and are not combined in one category.
The Procedure.Condition variable contains 6 medical procedures and 8 medical conditions. To indicate what procedure was performed or what condition was treated, the Medical_Categorey variable was added to the clean dataset.
df_clean <- df_clean %>%
mutate(Medical_Category = ifelse(grepl("Repair",Procedure.Condition) | grepl("Endarterectomy",Procedure.Condition) | grepl("Craniotomy",Procedure.Condition) | grepl("Resection",Procedure.Condition) | grepl("PCI",Procedure.Condition), "Procedure", "Condition"))
#glimpse(df_clean)
Density Plots for # of Cases, # of Deaths and Risk Adjusted Mortality Rate by Hospital Ratings.
p1 <- ggplot(df_clean,aes(log(X..of.Cases),fill=factor(Hospital.Ratings),colour=factor(Hospital.Ratings)))+
geom_density(alpha = 0.1)
p2 <- ggplot(df_clean,aes(log(X..of.Deaths),fill=factor(Hospital.Ratings),colour=factor(Hospital.Ratings)))+
geom_density(alpha = 0.1)
p3 <- ggplot(df_clean,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Hospital.Ratings),colour=factor(Hospital.Ratings)))+
geom_density(alpha = 0.1)
grid.arrange(p1, p2, p3, ncol=1)
Conclusions 1:
Associations between medical procedures or conditions with hospital ratings, number of cases, number of deaths and risk adjusted mortality rate.
Procedures.
df_p <- df_clean[which(df_clean$Medical_Category=="Procedure"),]
df_p_all <- df_p %>%
group_by(Procedure.Condition) %>%
summarise(all_cases = sum(X..of.Cases),
all_deaths = sum(X..of.Deaths),
all_mortality_rate = sum(Risk.Adjusted.Mortality.Rate))
# df_p_all %>% arrange(desc(all_cases))
# df_p_all %>% arrange(desc(all_deaths))
# df_p_all %>% arrange(desc(all_mortality_rate))
df_p_all
## # A tibble: 6 x 4
## Procedure.Condition all_cases all_deaths all_mortality_rate
## <fctr> <int> <int> <dbl>
## 1 AAA Repair 4927 59 508.5
## 2 Carotid Endarterectomy 12478 60 290.2
## 3 Craniotomy 30164 2159 2354.6
## 4 Esophageal Resection 619 28 436.9
## 5 Pancreatic Resection 3356 93 1002.8
## 6 PCI 78660 2028 793.6
Conditions.
df_c <- df_clean[which(df_clean$Medical_Category=="Condition"),]
df_c_all <- df_c %>%
group_by(Procedure.Condition) %>%
summarise(all_cases = sum(X..of.Cases),
all_deaths = sum(X..of.Deaths),
all_mortality_rate = sum(Risk.Adjusted.Mortality.Rate))
#df_c_all %>% arrange(desc(all_cases))
#df_c_all %>% arrange(desc(all_deaths))
#df_c_all %>% arrange(desc(all_mortality_rate))
df_c_all
## # A tibble: 8 x 4
## Procedure.Condition all_cases all_deaths all_mortality_rate
## <fctr> <int> <int> <dbl>
## 1 Acute Stroke 217956 20461 26582.9
## 2 AMI 93594 5731 3863.4
## 3 GI Hemorrhage 94804 2099 1597.8
## 4 Heart Failure 155066 4778 2200.0
## 5 Hip Fracture 32245 744 945.8
## 6 Pancreatic Cancer 1787 43 632.0
## 7 Pancreatic Other 1425 41 590.0
## 8 Pneumonia 20630 1019 552.5
Hospital Ratings.
prop.table(table(df_clean$Procedure.Condition,df_clean$Hospital.Ratings))*100
##
## As Expected Better Worse
## AAA Repair 4.5579886 0.0000000 0.0324412
## Acute Stroke 29.0673155 1.2003244 1.1841038
## AMI 8.6942417 0.3730738 0.5028386
## Carotid Endarterectomy 6.4395783 0.0000000 0.1135442
## Craniotomy 4.2497972 0.2919708 0.2919708
## Esophageal Resection 1.2165450 0.0000000 0.0000000
## GI Hemorrhage 9.6512571 0.1622060 0.2757502
## Heart Failure 9.0186537 0.3730738 0.6001622
## Hip Fracture 6.8450933 0.0000000 0.0648824
## Pancreatic Cancer 2.2708840 0.0000000 0.0324412
## Pancreatic Other 2.0762368 0.0000000 0.0324412
## Pancreatic Resection 3.0170316 0.0000000 0.0648824
## PCI 4.5417680 0.0973236 0.2108678
## Pneumonia 2.3844282 0.0648824 0.0000000
Conclusions 2:
The highest mortality rates is for Craniotomy and Pancreatic Resection procedures, Acute Stroke, AMI and Heart Failure conditions.
The lowest mortality rates is for Carotid Endarterectomy procedure, Pancreatic Other and Pneumonia conditions.
As Expected ratings are for Acute Stroke, AMI, GI Hemorrhage and Heart Failure conditions.
Density Plots for Risk Adjusted Mortality Rate by Procedures Performed and Hospital Ratings.
p6 <- ggplot(df_p,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Hospital.Ratings),colour=factor(Hospital.Ratings)))+
geom_density(alpha = 0.1)+
theme(legend.position='bottom')+
facet_wrap(~ Procedure.Condition, ncol=2, scales="free_y")
p6
Density Plots for Risk Adjusted Mortality Rate by Conditions Treated and Hospital Ratings.
p9 <- ggplot(df_c,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Hospital.Ratings),colour=factor(Hospital.Ratings)))+
geom_density(alpha = 0.1)+
theme(legend.position='bottom')+
facet_wrap(~ Procedure.Condition, ncol=2, scales="free_y")
p9
Summary: The most contributing procedues and conditions.
Summary of hospital ratings over all conditions and procedues.
df_clean <- df_clean %>% mutate(ratings =
ifelse(grepl("As Expected",Hospital.Ratings),"0",
ifelse(grepl("Better",Hospital.Ratings),"1",
ifelse(grepl("Worse",Hospital.Ratings),"-1",NA))))
df_clean$ratings <- as.numeric(df_clean$ratings)
all_ratings <- df_clean %>%
group_by(Hospital,Latitude,Longitude) %>%
summarise(all_ratings = sum(ratings),
mean_mortality_rate = mean(Risk.Adjusted.Mortality.Rate)) %>%
mutate(ratings =
ifelse(all_ratings > 0,"Better",
ifelse(all_ratings < 0, "Worse","As Expected")))
all_ratings$ratings <- as.factor(all_ratings$ratings)
all_ratings <- tbl_df(all_ratings)
Overall Hospital Ratings:
summary(all_ratings$ratings)
## As Expected Better Worse
## 172 69 99
all_ratings %>% arrange(desc(all_ratings)) %>% select(Hospital,all_ratings) %>% slice(1:10)
## # A tibble: 10 x 2
## Hospital
## <fctr>
## 1 Kaiser Foundation Hospital â Redwood City
## 2 Kaiser Foundation Hospital â Sunset
## 3 Centinela Hospital Medical Center
## 4 Scripps Green Hospital
## 5 Cedars Sinai Medical Center
## 6 Desert Valley Hospital
## 7 Encino Hospital Medical Center
## 8 Glendale Adventist Medical Center â Wilson Terrace
## 9 Kaiser Foundation Hospital â Rehabilitation Center Vallejo
## 10 Kaiser Foundation Hospital â San Francisco
## # ... with 1 more variables: all_ratings <dbl>
all_ratings %>% arrange(all_ratings) %>% select(Hospital,all_ratings) %>% slice(1:10)
## # A tibble: 10 x 2
## Hospital
## <fctr>
## 1 Palomar Health Downtown Campus
## 2 Los Angeles County/University of Southern California Medical Center
## 3 San Francisco General Hospital
## 4 Santa Barbara Cottage Hospital
## 5 Arrowhead Regional Medical Center
## 6 Grossmont Hospital
## 7 Sharp Memorial Hospital
## 8 Antelope Valley Hospital
## 9 Doctors Medical Center
## 10 Good Samaritan Hospital â San Jose
## # ... with 1 more variables: all_ratings <dbl>
all_ratings %>% arrange(desc(mean_mortality_rate)) %>% select(Hospital,mean_mortality_rate) %>% slice(1:10)
## # A tibble: 10 x 2
## Hospital mean_mortality_rate
## <fctr> <dbl>
## 1 Memorial Hospital Los Banos 28.35714
## 2 Mountains Community Hospital 27.45000
## 3 Santa Ynez Valley Cottage Hospital 26.47500
## 4 Biggs Gridley Memorial Hospital 24.31111
## 5 Coalinga Regional Medical Center 23.71667
## 6 George L. Mee Memorial Hospital 19.30000
## 7 Adventist Medical Center â Reedley 18.83333
## 8 Central Valley Specialty Hospital 18.70000
## 9 San Joaquin General Hospital 14.68095
## 10 Goleta Valley Cottage Hospital 14.26000
all_ratings %>% arrange(mean_mortality_rate) %>% select(Hospital,mean_mortality_rate) %>% slice(1:10)
## # A tibble: 10 x 2
## Hospital mean_mortality_rate
## <fctr> <dbl>
## 1 Corcoran District Hospital 0
## 2 Eastern Plumas Hospital â Portola Campus 0
## 3 Glenn Medical Center 0
## 4 Good Samaritan Hospital â Bakersfield 0
## 5 Hoag Orthopedic Institute 0
## 6 Kern Valley Healthcare District 0
## 7 Laguna Honda Hospital and Rehabilitation Center 0
## 8 Mammoth Hospital 0
## 9 Southern Inyo Hospital 0
## 10 Tehachapi Hospital 0
Mapping of overall hospital ratings and mean mortality rates.
CAmap <- get_map(location="California",source="google",maptype="roadmap",crop=FALSE,zoom=6)
ggmap (CAmap) +
geom_point(aes(x=Longitude,y=Latitude,size=mean_mortality_rate,colour=ratings),data=all_ratings,alpha=0.7)+
scale_colour_manual(values=c("Worse" = "darkred","Better" = "darkblue","As Expected" = "darkgrey"))+
scale_size(range = c(0, 10))
I will analyze which condition or procedure has the best or worse hospital ratings, and map these results only for selected conditions or procedures.
To predict hospital ratings I will use classification decision trees, random forests or multinomial logistic regression. First, I will use three models to train them on 2012-2013 data and access which model gives the best performance on the training data, then I will test three model performances on 2014 test data.
best_ratings <- all_ratings %>% arrange(desc(all_ratings)) %>% slice(1:25)
lowest_mort_rate <- all_ratings %>% arrange(mean_mortality_rate) %>% slice(1:25)
best_lowest <- bind_rows(best_ratings,lowest_mort_rate)
CAmap <- get_map(location="California",source="google",maptype="roadmap",crop=FALSE,zoom=6)
ggmap (CAmap) +
geom_point(aes(x=Longitude,y=Latitude,size=ratings,colour=mean_mortality_rate),data=best_lowest,alpha=0.7)+
scale_colour_distiller(type = "div", palette = "RdBu", direction = -1)