First Dataset is available from CA Hospital Mortality Rates for 2012-2013..
Description of first dataset: The dataset contains risk-adjusted mortality rates, quality ratings, and number of deaths and cases for 6 medical conditions treated (Acute Stroke, Acute Myocardial Infarction, Heart Failure, Gastrointestinal Hemorrhage, Hip Fracture and Pneumonia) and 6 procedures performed (Abdominal Aortic Aneurysm Repair, Carotid Endarterectomy, Craniotomy, Esophageal Resection, Pancreatic Resection, Percutaneous Coronary Intervention) in California hospitals for 2012 and 2013. This dataset does not include conditions treated or procedures performed in outpatient settings. Please refer to statewide table for CA overall mortality rates.
Second Dataset is availabe from Hospital Profitability for 2009-2013.
Description of second dataset: The dataset contains income statement information for all licensed, comparable hospitals in the state of California. Kaiser hospitals, state mental hospitals, psychiatric health facilities, and hospitals with mainly long-term care patients are excluded. Deductions from Revenue, Net Patient Revenue, Net from Operations (Operating Revenue less Operating Expense), and Net Income for public hospitals has been adjusted for Disproportionate Share intergovernmental transfers for funding the Disproportionate Share Hospital Program. The program gets federal matching funds to pay supplemental payments to hospitals with a disproportionate share of uninsured, underinsured, and Medi-Cal patients.
Question: Can we predict hospital ratings based on risk adjusted mortality rates, number of deaths, number of cases, medical procedures performed, medical conditions treated and hospital profitability for 2012-2013?
Load the data from csv file.
data <- read.csv("California_Hospital_Inpatient_Mortality_Rates_and_Quality_Ratings__2012-2013.csv",sep=",",header=TRUE)
df <- tbl_df(data)
glimpse(df)
## Observations: 11,169
## Variables: 12
## $ Year <int> 2012, 2012, 2012, 2012, 2012, 201...
## $ County <fctr> Alameda, Alameda, Alameda, Alame...
## $ Hospital <fctr> Alameda Hospital, Alameda Hospit...
## $ OSHPDID <int> 106010735, 106010735, 106010735, ...
## $ Procedure.Condition <fctr> PCI, AMI, Acute Stroke Ischemic,...
## $ Risk.Adjusted.Mortality.Rate <dbl> NA, 5.5, 2.3, NA, 1.3, 0.0, NA, 2...
## $ X..of.Deaths <int> NA, 4, 1, NA, 1, 0, NA, 5, NA, 5,...
## $ X..of.Cases <int> NA, 41, 60, NA, 65, 5, NA, 11, NA...
## $ Hospital.Ratings <fctr> As Expected, As Expected, As Exp...
## $ Longitude <dbl> -122.2536, -122.2536, -122.2536, ...
## $ Latitude <dbl> 37.76295, 37.76295, 37.76295, 37....
## $ location1 <fctr> (37.762953, -122.25362), (37.762...
Load the data from the website.
setwd("C:/Users/maria/Dropbox (Personal)/SpringBoard Fund/Rprojects/")
download.file("https://chhs.data.ca.gov/resource/dayj-x3m3.csv",destfile="mortality_data.csv")
data <- read.csv("mortality_data.csv",sep=",",header=TRUE)
df <- tbl_df(data)
glimpse(df)
Dataset: 11169 observations and 12 variables.
Variables with missing values:
Remove missing values, since number of missing values consists of half of dataset.
df_clean <- df[which(is.na(df$X..of.Cases)==F),]
Clean Dataset: 6165 observations and 12 variables.
Variables with no missing values:
summary(df_clean$Hospital.Ratings)
## As Expected Better Worse
## 5797 158 210
summary(df_clean$Procedure.Condition)
## AAA Repair Acute Stroke
## 283 617
## Acute Stroke Hemorrhagic Acute Stroke Ischemic
## 466 615
## Acute Stroke Subarachnoid AMI
## 241 590
## Carotid Endarterectomy Craniotomy
## 404 298
## Esophageal Resection GI Hemorrhage
## 75 622
## Heart Failure Hip Fracture
## 616 426
## Pancreatic Cancer Pancreatic Other
## 142 130
## Pancreatic Resection PCI
## 190 299
## Pneumonia
## 151
Decoding Procedure/Condition variable.
According to the American Stroke Association (ASA), strokes can be classified into 2 main categories: 87% are ischemic strokes, caused by blockage of an artery; 13% are hemorrhagic strokes, caused by bleeding. Ischemic strokes are further divided into 2 groups: thrombotic and embolic strokes. Hemorrhagic strokes are divided into 2 main categories: intracerebral and subarachnoid hemorrhages.
Our clean dataset has four categories for Acute Stroke:
Within each hospital, there are different notations for Acute Stroke variable. It suggests that different doctor uses different notations for the condition. These four categories are combined in one: Acute Stroke.
df_clean$Procedure.Condition <- gsub("Acute Stroke .*","Acute Stroke",df_clean$Procedure.Condition)
df_clean$Procedure.Condition <- factor(df_clean$Procedure.Condition)
Two additional categories are present in Procedure/Condition variable:
These categories are separate medical conditions and are not combined in one category.
The Procedure.Condition variable contains 6 medical procedures and 8 medical conditions. To indicate what procedure was performed or what condition was treated, the Medical_Categorey variable was added to the Clean Dataset.
df_clean <- df_clean %>%
mutate(Medical_Category = ifelse(grepl("Repair",Procedure.Condition) | grepl("Endarterectomy",Procedure.Condition) | grepl("Craniotomy",Procedure.Condition) | grepl("Resection",Procedure.Condition) | grepl("PCI",Procedure.Condition), "Procedure", "Condition"))
glimpse(df_clean)
## Observations: 6,165
## Variables: 13
## $ Year <int> 2012, 2012, 2012, 2012, 2012, 201...
## $ County <fctr> Alameda, Alameda, Alameda, Alame...
## $ Hospital <fctr> Alameda Hospital, Alameda Hospit...
## $ OSHPDID <int> 106010735, 106010735, 106010735, ...
## $ Procedure.Condition <fctr> AMI, Acute Stroke, GI Hemorrhage...
## $ Risk.Adjusted.Mortality.Rate <dbl> 5.5, 2.3, 1.3, 0.0, 25.1, 2.6, 8....
## $ X..of.Deaths <int> 4, 1, 1, 0, 5, 5, 7, 0, 5, 18, 8,...
## $ X..of.Cases <int> 41, 60, 65, 5, 11, 139, 72, 35, 1...
## $ Hospital.Ratings <fctr> As Expected, As Expected, As Exp...
## $ Longitude <dbl> -122.2536, -122.2536, -122.2536, ...
## $ Latitude <dbl> 37.76295, 37.76295, 37.76295, 37....
## $ location1 <fctr> (37.762953, -122.25362), (37.762...
## $ Medical_Category <chr> "Condition", "Condition", "Condit...
Histrograms of # of Deaths, # of Cases and Risk.Adjusted.Mortality.Rate for 2012 and 2013 years.
p1 <- ggplot(df_clean,aes(log(X..of.Deaths),fill=factor(Year),colour=factor(Year)))+
#geom_histogram(position="dodge")
geom_density(alpha = 0.1)
p2 <- ggplot(df_clean,aes(log(X..of.Cases),fill=factor(Year),colour=factor(Year)))+
#geom_histogram(position="dodge")
geom_density(alpha = 0.1)
p3 <- ggplot(df_clean,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Year),colour=factor(Year)))+
#geom_histogram(position="dodge")
geom_density(alpha = 0.1)
grid.arrange(p1, p2, p3, ncol=1)
Histrograms of # of Deaths for Hospital.Ratings.
df_hr_ae <- df_clean[which(df_clean$Hospital.Ratings=="As Expected"),]
df_hr <- df_clean[which(df_clean$Hospital.Ratings!="As Expected"),]
p1 <- ggplot(df_hr_ae,aes(log(X..of.Deaths),fill=factor(Hospital.Ratings),
colour=factor(Hospital.Ratings)))+
#geom_histogram(position="dodge")
geom_density(alpha = 0.1)
p4 <- ggplot(df_hr,aes(log(X..of.Deaths),fill=factor(Hospital.Ratings),
colour=factor(Hospital.Ratings)))+
#geom_histogram(position="dodge")
geom_density(alpha = 0.1)
grid.arrange(p1, p4, ncol=1)
Histrograms of # of Cases for Hospital.Ratings.
p2 <- ggplot(df_hr_ae,aes(log(X..of.Cases),fill=factor(Hospital.Ratings),
colour=factor(Hospital.Ratings)))+
#geom_histogram(position="dodge")
geom_density(alpha = 0.1)
p5 <- ggplot(df_hr,aes(log(X..of.Cases),fill=factor(Hospital.Ratings),
colour=factor(Hospital.Ratings)))+
#geom_histogram(position="dodge")
geom_density(alpha = 0.1)
grid.arrange(p2, p5, ncol=1)
Histrograms of Risk.Adjusted.Mortality.Rate for Hospital.Ratings.
p3 <- ggplot(df_hr_ae,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Hospital.Ratings),
colour=factor(Hospital.Ratings)))+
#geom_histogram(position="dodge")+
geom_density(alpha = 0.1)
p6 <- ggplot(df_hr,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Hospital.Ratings),
colour=factor(Hospital.Ratings)))+
#geom_histogram(position="dodge")+
geom_density(alpha = 0.1)
grid.arrange(p3, p6, ncol=1)
Histrograms of # of Deaths, # of Cases and Risk.Adjusted.Mortality.Rate for Year and Hospital.Ratings.
p4 <- ggplot(df_hr_ae,aes(log(X..of.Deaths),fill=factor(Year),colour=factor(Year)))+
#geom_histogram(position="dodge")+
geom_density(alpha = 0.1)+
facet_grid(Hospital.Ratings ~ ., scale="free_y", space="free_y")
p5 <- ggplot(df_hr_ae,aes(log(X..of.Cases),fill=factor(Year),colour=factor(Year)))+
#geom_histogram(position="dodge")+
geom_density(alpha = 0.1)+
facet_grid(Hospital.Ratings ~ ., scale="free_y", space="free_y")
p6 <- ggplot(df_hr_ae,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Year),colour=factor(Year)))+
#geom_histogram(position="dodge")+
geom_density(alpha = 0.1)+
facet_grid(Hospital.Ratings ~ ., scale="free_y", space="free_y")
grid.arrange(p4, p5, p6, ncol=1)
p7 <- ggplot(df_hr,aes(log(X..of.Deaths),fill=factor(Year),colour=factor(Year)))+
#geom_histogram(position="dodge")+
geom_density(alpha = 0.1)+
facet_grid(Hospital.Ratings ~ ., scale="free_y", space="free_y")
p8 <- ggplot(df_hr,aes(log(X..of.Cases),fill=factor(Year),colour=factor(Year)))+
#geom_histogram(position="dodge")+
geom_density(alpha = 0.1)+
facet_grid(Hospital.Ratings ~ ., scale="free_y", space="free_y")
p9 <- ggplot(df_hr,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Year),colour=factor(Year)))+
#geom_histogram(position="dodge")+
geom_density(alpha = 0.1)+
facet_grid(Hospital.Ratings ~ ., scale="free_y", space="free_y")
grid.arrange(p7, p8, p9, ncol=2)
Conclusion 1: There is association between the risk adjusted mortality rate and hospital ratings.
Histrograms of # of Deaths, # of Cases and Risk.Adjusted.Mortality.Rate for Procedures Performed.
df_p <- df_clean[which(df_clean$Medical_Category=="Procedure"),]
p4 <- ggplot(df_p,aes(log(X..of.Deaths),fill=factor(Procedure.Condition),colour=factor(Procedure.Condition)))+
geom_histogram(position="stack")
#geom_density(alpha = 0.1)
p4
p5 <- ggplot(df_p,aes(log(X..of.Cases),fill=factor(Procedure.Condition),colour=factor(Procedure.Condition)))+
geom_histogram(position="stack")
#geom_density(alpha = 0.1)
p5
p6 <- ggplot(df_p,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Procedure.Condition),colour=factor(Procedure.Condition)))+
geom_histogram(position="stack")
#geom_density(alpha = 0.1)
p6
Histrograms of # of Deaths, # of Cases and Risk.Adjusted.Mortality.Rate for Conditions Treated.
df_c <- df_clean[which(df_clean$Medical_Category=="Condition"),]
p7 <- ggplot(df_c,aes(log(X..of.Deaths),fill=factor(Procedure.Condition),colour=factor(Procedure.Condition)))+
geom_histogram(position="stack")
#geom_density(alpha = 0.1)
p7
p8 <- ggplot(df_c,aes(log(X..of.Cases),fill=factor(Procedure.Condition),colour=factor(Procedure.Condition)))+
geom_histogram(position="stack")
#geom_density(alpha = 0.1)
p8
p9 <- ggplot(df_c,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Procedure.Condition),colour=factor(Procedure.Condition)))+
geom_histogram(position="stack")
#geom_density(alpha = 0.1)
p9
#grid.arrange(p4, p5, p6, ncol=1)
Conclusion 2:
Histrograms for Procedures for Hospital.Ratings.
df_p_hr_ae <- df_p[which(df_p$Hospital.Ratings=="As Expected"),]
df_p_hr <- df_p[which(df_p$Hospital.Ratings!="As Expected"),]
p1 <- ggplot(df_p_hr_ae,aes(Procedure.Condition, colour=factor(Hospital.Ratings),fill=factor(Hospital.Ratings))) +
stat_count(width = 0.3, position="dodge")+
labs(title="Histogram for Procedures")+
labs(x="Procedure", y="Count")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p2 <- ggplot(df_p_hr,aes(Procedure.Condition,colour=factor(Hospital.Ratings),fill=factor(Hospital.Ratings))) +
stat_count(width = 0.3, position="dodge")+
labs(title="Histogram for Procedures")+
labs(x="Procedure", y="Count")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
grid.arrange(p1, p2, ncol=1)
Histrograms for Conditions for Hospital.Ratings.
df_c_hr_ae <- df_c[which(df_c$Hospital.Ratings=="As Expected"),]
df_c_hr <- df_c[which(df_c$Hospital.Ratings!="As Expected"),]
p3 <- ggplot(df_c_hr_ae,aes(Procedure.Condition, colour=factor(Hospital.Ratings),fill=factor(Hospital.Ratings))) +
stat_count(width = 0.3, position="dodge")+
labs(title="Histogram for Conditions")+
labs(x="Condition", y="Count")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p4 <- ggplot(df_c_hr,aes(Procedure.Condition,colour=factor(Hospital.Ratings),fill=factor(Hospital.Ratings))) +
stat_count(width = 0.3, position="dodge")+
labs(title="Histogram for Consitions")+
labs(x="Condition", y="Count")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
grid.arrange(p3, p4, ncol=1)
Associations between medical procedure or condition with hospital ratings.
table(df_clean$Procedure.Condition,df_clean$Hospital.Ratings)
##
## As Expected Better Worse
## AAA Repair 281 0 2
## Acute Stroke 1792 74 73
## AMI 536 23 31
## Carotid Endarterectomy 397 0 7
## Craniotomy 262 18 18
## Esophageal Resection 75 0 0
## GI Hemorrhage 595 10 17
## Heart Failure 556 23 37
## Hip Fracture 422 0 4
## Pancreatic Cancer 140 0 2
## Pancreatic Other 128 0 2
## Pancreatic Resection 186 0 4
## PCI 280 6 13
## Pneumonia 147 4 0
Conclusion 3:
Higher the risk adjusted mortality rate, worse the hospital ratings.
The Risk.Adjusted.Mortality.Rate is lower for PCI procedure and for heart failure, AMI and GI Hemorrhage conditions.
Hospital ratings are worse for PCI procedure and for AMI, GI Hemorrhage and heart failure conditions.
Risk.Adjusted.Mortality.Rate versus X..of.Cases or X..of.Deaths.
fit1 <- lm(Risk.Adjusted.Mortality.Rate ~ X..of.Cases, data=df_clean)
#summary(fit1)
p1 <- ggplot(df_clean, aes(X..of.Cases,Risk.Adjusted.Mortality.Rate))+
geom_point(alpha = 0.1)+
stat_smooth(method='lm')
fit2 <- lm(Risk.Adjusted.Mortality.Rate ~ X..of.Deaths, data=df_clean)
#summary(fit2)
p2 <- ggplot(df_clean, aes(X..of.Deaths,Risk.Adjusted.Mortality.Rate))+
geom_point(alpha = 0.1)+
stat_smooth(method='lm')
grid.arrange(p1, p2, ncol=2)
Mapping of Hospital.Ratings. (need to understand more here)
CAmap <- get_map(location="California",source="google",maptype="roadmap",crop=FALSE,zoom=6)
ggmap (CAmap) +
geom_point(aes(x=Longitude,y=Latitude,size=Hospital.Ratings),data=df_clean,alpha=.3,color="darkred")#,size=2)