CA Mortality Rates: Explanatory Data Analysis.

First Dataset is available from CA Hospital Mortality Rates for 2012-2013..

Description of first dataset: The dataset contains risk-adjusted mortality rates, quality ratings, and number of deaths and cases for 6 medical conditions treated (Acute Stroke, Acute Myocardial Infarction, Heart Failure, Gastrointestinal Hemorrhage, Hip Fracture and Pneumonia) and 6 procedures performed (Abdominal Aortic Aneurysm Repair, Carotid Endarterectomy, Craniotomy, Esophageal Resection, Pancreatic Resection, Percutaneous Coronary Intervention) in California hospitals for 2012 and 2013. This dataset does not include conditions treated or procedures performed in outpatient settings. Please refer to statewide table for CA overall mortality rates.

Second Dataset is availabe from Hospital Profitability for 2009-2013.

Description of second dataset: The dataset contains income statement information for all licensed, comparable hospitals in the state of California. Kaiser hospitals, state mental hospitals, psychiatric health facilities, and hospitals with mainly long-term care patients are excluded. Deductions from Revenue, Net Patient Revenue, Net from Operations (Operating Revenue less Operating Expense), and Net Income for public hospitals has been adjusted for Disproportionate Share intergovernmental transfers for funding the Disproportionate Share Hospital Program. The program gets federal matching funds to pay supplemental payments to hospitals with a disproportionate share of uninsured, underinsured, and Medi-Cal patients.

Question: Can we predict hospital ratings based on risk adjusted mortality rates, number of deaths, number of cases, medical procedures performed, medical conditions treated and hospital profitability for 2012-2013?

Load the data from csv file.

data <- read.csv("California_Hospital_Inpatient_Mortality_Rates_and_Quality_Ratings__2012-2013.csv",sep=",",header=TRUE)
df <- tbl_df(data)
glimpse(df)

## Observations: 11,169
## Variables: 12
## $ Year                         <int> 2012, 2012, 2012, 2012, 2012, 201...
## $ County                       <fctr> Alameda, Alameda, Alameda, Alame...
## $ Hospital                     <fctr> Alameda Hospital, Alameda Hospit...
## $ OSHPDID                      <int> 106010735, 106010735, 106010735, ...
## $ Procedure.Condition          <fctr> PCI, AMI, Acute Stroke Ischemic,...
## $ Risk.Adjusted.Mortality.Rate <dbl> NA, 5.5, 2.3, NA, 1.3, 0.0, NA, 2...
## $ X..of.Deaths                 <int> NA, 4, 1, NA, 1, 0, NA, 5, NA, 5,...
## $ X..of.Cases                  <int> NA, 41, 60, NA, 65, 5, NA, 11, NA...
## $ Hospital.Ratings             <fctr> As Expected, As Expected, As Exp...
## $ Longitude                    <dbl> -122.2536, -122.2536, -122.2536, ...
## $ Latitude                     <dbl> 37.76295, 37.76295, 37.76295, 37....
## $ location1                    <fctr> (37.762953, -122.25362), (37.762...

Load the data from the website.

setwd("C:/Users/maria/Dropbox (Personal)/SpringBoard Fund/Rprojects/")
download.file("https://chhs.data.ca.gov/resource/dayj-x3m3.csv",destfile="mortality_data.csv")
data <- read.csv("mortality_data.csv",sep=",",header=TRUE)
df <- tbl_df(data)
glimpse(df)

Description, analysis and cleaning of variables in the dataset.

Dataset: 11169 observations and 12 variables.

Variables with missing values:

Risk Adjusted Mortality Rate: The Risk Adjusted Mortality Rates (RAMR) presented here adjusts the observed mortality rates. This statistical methodology takes into account pre-existing health problems that put some patients at greater risk of death to âlevel the playing fieldâ and allow fair comparisons across hospitals. 4754 missing values.
Number of Deaths: Number of patients that died in this hospital. 4926 missing values.
Number of Cases: Number of patients that had this medical procedure or condition in this hospital. 5004 missing values.

Remove missing values, since number of missing values consists of half of dataset.

df_clean <- df[which(is.na(df$X..of.Cases)==F),]

Clean Dataset: 6165 observations and 12 variables.

Variables with no missing values:

Year: two years 2012 and 2013; 3100 values for 2012 year and 3065 values for 2013 year.
County: 55 counties.
Hospital: 341 hospitals.
OSHPDID: A unique number established by the Office of Statewide Health Planning and Development (OSHPD) for identifying facilities and used in the Licensed Facility Information System (LFIS). The first three numbers identify the type of facility, the next two represent the county number, and the last five are randomly assigned within each county. 570261 unique codes.
Longitude: Longitude of hospital.
Latitude: Latitude of hospital.
location1: 333 levels.
Hospital Ratings: Comparison rating based on a 95% Confidence Interval (CI). If a hospitalâs upper CI is less than the statewide observed rate, it is designated as performing âbetterâ than the average hospital. If a hospitalâs lower CI is greater than the state rate, it is designated as performing âworseâ than the average state hospital. 3 levels of Hospital Ratings: As Expected, Better and Worse.

summary(df_clean$Hospital.Ratings)

## As Expected      Better       Worse 
##        5797         158         210

Procedure/Condition: Procedure that was performed or condition that was treated. 6 medical procedures performed: Abdominal Aortic Aneurysm (AAA) Repair, Carotid Endarterectomy, Craniotomy, Esophageal Resection, Pancreatic Resection, Percutaneous Coronary Intervention. 6 medical conditions treated: Acute Stroke, Acute Myocardial Infarction, Heart Failure, Gastrointestinal Hemorrhage, Hip Fracture and Pneumonia. Clean dataset consists of 17 levels, instead of 12.

summary(df_clean$Procedure.Condition)

##                AAA Repair              Acute Stroke 
##                       283                       617 
##  Acute Stroke Hemorrhagic     Acute Stroke Ischemic 
##                       466                       615 
## Acute Stroke Subarachnoid                       AMI 
##                       241                       590 
##    Carotid Endarterectomy                Craniotomy 
##                       404                       298 
##      Esophageal Resection             GI Hemorrhage 
##                        75                       622 
##             Heart Failure              Hip Fracture 
##                       616                       426 
##         Pancreatic Cancer          Pancreatic Other 
##                       142                       130 
##      Pancreatic Resection                       PCI 
##                       190                       299 
##                 Pneumonia 
##                       151

Decoding Procedure/Condition variable.

According to the American Stroke Association (ASA), strokes can be classified into 2 main categories: 87% are ischemic strokes, caused by blockage of an artery; 13% are hemorrhagic strokes, caused by bleeding. Ischemic strokes are further divided into 2 groups: thrombotic and embolic strokes. Hemorrhagic strokes are divided into 2 main categories: intracerebral and subarachnoid hemorrhages.

Our clean dataset has four categories for Acute Stroke:

Acute Stroke: 617 observations;
Acute Stroke Hemorrhagic: 466 observations;
Acute Stroke Ischemic: 615 obervations;
Acute Stroke Subarachnoid: 241 observations.

Within each hospital, there are different notations for Acute Stroke variable. It suggests that different doctor uses different notations for the condition. These four categories are combined in one: Acute Stroke.

df_clean$Procedure.Condition <- gsub("Acute Stroke .*","Acute Stroke",df_clean$Procedure.Condition)
df_clean$Procedure.Condition <- factor(df_clean$Procedure.Condition)

Two additional categories are present in Procedure/Condition variable:

Pancreatic Cancer: 142 observations;
Pancreatic Other: 130 observations.

These categories are separate medical conditions and are not combined in one category.

The Procedure.Condition variable contains 6 medical procedures and 8 medical conditions. To indicate what procedure was performed or what condition was treated, the Medical_Categorey variable was added to the Clean Dataset.

df_clean <- df_clean %>% 
  mutate(Medical_Category = ifelse(grepl("Repair",Procedure.Condition) | grepl("Endarterectomy",Procedure.Condition) | grepl("Craniotomy",Procedure.Condition) | grepl("Resection",Procedure.Condition) | grepl("PCI",Procedure.Condition),               "Procedure", "Condition"))

glimpse(df_clean)

## Observations: 6,165
## Variables: 13
## $ Year                         <int> 2012, 2012, 2012, 2012, 2012, 201...
## $ County                       <fctr> Alameda, Alameda, Alameda, Alame...
## $ Hospital                     <fctr> Alameda Hospital, Alameda Hospit...
## $ OSHPDID                      <int> 106010735, 106010735, 106010735, ...
## $ Procedure.Condition          <fctr> AMI, Acute Stroke, GI Hemorrhage...
## $ Risk.Adjusted.Mortality.Rate <dbl> 5.5, 2.3, 1.3, 0.0, 25.1, 2.6, 8....
## $ X..of.Deaths                 <int> 4, 1, 1, 0, 5, 5, 7, 0, 5, 18, 8,...
## $ X..of.Cases                  <int> 41, 60, 65, 5, 11, 139, 72, 35, 1...
## $ Hospital.Ratings             <fctr> As Expected, As Expected, As Exp...
## $ Longitude                    <dbl> -122.2536, -122.2536, -122.2536, ...
## $ Latitude                     <dbl> 37.76295, 37.76295, 37.76295, 37....
## $ location1                    <fctr> (37.762953, -122.25362), (37.762...
## $ Medical_Category             <chr> "Condition", "Condition", "Condit...

Initial Explanatory Data Analysis

Histrograms of # of Deaths, # of Cases and Risk.Adjusted.Mortality.Rate for 2012 and 2013 years.

p1 <- ggplot(df_clean,aes(log(X..of.Deaths),fill=factor(Year),colour=factor(Year)))+
  #geom_histogram(position="dodge")
  geom_density(alpha = 0.1)

p2 <- ggplot(df_clean,aes(log(X..of.Cases),fill=factor(Year),colour=factor(Year)))+
  #geom_histogram(position="dodge")
  geom_density(alpha = 0.1)

p3 <- ggplot(df_clean,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Year),colour=factor(Year)))+
  #geom_histogram(position="dodge")
  geom_density(alpha = 0.1)

grid.arrange(p1, p2, p3, ncol=1)

Distributions between 2012 and 2013 years look simiar.
There are no associations between variables.
The distribution for the number of deaths is right-skewed.
The distribution for the number of cases is uniform.
The distribution for the risk adjusted mortality rate is symmetric.

Histrograms of # of Deaths for Hospital.Ratings.

df_hr_ae <- df_clean[which(df_clean$Hospital.Ratings=="As Expected"),]
df_hr <- df_clean[which(df_clean$Hospital.Ratings!="As Expected"),]

p1 <- ggplot(df_hr_ae,aes(log(X..of.Deaths),fill=factor(Hospital.Ratings),
                          colour=factor(Hospital.Ratings)))+
  #geom_histogram(position="dodge")
  geom_density(alpha = 0.1)

p4 <- ggplot(df_hr,aes(log(X..of.Deaths),fill=factor(Hospital.Ratings),
                       colour=factor(Hospital.Ratings)))+
  #geom_histogram(position="dodge")
  geom_density(alpha = 0.1)

grid.arrange(p1, p4, ncol=1)

The distribution of number of deaths with expected hosital ratings is right-skewed.
The distribution of number of deaths with worse hosital ratings is left-skewed.
The distribution of number of deaths with better hosital ratings is uniform.
There is an association between the number of deaths and hospital ratings.
The higher number of deaths, the worse hospital ratings are.

Histrograms of # of Cases for Hospital.Ratings.

p2 <- ggplot(df_hr_ae,aes(log(X..of.Cases),fill=factor(Hospital.Ratings),
                          colour=factor(Hospital.Ratings)))+
  #geom_histogram(position="dodge")
  geom_density(alpha = 0.1)

p5 <- ggplot(df_hr,aes(log(X..of.Cases),fill=factor(Hospital.Ratings),
                       colour=factor(Hospital.Ratings)))+
  #geom_histogram(position="dodge")
  geom_density(alpha = 0.1)

grid.arrange(p2, p5, ncol=1)

There are no associations between variables.
The distribution of number of cases with expected hosital ratings is uniform.
The distributions of number of cases with worse and better hosital ratings are left-skewed.

Histrograms of Risk.Adjusted.Mortality.Rate for Hospital.Ratings.

p3 <- ggplot(df_hr_ae,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Hospital.Ratings),
                          colour=factor(Hospital.Ratings)))+
  #geom_histogram(position="dodge")+
  geom_density(alpha = 0.1)

p6 <- ggplot(df_hr,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Hospital.Ratings),
                       colour=factor(Hospital.Ratings)))+
  #geom_histogram(position="dodge")+
  geom_density(alpha = 0.1)

grid.arrange(p3, p6, ncol=1)

The distribution of mortality rate with expected hosital ratings is symmetric and/or bimodal.
The distribution of mortality rate with worse hosital ratings is right-skewed.
The distribution of mortality rate with better hosital ratings is symmetric.
There is an association between the mortality rate and hospital ratings.
The higher mortality rate, the worse hospital ratings are.
The lower mortality rate, the better hospital ratings are.

Histrograms of # of Deaths, # of Cases and Risk.Adjusted.Mortality.Rate for Year and Hospital.Ratings.

p4 <- ggplot(df_hr_ae,aes(log(X..of.Deaths),fill=factor(Year),colour=factor(Year)))+
  #geom_histogram(position="dodge")+
  geom_density(alpha = 0.1)+
  facet_grid(Hospital.Ratings ~ ., scale="free_y", space="free_y")

p5 <- ggplot(df_hr_ae,aes(log(X..of.Cases),fill=factor(Year),colour=factor(Year)))+
  #geom_histogram(position="dodge")+
  geom_density(alpha = 0.1)+
  facet_grid(Hospital.Ratings ~ ., scale="free_y", space="free_y")

p6 <- ggplot(df_hr_ae,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Year),colour=factor(Year)))+
  #geom_histogram(position="dodge")+
  geom_density(alpha = 0.1)+
  facet_grid(Hospital.Ratings ~ ., scale="free_y", space="free_y")

grid.arrange(p4, p5, p6, ncol=1)

Distributions between 2012 and 2013 years look simiar.
There are no associations between variables.

p7 <- ggplot(df_hr,aes(log(X..of.Deaths),fill=factor(Year),colour=factor(Year)))+
  #geom_histogram(position="dodge")+
  geom_density(alpha = 0.1)+
  facet_grid(Hospital.Ratings ~ ., scale="free_y", space="free_y")

p8 <- ggplot(df_hr,aes(log(X..of.Cases),fill=factor(Year),colour=factor(Year)))+
  #geom_histogram(position="dodge")+
  geom_density(alpha = 0.1)+
  facet_grid(Hospital.Ratings ~ ., scale="free_y", space="free_y")

p9 <- ggplot(df_hr,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Year),colour=factor(Year)))+
  #geom_histogram(position="dodge")+
  geom_density(alpha = 0.1)+
  facet_grid(Hospital.Ratings ~ ., scale="free_y", space="free_y")

grid.arrange(p7, p8, p9, ncol=2)

Distributions between 2012 and 2013 years look simiar.
There are no associations between variables for number of deaths and number of cases.
There is association between the risk adjusted mortality rate and hospital ratings.
Lower the risk adjusted mortality rate, better the hospital ratings.
Higher the risk adjusted mortality rate, worse the hospital ratings.

Conclusion 1: There is association between the risk adjusted mortality rate and hospital ratings.

Histrograms of # of Deaths, # of Cases and Risk.Adjusted.Mortality.Rate for Procedures Performed.

df_p <- df_clean[which(df_clean$Medical_Category=="Procedure"),]

p4 <- ggplot(df_p,aes(log(X..of.Deaths),fill=factor(Procedure.Condition),colour=factor(Procedure.Condition)))+
  geom_histogram(position="stack")
  #geom_density(alpha = 0.1)

p4

The highest number of deaths is from PCI and craniotomy procedures.

p5 <- ggplot(df_p,aes(log(X..of.Cases),fill=factor(Procedure.Condition),colour=factor(Procedure.Condition)))+
  geom_histogram(position="stack")
  #geom_density(alpha = 0.1)

p5

The highest number of cases is from PCI and craniotomy procedures.

p6 <- ggplot(df_p,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Procedure.Condition),colour=factor(Procedure.Condition)))+
  geom_histogram(position="stack")
  #geom_density(alpha = 0.1)

p6

The highest Risk.Adjusted.Mortality.Rate is from craniotomy procedure.
The Risk.Adjusted.Mortality.Rate is lower from PCI procedure.

Histrograms of # of Deaths, # of Cases and Risk.Adjusted.Mortality.Rate for Conditions Treated.

df_c <- df_clean[which(df_clean$Medical_Category=="Condition"),]

p7 <- ggplot(df_c,aes(log(X..of.Deaths),fill=factor(Procedure.Condition),colour=factor(Procedure.Condition)))+
  geom_histogram(position="stack")
  #geom_density(alpha = 0.1)

p7

The highest number of deaths is for acute stroke condition.
The number of deaths lower is for heart failure and AMI conditions.

p8 <- ggplot(df_c,aes(log(X..of.Cases),fill=factor(Procedure.Condition),colour=factor(Procedure.Condition)))+
  geom_histogram(position="stack")
  #geom_density(alpha = 0.1)

p8

The highest number of cases is for acute stroke condition.
The number of cases lower is for heart failure, AMI and GI Hemorrhage conditions.

p9 <- ggplot(df_c,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Procedure.Condition),colour=factor(Procedure.Condition)))+
  geom_histogram(position="stack")
  #geom_density(alpha = 0.1)

p9

#grid.arrange(p4, p5, p6, ncol=1)

The highest Risk.Adjusted.Mortality.Rate is for acute stroke condition.
The Risk.Adjusted.Mortality.Rate lower is for heart failure, AMI and GI Hemorrhage conditions.

Conclusion 2:

The highest Risk.Adjusted.Mortality.Rate is for craniotomy procedure and acute stroke condition.
The Risk.Adjusted.Mortality.Rate is lower for PCI procedure and for heart failure, AMI and GI Hemorrhage conditions.

Histrograms for Procedures for Hospital.Ratings.

df_p_hr_ae <- df_p[which(df_p$Hospital.Ratings=="As Expected"),]
df_p_hr <- df_p[which(df_p$Hospital.Ratings!="As Expected"),]

p1 <- ggplot(df_p_hr_ae,aes(Procedure.Condition, colour=factor(Hospital.Ratings),fill=factor(Hospital.Ratings))) +
  stat_count(width = 0.3, position="dodge")+
  labs(title="Histogram for Procedures")+
  labs(x="Procedure", y="Count")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

p2 <- ggplot(df_p_hr,aes(Procedure.Condition,colour=factor(Hospital.Ratings),fill=factor(Hospital.Ratings))) +
  stat_count(width = 0.3, position="dodge")+
  labs(title="Histogram for Procedures")+
  labs(x="Procedure", y="Count")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

grid.arrange(p1, p2, ncol=1)

Hospital ratings are the same between worse and better for craniotomy procedure.
Hospital ratings are worse for PCI procedure.

Histrograms for Conditions for Hospital.Ratings.

df_c_hr_ae <- df_c[which(df_c$Hospital.Ratings=="As Expected"),]
df_c_hr <- df_c[which(df_c$Hospital.Ratings!="As Expected"),]

p3 <- ggplot(df_c_hr_ae,aes(Procedure.Condition, colour=factor(Hospital.Ratings),fill=factor(Hospital.Ratings))) +
  stat_count(width = 0.3, position="dodge")+
  labs(title="Histogram for Conditions")+
  labs(x="Condition", y="Count")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

p4 <- ggplot(df_c_hr,aes(Procedure.Condition,colour=factor(Hospital.Ratings),fill=factor(Hospital.Ratings))) +
  stat_count(width = 0.3, position="dodge")+
  labs(title="Histogram for Consitions")+
  labs(x="Condition", y="Count")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

grid.arrange(p3, p4, ncol=1)

Hospital ratings are the same between worse and better for acute stroke condition.
Hospital ratings are worse for AMI, GI Hemorrhage and heart failure conditions.

Associations between medical procedure or condition with hospital ratings.

table(df_clean$Procedure.Condition,df_clean$Hospital.Ratings)

##                         
##                          As Expected Better Worse
##   AAA Repair                     281      0     2
##   Acute Stroke                  1792     74    73
##   AMI                            536     23    31
##   Carotid Endarterectomy         397      0     7
##   Craniotomy                     262     18    18
##   Esophageal Resection            75      0     0
##   GI Hemorrhage                  595     10    17
##   Heart Failure                  556     23    37
##   Hip Fracture                   422      0     4
##   Pancreatic Cancer              140      0     2
##   Pancreatic Other               128      0     2
##   Pancreatic Resection           186      0     4
##   PCI                            280      6    13
##   Pneumonia                      147      4     0

Conclusion 3:

Hospital ratings are the same between worse and better for craniotomy procedure and acute stroke condition.
Hospital ratings are worse for PCI procedure and for AMI, GI Hemorrhage and heart failure conditions.

Summary (from conclusions):

There is association between the risk adjusted mortality rate and hospital ratings.
Lower the risk adjusted mortality rate, better the hospital ratings.
Higher the risk adjusted mortality rate, worse the hospital ratings.
The highest Risk.Adjusted.Mortality.Rate is for craniotomy procedure and acute stroke condition.
The Risk.Adjusted.Mortality.Rate is lower for PCI procedure and for heart failure, AMI and GI Hemorrhage conditions.
Hospital ratings are the same between worse and better for craniotomy procedure and acute stroke condition.
Hospital ratings are worse for PCI procedure and for AMI, GI Hemorrhage and heart failure conditions.

Risk.Adjusted.Mortality.Rate versus X..of.Cases or X..of.Deaths.

fit1 <- lm(Risk.Adjusted.Mortality.Rate ~ X..of.Cases, data=df_clean)
#summary(fit1)
p1 <- ggplot(df_clean, aes(X..of.Cases,Risk.Adjusted.Mortality.Rate))+ 
  geom_point(alpha = 0.1)+
  stat_smooth(method='lm')

fit2 <- lm(Risk.Adjusted.Mortality.Rate ~ X..of.Deaths, data=df_clean)
#summary(fit2)
p2 <- ggplot(df_clean, aes(X..of.Deaths,Risk.Adjusted.Mortality.Rate))+
  geom_point(alpha = 0.1)+
  stat_smooth(method='lm')

grid.arrange(p1, p2, ncol=2)

There is no accociation between the Risk.Adjusted.Mortality.Rate and X..of.Cases or X..of.Deaths.

Mapping of Hospital.Ratings. (need to understand more here)

CAmap <- get_map(location="California",source="google",maptype="roadmap",crop=FALSE,zoom=6)
ggmap (CAmap) +
  geom_point(aes(x=Longitude,y=Latitude,size=Hospital.Ratings),data=df_clean,alpha=.3,color="darkred")#,size=2)

CA Mortality Rates: Explanatory Data Analysis.

Maria P. Frushicheva

Monday, August 8, 2016

Description, analysis and cleaning of variables in the dataset.

Initial Explanatory Data Analysis

Summary (from conclusions):