Milestone Report for Capstone Project

Introduction

Importance: Using hospital quality ratings, patients are able to make a better decision in what hospital they want to be treated and where the best care is available in state of California, based on overall hospital performance or based on particular medical condition or procedure.

Question: Can we predict hospital quality ratings based on risk adjusted mortality rates, number of deaths, number of cases, medical procedures performed and medical conditions treated for 2012-2013?

Description of Data Set

Dataset: is available from California Hospital Inpatient Mortality Rates and Quality Ratings, 2012-2013.

Description of dataset: The dataset contains risk-adjusted mortality rates, and number of deaths and cases for 6 medical conditions treated (Acute Stroke, Acute Myocardial Infarction, Heart Failure, Gastrointestinal Hemorrhage, Hip Fracture and Pneumonia) and 6 procedures performed (Abdominal Aortic Aneurysm Repair, Carotid Endarterectomy, Craniotomy, Esophageal Resection, Pancreatic Resection, Percutaneous Coronary Intervention) in California hospitals for 2012 and 2013. This dataset does not include conditions treated or procedures performed in outpatient settings.

Description, Analysis and Cleaning of Variables in the Data Set

Load the data from csv file.

setwd("C:/Users/postdoc/Dropbox (Personal)/SpringBoard Fund/Rprojects/")
data <- read.csv("California_Hospital_Inpatient_Mortality_Rates_and_Quality_Ratings__2012-2013.csv",sep=",",header=TRUE)
df <- tbl_df(data)
# glimpse(df)

Dataset: 11169 observations and 12 variables.

Variables with missing values:

Risk Adjusted Mortality Rate: The Risk Adjusted Mortality Rates (RAMR) presented here adjusts the observed mortality rates. This statistical methodology takes into account pre-existing health problems that put some patients at greater risk of death to level the playing field and allow fair comparisons across hospitals. 4754 missing values.
Number of Deaths: Number of patients that died in this hospital. 4926 missing values.
Number of Cases: Number of patients that had this medical procedure or condition in this hospital. 5004 missing values.

Remove missing values, because number of missing values consists of half of dataset.

df_clean <- df[which(is.na(df$X..of.Cases)==F),]

Clean Dataset: 6165 observations and 12 variables.

Variables with no missing values:

Year: two years 2012 and 2013; 3100 values for 2012 year and 3065 values for 2013 year.
County: 55 counties.
Hospital: 341 hospitals.
OSHPDID: A unique number established by the Office of Statewide Health Planning and Development (OSHPD) for identifying facilities and used in the Licensed Facility Information System (LFIS). The first three numbers identify the type of facility, the next two represent the county number, and the last five are randomly assigned within each county. 570261 unique codes.
Longitude: Longitude of hospital.
Latitude: Latitude of hospital.
location1: 333 levels.
Hospital Ratings: Comparison rating based on a 95% Confidence Interval (CI). If a hospitals upper CI is less than the statewide observed rate, it is designated as performing better than the average hospital. If a hospitals lower CI is greater than the state rate, it is designated as performing worse than the average state hospital. 3 levels of Hospital Ratings: As Expected, Better and Worse.

summary(df_clean$Hospital.Ratings)

## As Expected      Better       Worse 
##        5797         158         210

Procedure/Condition: Procedure that was performed or condition that was treated. 6 medical procedures performed: Abdominal Aortic Aneurysm (AAA) Repair, Carotid Endarterectomy, Craniotomy, Esophageal Resection, Pancreatic Resection, Percutaneous Coronary Intervention. 6 medical conditions treated: Acute Stroke, Acute Myocardial Infarction, Heart Failure, Gastrointestinal Hemorrhage, Hip Fracture and Pneumonia. Clean dataset consists of 17 levels, instead of 12.

summary(df_clean$Procedure.Condition)

##                AAA Repair              Acute Stroke 
##                       283                       617 
##  Acute Stroke Hemorrhagic     Acute Stroke Ischemic 
##                       466                       615 
## Acute Stroke Subarachnoid                       AMI 
##                       241                       590 
##    Carotid Endarterectomy                Craniotomy 
##                       404                       298 
##      Esophageal Resection             GI Hemorrhage 
##                        75                       622 
##             Heart Failure              Hip Fracture 
##                       616                       426 
##         Pancreatic Cancer          Pancreatic Other 
##                       142                       130 
##      Pancreatic Resection                       PCI 
##                       190                       299 
##                 Pneumonia 
##                       151

Decoding Procedure/Condition variable.

According to the American Stroke Association (ASA), strokes can be classified into 2 main categories: 87% are ischemic strokes, caused by blockage of an artery; 13% are hemorrhagic strokes, caused by bleeding. Ischemic strokes are further divided into 2 groups: thrombotic and embolic strokes. Hemorrhagic strokes are divided into 2 main categories: intracerebral and subarachnoid hemorrhages.

Our clean dataset has four categories for Acute Stroke:

Acute Stroke: 617 observations;
Acute Stroke Hemorrhagic: 466 observations;
Acute Stroke Ischemic: 615 obervations;
Acute Stroke Subarachnoid: 241 observations.

Within each hospital, there are different notations for Acute Stroke variable. It suggests that different doctor uses different notations for the condition. These four categories are combined in one: Acute Stroke.

df_clean$Procedure.Condition <- gsub("Acute Stroke .*","Acute Stroke",df_clean$Procedure.Condition)
df_clean$Procedure.Condition <- factor(df_clean$Procedure.Condition)

Two additional categories are present in Procedure/Condition variable:

Pancreatic Cancer: 142 observations;
Pancreatic Other: 130 observations.

These categories are separate medical conditions and are not combined in one category.

The Procedure.Condition variable contains 6 medical procedures and 8 medical conditions. To indicate what procedure was performed or what condition was treated, the Medical_Categorey variable was added to the clean dataset.

df_clean <- df_clean %>% 
  mutate(Medical_Category = ifelse(grepl("Repair",Procedure.Condition) | grepl("Endarterectomy",Procedure.Condition) | grepl("Craniotomy",Procedure.Condition) | grepl("Resection",Procedure.Condition) | grepl("PCI",Procedure.Condition),               "Procedure", "Condition"))
#glimpse(df_clean)

Explanatory Data Analysis

Density Plots for # of Cases, # of Deaths and Risk Adjusted Mortality Rate by Hospital Ratings.

p1 <- ggplot(df_clean,aes(log(X..of.Cases),fill=factor(Hospital.Ratings),colour=factor(Hospital.Ratings)))+
  geom_density(alpha = 0.1)

p2 <- ggplot(df_clean,aes(log(X..of.Deaths),fill=factor(Hospital.Ratings),colour=factor(Hospital.Ratings)))+
  geom_density(alpha = 0.1)

p3 <- ggplot(df_clean,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Hospital.Ratings),colour=factor(Hospital.Ratings)))+
  geom_density(alpha = 0.1)

grid.arrange(p1, p2, p3, ncol=1)

Conclusions 1:

Distributions between 2012 and 2013 years look simiar (graphs are not shown).
There are no associations between variables for number of deaths and number of cases.
There is association between the risk adjusted mortality rate and hospital ratings.
Lower the risk adjusted mortality rate, better the hospital ratings.
Higher the risk adjusted mortality rate, worse the hospital ratings.

Associations between medical procedures or conditions with hospital ratings, number of cases, number of deaths and risk adjusted mortality rate.

Procedures.

df_p <- df_clean[which(df_clean$Medical_Category=="Procedure"),]

df_p_all <- df_p %>% 
  group_by(Procedure.Condition) %>% 
  summarise(all_cases = sum(X..of.Cases), 
            all_deaths = sum(X..of.Deaths),
            all_mortality_rate = sum(Risk.Adjusted.Mortality.Rate))

# df_p_all %>% arrange(desc(all_cases))
# df_p_all %>% arrange(desc(all_deaths))
# df_p_all %>% arrange(desc(all_mortality_rate))

df_p_all

## # A tibble: 6 x 4
##      Procedure.Condition all_cases all_deaths all_mortality_rate
##                   <fctr>     <int>      <int>              <dbl>
## 1             AAA Repair      4927         59              508.5
## 2 Carotid Endarterectomy     12478         60              290.2
## 3             Craniotomy     30164       2159             2354.6
## 4   Esophageal Resection       619         28              436.9
## 5   Pancreatic Resection      3356         93             1002.8
## 6                    PCI     78660       2028              793.6

Conditions.

df_c <- df_clean[which(df_clean$Medical_Category=="Condition"),]

df_c_all <- df_c %>% 
  group_by(Procedure.Condition) %>% 
  summarise(all_cases = sum(X..of.Cases), 
            all_deaths = sum(X..of.Deaths),
            all_mortality_rate = sum(Risk.Adjusted.Mortality.Rate))

#df_c_all %>% arrange(desc(all_cases))
#df_c_all %>% arrange(desc(all_deaths))
#df_c_all %>% arrange(desc(all_mortality_rate))

df_c_all

## # A tibble: 8 x 4
##   Procedure.Condition all_cases all_deaths all_mortality_rate
##                <fctr>     <int>      <int>              <dbl>
## 1        Acute Stroke    217956      20461            26582.9
## 2                 AMI     93594       5731             3863.4
## 3       GI Hemorrhage     94804       2099             1597.8
## 4       Heart Failure    155066       4778             2200.0
## 5        Hip Fracture     32245        744              945.8
## 6   Pancreatic Cancer      1787         43              632.0
## 7    Pancreatic Other      1425         41              590.0
## 8           Pneumonia     20630       1019              552.5

Hospital Ratings.

prop.table(table(df_clean$Procedure.Condition,df_clean$Hospital.Ratings))*100

##                         
##                          As Expected     Better      Worse
##   AAA Repair               4.5579886  0.0000000  0.0324412
##   Acute Stroke            29.0673155  1.2003244  1.1841038
##   AMI                      8.6942417  0.3730738  0.5028386
##   Carotid Endarterectomy   6.4395783  0.0000000  0.1135442
##   Craniotomy               4.2497972  0.2919708  0.2919708
##   Esophageal Resection     1.2165450  0.0000000  0.0000000
##   GI Hemorrhage            9.6512571  0.1622060  0.2757502
##   Heart Failure            9.0186537  0.3730738  0.6001622
##   Hip Fracture             6.8450933  0.0000000  0.0648824
##   Pancreatic Cancer        2.2708840  0.0000000  0.0324412
##   Pancreatic Other         2.0762368  0.0000000  0.0324412
##   Pancreatic Resection     3.0170316  0.0000000  0.0648824
##   PCI                      4.5417680  0.0973236  0.2108678
##   Pneumonia                2.3844282  0.0648824  0.0000000

Conclusions 2:

The highest number of cases is for PCI and Craniotomy procedures, Acute Stroke, Heart Failure, AMI and GI Hemorrhage conditions.
The highest number of deaths is for Craniotomy and PCI procedures, Acute Stroke, AMI and Heart Failure conditions.
The highest mortality rates is for Craniotomy and Pancreatic Resection procedures, Acute Stroke, AMI and Heart Failure conditions.
The lowest number of cases is for Esophageal Resection procedure, Pancreatic Cancer and Pancreatic Other conditions.
The lowest number of deaths is for Esophageal Resection procedure, Pancreatic Cancer and Pancreatic Other conditions.
The lowest mortality rates is for Carotid Endarterectomy procedure, Pancreatic Other and Pneumonia conditions.
Better ratings are for Craniotomy procedure, Acute Stroke, AMI, GI Hemorrhage and Heart Failure conditions.
Worse ragings are for Craniotomy and PCI procedures, Acute Stroke, AMI, GI Hemorrhage and Heart Failure conditions.
As Expected ratings are for Acute Stroke, AMI, GI Hemorrhage and Heart Failure conditions.

Density Plots for Risk Adjusted Mortality Rate by Procedures Performed and Hospital Ratings.

p6 <- ggplot(df_p,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Hospital.Ratings),colour=factor(Hospital.Ratings)))+
  geom_density(alpha = 0.1)+
  theme(legend.position='bottom')+
  facet_wrap(~ Procedure.Condition, ncol=2, scales="free_y")

p6

Density Plots for Risk Adjusted Mortality Rate by Conditions Treated and Hospital Ratings.

p9 <- ggplot(df_c,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Hospital.Ratings),colour=factor(Hospital.Ratings)))+
  geom_density(alpha = 0.1)+
  theme(legend.position='bottom')+
  facet_wrap(~ Procedure.Condition, ncol=2, scales="free_y")

p9

Summary: The most contributing procedues and conditions.

Procedures:
- the highest: PCI, Craniotomy, Pancreatic Resection.
- the lowest: Esophageal Resection, Carotid Endarterectomy.
Conditions:
- the highest: Acute Stroke, AMI, Heart Failure, GI Hemorrhage.
- the lowest: Pancreatic Cancer, Pancreatic Other, Pneumonia.

Mapping and summary of overall hospital quality ratings and mean mortality rate among all conditions and procedures.

Summary of hospital ratings over all conditions and procedues.

df_clean <- df_clean %>% mutate(ratings =
           ifelse(grepl("As Expected",Hospital.Ratings),"0",                        
           ifelse(grepl("Better",Hospital.Ratings),"1",
           ifelse(grepl("Worse",Hospital.Ratings),"-1",NA))))
df_clean$ratings <- as.numeric(df_clean$ratings)

all_ratings <- df_clean %>% 
                group_by(Hospital,Latitude,Longitude) %>% 
                summarise(all_ratings = sum(ratings), 
                          mean_mortality_rate = mean(Risk.Adjusted.Mortality.Rate)) %>% 
                mutate(ratings =
                         ifelse(all_ratings > 0,"Better",
                         ifelse(all_ratings < 0, "Worse","As Expected")))
all_ratings$ratings <- as.factor(all_ratings$ratings)
all_ratings <- tbl_df(all_ratings)

Overall Hospital Ratings:

summary(all_ratings$ratings)

## As Expected      Better       Worse 
##         172          69          99

Hospitals with better quality ratings:

all_ratings %>% arrange(desc(all_ratings)) %>% select(Hospital,all_ratings) %>% slice(1:10)

## # A tibble: 10 x 2
##                                                        Hospital
##                                                          <fctr>
## 1                   Kaiser Foundation Hospital â Redwood City
## 2                         Kaiser Foundation Hospital â Sunset
## 3                             Centinela Hospital Medical Center
## 4                                        Scripps Green Hospital
## 5                                   Cedars Sinai Medical Center
## 6                                        Desert Valley Hospital
## 7                                Encino Hospital Medical Center
## 8          Glendale Adventist Medical Center â Wilson Terrace
## 9  Kaiser Foundation Hospital â Rehabilitation Center Vallejo
## 10                 Kaiser Foundation Hospital â San Francisco
## # ... with 1 more variables: all_ratings <dbl>

Hospitals with worse quality ratings:

all_ratings %>% arrange(all_ratings) %>% select(Hospital,all_ratings) %>% slice(1:10)

## # A tibble: 10 x 2
##                                                               Hospital
##                                                                 <fctr>
## 1                                       Palomar Health Downtown Campus
## 2  Los Angeles County/University of Southern California Medical Center
## 3                                       San Francisco General Hospital
## 4                                       Santa Barbara Cottage Hospital
## 5                                    Arrowhead Regional Medical Center
## 6                                                   Grossmont Hospital
## 7                                              Sharp Memorial Hospital
## 8                                             Antelope Valley Hospital
## 9                                               Doctors Medical Center
## 10                                Good Samaritan Hospital â San Jose
## # ... with 1 more variables: all_ratings <dbl>

Hospitals with the highest mean mortality rate:

all_ratings %>% arrange(desc(mean_mortality_rate)) %>% select(Hospital,mean_mortality_rate)  %>% slice(1:10)

## # A tibble: 10 x 2
##                                Hospital mean_mortality_rate
##                                  <fctr>               <dbl>
## 1           Memorial Hospital Los Banos            28.35714
## 2          Mountains Community Hospital            27.45000
## 3    Santa Ynez Valley Cottage Hospital            26.47500
## 4       Biggs Gridley Memorial Hospital            24.31111
## 5      Coalinga Regional Medical Center            23.71667
## 6       George L. Mee Memorial Hospital            19.30000
## 7  Adventist Medical Center â Reedley            18.83333
## 8     Central Valley Specialty Hospital            18.70000
## 9          San Joaquin General Hospital            14.68095
## 10       Goleta Valley Cottage Hospital            14.26000

Hospitals with the lowest mean mortality rate:

all_ratings %>% arrange(mean_mortality_rate) %>% select(Hospital,mean_mortality_rate) %>% slice(1:10)

## # A tibble: 10 x 2
##                                           Hospital mean_mortality_rate
##                                             <fctr>               <dbl>
## 1                       Corcoran District Hospital                   0
## 2       Eastern Plumas Hospital â Portola Campus                   0
## 3                             Glenn Medical Center                   0
## 4          Good Samaritan Hospital â Bakersfield                   0
## 5                        Hoag Orthopedic Institute                   0
## 6                  Kern Valley Healthcare District                   0
## 7  Laguna Honda Hospital and Rehabilitation Center                   0
## 8                                 Mammoth Hospital                   0
## 9                           Southern Inyo Hospital                   0
## 10                              Tehachapi Hospital                   0

Mapping of overall hospital ratings and mean mortality rates.

CAmap <- get_map(location="California",source="google",maptype="roadmap",crop=FALSE,zoom=6) 
ggmap (CAmap) +
  geom_point(aes(x=Longitude,y=Latitude,size=mean_mortality_rate,colour=ratings),data=all_ratings,alpha=0.7)+
  scale_colour_manual(values=c("Worse" = "darkred","Better" = "darkblue","As Expected" = "darkgrey"))+
  scale_size(range = c(0, 10))

Analysis 2:

I will analyze which condition or procedure has the best or worse hospital ratings, and map these results only for selected conditions or procedures.

Predictions

To predict hospital ratings I will use classification decision trees, random forests or multinomial logistic regression. First, I will use three models to train them on 2012-2013 data and access which model gives the best performance on the training data, then I will test three model performances on 2014 test data.

Recommendations to Patients

Hospitals with the best overall ratings and the lowest mean mortality rate in state of California.

best_ratings <- all_ratings %>% arrange(desc(all_ratings)) %>% slice(1:25)
lowest_mort_rate <- all_ratings %>% arrange(mean_mortality_rate) %>% slice(1:25)
best_lowest <- bind_rows(best_ratings,lowest_mort_rate)

CAmap <- get_map(location="California",source="google",maptype="roadmap",crop=FALSE,zoom=6) 
ggmap (CAmap) +
  geom_point(aes(x=Longitude,y=Latitude,size=ratings,colour=mean_mortality_rate),data=best_lowest,alpha=0.7)+
  scale_colour_distiller(type = "div", palette = "RdBu", direction = -1)

Hospitals with …

Recommend to patients which hospital has the best ratings for particular medical condition or procedure in state of California;

Hospitals with …

Recommend which hospital will have the best care in the future using predicted hospital ratings.

Milestone Report for Capstone Project

Maria P. Frushicheva

August 12, 2016

Introduction

Description of Data Set

Description, Analysis and Cleaning of Variables in the Data Set

Explanatory Data Analysis

Mapping and summary of overall hospital quality ratings and mean mortality rate among all conditions and procedures.

Analysis 2:

Predictions

Recommendations to Patients

Hospitals with the best overall ratings and the lowest mean mortality rate in state of California.

Hospitals with …

Hospitals with …