Capstone Project: CA Hospital Quality Ratings Prediction

Introduction

Importance: Using hospital quality ratings, patients are able to make a better decision in what hospital they want to be treated and where the best care is available in state of California, based on overall hospital performance or based on particular medical condition or procedure.

Question: Can we predict hospital quality ratings based on risk adjusted mortality rates, number of deaths, number of cases, medical procedures performed and medical conditions treated for 2012-2013?

Description of Data Set

Dataset: is available from California Hospital Inpatient Mortality Rates and Quality Ratings, 2012-2013.

Description of dataset: The dataset contains risk-adjusted mortality rates, and number of deaths and cases for 6 medical conditions treated (Acute Stroke, Acute Myocardial Infarction, Heart Failure, Gastrointestinal Hemorrhage, Hip Fracture and Pneumonia) and 6 procedures performed (Abdominal Aortic Aneurysm Repair, Carotid Endarterectomy, Craniotomy, Esophageal Resection, Pancreatic Resection, Percutaneous Coronary Intervention) in California hospitals for 2012 and 2013. This dataset does not include conditions treated or procedures performed in outpatient settings.

Description, Analysis and Cleaning of Variables in the Data Set

Load the data from csv file.

setwd("C:/Users/postdoc/Dropbox (Personal)/SpringBoard Fund/Rprojects/")
data <- read.csv("California_Hospital_Inpatient_Mortality_Rates_and_Quality_Ratings__2012-2013.csv",sep=",",header=TRUE)
df <- tbl_df(data)

Dataset: 11169 observations and 12 variables.

Variables with missing values:

Risk Adjusted Mortality Rate: The Risk Adjusted Mortality Rates (RAMR) presented here adjusts the observed mortality rates. This statistical methodology takes into account pre-existing health problems that put some patients at greater risk of death to level the playing field and allow fair comparisons across hospitals. 4754 missing values.
Number of Deaths: Number of patients that died in this hospital. 4926 missing values.
Number of Cases: Number of patients that had this medical procedure or condition in this hospital. 5004 missing values.

Remove missing values, because number of missing values consists of half of dataset.

df_clean <- df[which(is.na(df$X..of.Cases)==F),]

Clean Dataset: 6165 observations and 12 variables.

Variables with no missing values:

Year: two years 2012 and 2013; 3100 values for 2012 year and 3065 values for 2013 year.
County: 55 counties.
Hospital: 341 hospitals.
OSHPDID: A unique number established by the Office of Statewide Health Planning and Development (OSHPD) for identifying facilities and used in the Licensed Facility Information System (LFIS). The first three numbers identify the type of facility, the next two represent the county number, and the last five are randomly assigned within each county. 570261 unique codes.
Longitude: Longitude of hospital.
Latitude: Latitude of hospital.
location1: 333 levels.
Hospital Ratings: Comparison rating based on a 95% Confidence Interval (CI). If a hospitals upper CI is less than the statewide observed rate, it is designated as performing better than the average hospital. If a hospitals lower CI is greater than the state rate, it is designated as performing worse than the average state hospital. 3 levels of Hospital Ratings: As Expected, Better and Worse.

summary(df_clean$Hospital.Ratings)

## As Expected      Better       Worse 
##        5797         158         210

Procedure/Condition: Procedure that was performed or condition that was treated. 6 medical procedures performed: Abdominal Aortic Aneurysm (AAA) Repair, Carotid Endarterectomy, Craniotomy, Esophageal Resection, Pancreatic Resection, Percutaneous Coronary Intervention. 6 medical conditions treated: Acute Stroke, Acute Myocardial Infarction, Heart Failure, Gastrointestinal Hemorrhage, Hip Fracture and Pneumonia. Clean dataset consists of 17 levels, instead of 12.

summary(df_clean$Procedure.Condition)

##                AAA Repair              Acute Stroke 
##                       283                       617 
##  Acute Stroke Hemorrhagic     Acute Stroke Ischemic 
##                       466                       615 
## Acute Stroke Subarachnoid                       AMI 
##                       241                       590 
##    Carotid Endarterectomy                Craniotomy 
##                       404                       298 
##      Esophageal Resection             GI Hemorrhage 
##                        75                       622 
##             Heart Failure              Hip Fracture 
##                       616                       426 
##         Pancreatic Cancer          Pancreatic Other 
##                       142                       130 
##      Pancreatic Resection                       PCI 
##                       190                       299 
##                 Pneumonia 
##                       151

Decoding Procedure/Condition variable.

According to the American Stroke Association (ASA), strokes can be classified into 2 main categories: 87% are ischemic strokes, caused by blockage of an artery; 13% are hemorrhagic strokes, caused by bleeding. Ischemic strokes are further divided into 2 groups: thrombotic and embolic strokes. Hemorrhagic strokes are divided into 2 main categories: intracerebral and subarachnoid hemorrhages.

Our clean dataset has four categories for Acute Stroke:

Acute Stroke: 617 observations;
Acute Stroke Hemorrhagic: 466 observations;
Acute Stroke Ischemic: 615 obervations;
Acute Stroke Subarachnoid: 241 observations.

Within each hospital, there are different notations for Acute Stroke variable. It suggests that different doctor uses different notations for the condition. These four categories are combined in one: Acute Stroke.

df_clean$Procedure.Condition <- gsub("Acute Stroke .*","Acute Stroke",df_clean$Procedure.Condition)
df_clean$Procedure.Condition <- factor(df_clean$Procedure.Condition)

Two additional categories are present in Procedure/Condition variable:

Pancreatic Cancer: 142 observations;
Pancreatic Other: 130 observations.

These categories are separate medical conditions and are not combined in one category.

The Procedure.Condition variable contains 6 medical procedures and 8 medical conditions. To indicate what procedure was performed or what condition was treated, the Medical_Categorey variable was added to the clean dataset.

df_clean <- df_clean %>% 
  mutate(Medical_Category = ifelse(grepl("Repair",Procedure.Condition) | grepl("Endarterectomy",Procedure.Condition) | grepl("Craniotomy",Procedure.Condition) | grepl("Resection",Procedure.Condition) | grepl("PCI",Procedure.Condition),               "Procedure", "Condition"))

Explanatory Data Analysis

Density Plots for # of Cases, # of Deaths and Risk Adjusted Mortality Rate by Hospital Ratings.

p1 <- ggplot(df_clean,aes(log(X..of.Cases),fill=factor(Hospital.Ratings),colour=factor(Hospital.Ratings)))+
  geom_density(alpha = 0.1)

p2 <- ggplot(df_clean,aes(log(X..of.Deaths),fill=factor(Hospital.Ratings),colour=factor(Hospital.Ratings)))+
  geom_density(alpha = 0.1)

p3 <- ggplot(df_clean,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Hospital.Ratings),colour=factor(Hospital.Ratings)))+
  geom_density(alpha = 0.1)

grid.arrange(p1, p2, p3, ncol=1)

Conclusions 1:

Distributions between 2012 and 2013 years look simiar (graphs are not shown).
There are no associations between variables for number of deaths and number of cases.
There is association between the risk adjusted mortality rate and hospital ratings.
Lower the risk adjusted mortality rate, better the hospital ratings.
Higher the risk adjusted mortality rate, worse the hospital ratings.

Density Plots for Risk Adjusted Mortality Rate by Procedures Performed and Hospital Ratings.

df_p <- df_clean[which(df_clean$Medical_Category=="Procedure"),]

p6 <- ggplot(df_p,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Hospital.Ratings),colour=factor(Hospital.Ratings)))+
  geom_density(alpha = 0.1)+
  theme(legend.position='bottom')+
  facet_wrap(~ Procedure.Condition, ncol=2, scales="free_y")
p6

The highest mortality rates is for Craniotomy and Pancreatic Resection procedures.
Better and worst ratings are for Craniotomy and PCI procedures.
There is association between the risk adjusted mortality rate and hospital ratings.

Density Plots for Risk Adjusted Mortality Rate by Conditions Treated and Hospital Ratings.

df_c <- df_clean[which(df_clean$Medical_Category=="Condition"),]

p9 <- ggplot(df_c,aes(log(Risk.Adjusted.Mortality.Rate),fill=factor(Hospital.Ratings),colour=factor(Hospital.Ratings)))+
  geom_density(alpha = 0.1)+
  theme(legend.position='bottom')+
  facet_wrap(~ Procedure.Condition, ncol=2, scales="free_y")
p9

The highest mortality rates is for Acute Stroke, AMI and Heart Failure conditions.
Better and worse ratings are for Acute Stroke, AMI, GI Hemorrhage and Heart Failure conditions.
There is association between the risk adjusted mortality rate and hospital ratings.

Associations between medical procedures or conditions with hospital ratings, number of cases, number of deaths and risk adjusted mortality rate.

Procedures.

df_p_all <- df_p %>% 
  group_by(Procedure.Condition) %>% 
  summarise(all_cases = sum(X..of.Cases), 
            all_deaths = sum(X..of.Deaths),
            all_mortality_rate = sum(Risk.Adjusted.Mortality.Rate))
df_p_all

## # A tibble: 6 x 4
##      Procedure.Condition all_cases all_deaths all_mortality_rate
##                   <fctr>     <int>      <int>              <dbl>
## 1             AAA Repair      4927         59              508.5
## 2 Carotid Endarterectomy     12478         60              290.2
## 3             Craniotomy     30164       2159             2354.6
## 4   Esophageal Resection       619         28              436.9
## 5   Pancreatic Resection      3356         93             1002.8
## 6                    PCI     78660       2028              793.6

Conditions.

df_c_all <- df_c %>% 
  group_by(Procedure.Condition) %>% 
  summarise(all_cases = sum(X..of.Cases), 
            all_deaths = sum(X..of.Deaths),
            all_mortality_rate = sum(Risk.Adjusted.Mortality.Rate))
df_c_all

## # A tibble: 8 x 4
##   Procedure.Condition all_cases all_deaths all_mortality_rate
##                <fctr>     <int>      <int>              <dbl>
## 1        Acute Stroke    217956      20461            26582.9
## 2                 AMI     93594       5731             3863.4
## 3       GI Hemorrhage     94804       2099             1597.8
## 4       Heart Failure    155066       4778             2200.0
## 5        Hip Fracture     32245        744              945.8
## 6   Pancreatic Cancer      1787         43              632.0
## 7    Pancreatic Other      1425         41              590.0
## 8           Pneumonia     20630       1019              552.5

The highest number of cases is for PCI and Craniotomy procedures, Acute Stroke, Heart Failure, AMI and GI Hemorrhage conditions.
The highest number of deaths is for Craniotomy and PCI procedures, Acute Stroke, AMI and Heart Failure conditions.
The highest mortality rates is for Craniotomy and Pancreatic Resection procedures, Acute Stroke, AMI and Heart Failure conditions.
The lowest number of cases is for Esophageal Resection procedure, Pancreatic Cancer and Pancreatic Other conditions.
The lowest number of deaths is for Esophageal Resection procedure, Pancreatic Cancer and Pancreatic Other conditions.
The lowest mortality rates is for Carotid Endarterectomy procedure, Pancreatic Other and Pneumonia conditions.

Hospital Ratings.

prop.table(table(df_clean$Procedure.Condition,df_clean$Hospital.Ratings))*100

##                         
##                          As Expected     Better      Worse
##   AAA Repair               4.5579886  0.0000000  0.0324412
##   Acute Stroke            29.0673155  1.2003244  1.1841038
##   AMI                      8.6942417  0.3730738  0.5028386
##   Carotid Endarterectomy   6.4395783  0.0000000  0.1135442
##   Craniotomy               4.2497972  0.2919708  0.2919708
##   Esophageal Resection     1.2165450  0.0000000  0.0000000
##   GI Hemorrhage            9.6512571  0.1622060  0.2757502
##   Heart Failure            9.0186537  0.3730738  0.6001622
##   Hip Fracture             6.8450933  0.0000000  0.0648824
##   Pancreatic Cancer        2.2708840  0.0000000  0.0324412
##   Pancreatic Other         2.0762368  0.0000000  0.0324412
##   Pancreatic Resection     3.0170316  0.0000000  0.0648824
##   PCI                      4.5417680  0.0973236  0.2108678
##   Pneumonia                2.3844282  0.0648824  0.0000000

Better ratings are for Craniotomy procedure, Acute Stroke, AMI and Heart Failure conditions.
Worse ragings are for Craniotomy and PCI procedures, Acute Stroke, AMI, GI Hemorrhage and Heart Failure conditions.
As Expected ratings are for Acute Stroke, AMI, GI Hemorrhage and Heart Failure conditions.

Conclusions 2:

There is association between the risk adjusted mortality rate and hospital ratings.
Lower the risk adjusted mortality rate, better the hospital ratings.
Higher the risk adjusted mortality rate, worse the hospital ratings.
Procedures:
- with severe outcomes: PCI, Craniotomy and Pancreatic Resection.
- with good outcomes: Esophageal Resection and Carotid Endarterectomy.
Conditions:
- with severe outcomes: Acute Stroke, AMI, Heart Failure and GI Hemorrhage.
- with good outcomes: Pancreatic Cancer, Pancreatic Other and Pneumonia.

Mapping and summary of overall hospital quality ratings and mean mortality rate among all conditions and procedures.

Summary of hospital ratings over all conditions and procedues.

df_clean <- df_clean %>% mutate(ratings =
           ifelse(grepl("As Expected",Hospital.Ratings),"0",                        
           ifelse(grepl("Better",Hospital.Ratings),"1",
           ifelse(grepl("Worse",Hospital.Ratings),"-1",NA))))
df_clean$ratings <- as.numeric(df_clean$ratings)

all_ratings <- df_clean %>% 
                group_by(Hospital,Latitude,Longitude) %>% 
                summarise(all_ratings = sum(ratings), 
                          mean_mortality_rate = mean(Risk.Adjusted.Mortality.Rate)) %>% 
                mutate(ratings =
                         ifelse(all_ratings > 0,"Better",
                         ifelse(all_ratings < 0, "Worse","As Expected")))
all_ratings$ratings <- as.factor(all_ratings$ratings)
all_ratings <- tbl_df(all_ratings)

Mapping of overall hospital ratings and mean mortality rates.

CAmap <- get_map(location="California",source="google",maptype="roadmap",crop=FALSE,zoom=6) 
ggmap (CAmap) +
  geom_point(aes(x=Longitude,y=Latitude,size=mean_mortality_rate,colour=ratings),data=all_ratings,alpha=0.5)+
  scale_colour_manual(values=c("Worse" = "darkred","Better" = "darkblue","As Expected" = "darkgrey"))+
  scale_size(range = c(0, 10))

Overall Hospital Ratings:

summary(all_ratings$ratings)

## As Expected      Better       Worse 
##         172          69          99

Hospitals with the best quality ratings:

all_ratings %>% arrange(desc(all_ratings)) %>% select(Hospital) %>% slice(1:10)

## # A tibble: 10 x 1
##                                                        Hospital
##                                                          <fctr>
## 1                   Kaiser Foundation Hospital â Redwood City
## 2                         Kaiser Foundation Hospital â Sunset
## 3                             Centinela Hospital Medical Center
## 4                                        Scripps Green Hospital
## 5                                   Cedars Sinai Medical Center
## 6                                        Desert Valley Hospital
## 7                                Encino Hospital Medical Center
## 8          Glendale Adventist Medical Center â Wilson Terrace
## 9  Kaiser Foundation Hospital â Rehabilitation Center Vallejo
## 10                 Kaiser Foundation Hospital â San Francisco

Hospitals with the lowest mean mortality rate:

all_ratings %>% arrange(mean_mortality_rate) %>% select(Hospital) %>% slice(1:10)

## # A tibble: 10 x 1
##                                           Hospital
##                                             <fctr>
## 1                       Corcoran District Hospital
## 2       Eastern Plumas Hospital â Portola Campus
## 3                             Glenn Medical Center
## 4          Good Samaritan Hospital â Bakersfield
## 5                        Hoag Orthopedic Institute
## 6                  Kern Valley Healthcare District
## 7  Laguna Honda Hospital and Rehabilitation Center
## 8                                 Mammoth Hospital
## 9                           Southern Inyo Hospital
## 10                              Tehachapi Hospital

Hospitals with the worst quality ratings:

all_ratings %>% arrange(all_ratings) %>% select(Hospital) %>% slice(1:10)

## # A tibble: 10 x 1
##                                                               Hospital
##                                                                 <fctr>
## 1                                       Palomar Health Downtown Campus
## 2  Los Angeles County/University of Southern California Medical Center
## 3                                       San Francisco General Hospital
## 4                                       Santa Barbara Cottage Hospital
## 5                                    Arrowhead Regional Medical Center
## 6                                                   Grossmont Hospital
## 7                                              Sharp Memorial Hospital
## 8                                             Antelope Valley Hospital
## 9                                               Doctors Medical Center
## 10                                Good Samaritan Hospital â San Jose

Hospitals with the highest mean mortality rate:

all_ratings %>% arrange(desc(mean_mortality_rate)) %>% select(Hospital)  %>% slice(1:10)

## # A tibble: 10 x 1
##                                Hospital
##                                  <fctr>
## 1           Memorial Hospital Los Banos
## 2          Mountains Community Hospital
## 3    Santa Ynez Valley Cottage Hospital
## 4       Biggs Gridley Memorial Hospital
## 5      Coalinga Regional Medical Center
## 6       George L. Mee Memorial Hospital
## 7  Adventist Medical Center â Reedley
## 8     Central Valley Specialty Hospital
## 9          San Joaquin General Hospital
## 10       Goleta Valley Cottage Hospital

Summary of hospital quality ratings and mortality rates for Acute Stroke, AMI and Heart Failure conditions, PCI, Craniotomy and Pancreatic Resection procedures.

Summary for Acute Stroke condition.

df_as <- df_c[which(df_c$Procedure.Condition=="Acute Stroke"),]

df_as <- df_as %>% mutate(ratings =
           ifelse(grepl("As Expected",Hospital.Ratings),"0",                        
           ifelse(grepl("Better",Hospital.Ratings),"1",
           ifelse(grepl("Worse",Hospital.Ratings),"-1",NA))))
df_as$ratings <- as.numeric(df_as$ratings)

df_as_all <- df_as %>% 
                group_by(Hospital,Latitude,Longitude) %>% 
                summarise(all_ratings = sum(ratings), 
                          mean_mortality_rate = mean(Risk.Adjusted.Mortality.Rate)) %>% 
                mutate(ratings =
                         ifelse(all_ratings > 0,"Better",
                         ifelse(all_ratings < 0, "Worse","As Expected")))
df_as_all$ratings <- as.factor(df_as_all$ratings)
df_as_all <- tbl_df(df_as_all)

df_as_best_rat <- df_as_all %>% arrange(desc(all_ratings)) %>% slice(1:50)
df_as_best <- df_as_best_rat %>% arrange(mean_mortality_rate) %>% slice(1:25)
df_as_best %>% slice(1:10) %>% select(Hospital)

## # A tibble: 10 x 1
##                                                 Hospital
##                                                   <fctr>
## 1                  Pacific Alliance Medical Center, Inc.
## 2                               Anaheim General Hospital
## 3                                 Olympia Medical Center
## 4  Los Angeles County/Olive View â UCLA Medical Center
## 5                         Encino Hospital Medical Center
## 6                                Marina Del Rey Hospital
## 7                            Sutter Delta Medical Center
## 8                                 Desert Valley Hospital
## 9                           Sutter Solano Medical Center
## 10         Kaiser Foundation Hospital â Oakland Campus

Summary for AMI condition.

## # A tibble: 10 x 1
##                                           Hospital
##                                             <fctr>
## 1        Southern California Hospital at Hollywood
## 2                            Sherman Oaks Hospital
## 3           Kaiser Foundation Hospital â Antioch
## 4                   Encino Hospital Medical Center
## 5                 La Palma Intercommunity Hospital
## 6                           Scripps Green Hospital
## 7        Kaiser Foundation Hospital â Santa Rosa
## 8                         Paradise Valley Hospital
## 9  Kaiser Foundation Hospital â South Sacramento
## 10          Kaiser Foundation Hospital â Hayward

Summary for Heart Failure condition.

## # A tibble: 10 x 1
##                                Hospital
##                                  <fctr>
## 1  Adventist Medical Center â Reedley
## 2              Anaheim General Hospital
## 3                 Sherman Oaks Hospital
## 4            Barstow Community Hospital
## 5     Centinela Hospital Medical Center
## 6        Encino Hospital Medical Center
## 7                Desert Valley Hospital
## 8              Paradise Valley Hospital
## 9                Scripps Green Hospital
## 10        White Memorial Medical Center

Summary for PCI procedure.

## # A tibble: 10 x 1
##                                                Hospital
##                                                  <fctr>
## 1                                Brotman Medical Center
## 2                    Fresno Heart and Surgical Hospital
## 3                                    El Camino Hospital
## 4                        Downey Regional Medical Center
## 5                  Henry Mayo Newhall Memorial Hospital
## 6                                Scripps Green Hospital
## 7  California Pacific Medical Center â Pacific Campus
## 8                 Kaiser Foundation Hospital â Sunset
## 9      Community Memorial Hospital â San Buenaventura
## 10 Glendale Adventist Medical Center â Wilson Terrace

Summary for Craniotomy procedure.

## # A tibble: 10 x 1
##                                                Hospital
##                                                  <fctr>
## 1                                     Alhambra Hospital
## 2                                Desert Valley Hospital
## 3                     El Centro Regional Medical Center
## 4                          Saint Johnâs Health Center
## 5                                    El Camino Hospital
## 6       City of Hope Helford Clinical Research Hospital
## 7          Community Hospital of The Monterey Peninsula
## 8                 Community Hospital Monterey Peninsula
## 9  California Pacific Medical Center â Pacific Campus
## 10 Glendale Adventist Medical Center â Wilson Terrace

Summary for Pancreatic Resection procedure.

## # A tibble: 10 x 1
##                                            Hospital
##                                              <fctr>
## 1                     Alameda County Medical Center
## 2             Community Hospital Monterey Peninsula
## 3      Community Hospital of The Monterey Peninsula
## 4  Community Memorial Hospital â San Buenaventura
## 5                               Eden Medical Center
## 6                         Eisenhower Medical Center
## 7                Fresno Heart and Surgical Hospital
## 8              Good Samaritan Hospital â San Jose
## 9                                Grossmont Hospital
## 10      John Muir Medical Center â Concord Campus

Predictions

Approach

Predict hospital quality ratings using random forests and classification decision trees.
Train the models and evaluate the model performances on 2012 training data.
Test the model performances on 2013 test data.

Split the Data Set into 2012 training and 2013 test sets.

train <- df_clean[which(df_clean$Year==2012),]
test_original <- df_clean[which(df_clean$Year==2013),]
test <- subset(test_original, select = -Hospital.Ratings)

Hospital Ratings Prediction Using Random Forests.

Feature Engineering using Random Forests.

fit <- randomForest(Hospital.Ratings ~ Procedure.Condition + Risk.Adjusted.Mortality.Rate + X..of.Cases + X..of.Deaths, data=train,importance=TRUE,ntree=1000)
print(fit) # view results

## 
## Call:
##  randomForest(formula = Hospital.Ratings ~ Procedure.Condition +      Risk.Adjusted.Mortality.Rate + X..of.Cases + X..of.Deaths,      data = train, importance = TRUE, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 4.32%
## Confusion matrix:
##             As Expected Better Worse class.error
## As Expected        2902     10     5 0.005142269
## Better               56     24     1 0.703703704
## Worse                62      0    40 0.607843137

importance(fit) # importance of each predictor

##                              As Expected    Better    Worse
## Procedure.Condition            145.51584 24.974139 11.60154
## Risk.Adjusted.Mortality.Rate   104.67908 30.106975 73.00210
## X..of.Cases                     89.16738 50.456622 17.41210
## X..of.Deaths                    53.59510  6.323429 58.84989
##                              MeanDecreaseAccuracy MeanDecreaseGini
## Procedure.Condition                     149.34129         60.14185
## Risk.Adjusted.Mortality.Rate            112.79069        125.53213
## X..of.Cases                              95.16901         87.71509
## X..of.Deaths                             58.43046         67.69199

Mean Decrease Accuracy shows how important that variable is in classifying the data. They are ordered top-to-bottom as most- to least-important. Variables with a large mean decrease in accuracy are more important for classification of the data.

varImpPlot(fit,type=1)

The most important variables: Procedure.Condition and Risk.Adjusted.Mortality.Rate.
Mean Decrease Gini is a measure of how each variable contributes to the homogeneity of the nodes and leaves in the resulting random forest. The Gini coefficient is a measure of homogeneity from 0 (homogeneous) to 1 (heterogeneous). Variables that result in nodes with higher purity have a higher decrease in Gini coefficient.

varImpPlot(fit,type=2)

Risk.Adjusted.Mortality.Rate. variable that results in nodes with higher purity.

Predictions with all variables.

# prediction
prediction <- predict(fit, test)
#fancyRpartPlot(fit)
# confusion matrix
cm <- as.matrix(table(Actual = test_original$Hospital.Ratings,Predicted = prediction))
cm

##              Predicted
## Actual        As Expected Better Worse
##   As Expected        2867      7     6
##   Better               49     28     0
##   Worse                72      0    36

n = sum(cm) # number of instances
nc = nrow(cm) # number of classes
diag = diag(cm) # number of correctly classified instances per class 
rowsums = apply(cm, 1, sum) # number of instances per class
colsums = apply(cm, 2, sum) # number of predictions per class
p = rowsums / n # distribution of instances over the actual classes
q = colsums / n # distribution of instances over the predicted classes

accuracy = sum(diag) / n 
accuracy

## [1] 0.9562806

precision = diag / colsums 
recall = diag / rowsums 
f1 = 2 * precision * recall / (precision + recall) 
data.frame(precision, recall, f1)

##             precision    recall        f1
## As Expected 0.9595047 0.9954861 0.9771643
## Better      0.8000000 0.3636364 0.5000000
## Worse       0.8571429 0.3333333 0.4800000

Clean 2012 training set and convert to the wide format based on Procedure.Condition and Risk.Adjusted.Mortality.Rate variables.

# convert train set to wide format. 
train2 <- train %>% 
                group_by(Hospital,Latitude,Longitude,Procedure.Condition) %>% 
                summarise(all_ratings = sum(ratings), 
                          mortality_rate = sum(Risk.Adjusted.Mortality.Rate)) %>% 
                mutate(ratings =
                         ifelse(all_ratings > 0,"Better",
                         ifelse(all_ratings < 0, "Worse","As Expected")))

train2$ratings <- as.factor(train2$ratings)
train2 <- tbl_df(train2)

train2_wide <- train2 %>% select(Hospital,Latitude,Longitude,Procedure.Condition,ratings,mortality_rate) %>% spread(Procedure.Condition,mortality_rate) 

# subset to ratings and procedures only
train2_wide_cut <- train2_wide[,-c(1,2,3)]
# remove white spaces from column names
colnames(train2_wide_cut) <- gsub(" ","",colnames(train2_wide_cut))
# replace NA with 0, beauce some hospitals does not treat these conditions, thus mortality rate is zero. 
train2_wide_cut[is.na(train2_wide_cut)] <- 0

Feature Enginering with Random Forests on wide format 2012 training set

fit <- randomForest(ratings ~ AAARepair + AcuteStroke + AMI + CarotidEndarterectomy + Craniotomy + EsophagealResection + GIHemorrhage + HeartFailure + HipFracture + PancreaticCancer + PancreaticOther + PancreaticResection + PCI + Pneumonia, data=train2_wide_cut,importance=TRUE,ntree=1000)
print(fit) # view results

## 
## Call:
##  randomForest(formula = ratings ~ AAARepair + AcuteStroke + AMI +      CarotidEndarterectomy + Craniotomy + EsophagealResection +      GIHemorrhage + HeartFailure + HipFracture + PancreaticCancer +      PancreaticOther + PancreaticResection + PCI + Pneumonia,      data = train2_wide_cut, importance = TRUE, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 15.98%
## Confusion matrix:
##             As Expected Better Worse class.error
## As Expected         296     17    13  0.09202454
## Better                8     37     2  0.21276596
## Worse                26      4    35  0.46153846

importance(fit) # importance of each predictor

##                       As Expected    Better     Worse MeanDecreaseAccuracy
## AAARepair              -1.1071070  5.203811  1.501276             2.667456
## AcuteStroke            11.0114687 17.605407 40.008126            41.322957
## AMI                    -1.2714426 36.702337 34.383789            43.229566
## CarotidEndarterectomy   6.5132876 12.069448 12.138739            17.444599
## Craniotomy              5.6676596 20.111117 12.475400            22.838230
## EsophagealResection    -2.3301798  2.435076  2.436611             0.784897
## GIHemorrhage            2.3487945 44.159897 39.138107            49.907648
## HeartFailure            1.5175070 46.135206 37.174181            49.184776
## HipFracture            -5.9453389 30.079553 27.773111            31.667778
## PancreaticCancer       -1.0057905  3.738689  3.682976             3.871419
## PancreaticOther        -0.2044520  5.538596  8.723983             8.821532
## PancreaticResection     1.2884631  7.435238 11.966683            12.062778
## PCI                     0.3144257 14.668247  7.926353            14.960126
## Pneumonia              -3.9746467 10.982875 18.830663            14.365316
##                       MeanDecreaseGini
## AAARepair                   1.04159696
## AcuteStroke                17.77417855
## AMI                        24.18630962
## CarotidEndarterectomy       2.09411139
## Craniotomy                  6.24026595
## EsophagealResection         0.09852053
## GIHemorrhage               26.37847005
## HeartFailure               24.45450167
## HipFracture                10.07273068
## PancreaticCancer            0.27019039
## PancreaticOther             0.93716163
## PancreaticResection         1.13203146
## PCI                         5.02526795
## Pneumonia                   3.08633895

varImpPlot(fit,type=1)

varImpPlot(fit,type=2)

The most important variables are Heart Failure, GI Hemorrhage, AMI and Acute Stroke, Hip Fracture conditions;
Other variables are less important. Need to separate procedures from conditions, since procedures are not included in classification.

Hospital Ratings Prediction Using Classification Decision Trees (CART).

All variables are included in tree construction.

tree0 <- rpart(ratings ~ AAARepair + AcuteStroke + AMI + CarotidEndarterectomy + Craniotomy + EsophagealResection + GIHemorrhage + HeartFailure + HipFracture + PancreaticCancer + PancreaticOther + PancreaticResection + PCI + Pneumonia, data = train2_wide_cut, method = "class")
printcp(tree0)

## 
## Classification tree:
## rpart(formula = ratings ~ AAARepair + AcuteStroke + AMI + CarotidEndarterectomy + 
##     Craniotomy + EsophagealResection + GIHemorrhage + HeartFailure + 
##     HipFracture + PancreaticCancer + PancreaticOther + PancreaticResection + 
##     PCI + Pneumonia, data = train2_wide_cut, method = "class")
## 
## Variables actually used in tree construction:
## [1] AcuteStroke  AMI          GIHemorrhage HeartFailure HipFracture 
## 
## Root node error: 112/438 = 0.25571
## 
## n= 438 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.059524      0   1.00000 1.00000 0.081520
## 2 0.035714      5   0.60714 0.69643 0.071490
## 3 0.010000      7   0.53571 0.66071 0.070018

fancyRpartPlot(tree0)

Future Work

Predict hospital quality ratings using multinomial logistic regression.
- Train the model and access the model performance on 2012 training data.
- Test the model performance on 2013 test data.
Compare three models: random forests, classification decision trees and multinomial logistic regression.
- Summarize which model gives the best performance on 2012 training data and on 2013 test data.
- Choose the best model and test its performance on 2014 test data.

Recommendations to Patients

Top 25 Hospitals with the best overall ratings and the lowest mean mortality rate in state of California.

best_ratings <- all_ratings %>% arrange(desc(all_ratings)) %>% slice(1:50)
best_lowest <- best_ratings %>% arrange(mean_mortality_rate) %>% slice(1:25)

CAmap <- get_map(location="California",source="google",maptype="roadmap",crop=FALSE,zoom=6) 
ggmap (CAmap) +
  geom_point(aes(x=Longitude,y=Latitude,colour=mean_mortality_rate),data=best_lowest,size=5,alpha=0.6)+
  scale_colour_gradient(limits=c(1, 5), high="red", low="darkblue")

Top Ten Hospitals with the best ratings and the lowest mean mortality rate for Acute Stroke, AMI and Heart Failure conditions.

df_as_best <- df_as_best %>% mutate(Procedure.Condition="Acute Stroke")
df_ami_best <- df_ami_best %>% mutate(Procedure.Condition="AMI")
df_hf_best <- df_hf_best %>% mutate(Procedure.Condition="Heart Failure")
best_cond <- bind_rows(df_as_best[1:10,],df_ami_best[1:10,],df_hf_best[1:10,])

CAmap <- get_map(location="California",source="google",maptype="roadmap",crop=FALSE,zoom=6) 
ggmap (CAmap) +
  geom_point(aes(x=Longitude,y=Latitude,size=mean_mortality_rate,colour=Procedure.Condition),data=best_cond,alpha=0.8)+
  scale_colour_manual(values=c("Acute Stroke"="darkred", "AMI"="darkgrey", "Heart Failure"="darkblue"))+
  scale_size(range = c(0, 10))

Top Ten Hospitals with the best ratings and the lowest mean mortality rate for PCI, Craniotomy and Pancreatic Resection procedures.

df_pci_best <- df_pci_best %>% mutate(Procedure.Condition="PCI")
df_cr_best <- df_cr_best %>% mutate(Procedure.Condition="Craniotomy")
df_pr_best <- df_pr_best %>% mutate(Procedure.Condition="Pancreatic Resection")
best_proc <- bind_rows(df_pci_best[1:10,],df_cr_best[1:10,],df_pr_best[1:10,])

CAmap <- get_map(location="California",source="google",maptype="roadmap",crop=FALSE,zoom=6) 
ggmap (CAmap) +
  geom_point(aes(x=Longitude,y=Latitude,size=mean_mortality_rate,colour=Procedure.Condition),data=best_proc,alpha=0.6)+
  scale_colour_manual(values=c("PCI"="darkblue", "Craniotomy"="darkred", "Pancreatic Resection"="darkgrey"))+
  scale_size(range = c(0, 10))

Hospitals with …

Recommend which hospital will have the best care in the future using predicted hospital ratings.