California Statewide Inpatient Mortality Rates, 2012-2013.

Dataset is available from healthdata.gov.

Description of dataset: The dataset contains risk-adjusted mortality rates, quality ratings, and number of deaths and cases for 6 medical conditions treated (Acute Stroke, Acute Myocardial Infarction, Heart Failure, Gastrointestinal Hemorrhage, Hip Fracture and Pneumonia) and 6 procedures performed (Abdominal Aortic Aneurysm Repair, Carotid Endarterectomy, Craniotomy, Esophageal Resection, Pancreatic Resection, Percutaneous Coronary Intervention) in California hospitals for 2012 and 2013. This dataset does not include conditions treated or procedures performed in outpatient settings. Please refer to statewide table for California overall rates.

Question: How the rate of in-hospital mortality across all facilities in California is correlated with the diagnoses and procedures performed during an inpatient stay?

Preliminary Questions:

diagnosis versus mortality
procedures versus mortality
cost of procedure

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Loading required package: tidyr

## Loading required package: ggplot2

## Loading required package: gridExtra

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

Load the data from csv file. (from the website - need to figure out how to do this???)

data <- read.csv("California_Hospital_Inpatient_Mortality_Rates_and_Quality_Ratings__2012-2013.csv",sep=",",header=TRUE)
df <- tbl_df(data)
glimpse(df)

## Observations: 11,169
## Variables: 12
## $ Year                         <int> 2012, 2012, 2012, 2012, 2012, 201...
## $ County                       <fctr> Alameda, Alameda, Alameda, Alame...
## $ Hospital                     <fctr> Alameda Hospital, Alameda Hospit...
## $ OSHPDID                      <int> 106010735, 106010735, 106010735, ...
## $ Procedure.Condition          <fctr> PCI, AMI, Acute Stroke Ischemic,...
## $ Risk.Adjusted.Mortality.Rate <dbl> NA, 5.5, 2.3, NA, 1.3, 0.0, NA, 2...
## $ X..of.Deaths                 <int> NA, 4, 1, NA, 1, 0, NA, 5, NA, 5,...
## $ X..of.Cases                  <int> NA, 41, 60, NA, 65, 5, NA, 11, NA...
## $ Hospital.Ratings             <fctr> As Expected, As Expected, As Exp...
## $ Longitude                    <dbl> -122.2536, -122.2536, -122.2536, ...
## $ Latitude                     <dbl> 37.76295, 37.76295, 37.76295, 37....
## $ location1                    <fctr> (37.762953, -122.25362), (37.762...

Description, analysis and cleaning of variables in the dataset.

Dataset: Observations: 11,169 and Variables: 12

Variables with no missing values:

Year: two years 2012 and 2013; 5576 values for 2012 year and 5593 values for 2013 year.
County: 55 counties.
Hospital: 341 hospitals.
OSHPDID: A unique number established by the Office of Statewide Health Planning and Development (OSHPD) for identifying facilities and used in the Licensed Facility Information System (LFIS). The first three numbers identify the type of facility, the next two represent the county number, and the last five are randomly assigned within each county. 570261 unique codes.
Longitude: Longitude of hospital.
Latitude: Latitude of hospital.
location1: 333 levels.
Hospital Ratings: Comparison rating based on a 95% Confidence Interval (CI). If a hospital’s upper CI is less than the statewide observed rate, it is designated as performing “better” than the average hospital. If a hospital’s lower CI is greater than the state rate, it is designated as performing “worse” than the average state hospital. 3 levels of Hospital Ratings: As Expected, Better and Worse.

summary(df$Hospital.Ratings)

## As Expected      Better       Worse 
##       10801         158         210

Procedure/Condition: Procedure that was performed or condition that was treated. 6 medical procedures performed: Abdominal Aortic Aneurysm (AAA) Repair, Carotid Endarterectomy, Craniotomy, Esophageal Resection, Pancreatic Resection, Percutaneous Coronary Intervention. 6 medical conditions treated: Acute Stroke, Acute Myocardial Infarction, Heart Failure, Gastrointestinal Hemorrhage, Hip Fracture and Pneumonia. 17 levels, instead of 12.

summary(df$Procedure.Condition)

##                AAA Repair              Acute Stroke 
##                       657                       657 
##  Acute Stroke Hemorrhagic     Acute Stroke Ischemic 
##                       657                       657 
## Acute Stroke Subarachnoid                       AMI 
##                       657                       657 
##    Carotid Endarterectomy                Craniotomy 
##                       657                       657 
##      Esophageal Resection             GI Hemorrhage 
##                       657                       657 
##             Heart Failure              Hip Fracture 
##                       657                       657 
##         Pancreatic Cancer          Pancreatic Other 
##                       657                       657 
##      Pancreatic Resection                       PCI 
##                       657                       657 
##                 Pneumonia 
##                       657

Decoding Procedure/Condition variable.

combine 17 levels into 12.
separate procedure and condition in two columns, potential problem here: will create NA values for missing procedure or condition.
OR: separate in two columns: the category column with values procedure or codnition, and the category values column with values of 12 procedures and conditions.

Variables with missing values:

Risk Adjusted Mortality Rate: The Risk Adjusted Mortality Rates (RAMR) presented here adjusts the observed mortality rates. This statistical methodology takes into account pre-existing health problems that put some patients at greater risk of death to “level the playing field” and allow fair comparisons across hospitals. 4754 missing values.
Number of Deaths: Number of patients that died in this hospital. 4926 missing values.
Number of Cases: Number of patients that had this medical procedure or condition in this hospital. 5004 missing values.

Solution: check if missing values in three variables are within the same observations. Can we just remove them or we need to populate them with mean or median values?

Checking for outliers.

Boxplots of # of Deaths, # of Cases and Risk.Adjusted.Mortality.Rate for 2012 and 2013 years.

p1 <- ggplot(df,aes(factor(Year),X..of.Deaths))+
  geom_boxplot()

p2 <- ggplot(df,aes(factor(Year),X..of.Cases))+
  geom_boxplot()

p3 <- ggplot(df,aes(factor(Year),Risk.Adjusted.Mortality.Rate))+
  geom_boxplot()

grid.arrange(p1, p2, p3, ncol=1)

## Warning: Removed 4926 rows containing non-finite values (stat_boxplot).

## Warning: Removed 5004 rows containing non-finite values (stat_boxplot).

## Warning: Removed 4754 rows containing non-finite values (stat_boxplot).

summary(df$Risk.Adjusted.Mortality.Rate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   3.400   6.731   8.000 100.000    4754

summary(df$X..of.Deaths)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.00    3.00    6.36    8.00  109.00    4926

summary(df$X..of.Cases)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     1.0    16.0    59.0   121.3   179.0  1718.0    5004

Boxplots of # of Deaths, # of Cases and Risk.Adjusted.Mortality.Rate for Hospital.Ratings.

p1 <- ggplot(df,aes(factor(Hospital.Ratings),X..of.Deaths))+
  geom_boxplot()

p2 <- ggplot(df,aes(factor(Hospital.Ratings),X..of.Cases))+
  geom_boxplot()

p3 <- ggplot(df,aes(factor(Hospital.Ratings),Risk.Adjusted.Mortality.Rate))+
  geom_boxplot()

grid.arrange(p1, p2, p3, ncol=1)

## Warning: Removed 4926 rows containing non-finite values (stat_boxplot).

## Warning: Removed 5004 rows containing non-finite values (stat_boxplot).

## Warning: Removed 4754 rows containing non-finite values (stat_boxplot).

Initial Explanatory Data Analysis

Histrograms of # of Deaths, # of Cases and Risk.Adjusted.Mortality.Rate for 2012 and 2013 years.

p1 <- ggplot(df,aes(X..of.Deaths,fill=factor(Year)))+
  geom_histogram(position="dodge")

p2 <- ggplot(df,aes(X..of.Cases,fill=factor(Year)))+
  geom_histogram(position="dodge")

p3 <- ggplot(df,aes(Risk.Adjusted.Mortality.Rate,fill=factor(Year)))+
  geom_histogram(position="dodge")

grid.arrange(p1, p2, p3, ncol=1)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 4926 rows containing non-finite values (stat_bin).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 5004 rows containing non-finite values (stat_bin).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 4754 rows containing non-finite values (stat_bin).

Histrograms of # of Deaths, # of Cases and Risk.Adjusted.Mortality.Rate for Hospital.Ratings.

p4 <- ggplot(df,aes(X..of.Deaths,fill=factor(Hospital.Ratings)))+
  geom_histogram(position="dodge")

p5 <- ggplot(df,aes(X..of.Cases,fill=factor(Hospital.Ratings)))+
  geom_histogram(position="dodge")

p6 <- ggplot(df,aes(Risk.Adjusted.Mortality.Rate,fill=factor(Hospital.Ratings)))+
  geom_histogram(position="dodge")

grid.arrange(p4, p5, p6, ncol=1)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 4926 rows containing non-finite values (stat_bin).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 5004 rows containing non-finite values (stat_bin).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 4754 rows containing non-finite values (stat_bin).

Histrograms of # of Deaths, # of Cases and Risk.Adjusted.Mortality.Rate for Year and Hospital.Ratings.

p4 <- ggplot(df,aes(X..of.Deaths,fill=factor(Year)))+
  geom_histogram(position="dodge")+
  facet_grid(". ~ Hospital.Ratings")

p5 <- ggplot(df,aes(X..of.Cases,fill=factor(Year)))+
  geom_histogram(position="dodge")+
  facet_grid(". ~ Hospital.Ratings")

p6 <- ggplot(df,aes(Risk.Adjusted.Mortality.Rate,fill=factor(Year)))+
  geom_histogram(position="dodge")+
  facet_grid(". ~ Hospital.Ratings")

grid.arrange(p4, p5, p6, ncol=1)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 4926 rows containing non-finite values (stat_bin).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 5004 rows containing non-finite values (stat_bin).

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 4754 rows containing non-finite values (stat_bin).

California Statewide Inpatient Mortality Rates, 2012-2013.

Maria P. Frushicheva

Sunday, July 31, 2016

Description, analysis and cleaning of variables in the dataset.

Checking for outliers.

Initial Explanatory Data Analysis