Dataset is available from healthdata.gov.

Description of dataset: The dataset contains risk-adjusted mortality rates, quality ratings, and number of deaths and cases for 6 medical conditions treated (Acute Stroke, Acute Myocardial Infarction, Heart Failure, Gastrointestinal Hemorrhage, Hip Fracture and Pneumonia) and 6 procedures performed (Abdominal Aortic Aneurysm Repair, Carotid Endarterectomy, Craniotomy, Esophageal Resection, Pancreatic Resection, Percutaneous Coronary Intervention) in California hospitals for 2012 and 2013. This dataset does not include conditions treated or procedures performed in outpatient settings. Please refer to statewide table for California overall rates.

Question: How the rate of in-hospital mortality across all facilities in California is correlated with the diagnoses and procedures performed during an inpatient stay?

Preliminary Questions:

## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: tidyr
## Loading required package: ggplot2
## Loading required package: gridExtra
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine

Load the data from csv file. (from the website - need to figure out how to do this???)

data <- read.csv("California_Hospital_Inpatient_Mortality_Rates_and_Quality_Ratings__2012-2013.csv",sep=",",header=TRUE)
df <- tbl_df(data)
glimpse(df)
## Observations: 11,169
## Variables: 12
## $ Year                         <int> 2012, 2012, 2012, 2012, 2012, 201...
## $ County                       <fctr> Alameda, Alameda, Alameda, Alame...
## $ Hospital                     <fctr> Alameda Hospital, Alameda Hospit...
## $ OSHPDID                      <int> 106010735, 106010735, 106010735, ...
## $ Procedure.Condition          <fctr> PCI, AMI, Acute Stroke Ischemic,...
## $ Risk.Adjusted.Mortality.Rate <dbl> NA, 5.5, 2.3, NA, 1.3, 0.0, NA, 2...
## $ X..of.Deaths                 <int> NA, 4, 1, NA, 1, 0, NA, 5, NA, 5,...
## $ X..of.Cases                  <int> NA, 41, 60, NA, 65, 5, NA, 11, NA...
## $ Hospital.Ratings             <fctr> As Expected, As Expected, As Exp...
## $ Longitude                    <dbl> -122.2536, -122.2536, -122.2536, ...
## $ Latitude                     <dbl> 37.76295, 37.76295, 37.76295, 37....
## $ location1                    <fctr> (37.762953, -122.25362), (37.762...

Description, analysis and cleaning of variables in the dataset.

Dataset: Observations: 11,169 and Variables: 12

Variables with no missing values:

summary(df$Hospital.Ratings)
## As Expected      Better       Worse 
##       10801         158         210
summary(df$Procedure.Condition)
##                AAA Repair              Acute Stroke 
##                       657                       657 
##  Acute Stroke Hemorrhagic     Acute Stroke Ischemic 
##                       657                       657 
## Acute Stroke Subarachnoid                       AMI 
##                       657                       657 
##    Carotid Endarterectomy                Craniotomy 
##                       657                       657 
##      Esophageal Resection             GI Hemorrhage 
##                       657                       657 
##             Heart Failure              Hip Fracture 
##                       657                       657 
##         Pancreatic Cancer          Pancreatic Other 
##                       657                       657 
##      Pancreatic Resection                       PCI 
##                       657                       657 
##                 Pneumonia 
##                       657

Decoding Procedure/Condition variable.

Variables with missing values:

Solution: check if missing values in three variables are within the same observations. Can we just remove them or we need to populate them with mean or median values?

Checking for outliers.

Boxplots of # of Deaths, # of Cases and Risk.Adjusted.Mortality.Rate for 2012 and 2013 years.

p1 <- ggplot(df,aes(factor(Year),X..of.Deaths))+
  geom_boxplot()

p2 <- ggplot(df,aes(factor(Year),X..of.Cases))+
  geom_boxplot()

p3 <- ggplot(df,aes(factor(Year),Risk.Adjusted.Mortality.Rate))+
  geom_boxplot()

grid.arrange(p1, p2, p3, ncol=1)
## Warning: Removed 4926 rows containing non-finite values (stat_boxplot).
## Warning: Removed 5004 rows containing non-finite values (stat_boxplot).
## Warning: Removed 4754 rows containing non-finite values (stat_boxplot).

summary(df$Risk.Adjusted.Mortality.Rate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   3.400   6.731   8.000 100.000    4754
summary(df$X..of.Deaths)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.00    3.00    6.36    8.00  109.00    4926
summary(df$X..of.Cases)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     1.0    16.0    59.0   121.3   179.0  1718.0    5004

Boxplots of # of Deaths, # of Cases and Risk.Adjusted.Mortality.Rate for Hospital.Ratings.

p1 <- ggplot(df,aes(factor(Hospital.Ratings),X..of.Deaths))+
  geom_boxplot()

p2 <- ggplot(df,aes(factor(Hospital.Ratings),X..of.Cases))+
  geom_boxplot()

p3 <- ggplot(df,aes(factor(Hospital.Ratings),Risk.Adjusted.Mortality.Rate))+
  geom_boxplot()

grid.arrange(p1, p2, p3, ncol=1)
## Warning: Removed 4926 rows containing non-finite values (stat_boxplot).
## Warning: Removed 5004 rows containing non-finite values (stat_boxplot).
## Warning: Removed 4754 rows containing non-finite values (stat_boxplot).

Initial Explanatory Data Analysis

Histrograms of # of Deaths, # of Cases and Risk.Adjusted.Mortality.Rate for 2012 and 2013 years.

p1 <- ggplot(df,aes(X..of.Deaths,fill=factor(Year)))+
  geom_histogram(position="dodge")

p2 <- ggplot(df,aes(X..of.Cases,fill=factor(Year)))+
  geom_histogram(position="dodge")

p3 <- ggplot(df,aes(Risk.Adjusted.Mortality.Rate,fill=factor(Year)))+
  geom_histogram(position="dodge")

grid.arrange(p1, p2, p3, ncol=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4926 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 5004 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4754 rows containing non-finite values (stat_bin).

Histrograms of # of Deaths, # of Cases and Risk.Adjusted.Mortality.Rate for Hospital.Ratings.

p4 <- ggplot(df,aes(X..of.Deaths,fill=factor(Hospital.Ratings)))+
  geom_histogram(position="dodge")

p5 <- ggplot(df,aes(X..of.Cases,fill=factor(Hospital.Ratings)))+
  geom_histogram(position="dodge")

p6 <- ggplot(df,aes(Risk.Adjusted.Mortality.Rate,fill=factor(Hospital.Ratings)))+
  geom_histogram(position="dodge")

grid.arrange(p4, p5, p6, ncol=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4926 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 5004 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4754 rows containing non-finite values (stat_bin).

Histrograms of # of Deaths, # of Cases and Risk.Adjusted.Mortality.Rate for Year and Hospital.Ratings.

p4 <- ggplot(df,aes(X..of.Deaths,fill=factor(Year)))+
  geom_histogram(position="dodge")+
  facet_grid(". ~ Hospital.Ratings")

p5 <- ggplot(df,aes(X..of.Cases,fill=factor(Year)))+
  geom_histogram(position="dodge")+
  facet_grid(". ~ Hospital.Ratings")

p6 <- ggplot(df,aes(Risk.Adjusted.Mortality.Rate,fill=factor(Year)))+
  geom_histogram(position="dodge")+
  facet_grid(". ~ Hospital.Ratings")

grid.arrange(p4, p5, p6, ncol=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4926 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 5004 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4754 rows containing non-finite values (stat_bin).