Dataset is available from healthdata.gov.
Description of dataset: The dataset contains risk-adjusted mortality rates, quality ratings, and number of deaths and cases for 6 medical conditions treated (Acute Stroke, Acute Myocardial Infarction, Heart Failure, Gastrointestinal Hemorrhage, Hip Fracture and Pneumonia) and 6 procedures performed (Abdominal Aortic Aneurysm Repair, Carotid Endarterectomy, Craniotomy, Esophageal Resection, Pancreatic Resection, Percutaneous Coronary Intervention) in California hospitals for 2012 and 2013. This dataset does not include conditions treated or procedures performed in outpatient settings. Please refer to statewide table for California overall rates.
Question: How the rate of in-hospital mortality across all facilities in California is correlated with the diagnoses and procedures performed during an inpatient stay?
Preliminary Questions:
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: tidyr
## Loading required package: ggplot2
## Loading required package: gridExtra
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
Load the data from csv file. (from the website - need to figure out how to do this???)
data <- read.csv("California_Hospital_Inpatient_Mortality_Rates_and_Quality_Ratings__2012-2013.csv",sep=",",header=TRUE)
df <- tbl_df(data)
glimpse(df)
## Observations: 11,169
## Variables: 12
## $ Year <int> 2012, 2012, 2012, 2012, 2012, 201...
## $ County <fctr> Alameda, Alameda, Alameda, Alame...
## $ Hospital <fctr> Alameda Hospital, Alameda Hospit...
## $ OSHPDID <int> 106010735, 106010735, 106010735, ...
## $ Procedure.Condition <fctr> PCI, AMI, Acute Stroke Ischemic,...
## $ Risk.Adjusted.Mortality.Rate <dbl> NA, 5.5, 2.3, NA, 1.3, 0.0, NA, 2...
## $ X..of.Deaths <int> NA, 4, 1, NA, 1, 0, NA, 5, NA, 5,...
## $ X..of.Cases <int> NA, 41, 60, NA, 65, 5, NA, 11, NA...
## $ Hospital.Ratings <fctr> As Expected, As Expected, As Exp...
## $ Longitude <dbl> -122.2536, -122.2536, -122.2536, ...
## $ Latitude <dbl> 37.76295, 37.76295, 37.76295, 37....
## $ location1 <fctr> (37.762953, -122.25362), (37.762...
Dataset: Observations: 11,169 and Variables: 12
Variables with no missing values:
summary(df$Hospital.Ratings)
## As Expected Better Worse
## 10801 158 210
summary(df$Procedure.Condition)
## AAA Repair Acute Stroke
## 657 657
## Acute Stroke Hemorrhagic Acute Stroke Ischemic
## 657 657
## Acute Stroke Subarachnoid AMI
## 657 657
## Carotid Endarterectomy Craniotomy
## 657 657
## Esophageal Resection GI Hemorrhage
## 657 657
## Heart Failure Hip Fracture
## 657 657
## Pancreatic Cancer Pancreatic Other
## 657 657
## Pancreatic Resection PCI
## 657 657
## Pneumonia
## 657
Decoding Procedure/Condition variable.
Variables with missing values:
Solution: check if missing values in three variables are within the same observations. Can we just remove them or we need to populate them with mean or median values?
Boxplots of # of Deaths, # of Cases and Risk.Adjusted.Mortality.Rate for 2012 and 2013 years.
p1 <- ggplot(df,aes(factor(Year),X..of.Deaths))+
geom_boxplot()
p2 <- ggplot(df,aes(factor(Year),X..of.Cases))+
geom_boxplot()
p3 <- ggplot(df,aes(factor(Year),Risk.Adjusted.Mortality.Rate))+
geom_boxplot()
grid.arrange(p1, p2, p3, ncol=1)
## Warning: Removed 4926 rows containing non-finite values (stat_boxplot).
## Warning: Removed 5004 rows containing non-finite values (stat_boxplot).
## Warning: Removed 4754 rows containing non-finite values (stat_boxplot).
summary(df$Risk.Adjusted.Mortality.Rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 3.400 6.731 8.000 100.000 4754
summary(df$X..of.Deaths)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.00 3.00 6.36 8.00 109.00 4926
summary(df$X..of.Cases)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.0 16.0 59.0 121.3 179.0 1718.0 5004
Boxplots of # of Deaths, # of Cases and Risk.Adjusted.Mortality.Rate for Hospital.Ratings.
p1 <- ggplot(df,aes(factor(Hospital.Ratings),X..of.Deaths))+
geom_boxplot()
p2 <- ggplot(df,aes(factor(Hospital.Ratings),X..of.Cases))+
geom_boxplot()
p3 <- ggplot(df,aes(factor(Hospital.Ratings),Risk.Adjusted.Mortality.Rate))+
geom_boxplot()
grid.arrange(p1, p2, p3, ncol=1)
## Warning: Removed 4926 rows containing non-finite values (stat_boxplot).
## Warning: Removed 5004 rows containing non-finite values (stat_boxplot).
## Warning: Removed 4754 rows containing non-finite values (stat_boxplot).
Histrograms of # of Deaths, # of Cases and Risk.Adjusted.Mortality.Rate for 2012 and 2013 years.
p1 <- ggplot(df,aes(X..of.Deaths,fill=factor(Year)))+
geom_histogram(position="dodge")
p2 <- ggplot(df,aes(X..of.Cases,fill=factor(Year)))+
geom_histogram(position="dodge")
p3 <- ggplot(df,aes(Risk.Adjusted.Mortality.Rate,fill=factor(Year)))+
geom_histogram(position="dodge")
grid.arrange(p1, p2, p3, ncol=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4926 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 5004 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4754 rows containing non-finite values (stat_bin).
Histrograms of # of Deaths, # of Cases and Risk.Adjusted.Mortality.Rate for Hospital.Ratings.
p4 <- ggplot(df,aes(X..of.Deaths,fill=factor(Hospital.Ratings)))+
geom_histogram(position="dodge")
p5 <- ggplot(df,aes(X..of.Cases,fill=factor(Hospital.Ratings)))+
geom_histogram(position="dodge")
p6 <- ggplot(df,aes(Risk.Adjusted.Mortality.Rate,fill=factor(Hospital.Ratings)))+
geom_histogram(position="dodge")
grid.arrange(p4, p5, p6, ncol=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4926 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 5004 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4754 rows containing non-finite values (stat_bin).
Histrograms of # of Deaths, # of Cases and Risk.Adjusted.Mortality.Rate for Year and Hospital.Ratings.
p4 <- ggplot(df,aes(X..of.Deaths,fill=factor(Year)))+
geom_histogram(position="dodge")+
facet_grid(". ~ Hospital.Ratings")
p5 <- ggplot(df,aes(X..of.Cases,fill=factor(Year)))+
geom_histogram(position="dodge")+
facet_grid(". ~ Hospital.Ratings")
p6 <- ggplot(df,aes(Risk.Adjusted.Mortality.Rate,fill=factor(Year)))+
geom_histogram(position="dodge")+
facet_grid(". ~ Hospital.Ratings")
grid.arrange(p4, p5, p6, ncol=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4926 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 5004 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4754 rows containing non-finite values (stat_bin).