For this project, I will be analyzing the Overall Hospital Quality Star Rating provided by the Center for Medicaid and Medicare Services (CMS). The CMS dataset contains ratings from over 5,000 hospital across the United States. It summarizes 5 keys measures (mortality rate, safety, readmission, patient experience, and timely & effective care) into a 5 star rating system.
Along with the hospital data, I will also examine a dataset from the Census Bureau that contains the median income of households by zip code. By merging these two datasets based on the zip code field, I hope to gain insights into the relationship between hospital quality ratings and income levels in the areas where the hospitals are located. In addition, I will explore any trends or patterns that may exist among hospitals with high and low ratings.
Being as I work in a hospital, I think it is important to understand why there are discrepancies between hospitals quality in healthcare. Identifying these factors can not only improve hospital ratings but improve patient outcomes as well.
Sources:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
#Extracting hospital data
hospital_df_raw <- read.csv("https://data.cms.gov/provider-data/sites/default/files/resources/092256becd267d9eeccf73bf7d16c46b_1681243512/Hospital_General_Information.csv")
#Extracting income data - The dataset was large and the page crashed when I tried viewing the table on the website so it was easier to download it and upload it to github.
income_df_raw <- read.csv("https://raw.githubusercontent.com/LeJQC/MSDS/main/DATA%20606/Final%20Project/Income%20data.csv")
# Tidying the hospital data frame and selecting the columns I need
hospital_df <- hospital_df_raw %>%
select("Facility.Name","City", "State", "ZIP.Code", "Hospital.Type", "Hospital.Ownership", "Hospital.overall.rating") %>%
rename(HospitalName = Facility.Name, HospitalRating = Hospital.overall.rating, ZipCode = ZIP.Code)
# Tidying the income data frame. Looking to make a data frame with 2 columns: zip code and median income
income_df <- income_df_raw %>%
select("NAME","S1901_C01_012E") %>%
slice(-1) %>%
mutate(NAME = as.numeric(substr(NAME, nchar(NAME) - 4, nchar(NAME))),
S1901_C01_012E = as.numeric(S1901_C01_012E)) %>%
rename(ZipCode = NAME, Income = S1901_C01_012E)
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `S1901_C01_012E = as.numeric(S1901_C01_012E)`.
## Caused by warning:
## ! NAs introduced by coercion
# Merging both the hospital and income data frames together
merged_df <- merge(income_df, hospital_df, by ="ZipCode")
# Cleaning out the NA values
ratings_df <- merged_df %>%
filter(HospitalRating != "Not Available", Income >= 0) %>%
group_by(HospitalRating) %>%
mutate(HospitalRating = as.numeric(HospitalRating))
# Summary statistics by rating(`group1`)
stats <- psych::describeBy(ratings_df$Income, group = ratings_df$HospitalRating, mat = TRUE)
rownames(stats) <- NULL
stats <- stats %>%
select(c(group1, n, mean, sd, median, min, max, range, se)) %>%
rename("Rating" = "group1")
knitr::kable(stats)
Rating | n | mean | sd | median | min | max | range | se |
---|---|---|---|---|---|---|---|---|
1 | 180 | 59232.96 | 27205.07 | 53540.0 | 16764 | 160890 | 144126 | 2027.7458 |
2 | 663 | 61283.45 | 24826.76 | 54567.0 | 20239 | 174419 | 154180 | 964.1914 |
3 | 851 | 62800.59 | 23562.27 | 57324.0 | 19083 | 194462 | 175379 | 807.7041 |
4 | 851 | 68834.50 | 25469.45 | 62054.0 | 15833 | 232400 | 216567 | 873.0817 |
5 | 408 | 75001.75 | 30628.49 | 66046.5 | 11404 | 216286 | 204882 | 1516.3362 |
Surprisingly, the zip code with the lowest income has a hospital with 5 stars.
To get a general sense of income and how hospitals are rated by
state, let’s plot this using ggmap
.
library(ggmap)
## ℹ Google's Terms of Service: ]8;;https://mapsplatform.google.com<https://mapsplatform.google.com>]8;;
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
# Creating an average hospital rating by state
avg_state <- ratings_df %>%
group_by(State) %>%
mutate(avg_rating = round(mean(HospitalRating),2)) %>%
mutate(avg_income = round(mean(Income),2))
# Using the Google Maps API
register_google(key = "AIzaSyBCuGJujBhbM3SiJMW07WCEM0YFAKvG320")
# Getting the US map from Google Maps API
us_map <- get_map(location = "united states", zoom = 4, maptype = "terrain")
## ℹ <]8;;https://maps.googleapis.com/maps/api/staticmap?center=united%20states&zoom=4&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxxhttps://maps.googleapis.com/maps/api/staticmap?center=united%20states&zoom=4&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx]8;;>
## ℹ <]8;;https://maps.googleapis.com/maps/api/geocode/json?address=united+states&key=xxxhttps://maps.googleapis.com/maps/api/geocode/json?address=united+states&key=xxx]8;;>
# This has the lat and long of all the states
mapdata <- map_data("state")
# Create lookup table for state abbreviations
state_lookup <- tibble(State = state.abb, region = state.name)
state_lookup$region <- tolower(state_lookup$region)
# Merge lookup table with state ratings data frame
state_ratings <- avg_state %>%
left_join(state_lookup, by = "State") %>%
full_join(mapdata, by = "region") %>%
filter(avg_rating != 0, long != 0, Income != 0)
# Plot of US and average rating
map1_rating <- ggmap(us_map) +
geom_polygon(data = state_ratings, aes(x = long, y = lat, group = group, fill = avg_rating), color = "black", linewidth = 0.5, alpha = 0.8) +
theme_void()
map2_rating <- map1_rating +
scale_fill_gradient(name = "Average Hospital Rating", low = "yellow", high = "red", na.value = "grey50")
#map2_rating
# Saving the map because it took a long time to load
#ggsave("US Hospital rating.png", map2_rating)
# Plot of US and Income
map1_income <- ggmap(us_map) +
geom_polygon(data = state_ratings, aes(x = long, y = lat, group = group, fill = avg_income), color = "black", linewidth = 0.5, alpha = 0.8)+
theme_void()
map2_income <- map1_income +
scale_fill_gradient(name = "Income", low = "lightblue", high = "blue", na.value = "grey50")
#map2_income
#ggsave("US Income.png", map2_income)
US Hospital Rating
US Income
owner_table <- sort(table(ratings_df$Hospital.Ownership), decreasing = TRUE)
knitr::kable(owner_table)
Var1 | Freq |
---|---|
Voluntary non-profit - Private | 1454 |
Proprietary | 530 |
Voluntary non-profit - Other | 252 |
Government - Hospital District or Authority | 233 |
Voluntary non-profit - Church | 233 |
Government - Local | 197 |
Government - State | 24 |
Physician | 20 |
Government - Federal | 8 |
Tribal | 2 |
# Rating based on hospital ownership
ratings_df %>%
group_by(Hospital.Ownership) %>%
ggplot(aes(x = Hospital.Ownership, y = HospitalRating)) +
coord_flip() +
geom_bar(stat = "summary", fun = "mean")+
labs(x = "Hospital Ownership", y = "Average Rating") +
theme_bw()
# Plotting the highest and lowest rated hospital
ratings_df %>%
filter(HospitalRating == c(1,5)) %>%
ggplot(aes(x = Hospital.Ownership, fill = Hospital.Type)) +
facet_wrap(~ HospitalRating)+
geom_bar()+
coord_flip()
ratings_df %>%
ggplot(aes(x = Income)) +
geom_histogram(bins = 100) +
theme_bw()
ratings_df %>%
ggplot(aes(x = HospitalRating)) +
geom_histogram() +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Let’s look at how these two relate!
ratings_df %>%
ggplot(aes(x = Income, y = HospitalRating, group = HospitalRating)) +
geom_boxplot() +
theme_bw()
The median income seems to be increasing as the rating increases. Let’s run a linear regression analysis!
scatter_plot <- ggplot(ratings_df, aes(x = Income, y = HospitalRating, group = 1)) +
geom_jitter() +
labs(title = "Relationship Between Hospital Quality Ratings and Income Levels",
x = "Income Levels", y = "Hospital Quality Ratings") +
geom_smooth(method = lm, se = FALSE) +
theme_bw()
scatter_plot
## `geom_smooth()` using formula = 'y ~ x'
# Fit a linear regression model to the data
lm_model <- lm(HospitalRating ~ Income, data = ratings_df)
# Print the summary of the linear regression model
summary(lm_model)
##
## Call:
## lm(formula = HospitalRating ~ Income, data = ratings_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.95154 -1.02962 -0.08089 0.84345 2.19988
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.712e+00 5.501e-02 49.304 <2e-16 ***
## Income 7.702e-06 7.784e-07 9.896 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.105 on 2951 degrees of freedom
## Multiple R-squared: 0.03212, Adjusted R-squared: 0.03179
## F-statistic: 97.93 on 1 and 2951 DF, p-value: < 2.2e-16
The coefficient estimate for the Income is 7.702e-06 so for every dollar increase in Income, the hospital rating increases by 7.702e-06. As for significance, the p-value is very small indicating that there is a statistically significant relationship between income and hospital rating. However, the R-squared value is very small(0.03212), indicating that only 3% of the variation in hospital ratings is accounted for by the Income variable.
Based on the linear regression model, there is statistically significant positive, linear correlation between hospital rating and income. Although, there is a statistically significant relationship between income and hospital rating, the small coefficient estimate indicates income only contributes a small proportion(3%) of the variation in hospital rating. This may suggest that a linear regression model is not a reliable statistical test for this dataset.
There are some important limitations to note with this analysis. The hospital dataset is not inclusive of all the hospitals in the US. Hospitals that are not associated with CMS or do not have a rating were not included in the analysis. Also, the association between income and hospitals is based on zip code and on the assumption that everyone residing in the same zip code as the hospital has the same income level.
Overall, the results from this analysis highlight the need for further research into hospital quality beyond just income. In New York, NYU and Bellevue are located in the same zip code but their hospital ratings are complete opposites. NYU’s overall rating is a 5 while Bellevue’s is a 1. This illustrates that income is just a small piece of the puzzle. Factors such as demographics, health insurance, and healthcare policies can all impact hospital ratings and need to be researched as well. By identifying factors that influence the variations in hospital ratings, we can lower the discrepancies in healthcare and ensure equal, high quality care to every one.