Rankings are a common way of comparing different objects based on different metrics, and we have all interacted with this kind of data. For example, we might choose which school to attend based on studies which use metrics such as number of research publications, dollar amount of scholarships awarded, graduation outcomes and so on to rank them.
Similarly, states can be ranked based on the quality of care that their hospitals provide. The metrics used can vary widely, from number of hospital beds to percent of favorable ratings on patient experiences surveys. I stumbled upon this article that provides state healthcare rankings based on four metrics:
Number of readmissions
Number of preventable admissions
Medicare quality
Nursing home quality
The number of readmissions is a crucial metric for many hospitals, especially because Medicare reduces payments to hospitals with excess readmissions.
According to the artice, Utah is a state that is performing very well as far as the number of readmissions are concerned. It is ranked 3rd in the country, meaning the number of readmissions in Utah hospitals are relatively low. On the same scale, Mississippi is ranked 49th, indicating that Mississippi hospitals have a very high number of readmissions.
In this walkthrough, we will use R to:
Access Medicare data to obtain the numbers of readmissions for each hospital in the country
Use a t-test to compare the mean number of readmissions for hospitals in Utah and Mississippi and see if the difference in ranking is justified statistically.
library(tidyverse)
library(knitr)
library(tinytex)
The data has 19.7k rows and 12 columns.
colnames(dat)
The columns of interest for us are:
Let us rename columns 1, 4 and 10 for readability.
names(dat)[c(1, 4, 10)] <- c("Hospital", "Measure", "Readmissions")
Now, let us use a dplyr pipe to first pull the data from Utah and Mississippi, and then visualize only the 4 columns of interest:
clean <- dat %>% filter(State %in% c("UT", "MS")) %>%
select(Hospital, State, Readmissions, Measure)
Let’s take a glimpse at a few random rows of this data.
clean[25:36, ]
## Hospital State Readmissions
## 25 MERIT HEALTH RIVER OAKS MS Not Available
## 26 MERIT HEALTH RIVER OAKS MS Not Available
## 27 MERIT HEALTH RIVER OAKS MS 22
## 28 MERIT HEALTH RIVER OAKS MS 11
## 29 MERIT HEALTH RIVER OAKS MS 20
## 30 MERIT HEALTH RIVER OAKS MS 15
## 31 DAVIS HOSPITAL AND MEDICAL CENTER UT Not Available
## 32 DAVIS HOSPITAL AND MEDICAL CENTER UT Not Available
## 33 DAVIS HOSPITAL AND MEDICAL CENTER UT 11
## 34 DAVIS HOSPITAL AND MEDICAL CENTER UT 34
## 35 DAVIS HOSPITAL AND MEDICAL CENTER UT 12
## 36 DAVIS HOSPITAL AND MEDICAL CENTER UT 32
## Measure
## 25 READM_30_AMI_HRRP
## 26 READM_30_CABG_HRRP
## 27 READM_30_COPD_HRRP
## 28 READM_30_HF_HRRP
## 29 READM_30_HIP_KNEE_HRRP
## 30 READM_30_PN_HRRP
## 31 READM_30_AMI_HRRP
## 32 READM_30_CABG_HRRP
## 33 READM_30_COPD_HRRP
## 34 READM_30_HF_HRRP
## 35 READM_30_HIP_KNEE_HRRP
## 36 READM_30_PN_HRRP
We have a nice chunk of data representing readmissions from hospitals from both Utah and Mississipi, along with the particular disease for which the readmission happened.
This is because because each hospital records multiple readmissions measures. This representation of data is considered to be a ‘long’ format. This faciliates easy analysis and visualization in R.
Let’s try to make this easier to look at by:
wide <- clean %>%
mutate(Disease = ifelse(Measure =="READM_30_AMI_HRRP", "Myocardial Infarction",
ifelse(Measure == "READM_30_CABG_HRRP", "Bypass Graft",
ifelse(Measure== "READM_30_COPD_HRRP", "COPD",
ifelse(Measure == "READM_30_HF_HRRP", "Heart Failure",
ifelse(Measure == "READM_30_PN_HRRP", "Pneumonia", "Hip/Knee Replacement")))))) %>%
group_by(State, Disease) %>%
select(State, Disease, Readmissions) %>%
summarize(Mean = mean(as.numeric(Readmissions, na.rm = TRUE))) %>%
spread(Disease, Mean)
kable(wide)
| State | Bypass Graft | COPD | Heart Failure | Hip/Knee Replacement | Myocardial Infarction | Pneumonia |
|---|---|---|---|---|---|---|
| MS | 315.3492 | 208.7778 | 197.6032 | 308.6508 | 302.9365 | 204.7460 |
| UT | 343.6452 | 284.2581 | 265.0323 | 272.6129 | 329.6452 | 251.8387 |
This is an interesting table. This shows that Mississippi has, on average, fewer readmissions than Utah in every type of measure except hip/knee replacement.
Let’s use the long form data to compare readmission numbers for Utah and Mississippi. We will not be looking at the measure for readmission.
long <- clean %>% select(-Measure) %>%
mutate(Readmissions = as.numeric(Readmissions, na.rm = TRUE))
kable(head(long, 10))
| Hospital | State | Readmissions |
|---|---|---|
| ANDERSON REGIONAL MEDICAL CTR | MS | 278 |
| ANDERSON REGIONAL MEDICAL CTR | MS | 55 |
| ANDERSON REGIONAL MEDICAL CTR | MS | 35 |
| ANDERSON REGIONAL MEDICAL CTR | MS | 60 |
| ANDERSON REGIONAL MEDICAL CTR | MS | 44 |
| ANDERSON REGIONAL MEDICAL CTR | MS | 23 |
| BRIGHAM CITY COMMUNITY HOSPITAL | UT | 354 |
| BRIGHAM CITY COMMUNITY HOSPITAL | UT | 354 |
| BRIGHAM CITY COMMUNITY HOSPITAL | UT | 354 |
| BRIGHAM CITY COMMUNITY HOSPITAL | UT | 354 |
Let’s partition this data into 2 vectors containing numbers of readmissions for hospitals, one for Utah and One for Mississippi.
x <- long %>% filter(State == "UT") %>% .$Readmissions #Utah readmissions
y <- long %>% filter(State == "MS") %>% .$Readmissions #Mississippi readmissions
We are going to use this data to perform a t-test to compare the mean number of readmissions for the two states.
There are three major assumptions for a t-test:
Independence of observations: This assumption is met because the individual readmission data for any one hospital is independent of the number of readmissions for other hospitals.
Homegeneity of variances: The variances for the two groups should be equal. We can confirm that using the var.test() function in R.
var.test(x,y)
##
## F test to compare two variances
##
## data: x and y
## F = 0.83534, num df = 185, denom df = 377, p-value = 0.1657
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.6544439 1.0777766
## sample estimates:
## ratio of variances
## 0.8353404
The p-value of 0.16 suggests that we cannot reject the null hypothesis that the variances are equal. The assumption of equal variances is met.
hist(x, xlab = "Number of Readmissions",
ylab = "Number of Hospitals", col = "lightblue",
main = "Distribution of Utah Readmissions", ylim =c(0, 150))
hist(y, xlab = "Number of Readmissions",
ylab = "Number of Hospitals", col = "lightgreen",
main = "Distribution of Mississippi Readmissions", ylim = c(0, 200))
Both of these vectors are not normally distributed. There are two ways to tackle this—either by using a log transformation, using a non-parametric test like the Wilcoxon Mann Whitney. In this example, we do not have to do this, because the t-test is usually robust when samples sizes are large. Let us proceed with the test.
t.test(x, y, alternative = "less")
##
## Welch Two Sample t-test
##
## data: x and y
## t = 3.4013, df = 399.18, p-value = 0.9996
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf 51.71007
## sample estimates:
## mean of x mean of y
## 291.1720 256.3439
# p value is 0.99, much more than alpha of 0.05
#there is not enough evidence to suggest that
#the mean number of readmissions is lesser for Utah than Mississippi.
There is no evidence to show that mean readmission numbers in Utah are lesser than Mississippi. Why then did Utah receive a rank of 3 and Mississippi a rank of 49? I noticed when I scrolled down in the article that while their infographic specifies that they are ranking based on readmission numbers, the commentary at the bottom shows that they are using readmission rates. This could be a reason for the discrepancy in our findings.
Let’s try a t-test again, but this time, let’s use the excess readmission ratio column to base the t-test on:
Here’s a quick repeat of the wrangling we did, but this time, we choose to keep the readmission ratio column:
colnames(dat)[7] <- "Readmission_Ratio"
stock <- dat %>% filter(State %in% c("UT", "MS")) %>%
select(Hospital, State, Readmission_Ratio)
long <- stock %>% select(-Hospital) %>%
mutate(Readmission_Ratio = as.numeric(Readmission_Ratio, na.rm = TRUE))
x <- long %>% filter(State == "UT") %>% .$Readmission_Ratio
y <- long %>% filter(State == "MS") %>% .$Readmission_Ratio
Here’s the t-test:
t.test(x, y, alternative = "less")
##
## Welch Two Sample t-test
##
## data: x and y
## t = -5.7292, df = 291.25, p-value = 1.258e-08
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -436.2049
## sample estimates:
## mean of x mean of y
## 2075.586 2688.249
The p-value is highly significant! This means we reject the hypothesis that the means of excess readmission ratios are equal, and we pick the alternate hypothesis that states that the mean of the excess readmission ratio is lesser for Utah than Mississippi.
The article ranking is vindicated. Utah is doing much better than Mississippi in keeping readmissions low.
This was an interesting demonstration of using statistics to verify claims made in media publications. I would like to take this further by downloading other variables for Utah hospitals from Hospital Compare and try to run a regression analysis to find out what is contributing to the readmission ratios.