Healthcare rankings:

Rankings are a common way of comparing different objects based on different metrics, and we have all interacted with this kind of data. For example, we might choose which school to attend based on studies which use metrics such as number of research publications, dollar amount of scholarships awarded, graduation outcomes and so on to rank them.

Similarly, states can be ranked based on the quality of care that their hospitals provide. The metrics used can vary widely, from number of hospital beds to percent of favorable ratings on patient experiences surveys. I stumbled upon this article that provides state healthcare rankings based on four metrics:

The number of readmissions is a crucial metric for many hospitals, especially because Medicare reduces payments to hospitals with excess readmissions.

According to the artice, Utah is a state that is performing very well as far as the number of readmissions are concerned. It is ranked 3rd in the country, meaning the number of readmissions in Utah hospitals are relatively low. On the same scale, Mississippi is ranked 49th, indicating that Mississippi hospitals have a very high number of readmissions.

Let’s verify this for ourselves

In this walkthrough, we will use R to:

Load libraries

library(tidyverse)
library(knitr)
library(tinytex)

Accessing the data

The dataset can be found here.

dat <- read.csv("Readmissions.csv")

Data wrangling

The data has 19.7k rows and 12 columns.

colnames(dat)

The columns of interest for us are:

Let us rename columns 1, 4 and 10 for readability.

names(dat)[c(1, 4, 10)] <- c("Hospital", "Measure", "Readmissions")

Now, let us use a dplyr pipe to first pull the data from Utah and Mississippi, and then visualize only the 4 columns of interest:

clean <- dat %>% filter(State %in% c("UT", "MS")) %>% 
  select(Hospital, State, Readmissions, Measure)

Let’s take a glimpse at a few random rows of this data.

clean[25:36, ]
##                             Hospital State  Readmissions
## 25           MERIT HEALTH RIVER OAKS    MS Not Available
## 26           MERIT HEALTH RIVER OAKS    MS Not Available
## 27           MERIT HEALTH RIVER OAKS    MS            22
## 28           MERIT HEALTH RIVER OAKS    MS            11
## 29           MERIT HEALTH RIVER OAKS    MS            20
## 30           MERIT HEALTH RIVER OAKS    MS            15
## 31 DAVIS HOSPITAL AND MEDICAL CENTER    UT Not Available
## 32 DAVIS HOSPITAL AND MEDICAL CENTER    UT Not Available
## 33 DAVIS HOSPITAL AND MEDICAL CENTER    UT            11
## 34 DAVIS HOSPITAL AND MEDICAL CENTER    UT            34
## 35 DAVIS HOSPITAL AND MEDICAL CENTER    UT            12
## 36 DAVIS HOSPITAL AND MEDICAL CENTER    UT            32
##                   Measure
## 25      READM_30_AMI_HRRP
## 26     READM_30_CABG_HRRP
## 27     READM_30_COPD_HRRP
## 28       READM_30_HF_HRRP
## 29 READM_30_HIP_KNEE_HRRP
## 30       READM_30_PN_HRRP
## 31      READM_30_AMI_HRRP
## 32     READM_30_CABG_HRRP
## 33     READM_30_COPD_HRRP
## 34       READM_30_HF_HRRP
## 35 READM_30_HIP_KNEE_HRRP
## 36       READM_30_PN_HRRP

We have a nice chunk of data representing readmissions from hospitals from both Utah and Mississipi, along with the particular disease for which the readmission happened.

Why are the names of the hospital repeating?

This is because because each hospital records multiple readmissions measures. This representation of data is considered to be a ‘long’ format. This faciliates easy analysis and visualization in R.

Let’s try to make this easier to look at by:

  • Converting measure name codes to names of diseases/procedures.
  • Changing data from long to wide.
wide <- clean %>% 
  mutate(Disease = ifelse(Measure =="READM_30_AMI_HRRP", "Myocardial Infarction", 
                   ifelse(Measure == "READM_30_CABG_HRRP", "Bypass Graft",
                   ifelse(Measure== "READM_30_COPD_HRRP", "COPD",
                   ifelse(Measure == "READM_30_HF_HRRP", "Heart Failure",
  ifelse(Measure == "READM_30_PN_HRRP", "Pneumonia", "Hip/Knee Replacement")))))) %>%
  group_by(State, Disease) %>% 
  select(State, Disease, Readmissions) %>%
summarize(Mean = mean(as.numeric(Readmissions, na.rm = TRUE))) %>% 
  spread(Disease, Mean)

kable(wide)
State Bypass Graft COPD Heart Failure Hip/Knee Replacement Myocardial Infarction Pneumonia
MS 315.3492 208.7778 197.6032 308.6508 302.9365 204.7460
UT 343.6452 284.2581 265.0323 272.6129 329.6452 251.8387

This is an interesting table. This shows that Mississippi has, on average, fewer readmissions than Utah in every type of measure except hip/knee replacement.

Let’s use the long form data to compare readmission numbers for Utah and Mississippi. We will not be looking at the measure for readmission.

long <- clean %>% select(-Measure) %>% 
mutate(Readmissions = as.numeric(Readmissions, na.rm = TRUE))

kable(head(long, 10))
Hospital State Readmissions
ANDERSON REGIONAL MEDICAL CTR MS 278
ANDERSON REGIONAL MEDICAL CTR MS 55
ANDERSON REGIONAL MEDICAL CTR MS 35
ANDERSON REGIONAL MEDICAL CTR MS 60
ANDERSON REGIONAL MEDICAL CTR MS 44
ANDERSON REGIONAL MEDICAL CTR MS 23
BRIGHAM CITY COMMUNITY HOSPITAL UT 354
BRIGHAM CITY COMMUNITY HOSPITAL UT 354
BRIGHAM CITY COMMUNITY HOSPITAL UT 354
BRIGHAM CITY COMMUNITY HOSPITAL UT 354

Preparing for the t-test

Let’s partition this data into 2 vectors containing numbers of readmissions for hospitals, one for Utah and One for Mississippi.

x <- long %>% filter(State == "UT") %>% .$Readmissions #Utah readmissions
y <- long %>% filter(State == "MS") %>% .$Readmissions #Mississippi readmissions

We are going to use this data to perform a t-test to compare the mean number of readmissions for the two states.

There are three major assumptions for a t-test:

var.test(x,y) 
## 
##  F test to compare two variances
## 
## data:  x and y
## F = 0.83534, num df = 185, denom df = 377, p-value = 0.1657
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.6544439 1.0777766
## sample estimates:
## ratio of variances 
##          0.8353404

The p-value of 0.16 suggests that we cannot reject the null hypothesis that the variances are equal. The assumption of equal variances is met.

hist(x, xlab = "Number of Readmissions", 
     ylab = "Number of Hospitals", col = "lightblue", 
     main = "Distribution of Utah Readmissions", ylim =c(0, 150)) 

hist(y, xlab = "Number of Readmissions", 
     ylab = "Number of Hospitals", col = "lightgreen", 
     main = "Distribution of Mississippi Readmissions", ylim = c(0, 200)) 

Both of these vectors are not normally distributed. There are two ways to tackle this—either by using a log transformation, using a non-parametric test like the Wilcoxon Mann Whitney. In this example, we do not have to do this, because the t-test is usually robust when samples sizes are large. Let us proceed with the test.

T-test

t.test(x, y, alternative = "less") 
## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = 3.4013, df = 399.18, p-value = 0.9996
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##      -Inf 51.71007
## sample estimates:
## mean of x mean of y 
##  291.1720  256.3439
# p value is 0.99, much more than alpha of 0.05
#there is not enough evidence to suggest that 
#the mean number of readmissions is lesser for Utah than Mississippi.

Interpretation

There is no evidence to show that mean readmission numbers in Utah are lesser than Mississippi. Why then did Utah receive a rank of 3 and Mississippi a rank of 49? I noticed when I scrolled down in the article that while their infographic specifies that they are ranking based on readmission numbers, the commentary at the bottom shows that they are using readmission rates. This could be a reason for the discrepancy in our findings.

Let’s try a t-test again, but this time, let’s use the excess readmission ratio column to base the t-test on:

Here’s a quick repeat of the wrangling we did, but this time, we choose to keep the readmission ratio column:

colnames(dat)[7] <- "Readmission_Ratio"

stock <- dat %>% filter(State %in% c("UT", "MS")) %>% 
  select(Hospital, State, Readmission_Ratio)

long <- stock %>% select(-Hospital) %>% 
mutate(Readmission_Ratio = as.numeric(Readmission_Ratio, na.rm = TRUE))

x <- long %>% filter(State == "UT") %>% .$Readmission_Ratio 
y <- long %>% filter(State == "MS") %>% .$Readmission_Ratio

Here’s the t-test:

t.test(x, y, alternative = "less")
## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = -5.7292, df = 291.25, p-value = 1.258e-08
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -436.2049
## sample estimates:
## mean of x mean of y 
##  2075.586  2688.249

The p-value is highly significant! This means we reject the hypothesis that the means of excess readmission ratios are equal, and we pick the alternate hypothesis that states that the mean of the excess readmission ratio is lesser for Utah than Mississippi.

The article ranking is vindicated. Utah is doing much better than Mississippi in keeping readmissions low.

Next steps:

This was an interesting demonstration of using statistics to verify claims made in media publications. I would like to take this further by downloading other variables for Utah hospitals from Hospital Compare and try to run a regression analysis to find out what is contributing to the readmission ratios.