This project investigates variation in Medicare inpatient utilization across U.S. hospitals by focusing on service volume (total discharges) rather than dollar amounts, which can be influenced by hospital-specific charge-setting and pricing conventions. Using the CMS Medicare Inpatient Hospitals – by Provider and Service dataset (provider–DRG level records), we explore how utilization differs by clinical service (DRG), geography, and hospital context (e.g., urban–rural classification). Our methodology includes exploratory analysis of discharge distributions and variability across categories, followed by regression-based predictive modeling of discharge counts. Because utilization is a count outcome, we fit and compare generalized linear models designed for counts (Poisson and/or negative binomial), and evaluate model fit and predictive performance using a train/test split and metrics appropriate for count prediction. We also examine whether model residuals reveal hospitals or regions with systematically higher or lower utilization than expected after accounting for service mix and location. The primary outcomes of this study are (1) an interpretable, risk-adjusted “expected discharges” model, (2) a ranked set of predictors associated with higher utilization, and (3) visual summaries, including a U.S. map of utilization patterns or model residuals by state/region, to communicate geographic differences. These results provide a data-driven starting point for discussing where utilization differs most and where further clinical or operational investigation may be warranted.
The story we will focus on is on inpatient utilization by modeling Total Discharges (Tot_Dschrgs) across U.S. hospitals and diagnosis categories. The goal is to understand how discharge volumes vary by service type (DRG) and geography (state and rural–urban setting), and whether reimbursement is related to higher utilization. Because discharge counts are non-negative integers and can vary widely across provider–DRG combinations, we compare several standard count/regression approaches and summarize which model best captures the observed utilization patterns.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tibble)
library(knitr)
library(forcats)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(httr2)
library(readr)
base_url <- "https://data.cms.gov/data-api/v1/dataset/690ddc6c-2767-4618-b277-420ffb2bf27c/data"
fetch_cms_page <- function(offset = 0, size = 5000) {
req <- request(base_url) |>
req_url_query(offset = offset, size = size) |>
req_timeout(60) |>
req_retry(max_tries = 3)
resp <- req_perform(req)
# API returns an array of rows (JSON objects)
dat <- resp_body_json(resp, simplifyVector = TRUE)
as_tibble(dat)
}
fetch_cms_all <- function(size = 5000) {
out <- list()
offset <- 0
i <- 1
repeat {
message("Downloading: offset = ", offset)
page <- fetch_cms_page(offset = offset, size = size)
if (nrow(page) == 0) break
out[[i]] <- page
# last page when returned rows < requested size
if (nrow(page) < size) break
offset <- offset + size
i <- i + 1
}
bind_rows(out)
}
cms_inpatient <- fetch_cms_all(size = 5000)
## Downloading: offset = 0
## Downloading: offset = 5000
## Downloading: offset = 10000
## Downloading: offset = 15000
## Downloading: offset = 20000
## Downloading: offset = 25000
## Downloading: offset = 30000
## Downloading: offset = 35000
## Downloading: offset = 40000
## Downloading: offset = 45000
## Downloading: offset = 50000
## Downloading: offset = 55000
## Downloading: offset = 60000
## Downloading: offset = 65000
## Downloading: offset = 70000
## Downloading: offset = 75000
## Downloading: offset = 80000
## Downloading: offset = 85000
## Downloading: offset = 90000
## Downloading: offset = 95000
## Downloading: offset = 1e+05
## Downloading: offset = 105000
## Downloading: offset = 110000
## Downloading: offset = 115000
## Downloading: offset = 120000
## Downloading: offset = 125000
## Downloading: offset = 130000
## Downloading: offset = 135000
## Downloading: offset = 140000
## Downloading: offset = 145000
# Convert numeric-looking columns safely
cms_inpatient <- cms_inpatient |>
mutate(
Tot_Dschrgs = parse_number(Tot_Dschrgs),
Avg_Submtd_Cvrd_Chrg = parse_number(Avg_Submtd_Cvrd_Chrg),
Avg_Tot_Pymt_Amt = parse_number(Avg_Tot_Pymt_Amt),
Avg_Mdcr_Pymt_Amt = parse_number(Avg_Mdcr_Pymt_Amt)
)
glimpse(cms_inpatient)
## Rows: 146,427
## Columns: 15
## $ Rndrng_Prvdr_CCN <chr> "010001", "010001", "010001", "010001", "010…
## $ Rndrng_Prvdr_Org_Name <chr> "Southeast Health Medical Center", "Southeas…
## $ Rndrng_Prvdr_City <chr> "Dothan", "Dothan", "Dothan", "Dothan", "Dot…
## $ Rndrng_Prvdr_St <chr> "1108 Ross Clark Circle", "1108 Ross Clark C…
## $ Rndrng_Prvdr_State_FIPS <chr> "01", "01", "01", "01", "01", "01", "01", "0…
## $ Rndrng_Prvdr_Zip5 <chr> "36301", "36301", "36301", "36301", "36301",…
## $ Rndrng_Prvdr_State_Abrvtn <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "A…
## $ Rndrng_Prvdr_RUCA <chr> "2", "2", "2", "2", "2", "2", "2", "2", "2",…
## $ Rndrng_Prvdr_RUCA_Desc <chr> "Metropolitan area high commuting: primary f…
## $ DRG_Cd <chr> "003", "023", "024", "025", "038", "039", "0…
## $ DRG_Desc <chr> "ECMO OR TRACHEOSTOMY WITH MV >96 HOURS OR P…
## $ Tot_Dschrgs <dbl> 14, 26, 12, 16, 11, 36, 15, 62, 59, 34, 12, …
## $ Avg_Submtd_Cvrd_Chrg <dbl> 663764.36, 180980.88, 105824.33, 242539.50, …
## $ Avg_Tot_Pymt_Amt <dbl> 120219.929, 37321.038, 26936.667, 34745.375,…
## $ Avg_Mdcr_Pymt_Amt <dbl> 115544.143, 35261.808, 25048.917, 32438.625,…
Last Minute safety checks
# 1) Missingness check
colSums(is.na(cms_inpatient))
## Rndrng_Prvdr_CCN Rndrng_Prvdr_Org_Name Rndrng_Prvdr_City
## 0 0 0
## Rndrng_Prvdr_St Rndrng_Prvdr_State_FIPS Rndrng_Prvdr_Zip5
## 0 0 0
## Rndrng_Prvdr_State_Abrvtn Rndrng_Prvdr_RUCA Rndrng_Prvdr_RUCA_Desc
## 0 0 0
## DRG_Cd DRG_Desc Tot_Dschrgs
## 0 0 0
## Avg_Submtd_Cvrd_Chrg Avg_Tot_Pymt_Amt Avg_Mdcr_Pymt_Amt
## 0 0 0
# 2) Sanity check: are the numeric columns non-negative?
summary(dplyr::select(
cms_inpatient,
Tot_Dschrgs, Avg_Submtd_Cvrd_Chrg, Avg_Tot_Pymt_Amt, Avg_Mdcr_Pymt_Amt
))
## Tot_Dschrgs Avg_Submtd_Cvrd_Chrg Avg_Tot_Pymt_Amt Avg_Mdcr_Pymt_Amt
## Min. : 11.00 Min. : 3368 Min. : 1938 Min. : 180.3
## 1st Qu.: 14.00 1st Qu.: 35263 1st Qu.: 8686 1st Qu.: 6794.3
## Median : 20.00 Median : 58669 Median : 12831 Median : 10548.2
## Mean : 33.88 Mean : 90794 Mean : 18513 Mean : 15331.4
## 3rd Qu.: 35.00 3rd Qu.: 104217 3rd Qu.: 20403 3rd Qu.: 16815.0
## Max. :3210.00 Max. :10418933 Max. :761739 Max. :751479.0
We use CMS inpatient data at the provider–DRG level. The main outcome is Tot_Dschrgs (total discharges), and the key predictors are DRG_Cd (Diagnosis-Related Group code), Rndrng_Prvdr_St (provider state), and Rndrng_Prvdr_RUCA (rural–urban commuting area code). To capture how payment levels relate to utilization while reducing skew, we also include the log of Medicare payment: log_medicare_payment = log(Avg_Mdcr_Pymt_Amt). We apply log transforms to dollar variables (and examine log-discharge distribution for diagnostics) because these variables are highly right-skewed with large outliers; logging makes relationships more linear and stabilizes variance, which improves interpretability and model stability—especially when comparing Poisson vs. Negative Binomial vs. a log-linear OLS baseline.
For those looking, because of those huge max values, many of your plots/models will behave better if we use log transforms on dollar variables:
cms_inpatient2 <- cms_inpatient %>%
mutate(
log_charge = log(Avg_Submtd_Cvrd_Chrg),
log_total_payment = log(Avg_Tot_Pymt_Amt),
log_medicare_payment = log(Avg_Mdcr_Pymt_Amt)
)
We start by exploring the distribution of the total number of
discharges (Tot_Dschrgs). The summary statistics computed
above show that the minimum number of discharges for any provider–DRG
combination is 11 and the maximum is over 3,200. We create histograms
and boxplots to visualise the distribution of discharges on the original
scale and after applying a logarithmic transformation.
library(ggplot2)
## Distribution of discharges
ggplot(cms_inpatient2, aes(x = Tot_Dschrgs)) +
geom_histogram(bins = 50) +
labs(title = "Distribution of Total Discharges", x = "Total Discharges", y = "Count")
## Distribution of log discharges
cms_inpatient2$log_discharges <- log(cms_inpatient2$Tot_Dschrgs)
ggplot(cms_inpatient2, aes(x = log_discharges)) +
geom_histogram(bins = 50) +
labs(title = "Distribution of Log Total Discharges", x = "Log(Total Discharges)", y = "Count")
Next, we examine average discharges by diagnosis-related group (DRG) and by state to get a sense of variation across services and geography.
## Average discharges by DRG (top 10)
drg_summary <- cms_inpatient2 %>%
group_by(DRG_Cd) %>%
summarise(mean_discharges = mean(Tot_Dschrgs),
count = n()) %>%
arrange(desc(mean_discharges)) %>%
head(10)
knitr::kable(drg_summary, digits = 2, caption = "Top 10 DRGs by Average Discharges")
| DRG_Cd | mean_discharges | count |
|---|---|---|
| 871 | 209.78 | 2678 |
| 291 | 121.42 | 2633 |
| 274 | 69.96 | 757 |
| 885 | 69.47 | 571 |
| 177 | 67.42 | 2472 |
| 895 | 58.80 | 61 |
| 267 | 58.34 | 672 |
| 470 | 54.87 | 1311 |
| 193 | 52.15 | 2412 |
| 946 | 49.00 | 6 |
## Average discharges by state (top 10)
state_summary <- cms_inpatient2 %>%
group_by(Rndrng_Prvdr_St) %>%
summarise(mean_discharges = mean(Tot_Dschrgs),
count = n()) %>%
arrange(desc(mean_discharges)) %>%
head(10)
knitr::kable(state_summary, digits = 2, caption = "Top 10 States by Average Discharges")
| Rndrng_Prvdr_St | mean_discharges | count |
|---|---|---|
| 107 Lincoln Street | 353.25 | 4 |
| 3015 Veterans Parkway | 259.00 | 3 |
| 100 Trich Drive | 220.00 | 1 |
| 2520 Troy Drive | 206.00 | 2 |
| 3651 College Blvd | 187.00 | 3 |
| 401 South Santa Fe Avenue | 183.50 | 2 |
| 1711 West Temple Street | 145.05 | 19 |
| 23901 Lahser | 145.00 | 2 |
| 5025 N Paulina Street | 123.00 | 2 |
| 3029 West Main Street | 119.00 | 3 |
To model utilisation (total discharges) we consider three types of models:
We split the data into a training and testing set (80/20 split) and
fit each model on the training data. Predictor variables include the
diagnosis-related group (DRG_Cd), provider state
(Rndrng_Prvdr_St), rural–urban commuting area code
(Rndrng_Prvdr_RUCA) and the log of average Medicare payment
(log_medicare_payment) to capture any relationship between
payment and utilisation. Because categorical predictors have many
levels, the resulting models include many dummy variables; fitting them
on the full data may take some time.
set.seed(123)
DEBUG_SAMPLE_N <- NA # Due to slowness i use this to keep editing code, use NA for final
TRAIN_PROP <- 0.80
TOP_DRG <- 50
TOP_ST <- 60
TOP_RUCA <- 20
required_cols <- c("Tot_Dschrgs", "DRG_Cd", "Rndrng_Prvdr_St", "Rndrng_Prvdr_RUCA", "log_medicare_payment")
missing <- setdiff(required_cols, names(cms_inpatient2))
if (length(missing) > 0) stop("Missing columns in cms_inpatient2: ", paste(missing, collapse = ", "))
# 1) Build modeling data
model_data <- cms_inpatient2 %>%
dplyr::select(dplyr::all_of(required_cols)) %>%
mutate(
DRG_Cd = factor(DRG_Cd),
Rndrng_Prvdr_St = factor(Rndrng_Prvdr_St),
Rndrng_Prvdr_RUCA = factor(Rndrng_Prvdr_RUCA)
) %>%
filter(Tot_Dschrgs > 0, is.finite(log_medicare_payment))
# Optional sample for speed (sample BEFORE lumping is fine)
if (!is.na(DEBUG_SAMPLE_N) && DEBUG_SAMPLE_N < nrow(model_data)) {
set.seed(123)
model_data <- dplyr::slice_sample(model_data, n = DEBUG_SAMPLE_N)
}
# Lump rare categories into "Other" (on the sampled data)
model_data <- model_data %>%
mutate(
DRG_Cd = fct_lump_n(DRG_Cd, n = TOP_DRG, other_level = "Other"),
Rndrng_Prvdr_St = fct_lump_n(Rndrng_Prvdr_St, n = TOP_ST, other_level = "Other"),
Rndrng_Prvdr_RUCA = fct_lump_n(Rndrng_Prvdr_RUCA, n = TOP_RUCA, other_level = "Other")
)
cat("Rows used:", nrow(model_data), "\n")
## Rows used: 146427
cat("Levels (DRG, State, RUCA):\n")
## Levels (DRG, State, RUCA):
print(c(
DRG = nlevels(model_data$DRG_Cd),
State = nlevels(model_data$Rndrng_Prvdr_St),
RUCA = nlevels(model_data$Rndrng_Prvdr_RUCA)
))
## DRG State RUCA
## 51 61 20
# 2) Split that GUARANTEES every factor level appears in training
ensure_levels_in_train <- function(df, vars) {
idx <- integer(0)
for (v in vars) {
lvls <- levels(df[[v]])
for (lv in lvls) {
rows <- which(df[[v]] == lv)
if (length(rows) > 0) idx <- c(idx, sample(rows, 1))
}
}
unique(idx)
}
set.seed(123)
n <- nrow(model_data)
target_train_n <- floor(TRAIN_PROP * n)
seed_idx <- ensure_levels_in_train(model_data, c("DRG_Cd","Rndrng_Prvdr_St","Rndrng_Prvdr_RUCA"))
remaining <- setdiff(seq_len(n), seed_idx)
need <- max(0, target_train_n - length(seed_idx))
train_id <- c(seed_idx, sample(remaining, need))
test_id <- setdiff(seq_len(n), train_id)
train_data <- model_data[train_id, ]
test_data <- model_data[test_id, ]
# 3) Extra safety: force test levels to match training levels exactly
for (v in c("DRG_Cd","Rndrng_Prvdr_St","Rndrng_Prvdr_RUCA")) {
train_data[[v]] <- forcats::fct_expand(train_data[[v]], "Other")
test_chr <- as.character(test_data[[v]])
test_chr[is.na(test_chr)] <- "Other"
test_chr[!test_chr %in% levels(train_data[[v]])] <- "Other"
test_data[[v]] <- factor(test_chr, levels = levels(train_data[[v]]))
}
# Quick diagnostic: should be 0 unseen levels now
unseen_counts <- sapply(c("DRG_Cd","Rndrng_Prvdr_St","Rndrng_Prvdr_RUCA"), function(v) {
sum(!as.character(test_data[[v]]) %in% levels(train_data[[v]]))
})
cat("Unseen levels in test (should be 0):\n")
## Unseen levels in test (should be 0):
print(unseen_counts)
## DRG_Cd Rndrng_Prvdr_St Rndrng_Prvdr_RUCA
## 0 0 0
# 4) Fit models
cat("Fitting Poisson...\n")
## Fitting Poisson...
t_poisson <- system.time({
poisson_model <- glm(
Tot_Dschrgs ~ DRG_Cd + Rndrng_Prvdr_St + Rndrng_Prvdr_RUCA + log_medicare_payment,
family = poisson(link = "log"),
data = train_data,
control = glm.control(maxit = 25)
)
})
print(t_poisson)
## user system elapsed
## 10.66 0.15 10.82
cat("Fitting Negative Binomial...\n")
## Fitting Negative Binomial...
t_nb <- system.time({
nb_model <- MASS::glm.nb(
Tot_Dschrgs ~ DRG_Cd + Rndrng_Prvdr_St + Rndrng_Prvdr_RUCA + log_medicare_payment,
data = train_data,
control = glm.control(maxit = 25),
maxit = 25,
trace = FALSE
)
})
print(t_nb)
## user system elapsed
## 34.12 0.79 34.92
cat("Fitting log-linear OLS...\n")
## Fitting log-linear OLS...
t_lm <- system.time({
lm_model <- lm(
log(Tot_Dschrgs) ~ DRG_Cd + Rndrng_Prvdr_St + Rndrng_Prvdr_RUCA + log_medicare_payment,
data = train_data
)
})
print(t_lm)
## user system elapsed
## 1.74 0.07 1.78
# 5) Predict (this is where you were failing before)
test_data$pred_poisson <- predict(poisson_model, newdata = test_data, type = "response")
test_data$pred_nb <- predict(nb_model, newdata = test_data, type = "response")
test_data$pred_lm <- exp(predict(lm_model, newdata = test_data))
# 6) Evaluate
rmse <- function(a, p) sqrt(mean((a - p)^2))
mae <- function(a, p) mean(abs(a - p))
model_results <- tibble(
Model = c("Poisson", "Negative Binomial", "Log-linear OLS"),
RMSE = c(
rmse(test_data$Tot_Dschrgs, test_data$pred_poisson),
rmse(test_data$Tot_Dschrgs, test_data$pred_nb),
rmse(test_data$Tot_Dschrgs, test_data$pred_lm)
),
MAE = c(
mae(test_data$Tot_Dschrgs, test_data$pred_poisson),
mae(test_data$Tot_Dschrgs, test_data$pred_nb),
mae(test_data$Tot_Dschrgs, test_data$pred_lm)
)
)
knitr::kable(model_results, digits = 2,
caption = "Model performance on test data (Utilization/Discharges)")
| Model | RMSE | MAE |
|---|---|---|
| Poisson | 37.20 | 16.62 |
| Negative Binomial | 38.09 | 16.73 |
| Log-linear OLS | 40.90 | 15.98 |
The resulting table is comparing how well each model predicts Tot_Dschrgs (discharges/utilization) on the held-out test data using RMSE and MAE (both are “average prediction error” measures, and lower is better).
Here, the Poisson model has the lowest RMSE, while the Negative Binomial has the lowest MAE, so those two are performing best overall (with Poisson slightly better for typical squared-error performance, and NB slightly better for average absolute error).
This here is what I would write in the project report, feel free to edit this: Next, we interpret coefficients from the count model to identify which DRGs and regional factors are most strongly associated with discharge volume. However, because utilization is a count outcome and the data may be overdispersed, we prioritize the Negative Binomial model for interpretation. We focus discussion on the largest effects and aggregate patterns across DRG, state, and RUCA categories.
This section is important for knowing the big picture for the powerpoint
Model Summary
summary(nb_model)
##
## Call:
## MASS::glm.nb(formula = Tot_Dschrgs ~ DRG_Cd + Rndrng_Prvdr_St +
## Rndrng_Prvdr_RUCA + log_medicare_payment, data = train_data,
## control = glm.control(maxit = 25), maxit = 25, trace = FALSE,
## init.theta = 3.451298667, link = log)
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 1.3982939 0.2967331
## DRG_Cd065 0.0618367 0.0221746
## DRG_Cd175 -0.6403591 0.0266323
## DRG_Cd177 0.5235724 0.0205199
## DRG_Cd178 -0.3172377 0.0245420
## DRG_Cd189 0.1230315 0.0211711
## DRG_Cd190 -0.1234603 0.0224724
## DRG_Cd193 0.3538645 0.0207232
## DRG_Cd194 -0.3027007 0.0237367
## DRG_Cd208 -0.6640045 0.0257403
## DRG_Cd246 -0.5070509 0.0257304
## DRG_Cd247 -0.2725159 0.0240823
## DRG_Cd280 0.1105178 0.0217205
## DRG_Cd281 -0.4226186 0.0244798
## DRG_Cd286 -0.3124966 0.0250185
## DRG_Cd287 -0.3345458 0.0255680
## DRG_Cd291 1.1607707 0.0202071
## DRG_Cd308 -0.3036453 0.0230030
## DRG_Cd309 -0.0112784 0.0226609
## DRG_Cd310 -0.3335787 0.0261500
## DRG_Cd312 -0.1210043 0.0233056
## DRG_Cd314 -0.5462924 0.0257237
## DRG_Cd329 -0.6628565 0.0262714
## DRG_Cd330 -0.3615328 0.0252503
## DRG_Cd377 -0.2219765 0.0227571
## DRG_Cd378 0.1324901 0.0220048
## DRG_Cd389 -0.4844575 0.0241064
## DRG_Cd391 -0.4963816 0.0248925
## DRG_Cd392 0.1749381 0.0217509
## DRG_Cd394 -0.5197554 0.0261944
## DRG_Cd470 0.2853675 0.0236111
## DRG_Cd480 -0.6758924 0.0259080
## DRG_Cd481 -0.1480020 0.0221418
## DRG_Cd522 -0.4452626 0.0233672
## DRG_Cd552 -0.1733880 0.0245414
## DRG_Cd603 -0.1459281 0.0224846
## DRG_Cd637 -0.7204204 0.0255349
## DRG_Cd638 -0.5963475 0.0251423
## DRG_Cd640 -0.0643255 0.0218544
## DRG_Cd641 -0.1258449 0.0219637
## DRG_Cd682 -0.0096650 0.0215925
## DRG_Cd683 0.0459899 0.0215453
## DRG_Cd689 0.0482916 0.0218094
## DRG_Cd690 0.1248251 0.0214500
## DRG_Cd698 -0.0210627 0.0221101
## DRG_Cd699 -0.3752163 0.0259514
## DRG_Cd812 -0.4705738 0.0253299
## DRG_Cd853 -0.0536401 0.0222078
## DRG_Cd871 1.6638268 0.0199717
## DRG_Cd872 0.2505532 0.0210488
## DRG_CdOther -0.7314601 0.0164068
## Rndrng_Prvdr_St100 Madison Ave 0.2625692 0.0517168
## Rndrng_Prvdr_St1000 Blythe Blvd -0.0758730 0.0519588
## Rndrng_Prvdr_St101 Sivley Rd 0.1679558 0.0534270
## Rndrng_Prvdr_St111 South 11th Street -0.0327783 0.0536373
## Rndrng_Prvdr_St1200 South Cedar Crest Boulevard 0.4356613 0.0491926
## Rndrng_Prvdr_St1211 Medical Center Drive 0.1027909 0.0512381
## Rndrng_Prvdr_St1216 Second Street Southwest 0.3374302 0.0471819
## Rndrng_Prvdr_St1500 E Medical Center Drive, Spc 5474 -0.0449624 0.0553600
## Rndrng_Prvdr_St1600 South 48th St 0.0595970 0.0538970
## Rndrng_Prvdr_St1600 Sw Archer Rd 0.0063067 0.0539968
## Rndrng_Prvdr_St169 Ashley Ave 0.0107445 0.0535033
## Rndrng_Prvdr_St1700 S Tamiami Trl 0.4913209 0.0487173
## Rndrng_Prvdr_St1945 State Route 33 0.1062317 0.0537457
## Rndrng_Prvdr_St20 York St 0.1209532 0.0509467
## Rndrng_Prvdr_St200 East Chestnut Street 0.2344841 0.0517249
## Rndrng_Prvdr_St200 Lothrop Street -0.1046249 0.0544556
## Rndrng_Prvdr_St2100 Erwin Rd 0.0003371 0.0531830
## Rndrng_Prvdr_St2100 Stantonsburg Rd 0.0850446 0.0533941
## Rndrng_Prvdr_St2131 S 17th St Box 9000 0.1731225 0.0524833
## Rndrng_Prvdr_St251 E Huron St 0.1586621 0.0524620
## Rndrng_Prvdr_St2650 Ridge Ave 0.3315758 0.0515827
## Rndrng_Prvdr_St270 - 05 76th Avenue 0.1830098 0.0541577
## Rndrng_Prvdr_St300 Community Drive 0.2669454 0.0502584
## Rndrng_Prvdr_St300 Pasteur Drive 0.0866047 0.0502154
## Rndrng_Prvdr_St330 Brookline Avenue -0.0222496 0.0509393
## Rndrng_Prvdr_St3300 Gallows Road 0.2316641 0.0536895
## Rndrng_Prvdr_St34th & Spruce Sts 0.0133881 0.0531559
## Rndrng_Prvdr_St350 7th St N 0.2010806 0.0536927
## Rndrng_Prvdr_St3535 Olentangy River Rd 0.1449188 0.0530972
## Rndrng_Prvdr_St3601 W Thirteen Mile Rd 0.1495785 0.0519880
## Rndrng_Prvdr_St4000 Cambridge Street 0.1116961 0.0513959
## Rndrng_Prvdr_St4500 San Pablo Rd 0.0636239 0.0541604
## Rndrng_Prvdr_St4755 Ogletown-Stanton Road 0.4634774 0.0483983
## Rndrng_Prvdr_St505 Parnassus Ave, Box 0296 -0.0236681 0.0536772
## Rndrng_Prvdr_St509 Biltmore Ave 0.1488837 0.0527218
## Rndrng_Prvdr_St52 W Underwood St 0.2802639 0.0504253
## Rndrng_Prvdr_St525 East 68th Street 0.6276245 0.0440462
## Rndrng_Prvdr_St55 Fogg Road 0.3029302 0.0530472
## Rndrng_Prvdr_St55 Fruit Street 0.2638586 0.0492033
## Rndrng_Prvdr_St55 Lake Avenue North 0.0644356 0.0524918
## Rndrng_Prvdr_St550 First Avenue 0.6194161 0.0458981
## Rndrng_Prvdr_St5777 East Mayo Boulevard 0.1525402 0.0535845
## Rndrng_Prvdr_St600 Highland Avenue -0.0364390 0.0534577
## Rndrng_Prvdr_St600 Mary St 0.2451897 0.0531177
## Rndrng_Prvdr_St600 North Wolfe Street 0.0256733 0.0519197
## Rndrng_Prvdr_St601 E Rollins St 0.7820041 0.0448559
## Rndrng_Prvdr_St6019 Walnut Grove Road 0.1909223 0.0518544
## Rndrng_Prvdr_St619 South 19th Street 0.0187043 0.0518181
## Rndrng_Prvdr_St6565 Fannin 0.1643211 0.0514121
## Rndrng_Prvdr_St75 Francis Street 0.2361540 0.0497020
## Rndrng_Prvdr_St759 Chestnut Street 0.1606021 0.0530796
## Rndrng_Prvdr_St7700 Floyd Curl Dr 0.4491292 0.0479169
## Rndrng_Prvdr_St800 Prudential Dr 0.3187528 0.0507135
## Rndrng_Prvdr_St8700 Beverly Blvd 0.3832132 0.0496082
## Rndrng_Prvdr_St9500 Euclid Avenue 0.1922532 0.0518044
## Rndrng_Prvdr_StHealth Sciences Center Suny 0.2472501 0.0494797
## Rndrng_Prvdr_StOne Barnes-Jewish Hospital Plaza 0.0832560 0.0494219
## Rndrng_Prvdr_StOne Gustave L Levy Place 0.2065810 0.0522729
## Rndrng_Prvdr_StOne Wyoming Street 0.0596072 0.0534907
## Rndrng_Prvdr_StOther -0.2976400 0.0314470
## Rndrng_Prvdr_RUCA1 1.6191396 0.2930673
## Rndrng_Prvdr_RUCA1.1 1.5340547 0.2933525
## Rndrng_Prvdr_RUCA10 1.0861189 0.2945199
## Rndrng_Prvdr_RUCA10.1 1.7357404 0.3017321
## Rndrng_Prvdr_RUCA10.3 0.7710579 0.5107055
## Rndrng_Prvdr_RUCA2 1.4729047 0.2933679
## Rndrng_Prvdr_RUCA2.1 0.8742745 0.3199323
## Rndrng_Prvdr_RUCA3 0.7835080 0.3203371
## Rndrng_Prvdr_RUCA4 1.2259406 0.2931166
## Rndrng_Prvdr_RUCA4.1 1.1412413 0.2942213
## Rndrng_Prvdr_RUCA5 1.5028663 0.2942437
## Rndrng_Prvdr_RUCA6 0.2138522 0.3649867
## Rndrng_Prvdr_RUCA7 0.8234152 0.2935001
## Rndrng_Prvdr_RUCA7.1 0.7206591 0.3040949
## Rndrng_Prvdr_RUCA7.2 1.3246520 0.3018509
## Rndrng_Prvdr_RUCA8 0.6613702 0.3045020
## Rndrng_Prvdr_RUCA8.2 0.7831840 0.5256310
## Rndrng_Prvdr_RUCA9 0.5329886 0.3612613
## Rndrng_Prvdr_RUCA99 1.6603314 0.2942745
## log_medicare_payment 0.1048660 0.0030645
## z value Pr(>|z|)
## (Intercept) 4.712 2.45e-06 ***
## DRG_Cd065 2.789 0.005293 **
## DRG_Cd175 -24.044 < 2e-16 ***
## DRG_Cd177 25.515 < 2e-16 ***
## DRG_Cd178 -12.926 < 2e-16 ***
## DRG_Cd189 5.811 6.20e-09 ***
## DRG_Cd190 -5.494 3.93e-08 ***
## DRG_Cd193 17.076 < 2e-16 ***
## DRG_Cd194 -12.752 < 2e-16 ***
## DRG_Cd208 -25.796 < 2e-16 ***
## DRG_Cd246 -19.706 < 2e-16 ***
## DRG_Cd247 -11.316 < 2e-16 ***
## DRG_Cd280 5.088 3.62e-07 ***
## DRG_Cd281 -17.264 < 2e-16 ***
## DRG_Cd286 -12.491 < 2e-16 ***
## DRG_Cd287 -13.085 < 2e-16 ***
## DRG_Cd291 57.444 < 2e-16 ***
## DRG_Cd308 -13.200 < 2e-16 ***
## DRG_Cd309 -0.498 0.618694
## DRG_Cd310 -12.756 < 2e-16 ***
## DRG_Cd312 -5.192 2.08e-07 ***
## DRG_Cd314 -21.237 < 2e-16 ***
## DRG_Cd329 -25.231 < 2e-16 ***
## DRG_Cd330 -14.318 < 2e-16 ***
## DRG_Cd377 -9.754 < 2e-16 ***
## DRG_Cd378 6.021 1.73e-09 ***
## DRG_Cd389 -20.097 < 2e-16 ***
## DRG_Cd391 -19.941 < 2e-16 ***
## DRG_Cd392 8.043 8.78e-16 ***
## DRG_Cd394 -19.842 < 2e-16 ***
## DRG_Cd470 12.086 < 2e-16 ***
## DRG_Cd480 -26.088 < 2e-16 ***
## DRG_Cd481 -6.684 2.32e-11 ***
## DRG_Cd522 -19.055 < 2e-16 ***
## DRG_Cd552 -7.065 1.60e-12 ***
## DRG_Cd603 -6.490 8.58e-11 ***
## DRG_Cd637 -28.213 < 2e-16 ***
## DRG_Cd638 -23.719 < 2e-16 ***
## DRG_Cd640 -2.943 0.003247 **
## DRG_Cd641 -5.730 1.01e-08 ***
## DRG_Cd682 -0.448 0.654437
## DRG_Cd683 2.135 0.032797 *
## DRG_Cd689 2.214 0.026811 *
## DRG_Cd690 5.819 5.91e-09 ***
## DRG_Cd698 -0.953 0.340778
## DRG_Cd699 -14.458 < 2e-16 ***
## DRG_Cd812 -18.578 < 2e-16 ***
## DRG_Cd853 -2.415 0.015719 *
## DRG_Cd871 83.309 < 2e-16 ***
## DRG_Cd872 11.903 < 2e-16 ***
## DRG_CdOther -44.583 < 2e-16 ***
## Rndrng_Prvdr_St100 Madison Ave 5.077 3.83e-07 ***
## Rndrng_Prvdr_St1000 Blythe Blvd -1.460 0.144221
## Rndrng_Prvdr_St101 Sivley Rd 3.144 0.001669 **
## Rndrng_Prvdr_St111 South 11th Street -0.611 0.541127
## Rndrng_Prvdr_St1200 South Cedar Crest Boulevard 8.856 < 2e-16 ***
## Rndrng_Prvdr_St1211 Medical Center Drive 2.006 0.044841 *
## Rndrng_Prvdr_St1216 Second Street Southwest 7.152 8.57e-13 ***
## Rndrng_Prvdr_St1500 E Medical Center Drive, Spc 5474 -0.812 0.416687
## Rndrng_Prvdr_St1600 South 48th St 1.106 0.268831
## Rndrng_Prvdr_St1600 Sw Archer Rd 0.117 0.907020
## Rndrng_Prvdr_St169 Ashley Ave 0.201 0.840840
## Rndrng_Prvdr_St1700 S Tamiami Trl 10.085 < 2e-16 ***
## Rndrng_Prvdr_St1945 State Route 33 1.977 0.048091 *
## Rndrng_Prvdr_St20 York St 2.374 0.017591 *
## Rndrng_Prvdr_St200 East Chestnut Street 4.533 5.81e-06 ***
## Rndrng_Prvdr_St200 Lothrop Street -1.921 0.054696 .
## Rndrng_Prvdr_St2100 Erwin Rd 0.006 0.994942
## Rndrng_Prvdr_St2100 Stantonsburg Rd 1.593 0.111212
## Rndrng_Prvdr_St2131 S 17th St Box 9000 3.299 0.000972 ***
## Rndrng_Prvdr_St251 E Huron St 3.024 0.002492 **
## Rndrng_Prvdr_St2650 Ridge Ave 6.428 1.29e-10 ***
## Rndrng_Prvdr_St270 - 05 76th Avenue 3.379 0.000727 ***
## Rndrng_Prvdr_St300 Community Drive 5.311 1.09e-07 ***
## Rndrng_Prvdr_St300 Pasteur Drive 1.725 0.084588 .
## Rndrng_Prvdr_St330 Brookline Avenue -0.437 0.662266
## Rndrng_Prvdr_St3300 Gallows Road 4.315 1.60e-05 ***
## Rndrng_Prvdr_St34th & Spruce Sts 0.252 0.801145
## Rndrng_Prvdr_St350 7th St N 3.745 0.000180 ***
## Rndrng_Prvdr_St3535 Olentangy River Rd 2.729 0.006347 **
## Rndrng_Prvdr_St3601 W Thirteen Mile Rd 2.877 0.004013 **
## Rndrng_Prvdr_St4000 Cambridge Street 2.173 0.029762 *
## Rndrng_Prvdr_St4500 San Pablo Rd 1.175 0.240102
## Rndrng_Prvdr_St4755 Ogletown-Stanton Road 9.576 < 2e-16 ***
## Rndrng_Prvdr_St505 Parnassus Ave, Box 0296 -0.441 0.659261
## Rndrng_Prvdr_St509 Biltmore Ave 2.824 0.004744 **
## Rndrng_Prvdr_St52 W Underwood St 5.558 2.73e-08 ***
## Rndrng_Prvdr_St525 East 68th Street 14.249 < 2e-16 ***
## Rndrng_Prvdr_St55 Fogg Road 5.711 1.13e-08 ***
## Rndrng_Prvdr_St55 Fruit Street 5.363 8.20e-08 ***
## Rndrng_Prvdr_St55 Lake Avenue North 1.228 0.219621
## Rndrng_Prvdr_St550 First Avenue 13.495 < 2e-16 ***
## Rndrng_Prvdr_St5777 East Mayo Boulevard 2.847 0.004417 **
## Rndrng_Prvdr_St600 Highland Avenue -0.682 0.495465
## Rndrng_Prvdr_St600 Mary St 4.616 3.91e-06 ***
## Rndrng_Prvdr_St600 North Wolfe Street 0.494 0.620967
## Rndrng_Prvdr_St601 E Rollins St 17.434 < 2e-16 ***
## Rndrng_Prvdr_St6019 Walnut Grove Road 3.682 0.000232 ***
## Rndrng_Prvdr_St619 South 19th Street 0.361 0.718129
## Rndrng_Prvdr_St6565 Fannin 3.196 0.001393 **
## Rndrng_Prvdr_St75 Francis Street 4.751 2.02e-06 ***
## Rndrng_Prvdr_St759 Chestnut Street 3.026 0.002481 **
## Rndrng_Prvdr_St7700 Floyd Curl Dr 9.373 < 2e-16 ***
## Rndrng_Prvdr_St800 Prudential Dr 6.285 3.27e-10 ***
## Rndrng_Prvdr_St8700 Beverly Blvd 7.725 1.12e-14 ***
## Rndrng_Prvdr_St9500 Euclid Avenue 3.711 0.000206 ***
## Rndrng_Prvdr_StHealth Sciences Center Suny 4.997 5.82e-07 ***
## Rndrng_Prvdr_StOne Barnes-Jewish Hospital Plaza 1.685 0.092066 .
## Rndrng_Prvdr_StOne Gustave L Levy Place 3.952 7.75e-05 ***
## Rndrng_Prvdr_StOne Wyoming Street 1.114 0.265131
## Rndrng_Prvdr_StOther -9.465 < 2e-16 ***
## Rndrng_Prvdr_RUCA1 5.525 3.30e-08 ***
## Rndrng_Prvdr_RUCA1.1 5.229 1.70e-07 ***
## Rndrng_Prvdr_RUCA10 3.688 0.000226 ***
## Rndrng_Prvdr_RUCA10.1 5.753 8.79e-09 ***
## Rndrng_Prvdr_RUCA10.3 1.510 0.131097
## Rndrng_Prvdr_RUCA2 5.021 5.15e-07 ***
## Rndrng_Prvdr_RUCA2.1 2.733 0.006282 **
## Rndrng_Prvdr_RUCA3 2.446 0.014450 *
## Rndrng_Prvdr_RUCA4 4.182 2.88e-05 ***
## Rndrng_Prvdr_RUCA4.1 3.879 0.000105 ***
## Rndrng_Prvdr_RUCA5 5.108 3.26e-07 ***
## Rndrng_Prvdr_RUCA6 0.586 0.557931
## Rndrng_Prvdr_RUCA7 2.806 0.005024 **
## Rndrng_Prvdr_RUCA7.1 2.370 0.017795 *
## Rndrng_Prvdr_RUCA7.2 4.388 1.14e-05 ***
## Rndrng_Prvdr_RUCA8 2.172 0.029858 *
## Rndrng_Prvdr_RUCA8.2 1.490 0.136227
## Rndrng_Prvdr_RUCA9 1.475 0.140117
## Rndrng_Prvdr_RUCA99 5.642 1.68e-08 ***
## log_medicare_payment 34.219 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(3.4513) family taken to be 1)
##
## Null deviance: 248009 on 117140 degrees of freedom
## Residual deviance: 118174 on 117010 degrees of freedom
## AIC: 958720
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 3.4513
## Std. Err.: 0.0151
##
## 2 x log-likelihood: -958455.5810
RESULT: The result here is showing us that the negative binomial model fit successfully and that log_medicare_payment is strongly associated with discharge counts, even after controlling for DRG, state, and RUCA (it’s highly statistically significant). The Theta value of 3.4513 indicates there is over dispersion in the discharge counts (variance bigger than the mean), which is exactly why the negative binomial model is more appropriate than using plain Poisson.
The big picture: The negative binomial model shows that inpatient discharges vary a lot across the country, and the biggest drivers of discharge counts are the type of case (DRG group) and where the hospital is located (state and rural/urban category). The overdispersion check supports using a negative binomial model instead of Poisson, because the discharge counts have much more variability than Poisson assumptions allow. The coefficient table (with IRR values) highlights which DRG groups and locations are linked to notably higher or lower discharge volumes, and it also shows that higher average Medicare payment is associated with higher discharge counts even after controlling for case mix and geography. Basically, hospitals don’t all “use” inpatient care the same way as in the discharge volumes depend heavily on what kinds of patients they treat and where they operate, and higher paid services tend to show higher utilization.
Report the overdispersion check (quick justification for NB)
poisson_disp <- sum(residuals(poisson_model, type = "pearson")^2) / df.residual(poisson_model)
poisson_disp
## [1] 19.20035
Another example here on why Poisson is a poor fit and negative binomial is more appropriate, as the value of 19.20035 is the Poisson model’s dispersion statistic, and since it’s way bigger than 1, it shows the data are heavily overdispersed.
Important Coefficients
nb_coef <- broom::tidy(nb_model) %>%
dplyr::filter(term != "(Intercept)") %>%
dplyr::mutate(IRR = exp(estimate)) %>%
dplyr::arrange(p.value)
dplyr::slice_head(nb_coef, n = 15)
Diagnosis-Related Group (DRG) is a Medicare inpatient classification that groups similar hospital stays into the same “type of case” based on diagnosis and treatment. The nb_coef code is taking your negative binomial model results, removing the intercept, converting each coefficient into an IRR (incidence rate ratio) by doing exp(estimate), sorting predictors by the smallest p-values (most statistically significant), and then showing the top 15 predictors that have the strongest evidence of being associated with Total Discharges.
In this project, we used the CMS inpatient dataset (146,427 provider–DRG records) to study utilization, measured as total discharges (Tot_Dschrgs). The discharge counts are very right-skewed (most hospital–DRG combinations have relatively low discharges, with a smaller number having very high volumes). A quick Poisson overdispersion check returned a dispersion statistic around 19, which is far above 1 and tells us the data have much more variability than a basic Poisson model assumes.
To model discharges, we compared Poisson regression, negative binomial regression, and a log-linear OLS approach using DRG, state, RUCA, and log Medicare payment as predictors. On the 80/20 test split, Poisson had the lowest RMSE (best overall squared-error fit), while log-linear OLS had the lowest MAE (best average absolute error), with negative binomial close behind. Even so, because the Poisson model clearly violates the equal-mean-variance assumption (heavy overdispersion), the negative binomial model is the most appropriate for inference. Its results suggest that service type (DRG) and geography (state, plus RUCA) explain a large share of utilization differences, and higher Medicare payments are associated with higher expected discharges, even after controlling for DRG and location.