Abstract:

This project investigates variation in Medicare inpatient utilization across U.S. hospitals by focusing on service volume (total discharges) rather than dollar amounts, which can be influenced by hospital-specific charge-setting and pricing conventions. Using the CMS Medicare Inpatient Hospitals – by Provider and Service dataset (provider–DRG level records), we explore how utilization differs by clinical service (DRG), geography, and hospital context (e.g., urban–rural classification). Our methodology includes exploratory analysis of discharge distributions and variability across categories, followed by regression-based predictive modeling of discharge counts. Because utilization is a count outcome, we fit and compare generalized linear models designed for counts (Poisson and/or negative binomial), and evaluate model fit and predictive performance using a train/test split and metrics appropriate for count prediction. We also examine whether model residuals reveal hospitals or regions with systematically higher or lower utilization than expected after accounting for service mix and location. The primary outcomes of this study are (1) an interpretable, risk-adjusted “expected discharges” model, (2) a ranked set of predictors associated with higher utilization, and (3) visual summaries, including a U.S. map of utilization patterns or model residuals by state/region, to communicate geographic differences. These results provide a data-driven starting point for discussing where utilization differs most and where further clinical or operational investigation may be warranted.

Introduction (What are we trying to convey?)

The story we will focus on is on inpatient utilization by modeling Total Discharges (Tot_Dschrgs) across U.S. hospitals and diagnosis categories. The goal is to understand how discharge volumes vary by service type (DRG) and geography (state and rural–urban setting), and whether reimbursement is related to higher utilization. Because discharge counts are non-negative integers and can vary widely across provider–DRG combinations, we compare several standard count/regression approaches and summarize which model best captures the observed utilization patterns.

Data Import

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tibble)
library(knitr)
library(forcats)
library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

library(httr2)
library(readr)

base_url <- "https://data.cms.gov/data-api/v1/dataset/690ddc6c-2767-4618-b277-420ffb2bf27c/data"

fetch_cms_page <- function(offset = 0, size = 5000) {
  req <- request(base_url) |>
    req_url_query(offset = offset, size = size) |>
    req_timeout(60) |>
    req_retry(max_tries = 3)

  resp <- req_perform(req)

  # API returns an array of rows (JSON objects)
  dat <- resp_body_json(resp, simplifyVector = TRUE)
  as_tibble(dat)
}

fetch_cms_all <- function(size = 5000) {
  out <- list()
  offset <- 0
  i <- 1

  repeat {
    message("Downloading: offset = ", offset)
    page <- fetch_cms_page(offset = offset, size = size)

    if (nrow(page) == 0) break
    out[[i]] <- page

    # last page when returned rows < requested size
    if (nrow(page) < size) break

    offset <- offset + size
    i <- i + 1
  }

  bind_rows(out)
}

cms_inpatient <- fetch_cms_all(size = 5000)

## Downloading: offset = 0

## Downloading: offset = 5000

## Downloading: offset = 10000

## Downloading: offset = 15000

## Downloading: offset = 20000

## Downloading: offset = 25000

## Downloading: offset = 30000

## Downloading: offset = 35000

## Downloading: offset = 40000

## Downloading: offset = 45000

## Downloading: offset = 50000

## Downloading: offset = 55000

## Downloading: offset = 60000

## Downloading: offset = 65000

## Downloading: offset = 70000

## Downloading: offset = 75000

## Downloading: offset = 80000

## Downloading: offset = 85000

## Downloading: offset = 90000

## Downloading: offset = 95000

## Downloading: offset = 1e+05

## Downloading: offset = 105000

## Downloading: offset = 110000

## Downloading: offset = 115000

## Downloading: offset = 120000

## Downloading: offset = 125000

## Downloading: offset = 130000

## Downloading: offset = 135000

## Downloading: offset = 140000

## Downloading: offset = 145000

# Convert numeric-looking columns safely
cms_inpatient <- cms_inpatient |>
  mutate(
    Tot_Dschrgs = parse_number(Tot_Dschrgs),
    Avg_Submtd_Cvrd_Chrg = parse_number(Avg_Submtd_Cvrd_Chrg),
    Avg_Tot_Pymt_Amt = parse_number(Avg_Tot_Pymt_Amt),
    Avg_Mdcr_Pymt_Amt = parse_number(Avg_Mdcr_Pymt_Amt)
  )

glimpse(cms_inpatient)

## Rows: 146,427
## Columns: 15
## $ Rndrng_Prvdr_CCN          <chr> "010001", "010001", "010001", "010001", "010…
## $ Rndrng_Prvdr_Org_Name     <chr> "Southeast Health Medical Center", "Southeas…
## $ Rndrng_Prvdr_City         <chr> "Dothan", "Dothan", "Dothan", "Dothan", "Dot…
## $ Rndrng_Prvdr_St           <chr> "1108 Ross Clark Circle", "1108 Ross Clark C…
## $ Rndrng_Prvdr_State_FIPS   <chr> "01", "01", "01", "01", "01", "01", "01", "0…
## $ Rndrng_Prvdr_Zip5         <chr> "36301", "36301", "36301", "36301", "36301",…
## $ Rndrng_Prvdr_State_Abrvtn <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "A…
## $ Rndrng_Prvdr_RUCA         <chr> "2", "2", "2", "2", "2", "2", "2", "2", "2",…
## $ Rndrng_Prvdr_RUCA_Desc    <chr> "Metropolitan area high commuting: primary f…
## $ DRG_Cd                    <chr> "003", "023", "024", "025", "038", "039", "0…
## $ DRG_Desc                  <chr> "ECMO OR TRACHEOSTOMY WITH MV >96 HOURS OR P…
## $ Tot_Dschrgs               <dbl> 14, 26, 12, 16, 11, 36, 15, 62, 59, 34, 12, …
## $ Avg_Submtd_Cvrd_Chrg      <dbl> 663764.36, 180980.88, 105824.33, 242539.50, …
## $ Avg_Tot_Pymt_Amt          <dbl> 120219.929, 37321.038, 26936.667, 34745.375,…
## $ Avg_Mdcr_Pymt_Amt         <dbl> 115544.143, 35261.808, 25048.917, 32438.625,…

Last Minute safety checks

# 1) Missingness check
colSums(is.na(cms_inpatient))

##          Rndrng_Prvdr_CCN     Rndrng_Prvdr_Org_Name         Rndrng_Prvdr_City 
##                         0                         0                         0 
##           Rndrng_Prvdr_St   Rndrng_Prvdr_State_FIPS         Rndrng_Prvdr_Zip5 
##                         0                         0                         0 
## Rndrng_Prvdr_State_Abrvtn         Rndrng_Prvdr_RUCA    Rndrng_Prvdr_RUCA_Desc 
##                         0                         0                         0 
##                    DRG_Cd                  DRG_Desc               Tot_Dschrgs 
##                         0                         0                         0 
##      Avg_Submtd_Cvrd_Chrg          Avg_Tot_Pymt_Amt         Avg_Mdcr_Pymt_Amt 
##                         0                         0                         0

# 2) Sanity check: are the numeric columns non-negative?
summary(dplyr::select(
  cms_inpatient,
  Tot_Dschrgs, Avg_Submtd_Cvrd_Chrg, Avg_Tot_Pymt_Amt, Avg_Mdcr_Pymt_Amt
))

##   Tot_Dschrgs      Avg_Submtd_Cvrd_Chrg Avg_Tot_Pymt_Amt Avg_Mdcr_Pymt_Amt 
##  Min.   :  11.00   Min.   :    3368     Min.   :  1938   Min.   :   180.3  
##  1st Qu.:  14.00   1st Qu.:   35263     1st Qu.:  8686   1st Qu.:  6794.3  
##  Median :  20.00   Median :   58669     Median : 12831   Median : 10548.2  
##  Mean   :  33.88   Mean   :   90794     Mean   : 18513   Mean   : 15331.4  
##  3rd Qu.:  35.00   3rd Qu.:  104217     3rd Qu.: 20403   3rd Qu.: 16815.0  
##  Max.   :3210.00   Max.   :10418933     Max.   :761739   Max.   :751479.0

Data & Transformations

We use CMS inpatient data at the provider–DRG level. The main outcome is Tot_Dschrgs (total discharges), and the key predictors are DRG_Cd (Diagnosis-Related Group code), Rndrng_Prvdr_St (provider state), and Rndrng_Prvdr_RUCA (rural–urban commuting area code). To capture how payment levels relate to utilization while reducing skew, we also include the log of Medicare payment: log_medicare_payment = log(Avg_Mdcr_Pymt_Amt). We apply log transforms to dollar variables (and examine log-discharge distribution for diagnostics) because these variables are highly right-skewed with large outliers; logging makes relationships more linear and stabilizes variance, which improves interpretability and model stability—especially when comparing Poisson vs. Negative Binomial vs. a log-linear OLS baseline.

For those looking, because of those huge max values, many of your plots/models will behave better if we use log transforms on dollar variables:

cms_inpatient2 <- cms_inpatient %>%
  mutate(
    log_charge = log(Avg_Submtd_Cvrd_Chrg),
    log_total_payment = log(Avg_Tot_Pymt_Amt),
    log_medicare_payment = log(Avg_Mdcr_Pymt_Amt)
  )

Exploratory Data Analysis

We start by exploring the distribution of the total number of discharges (Tot_Dschrgs). The summary statistics computed above show that the minimum number of discharges for any provider–DRG combination is 11 and the maximum is over 3,200. We create histograms and boxplots to visualise the distribution of discharges on the original scale and after applying a logarithmic transformation.

library(ggplot2)

## Distribution of discharges
ggplot(cms_inpatient2, aes(x = Tot_Dschrgs)) +
  geom_histogram(bins = 50) +
  labs(title = "Distribution of Total Discharges", x = "Total Discharges", y = "Count")

## Distribution of log discharges
cms_inpatient2$log_discharges <- log(cms_inpatient2$Tot_Dschrgs)
ggplot(cms_inpatient2, aes(x = log_discharges)) +
  geom_histogram(bins = 50) +
  labs(title = "Distribution of Log Total Discharges", x = "Log(Total Discharges)", y = "Count")

Next, we examine average discharges by diagnosis-related group (DRG) and by state to get a sense of variation across services and geography.

## Average discharges by DRG (top 10)
drg_summary <- cms_inpatient2 %>%
  group_by(DRG_Cd) %>%
  summarise(mean_discharges = mean(Tot_Dschrgs),
            count = n()) %>%
  arrange(desc(mean_discharges)) %>%
  head(10)

knitr::kable(drg_summary, digits = 2, caption = "Top 10 DRGs by Average Discharges")

Top 10 DRGs by Average Discharges
DRG_Cd	mean_discharges	count
871	209.78	2678
291	121.42	2633
274	69.96	757
885	69.47	571
177	67.42	2472
895	58.80	61
267	58.34	672
470	54.87	1311
193	52.15	2412
946	49.00	6

## Average discharges by state (top 10)
state_summary <- cms_inpatient2 %>%
  group_by(Rndrng_Prvdr_St) %>%
  summarise(mean_discharges = mean(Tot_Dschrgs),
            count = n()) %>%
  arrange(desc(mean_discharges)) %>%
  head(10)

knitr::kable(state_summary, digits = 2, caption = "Top 10 States by Average Discharges")

Top 10 States by Average Discharges
Rndrng_Prvdr_St	mean_discharges	count
107 Lincoln Street	353.25	4
3015 Veterans Parkway	259.00	3
100 Trich Drive	220.00	1
2520 Troy Drive	206.00	2
3651 College Blvd	187.00	3
401 South Santa Fe Avenue	183.50	2
1711 West Temple Street	145.05	19
23901 Lahser	145.00	2
5025 N Paulina Street	123.00	2
3029 West Main Street	119.00	3

Model Building and Evaluation

To model utilisation (total discharges) we consider three types of models:

Poisson regression – appropriate for modelling count data with mean equal to variance.
Negative binomial regression – generalises Poisson by allowing over‑dispersion.
Multiple linear regression on the log of discharges – a common baseline for comparison.

We split the data into a training and testing set (80/20 split) and fit each model on the training data. Predictor variables include the diagnosis-related group (DRG_Cd), provider state (Rndrng_Prvdr_St), rural–urban commuting area code (Rndrng_Prvdr_RUCA) and the log of average Medicare payment (log_medicare_payment) to capture any relationship between payment and utilisation. Because categorical predictors have many levels, the resulting models include many dummy variables; fitting them on the full data may take some time.

set.seed(123)

DEBUG_SAMPLE_N <- NA   # Due to slowness i use this to keep editing code, use NA for final
TRAIN_PROP <- 0.80

TOP_DRG  <- 50
TOP_ST   <- 60
TOP_RUCA <- 20

required_cols <- c("Tot_Dschrgs", "DRG_Cd", "Rndrng_Prvdr_St", "Rndrng_Prvdr_RUCA", "log_medicare_payment")
missing <- setdiff(required_cols, names(cms_inpatient2))
if (length(missing) > 0) stop("Missing columns in cms_inpatient2: ", paste(missing, collapse = ", "))

# 1) Build modeling data
model_data <- cms_inpatient2 %>%
  dplyr::select(dplyr::all_of(required_cols)) %>%
  mutate(
    DRG_Cd = factor(DRG_Cd),
    Rndrng_Prvdr_St = factor(Rndrng_Prvdr_St),
    Rndrng_Prvdr_RUCA = factor(Rndrng_Prvdr_RUCA)
  ) %>%
  filter(Tot_Dschrgs > 0, is.finite(log_medicare_payment))

# Optional sample for speed (sample BEFORE lumping is fine)
if (!is.na(DEBUG_SAMPLE_N) && DEBUG_SAMPLE_N < nrow(model_data)) {
  set.seed(123)
  model_data <- dplyr::slice_sample(model_data, n = DEBUG_SAMPLE_N)
}

# Lump rare categories into "Other" (on the sampled data)
model_data <- model_data %>%
  mutate(
    DRG_Cd = fct_lump_n(DRG_Cd, n = TOP_DRG, other_level = "Other"),
    Rndrng_Prvdr_St = fct_lump_n(Rndrng_Prvdr_St, n = TOP_ST, other_level = "Other"),
    Rndrng_Prvdr_RUCA = fct_lump_n(Rndrng_Prvdr_RUCA, n = TOP_RUCA, other_level = "Other")
  )

cat("Rows used:", nrow(model_data), "\n")

## Rows used: 146427

cat("Levels (DRG, State, RUCA):\n")

## Levels (DRG, State, RUCA):

print(c(
  DRG = nlevels(model_data$DRG_Cd),
  State = nlevels(model_data$Rndrng_Prvdr_St),
  RUCA = nlevels(model_data$Rndrng_Prvdr_RUCA)
))

##   DRG State  RUCA 
##    51    61    20

# 2) Split that GUARANTEES every factor level appears in training
ensure_levels_in_train <- function(df, vars) {
  idx <- integer(0)
  for (v in vars) {
    lvls <- levels(df[[v]])
    for (lv in lvls) {
      rows <- which(df[[v]] == lv)
      if (length(rows) > 0) idx <- c(idx, sample(rows, 1))
    }
  }
  unique(idx)
}

set.seed(123)
n <- nrow(model_data)
target_train_n <- floor(TRAIN_PROP * n)

seed_idx <- ensure_levels_in_train(model_data, c("DRG_Cd","Rndrng_Prvdr_St","Rndrng_Prvdr_RUCA"))
remaining <- setdiff(seq_len(n), seed_idx)
need <- max(0, target_train_n - length(seed_idx))

train_id <- c(seed_idx, sample(remaining, need))
test_id  <- setdiff(seq_len(n), train_id)

train_data <- model_data[train_id, ]
test_data  <- model_data[test_id, ]

# 3) Extra safety: force test levels to match training levels exactly
for (v in c("DRG_Cd","Rndrng_Prvdr_St","Rndrng_Prvdr_RUCA")) {
  train_data[[v]] <- forcats::fct_expand(train_data[[v]], "Other")

  test_chr <- as.character(test_data[[v]])
  test_chr[is.na(test_chr)] <- "Other"
  test_chr[!test_chr %in% levels(train_data[[v]])] <- "Other"

  test_data[[v]] <- factor(test_chr, levels = levels(train_data[[v]]))
}

# Quick diagnostic: should be 0 unseen levels now
unseen_counts <- sapply(c("DRG_Cd","Rndrng_Prvdr_St","Rndrng_Prvdr_RUCA"), function(v) {
  sum(!as.character(test_data[[v]]) %in% levels(train_data[[v]]))
})
cat("Unseen levels in test (should be 0):\n")

## Unseen levels in test (should be 0):

print(unseen_counts)

##            DRG_Cd   Rndrng_Prvdr_St Rndrng_Prvdr_RUCA 
##                 0                 0                 0

# 4) Fit models
cat("Fitting Poisson...\n")

## Fitting Poisson...

t_poisson <- system.time({
  poisson_model <- glm(
    Tot_Dschrgs ~ DRG_Cd + Rndrng_Prvdr_St + Rndrng_Prvdr_RUCA + log_medicare_payment,
    family = poisson(link = "log"),
    data = train_data,
    control = glm.control(maxit = 25)
  )
})
print(t_poisson)

##    user  system elapsed 
##   10.66    0.15   10.82

cat("Fitting Negative Binomial...\n")

## Fitting Negative Binomial...

t_nb <- system.time({
  nb_model <- MASS::glm.nb(
    Tot_Dschrgs ~ DRG_Cd + Rndrng_Prvdr_St + Rndrng_Prvdr_RUCA + log_medicare_payment,
    data = train_data,
    control = glm.control(maxit = 25),
    maxit = 25,
    trace = FALSE
  )
})
print(t_nb)

##    user  system elapsed 
##   34.12    0.79   34.92

cat("Fitting log-linear OLS...\n")

## Fitting log-linear OLS...

t_lm <- system.time({
  lm_model <- lm(
    log(Tot_Dschrgs) ~ DRG_Cd + Rndrng_Prvdr_St + Rndrng_Prvdr_RUCA + log_medicare_payment,
    data = train_data
  )
})
print(t_lm)

##    user  system elapsed 
##    1.74    0.07    1.78

# 5) Predict (this is where you were failing before)
test_data$pred_poisson <- predict(poisson_model, newdata = test_data, type = "response")
test_data$pred_nb      <- predict(nb_model, newdata = test_data, type = "response")
test_data$pred_lm      <- exp(predict(lm_model, newdata = test_data))

# 6) Evaluate
rmse <- function(a, p) sqrt(mean((a - p)^2))
mae  <- function(a, p) mean(abs(a - p))

model_results <- tibble(
  Model = c("Poisson", "Negative Binomial", "Log-linear OLS"),
  RMSE  = c(
    rmse(test_data$Tot_Dschrgs, test_data$pred_poisson),
    rmse(test_data$Tot_Dschrgs, test_data$pred_nb),
    rmse(test_data$Tot_Dschrgs, test_data$pred_lm)
  ),
  MAE   = c(
    mae(test_data$Tot_Dschrgs, test_data$pred_poisson),
    mae(test_data$Tot_Dschrgs, test_data$pred_nb),
    mae(test_data$Tot_Dschrgs, test_data$pred_lm)
  )
)

knitr::kable(model_results, digits = 2,
             caption = "Model performance on test data (Utilization/Discharges)")

Model performance on test data (Utilization/Discharges)
Model	RMSE	MAE
Poisson	37.20	16.62
Negative Binomial	38.09	16.73
Log-linear OLS	40.90	15.98

The resulting table is comparing how well each model predicts Tot_Dschrgs (discharges/utilization) on the held-out test data using RMSE and MAE (both are “average prediction error” measures, and lower is better).

Here, the Poisson model has the lowest RMSE, while the Negative Binomial has the lowest MAE, so those two are performing best overall (with Poisson slightly better for typical squared-error performance, and NB slightly better for average absolute error).

This here is what I would write in the project report, feel free to edit this: Next, we interpret coefficients from the count model to identify which DRGs and regional factors are most strongly associated with discharge volume. However, because utilization is a count outcome and the data may be overdispersed, we prioritize the Negative Binomial model for interpretation. We focus discussion on the largest effects and aggregate patterns across DRG, state, and RUCA categories.

Interpretation (What does this all mean?)

This section is important for knowing the big picture for the powerpoint

Model Summary

summary(nb_model)

## 
## Call:
## MASS::glm.nb(formula = Tot_Dschrgs ~ DRG_Cd + Rndrng_Prvdr_St + 
##     Rndrng_Prvdr_RUCA + log_medicare_payment, data = train_data, 
##     control = glm.control(maxit = 25), maxit = 25, trace = FALSE, 
##     init.theta = 3.451298667, link = log)
## 
## Coefficients:
##                                                        Estimate Std. Error
## (Intercept)                                           1.3982939  0.2967331
## DRG_Cd065                                             0.0618367  0.0221746
## DRG_Cd175                                            -0.6403591  0.0266323
## DRG_Cd177                                             0.5235724  0.0205199
## DRG_Cd178                                            -0.3172377  0.0245420
## DRG_Cd189                                             0.1230315  0.0211711
## DRG_Cd190                                            -0.1234603  0.0224724
## DRG_Cd193                                             0.3538645  0.0207232
## DRG_Cd194                                            -0.3027007  0.0237367
## DRG_Cd208                                            -0.6640045  0.0257403
## DRG_Cd246                                            -0.5070509  0.0257304
## DRG_Cd247                                            -0.2725159  0.0240823
## DRG_Cd280                                             0.1105178  0.0217205
## DRG_Cd281                                            -0.4226186  0.0244798
## DRG_Cd286                                            -0.3124966  0.0250185
## DRG_Cd287                                            -0.3345458  0.0255680
## DRG_Cd291                                             1.1607707  0.0202071
## DRG_Cd308                                            -0.3036453  0.0230030
## DRG_Cd309                                            -0.0112784  0.0226609
## DRG_Cd310                                            -0.3335787  0.0261500
## DRG_Cd312                                            -0.1210043  0.0233056
## DRG_Cd314                                            -0.5462924  0.0257237
## DRG_Cd329                                            -0.6628565  0.0262714
## DRG_Cd330                                            -0.3615328  0.0252503
## DRG_Cd377                                            -0.2219765  0.0227571
## DRG_Cd378                                             0.1324901  0.0220048
## DRG_Cd389                                            -0.4844575  0.0241064
## DRG_Cd391                                            -0.4963816  0.0248925
## DRG_Cd392                                             0.1749381  0.0217509
## DRG_Cd394                                            -0.5197554  0.0261944
## DRG_Cd470                                             0.2853675  0.0236111
## DRG_Cd480                                            -0.6758924  0.0259080
## DRG_Cd481                                            -0.1480020  0.0221418
## DRG_Cd522                                            -0.4452626  0.0233672
## DRG_Cd552                                            -0.1733880  0.0245414
## DRG_Cd603                                            -0.1459281  0.0224846
## DRG_Cd637                                            -0.7204204  0.0255349
## DRG_Cd638                                            -0.5963475  0.0251423
## DRG_Cd640                                            -0.0643255  0.0218544
## DRG_Cd641                                            -0.1258449  0.0219637
## DRG_Cd682                                            -0.0096650  0.0215925
## DRG_Cd683                                             0.0459899  0.0215453
## DRG_Cd689                                             0.0482916  0.0218094
## DRG_Cd690                                             0.1248251  0.0214500
## DRG_Cd698                                            -0.0210627  0.0221101
## DRG_Cd699                                            -0.3752163  0.0259514
## DRG_Cd812                                            -0.4705738  0.0253299
## DRG_Cd853                                            -0.0536401  0.0222078
## DRG_Cd871                                             1.6638268  0.0199717
## DRG_Cd872                                             0.2505532  0.0210488
## DRG_CdOther                                          -0.7314601  0.0164068
## Rndrng_Prvdr_St100 Madison Ave                        0.2625692  0.0517168
## Rndrng_Prvdr_St1000 Blythe Blvd                      -0.0758730  0.0519588
## Rndrng_Prvdr_St101 Sivley Rd                          0.1679558  0.0534270
## Rndrng_Prvdr_St111 South 11th Street                 -0.0327783  0.0536373
## Rndrng_Prvdr_St1200 South Cedar Crest Boulevard       0.4356613  0.0491926
## Rndrng_Prvdr_St1211 Medical Center Drive              0.1027909  0.0512381
## Rndrng_Prvdr_St1216 Second Street Southwest           0.3374302  0.0471819
## Rndrng_Prvdr_St1500 E Medical Center Drive, Spc 5474 -0.0449624  0.0553600
## Rndrng_Prvdr_St1600 South 48th St                     0.0595970  0.0538970
## Rndrng_Prvdr_St1600 Sw Archer Rd                      0.0063067  0.0539968
## Rndrng_Prvdr_St169 Ashley Ave                         0.0107445  0.0535033
## Rndrng_Prvdr_St1700 S Tamiami Trl                     0.4913209  0.0487173
## Rndrng_Prvdr_St1945 State Route 33                    0.1062317  0.0537457
## Rndrng_Prvdr_St20 York St                             0.1209532  0.0509467
## Rndrng_Prvdr_St200 East Chestnut Street               0.2344841  0.0517249
## Rndrng_Prvdr_St200 Lothrop Street                    -0.1046249  0.0544556
## Rndrng_Prvdr_St2100  Erwin Rd                         0.0003371  0.0531830
## Rndrng_Prvdr_St2100 Stantonsburg Rd                   0.0850446  0.0533941
## Rndrng_Prvdr_St2131 S 17th St Box 9000                0.1731225  0.0524833
## Rndrng_Prvdr_St251 E Huron St                         0.1586621  0.0524620
## Rndrng_Prvdr_St2650 Ridge Ave                         0.3315758  0.0515827
## Rndrng_Prvdr_St270 - 05 76th Avenue                   0.1830098  0.0541577
## Rndrng_Prvdr_St300 Community Drive                    0.2669454  0.0502584
## Rndrng_Prvdr_St300 Pasteur Drive                      0.0866047  0.0502154
## Rndrng_Prvdr_St330 Brookline Avenue                  -0.0222496  0.0509393
## Rndrng_Prvdr_St3300 Gallows Road                      0.2316641  0.0536895
## Rndrng_Prvdr_St34th & Spruce Sts                      0.0133881  0.0531559
## Rndrng_Prvdr_St350 7th St N                           0.2010806  0.0536927
## Rndrng_Prvdr_St3535 Olentangy River Rd                0.1449188  0.0530972
## Rndrng_Prvdr_St3601 W Thirteen Mile Rd                0.1495785  0.0519880
## Rndrng_Prvdr_St4000 Cambridge Street                  0.1116961  0.0513959
## Rndrng_Prvdr_St4500 San Pablo Rd                      0.0636239  0.0541604
## Rndrng_Prvdr_St4755 Ogletown-Stanton Road             0.4634774  0.0483983
## Rndrng_Prvdr_St505 Parnassus Ave, Box 0296           -0.0236681  0.0536772
## Rndrng_Prvdr_St509 Biltmore Ave                       0.1488837  0.0527218
## Rndrng_Prvdr_St52 W Underwood St                      0.2802639  0.0504253
## Rndrng_Prvdr_St525 East 68th Street                   0.6276245  0.0440462
## Rndrng_Prvdr_St55 Fogg Road                           0.3029302  0.0530472
## Rndrng_Prvdr_St55 Fruit Street                        0.2638586  0.0492033
## Rndrng_Prvdr_St55 Lake Avenue North                   0.0644356  0.0524918
## Rndrng_Prvdr_St550 First Avenue                       0.6194161  0.0458981
## Rndrng_Prvdr_St5777 East Mayo Boulevard               0.1525402  0.0535845
## Rndrng_Prvdr_St600 Highland Avenue                   -0.0364390  0.0534577
## Rndrng_Prvdr_St600 Mary St                            0.2451897  0.0531177
## Rndrng_Prvdr_St600 North  Wolfe Street                0.0256733  0.0519197
## Rndrng_Prvdr_St601 E Rollins St                       0.7820041  0.0448559
## Rndrng_Prvdr_St6019 Walnut Grove Road                 0.1909223  0.0518544
## Rndrng_Prvdr_St619 South 19th Street                  0.0187043  0.0518181
## Rndrng_Prvdr_St6565 Fannin                            0.1643211  0.0514121
## Rndrng_Prvdr_St75 Francis Street                      0.2361540  0.0497020
## Rndrng_Prvdr_St759 Chestnut Street                    0.1606021  0.0530796
## Rndrng_Prvdr_St7700 Floyd Curl Dr                     0.4491292  0.0479169
## Rndrng_Prvdr_St800 Prudential Dr                      0.3187528  0.0507135
## Rndrng_Prvdr_St8700 Beverly Blvd                      0.3832132  0.0496082
## Rndrng_Prvdr_St9500 Euclid Avenue                     0.1922532  0.0518044
## Rndrng_Prvdr_StHealth Sciences Center Suny            0.2472501  0.0494797
## Rndrng_Prvdr_StOne Barnes-Jewish Hospital Plaza       0.0832560  0.0494219
## Rndrng_Prvdr_StOne Gustave L Levy Place               0.2065810  0.0522729
## Rndrng_Prvdr_StOne Wyoming Street                     0.0596072  0.0534907
## Rndrng_Prvdr_StOther                                 -0.2976400  0.0314470
## Rndrng_Prvdr_RUCA1                                    1.6191396  0.2930673
## Rndrng_Prvdr_RUCA1.1                                  1.5340547  0.2933525
## Rndrng_Prvdr_RUCA10                                   1.0861189  0.2945199
## Rndrng_Prvdr_RUCA10.1                                 1.7357404  0.3017321
## Rndrng_Prvdr_RUCA10.3                                 0.7710579  0.5107055
## Rndrng_Prvdr_RUCA2                                    1.4729047  0.2933679
## Rndrng_Prvdr_RUCA2.1                                  0.8742745  0.3199323
## Rndrng_Prvdr_RUCA3                                    0.7835080  0.3203371
## Rndrng_Prvdr_RUCA4                                    1.2259406  0.2931166
## Rndrng_Prvdr_RUCA4.1                                  1.1412413  0.2942213
## Rndrng_Prvdr_RUCA5                                    1.5028663  0.2942437
## Rndrng_Prvdr_RUCA6                                    0.2138522  0.3649867
## Rndrng_Prvdr_RUCA7                                    0.8234152  0.2935001
## Rndrng_Prvdr_RUCA7.1                                  0.7206591  0.3040949
## Rndrng_Prvdr_RUCA7.2                                  1.3246520  0.3018509
## Rndrng_Prvdr_RUCA8                                    0.6613702  0.3045020
## Rndrng_Prvdr_RUCA8.2                                  0.7831840  0.5256310
## Rndrng_Prvdr_RUCA9                                    0.5329886  0.3612613
## Rndrng_Prvdr_RUCA99                                   1.6603314  0.2942745
## log_medicare_payment                                  0.1048660  0.0030645
##                                                      z value Pr(>|z|)    
## (Intercept)                                            4.712 2.45e-06 ***
## DRG_Cd065                                              2.789 0.005293 ** 
## DRG_Cd175                                            -24.044  < 2e-16 ***
## DRG_Cd177                                             25.515  < 2e-16 ***
## DRG_Cd178                                            -12.926  < 2e-16 ***
## DRG_Cd189                                              5.811 6.20e-09 ***
## DRG_Cd190                                             -5.494 3.93e-08 ***
## DRG_Cd193                                             17.076  < 2e-16 ***
## DRG_Cd194                                            -12.752  < 2e-16 ***
## DRG_Cd208                                            -25.796  < 2e-16 ***
## DRG_Cd246                                            -19.706  < 2e-16 ***
## DRG_Cd247                                            -11.316  < 2e-16 ***
## DRG_Cd280                                              5.088 3.62e-07 ***
## DRG_Cd281                                            -17.264  < 2e-16 ***
## DRG_Cd286                                            -12.491  < 2e-16 ***
## DRG_Cd287                                            -13.085  < 2e-16 ***
## DRG_Cd291                                             57.444  < 2e-16 ***
## DRG_Cd308                                            -13.200  < 2e-16 ***
## DRG_Cd309                                             -0.498 0.618694    
## DRG_Cd310                                            -12.756  < 2e-16 ***
## DRG_Cd312                                             -5.192 2.08e-07 ***
## DRG_Cd314                                            -21.237  < 2e-16 ***
## DRG_Cd329                                            -25.231  < 2e-16 ***
## DRG_Cd330                                            -14.318  < 2e-16 ***
## DRG_Cd377                                             -9.754  < 2e-16 ***
## DRG_Cd378                                              6.021 1.73e-09 ***
## DRG_Cd389                                            -20.097  < 2e-16 ***
## DRG_Cd391                                            -19.941  < 2e-16 ***
## DRG_Cd392                                              8.043 8.78e-16 ***
## DRG_Cd394                                            -19.842  < 2e-16 ***
## DRG_Cd470                                             12.086  < 2e-16 ***
## DRG_Cd480                                            -26.088  < 2e-16 ***
## DRG_Cd481                                             -6.684 2.32e-11 ***
## DRG_Cd522                                            -19.055  < 2e-16 ***
## DRG_Cd552                                             -7.065 1.60e-12 ***
## DRG_Cd603                                             -6.490 8.58e-11 ***
## DRG_Cd637                                            -28.213  < 2e-16 ***
## DRG_Cd638                                            -23.719  < 2e-16 ***
## DRG_Cd640                                             -2.943 0.003247 ** 
## DRG_Cd641                                             -5.730 1.01e-08 ***
## DRG_Cd682                                             -0.448 0.654437    
## DRG_Cd683                                              2.135 0.032797 *  
## DRG_Cd689                                              2.214 0.026811 *  
## DRG_Cd690                                              5.819 5.91e-09 ***
## DRG_Cd698                                             -0.953 0.340778    
## DRG_Cd699                                            -14.458  < 2e-16 ***
## DRG_Cd812                                            -18.578  < 2e-16 ***
## DRG_Cd853                                             -2.415 0.015719 *  
## DRG_Cd871                                             83.309  < 2e-16 ***
## DRG_Cd872                                             11.903  < 2e-16 ***
## DRG_CdOther                                          -44.583  < 2e-16 ***
## Rndrng_Prvdr_St100 Madison Ave                         5.077 3.83e-07 ***
## Rndrng_Prvdr_St1000 Blythe Blvd                       -1.460 0.144221    
## Rndrng_Prvdr_St101 Sivley Rd                           3.144 0.001669 ** 
## Rndrng_Prvdr_St111 South 11th Street                  -0.611 0.541127    
## Rndrng_Prvdr_St1200 South Cedar Crest Boulevard        8.856  < 2e-16 ***
## Rndrng_Prvdr_St1211 Medical Center Drive               2.006 0.044841 *  
## Rndrng_Prvdr_St1216 Second Street Southwest            7.152 8.57e-13 ***
## Rndrng_Prvdr_St1500 E Medical Center Drive, Spc 5474  -0.812 0.416687    
## Rndrng_Prvdr_St1600 South 48th St                      1.106 0.268831    
## Rndrng_Prvdr_St1600 Sw Archer Rd                       0.117 0.907020    
## Rndrng_Prvdr_St169 Ashley Ave                          0.201 0.840840    
## Rndrng_Prvdr_St1700 S Tamiami Trl                     10.085  < 2e-16 ***
## Rndrng_Prvdr_St1945 State Route 33                     1.977 0.048091 *  
## Rndrng_Prvdr_St20 York St                              2.374 0.017591 *  
## Rndrng_Prvdr_St200 East Chestnut Street                4.533 5.81e-06 ***
## Rndrng_Prvdr_St200 Lothrop Street                     -1.921 0.054696 .  
## Rndrng_Prvdr_St2100  Erwin Rd                          0.006 0.994942    
## Rndrng_Prvdr_St2100 Stantonsburg Rd                    1.593 0.111212    
## Rndrng_Prvdr_St2131 S 17th St Box 9000                 3.299 0.000972 ***
## Rndrng_Prvdr_St251 E Huron St                          3.024 0.002492 ** 
## Rndrng_Prvdr_St2650 Ridge Ave                          6.428 1.29e-10 ***
## Rndrng_Prvdr_St270 - 05 76th Avenue                    3.379 0.000727 ***
## Rndrng_Prvdr_St300 Community Drive                     5.311 1.09e-07 ***
## Rndrng_Prvdr_St300 Pasteur Drive                       1.725 0.084588 .  
## Rndrng_Prvdr_St330 Brookline Avenue                   -0.437 0.662266    
## Rndrng_Prvdr_St3300 Gallows Road                       4.315 1.60e-05 ***
## Rndrng_Prvdr_St34th & Spruce Sts                       0.252 0.801145    
## Rndrng_Prvdr_St350 7th St N                            3.745 0.000180 ***
## Rndrng_Prvdr_St3535 Olentangy River Rd                 2.729 0.006347 ** 
## Rndrng_Prvdr_St3601 W Thirteen Mile Rd                 2.877 0.004013 ** 
## Rndrng_Prvdr_St4000 Cambridge Street                   2.173 0.029762 *  
## Rndrng_Prvdr_St4500 San Pablo Rd                       1.175 0.240102    
## Rndrng_Prvdr_St4755 Ogletown-Stanton Road              9.576  < 2e-16 ***
## Rndrng_Prvdr_St505 Parnassus Ave, Box 0296            -0.441 0.659261    
## Rndrng_Prvdr_St509 Biltmore Ave                        2.824 0.004744 ** 
## Rndrng_Prvdr_St52 W Underwood St                       5.558 2.73e-08 ***
## Rndrng_Prvdr_St525 East 68th Street                   14.249  < 2e-16 ***
## Rndrng_Prvdr_St55 Fogg Road                            5.711 1.13e-08 ***
## Rndrng_Prvdr_St55 Fruit Street                         5.363 8.20e-08 ***
## Rndrng_Prvdr_St55 Lake Avenue North                    1.228 0.219621    
## Rndrng_Prvdr_St550 First Avenue                       13.495  < 2e-16 ***
## Rndrng_Prvdr_St5777 East Mayo Boulevard                2.847 0.004417 ** 
## Rndrng_Prvdr_St600 Highland Avenue                    -0.682 0.495465    
## Rndrng_Prvdr_St600 Mary St                             4.616 3.91e-06 ***
## Rndrng_Prvdr_St600 North  Wolfe Street                 0.494 0.620967    
## Rndrng_Prvdr_St601 E Rollins St                       17.434  < 2e-16 ***
## Rndrng_Prvdr_St6019 Walnut Grove Road                  3.682 0.000232 ***
## Rndrng_Prvdr_St619 South 19th Street                   0.361 0.718129    
## Rndrng_Prvdr_St6565 Fannin                             3.196 0.001393 ** 
## Rndrng_Prvdr_St75 Francis Street                       4.751 2.02e-06 ***
## Rndrng_Prvdr_St759 Chestnut Street                     3.026 0.002481 ** 
## Rndrng_Prvdr_St7700 Floyd Curl Dr                      9.373  < 2e-16 ***
## Rndrng_Prvdr_St800 Prudential Dr                       6.285 3.27e-10 ***
## Rndrng_Prvdr_St8700 Beverly Blvd                       7.725 1.12e-14 ***
## Rndrng_Prvdr_St9500 Euclid Avenue                      3.711 0.000206 ***
## Rndrng_Prvdr_StHealth Sciences Center Suny             4.997 5.82e-07 ***
## Rndrng_Prvdr_StOne Barnes-Jewish Hospital Plaza        1.685 0.092066 .  
## Rndrng_Prvdr_StOne Gustave L Levy Place                3.952 7.75e-05 ***
## Rndrng_Prvdr_StOne Wyoming Street                      1.114 0.265131    
## Rndrng_Prvdr_StOther                                  -9.465  < 2e-16 ***
## Rndrng_Prvdr_RUCA1                                     5.525 3.30e-08 ***
## Rndrng_Prvdr_RUCA1.1                                   5.229 1.70e-07 ***
## Rndrng_Prvdr_RUCA10                                    3.688 0.000226 ***
## Rndrng_Prvdr_RUCA10.1                                  5.753 8.79e-09 ***
## Rndrng_Prvdr_RUCA10.3                                  1.510 0.131097    
## Rndrng_Prvdr_RUCA2                                     5.021 5.15e-07 ***
## Rndrng_Prvdr_RUCA2.1                                   2.733 0.006282 ** 
## Rndrng_Prvdr_RUCA3                                     2.446 0.014450 *  
## Rndrng_Prvdr_RUCA4                                     4.182 2.88e-05 ***
## Rndrng_Prvdr_RUCA4.1                                   3.879 0.000105 ***
## Rndrng_Prvdr_RUCA5                                     5.108 3.26e-07 ***
## Rndrng_Prvdr_RUCA6                                     0.586 0.557931    
## Rndrng_Prvdr_RUCA7                                     2.806 0.005024 ** 
## Rndrng_Prvdr_RUCA7.1                                   2.370 0.017795 *  
## Rndrng_Prvdr_RUCA7.2                                   4.388 1.14e-05 ***
## Rndrng_Prvdr_RUCA8                                     2.172 0.029858 *  
## Rndrng_Prvdr_RUCA8.2                                   1.490 0.136227    
## Rndrng_Prvdr_RUCA9                                     1.475 0.140117    
## Rndrng_Prvdr_RUCA99                                    5.642 1.68e-08 ***
## log_medicare_payment                                  34.219  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(3.4513) family taken to be 1)
## 
##     Null deviance: 248009  on 117140  degrees of freedom
## Residual deviance: 118174  on 117010  degrees of freedom
## AIC: 958720
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  3.4513 
##           Std. Err.:  0.0151 
## 
##  2 x log-likelihood:  -958455.5810

RESULT: The result here is showing us that the negative binomial model fit successfully and that log_medicare_payment is strongly associated with discharge counts, even after controlling for DRG, state, and RUCA (it’s highly statistically significant). The Theta value of 3.4513 indicates there is over dispersion in the discharge counts (variance bigger than the mean), which is exactly why the negative binomial model is more appropriate than using plain Poisson.

The big picture: The negative binomial model shows that inpatient discharges vary a lot across the country, and the biggest drivers of discharge counts are the type of case (DRG group) and where the hospital is located (state and rural/urban category). The overdispersion check supports using a negative binomial model instead of Poisson, because the discharge counts have much more variability than Poisson assumptions allow. The coefficient table (with IRR values) highlights which DRG groups and locations are linked to notably higher or lower discharge volumes, and it also shows that higher average Medicare payment is associated with higher discharge counts even after controlling for case mix and geography. Basically, hospitals don’t all “use” inpatient care the same way as in the discharge volumes depend heavily on what kinds of patients they treat and where they operate, and higher paid services tend to show higher utilization.

Report the overdispersion check (quick justification for NB)

poisson_disp <- sum(residuals(poisson_model, type = "pearson")^2) / df.residual(poisson_model)
poisson_disp

## [1] 19.20035

Another example here on why Poisson is a poor fit and negative binomial is more appropriate, as the value of 19.20035 is the Poisson model’s dispersion statistic, and since it’s way bigger than 1, it shows the data are heavily overdispersed.

Important Coefficients

nb_coef <- broom::tidy(nb_model) %>%
  dplyr::filter(term != "(Intercept)") %>%
  dplyr::mutate(IRR = exp(estimate)) %>%
  dplyr::arrange(p.value)

dplyr::slice_head(nb_coef, n = 15)

Diagnosis-Related Group (DRG) is a Medicare inpatient classification that groups similar hospital stays into the same “type of case” based on diagnosis and treatment. The nb_coef code is taking your negative binomial model results, removing the intercept, converting each coefficient into an IRR (incidence rate ratio) by doing exp(estimate), sorting predictors by the smallest p-values (most statistically significant), and then showing the top 15 predictors that have the strongest evidence of being associated with Total Discharges.

Conclusion

In this project, we used the CMS inpatient dataset (146,427 provider–DRG records) to study utilization, measured as total discharges (Tot_Dschrgs). The discharge counts are very right-skewed (most hospital–DRG combinations have relatively low discharges, with a smaller number having very high volumes). A quick Poisson overdispersion check returned a dispersion statistic around 19, which is far above 1 and tells us the data have much more variability than a basic Poisson model assumes.

To model discharges, we compared Poisson regression, negative binomial regression, and a log-linear OLS approach using DRG, state, RUCA, and log Medicare payment as predictors. On the 80/20 test split, Poisson had the lowest RMSE (best overall squared-error fit), while log-linear OLS had the lowest MAE (best average absolute error), with negative binomial close behind. Even so, because the Poisson model clearly violates the equal-mean-variance assumption (heavy overdispersion), the negative binomial model is the most appropriate for inference. Its results suggest that service type (DRG) and geography (state, plus RUCA) explain a large share of utilization differences, and higher Medicare payments are associated with higher expected discharges, even after controlling for DRG and location.

DATA 621 Final Project

Benson Toi, Jose Fuentes, and Gullit Navarrete

2025-12-14