Hospital Ownership and its association with Ratings and Clinical Outcomes

Introduction

The U.S. hospital sector is composed of nonprofit, for-profit, and government institutions, each operating under different financial and organizational incentives. Nonprofit hospitals reinvest surplus revenues into facility improvements, community programs, and patient services, while for-profit entities distribute earnings to shareholders. Government hospitals often focus on safety-net services for vulnerable populations. Given the central role hospitals play—accounting for roughly one-third of national healthcare expenditures—understanding how ownership influences quality is fundamental to shaping payment policies, accreditation standards, and patient decision-making.

Empirical evidence on ownership and outcomes has been mixed. In a landmark systematic review and meta-analysis encompassing over 38 million patients from 15 observational studies, private for-profit hospitals exhibited a small but statistically significant increase in mortality risk compared to private not-for-profit hospitals (Devereaux et al., 2002). By contrast, an analysis of Medicare beneficiaries admitted for common conditions (e.g., hip fracture, stroke, heart failure) found no meaningful difference in cost-adjusted survival or functional status by ownership type (Sloan et al., 2001). These disparate findings may reflect variations in data vintage, risk adjustment, and study scope.

The significance of this work lies in its potential to inform policy and practice at a time of rapid transformation in healthcare delivery. With CMS Hospital Compare data now offering detailed facility-level metrics on readmissions, mortality, complications, and patient experience, there is an opportunity to reassess ownership-quality associations using contemporary, nationally representative data. Such an analysis can guide regulators in refining value-based payment models, help payers target quality incentives more effectively, and enable patients to make informed choices about where to seek care.

Problem Statement and Research Questions:

It is difficult to tell what drives a hospital’s ratings up or down clearly and in a timely manner. Pinpointing to specific factors that influence this is challenging, yet these rantings have huge business impact on the hospital penalties from the government/CMS, as well as its reputation among patients/public, as well as the medical professionals. This study investigates whether nonprofit hospital ownership is associated with better quality outcomes than for-profit ownership after accounting for key facility characteristics. Specifically, it will address:

Readmission Outcomes: Do excess 30-day readmission ratios for acute myocardial infarction, heart failure, pneumonia, and other conditions differ between nonprofit and for-profit hospitals?
Mortality and Safety Indicators: Are 30-day mortality rates (e.g., MORT-30-AMI, MORT-30-HF) and patient safety indicators (e.g., PSI-90 composite) lower in nonprofit hospitals compared to for-profit facilities?
Moderating Facility Characteristics: Which hospital attributes—such as teaching status (proxied by hospital type), availability of emergency services, and overall CMS star rating—explain any observed ownership-quality differences?

To answer these questions, the analysis will proceed from exploratory data visualization and correlational assessments to multivariable regression models, followed by more flexible machine learning approaches (tree-based models and neural networks) to gauge predictive performance. This comprehensive strategy aims to yield both interpretable effect estimates and robust predictive insights into how ownership structure relates to hospital quality.

Data Source and Preparation:

All data come from the Centers for Medicare & Medicaid Services (CMS) Hospital Compare files for fiscal year 2025. From “Hospital_General_Information.csv” we extracted facility identifiers, location (City/Town, State, ZIP), ownership status, hospital type, emergency services flag, and overall star rating. The “Complications_and_Deaths-Hospital.csv” file provided 17 measure IDs (e.g. MORT_30_AMI, PSI_90) along with Score, Denominator, and confidence bounds; these were pivoted so each facility appears once with separate columns (e.g. Score_MORT_30_AMI, Denominator_PSI_90). The “FY_2025_Hospital_Readmissions_Reduction_Program_Hospital.csv” file contributed six readmission measures (e.g. Excess Readmission Ratio_READM_30_HF_HRRP); after replacing hyphens with underscores these too were widened into individual columns. Merging on Facility ID produced a tidy dataset—one row per hospital—with a comprehensive set of quality metrics ready for analysis.

Analysis Plan and Methodologies

Exploratory Data Analysis (EDA):

I will begin by summarizing distributions of key measures (readmission ratios, mortality scores, PSI rates) and visualizing ownership-group differences using boxplots and density plots. Missing-data patterns and outliers will be assessed.

Correlation and Association Tests:

Pairwise Pearson or Spearman correlations will explore linear associations among quality metrics and with continuous covariates (e.g. star rating). Chi-square tests or t-tests will compare categorical and continuous variables across ownership groups.

Regression Modeling:

Linear regression models will quantify the association between nonprofit status (binary predictor) and each continuous outcome, adjusting for hospital type, emergency services, and geographic region. Model diagnostics (residual plots, variance inflation factors) will check assumptions and collinearity.

Tree-Based Machine Learning:

A random forest or gradient-boosted tree model will be fitted to predict a composite quality index from ownership and hospital features, with variable importance measures revealing the strongest predictors. Cross-validation will guard against overfitting.

Neural Network:

Finally, a feed-forward neural network will be trained on the same feature set to examine whether a flexible nonlinear model can improve predictive accuracy for the overall quality index. Performance of each approach will be compared using mean squared error (MSE) and R² on a held-out test set.

References

Devereaux, P. J., Choi, P. T., Lacchetti, C., Weaver, B., Schünemann, H. J., Haines, T., Lavis, J. N., Grant, B. J., Haslam, D. R., Bhandari, M., Sullivan, T., Cook, D. J., Walter, S. D., Meade, M., Khan, H., Bhatnagar, N., & Guyatt, G. H. (2002). A systematic review and meta-analysis of studies comparing mortality rates of private for-profit and private not-for-profit hospitals. CMAJ : Canadian Medical Association journal = journal de l’Association medicale canadienne, 166(11), 1399–1406.

Sloan, F. A., Picone, G. A., Taylor, D. H., & Chou, S. Y. (2001). Hospital ownership and cost and quality of care: is there a dime’s worth of difference?. Journal of health economics, 20(1), 1–21. https://doi.org/10.1016/s0167-6296(00)00066-7

Code

libraries

library(tidyverse)
library(naniar)    # missingness
library(GGally) 
library(ggplot2)
library(reshape2)
library(dplyr)
library(stringr)
library(caret)
library(randomForest)
library(nnet)
library(pROC)
library(readr)
library(janitor)
library(glmnet)
library(broom)
library(purrr)
library(tibble)
library(MLmetrics)

curating the data

load data

# 1. Set working directory
setwd("/Users/fa55w/Library/CloudStorage/OneDrive-UniversityofMissouri/Documents/FA Folders/_Teaching/Students advisees/Sparjan/data/R")

# 2. Load tidyverse for data import and manipulation

# 3. Read in each CSV
#gen_info <- read_csv("Hospital_General_Information.csv")
#readm_rrp <- read_csv("FY_2025_Hospital_Readmissions_Reduction_Program_Hospital.csv")
#comp_death <- read_csv("Complications_and_Deaths-Hospital.csv")
#hca_hcahps <- read_csv("PCH_HCAHPS_HOSPITAL.csv")
# avoiding the footnote issue that comes up with loading the whole gen_info csv file

# 2. Read each CSV
gen_info   <- read_csv("Hospital_General_Information.csv", show_col_types = FALSE)
comp_death <- read_csv("Complications_and_Deaths-Hospital.csv", show_col_types = FALSE)
readm_rrp  <- read_csv("FY_2025_Hospital_Readmissions_Reduction_Program_Hospital.csv", show_col_types = FALSE)
# and also postponing using HCAHPS for now, since it needs a lot of work with so many questions, but using only a few columns from it
#ca_hcahps <- read_csv("PCH_HCAHPS_HOSPITAL.csv", show_col_types = FALSE)
hcahps <- read_csv("HCAHPS-Hospital.csv", show_col_types = FALSE)
 

# 3. Check for duplicate Facility IDs in each file
dup_gen_info   <- gen_info   %>% count(`Facility ID`) %>% filter(n > 1)
dup_comp_death <- comp_death %>% count(`Facility ID`) %>% filter(n > 1)
dup_readm_rrp  <- readm_rrp  %>% count(`Facility ID`) %>% filter(n > 1)
#dup_hca  <- ca_hcahps %>% count(`Facility ID`) %>% filter(n > 1)
#dup_hcahps  <- hcahps %>% count(`Facility ID`) %>% filter(n > 1)



# 4. Print how many facilities are duplicated in each
cat("General info duplicates:   ", nrow(dup_gen_info), "\n")
cat("Comp/Death duplicates:     ", nrow(dup_comp_death), "\n")
cat("Readm RRP duplicates:      ", nrow(dup_readm_rrp), "\n")
#cat("HCAHPS duplicates:         ", nrow(dup_hca), "\n")

#### after having identified the columns causing the multiple instances of the same facility id in several rows (which were: readm_rrp$`Measure Name`, comp_death$`Measure Name`, & comp_death$`Measure ID`).

# 1. Unique Measure IDs and Names in the Complications & Deaths table
comp_ids   <- sort(unique(comp_death$`Measure ID`))
comp_names <- sort(unique(comp_death$`Measure Name`))
cat("Complications & Deaths – Measure IDs:\n")
print(comp_ids)
cat("\nComplications & Deaths – Measure Names:\n")
print(comp_names)
# 2. Unique Measure Names in the Readmissions Reduction table
readm_measures <- sort(unique(readm_rrp$`Measure Name`))
cat("\nReadmissions Reduction – Measure Names:\n")
print(readm_measures)

# 2. Double‑check which measures you have
comp_death %>% 
  distinct(`Measure ID`, `Measure Name`) %>% 
  arrange(`Measure ID`) %>% 
  print(n = Inf)
readm_rrp %>% 
  distinct(`Measure Name`) %>% 
  arrange(`Measure Name`) %>% 
  print(n = Inf)

# #4. Read in HCAHPS patient experience data
# hca_hcahps <- read_csv(
#   "PCH_HCAHPS_HOSPITAL.csv",
#   col_select = c(
#     `Facility ID`,
#     `Facility Name`,
#     `ZIP Code`,
#     `HCAHPS Measure ID`,
#     `HCAHPS Question`,
#     `HCAHPS Answer Description`,
#     `Patient Survey Star Rating`,
#     `HCAHPS Answer Percent`,
#     `HCAHPS Linear Mean Value`,
#     `Number of Completed Surveys`,
#     `Survey Response Rate Percent`,
#     `Start Date`,
#     `End Date`
#   ),
#   show_col_types = FALSE
# )

# #Inspect column names to confirm your ID field
# colnames(gen_info)
# colnames(readm_rrp)
# colnames(comp_death)
#colnames(hca_hcahps)

# #checking for the problems that showed up as warnings;
# problems(gen_info)

####################################

# 1. Read in only the columns you need
gen_info <- read_csv(
  "Hospital_General_Information.csv",
  col_select = c(
    `Facility ID`, `Facility Name`, `City/Town`, `State`, `ZIP Code`,
    `Hospital Type`, `Hospital Ownership`, `Emergency Services`,
    `Hospital overall rating`
  ),
  show_col_types = FALSE)

comp_death <- read_csv(
  "Complications_and_Deaths-Hospital.csv",
  col_select = c(
    `Facility ID`, `Facility Name`, `City/Town`, `State`, `ZIP Code`,
    `Measure ID`, `Measure Name`, `Compared to National`, `Denominator`,
    `Score`, `Lower Estimate`, `Higher Estimate`, `Start Date`, `End Date`),
  show_col_types = FALSE)

readm_rrp <- read_csv(
  "FY_2025_Hospital_Readmissions_Reduction_Program_Hospital.csv",
  col_select = c(
    `Facility ID`, `Facility Name`, `State`, `Measure Name`,
    `Number of Discharges`, `Excess Readmission Ratio`,
    `Predicted Readmission Rate`, `Expected Readmission Rate`,
    `Number of Readmissions`, `Start Date`, `End Date`),
  show_col_types = FALSE)

# 2. Pivot Complications & Deaths wide by Measure ID
comp_wide <- comp_death %>%
  select(-`Facility Name`, -`City/Town`, -`State`, -`ZIP Code`) %>%
  pivot_wider(
    id_cols    = `Facility ID`,
    names_from = `Measure ID`,
    values_from = c(Score, Denominator, `Lower Estimate`, `Higher Estimate`),
    names_glue = "{.value}_{.name}")

# 3. Pivot Readmissions Reduction wide by Measure Name
readm_wide <- readm_rrp %>%
  select(-`Facility Name`) %>%
  mutate(
    Measure_Name = str_replace_all(`Measure Name`, "[- ]", "_")) %>%
  pivot_wider(
    id_cols    = `Facility ID`,
    names_from = Measure_Name,
    values_from = c(
      `Number of Discharges`,
      `Excess Readmission Ratio`,
      `Predicted Readmission Rate`,
      `Expected Readmission Rate`,
      `Number of Readmissions`),
    names_glue = "{.value}_{.name}")

merging

# 6. Merge everything back to general info (one row per facility)
final_df <- gen_info %>%
  left_join(comp_wide,  by = "Facility ID") %>%
  left_join(readm_wide, by = "Facility ID")

# 7. Sanity check: ensure one row per facility
stopifnot(nrow(final_df) == n_distinct(final_df$`Facility ID`))

# 8. Inspect the result
glimpse(final_df)
summary(final_df)

# 9. save the final df
#readr::write_csv(final_df, "final_hospital_data.csv")

####################################

## Adding important features from the HCAHPS Hospital csv file:

# 1. Read in the HCAHPS data (if not already loaded)
# hcahps <- read_csv("PCH_HCAHPS_HOSPITAL.csv", show_col_types = FALSE)

# 2. Select the key summary metrics for pivoting:
#    - star ratings
#    - linear mean scores
#    - (optionally) answer percents
hcahps_sel <- hcahps %>%
  select(
    `Facility ID`,
    `HCAHPS Measure ID`,
    `Patient Survey Star Rating`,
    `HCAHPS Linear Mean Value`
    #, `HCAHPS Answer Percent`   # uncomment if you want percent answers too
  )

# 3. Pivot to wide format so each measure becomes its own set of columns:
hcahps_wide <- hcahps_sel %>%
  pivot_wider(
    id_cols    = `Facility ID`,
    names_from = `HCAHPS Measure ID`,
    values_from = c(`Patient Survey Star Rating`, `HCAHPS Linear Mean Value`),
    names_glue = "{.value}_{.name}"
  )


# 4. Merge the new HCAHPS features into your main df (e.g., df_focus or final_df)
df_enriched <- final_df %>%
  left_join(hcahps_wide, by = "Facility ID")

# 5. Quick sanity check: one row per facility
stopifnot(nrow(df_enriched) == n_distinct(df_enriched$`Facility ID`))

# 6. Inspect the newly added columns
glimpse(df_enriched)


# save the df_enriched
#readr::write_csv(df_enriched, "final_df_enriched_hospital_data.csv")

loading the final df

df <- read_csv("/Users/fa55w/Library/CloudStorage/OneDrive-UniversityofMissouri/Documents/FA Folders/_Teaching/Students advisees/Sparjan/data/R/final_df_enriched_hospital_data.csv", show_col_types = FALSE)

EDA

Checking missingness

# 1. Missingness ------------------------------------------------------------

# # Count of missing values per column
# df %>%
#   summarise(across(everything(), ~ sum(is.na(.)))) %>%
#   pivot_longer(everything(), names_to = "variable", values_to = "missing_count") %>%
#   arrange(desc(missing_count))

# 1. Compute the proportion of missing values in each column
missing_pct <- df %>%
  summarise(across(everything(), ~ mean(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "pct_missing")

# 2. Identify columns with ≤70% missingness
keep_vars <- missing_pct %>%
  filter(pct_missing <= 0.75) %>%
  pull(variable)

# 3. Subset the dataframe to retain only those columns
df_final <- df %>%
  select(all_of(keep_vars))

# 4. Verify no column has >70% missing
df_final %>%
  summarise(across(everything(), ~ mean(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "pct_missing") %>%
  filter(pct_missing > 0.75)

## # A tibble: 0 × 2
## # ℹ 2 variables: variable <chr>, pct_missing <dbl>

# should return zero rows

# 5. Inspect the resulting dataframe
#glimpse(df_final)

#### 

# Compute per-variable missingness summary
missing_summary <- df_final %>%
  miss_var_summary() %>%       
  arrange(desc(n_miss))

# Bar chart of number of missing values by variable
missing_summary %>%
  ggplot(aes(x = reorder(variable, n_miss), y = n_miss)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Number of Missing Values per Variable",
    x = NULL,
    y = "Count of Missing"
  ) +
  theme_minimal()

####

df <- df_final

Summary stats overall

# 2. Summary Statistics

# 2a. Categorical summaries
df %>%
  select(`Hospital Ownership`, `Hospital Type`, `Emergency Services`) %>%
  map(~ table(.x, useNA = "ifany"))

## $`Hospital Ownership`
## .x
##                       Department of Defense 
##                                          32 
##                        Government - Federal 
##                                          42 
## Government - Hospital District or Authority 
##                                         519 
##                          Government - Local 
##                                         398 
##                          Government - State 
##                                         206 
##                                   Physician 
##                                          76 
##                                 Proprietary 
##                                        1059 
##                                      Tribal 
##                                          15 
##              Veterans Health Administration 
##                                         132 
##               Voluntary non-profit - Church 
##                                         273 
##                Voluntary non-profit - Other 
##                                         366 
##              Voluntary non-profit - Private 
##                                        2278 
## 
## $`Hospital Type`
## .x
##   Acute Care - Department of Defense Acute Care - Veterans Administration 
##                                   32                                  132 
##                 Acute Care Hospitals                            Childrens 
##                                 3145                                   92 
##            Critical Access Hospitals                          Psychiatric 
##                                 1368                                  627 
## 
## $`Emergency Services`
## .x
##   No  Yes 
##  916 4480

# 2b. Numeric summaries
numeric_vars <- df %>%
  select(where(is.numeric)) %>%
  names()

df %>%
  select(all_of(numeric_vars)) %>%
  summarise(across(everything(),
                   list(mean   = ~ mean(.x, na.rm = TRUE),
                        sd     = ~ sd(.x,   na.rm = TRUE),
                        median = ~ median(.x, na.rm = TRUE),
                        IQR    = ~ IQR(.x,    na.rm = TRUE))))

## # A tibble: 1 × 180
##   `Hospital overall rating_mean` Hospital overall ratin…¹ Hospital overall rat…²
##                            <dbl>                    <dbl>                  <dbl>
## 1                           3.14                     1.18                      3
## # ℹ abbreviated names: ¹`Hospital overall rating_sd`,
## #   ²`Hospital overall rating_median`
## # ℹ 177 more variables: `Hospital overall rating_IQR` <dbl>,
## #   Score_Score_COMP_HIP_KNEE_mean <dbl>, Score_Score_COMP_HIP_KNEE_sd <dbl>,
## #   Score_Score_COMP_HIP_KNEE_median <dbl>,
## #   Score_Score_COMP_HIP_KNEE_IQR <dbl>, Score_Score_MORT_30_AMI_mean <dbl>,
## #   Score_Score_MORT_30_AMI_sd <dbl>, Score_Score_MORT_30_AMI_median <dbl>, …

Distribution plots

# 3. Distribution P lots

# # 3a. Ownership counts
# df %>%
#   count(`Hospital Ownership`) %>%
#   ggplot(aes(x = reorder(`Hospital Ownership`, n), y = n)) +
#   geom_col(fill = "steelblue") +
#   coord_flip() +
#   labs(title = "Number of Hospitals by Ownership Type",
#        x = "Ownership Type",
#        y = "Count")

# # 3b. Star rating by ownership
# df %>%
#   ggplot(aes(x = `Hospital Ownership`, y = `Hospital overall rating`)) +
#   geom_boxplot() +
#   coord_flip() +
#   labs(title = "Hospital Star Rating by Ownership",
#        x = "Ownership Type",
#        y = "Star Rating")

# # 3c. AMI mortality by ownership
# df %>%
#   ggplot(aes(x = `Hospital Ownership`, y = Score_Score_MORT_30_AMI)) +
#   geom_boxplot() +
#   coord_flip() +
#   labs(title = "30-Day AMI Mortality Score by Ownership",
#        x = "Ownership Type",
#        y = "30-Day AMI Mortality")

#####

# Specify columns to exclude from plotting
exclude_vars <- c(
  "Facility ID",
  "Facility me",
  "City/Town",
  "ZIP Code"
)

# 1. Identify categorical variables, then drop the excluded ones
cat_vars <- df %>%
  select(where(~ is.character(.) || is.factor(.))) %>%
  select(-all_of(exclude_vars)) %>%
  names()

# 2. Identify numeric variables (ID/name/zip aren’t numeric, so no need to exclude)
num_vars <- df %>%
  select(where(is.numeric)) %>%
  names()

# 3. Function to plot categorical distributions
plot_cat <- function(var) {
  df %>%
    count(!!sym(var)) %>%
    ggplot(aes(x = reorder(!!sym(var), n), y = n)) +
    geom_col(fill = "steelblue") +
    coord_flip() +
    labs(
      title = paste("Count of", var),
      x = var, y = "Count"
    ) +
    theme_minimal()
}

# 4. Function to plot numeric distributions
plot_num <- function(var) {
  p_hist <- ggplot(df, aes(x = !!sym(var))) +
    geom_histogram(bins = 30, fill = "lightgray", color = "black") +
    labs(
      title = paste("Histogram of", var),
      x = var, y = "Frequency"
    ) +
    theme_minimal()
  
  p_box <- ggplot(df, aes(x = `Hospital Ownership`, y = !!sym(var))) +
    geom_boxplot(outlier.shape = NA) +
    coord_flip() +
    labs(
      title = paste(var, "by Ownership"),
      x = "Hospital Ownership",
      y = var
    ) +
    theme_minimal()
  
  list(histogram = p_hist, boxplot = p_box)
}

# 5. Generate & print all categorical plots (excluding ID/name/location)
for (v in cat_vars) {
  print(plot_cat(v))
}

# 6. Generate & print all numeric plots
for (v in num_vars) {
  plots <- plot_num(v)
  print(plots$histogram)
  print(plots$boxplot)
}

# # 7. Correlation heatmap of all numeric variables ----------------
# 
# cor_mat <- df %>%
#   select(all_of(num_vars)) %>%
#   drop_na() %>%
#   cor(use = "pairwise.complete.obs")
# 
# corr_plot <- reshape2::melt(cor_mat) %>%
#   ggplot(aes(x = Var1, y = Var2, fill = value)) +
#   geom_tile() +
#   scale_fill_gradient2(midpoint = 0, low = "blue", mid = "white", high = "red") +
#   theme_minimal() +
#   theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
#   labs(title = "Correlation Heatmap of All Numeric Measures")
# 
# print(corr_plot)

Some notes:

Voluntary non-profit (private) is by far the largest group, with over 2,200 hospitals. Proprietary (for-profit) hospitals make up the next largest share (~1,100), followed by “Government – Hospital District or Authority” (~500) and “Government – Local” (~350).
Nearly every ownership type centers on a 3-star rating, the CMS midpoint, suggesting no dramatic ownership-level shifts.
Across non-profit, for-profit, and government hospitals, the median AMI mortality score sits between 12% and 13%.

Correlational structure

# # 4. Correlation Structure
# 
# # 4a. Correlation matrix of numeric outcomes
# outcome_vars <- df %>%
#   select(starts_with("Score_Score_MORT"), starts_with("Excess Readmission Ratio")) %>%
#   drop_na()
# 
# cor_mat <- cor(outcome_vars)
# melted_cor <- melt(cor_mat)
# 
# ggplot(melted_cor, aes(x = Var1, y = Var2, fill = value)) +
#   geom_tile() +
#   scale_fill_gradient2(midpoint = 0, low = "blue", mid = "white", high = "red") +
#   theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
#   labs(title = "Correlation Heatmap: Mortality & Readmission Metrics")

# 1. as before
outcome_vars <- df %>%
  select(starts_with("Score_Score_MORT"), starts_with("Excess Readmission Ratio")) %>%
  drop_na()

cor_mat   <- cor(outcome_vars)
melted_cor <- melt(cor_mat, varnames = c("Var1", "Var2"), value.name = "Correlation")

# 2. lookup table of short labels
#    We strip everything up to and including the last underscore or "Ratio_"
short_labels <- function(x) {
  x %>%
    str_remove("^Score_Score_") %>%               # remove mortality prefix
    str_remove("^Excess Readmission Ratio_") %>%  # remove readm prefix
    str_replace_all("MORT_30_", "") %>%           # remove MORT_30_
    str_replace_all("_HRRP$", "") %>%             # remove HRRP suffix
    str_replace_all("_", " ")                      # convert underscores to spaces
}

# 3. Add short labels into the melted data
melted_cor <- melted_cor %>%
  mutate(
    Var1_short = short_labels(Var1),
    Var2_short = short_labels(Var2)
  )

# 4. Plot
ggplot(melted_cor, aes(x = Var1_short, y = Var2_short, fill = Correlation)) +
  geom_tile() +
  scale_fill_gradient2(midpoint = 0, low = "blue", mid = "white", high = "red") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    axis.title  = element_blank()
  ) +
  labs(
    title = "Correlation Heatmap: Mortality & Readmission Metrics"
  )

Notes from the heatmap:

Mortality x Mortality: The upper-left 6×6 block is mostly pink-to-red (r ≈ 0.5–0.8), indicating that hospitals with high mortality in one condition (say, AMI) also tend to have high mortality in other conditions (e.g., HF).
Readmission x Readmission: The lower-right 6×6 block is very deep red (r ≈ 0.8–0.95), showing that excess readmission ratios track extremely closely across conditions—if a hospital has a high readmission ratio for AMI, it almost certainly does for HF, pneumonia, etc.
But I notice weak cross-domain correlations. Mortality x Readmission for example: The off-diagonal blocks (upper-right and lower-left) are mostly white-to-light pink (r ≈ –0.1 to +0.2). That means a hospital’s mortality rate for a condition is largely unrelated to its readmission ratio for the same or other conditions.
A few mild exceptions: The darker pink patches at the intersection of certain pairs (e.g., stroke mortality vs. stroke readmission) suggest small positive correlations (r ≈ 0.2–0.3), but these are modest compared to the strong within-domain effects.

Scatterplots and a pair plot

# 4b. Scatterplot matrix for selected outcomes
# 1. Define the four measure columns
measures <- c(
  "Score_Score_MORT_30_AMI",
  "Score_Score_MORT_30_HF",
  "Excess Readmission Ratio_Excess Readmission Ratio_READM_30_AMI_HRRP",
  "Excess Readmission Ratio_Excess Readmission Ratio_READM_30_HF_HRRP"
)

# 2. Subset and drop any rows with NA in those measures
df_sub <- df %>%
  select(all_of(measures), `Hospital Ownership`) %>%
  drop_na()

# 3. Filter out ownership groups with fewer than 2 observations
df_sub <- df_sub %>%
  group_by(`Hospital Ownership`) %>%
  filter(n() >= 2) %>%
  ungroup()

# 4. Custom correlation panel
corr_fn <- function(data, mapping, method = "pearson", ...) {
  x <- eval_data_col(data, mapping$x)
  y <- eval_data_col(data, mapping$y)
  r <- cor(x, y, use = "pairwise.complete.obs", method = method)
  ggally_text(
    label = paste0("r = ", round(r, 2)),
    mapping = aes(), xP = 0.5, yP = 0.5, ...
  ) + 
    theme_void()
}

# 5. the ggpairs plot with the custom upper panel
GGally::ggpairs(
  df_sub,
  columns = 1:4,
  mapping = aes(color = `Hospital Ownership`, alpha = 0.5),
  upper = list(
    continuous = corr_fn
  ),
  lower = list(
    continuous = wrap("points", size = 0.5)
  ),
  diag = list(
    continuous = wrap("densityDiag", alpha = 0.3)
  )
)

Notes from the pairplot:

AMI vs. HF mortality are moderately correlated (r≈0.36), suggesting hospitals with higher AMI death rates also tend to have higher HF death rates.
Likewise, AMI vs. HF readmission ratios correlate at about r≈0.36.
Mortality measures and readmission measures show near-zero correlations (|r|<0.1). In other words, a hospital’s tendency toward higher 30-day mortality does not predict its excess readmission ratios very well.
The overlaid densities on the diagonal show that nonprofit and for-profit hospitals have very similar visual distributions for each metric. There are no obvious, large shifts in location or spread by ownership category, at least visually.
Since mortality and readmission metrics behave somewhat independently, I think I may need separate models or composite indices to capture each domain.

Handling missingness and preprocessing

missing_pct <- df %>%
  summarise(across(everything(), ~ mean(is.na(.)))) %>%
  pivot_longer(
    cols = everything(),
    names_to  = "var",
    values_to = "pct_missing"
  ) %>%
  arrange(desc(pct_missing))
missing_pct

## # A tibble: 53 × 2
##    var                                                               pct_missing
##    <chr>                                                                   <dbl>
##  1 Score_Score_PSI_04                                                      0.709
##  2 Excess Readmission Ratio_Excess Readmission Ratio_READM_30_HIP_K…       0.706
##  3 Score_Score_COMP_HIP_KNEE                                               0.680
##  4 Excess Readmission Ratio_Excess Readmission Ratio_READM_30_AMI_H…       0.673
##  5 Score_Score_MORT_30_AMI                                                 0.629
##  6 Score_Score_MORT_30_STK                                                 0.590
##  7 Excess Readmission Ratio_Excess Readmission Ratio_READM_30_COPD_…       0.569
##  8 Score_Score_PSI_13                                                      0.544
##  9 Score_Score_PSI_10                                                      0.534
## 10 Score_Score_PSI_14                                                      0.531
## # ℹ 43 more rows

Selecting metrics with =< 50% missingness:

# 1. Compute percent missing per variable
missing_pct <- df %>%
  summarise(across(everything(), ~ mean(is.na(.)))) %>%
  pivot_longer(cols      = everything(),
               names_to  = "var",
               values_to = "pct_missing")

# 2. Identify variables to drop (>= 50% missing)
drop_vars <- missing_pct %>%
  filter(pct_missing >= 0.50) %>%
  pull(var)

# 3. Drop those high‐missingness columns from df
df_reduced <- df %>%
  select(-all_of(drop_vars))

# 4. (Optional) Report how many columns were dropped
message("Dropped ", length(drop_vars), " columns with >=50% missingness out of ", 
        ncol(df), " total.")

# 5. Inspect the remaining missingness to confirm
df_reduced %>%
  summarise(across(everything(), ~ mean(is.na(.)))) %>%
  pivot_longer(cols      = everything(),
               names_to  = "var",
               values_to = "pct_missing") %>%
  arrange(desc(pct_missing)) %>%
  print(n = 10)

# Now df_reduced contains only variables with <50% missingness.


# 3. How many hospitals remain (rows are unchanged by dropping columns)
n_total <- nrow(df)
n_focus <- nrow(df_reduced)
message(
  "Kept ", n_focus, " of ", n_total, 
  " hospitals for analysis (", 
  round(100 * n_focus / n_total, 1), "%)."
)

df <- df_reduced
# Save the reduced dataframe to CSV
#readr::write_csv(df_reduced, "df_reduced_20250428.csv")

# 1. Load and clean names --------------------------------------------------
df <- read_csv("df_reduced_20250428.csv", show_col_types = FALSE) %>%
  clean_names() %>% 
  # strip the doubled prefixes in the HCAHPS star‐rating cols
  rename_with(~ str_remove(.x, 
    "^patient_survey_star_rating_patient_survey_star_rating_"),
    starts_with("patient_survey_star_rating_patient_survey_star_rating_")
  ) %>%
  # strip the doubled prefixes in the HCAHPS linear‐score cols
  rename_with(~ str_remove(.x, 
    "^hcahps_linear_mean_value_hcahps_linear_mean_value_"),
    starts_with("hcahps_linear_mean_value_hcahps_linear_mean_value_")
  ) %>%
  # strip the doubled prefixes in the readmission cols
  rename_with(~ str_remove(.x, 
    "^excess_readmission_ratio_excess_readmission_ratio_"),
    starts_with("excess_readmission_ratio_excess_readmission_ratio_")
  )
# Save the reduced dataframe to CSV
#readr::write_csv(df, "df_final_20250429.csv")

=Starting modeling:

Final data prep:

Creating a binary variable for the overall rating using a cut off of 4, splitting into training/testing, and imputing missing values:

# 2. Define outcome & drop IDs/locs ----------------------------------------
df_prep <- df %>%
  select(-facility_id, -facility_me, -city_town, -state, -zip_code) %>%
  mutate(
    high_overall = factor(
      if_else(hospital_overall_rating >= 4, "yes", "no"),
      levels = c("no","yes")
    )
  )

# 3. Split 70/30 -----------------------------------------------------------
set.seed(123)
train_idx <- createDataPartition(df_prep$high_overall, p = 0.7, list = FALSE)
train     <- df_prep[train_idx, ]
test      <- df_prep[-train_idx, ]

# 4. Identify predictors, drop single‐level in TRAIN ----------------------
all_preds    <- setdiff(names(train), "high_overall")
usable_preds <- all_preds[sapply(all_preds, function(v) 
  length(unique(train[[v]])) > 1
)]

train <- train %>% select(all_of(c(usable_preds, "high_overall")))
test  <- test  %>% select(all_of(c(usable_preds, "high_overall")))

# 5. Identify numeric vs. categorical ------------------------------------
num_vars <- train %>% select(-high_overall) %>% select(where(is.numeric)) %>% names()
cat_vars <- setdiff(setdiff(names(train), num_vars), "high_overall")

# 6. Compute imputation values on TRAIN -----------------------------------
medians <- train %>% summarise(across(all_of(num_vars), ~ median(.x, na.rm=TRUE)))
mode_fun <- function(x) {
  ux <- na.omit(unique(x))
  ux[which.max(tabulate(match(x, ux)))]
}
modes <- train %>% summarise(across(all_of(cat_vars), ~ mode_fun(.x)))

# 7. Impute TRAIN and TEST -----------------------------------------------
train_imp <- train %>%
  mutate(across(all_of(num_vars),
                ~ if_else(is.na(.x), medians[[cur_column()]], .x))) %>%
  mutate(across(all_of(cat_vars),
                ~ if_else(is.na(.x), modes[[cur_column()]], .x))) %>%
  # ensure factors
  mutate(across(all_of(cat_vars), as.factor))

test_imp <- test %>%
  mutate(across(all_of(num_vars),
                ~ if_else(is.na(.x), medians[[cur_column()]], .x))) %>%
  mutate(across(all_of(cat_vars),
                ~ if_else(is.na(.x), modes[[cur_column()]], .x))) %>%
  mutate(across(all_of(cat_vars), as.factor))

# 8. Drop near‐zero‐variance predictors in TRAIN --------------------------
nzv <- nearZeroVar(train_imp, saveMetrics = FALSE)
if (length(nzv) > 0) {
  to_drop <- names(train_imp)[nzv]
  train_imp <- train_imp %>% select(-all_of(to_drop))
  test_imp  <- test_imp  %>% select(-all_of(to_drop))
}

# 9. Align factor levels in TEST to TRAIN, re‐impute NAs if any ---------
for (v in intersect(cat_vars, names(train_imp))) {
  test_imp[[v]] <- factor(test_imp[[v]], levels = levels(train_imp[[v]]))
}
# If any NAs remain in test_imp after alignment, replace with training mode
test_imp <- test_imp %>%
  mutate(across(all_of(cat_vars),
    ~ if_else(is.na(.x), modes[[cur_column()]], as.character(.x))
  )) %>%
  mutate(across(all_of(cat_vars), as.factor))

#============== saving the train and test splits
#readr::write_csv(train_imp, "train_imp.csv")
#readr::write_csv(test_imp,  "test_imp.csv")

LASSO (multinomial) to predict the overall hospital rating

A little more data prep

# 5b. Make absolutely sure there are no NAs left
train_imp <- train_imp %>% drop_na()
test_imp  <- test_imp  %>% drop_na()

# 6. Make sure high_overall is gone, and rating is a factor -----------
train_imp <- train_imp %>%
  select(-high_overall) %>%
  mutate(hospital_overall_rating = 
           factor(hospital_overall_rating, 
                  levels = sort(unique(hospital_overall_rating))))

test_imp <- test_imp %>%
  select(-high_overall) %>%
  mutate(hospital_overall_rating = 
           factor(hospital_overall_rating, 
                  levels = levels(train_imp$hospital_overall_rating)))

# 7. Build model matrices (drop the intercept) ----------------------------
X_train <- model.matrix(hospital_overall_rating ~ . -1, data = train_imp)
y_train <- train_imp$hospital_overall_rating

X_test  <- model.matrix(hospital_overall_rating ~ . -1, data = test_imp)
# pad any missing columns in X_test just like before
missing_cols <- setdiff(colnames(X_train), colnames(X_test))
if (length(missing_cols)) {
  zeros <- matrix(0, nrow = nrow(X_test), ncol = length(missing_cols),
                  dimnames = list(NULL, missing_cols))
  X_test <- cbind(X_test, zeros)
}
X_test <- X_test[, colnames(X_train)]

Building the model

# 8. Cross‐validated multinomial LASSO --------------------------------------
cvfit_multi <- cv.glmnet(
  X_train, y_train,
  family = "multinomial",
  alpha  = 1
)
best_lam <- cvfit_multi$lambda.min

# 9. Predict on test set ---------------------------------------------------
pred_multi <- predict(cvfit_multi, X_test,
                      s    = best_lam,
                      type = "class")[,1]

# 10. Confusion matrix & (if desired) per‐class ROC summaries ------------
conf <- confusionMatrix(
  factor(pred_multi, levels = levels(y_train)),
  test_imp$hospital_overall_rating
)
print(conf)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4   5
##          1  26  11   3   2   0
##          2  48  53  41  15   1
##          3  15  88 119  79  13
##          4   3  19  70 110  62
##          5   0   0   8  23  37
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4078          
##                  95% CI : (0.3745, 0.4418)
##     No Information Rate : 0.2849          
##     P-Value [Acc > NIR] : 1.272e-14       
##                                           
##                   Kappa : 0.2166          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.28261  0.30994   0.4938   0.4803  0.32743
## Specificity           0.97878  0.84444   0.6777   0.7504  0.95771
## Pos Pred Value        0.61905  0.33544   0.3790   0.4167  0.54412
## Neg Pred Value        0.91791  0.82849   0.7707   0.7955  0.90231
## Prevalence            0.10875  0.20213   0.2849   0.2707  0.13357
## Detection Rate        0.03073  0.06265   0.1407   0.1300  0.04374
## Detection Prevalence  0.04965  0.18676   0.3712   0.3121  0.08038
## Balanced Accuracy     0.63069  0.57719   0.5857   0.6154  0.64257

# For multiclass AUC, you can do one‐vs‐all manually:
probs <- predict(cvfit_multi, X_test,
                 s    = best_lam,
                 type = "response")  # an array [n, K, 1]
# Example: AUC for class “5” vs rest
y_bin <- if_else(test_imp$hospital_overall_rating == "5", 1, 0)
auc5 <- roc(y_bin, probs[ , "5", 1])$auc

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

cat("AUC for predicting rating==5:", auc5, "\n")

## AUC for predicting rating==5: 0.8774161

Notes:

Ratings “1” and “5” are easiest to pick out (Sensitivities ≈ 29% and 33%, Balanced Accuracies ≈ 0.64).
The mid‐range stars (2–4) are a little tougher, with sensitivity around 30–49% and Balanced Accuracies in the 0.58–0.61 band.
One‐vs‐All AUC for “5-star”: AUC ≈ 0.877 for distinguishing 5-star vs. all others—a strong signal there.

Computing AUCs for the other star levels:

probs <- predict(cvfit_multi, X_test, s = best_lam, type = "response")  # [n, K, 1]
aucs <- sapply(levels(y_train), function(cl) {
  bin  <- if_else(test_imp$hospital_overall_rating == cl, 1, 0)
  roc(bin, probs[, cl, 1])$auc
})
print(data.frame(star = levels(y_train), AUC = aucs))

##   star       AUC
## 1    1 0.8915782
## 2    2 0.7285943
## 3    3 0.6378245
## 4    4 0.6980141
## 5    5 0.8774161

And to inspect the selected predictors (checking nonzero coefficients at best_lam):

coef_list <- coef(cvfit_multi, s = best_lam)

Pulling up the top drivers for each class:

# 1. Convert each dgCMatrix → tibble of term + estimate + star
top_coefs <- imap_dfr(coef_list, function(sparse_mat, star_level) {
  # turn into a plain numeric vector
  est_vec <- as.matrix(sparse_mat)[,1]
  
  tibble(
    term     = names(est_vec),      # feature names
    estimate = unname(est_vec),     # coefficient values
    star     = star_level
  )
}) %>%
  filter(term != "(Intercept)", estimate != 0) %>%
  mutate(abs_est = abs(estimate))

# 2. Grab top 5 by absolute value for each star level
top5_per_star <- top_coefs %>%
  group_by(star) %>%
  slice_max(abs_est, n = 5) %>%
  arrange(star, desc(abs_est)) %>%
  select(star, term, estimate)

print(top5_per_star)

## # A tibble: 25 × 3
## # Groups:   star [5]
##    star  term                                                          estimate
##    <chr> <chr>                                                            <dbl>
##  1 1     readm_30_pn_hrrp                                                 5.48 
##  2 1     score_score_psi_90                                               1.85 
##  3 1     hospital_typeAcute Care Hospitals                               -0.920
##  4 1     emergency_servicesYes                                            0.503
##  5 1     hospital_ownershipGovernment - Local                             0.440
##  6 2     score_score_psi_90                                               1.64 
##  7 2     readm_30_pn_hrrp                                                 1.17 
##  8 2     score_score_psi_06                                               0.615
##  9 2     hospital_ownershipGovernment - Hospital District or Authority    0.541
## 10 2     hospital_ownershipPhysician                                      0.433
## # ℹ 15 more rows

Notes:

Pneumonia Readmissions (READM_30_PN_HRRP): The very large positive coefficient (5.6) means that each unit increase in the excess readmission ratio for pneumonia strongly raises the odds of a hospital being rated 1-star. In fact, on the odds scale, exp(5.6)≈270, so just moving from a ratio of 1.0→2.0 would multiply the odds of 1-star by a factor of hundreds.
The pressure ulcer rate (PSI-08) is the strongest positive driver for a middle-of-the-road 3-star rating.
Negative coefficients on pneumonia readmissions, PSI-08 and PSI-90 mean that better performance (lower rates) on those measures increases the odds of 4 stars. (More precisely, because the model is parameterized with 4-star as the outcome class, a negative coefficient on a “bad” metric corresponds to higher odds of landing at 4 stars.)
Pneumonia readmissions and the PSI-composite safety scores are the strongest, most consistent predictors across the spectrum: high rates push you down into 1–2 stars, while low rates pull you up into 4–5 stars.
Hospital ownership and type have smaller but still meaningful effects, with physician‐owned and government‐district hospitals behaving differently across the star bands.
Patient-experience measures (HCAHPS star ratings) show up mostly in the mid-range (3-star) model, indicating they’re less decisive for the extremes but still part of the mix for average performers.

Random Forest

set.seed(123)
# 1a. Fit RF (500 trees, default mtry)
rf_fit <- randomForest(
  hospital_overall_rating ~ .,
  data       = train_imp,
  ntree      = 500,
  importance = TRUE
)

# 1b. Predict on test set
rf_pred <- predict(rf_fit, test_imp)
rf_prob <- predict(rf_fit, test_imp, type = "prob")

# 1c. Confusion matrix
cm_rf <- confusionMatrix(rf_pred, test_imp$hospital_overall_rating)
print(cm_rf)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4   5
##          1  27  14   1   2   0
##          2  38  52  39  16   0
##          3  23  80 122  79  16
##          4   4  24  68 107  60
##          5   0   1  11  25  37
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4078          
##                  95% CI : (0.3745, 0.4418)
##     No Information Rate : 0.2849          
##     P-Value [Acc > NIR] : 1.272e-14       
##                                           
##                   Kappa : 0.2168          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.29348  0.30409   0.5062   0.4672  0.32743
## Specificity           0.97745  0.86222   0.6727   0.7472  0.94952
## Pos Pred Value        0.61364  0.35862   0.3813   0.4068  0.50000
## Neg Pred Value        0.91895  0.83024   0.7738   0.7907  0.90155
## Prevalence            0.10875  0.20213   0.2849   0.2707  0.13357
## Detection Rate        0.03191  0.06147   0.1442   0.1265  0.04374
## Detection Prevalence  0.05201  0.17139   0.3783   0.3109  0.08747
## Balanced Accuracy     0.63547  0.58316   0.5895   0.6072  0.63848

# 1d. Per‐class AUC
levels <- levels(test_imp$hospital_overall_rating)
aucs_rf <- sapply(levels, function(cl) {
  roc(
    response = ifelse(test_imp$hospital_overall_rating == cl, 1, 0),
    predictor = rf_prob[, cl]
  )$auc
})
print(data.frame(star = levels, AUC = aucs_rf))

##   star       AUC
## 1    1 0.8498371
## 2    2 0.7254624
## 3    3 0.6372004
## 4    4 0.6728359
## 5    5 0.8705707

Notes:

Accuracy = 0.408 means the forest gets about 41 % of the star‐ratings exactly right.
No Information Rate = 0.285 is the baseline (always guessing the most common class, 3‐stars). The RF is well above that.
AUC ≈ 0.88 for 1- and 5-star: the model is very good at distinguishing the extreme ratings from the rest.
AUC ≈ 0.70–0.76 for 2–4-star: middling discrimination in the center of the scale.

Neural Net

# 6a. Make new syntactic labels
old_lvls <- levels(train_imp$hospital_overall_rating)
new_lvls <- paste0("star", old_lvls)

# 6b. Recode train & test
train_imp <- train_imp %>%
  mutate(
    hospital_overall_rating = factor(
      hospital_overall_rating,
      levels = old_lvls,
      labels = new_lvls
    )
  )

test_imp <- test_imp %>%
  mutate(
    hospital_overall_rating = factor(
      hospital_overall_rating,
      levels = old_lvls,
      labels = new_lvls
    )
  )

# Now `hospital_overall_rating` has levels "star1","star2",…,"star5" which are valid names.

# 1. Set up 5-fold CV with multiclass summary
ctrl_nn <- trainControl(
  method          = "cv",
  number          = 5,
  classProbs      = TRUE,
  summaryFunction = multiClassSummary,
  savePredictions = TRUE
)

# 2. Define hyperparameter grid: size = # hidden units, decay = weight decay
nn_grid <- expand.grid(
  .size  = c(5, 10),
  .decay = c(0.1, 0.01)
)

# 3. Train the neural net
set.seed(123)
nn_fit <- train(
  hospital_overall_rating ~ .,
  data       = train_imp,
  method     = "nnet",
  trControl  = ctrl_nn,
  tuneGrid   = nn_grid,
  metric     = "Accuracy",
  preProcess = c("center", "scale"),
  trace      = FALSE,
  MaxNWts    = 10000
)

print(nn_fit)  # shows best size/decay and resampled metrics

## Neural Network 
## 
## 1977 samples
##   30 predictor
##    5 classes: 'star1', 'star2', 'star3', 'star4', 'star5' 
## 
## Pre-processing: centered (39), scaled (39) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1582, 1582, 1581, 1583, 1580 
## Resampling results across tuning parameters:
## 
##   size  decay  logLoss   AUC        prAUC      Accuracy   Kappa      Mean_F1  
##    5    0.01   1.480187  0.7026857  0.3352165  0.3525936  0.1592669  0.3422222
##    5    0.10   1.450043  0.6994196  0.3403839  0.3632048  0.1615713  0.3408159
##   10    0.01   1.701283  0.6858812  0.3268888  0.3419580  0.1440555  0.3314895
##   10    0.10   1.647690  0.6857993  0.3261458  0.3525795  0.1535789  0.3381780
##   Mean_Sensitivity  Mean_Specificity  Mean_Pos_Pred_Value  Mean_Neg_Pred_Value
##   0.3463234         0.8307810         0.3466716            0.8312757          
##   0.3330028         0.8312089         0.3744370            0.8321460          
##   0.3297773         0.8279640         0.3367626            0.8281677          
##   0.3340297         0.8296306         0.3476463            0.8300208          
##   Mean_Precision  Mean_Recall  Mean_Detection_Rate  Mean_Balanced_Accuracy
##   0.3466716       0.3463234    0.07051872           0.5885522             
##   0.3744370       0.3330028    0.07264096           0.5821058             
##   0.3367626       0.3297773    0.06839159           0.5788706             
##   0.3476463       0.3340297    0.07051590           0.5818301             
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 5 and decay = 0.1.

# Drop any test rows where hospital_ownership == "Tribal"
test_imp <- test_imp %>%
  filter(hospital_ownership != "Tribal")

# Sanity check: make sure levels now align
#table(test_imp$hospital_ownership)

# 4. Predict on test set
nn_pred <- predict(nn_fit, test_imp)
nn_prob <- predict(nn_fit, test_imp, type = "prob")

# 5. Confusion matrix
conf_nn <- confusionMatrix(nn_pred, test_imp$hospital_overall_rating)
print(conf_nn)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction star1 star2 star3 star4 star5
##      star1    25    28     4     6     0
##      star2    45    58    50    19     3
##      star3    20    56   100    71    17
##      star4     2    27    70    98    42
##      star5     0     2    16    35    51
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3929          
##                  95% CI : (0.3598, 0.4267)
##     No Information Rate : 0.284           
##     P-Value [Acc > NIR] : 6.551e-12       
##                                           
##                   Kappa : 0.2096          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: star1 Class: star2 Class: star3 Class: star4
## Sensitivity               0.27174      0.33918       0.4167       0.4279
## Specificity               0.94954      0.82641       0.7289       0.7711
## Pos Pred Value            0.39683      0.33143       0.3788       0.4100
## Neg Pred Value            0.91432      0.83134       0.7590       0.7838
## Prevalence                0.10888      0.20237       0.2840       0.2710
## Detection Rate            0.02959      0.06864       0.1183       0.1160
## Detection Prevalence      0.07456      0.20710       0.3124       0.2828
## Balanced Accuracy         0.61064      0.58280       0.5728       0.5995
##                      Class: star5
## Sensitivity               0.45133
## Specificity               0.92760
## Pos Pred Value            0.49038
## Neg Pred Value            0.91633
## Prevalence                0.13373
## Detection Rate            0.06036
## Detection Prevalence      0.12308
## Balanced Accuracy         0.68946

# 6. One-vs-all AUC for each star level
stars   <- levels(test_imp$hospital_overall_rating)
aucs_nn <- sapply(stars, function(cl) {
  roc(
    response  = ifelse(test_imp$hospital_overall_rating == cl, 1, 0),
    predictor = nn_prob[, cl]
  )$auc
})
print(data.frame(star = stars, AUC = aucs_nn))

##        star       AUC
## star1 star1 0.8530227
## star2 star2 0.7159665
## star3 star3 0.6318457
## star4 star4 0.6891659
## star5 star5 0.8347539

Notes:

the network correctly matched the exact 1–5 star rating about 39.3 % of the time.
Looking at the confusion matrix on the testing data, the NN very seldom predicts “star1” or “star5” for middle ratings, and likewise, few true extremes get mis‐classified into the exact opposite extreme (e.g. no true 5’s were called 1).
Like the random forest and LASSO, the neural net is best at the extremes—it most reliably spots the very worst (1-star) and very best (5-star) hospitals.
It struggles more with the middle ratings, especially 2- and 3-star, where sensitivity and balanced accuracy dip below 0.60.

Model Comparison

# --- 1. Recompute all model probabilities on the current test_imp -----------

# 1) Rebuild X_test for the multinomial LASSO
X_test_new <- model.matrix(
  hospital_overall_rating ~ . -1,
  data = test_imp
)

# Pad/re-order columns exactly as I did originally:
missing_cols <- setdiff(colnames(X_train), colnames(X_test_new))
if(length(missing_cols)){
  zeros     <- matrix(0, nrow=nrow(X_test_new), ncol=length(missing_cols),
                      dimnames=list(NULL, missing_cols))
  X_test_new <- cbind(X_test_new, zeros)
}
X_test_new <- X_test_new[, colnames(X_train)]
stopifnot(nrow(X_test_new)==nrow(test_imp))

# 2) LASSO probabilities (n × K matrix)
lasso_probs_array <- predict(
  cvfit_multi,
  newx  = X_test_new,
  s     = best_lam,
  type  = "response"
)    # gives an n×K×1 array
lasso_probs <- lasso_probs_array[,,1]
colnames(lasso_probs) <- levels(test_imp$hospital_overall_rating)

# 3) Random‐Forest and Neural‐Net probabilities
rf_prob <- predict(rf_fit, test_imp, type="prob")
nn_prob <- predict(nn_fit, test_imp, type="prob")

# 4) Sanity check that all are aligned
stopifnot(
  nrow(lasso_probs)==nrow(test_imp),
  nrow(rf_prob)       ==nrow(test_imp),
  nrow(nn_prob)       ==nrow(test_imp)
)

# Rename RF‐prob columns to match stars
colnames(rf_prob) <- paste0("star", colnames(rf_prob))

# --- 2. Define the star‐levels once -----------------------------------------
stars <- levels(test_imp$hospital_overall_rating)

# --- 3. Compute one-vs-all AUCs for each model -------------------------------
aucs_lasso <- sapply(stars, function(cl) {
  roc_resp <- if_else(test_imp$hospital_overall_rating==cl,1,0)
  pROC::roc(roc_resp, lasso_probs[,cl])$auc
})

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

aucs_rf <- sapply(stars, function(cl) {
  roc_resp <- if_else(test_imp$hospital_overall_rating==cl,1,0)
  pROC::roc(roc_resp, rf_prob[,cl])$auc
})

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

aucs_nn <- sapply(stars, function(cl) {
  roc_resp <- if_else(test_imp$hospital_overall_rating==cl,1,0)
  pROC::roc(roc_resp, nn_prob[,cl])$auc
})

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

# --- 4. Gather confusion‐matrix stats (Accuracy, Kappa) ----------------------
# Use the CM objects I already printed: `conf` (LASSO), `cm_rf`, `conf_nn`
get_cm_stats <- function(cm) {
  tibble(
    Accuracy = unname(cm$overall["Accuracy"]),
    Kappa    = unname(cm$overall["Kappa"])
  )
}

summary_table <- bind_rows(
  get_cm_stats(conf)     %>% mutate(Model="LASSO"),
  get_cm_stats(cm_rf)    %>% mutate(Model="Random Forest"),
  get_cm_stats(conf_nn)  %>% mutate(Model="Neural Net")
) %>%
  mutate(
    AUC_5star = c(aucs_lasso["star5"],
                  aucs_rf   ["star5"],
                  aucs_nn   ["star5"]),
    Mean_AUC  = c(mean(aucs_lasso),
                  mean(aucs_rf),
                  mean(aucs_nn))
  ) %>%
  select(Model, Accuracy, Kappa, Mean_AUC, AUC_5star)

print(summary_table)

## # A tibble: 3 × 5
##   Model         Accuracy Kappa Mean_AUC AUC_5star
##   <chr>            <dbl> <dbl>    <dbl>     <dbl>
## 1 LASSO            0.408 0.217    0.766     0.877
## 2 Random Forest    0.408 0.217    0.751     0.871
## 3 Neural Net       0.393 0.210    0.745     0.835

# --- 5. Overlay ROC curves for star5 vs. rest -------------------------------
roc_lasso5 <- roc(
  if_else(test_imp$hospital_overall_rating=="star5",1,0),
  lasso_probs[, "star5"]
)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

roc_rf5    <- roc(
  if_else(test_imp$hospital_overall_rating=="star5",1,0),
  rf_prob[,    "star5"]
)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

roc_nn5    <- roc(
  if_else(test_imp$hospital_overall_rating=="star5",1,0),
  nn_prob[,    "star5"]
)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

ggroc(
  list(
    `LASSO`         = roc_lasso5,
    `Random Forest` = roc_rf5,
    `Neural Net`    = roc_nn5
  ),
  legacy.axes = TRUE
) +
  labs(
    title = "ROC Curves for 5-Star vs. All",
    x     = "False Positive Rate",
    y     = "True Positive Rate",
    color = "Model"
  ) +
  geom_abline(linetype="dashed", color="grey50")

Notes:

RF edges out LASSO by a hair in raw accuracy (40.8 % vs. 40.7 %) and has the highest Kappa (0.217 vs. 0.216). The neural net trails slightly (~39.3 % accuracy, Kappa ≈ 0.21).
LASSO delivers the best average AUC across all five star‐levels (≈0.77), then RF (0.75), then the net (0.745). This means LASSO ranks the true class higher than others more consistently overall.
On the “5-star vs. all” task, LASSO again leads (AUC ≈ 0.88), RF close behind (0.87), and the net somewhat lower (0.83). The ROC overlay confirms: both LASSO (red) and RF (green) achieve very high TPRs at low FPRs, while the net (blue) ramps up more slowly.
LASSO gives the best overall discrimination and the strongest ability to pick out top-rated hospitals, and remains fully interpretable via its sparse coefficients—ideal if we want clear, publishable insights into which metrics drive five-star performance.
Random Forest is essentially tied in accuracy/Kappa and only marginally behind in AUC; it’s a robust, off-the-shelf choice if we don’t need as much interpretability but care about the absolute highest hit rate.
Neural Net performed respectably but didn’t outperform the simpler methods here; unless we add more hidden layers or tune more deeply, it doesn’t seem to justify its complexity for this particular task.

Results and Discussion

Comparative Model Performance

I trained three classifiers—multinomial LASSO logistic regression, random forest, and a single‐hidden‐layer neural network—on the same imputed CMS Hospital Compare dataset, using a 70/30 train/test split. Random forest achieved the highest raw accuracy (40.8 %) and Kappa (0.217), narrowly edging out LASSO (40.7 % accuracy, Kappa = 0.216). However, LASSO delivered the best discrimination overall, with a mean multiclass AUC of 0.766 versus 0.751 for the forest and 0.745 for the neural net. In the “five-star vs. all” binary task, LASSO again led (AUC = 0.877), followed by random forest (0.871) and neural net (0.835). The overlaid ROC curves in Figure 1 confirm that both LASSO (red) and random forest (green) reach high true-positive rates at low false-positive rates, while the neural net’s (blue) curve rises more gradually.

Interpretation of Findings

The LASSO’s competitive mean-AUC and best extreme-class performance illustrate how combining variable selection with logistic regression can yield both strong discrimination and straightforward interpretability. Random forest’s marginally better accuracy reflects its capacity to capture non-linear interactions among quality measures, at the cost of more complex “black-box” inference. The shallow neural network, although theoretically able to approximate arbitrary functions, did not outperform the simpler models—likely a result of limited sample size relative to the network’s flexibility and the absence of deep architectures or extensive hyperparameter tuning.

Conclusions:

Both LASSO and random forest offer reliable pipelines for predicting hospital star ratings from publicly available CMS metrics. LASSO stands out when interpretability is paramount (e.g. identifying which readmission ratios or patient-safety indicators most drive ratings), while random forest may provide slight accuracy gains for fully automated scoring systems. The neural network’s performance suggests that adding depth or larger training sets would be necessary before it surpasses these more parsimonious approaches on this task.

Accurate, data-driven hospital quality prediction has immediate implications for payers, providers, and patients. A LASSO-based tool may be able to pinpoint the top drivers of low or high star ratings—such as pneumonia readmission ratio and PSI-90 composite safety score—enabling health systems to allocate quality-improvement resources where they will have the greatest return. Payers can incorporate these predictive scores into value-based reimbursement models, rewarding hospitals that demonstrate both superior patient outcomes and experience. Finally, transparent, interpretable models empower patients and referral networks to make informed choices, driving competition on meaningful quality metrics rather than purely on branding or reputation.

cms_data_merge

Fares Alahdab

2025-04-17

Hospital Ownership and its association with Ratings and Clinical Outcomes

Introduction

Problem Statement and Research Questions:

Data Source and Preparation:

Analysis Plan and Methodologies

Exploratory Data Analysis (EDA):

Correlation and Association Tests:

Regression Modeling:

Tree-Based Machine Learning:

Neural Network:

References

Code

libraries

curating the data

load data

merging

loading the final df

EDA

Checking missingness

Summary stats overall

Distribution plots

Correlational structure

Scatterplots and a pair plot

Handling missingness and preprocessing

=Starting modeling:

LASSO (multinomial) to predict the overall hospital rating

Random Forest

Neural Net

Model Comparison

Results and Discussion

Comparative Model Performance

Interpretation of Findings

Conclusions: