1 Project Context

When a patient with diabetes leaves the hospital, the next month is, statistically, the most fragile stretch of their care. The Centers for Medicare and Medicaid Services have used 30-day readmission rates as a quality metric since 2012, and unplanned readmissions cost the U.S. health system roughly twenty-six billion dollars a year (Jencks et al., 2009). Diabetic inpatients are over-represented in those numbers because the same admission that brought them in — unstable glucose, fluid balance, infection — rarely resolves cleanly in three or four days.

This Module 4 notebook executes the analytic plan from our Module 3 proposal: clean the publicly available Diabetes 130-US Hospitals dataset (Strack et al., 2014), produce descriptive statistics, run our first wave of inferential tests — three chi-square tests of independence and a one-way ANOVA with Tukey HSD post-hoc — and fit a preliminary logistic regression. The regularized models (Ridge / Lasso) come in Module 6.

Research questions answered tentatively here

Q1. Which patient and treatment factors most strongly predict 30-day hospital readmission among diabetic patients?

Q2. Is HbA1c testing during admission associated with lower readmission rates across age and race groups?

Q3. Does mean length of stay differ significantly across admission types?

This HTML document is the interactive companion to the APA-formatted Word report. Every figure below is rendered with Plotly, so the reader can hover over points to inspect counts and confidence intervals, click legend items to toggle groups, and zoom into regions of interest.

2 Packages and Seed

pkgs <- c("tidyverse", "janitor", "broom", "psych",
          "rsample", "pROC", "scales", "binom",
          "plotly", "DT", "knitr", "kableExtra")
to_install <- setdiff(pkgs, rownames(installed.packages()))
if (length(to_install)) install.packages(to_install)
invisible(lapply(pkgs, library, character.only = TRUE))

set.seed(42)

# Shared aesthetic for ggplot objects we feed to Plotly
theme_paper <- function() {
  theme_minimal(base_size = 11) +
    theme(plot.title.position = "plot",
          plot.title          = element_text(face = "bold"),
          plot.subtitle       = element_text(color = "grey45", face = "italic"),
          panel.grid.minor    = element_blank(),
          axis.text           = element_text(color = "grey25"),
          legend.position     = "top")
}

3 Load and Clean

We treat the literal "?" as missing, but keep the literal string "None" because in this dataset it is the explicit code for “HbA1c test not ordered” — not missingness. Conflating the two would erase exactly the signal that Q2 asks about.

df_raw <- read_csv("diabetic_data.csv",
                   na = c("?", ""),
                   show_col_types = FALSE)

df <- df_raw %>%
  select(-weight, -payer_code, -medical_specialty) %>%
  filter(!discharge_disposition_id %in% c(11, 13, 14, 19, 20, 21)) %>%
  arrange(encounter_id) %>%
  distinct(patient_nbr, .keep_all = TRUE) %>%
  filter(!is.na(race), !is.na(diag_1), gender != "Unknown/Invalid") %>%
  mutate(
    readmit_30 = if_else(readmitted == "<30", 1L, 0L),
    A1C_tested = if_else(A1Cresult == "None" | is.na(A1Cresult),
                         "Not tested", "Tested"),
    age = factor(age,
                 levels = c("[0-10)","[10-20)","[20-30)","[30-40)",
                            "[40-50)","[50-60)","[60-70)","[70-80)",
                            "[80-90)","[90-100)"))
  )

c(raw_rows     = nrow(df_raw),
  cleaned_rows = nrow(df),
  readmit_rate = round(mean(df$readmit_30), 4))

##     raw_rows cleaned_rows readmit_rate 
##  101766.0000   68061.0000       0.0902

After cleaning we retain 68,061 unique patients, with an observed 30-day readmission rate of 9.02% — closely matching the 8.9% baseline reported in Strack et al. (2014).

3.1 Figure 1 — Pre-Cleaning Missingness Profile

3.1.1 Interactive Plot

miss_df <- df_raw %>%
  summarise(across(everything(), ~ mean(is.na(.)) * 100)) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "pct") %>%
  filter(variable %in% c("weight","payer_code","medical_specialty",
                         "race","diag_3","diag_2","diag_1")) %>%
  mutate(variable = factor(variable,
                           levels = c("diag_1","diag_2","diag_3","race",
                                      "medical_specialty","payer_code","weight")),
         flag = case_when(pct > 30 ~ "Drop (>30%)",
                          pct > 5  ~ "Caution",
                          TRUE     ~ "Keep"),
         tooltip = sprintf("<b>%s</b><br>Missing: %.1f%%<br>Decision: %s",
                           variable, pct, flag))

p1 <- plot_ly(miss_df,
              x = ~pct, y = ~variable, type = "bar",
              orientation = "h",
              color = ~flag,
              colors = c("Drop (>30%)" = "#D1495B",
                         "Caution"     = "#E9C46A",
                         "Keep"        = "#7A9E7E"),
              text = ~tooltip, hoverinfo = "text",
              marker = list(line = list(color = "grey20", width = 0.5))) %>%
  add_annotations(x = miss_df$pct, y = miss_df$variable,
                  text = sprintf("%.1f%%", miss_df$pct),
                  xanchor = "left", showarrow = FALSE,
                  font = list(size = 11, color = "grey25"),
                  xshift = 4) %>%
  layout(title = list(text = "<b>Pre-Cleaning Missingness Profile</b><br><sup>Click legend to toggle; hover bars for details</sup>",
                      x = 0, xanchor = "left"),
         xaxis = list(title = "Missingness (%)", range = c(0, 110),
                      gridcolor = "grey92"),
         yaxis = list(title = "", gridcolor = "white"),
         shapes = list(list(type = "line",
                            x0 = 30, x1 = 30, y0 = -0.5, y1 = 6.5,
                            line = list(dash = "dash", color = "#D1495B", width = 2))),
         legend = list(orientation = "h", x = 0, y = -0.15),
         hovermode = "closest")
p1

3.1.2 Data Table

miss_df %>%
  transmute(variable, missing_pct = round(pct, 2), decision = flag) %>%
  arrange(desc(missing_pct)) %>%
  DT::datatable(caption = "Missingness percentage and disposition",
                options = list(pageLength = 10, dom = 'tip'),
                rownames = FALSE)

Interpretation: We removed three columns whose missingness exceeded a 30% threshold beyond which imputation produces more noise than signal: weight (97% missing), payer code (40%), and medical specialty (49%).

3.2 Figure 2 — Outcome Distribution

3.2.1 Interactive Donut Chart

out_tbl <- df %>%
  count(readmitted) %>%
  mutate(label = case_when(readmitted == "NO"  ~ "No readmission",
                            readmitted == ">30" ~ "Readmitted >30 days",
                            readmitted == "<30" ~ "Readmitted <30 days"),
         pct   = n / sum(n) * 100,
         tooltip = sprintf("<b>%s</b><br>Patients: %s<br>Percentage: %.1f%%",
                           label, scales::comma(n), pct))

p2 <- plot_ly(out_tbl,
              labels = ~label, values = ~n, type = "pie", hole = 0.55,
              marker = list(colors = c("#7A9E7E", "#D1495B", "#E9C46A"),
                            line   = list(color = "white", width = 2)),
              text = ~tooltip, hoverinfo = "text",
              textinfo = "label+percent",
              textposition = "outside") %>%
  layout(title = list(text = "<b>Outcome Distribution (N = 68,061)</b><br><sup>Click to isolate segments; double-click to reset</sup>",
                      x = 0, xanchor = "left"),
         annotations = list(list(text = "<b>9.02%</b><br>30-day<br>readmit",
                                  x = 0.5, y = 0.5, 
                                  font = list(size = 20, color = "#D1495B"),
                                  showarrow = FALSE)),
         legend = list(orientation = "h", x = 0, y = -0.05),
         showlegend = TRUE)
p2

3.2.2 Summary Statistics

out_tbl %>%
  transmute(category = label, n, percent = round(pct, 2)) %>%
  kable(caption = "Outcome categories in the cleaned cohort",
        col.names = c("Category", "Count", "Percentage (%)")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)

Outcome categories in the cleaned cohort
Category	Count	Percentage (%)
Readmitted <30 days	6142	9.02
Readmitted >30 days	21789	32.01
No readmission	40130	58.96

Key Finding: Although the headline number is 9% within thirty days, roughly one in three diabetic patients in this cohort eventually return to inpatient care. That ratio is what gives the readmission question its policy weight.

4 Descriptive Statistics

num_vars <- c("time_in_hospital", "num_lab_procedures", "num_procedures",
              "num_medications", "number_outpatient", "number_emergency",
              "number_inpatient", "number_diagnoses")

desc_tbl <- df %>%
  select(all_of(num_vars)) %>%
  psych::describe() %>%
  as.data.frame() %>%
  rownames_to_column("variable") %>%
  select(variable, n, mean, sd, min, max) %>%
  mutate(across(c(mean, sd), ~ round(., 2)))

DT::datatable(desc_tbl,
              caption = "Table 1. Descriptive statistics for numeric predictors",
              options = list(pageLength = 10, dom = 'tip'),
              rownames = FALSE) %>%
  DT::formatStyle(columns = 1:6, fontSize = '95%')

Notable Finding: The mean of 16 medications per encounter shows that polypharmacy is the rule, not the exception, in this population. HbA1c was recorded for only 18.3% of encounters; the remaining 81.7% are coded “None” (test not ordered).

5 Chi-Square Tests of Independence

Chi-square tests are appropriate here because every cell of each 2-way contingency table has expected count above five — comfortably above the standard rule-of-thumb threshold.

5.1 A1C Testing × 30-Day Readmission

chi_a1c <- chisq.test(table(df$A1C_tested, df$readmit_30))
chi_a1c

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(df$A1C_tested, df$readmit_30)
## X-squared = 5.3875, df = 1, p-value = 0.02028

phi_a1c <- sqrt(chi_a1c$statistic / sum(chi_a1c$observed))
cat("phi (effect size):", round(phi_a1c, 4), "\n")

## phi (effect size): 0.0089

df %>%
  group_by(A1C_tested) %>%
  summarise(n = n(),
            readmit_rate_pct = round(mean(readmit_30) * 100, 2)) %>%
  kable(caption = "30-day readmission rate by A1C testing status",
        col.names = c("A1C Testing", "Count", "Readmit Rate (%)")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)

30-day readmission rate by A1C testing status
A1C Testing	Count	Readmit Rate (%)
Not tested	55599	9.15
Tested	12462	8.48

Result: χ²(1, N = 68,061) = 5.39, p = .020. The φ effect size of 0.009 indicates a real but practically tiny effect — patients with A1C testing had 8.48% vs. 9.15% readmission.

5.2 Age × 30-Day Readmission

chi_age <- chisq.test(table(df$age, df$readmit_30))
chi_age

## 
##  Pearson's Chi-squared test
## 
## data:  table(df$age, df$readmit_30)
## X-squared = 185.14, df = 9, p-value < 2.2e-16

Result: Age is the strongest categorical predictor, χ²(9) = 185.14, p < .001.

5.3 Race × 30-Day Readmission

chi_race <- chisq.test(table(df$race, df$readmit_30))
chi_race

## 
##  Pearson's Chi-squared test
## 
## data:  table(df$race, df$readmit_30)
## X-squared = 12.235, df = 4, p-value = 0.01569

Result: Race shows a modest but reliable association, χ²(4) = 12.24, p = .016.

5.4 Figure 3 — Readmission Rate by Age × A1C Testing

This is the central visualization for Q2. Hover over any point to see the exact rate, sample size, and 95% Wilson confidence interval for that cell.

5.4.1 Interactive Plot

plot_df <- df %>%
  group_by(age, A1C_tested) %>%
  summarise(n = n(), k = sum(readmit_30), .groups = "drop") %>%
  filter(n >= 30) %>%
  rowwise() %>%
  mutate(ci = list(binom::binom.wilson(k, n))) %>%
  unnest(ci, names_sep = "_") %>%  # <<<--- FIX: Added names_sep
  mutate(rate_pct  = ci_mean * 100,    # <<<--- FIX: Changed to ci_mean
         lo_pct    = ci_lower * 100,   # <<<--- FIX: Changed to ci_lower
         hi_pct    = ci_upper * 100,   # <<<--- FIX: Changed to ci_upper
         tooltip   = sprintf(
           "<b>Age: %s</b><br>Group: %s<br>Encounters: %s<br>Rate: %.2f%%<br>95%% CI: [%.2f%%, %.2f%%]",
           age, A1C_tested, scales::comma(n), rate_pct, lo_pct, hi_pct))

overall <- mean(df$readmit_30) * 100

g3 <- ggplot(plot_df,
             aes(x = age, y = rate_pct,
                 color = A1C_tested, shape = A1C_tested,
                 text = tooltip, group = A1C_tested)) +
  geom_hline(yintercept = overall, linetype = "dashed",
             color = "grey50", linewidth = 0.4) +
  geom_errorbar(aes(ymin = lo_pct, ymax = hi_pct),
                position = position_dodge(width = 0.45),
                width = 0.18, linewidth = 0.7) +
  geom_point(aes(size = n),
             position = position_dodge(width = 0.45),
             stroke = 1.1, fill = "white") +
  scale_color_manual(values = c("Tested"     = "#1F6FB4",
                                "Not tested" = "#D1495B")) +
  scale_shape_manual(values = c("Tested" = 16, "Not tested" = 15)) +
  scale_size_continuous(range = c(3, 8), guide = "none") +
  scale_y_continuous(labels = scales::label_percent(scale = 1),
                     limits = c(0, NA)) +
  labs(title    = "30-Day Readmission Rate by Age and HbA1c Testing",
       subtitle = "Dot size ∝ encounter count; bars = 95% Wilson CIs",
       x = "Age group", y = "30-day readmission rate",
       color = "A1C status", shape = "A1C status") +
  theme_paper() +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

ggplotly(g3, tooltip = "text") %>%
  layout(title = list(text = "<b>30-Day Readmission Rate by Age × HbA1c Testing</b><br><sup>Hover for details; click legend to toggle groups</sup>",
                      x = 0),
         legend = list(orientation = "h", x = 0, y = 1.12),
         hovermode = "closest")

5.4.2 Data Table

plot_df %>%
  transmute(age, group = A1C_tested, n,
            rate_pct  = round(rate_pct, 2),
            lo_pct    = round(lo_pct, 2),
            hi_pct    = round(hi_pct, 2)) %>%
  arrange(age, group) %>%
  DT::datatable(caption = "Readmission rate with 95% Wilson CI by age × A1C",
                options = list(pageLength = 10),
                rownames = FALSE,
                colnames = c("Age", "A1C Group", "Count", 
                            "Rate (%)", "CI Lower (%)", "CI Upper (%)"))

Key Insight: Readmission risk climbs monotonically with age, plateauing near 11% in the 80+ group. The A1C-tested vs. not-tested gap is small in any age band, with overlapping confidence intervals. Age, not testing behavior, is the dominant signal.

6 One-Way ANOVA: Length of Stay × Admission Type

sub <- df %>%
  filter(admission_type_id %in% 1:3) %>%
  mutate(adm = factor(admission_type_id,
                      labels = c("Emergency", "Urgent", "Elective")))

anova_fit <- aov(time_in_hospital ~ adm, data = sub)
summary(anova_fit)

##                Df Sum Sq Mean Sq F value Pr(>F)    
## adm             2    911   455.5   52.88 <2e-16 ***
## Residuals   60259 519096     8.6                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

TukeyHSD(anova_fit)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = time_in_hospital ~ adm, data = sub)
## 
## $adm
##                          diff         lwr        upr p adj
## Urgent-Emergency    0.1648989  0.09247632  0.2373214 3e-07
## Elective-Emergency -0.2093340 -0.27921032 -0.1394577 0e+00
## Elective-Urgent    -0.3742329 -0.46021101 -0.2882547 0e+00

sub %>%
  group_by(adm) %>%
  summarise(n    = n(),
            mean = round(mean(time_in_hospital), 3),
            sd   = round(sd(time_in_hospital), 3)) %>%
  kable(caption = "Length of stay (days) by admission type",
        col.names = c("Admission Type", "Count", "Mean (days)", "SD")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)

Length of stay (days) by admission type
Admission Type	Count	Mean (days)	SD
Emergency	34596	4.298	2.902
Urgent	12204	4.463	3.012
Elective	13462	4.089	2.948

Result: F(2, 60,259) = 52.88, p < .001. Every pairwise contrast is significant, but the absolute spread is < 0.4 days, so admission type is a weak driver of resource use.

6.1 Figure 4 — Length-of-Stay Distribution

6.1.1 Interactive Violin Plot

adm_levels <- c("Elective","Emergency","Urgent")
sub2 <- sub %>% mutate(adm = factor(adm, levels = adm_levels))

p4 <- plot_ly()
pal <- c("Elective"="#7A9E7E","Emergency"="#E07A5F","Urgent"="#3D5A80")
for (lev in adm_levels) {
  vals <- sub2$time_in_hospital[sub2$adm == lev]
  p4 <- add_trace(p4, y = vals, type = "violin", name = lev,
                  box = list(visible = TRUE),
                  meanline = list(visible = TRUE),
                  points = "outliers",
                  line = list(color = "grey25"),
                  fillcolor = pal[[lev]], opacity = 0.6,
                  hoverinfo = "y+name")
}
p4 <- p4 %>%
  layout(title = list(text = paste0(
           "<b>Length of Stay Distribution Across Admission Types</b><br>",
           "<sup>F(2, 60259) = 52.88, p < .001; Tukey: Δ = +0.21 (Eme−Ele), +0.17 (Urg−Eme), +0.37 (Urg−Ele) days</sup>"),
                      x = 0, xanchor = "left"),
         yaxis = list(title = "Time in hospital (days)",
                      range = c(0, 16), gridcolor = "grey92"),
         xaxis = list(title = "Admission type"),
         showlegend = TRUE,
         legend = list(orientation = "h", x = 0, y = -0.15),
         hovermode = "closest")
p4

6.1.2 Summary Table

sub2 %>%
  group_by(adm) %>%
  summarise(n = n(),
            mean = round(mean(time_in_hospital), 2),
            median = round(median(time_in_hospital), 2),
            sd = round(sd(time_in_hospital), 2)) %>%
  kable(caption = "Length of stay (days) by admission type",
        col.names = c("Type", "Count", "Mean", "Median", "SD")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)

Length of stay (days) by admission type
Type	Count	Mean	Median	SD
Elective	13462	4.09	3	2.95
Emergency	34596	4.30	4	2.90
Urgent	12204	4.46	4	3.01

7 Preliminary Logistic Regression

mdf <- df %>%
  mutate(admission_type = factor(admission_type_id)) %>%
  select(readmit_30,
         time_in_hospital, num_lab_procedures, num_procedures,
         num_medications, number_outpatient, number_emergency,
         number_inpatient, number_diagnoses,
         age, race, gender, A1C_tested, change, diabetesMed,
         insulin, admission_type) %>%
  drop_na()

split <- rsample::initial_split(mdf, prop = 0.7, strata = readmit_30)
train <- rsample::training(split)
test  <- rsample::testing(split)

fit <- glm(readmit_30 ~ ., data = train, family = binomial)
test$prob <- predict(fit, newdata = test, type = "response")
auc_val   <- pROC::auc(test$readmit_30, test$prob)
cat("Held-out AUC:", round(as.numeric(auc_val), 4), "\n")

## Held-out AUC: 0.6025

The unregularized model achieves an AUC of 0.60 on a held-out 30% sample.

num_features <- c("time_in_hospital","num_lab_procedures","num_procedures",
                  "num_medications","number_outpatient","number_emergency",
                  "number_inpatient","number_diagnoses")

fit_z <- glm(readmit_30 ~ scale(time_in_hospital) + scale(num_lab_procedures) +
                          scale(num_procedures) + scale(num_medications) +
                          scale(number_outpatient) + scale(number_emergency) +
                          scale(number_inpatient) + scale(number_diagnoses),
             data = train, family = binomial)

forest_df <- broom::tidy(fit_z, exponentiate = TRUE, conf.int = TRUE) %>%
  filter(term != "(Intercept)") %>%
  mutate(label = recode(term,
    "scale(time_in_hospital)"   = "Length of stay",
    "scale(num_lab_procedures)" = "# Lab procedures",
    "scale(num_procedures)"     = "# Procedures",
    "scale(num_medications)"    = "# Medications",
    "scale(number_outpatient)"  = "Prior outpatient visits",
    "scale(number_emergency)"   = "Prior ER visits",
    "scale(number_inpatient)"   = "Prior inpatient visits",
    "scale(number_diagnoses)"   = "# Diagnoses"
  ),
  direction = if_else(estimate >= 1, "Risk-increasing", "Protective")) %>%
  arrange(estimate)

7.1 Figure 5 — Forest Plot

7.1.1 Interactive Forest Plot

fdf <- forest_df %>% mutate(
  label_ord = factor(label, levels = label),
  tooltip   = sprintf("<b>%s</b><br>OR = %.3f<br>95%% CI: [%.3f, %.3f]<br>p = %.4f",
                      label, estimate, conf.low, conf.high, p.value))

p5 <- plot_ly(fdf, y = ~label_ord, x = ~estimate,
              error_x = list(type = "data",
                             symmetric = FALSE,
                             array      = ~conf.high - estimate,
                             arrayminus = ~estimate - conf.low,
                             color = "grey40", thickness = 2),
              type = "scatter", mode = "markers",
              marker = list(size = 14, line = list(width = 2, color = "white")),
              color = ~direction,
              colors = c("Risk-increasing" = "#1F6FB4",
                         "Protective"      = "#D1495B"),
              text = ~tooltip, hoverinfo = "text") %>%
  layout(title = list(text = "<b>Forest Plot — Standardized Odds Ratios</b><br><sup>OR per 1 SD increase; hover for 95% CI and p-value</sup>",
                      x = 0, xanchor = "left"),
         xaxis = list(title = "Odds ratio (per 1 SD increase)",
                      range = c(0.88, 1.36), gridcolor = "grey92"),
         yaxis = list(title = "", gridcolor = "white"),
         shapes = list(list(type = "line",
                            x0 = 1, x1 = 1, y0 = -0.5, y1 = 7.5,
                            line = list(dash = "dash", color = "grey55", width = 2))),
         legend = list(orientation = "h", x = 0, y = -0.20),
         hovermode = "closest")
p5

7.1.2 Coefficients Table

forest_df %>%
  transmute(predictor = label,
            OR_per_SD = round(estimate, 3),
            ci_low    = round(conf.low, 3),
            ci_high   = round(conf.high, 3),
            p_value   = signif(p.value, 3)) %>%
  arrange(desc(OR_per_SD)) %>%
  DT::datatable(caption = "Standardized odds ratios with 95% CI",
                options = list(pageLength = 10),
                rownames = FALSE,
                colnames = c("Predictor", "OR (per 1 SD)", 
                            "CI Lower", "CI Upper", "p-value"))

Key Finding: Prior inpatient visits (OR = 1.23) is the strongest predictor, followed by length of stay (OR = 1.14) and # diagnoses (OR = 1.12). Number of procedures is protective (OR = 0.93).

7.2 Figure 6 — ROC Curve

7.2.1 Interactive ROC

roc_obj <- pROC::roc(test$readmit_30, test$prob, quiet = TRUE)
roc_df  <- data.frame(fpr = 1 - roc_obj$specificities,
                      tpr = roc_obj$sensitivities,
                      thr = roc_obj$thresholds) %>% arrange(fpr)

target_thr <- c(0.10, 0.20, 0.30)
markers <- map_dfr(target_thr, function(t) {
  i <- which.min(abs(roc_df$thr - t))
  data.frame(fpr = roc_df$fpr[i], tpr = roc_df$tpr[i],
             label = sprintf("<b>Threshold = %.2f</b><br>FPR = %.3f<br>TPR = %.3f",
                             t, roc_df$fpr[i], roc_df$tpr[i]))
})

p6 <- plot_ly() %>%
  add_trace(data = roc_df, x = ~fpr, y = ~tpr,
            type = "scatter", mode = "lines",
            line = list(color = "#1F6FB4", width = 3),
            fill = "tozeroy", fillcolor = "rgba(31, 111, 180, 0.15)",
            name = sprintf("Logistic (AUC = %.3f)", as.numeric(auc_val)),
            hovertemplate = "<b>Model Performance</b><br>FPR: %{x:.3f}<br>TPR: %{y:.3f}<extra></extra>") %>%
  add_trace(x = c(0, 1), y = c(0, 1), type = "scatter", mode = "lines",
            line = list(dash = "dash", color = "grey55", width = 2),
            name = "Chance (AUC = 0.500)", hoverinfo = "skip") %>%
  add_trace(data = markers, x = ~fpr, y = ~tpr,
            type = "scatter", mode = "markers",
            marker = list(size = 13, color = "#D1495B",
                          line = list(width = 2, color = "white")),
            text = ~label, hoverinfo = "text",
            name = "Decision Thresholds") %>%
  layout(title = list(text = sprintf(
           "<b>ROC Curve — Held-out 30%%</b><br><sup>N_test = %s; baseline = %.1f%%; hover curve and markers for details</sup>",
           scales::comma(nrow(test)), mean(test$readmit_30) * 100),
                      x = 0, xanchor = "left"),
         xaxis = list(title = "False positive rate (1 − specificity)",
                      range = c(-0.02, 1.02), gridcolor = "grey92"),
         yaxis = list(title = "True positive rate (sensitivity)",
                      range = c(-0.02, 1.02), gridcolor = "grey92"),
         legend = list(x = 0.55, y = 0.12),
         hovermode = "closest")
p6

7.2.2 Confusion Matrix

pred <- as.integer(test$prob >= 0.5)
conf_mat <- table(Actual = test$readmit_30, Predicted = pred)
conf_mat

##       Predicted
## Actual     0     1
##      0 18527     4
##      1  1883     5

# Calculate metrics
tn <- conf_mat[1,1]; fp <- conf_mat[1,2]
fn <- conf_mat[2,1]; tp <- conf_mat[2,2]
accuracy <- (tp + tn) / sum(conf_mat)
sensitivity <- tp / (tp + fn)
specificity <- tn / (tn + fp)

cat(sprintf("\nAccuracy: %.3f\nSensitivity: %.3f\nSpecificity: %.3f\n",
            accuracy, sensitivity, specificity))

## 
## Accuracy: 0.908
## Sensitivity: 0.003
## Specificity: 1.000

Interpretation: The modest AUC of 0.60 indicates weak discrimination, motivating the regularized models and interaction terms in Module 6.

8 An Interaction Worth Highlighting

Patients with 3+ prior inpatient visits run readmission rates of 25–30% — triple the population mean — regardless of how long they stay. Length of stay only matters for patients with no prior admissions.

8.1 Figure 7 — Risk Grid

8.1.1 Interactive Heatmap

df_grid <- df %>%
  mutate(los_bin = cut(time_in_hospital,
                       breaks = c(0, 2, 4, 6, 8, 14),
                       labels = c("1-2","3-4","5-6","7-8","9-14")),
         inp_bin = cut(number_inpatient,
                       breaks = c(-0.1, 0.5, 1.5, 2.5, 12.5),
                       labels = c("0","1","2","3+"))) %>%
  group_by(inp_bin, los_bin) %>%
  summarise(n = n(), rate = mean(readmit_30) * 100, .groups = "drop") %>%
  mutate(label   = sprintf("%.1f%%<br>n = %s", rate, scales::comma(n)),
         tooltip = sprintf(
           "<b>Prior Inpatient: %s</b><br>Length of Stay: %s days<br>Encounters: %s<br><b>Readmission Rate: %.2f%%</b>",
           inp_bin, los_bin, scales::comma(n), rate))

heat_mat <- df_grid %>%
  select(inp_bin, los_bin, rate) %>%
  pivot_wider(names_from = los_bin, values_from = rate) %>%
  column_to_rownames("inp_bin") %>%
  as.matrix()
text_mat <- df_grid %>%
  select(inp_bin, los_bin, label) %>%
  pivot_wider(names_from = los_bin, values_from = label) %>%
  column_to_rownames("inp_bin") %>%
  as.matrix()
hover_mat <- df_grid %>%
  select(inp_bin, los_bin, tooltip) %>%
  pivot_wider(names_from = los_bin, values_from = tooltip) %>%
  column_to_rownames("inp_bin") %>%
  as.matrix()

p7 <- plot_ly(x = colnames(heat_mat),
              y = rownames(heat_mat),
              z = heat_mat,
              type = "heatmap",
              colorscale = list(c(0, "#3D5A80"),
                                c(0.25, "#98C1D9"),
                                c(0.5, "#FAF3DD"),
                                c(0.75, "#E07A5F"),
                                c(1, "#9D2828")),
              text = hover_mat, hoverinfo = "text",
              colorbar = list(title = "Readmit<br>Rate (%)", 
                             titleside = "right")) %>%
  add_annotations(x = rep(colnames(heat_mat), each = nrow(heat_mat)),
                  y = rep(rownames(heat_mat), times = ncol(heat_mat)),
                  text = as.vector(text_mat), showarrow = FALSE,
                  font = list(size = 12, color = "white")) %>%
  layout(title = list(text = "<b>Risk Grid — Prior Admissions × Length of Stay</b><br><sup>Hover cells for details; deep red = highest risk</sup>",
                      x = 0, xanchor = "left"),
         xaxis = list(title = "Length of stay (days)", side = "bottom"),
         yaxis = list(title = "Prior inpatient visits (past year)"),
         hovermode = "closest")
p7

8.1.2 Risk Grid Table

df_grid %>%
  transmute(prior_inpatient = inp_bin,
            length_of_stay  = los_bin,
            encounters = n,
            readmission_rate_pct = round(rate, 2)) %>%
  DT::datatable(caption = "Cell-level counts and readmission rates",
                options = list(pageLength = 20),
                rownames = FALSE,
                colnames = c("Prior Admissions", "LOS (days)", 
                            "Encounters", "Readmit Rate (%)")) %>%
  DT::formatStyle('readmission_rate_pct',
                  background = DT::styleColorBar(range(df_grid$rate), '#D1495B'),
                  backgroundSize = '100% 80%',
                  backgroundRepeat = 'no-repeat',
                  backgroundPosition = 'center')

Critical Finding: This non-additive interaction (visible in the 3+ row) motivates explicit interaction terms in Module 6’s regularized models.

9 Discussion and Next Steps

The preliminary results give tentative answers to three research questions:

Q1. Age, prior inpatient utilization, length of stay, and number of diagnoses are the strongest signals; medication-change and HbA1c testing exert modest influence.
Q2. HbA1c testing reduces readmission by ~0.7 percentage points — real but practically small.
Q3. Length of stay differs across admission types but the effect is < 0.4 days.

Three patterns shaping Module 6:

Polypharmacy saturation: Average encounter involves 16 medications and 7 diagnoses
History dominates: Prior admissions predict better than current vitals
Non-linear interaction: Figure 7 shows 3+ prior visits → 25–30% risk regardless of LOS

Module 6 Plan: Ridge/Lasso with cv.glmnet, explicit interaction terms, VIF checks, and held-out validation.

10 Conclusion

The cleaned dataset (N = 68,061, 9.02% readmission) produces interpretable, defensible results. Clinical complexity and prior utilization drive most explainable variation; glycemic management matters at the margin. Module 6 will convert these findings into a regularized, validated model to flag high-risk patients before discharge.

11 References

Centers for Medicare & Medicaid Services. (2023). Hospital Readmissions Reduction Program (HRRP). https://www.cms.gov/medicare/quality-initiatives-patient-assessment-instruments/value-based-programs/hrrp/hospital-readmissions-reduction-program-hrrp

Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22. https://doi.org/10.18637/jss.v033.i01

Jencks, S. F., Williams, M. V., & Coleman, E. A. (2009). Rehospitalizations among patients in the Medicare fee-for-service program. New England Journal of Medicine, 360(14), 1418–1428. https://doi.org/10.1056/NEJMsa0803563

Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C., & Müller, M. (2011). pROC: An open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, 77. https://doi.org/10.1186/1471-2105-12-77

Strack, B., DeShazo, J. P., Gennings, C., Olmo, J. L., Ventura, S., Cios, K. J., & Clore, J. N. (2014). Impact of HbA1c measurement on hospital readmission rates: Analysis of 70,000 clinical database patient records. BioMed Research International, 2014, 781670. https://doi.org/10.1155/2014/781670

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Predicting 30-Day Hospital Readmission in Diabetic Patients

An Initial Statistical Analysis of 100,000 Clinical Encounters

Anush Goel, Mandeep Kaur, and Sandeep Kaur

April 26, 2026