When a patient with diabetes leaves the hospital, the next month is, statistically, the most fragile stretch of their care. The Centers for Medicare and Medicaid Services have used 30-day readmission rates as a quality metric since 2012, and unplanned readmissions cost the U.S. health system roughly twenty-six billion dollars a year (Jencks et al., 2009). Diabetic inpatients are over-represented in those numbers because the same admission that brought them in — unstable glucose, fluid balance, infection — rarely resolves cleanly in three or four days.
This Module 4 notebook executes the analytic plan from our Module 3 proposal: clean the publicly available Diabetes 130-US Hospitals dataset (Strack et al., 2014), produce descriptive statistics, run our first wave of inferential tests — three chi-square tests of independence and a one-way ANOVA with Tukey HSD post-hoc — and fit a preliminary logistic regression. The regularized models (Ridge / Lasso) come in Module 6.
Research questions answered tentatively here
- Q1. Which patient and treatment factors most strongly predict 30-day hospital readmission among diabetic patients?
- Q2. Is HbA1c testing during admission associated with lower readmission rates across age and race groups?
- Q3. Does mean length of stay differ significantly across admission types?
This HTML document is the interactive companion to the APA-formatted Word report. Every figure below is rendered with Plotly, so the reader can hover over points to inspect counts and confidence intervals, click legend items to toggle groups, and zoom into regions of interest.
pkgs <- c("tidyverse", "janitor", "broom", "psych",
"rsample", "pROC", "scales", "binom",
"plotly", "DT", "knitr", "kableExtra")
to_install <- setdiff(pkgs, rownames(installed.packages()))
if (length(to_install)) install.packages(to_install)
invisible(lapply(pkgs, library, character.only = TRUE))
set.seed(42)
# Shared aesthetic for ggplot objects we feed to Plotly
theme_paper <- function() {
theme_minimal(base_size = 11) +
theme(plot.title.position = "plot",
plot.title = element_text(face = "bold"),
plot.subtitle = element_text(color = "grey45", face = "italic"),
panel.grid.minor = element_blank(),
axis.text = element_text(color = "grey25"),
legend.position = "top")
}We treat the literal "?" as missing, but keep
the literal string "None" because in this dataset
it is the explicit code for “HbA1c test not ordered” — not
missingness. Conflating the two would erase exactly the signal that Q2
asks about.
df_raw <- read_csv("diabetic_data.csv",
na = c("?", ""),
show_col_types = FALSE)
df <- df_raw %>%
select(-weight, -payer_code, -medical_specialty) %>%
filter(!discharge_disposition_id %in% c(11, 13, 14, 19, 20, 21)) %>%
arrange(encounter_id) %>%
distinct(patient_nbr, .keep_all = TRUE) %>%
filter(!is.na(race), !is.na(diag_1), gender != "Unknown/Invalid") %>%
mutate(
readmit_30 = if_else(readmitted == "<30", 1L, 0L),
A1C_tested = if_else(A1Cresult == "None" | is.na(A1Cresult),
"Not tested", "Tested"),
age = factor(age,
levels = c("[0-10)","[10-20)","[20-30)","[30-40)",
"[40-50)","[50-60)","[60-70)","[70-80)",
"[80-90)","[90-100)"))
)
c(raw_rows = nrow(df_raw),
cleaned_rows = nrow(df),
readmit_rate = round(mean(df$readmit_30), 4))## raw_rows cleaned_rows readmit_rate
## 101766.0000 68061.0000 0.0902
After cleaning we retain 68,061 unique patients, with an observed 30-day readmission rate of 9.02% — closely matching the 8.9% baseline reported in Strack et al. (2014).
miss_df <- df_raw %>%
summarise(across(everything(), ~ mean(is.na(.)) * 100)) %>%
pivot_longer(everything(), names_to = "variable", values_to = "pct") %>%
filter(variable %in% c("weight","payer_code","medical_specialty",
"race","diag_3","diag_2","diag_1")) %>%
mutate(variable = factor(variable,
levels = c("diag_1","diag_2","diag_3","race",
"medical_specialty","payer_code","weight")),
flag = case_when(pct > 30 ~ "Drop (>30%)",
pct > 5 ~ "Caution",
TRUE ~ "Keep"),
tooltip = sprintf("<b>%s</b><br>Missing: %.1f%%<br>Decision: %s",
variable, pct, flag))
p1 <- plot_ly(miss_df,
x = ~pct, y = ~variable, type = "bar",
orientation = "h",
color = ~flag,
colors = c("Drop (>30%)" = "#D1495B",
"Caution" = "#E9C46A",
"Keep" = "#7A9E7E"),
text = ~tooltip, hoverinfo = "text",
marker = list(line = list(color = "grey20", width = 0.5))) %>%
add_annotations(x = miss_df$pct, y = miss_df$variable,
text = sprintf("%.1f%%", miss_df$pct),
xanchor = "left", showarrow = FALSE,
font = list(size = 11, color = "grey25"),
xshift = 4) %>%
layout(title = list(text = "<b>Pre-Cleaning Missingness Profile</b><br><sup>Click legend to toggle; hover bars for details</sup>",
x = 0, xanchor = "left"),
xaxis = list(title = "Missingness (%)", range = c(0, 110),
gridcolor = "grey92"),
yaxis = list(title = "", gridcolor = "white"),
shapes = list(list(type = "line",
x0 = 30, x1 = 30, y0 = -0.5, y1 = 6.5,
line = list(dash = "dash", color = "#D1495B", width = 2))),
legend = list(orientation = "h", x = 0, y = -0.15),
hovermode = "closest")
p1miss_df %>%
transmute(variable, missing_pct = round(pct, 2), decision = flag) %>%
arrange(desc(missing_pct)) %>%
DT::datatable(caption = "Missingness percentage and disposition",
options = list(pageLength = 10, dom = 'tip'),
rownames = FALSE)Interpretation: We removed three columns whose missingness exceeded a 30% threshold beyond which imputation produces more noise than signal: weight (97% missing), payer code (40%), and medical specialty (49%).
out_tbl <- df %>%
count(readmitted) %>%
mutate(label = case_when(readmitted == "NO" ~ "No readmission",
readmitted == ">30" ~ "Readmitted >30 days",
readmitted == "<30" ~ "Readmitted <30 days"),
pct = n / sum(n) * 100,
tooltip = sprintf("<b>%s</b><br>Patients: %s<br>Percentage: %.1f%%",
label, scales::comma(n), pct))
p2 <- plot_ly(out_tbl,
labels = ~label, values = ~n, type = "pie", hole = 0.55,
marker = list(colors = c("#7A9E7E", "#D1495B", "#E9C46A"),
line = list(color = "white", width = 2)),
text = ~tooltip, hoverinfo = "text",
textinfo = "label+percent",
textposition = "outside") %>%
layout(title = list(text = "<b>Outcome Distribution (N = 68,061)</b><br><sup>Click to isolate segments; double-click to reset</sup>",
x = 0, xanchor = "left"),
annotations = list(list(text = "<b>9.02%</b><br>30-day<br>readmit",
x = 0.5, y = 0.5,
font = list(size = 20, color = "#D1495B"),
showarrow = FALSE)),
legend = list(orientation = "h", x = 0, y = -0.05),
showlegend = TRUE)
p2out_tbl %>%
transmute(category = label, n, percent = round(pct, 2)) %>%
kable(caption = "Outcome categories in the cleaned cohort",
col.names = c("Category", "Count", "Percentage (%)")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)| Category | Count | Percentage (%) |
|---|---|---|
| Readmitted <30 days | 6142 | 9.02 |
| Readmitted >30 days | 21789 | 32.01 |
| No readmission | 40130 | 58.96 |
Key Finding: Although the headline number is 9% within thirty days, roughly one in three diabetic patients in this cohort eventually return to inpatient care. That ratio is what gives the readmission question its policy weight.
num_vars <- c("time_in_hospital", "num_lab_procedures", "num_procedures",
"num_medications", "number_outpatient", "number_emergency",
"number_inpatient", "number_diagnoses")
desc_tbl <- df %>%
select(all_of(num_vars)) %>%
psych::describe() %>%
as.data.frame() %>%
rownames_to_column("variable") %>%
select(variable, n, mean, sd, min, max) %>%
mutate(across(c(mean, sd), ~ round(., 2)))
DT::datatable(desc_tbl,
caption = "Table 1. Descriptive statistics for numeric predictors",
options = list(pageLength = 10, dom = 'tip'),
rownames = FALSE) %>%
DT::formatStyle(columns = 1:6, fontSize = '95%')Notable Finding: The mean of 16 medications per encounter shows that polypharmacy is the rule, not the exception, in this population. HbA1c was recorded for only 18.3% of encounters; the remaining 81.7% are coded “None” (test not ordered).
Chi-square tests are appropriate here because every cell of each 2-way contingency table has expected count above five — comfortably above the standard rule-of-thumb threshold.
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(df$A1C_tested, df$readmit_30)
## X-squared = 5.3875, df = 1, p-value = 0.02028
phi_a1c <- sqrt(chi_a1c$statistic / sum(chi_a1c$observed))
cat("phi (effect size):", round(phi_a1c, 4), "\n")## phi (effect size): 0.0089
df %>%
group_by(A1C_tested) %>%
summarise(n = n(),
readmit_rate_pct = round(mean(readmit_30) * 100, 2)) %>%
kable(caption = "30-day readmission rate by A1C testing status",
col.names = c("A1C Testing", "Count", "Readmit Rate (%)")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)| A1C Testing | Count | Readmit Rate (%) |
|---|---|---|
| Not tested | 55599 | 9.15 |
| Tested | 12462 | 8.48 |
Result: χ²(1, N = 68,061) = 5.39, p = .020. The φ effect size of 0.009 indicates a real but practically tiny effect — patients with A1C testing had 8.48% vs. 9.15% readmission.
##
## Pearson's Chi-squared test
##
## data: table(df$age, df$readmit_30)
## X-squared = 185.14, df = 9, p-value < 2.2e-16
Result: Age is the strongest categorical predictor, χ²(9) = 185.14, p < .001.
##
## Pearson's Chi-squared test
##
## data: table(df$race, df$readmit_30)
## X-squared = 12.235, df = 4, p-value = 0.01569
Result: Race shows a modest but reliable association, χ²(4) = 12.24, p = .016.
This is the central visualization for Q2. Hover over any point to see the exact rate, sample size, and 95% Wilson confidence interval for that cell.
plot_df <- df %>%
group_by(age, A1C_tested) %>%
summarise(n = n(), k = sum(readmit_30), .groups = "drop") %>%
filter(n >= 30) %>%
rowwise() %>%
mutate(ci = list(binom::binom.wilson(k, n))) %>%
unnest(ci, names_sep = "_") %>% # <<<--- FIX: Added names_sep
mutate(rate_pct = ci_mean * 100, # <<<--- FIX: Changed to ci_mean
lo_pct = ci_lower * 100, # <<<--- FIX: Changed to ci_lower
hi_pct = ci_upper * 100, # <<<--- FIX: Changed to ci_upper
tooltip = sprintf(
"<b>Age: %s</b><br>Group: %s<br>Encounters: %s<br>Rate: %.2f%%<br>95%% CI: [%.2f%%, %.2f%%]",
age, A1C_tested, scales::comma(n), rate_pct, lo_pct, hi_pct))
overall <- mean(df$readmit_30) * 100
g3 <- ggplot(plot_df,
aes(x = age, y = rate_pct,
color = A1C_tested, shape = A1C_tested,
text = tooltip, group = A1C_tested)) +
geom_hline(yintercept = overall, linetype = "dashed",
color = "grey50", linewidth = 0.4) +
geom_errorbar(aes(ymin = lo_pct, ymax = hi_pct),
position = position_dodge(width = 0.45),
width = 0.18, linewidth = 0.7) +
geom_point(aes(size = n),
position = position_dodge(width = 0.45),
stroke = 1.1, fill = "white") +
scale_color_manual(values = c("Tested" = "#1F6FB4",
"Not tested" = "#D1495B")) +
scale_shape_manual(values = c("Tested" = 16, "Not tested" = 15)) +
scale_size_continuous(range = c(3, 8), guide = "none") +
scale_y_continuous(labels = scales::label_percent(scale = 1),
limits = c(0, NA)) +
labs(title = "30-Day Readmission Rate by Age and HbA1c Testing",
subtitle = "Dot size ∝ encounter count; bars = 95% Wilson CIs",
x = "Age group", y = "30-day readmission rate",
color = "A1C status", shape = "A1C status") +
theme_paper() +
theme(axis.text.x = element_text(angle = 20, hjust = 1))
ggplotly(g3, tooltip = "text") %>%
layout(title = list(text = "<b>30-Day Readmission Rate by Age × HbA1c Testing</b><br><sup>Hover for details; click legend to toggle groups</sup>",
x = 0),
legend = list(orientation = "h", x = 0, y = 1.12),
hovermode = "closest")plot_df %>%
transmute(age, group = A1C_tested, n,
rate_pct = round(rate_pct, 2),
lo_pct = round(lo_pct, 2),
hi_pct = round(hi_pct, 2)) %>%
arrange(age, group) %>%
DT::datatable(caption = "Readmission rate with 95% Wilson CI by age × A1C",
options = list(pageLength = 10),
rownames = FALSE,
colnames = c("Age", "A1C Group", "Count",
"Rate (%)", "CI Lower (%)", "CI Upper (%)"))Key Insight: Readmission risk climbs monotonically with age, plateauing near 11% in the 80+ group. The A1C-tested vs. not-tested gap is small in any age band, with overlapping confidence intervals. Age, not testing behavior, is the dominant signal.
sub <- df %>%
filter(admission_type_id %in% 1:3) %>%
mutate(adm = factor(admission_type_id,
labels = c("Emergency", "Urgent", "Elective")))
anova_fit <- aov(time_in_hospital ~ adm, data = sub)
summary(anova_fit)## Df Sum Sq Mean Sq F value Pr(>F)
## adm 2 911 455.5 52.88 <2e-16 ***
## Residuals 60259 519096 8.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = time_in_hospital ~ adm, data = sub)
##
## $adm
## diff lwr upr p adj
## Urgent-Emergency 0.1648989 0.09247632 0.2373214 3e-07
## Elective-Emergency -0.2093340 -0.27921032 -0.1394577 0e+00
## Elective-Urgent -0.3742329 -0.46021101 -0.2882547 0e+00
sub %>%
group_by(adm) %>%
summarise(n = n(),
mean = round(mean(time_in_hospital), 3),
sd = round(sd(time_in_hospital), 3)) %>%
kable(caption = "Length of stay (days) by admission type",
col.names = c("Admission Type", "Count", "Mean (days)", "SD")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)| Admission Type | Count | Mean (days) | SD |
|---|---|---|---|
| Emergency | 34596 | 4.298 | 2.902 |
| Urgent | 12204 | 4.463 | 3.012 |
| Elective | 13462 | 4.089 | 2.948 |
Result: F(2, 60,259) = 52.88, p < .001. Every pairwise contrast is significant, but the absolute spread is < 0.4 days, so admission type is a weak driver of resource use.
adm_levels <- c("Elective","Emergency","Urgent")
sub2 <- sub %>% mutate(adm = factor(adm, levels = adm_levels))
p4 <- plot_ly()
pal <- c("Elective"="#7A9E7E","Emergency"="#E07A5F","Urgent"="#3D5A80")
for (lev in adm_levels) {
vals <- sub2$time_in_hospital[sub2$adm == lev]
p4 <- add_trace(p4, y = vals, type = "violin", name = lev,
box = list(visible = TRUE),
meanline = list(visible = TRUE),
points = "outliers",
line = list(color = "grey25"),
fillcolor = pal[[lev]], opacity = 0.6,
hoverinfo = "y+name")
}
p4 <- p4 %>%
layout(title = list(text = paste0(
"<b>Length of Stay Distribution Across Admission Types</b><br>",
"<sup>F(2, 60259) = 52.88, p < .001; Tukey: Δ = +0.21 (Eme−Ele), +0.17 (Urg−Eme), +0.37 (Urg−Ele) days</sup>"),
x = 0, xanchor = "left"),
yaxis = list(title = "Time in hospital (days)",
range = c(0, 16), gridcolor = "grey92"),
xaxis = list(title = "Admission type"),
showlegend = TRUE,
legend = list(orientation = "h", x = 0, y = -0.15),
hovermode = "closest")
p4sub2 %>%
group_by(adm) %>%
summarise(n = n(),
mean = round(mean(time_in_hospital), 2),
median = round(median(time_in_hospital), 2),
sd = round(sd(time_in_hospital), 2)) %>%
kable(caption = "Length of stay (days) by admission type",
col.names = c("Type", "Count", "Mean", "Median", "SD")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)| Type | Count | Mean | Median | SD |
|---|---|---|---|---|
| Elective | 13462 | 4.09 | 3 | 2.95 |
| Emergency | 34596 | 4.30 | 4 | 2.90 |
| Urgent | 12204 | 4.46 | 4 | 3.01 |
mdf <- df %>%
mutate(admission_type = factor(admission_type_id)) %>%
select(readmit_30,
time_in_hospital, num_lab_procedures, num_procedures,
num_medications, number_outpatient, number_emergency,
number_inpatient, number_diagnoses,
age, race, gender, A1C_tested, change, diabetesMed,
insulin, admission_type) %>%
drop_na()
split <- rsample::initial_split(mdf, prop = 0.7, strata = readmit_30)
train <- rsample::training(split)
test <- rsample::testing(split)
fit <- glm(readmit_30 ~ ., data = train, family = binomial)
test$prob <- predict(fit, newdata = test, type = "response")
auc_val <- pROC::auc(test$readmit_30, test$prob)
cat("Held-out AUC:", round(as.numeric(auc_val), 4), "\n")## Held-out AUC: 0.6025
The unregularized model achieves an AUC of 0.60 on a held-out 30% sample.
num_features <- c("time_in_hospital","num_lab_procedures","num_procedures",
"num_medications","number_outpatient","number_emergency",
"number_inpatient","number_diagnoses")
fit_z <- glm(readmit_30 ~ scale(time_in_hospital) + scale(num_lab_procedures) +
scale(num_procedures) + scale(num_medications) +
scale(number_outpatient) + scale(number_emergency) +
scale(number_inpatient) + scale(number_diagnoses),
data = train, family = binomial)
forest_df <- broom::tidy(fit_z, exponentiate = TRUE, conf.int = TRUE) %>%
filter(term != "(Intercept)") %>%
mutate(label = recode(term,
"scale(time_in_hospital)" = "Length of stay",
"scale(num_lab_procedures)" = "# Lab procedures",
"scale(num_procedures)" = "# Procedures",
"scale(num_medications)" = "# Medications",
"scale(number_outpatient)" = "Prior outpatient visits",
"scale(number_emergency)" = "Prior ER visits",
"scale(number_inpatient)" = "Prior inpatient visits",
"scale(number_diagnoses)" = "# Diagnoses"
),
direction = if_else(estimate >= 1, "Risk-increasing", "Protective")) %>%
arrange(estimate)fdf <- forest_df %>% mutate(
label_ord = factor(label, levels = label),
tooltip = sprintf("<b>%s</b><br>OR = %.3f<br>95%% CI: [%.3f, %.3f]<br>p = %.4f",
label, estimate, conf.low, conf.high, p.value))
p5 <- plot_ly(fdf, y = ~label_ord, x = ~estimate,
error_x = list(type = "data",
symmetric = FALSE,
array = ~conf.high - estimate,
arrayminus = ~estimate - conf.low,
color = "grey40", thickness = 2),
type = "scatter", mode = "markers",
marker = list(size = 14, line = list(width = 2, color = "white")),
color = ~direction,
colors = c("Risk-increasing" = "#1F6FB4",
"Protective" = "#D1495B"),
text = ~tooltip, hoverinfo = "text") %>%
layout(title = list(text = "<b>Forest Plot — Standardized Odds Ratios</b><br><sup>OR per 1 SD increase; hover for 95% CI and p-value</sup>",
x = 0, xanchor = "left"),
xaxis = list(title = "Odds ratio (per 1 SD increase)",
range = c(0.88, 1.36), gridcolor = "grey92"),
yaxis = list(title = "", gridcolor = "white"),
shapes = list(list(type = "line",
x0 = 1, x1 = 1, y0 = -0.5, y1 = 7.5,
line = list(dash = "dash", color = "grey55", width = 2))),
legend = list(orientation = "h", x = 0, y = -0.20),
hovermode = "closest")
p5forest_df %>%
transmute(predictor = label,
OR_per_SD = round(estimate, 3),
ci_low = round(conf.low, 3),
ci_high = round(conf.high, 3),
p_value = signif(p.value, 3)) %>%
arrange(desc(OR_per_SD)) %>%
DT::datatable(caption = "Standardized odds ratios with 95% CI",
options = list(pageLength = 10),
rownames = FALSE,
colnames = c("Predictor", "OR (per 1 SD)",
"CI Lower", "CI Upper", "p-value"))Key Finding: Prior inpatient visits (OR = 1.23) is the strongest predictor, followed by length of stay (OR = 1.14) and # diagnoses (OR = 1.12). Number of procedures is protective (OR = 0.93).
roc_obj <- pROC::roc(test$readmit_30, test$prob, quiet = TRUE)
roc_df <- data.frame(fpr = 1 - roc_obj$specificities,
tpr = roc_obj$sensitivities,
thr = roc_obj$thresholds) %>% arrange(fpr)
target_thr <- c(0.10, 0.20, 0.30)
markers <- map_dfr(target_thr, function(t) {
i <- which.min(abs(roc_df$thr - t))
data.frame(fpr = roc_df$fpr[i], tpr = roc_df$tpr[i],
label = sprintf("<b>Threshold = %.2f</b><br>FPR = %.3f<br>TPR = %.3f",
t, roc_df$fpr[i], roc_df$tpr[i]))
})
p6 <- plot_ly() %>%
add_trace(data = roc_df, x = ~fpr, y = ~tpr,
type = "scatter", mode = "lines",
line = list(color = "#1F6FB4", width = 3),
fill = "tozeroy", fillcolor = "rgba(31, 111, 180, 0.15)",
name = sprintf("Logistic (AUC = %.3f)", as.numeric(auc_val)),
hovertemplate = "<b>Model Performance</b><br>FPR: %{x:.3f}<br>TPR: %{y:.3f}<extra></extra>") %>%
add_trace(x = c(0, 1), y = c(0, 1), type = "scatter", mode = "lines",
line = list(dash = "dash", color = "grey55", width = 2),
name = "Chance (AUC = 0.500)", hoverinfo = "skip") %>%
add_trace(data = markers, x = ~fpr, y = ~tpr,
type = "scatter", mode = "markers",
marker = list(size = 13, color = "#D1495B",
line = list(width = 2, color = "white")),
text = ~label, hoverinfo = "text",
name = "Decision Thresholds") %>%
layout(title = list(text = sprintf(
"<b>ROC Curve — Held-out 30%%</b><br><sup>N_test = %s; baseline = %.1f%%; hover curve and markers for details</sup>",
scales::comma(nrow(test)), mean(test$readmit_30) * 100),
x = 0, xanchor = "left"),
xaxis = list(title = "False positive rate (1 − specificity)",
range = c(-0.02, 1.02), gridcolor = "grey92"),
yaxis = list(title = "True positive rate (sensitivity)",
range = c(-0.02, 1.02), gridcolor = "grey92"),
legend = list(x = 0.55, y = 0.12),
hovermode = "closest")
p6pred <- as.integer(test$prob >= 0.5)
conf_mat <- table(Actual = test$readmit_30, Predicted = pred)
conf_mat## Predicted
## Actual 0 1
## 0 18527 4
## 1 1883 5
# Calculate metrics
tn <- conf_mat[1,1]; fp <- conf_mat[1,2]
fn <- conf_mat[2,1]; tp <- conf_mat[2,2]
accuracy <- (tp + tn) / sum(conf_mat)
sensitivity <- tp / (tp + fn)
specificity <- tn / (tn + fp)
cat(sprintf("\nAccuracy: %.3f\nSensitivity: %.3f\nSpecificity: %.3f\n",
accuracy, sensitivity, specificity))##
## Accuracy: 0.908
## Sensitivity: 0.003
## Specificity: 1.000
Interpretation: The modest AUC of 0.60 indicates weak discrimination, motivating the regularized models and interaction terms in Module 6.
Patients with 3+ prior inpatient visits run readmission rates of 25–30% — triple the population mean — regardless of how long they stay. Length of stay only matters for patients with no prior admissions.
df_grid <- df %>%
mutate(los_bin = cut(time_in_hospital,
breaks = c(0, 2, 4, 6, 8, 14),
labels = c("1-2","3-4","5-6","7-8","9-14")),
inp_bin = cut(number_inpatient,
breaks = c(-0.1, 0.5, 1.5, 2.5, 12.5),
labels = c("0","1","2","3+"))) %>%
group_by(inp_bin, los_bin) %>%
summarise(n = n(), rate = mean(readmit_30) * 100, .groups = "drop") %>%
mutate(label = sprintf("%.1f%%<br>n = %s", rate, scales::comma(n)),
tooltip = sprintf(
"<b>Prior Inpatient: %s</b><br>Length of Stay: %s days<br>Encounters: %s<br><b>Readmission Rate: %.2f%%</b>",
inp_bin, los_bin, scales::comma(n), rate))
heat_mat <- df_grid %>%
select(inp_bin, los_bin, rate) %>%
pivot_wider(names_from = los_bin, values_from = rate) %>%
column_to_rownames("inp_bin") %>%
as.matrix()
text_mat <- df_grid %>%
select(inp_bin, los_bin, label) %>%
pivot_wider(names_from = los_bin, values_from = label) %>%
column_to_rownames("inp_bin") %>%
as.matrix()
hover_mat <- df_grid %>%
select(inp_bin, los_bin, tooltip) %>%
pivot_wider(names_from = los_bin, values_from = tooltip) %>%
column_to_rownames("inp_bin") %>%
as.matrix()
p7 <- plot_ly(x = colnames(heat_mat),
y = rownames(heat_mat),
z = heat_mat,
type = "heatmap",
colorscale = list(c(0, "#3D5A80"),
c(0.25, "#98C1D9"),
c(0.5, "#FAF3DD"),
c(0.75, "#E07A5F"),
c(1, "#9D2828")),
text = hover_mat, hoverinfo = "text",
colorbar = list(title = "Readmit<br>Rate (%)",
titleside = "right")) %>%
add_annotations(x = rep(colnames(heat_mat), each = nrow(heat_mat)),
y = rep(rownames(heat_mat), times = ncol(heat_mat)),
text = as.vector(text_mat), showarrow = FALSE,
font = list(size = 12, color = "white")) %>%
layout(title = list(text = "<b>Risk Grid — Prior Admissions × Length of Stay</b><br><sup>Hover cells for details; deep red = highest risk</sup>",
x = 0, xanchor = "left"),
xaxis = list(title = "Length of stay (days)", side = "bottom"),
yaxis = list(title = "Prior inpatient visits (past year)"),
hovermode = "closest")
p7df_grid %>%
transmute(prior_inpatient = inp_bin,
length_of_stay = los_bin,
encounters = n,
readmission_rate_pct = round(rate, 2)) %>%
DT::datatable(caption = "Cell-level counts and readmission rates",
options = list(pageLength = 20),
rownames = FALSE,
colnames = c("Prior Admissions", "LOS (days)",
"Encounters", "Readmit Rate (%)")) %>%
DT::formatStyle('readmission_rate_pct',
background = DT::styleColorBar(range(df_grid$rate), '#D1495B'),
backgroundSize = '100% 80%',
backgroundRepeat = 'no-repeat',
backgroundPosition = 'center')Critical Finding: This non-additive interaction (visible in the 3+ row) motivates explicit interaction terms in Module 6’s regularized models.
The preliminary results give tentative answers to three research questions:
Three patterns shaping Module 6:
Module 6 Plan: Ridge/Lasso with
cv.glmnet, explicit interaction terms, VIF checks, and
held-out validation.
The cleaned dataset (N = 68,061, 9.02% readmission) produces interpretable, defensible results. Clinical complexity and prior utilization drive most explainable variation; glycemic management matters at the margin. Module 6 will convert these findings into a regularized, validated model to flag high-risk patients before discharge.
Centers for Medicare & Medicaid Services. (2023). Hospital Readmissions Reduction Program (HRRP). https://www.cms.gov/medicare/quality-initiatives-patient-assessment-instruments/value-based-programs/hrrp/hospital-readmissions-reduction-program-hrrp
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22. https://doi.org/10.18637/jss.v033.i01
Jencks, S. F., Williams, M. V., & Coleman, E. A. (2009). Rehospitalizations among patients in the Medicare fee-for-service program. New England Journal of Medicine, 360(14), 1418–1428. https://doi.org/10.1056/NEJMsa0803563
Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C., & Müller, M. (2011). pROC: An open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, 77. https://doi.org/10.1186/1471-2105-12-77
Strack, B., DeShazo, J. P., Gennings, C., Olmo, J. L., Ventura, S., Cios, K. J., & Clore, J. N. (2014). Impact of HbA1c measurement on hospital readmission rates: Analysis of 70,000 clinical database patient records. BioMed Research International, 2014, 781670. https://doi.org/10.1155/2014/781670
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686