👉 Move the sliders, change demographics, toggle treatment markers — the calculator recomputes the predicted 30-day readmission probability, the YES/NO discharge flag, and the per-feature contribution chart in real time, entirely in your browser. Every input is responsive: the calculator uses Ridge coefficients (which keep all 19 features) so changing male/female, race, admission type, or any treatment toggle visibly moves the prediction. Lasso (used elsewhere in this report) drops 9 of those features for parsimony — see Which Factors Drive the Prediction below for the variable-selection view.
We built and validated a parsimonious risk model on 68,061 unique diabetic inpatients drawn from 130 U.S. hospitals (1999–2008). Lasso-regularized logistic regression — fit with 5-fold cross-validated AUC — retained 5 of 19 candidate features and matched the unregularized GLM’s discrimination (AUC 0.616 [95% CI 0.603–0.628] vs 0.612 [0.599–0.624]). All three methods’ confidence intervals overlap completely — they are statistically indistinguishable, so Lasso’s parsimony is free.
Headline finding. A discharge protocol applying the threshold of 0.085 to the Lasso predictions correctly catches 72% of true 30-day readmissions while flagging 59% of patients for the high-risk pathway. Patients in the top decile of predicted risk readmit at >15% — more than 3× the cohort baseline.
These three profiles show the predictor behaving as expected across the risk spectrum. The probabilities below are computed with the same Ridge coefficients the live calculator uses — set the same inputs in the predictor at the top to verify.
Q1 answered. Prior inpatient utilization, age, length of stay, on-diabetes-medication, and number of diagnoses are the five strongest predictors. Race, gender, HbA1c testing, medication change, prior outpatient visits, and admission type all collapse to zero — Lasso says they carry no incremental signal once clinical-history variables are in the model. The live calculator at the top uses Ridge coefficients so these variables stay responsive there for educational exploration; Lasso drops them in the deployed model.
A model that works on average can still fail on specific subgroups. We re-evaluated the Lasso AUC within age bands and by sex on the held-out test set. Wide CIs on smaller subgroups are expected; the question is whether any group sits clearly below the overall band.
Reading the audit. Every subgroup CI overlaps the overall AUC band. There is no evidence of materially worse performance for any age band or for either sex on this test set — but with only ~5,105 patients per age band, this audit is necessary, not sufficient. A production deployment should re-audit on contemporary data and look at calibration, PPV, and sensitivity within group, not just AUC.
At sensitivity 72% and specificity 42%, the model would flag about 147 of the 250 daily diabetic discharges, correctly catching 17 of the 23 true 30-day readmissions while generating 130 false alarms.
At 5 minutes of nurse time per flagged patient and a $50/hr loaded cost, that’s 12.2 nurse-hours/day ≈ $612/day in incremental labor. Even under a deliberately conservative assumption that only 20% of caught readmissions are actually prevented and that each prevented readmission saves $15,000, the math yields ~$51,000 /day in avoided cost — roughly 83× return on the labor input. Break-even intervention efficacy: 0.2%.
Vintage data. The cohort spans 1999–2008. Diabetes care, EMR coding standards, and discharge protocols have shifted materially since then. A 2026 deployment would need to be re-fit on contemporary data — the method generalizes; the specific coefficients should not be ported as-is.
ICD-9, not ICD-10. Diagnosis encoding predates the 2015 U.S. ICD-10 transition. Mapping is mechanical for count features, non-trivial for any feature engineering built on diagnosis codes themselves.
No vitals, no labs (except A1c), no notes, no SDOH. The model uses only billing-style administrative fields. AUC ≈ 0.61 is the ceiling of what this signal can deliver. Adding vitals, complete metabolic panel, free-text discharge summaries, or social determinants would plausibly push AUC into the 0.70+ range.
Inter-hospital transfers are invisible. Patients who readmit at facilities outside the network are not labeled positive. The true 30-day rate is almost certainly higher than 9.02%; the model is systematically under-predicting that residual risk.
Subgroup audit is necessary, not sufficient. AUC by age band and sex is uniform on this test set, but uniformity in AUC does not prove fairness. A production deployment should re-audit on contemporary data and look at calibration, PPV, and sensitivity within group, not just AUC.
Q1 — drivers. Prior inpatient utilization (β = +0.167), age (+0.094), length of stay (+0.095), on-diabetes-medication (+0.110), and number of diagnoses (+0.059) dominate. Race, gender, HbA1c testing, medication change, prior outpatient visits, and admission type drop out under L1 penalty.
Q2 — HbA1c. Module 4’s univariate test showed a small protective association (8.48% vs 9.15%), but Lasso shrinks the coefficient to zero in the multivariable model. The marginal HbA1c effect is captured by other clinical-history variables that correlate with whether testing was performed.
Q3 — admission type. Length of stay differs significantly across admission types (F = 52.88), but the largest pairwise gap is < 0.4 days, and Lasso drops both admission-type dummies. Statistically reliable, not clinically decisive.
Q4 — regularization. Lasso matches the unregularized GLM’s AUC (0.616 [0.603, 0.628] vs 0.612 [0.599, 0.624]) using only 5 of 19 features, with a stable solution across 10 random splits and uniform performance across age bands and sex. Equivalent discrimination at half the parameter count, with overlapping CIs confirming statistical equivalence.
.R script)The complete pipeline lives in
Module_5_FinalProject_DraftReport_Rscript.R — the file
shown below is read directly from disk at knit time, so what you see is
exactly what runs end-to-end.
# =====================================================================
# Predicting 30-Day Hospital Readmission in Diabetic Patients
# ALY6015 Module 5 — Draft Report R Script
# Anush Goel, Mandeep Kaur, Sandeep Kaur — May 2026
#
# -----Packages ----------------------------------------------------
pkgs <- c("tidyverse", "janitor", "broom", "rsample",
"glmnet", "caret", "pROC", "yardstick",
"patchwork", "scales", "jsonlite",
"plotly", "RColorBrewer")
to_install <- setdiff(pkgs, rownames(installed.packages()))
if (length(to_install)) install.packages(to_install)
invisible(lapply(pkgs, library, character.only = TRUE))
set.seed(2024)
options(stringsAsFactors = FALSE)
# Color palette used across all figures (matches the predictor app)
PAL <- list(
navy = "#1F4E6B",
teal = "#2D7D9A",
accent = "#C44536",
cool = "#4A7C59",
gray = "#9CA3AF",
light = "#E8F1F4"
)
theme_paper <- function() {
theme_minimal(base_size = 11) +
theme(plot.title.position = "plot",
plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(color = "grey45", face = "italic"),
panel.grid.minor = element_blank(),
panel.grid.major = element_line(color = "grey92"),
axis.text = element_text(color = "grey25"),
axis.title = element_text(color = "grey25"),
legend.position = "bottom",
legend.title = element_blank())
}
# ----- 1. Load and clean -----------------------------------------------
# Treat "?" as NA at read time to avoid type-matching errors in dplyr 1.1+
raw <- read_csv("diabetic_data.csv",
na = c("", "NA", "?"),
show_col_types = FALSE)
# Drop columns with >30% missingness (weight, payer_code, medical_specialty)
miss <- map_dbl(raw, ~mean(is.na(.x)))
high_miss <- names(miss[miss > 0.30])
df <- raw |> select(-any_of(high_miss))
# Remove death/hospice (cannot be readmitted) and unknown gender; keep first encounter
df <- df |>
filter(!discharge_disposition_id %in% c(11, 13, 14, 19, 20, 21)) |>
filter(gender != "Unknown/Invalid") |>
arrange(patient_nbr, encounter_id) |>
distinct(patient_nbr, .keep_all = TRUE) |>
drop_na(race, diag_1)
# Derive features
df <- df |>
mutate(
readmit_30 = as.integer(readmitted == "<30"),
age_mid = readr::parse_number(str_extract(age, "\\d+")) + 5,
male = as.integer(gender == "Male"),
race_AA = as.integer(race == "AfricanAmerican"),
race_Hisp = as.integer(race == "Hispanic"),
race_Other = as.integer(!race %in% c("Caucasian","AfricanAmerican","Hispanic")),
a1c_tested = as.integer(A1Cresult != "None"),
meds_changed = as.integer(change == "Ch"),
on_diabetesMed = as.integer(diabetesMed == "Yes"),
insulin_on = as.integer(insulin != "No"),
adm_emergency = as.integer(admission_type_id == 1),
adm_urgent = as.integer(admission_type_id == 2)
)
cat(sprintf("Cleaned cohort: %s patients, baseline 30-day readmit rate %.2f%%\n",
format(nrow(df), big.mark = ","),
100 * mean(df$readmit_30)))
# ----- 2. Module 4 carry-over inferential tests ------------------------
# Chi-square: HbA1c testing
tab_a1c <- table(df$a1c_tested, df$readmit_30)
chi_a1c <- chisq.test(tab_a1c)
phi_a1c <- sqrt(chi_a1c$statistic / sum(tab_a1c))
# Chi-square: Age
chi_age <- chisq.test(table(df$age, df$readmit_30))
# Chi-square: Race
chi_race <- chisq.test(table(df$race, df$readmit_30))
# One-way ANOVA on length of stay by admission type (1, 2, 3)
df_anova <- df |> filter(admission_type_id %in% c(1, 2, 3))
aov_los <- aov(time_in_hospital ~ factor(admission_type_id), data = df_anova)
tukey <- TukeyHSD(aov_los)
cat("\n--- Module 4 carry-over tests ---\n")
cat(sprintf("HbA1c: X^2=%.2f, df=%d, p=%.4f, phi=%.3f\n",
chi_a1c$statistic, chi_a1c$parameter, chi_a1c$p.value, phi_a1c))
cat(sprintf("Age: X^2=%.2f, df=%d, p=%.4g\n",
chi_age$statistic, chi_age$parameter, chi_age$p.value))
cat(sprintf("Race: X^2=%.2f, df=%d, p=%.4f\n",
chi_race$statistic, chi_race$parameter, chi_race$p.value))
cat("ANOVA LOS by admission type:\n"); print(summary(aov_los))
# ----- 3. Build Ridge / Lasso / GLM ------------------------------------
features <- c("time_in_hospital", "num_lab_procedures", "num_procedures",
"num_medications", "number_outpatient", "number_emergency",
"number_inpatient", "number_diagnoses", "age_mid",
"race_AA", "race_Hisp", "race_Other", "male",
"a1c_tested", "meds_changed", "on_diabetesMed", "insulin_on",
"adm_emergency", "adm_urgent")
scale_cols <- features[1:9] # standardize numeric columns only
X <- df |> select(all_of(features)) |> as.matrix()
y <- df$readmit_30
# 70/30 stratified split
split <- initial_split(df, prop = 0.70, strata = readmit_30)
idx_tr <- as.integer(rownames(training(split)))
idx_te <- setdiff(seq_len(nrow(df)), idx_tr)
X_tr <- X[idx_tr, , drop = FALSE]; y_tr <- y[idx_tr]
X_te <- X[idx_te, , drop = FALSE]; y_te <- y[idx_te]
# Standardize numeric columns using TRAINING means/SDs (re-used by predictor)
mu_tr <- colMeans(X_tr[, scale_cols, drop = FALSE])
sd_tr <- apply(X_tr[, scale_cols, drop = FALSE], 2, sd)
X_tr_s <- X_tr; X_te_s <- X_te
for (c in scale_cols) {
X_tr_s[, c] <- (X_tr[, c] - mu_tr[c]) / sd_tr[c]
X_te_s[, c] <- (X_te[, c] - mu_tr[c]) / sd_tr[c]
}
# CV.GLMNET: Lasso (alpha=1) and Ridge (alpha=0), 5-fold AUC
cat("\n--- Fitting cv.glmnet ---\n")
cv_lasso <- cv.glmnet(X_tr_s, y_tr, family = "binomial",
alpha = 1, type.measure = "auc", nfolds = 5)
cv_ridge <- cv.glmnet(X_tr_s, y_tr, family = "binomial",
alpha = 0, type.measure = "auc", nfolds = 5)
fit_glm <- glm(y_tr ~ ., data = as.data.frame(X_tr_s), family = binomial())
# Predicted probabilities on TEST set
p_lasso <- as.numeric(predict(cv_lasso, X_te_s, s = "lambda.min", type = "response"))
p_ridge <- as.numeric(predict(cv_ridge, X_te_s, s = "lambda.min", type = "response"))
p_glm <- as.numeric(predict(fit_glm, newdata = as.data.frame(X_te_s), type = "response"))
# AUCs
auc_l <- as.numeric(pROC::auc(y_te, p_lasso))
auc_r <- as.numeric(pROC::auc(y_te, p_ridge))
auc_g <- as.numeric(pROC::auc(y_te, p_glm))
cat(sprintf("Held-out AUCs: GLM=%.3f Ridge=%.3f Lasso=%.3f\n",
auc_g, auc_r, auc_l))
# Lasso coefficient vector (excluding intercept)
co_l <- as.numeric(coef(cv_lasso, s = "lambda.min"))
intercept <- co_l[1]
beta <- co_l[-1]; names(beta) <- features
n_active <- sum(abs(beta) > 1e-8)
cat(sprintf("Lasso retained %d of %d features.\n", n_active, length(features)))
# Ridge coefficients (always all retained)
co_r <- as.numeric(coef(cv_ridge, s = "lambda.min"))[-1]
names(co_r) <- features
# Threshold: anchor at 0.085 (matches the report and predictor)
THRESHOLD <- 0.085
y_pred <- as.integer(p_lasso >= THRESHOLD)
cm <- caret::confusionMatrix(factor(y_pred, levels = c(0, 1)),
factor(y_te, levels = c(0, 1)),
positive = "1")
TN <- cm$table[1,1]; FP <- cm$table[2,1]; FN <- cm$table[1,2]; TP <- cm$table[2,2]
sens <- TP / (TP + FN); spec <- TN / (TN + FP); ppv <- TP / (TP + FP)
cat(sprintf("Confusion at thr=%.3f: TN=%d FP=%d FN=%d TP=%d Sens=%.2f Spec=%.2f PPV=%.2f\n",
THRESHOLD, TN, FP, FN, TP, sens, spec, ppv))
# ----- 4. Pretty labels for figures ------------------------------------
pretty <- c(
time_in_hospital = "Length of stay",
num_lab_procedures = "# Lab procedures",
num_procedures = "# In-hospital procedures",
num_medications = "# Medications",
number_outpatient = "Prior outpatient visits",
number_emergency = "Prior ER visits",
number_inpatient = "Prior inpatient visits",
number_diagnoses = "# Diagnoses",
age_mid = "Age",
race_AA = "Race: African American",
race_Hisp = "Race: Hispanic",
race_Other = "Race: Other",
male = "Sex: Male",
a1c_tested = "HbA1c was tested",
meds_changed = "Medication changed",
on_diabetesMed = "On diabetes medication",
insulin_on = "On insulin",
adm_emergency = "Admission: Emergency",
adm_urgent = "Admission: Urgent"
)
# =====================================================================
# FIGURE 1 — Held-out AUC comparison (matches docx Figure 1)
# =====================================================================
fig1 <- ggplot(fig1_df, aes(model, auc, fill = model)) +
geom_col(width = 0.6, color = "white") +
geom_text(aes(label = sprintf("%.3f", auc)),
vjust = -0.5, fontface = "bold", size = 4) +
geom_hline(yintercept = 0.5, linetype = "dotted", color = "grey60") +
scale_fill_manual(
values = c("Unregularized GLM" = "#9CA3AF",
"Ridge (L2)" = "#2D7D9A",
"Lasso (L1)" = "#1F4E6B"),
guide = "none"
) +
scale_y_continuous(breaks = seq(0.45, 0.65, 0.05)) +
coord_cartesian(ylim = c(0.45, 0.68)) + # <-- this is the fix
labs(title = "Figure 1. Discrimination of three logistic models",
x = NULL, y = "Held-out AUC (30% test set)") +
theme_paper()
print(fig1)
# =====================================================================
# FIGURE 2 — Lasso variable importance (matches docx Figure 2)
# =====================================================================
fig2_df <- tibble(feature = features, beta = beta) |>
mutate(label = pretty[feature],
active = abs(beta) > 1e-8,
dir = case_when(beta > 0.001 ~ "Increases risk",
beta < -0.001 ~ "Decreases risk",
TRUE ~ "Shrunk to zero")) |>
arrange(desc(abs(beta))) |>
mutate(label = factor(label, levels = rev(label)))
fig2 <- ggplot(fig2_df, aes(beta, label, fill = dir)) +
geom_col(data = filter(fig2_df, active), width = 0.7, color = "white") +
geom_text(data = filter(fig2_df, active),
aes(label = sprintf("%+.3f", beta),
hjust = ifelse(beta > 0, -0.15, 1.15)),
size = 3, fontface = "bold") +
geom_text(data = filter(fig2_df, !active),
aes(x = 0.005, label = " shrunk to 0"),
size = 2.8, color = "grey55", fontface = "italic", hjust = 0) +
geom_vline(xintercept = 0, color = "grey30", linewidth = 0.4) +
scale_fill_manual(values = c("Increases risk" = PAL$accent,
"Decreases risk" = PAL$cool,
"Shrunk to zero" = "grey85"),
name = NULL) +
scale_x_continuous(expand = expansion(mult = c(0.18, 0.18))) +
labs(title = sprintf("Figure 2. Lasso variable importance — %d of %d retained",
n_active, length(features)),
x = "Standardized log-odds coefficient (β)", y = NULL) +
theme_paper() +
theme(legend.position = "bottom")
print(fig2)
# =====================================================================
# FIGURE 3 — ROC overlay with threshold marker (matches docx Figure 4)
# =====================================================================
roc_df <- bind_rows(
tibble(fpr = pROC::roc(y_te, p_glm, quiet = TRUE)$specificities,
tpr = pROC::roc(y_te, p_glm, quiet = TRUE)$sensitivities,
model = sprintf("GLM (AUC = %.3f)", auc_g)) |>
mutate(fpr = 1 - fpr) |> arrange(fpr),
tibble(fpr = pROC::roc(y_te, p_ridge, quiet = TRUE)$specificities,
tpr = pROC::roc(y_te, p_ridge, quiet = TRUE)$sensitivities,
model = sprintf("Ridge (AUC = %.3f)", auc_r)) |>
mutate(fpr = 1 - fpr) |> arrange(fpr),
tibble(fpr = pROC::roc(y_te, p_lasso, quiet = TRUE)$specificities,
tpr = pROC::roc(y_te, p_lasso, quiet = TRUE)$sensitivities,
model = sprintf("Lasso (AUC = %.3f)", auc_l)) |>
mutate(fpr = 1 - fpr) |> arrange(fpr)
)
roc_df$model <- factor(roc_df$model, levels = unique(roc_df$model))
fig3 <- ggplot(roc_df, aes(fpr, tpr, color = model)) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey60") +
geom_line(linewidth = 1.0) +
geom_point(aes(x = 1 - spec, y = sens), color = PAL$accent, size = 4,
inherit.aes = FALSE) +
annotate("label", x = 1 - spec + 0.15, y = sens - 0.10,
label = sprintf("Threshold = %.3f\nSens = %.2f, Spec = %.2f",
THRESHOLD, sens, spec),
size = 3, color = "grey20", fontface = "plain",
label.size = 0.3, label.padding = unit(0.4, "lines")) +
scale_color_manual(values = c(PAL$gray, PAL$teal, PAL$navy)) +
coord_equal(xlim = c(-0.02, 1.02), ylim = c(-0.02, 1.02)) +
labs(title = "Figure 3. ROC curves on the held-out test set",
x = "False positive rate (1 − specificity)",
y = "True positive rate (sensitivity)") +
theme_paper() +
theme(legend.position = c(0.78, 0.18),
legend.background = element_rect(fill = "white", color = NA))
print(fig3)
# =====================================================================
# FIGURE 4 — Confusion matrix at chosen threshold (matches docx Figure 5)
# =====================================================================
cm_df <- tibble(
Actual = c("Actual: No", "Actual: No", "Actual: Yes", "Actual: Yes"),
Predicted = c("Predicted: No", "Predicted: Yes", "Predicted: No", "Predicted: Yes"),
count = c(TN, FP, FN, TP),
cell = c("TN", "FP", "FN", "TP")
) |> mutate(
pct = 100 * count / sum(count),
fill = case_when(cell == "TN" ~ "#E8F1F4",
cell == "TP" ~ "#DCE8E0",
TRUE ~ "#FBE5DF"),
color = case_when(cell == "TN" ~ "#1F4E6B",
cell == "TP" ~ "#2C5C42",
TRUE ~ "#7A2A1F")
)
fig4 <- ggplot(cm_df, aes(Predicted, fct_rev(factor(Actual)), fill = I(fill))) +
geom_tile(color = "white", linewidth = 3) +
geom_text(aes(label = paste0(cell, "\n", scales::comma(count),
"\n(", sprintf("%.1f%%", pct), ")"),
color = I(color)),
fontface = "bold", lineheight = 1.1, size = 4) +
labs(title = sprintf("Figure 4. Confusion matrix at threshold = %.3f", THRESHOLD),
subtitle = sprintf("Sensitivity = %.2f Specificity = %.2f PPV = %.2f n = %s",
sens, spec, ppv, scales::comma(TN+FP+FN+TP)),
x = NULL, y = NULL) +
theme_paper() +
theme(panel.grid = element_blank(),
axis.ticks = element_blank())
print(fig4)
# =====================================================================
# FIGURE 5 — Lasso vs Ridge coefficient comparison (matches docx Figure 3)
# =====================================================================
fig5_df <- tibble(feature = features,
lasso = beta,
ridge = co_r) |>
mutate(label = pretty[feature]) |>
arrange(desc(abs(ridge))) |>
mutate(label = factor(label, levels = rev(label))) |>
pivot_longer(c(lasso, ridge), names_to = "method", values_to = "coef") |>
mutate(method = factor(if_else(method == "lasso",
"Lasso (L1) — shrinks AND eliminates",
"Ridge (L2) — shrinks every coefficient"),
levels = c("Ridge (L2) — shrinks every coefficient",
"Lasso (L1) — shrinks AND eliminates")))
# Mark Lasso-zero rows
fig5_zero <- fig5_df |>
filter(grepl("Lasso", method) & abs(coef) < 1e-8)
fig5 <- ggplot(fig5_df, aes(coef, label, fill = method)) +
geom_col(position = position_dodge(width = 0.7), width = 0.6, color = "white") +
geom_point(data = fig5_zero, aes(x = 0), color = PAL$accent, size = 3,
show.legend = FALSE,
position = position_nudge(y = -0.2)) +
geom_vline(xintercept = 0, color = "grey30", linewidth = 0.4) +
scale_fill_manual(values = c(PAL$teal, PAL$navy), name = NULL) +
labs(title = "Figure 5. Coefficient comparison — Ridge shrinks, Lasso selects",
x = "Standardized coefficient (β)", y = NULL,
caption = "Red dots mark coefficients Lasso shrunk exactly to zero.") +
theme_paper() +
theme(legend.position = "bottom",
plot.caption = element_text(color = "grey45", size = 9))
print(fig5)
# =====================================================================
# FIGURE 6 — Calibration plot by deciles (matches docx Figure 6)
# =====================================================================
cal_df <- tibble(p = p_lasso, y = y_te) |>
mutate(decile = ntile(p, 10)) |>
group_by(decile) |>
summarise(pred = mean(p), obs = mean(y), n = n(), .groups = "drop") |>
mutate(label = paste0("D", decile))
baseline <- mean(y)
mx <- max(c(cal_df$pred, cal_df$obs)) * 1.10
fig6 <- ggplot(cal_df, aes(pred, obs)) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey55") +
geom_hline(yintercept = baseline, linetype = "dotted", color = PAL$accent) +
geom_line(color = PAL$navy, linewidth = 0.8, alpha = 0.5) +
geom_point(aes(size = n), color = PAL$navy, alpha = 0.85) +
ggrepel::geom_text_repel(aes(label = label), size = 3, color = "grey25",
min.segment.length = Inf) +
scale_x_continuous(labels = scales::percent_format(accuracy = 1),
limits = c(0, mx)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1),
limits = c(0, mx)) +
scale_size_area(guide = "none", max_size = 9) +
coord_equal() +
labs(title = "Figure 6. Calibration on the held-out test set",
subtitle = sprintf("Population baseline = %.1f%% (red dotted)", 100 * baseline),
x = "Mean predicted probability (per decile)",
y = "Observed readmission rate (per decile)") +
theme_paper()
print(fig6)
# =====================================================================
# FIGURE 7 — Score distribution by outcome (matches docx Figure 7)
# =====================================================================
dist_df <- tibble(p = p_lasso, y = factor(y_te,
levels = c(0,1),
labels = c(sprintf("Not readmitted (n = %s)",
scales::comma(sum(y_te==0))),
sprintf("Readmitted within 30 days (n = %s)",
scales::comma(sum(y_te==1))))))
fig7 <- ggplot(dist_df, aes(p, fill = y)) +
geom_histogram(bins = 60, alpha = 0.75, color = "white", position = "identity") +
geom_vline(xintercept = THRESHOLD, color = "grey15", linetype = "dashed", linewidth = 0.7) +
geom_vline(xintercept = baseline, color = "grey60", linetype = "dotted") +
scale_fill_manual(values = c(PAL$cool, PAL$accent), name = NULL) +
scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(title = "Figure 7. Score distribution — how well the model separates outcomes",
subtitle = sprintf("Decision threshold (dashed) = %.3f Baseline rate (dotted) = %.1f%%",
THRESHOLD, 100 * baseline),
x = "Predicted probability of 30-day readmission",
y = "Number of patients (test set)") +
theme_paper()
print(fig7)
cat("\nAll seven figures saved as PNG.\n")
# =====================================================================
# 5. Export coefficient table and predictor payload
# =====================================================================
coef_tbl <- tibble(variable = features,
lasso_beta = beta,
ridge_beta = co_r,
abs_beta = abs(beta),
active = ifelse(abs(beta) > 1e-8, "Yes", "No")) |>
arrange(desc(abs_beta))
write_csv(coef_tbl, "lasso_coefficients.csv")
# Means/SDs of the FULL dataset for predictor standardization
# (the predictor uses these to standardize new patient inputs)
mu_full <- colMeans(X[, scale_cols, drop = FALSE])
sd_full <- apply(X[, scale_cols, drop = FALSE], 2, sd)
payload <- list(
intercept = unname(intercept),
threshold = THRESHOLD,
baseline_rate = mean(y),
scale_cols = scale_cols,
features = features,
means = as.list(mu_full),
sds = as.list(sd_full),
coefs = as.list(beta),
auc_lasso = auc_l,
auc_ridge = auc_r,
auc_glm = auc_g,
n_active = n_active
)
jsonlite::write_json(payload, "module5_predictor_payload.json",
pretty = TRUE, auto_unbox = TRUE)
cat("\nWrote: lasso_coefficients.csv + module5_predictor_payload.json\n")
cat("\n--- DONE ---\n")
ALY6015 Intermediate Analytics · College of Professional Studies,
Northeastern University · May 2026
Built with R, glmnet, plotly, and
ggplot2 · Data: Diabetes 130-US Hospitals (Strack et al.,
2014)