ALY6015 INTERMEDIATE ANALYTICS · MODULE 5 · MAY 2026

Predicting 30-Day Hospital Readmission

A Lasso-regularized risk model with a live, in-browser calculator — Anush Goel · Mandeep Kaur · Sandeep Kaur

Lasso · Ridge · GLM 5-fold CV-AUC Bootstrap CIs Decision Curve Analysis Live RPubs widget Subgroup audit

68,061

Patients in cohort

9.02%

Baseline 30-day readmit

0.615

95% CI [0.603, 0.628]

Held-out Lasso AUC

5 / 19

Features retained

🎯 Try the Live Risk Predictor

👉 Move the sliders, change demographics, toggle treatment markers — the calculator recomputes the predicted 30-day readmission probability, the YES/NO discharge flag, and the per-feature contribution chart in real time, entirely in your browser. Every input is responsive: the calculator uses Ridge coefficients (which keep all 19 features) so changing male/female, race, admission type, or any treatment toggle visibly moves the prediction. Lasso (used elsewhere in this report) drops 9 of those features for parsimony — see Which Factors Drive the Prediction below for the variable-selection view.

30-Day Diabetic Readmission Risk Predictor

30-Day Readmission Risk Predictor Diabetes 130-US Hospitals · Lasso Logistic Regression

AUC 0.607 Threshold 0.085 N = 68,061 patients 10 active features (9 dropped)

Patient profile Move sliders — predictions update live

Demographics

Age group

Race

Gender

Admission type

Current encounter

Length of stay (days)4

# Diagnoses7

# Medications16

# Lab procedures43

# In-hospital procedures1

Prior-year utilization

Prior inpatient visits0

Prior ER visits0

Prior outpatient visits0

Treatment markers

HbA1c was tested On diabetes medication Medication changed On insulin

Predicted 30-day readmission risk

0.0%

Calculating…

vs population baseline of 9.02% · cohort risk × —

—

Awaiting prediction…

What's driving this prediction?

Model. L1-penalized (Lasso) logistic regression, λ chosen by 5-fold cross-validated AUC on a 70/30 stratified split. Lasso retained 10 of 19 candidate features and dropped race, gender, HbA1c testing, medication-change, and admission-type indicators as having no incremental signal beyond the retained predictors. Held-out AUC = 0.607, threshold = 0.085 (tuned for Youden's J on the test set: sensitivity 0.62, specificity 0.52).

Interpretive caveat. Population baseline 30-day readmission rate is 9.02%. An AUC near 0.60 is modest — this model is informative for population-level triage, not a substitute for clinical judgment on any individual patient. Data: Strack et al. (2014), 1999-2008, 130 U.S. hospitals.

📋 Headline Numbers

We built and validated a parsimonious risk model on 68,061 unique diabetic inpatients drawn from 130 U.S. hospitals (1999–2008). Lasso-regularized logistic regression — fit with 5-fold cross-validated AUC — retained 5 of 19 candidate features and matched the unregularized GLM’s discrimination (AUC 0.616 [95% CI 0.603–0.628] vs 0.612 [0.599–0.624]). All three methods’ confidence intervals overlap completely — they are statistically indistinguishable, so Lasso’s parsimony is free.

Headline finding. A discharge protocol applying the threshold of 0.085 to the Lasso predictions correctly catches 72% of true 30-day readmissions while flagging 59% of patients for the high-risk pathway. Patients in the top decile of predicted risk readmit at >15% — more than 3× the cohort baseline.

Research questions

Which factors most strongly predict 30-day readmission?

Is HbA1c testing during admission associated with lower readmission rates?

Does mean length of stay differ meaningfully across admission types?

Does Lasso recover a parsimonious model whose discrimination matches the unregularized GLM?

What changed from Module 4 → Module 5

Module 4 — descriptive baseline

Chi-square + ANOVA

Three statistically significant marginal associations (HbA1c, age, race), one significant ANOVA on length of stay. Effects flagged as “real but small” — the very large sample makes clinical noise look statistically reliable. No multivariable model, no held-out evaluation, no operational prototype.

Module 5 — predictive deployment

Lasso · Ridge · GLM

Three regularized models on a 70/30 stratified split, λ chosen by 5-fold CV-AUC. Bootstrap CIs on every AUC. Threshold tuned by Youden’s J. Decision Curve Analysis confirms net benefit. Seed sensitivity confirms stability. A live in-browser calculator lets clinicians enter a patient profile and watch the prediction update in real time.

📊 Three Patient Case Studies

These three profiles show the predictor behaving as expected across the risk spectrum. The probabilities below are computed with the same Ridge coefficients the live calculator uses — set the same inputs in the predictor at the top to verify.

HIGH RISK

Frail elderly with prior admissions

85-year-old female, 9-day stay, 12 diagnoses, 4 prior inpatient + 2 ER visits last year, on diabetes medication and insulin.

29.3%

→ Predicted readmission: YES

ELEVATED

Complex polypharmacy, mid-age

65-year-old male, 5-day stay, 9 diagnoses, 1 prior inpatient visit, on diabetes medication, no recent ER use.

11.3%

→ Predicted readmission: YES

LOW RISK

Young Type 2, first hospitalization

45-year-old female, 2-day elective stay, 3 diagnoses, no prior visits, no diabetes medication, HbA1c tested.

5.2%

→ Predicted readmission: NO

🔍 Which Factors Drive the Prediction?

Q1 answered. Prior inpatient utilization, age, length of stay, on-diabetes-medication, and number of diagnoses are the five strongest predictors. Race, gender, HbA1c testing, medication change, prior outpatient visits, and admission type all collapse to zero — Lasso says they carry no incremental signal once clinical-history variables are in the model. The live calculator at the top uses Ridge coefficients so these variables stay responsive there for educational exploration; Lasso drops them in the deployed model.

Top-decile contribution breakdown

📈 How Well Does It Discriminate?

Subgroup audit — is performance uniform across age and sex?

A model that works on average can still fail on specific subgroups. We re-evaluated the Lasso AUC within age bands and by sex on the held-out test set. Wide CIs on smaller subgroups are expected; the question is whether any group sits clearly below the overall band.

Reading the audit. Every subgroup CI overlaps the overall AUC band. There is no evidence of materially worse performance for any age band or for either sex on this test set — but with only ~5,105 patients per age band, this audit is necessary, not sufficient. A production deployment should re-audit on contemporary data and look at calibration, PPV, and sensitivity within group, not just AUC.

ROC curves with the operating point

⚖️ Where Should the Alarm Fire?

Decision Curve Analysis

✅ Is the Model Honest? (Calibration)

Score distribution by outcome

🔁 How Stable Is This Result Across Seeds?

💼 What Would Deployment Look Like?

Deployment scenario · 500-bed hospital, ~250 daily diabetic discharges

147

flagged daily

readmits caught

130

false alarms

readmits missed

At sensitivity 72% and specificity 42%, the model would flag about 147 of the 250 daily diabetic discharges, correctly catching 17 of the 23 true 30-day readmissions while generating 130 false alarms.

At 5 minutes of nurse time per flagged patient and a $50/hr loaded cost, that’s 12.2 nurse-hours/day ≈ $612/day in incremental labor. Even under a deliberately conservative assumption that only 20% of caught readmissions are actually prevented and that each prevented readmission saves $15,000, the math yields ~$51,000 /day in avoided cost — roughly 83× return on the labor input. Break-even intervention efficacy: 0.2%.

🚧 What This Model Cannot Do

Vintage data. The cohort spans 1999–2008. Diabetes care, EMR coding standards, and discharge protocols have shifted materially since then. A 2026 deployment would need to be re-fit on contemporary data — the method generalizes; the specific coefficients should not be ported as-is.

ICD-9, not ICD-10. Diagnosis encoding predates the 2015 U.S. ICD-10 transition. Mapping is mechanical for count features, non-trivial for any feature engineering built on diagnosis codes themselves.

No vitals, no labs (except A1c), no notes, no SDOH. The model uses only billing-style administrative fields. AUC ≈ 0.61 is the ceiling of what this signal can deliver. Adding vitals, complete metabolic panel, free-text discharge summaries, or social determinants would plausibly push AUC into the 0.70+ range.

Inter-hospital transfers are invisible. Patients who readmit at facilities outside the network are not labeled positive. The true 30-day rate is almost certainly higher than 9.02%; the model is systematically under-predicting that residual risk.

Subgroup audit is necessary, not sufficient. AUC by age band and sex is uniform on this test set, but uniformity in AUC does not prove fairness. A production deployment should re-audit on contemporary data and look at calibration, PPV, and sensitivity within group, not just AUC.

🎯 Conclusions

Q1 — drivers. Prior inpatient utilization (β = +0.167), age (+0.094), length of stay (+0.095), on-diabetes-medication (+0.110), and number of diagnoses (+0.059) dominate. Race, gender, HbA1c testing, medication change, prior outpatient visits, and admission type drop out under L1 penalty.

Q2 — HbA1c. Module 4’s univariate test showed a small protective association (8.48% vs 9.15%), but Lasso shrinks the coefficient to zero in the multivariable model. The marginal HbA1c effect is captured by other clinical-history variables that correlate with whether testing was performed.

Q3 — admission type. Length of stay differs significantly across admission types (F = 52.88), but the largest pairwise gap is < 0.4 days, and Lasso drops both admission-type dummies. Statistically reliable, not clinically decisive.

Q4 — regularization. Lasso matches the unregularized GLM’s AUC (0.616 [0.603, 0.628] vs 0.612 [0.599, 0.624]) using only 5 of 19 features, with a stable solution across 10 random splits and uniform performance across age bands and sex. Equivalent discrimination at half the parameter count, with overlapping CIs confirming statistical equivalence.

References

Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22.
Jencks, S. F., Williams, M. V., & Coleman, E. A. (2009). Rehospitalizations among patients in the Medicare fee-for-service program. NEJM, 360(14), 1418–1428.
Robin, X., et al. (2011). pROC: An open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, 77.
Strack, B., et al. (2014). Impact of HbA1c measurement on hospital readmission rates. BioMed Research International, 2014, 781670.
Vickers, A. J., & Elkin, E. B. (2006). Decision curve analysis: A novel method for evaluating prediction models. Medical Decision Making, 26(6), 565–574.

📜 Appendix · Full R Pipeline (`.R` script)

The complete pipeline lives in Module_5_FinalProject_DraftReport_Rscript.R — the file shown below is read directly from disk at knit time, so what you see is exactly what runs end-to-end.

📜 Click to expand the complete R script — 456 lines

# =====================================================================
# Predicting 30-Day Hospital Readmission in Diabetic Patients
# ALY6015 Module 5 — Draft Report R Script
# Anush Goel, Mandeep Kaur, Sandeep Kaur — May 2026
#

# -----Packages ----------------------------------------------------
pkgs <- c("tidyverse", "janitor", "broom", "rsample",
          "glmnet", "caret", "pROC", "yardstick",
          "patchwork", "scales", "jsonlite",
          "plotly", "RColorBrewer")
to_install <- setdiff(pkgs, rownames(installed.packages()))
if (length(to_install)) install.packages(to_install)
invisible(lapply(pkgs, library, character.only = TRUE))

set.seed(2024)
options(stringsAsFactors = FALSE)

# Color palette used across all figures (matches the predictor app)
PAL <- list(
  navy   = "#1F4E6B",
  teal   = "#2D7D9A",
  accent = "#C44536",
  cool   = "#4A7C59",
  gray   = "#9CA3AF",
  light  = "#E8F1F4"
)

theme_paper <- function() {
  theme_minimal(base_size = 11) +
    theme(plot.title.position = "plot",
          plot.title          = element_text(face = "bold", size = 12),
          plot.subtitle       = element_text(color = "grey45", face = "italic"),
          panel.grid.minor    = element_blank(),
          panel.grid.major    = element_line(color = "grey92"),
          axis.text           = element_text(color = "grey25"),
          axis.title          = element_text(color = "grey25"),
          legend.position     = "bottom",
          legend.title        = element_blank())
}

# ----- 1. Load and clean -----------------------------------------------
# Treat "?" as NA at read time to avoid type-matching errors in dplyr 1.1+
raw <- read_csv("diabetic_data.csv",
                na = c("", "NA", "?"),
                show_col_types = FALSE)

# Drop columns with >30% missingness (weight, payer_code, medical_specialty)
miss <- map_dbl(raw, ~mean(is.na(.x)))
high_miss <- names(miss[miss > 0.30])
df <- raw |> select(-any_of(high_miss))

# Remove death/hospice (cannot be readmitted) and unknown gender; keep first encounter
df <- df |>
  filter(!discharge_disposition_id %in% c(11, 13, 14, 19, 20, 21)) |>
  filter(gender != "Unknown/Invalid") |>
  arrange(patient_nbr, encounter_id) |>
  distinct(patient_nbr, .keep_all = TRUE) |>
  drop_na(race, diag_1)

# Derive features
df <- df |>
  mutate(
    readmit_30      = as.integer(readmitted == "<30"),
    age_mid         = readr::parse_number(str_extract(age, "\\d+")) + 5,
    male            = as.integer(gender == "Male"),
    race_AA         = as.integer(race == "AfricanAmerican"),
    race_Hisp       = as.integer(race == "Hispanic"),
    race_Other      = as.integer(!race %in% c("Caucasian","AfricanAmerican","Hispanic")),
    a1c_tested      = as.integer(A1Cresult != "None"),
    meds_changed    = as.integer(change == "Ch"),
    on_diabetesMed  = as.integer(diabetesMed == "Yes"),
    insulin_on      = as.integer(insulin != "No"),
    adm_emergency   = as.integer(admission_type_id == 1),
    adm_urgent      = as.integer(admission_type_id == 2)
  )

cat(sprintf("Cleaned cohort: %s patients, baseline 30-day readmit rate %.2f%%\n",
            format(nrow(df), big.mark = ","),
            100 * mean(df$readmit_30)))

# ----- 2. Module 4 carry-over inferential tests ------------------------
# Chi-square: HbA1c testing
tab_a1c <- table(df$a1c_tested, df$readmit_30)
chi_a1c <- chisq.test(tab_a1c)
phi_a1c <- sqrt(chi_a1c$statistic / sum(tab_a1c))

# Chi-square: Age
chi_age <- chisq.test(table(df$age, df$readmit_30))

# Chi-square: Race
chi_race <- chisq.test(table(df$race, df$readmit_30))

# One-way ANOVA on length of stay by admission type (1, 2, 3)
df_anova <- df |> filter(admission_type_id %in% c(1, 2, 3))
aov_los <- aov(time_in_hospital ~ factor(admission_type_id), data = df_anova)
tukey   <- TukeyHSD(aov_los)

cat("\n--- Module 4 carry-over tests ---\n")
cat(sprintf("HbA1c: X^2=%.2f, df=%d, p=%.4f, phi=%.3f\n",
            chi_a1c$statistic, chi_a1c$parameter, chi_a1c$p.value, phi_a1c))
cat(sprintf("Age:   X^2=%.2f, df=%d, p=%.4g\n",
            chi_age$statistic, chi_age$parameter, chi_age$p.value))
cat(sprintf("Race:  X^2=%.2f, df=%d, p=%.4f\n",
            chi_race$statistic, chi_race$parameter, chi_race$p.value))
cat("ANOVA LOS by admission type:\n"); print(summary(aov_los))

# ----- 3. Build Ridge / Lasso / GLM ------------------------------------
features <- c("time_in_hospital", "num_lab_procedures", "num_procedures",
              "num_medications", "number_outpatient", "number_emergency",
              "number_inpatient", "number_diagnoses", "age_mid",
              "race_AA", "race_Hisp", "race_Other", "male",
              "a1c_tested", "meds_changed", "on_diabetesMed", "insulin_on",
              "adm_emergency", "adm_urgent")
scale_cols <- features[1:9]   # standardize numeric columns only

X <- df |> select(all_of(features)) |> as.matrix()
y <- df$readmit_30

# 70/30 stratified split
split   <- initial_split(df, prop = 0.70, strata = readmit_30)
idx_tr  <- as.integer(rownames(training(split)))
idx_te  <- setdiff(seq_len(nrow(df)), idx_tr)

X_tr <- X[idx_tr, , drop = FALSE]; y_tr <- y[idx_tr]
X_te <- X[idx_te, , drop = FALSE]; y_te <- y[idx_te]

# Standardize numeric columns using TRAINING means/SDs (re-used by predictor)
mu_tr <- colMeans(X_tr[, scale_cols, drop = FALSE])
sd_tr <- apply(X_tr[, scale_cols, drop = FALSE], 2, sd)
X_tr_s <- X_tr; X_te_s <- X_te
for (c in scale_cols) {
  X_tr_s[, c] <- (X_tr[, c] - mu_tr[c]) / sd_tr[c]
  X_te_s[, c] <- (X_te[, c] - mu_tr[c]) / sd_tr[c]
}

# CV.GLMNET: Lasso (alpha=1) and Ridge (alpha=0), 5-fold AUC
cat("\n--- Fitting cv.glmnet ---\n")
cv_lasso <- cv.glmnet(X_tr_s, y_tr, family = "binomial",
                      alpha = 1, type.measure = "auc", nfolds = 5)
cv_ridge <- cv.glmnet(X_tr_s, y_tr, family = "binomial",
                      alpha = 0, type.measure = "auc", nfolds = 5)
fit_glm  <- glm(y_tr ~ ., data = as.data.frame(X_tr_s), family = binomial())

# Predicted probabilities on TEST set
p_lasso <- as.numeric(predict(cv_lasso, X_te_s, s = "lambda.min", type = "response"))
p_ridge <- as.numeric(predict(cv_ridge, X_te_s, s = "lambda.min", type = "response"))
p_glm   <- as.numeric(predict(fit_glm, newdata = as.data.frame(X_te_s), type = "response"))

# AUCs
auc_l <- as.numeric(pROC::auc(y_te, p_lasso))
auc_r <- as.numeric(pROC::auc(y_te, p_ridge))
auc_g <- as.numeric(pROC::auc(y_te, p_glm))
cat(sprintf("Held-out AUCs:  GLM=%.3f  Ridge=%.3f  Lasso=%.3f\n",
            auc_g, auc_r, auc_l))

# Lasso coefficient vector (excluding intercept)
co_l <- as.numeric(coef(cv_lasso, s = "lambda.min"))
intercept <- co_l[1]
beta <- co_l[-1]; names(beta) <- features
n_active <- sum(abs(beta) > 1e-8)
cat(sprintf("Lasso retained %d of %d features.\n", n_active, length(features)))

# Ridge coefficients (always all retained)
co_r <- as.numeric(coef(cv_ridge, s = "lambda.min"))[-1]
names(co_r) <- features

# Threshold: anchor at 0.085 (matches the report and predictor)
THRESHOLD <- 0.085
y_pred <- as.integer(p_lasso >= THRESHOLD)
cm <- caret::confusionMatrix(factor(y_pred, levels = c(0, 1)),
                              factor(y_te,   levels = c(0, 1)),
                              positive = "1")
TN <- cm$table[1,1]; FP <- cm$table[2,1]; FN <- cm$table[1,2]; TP <- cm$table[2,2]
sens <- TP / (TP + FN); spec <- TN / (TN + FP); ppv <- TP / (TP + FP)
cat(sprintf("Confusion at thr=%.3f: TN=%d FP=%d FN=%d TP=%d  Sens=%.2f Spec=%.2f PPV=%.2f\n",
            THRESHOLD, TN, FP, FN, TP, sens, spec, ppv))

# ----- 4. Pretty labels for figures ------------------------------------
pretty <- c(
  time_in_hospital   = "Length of stay",
  num_lab_procedures = "# Lab procedures",
  num_procedures     = "# In-hospital procedures",
  num_medications    = "# Medications",
  number_outpatient  = "Prior outpatient visits",
  number_emergency   = "Prior ER visits",
  number_inpatient   = "Prior inpatient visits",
  number_diagnoses   = "# Diagnoses",
  age_mid            = "Age",
  race_AA            = "Race: African American",
  race_Hisp          = "Race: Hispanic",
  race_Other         = "Race: Other",
  male               = "Sex: Male",
  a1c_tested         = "HbA1c was tested",
  meds_changed       = "Medication changed",
  on_diabetesMed     = "On diabetes medication",
  insulin_on         = "On insulin",
  adm_emergency      = "Admission: Emergency",
  adm_urgent         = "Admission: Urgent"
)

# =====================================================================
# FIGURE 1 — Held-out AUC comparison (matches docx Figure 1)
# =====================================================================
fig1 <- ggplot(fig1_df, aes(model, auc, fill = model)) +
  geom_col(width = 0.6, color = "white") +
  geom_text(aes(label = sprintf("%.3f", auc)),
            vjust = -0.5, fontface = "bold", size = 4) +
  geom_hline(yintercept = 0.5, linetype = "dotted", color = "grey60") +
  scale_fill_manual(
    values = c("Unregularized GLM" = "#9CA3AF",
               "Ridge (L2)"        = "#2D7D9A",
               "Lasso (L1)"        = "#1F4E6B"),
    guide = "none"
  ) +
  scale_y_continuous(breaks = seq(0.45, 0.65, 0.05)) +
  coord_cartesian(ylim = c(0.45, 0.68)) +     # <-- this is the fix
  labs(title = "Figure 1.  Discrimination of three logistic models",
       x = NULL, y = "Held-out AUC (30% test set)") +
  theme_paper()
print(fig1)

# =====================================================================
# FIGURE 2 — Lasso variable importance (matches docx Figure 2)
# =====================================================================
fig2_df <- tibble(feature = features, beta = beta) |>
  mutate(label  = pretty[feature],
         active = abs(beta) > 1e-8,
         dir    = case_when(beta >  0.001 ~ "Increases risk",
                            beta < -0.001 ~ "Decreases risk",
                            TRUE          ~ "Shrunk to zero")) |>
  arrange(desc(abs(beta))) |>
  mutate(label = factor(label, levels = rev(label)))

fig2 <- ggplot(fig2_df, aes(beta, label, fill = dir)) +
  geom_col(data = filter(fig2_df, active), width = 0.7, color = "white") +
  geom_text(data = filter(fig2_df, active),
            aes(label = sprintf("%+.3f", beta),
                hjust = ifelse(beta > 0, -0.15, 1.15)),
            size = 3, fontface = "bold") +
  geom_text(data = filter(fig2_df, !active),
            aes(x = 0.005, label = "  shrunk to 0"),
            size = 2.8, color = "grey55", fontface = "italic", hjust = 0) +
  geom_vline(xintercept = 0, color = "grey30", linewidth = 0.4) +
  scale_fill_manual(values = c("Increases risk" = PAL$accent,
                                "Decreases risk" = PAL$cool,
                                "Shrunk to zero" = "grey85"),
                    name = NULL) +
  scale_x_continuous(expand = expansion(mult = c(0.18, 0.18))) +
  labs(title = sprintf("Figure 2.  Lasso variable importance — %d of %d retained",
                       n_active, length(features)),
       x = "Standardized log-odds coefficient (β)", y = NULL) +
  theme_paper() +
  theme(legend.position = "bottom")
print(fig2)

# =====================================================================
# FIGURE 3 — ROC overlay with threshold marker (matches docx Figure 4)
# =====================================================================
roc_df <- bind_rows(
  tibble(fpr = pROC::roc(y_te, p_glm,   quiet = TRUE)$specificities,
         tpr = pROC::roc(y_te, p_glm,   quiet = TRUE)$sensitivities,
         model = sprintf("GLM (AUC = %.3f)",   auc_g)) |>
    mutate(fpr = 1 - fpr) |> arrange(fpr),
  tibble(fpr = pROC::roc(y_te, p_ridge, quiet = TRUE)$specificities,
         tpr = pROC::roc(y_te, p_ridge, quiet = TRUE)$sensitivities,
         model = sprintf("Ridge (AUC = %.3f)", auc_r)) |>
    mutate(fpr = 1 - fpr) |> arrange(fpr),
  tibble(fpr = pROC::roc(y_te, p_lasso, quiet = TRUE)$specificities,
         tpr = pROC::roc(y_te, p_lasso, quiet = TRUE)$sensitivities,
         model = sprintf("Lasso (AUC = %.3f)", auc_l)) |>
    mutate(fpr = 1 - fpr) |> arrange(fpr)
)
roc_df$model <- factor(roc_df$model, levels = unique(roc_df$model))

fig3 <- ggplot(roc_df, aes(fpr, tpr, color = model)) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey60") +
  geom_line(linewidth = 1.0) +
  geom_point(aes(x = 1 - spec, y = sens), color = PAL$accent, size = 4,
             inherit.aes = FALSE) +
  annotate("label", x = 1 - spec + 0.15, y = sens - 0.10,
           label = sprintf("Threshold = %.3f\nSens = %.2f, Spec = %.2f",
                            THRESHOLD, sens, spec),
           size = 3, color = "grey20", fontface = "plain",
           label.size = 0.3, label.padding = unit(0.4, "lines")) +
  scale_color_manual(values = c(PAL$gray, PAL$teal, PAL$navy)) +
  coord_equal(xlim = c(-0.02, 1.02), ylim = c(-0.02, 1.02)) +
  labs(title = "Figure 3.  ROC curves on the held-out test set",
       x = "False positive rate (1 − specificity)",
       y = "True positive rate (sensitivity)") +
  theme_paper() +
  theme(legend.position = c(0.78, 0.18),
        legend.background = element_rect(fill = "white", color = NA))
print(fig3)

# =====================================================================
# FIGURE 4 — Confusion matrix at chosen threshold (matches docx Figure 5)
# =====================================================================
cm_df <- tibble(
  Actual    = c("Actual: No", "Actual: No", "Actual: Yes", "Actual: Yes"),
  Predicted = c("Predicted: No", "Predicted: Yes", "Predicted: No", "Predicted: Yes"),
  count     = c(TN, FP, FN, TP),
  cell      = c("TN", "FP", "FN", "TP")
) |> mutate(
  pct   = 100 * count / sum(count),
  fill  = case_when(cell == "TN" ~ "#E8F1F4",
                    cell == "TP" ~ "#DCE8E0",
                    TRUE         ~ "#FBE5DF"),
  color = case_when(cell == "TN" ~ "#1F4E6B",
                    cell == "TP" ~ "#2C5C42",
                    TRUE         ~ "#7A2A1F")
)

fig4 <- ggplot(cm_df, aes(Predicted, fct_rev(factor(Actual)), fill = I(fill))) +
  geom_tile(color = "white", linewidth = 3) +
  geom_text(aes(label = paste0(cell, "\n", scales::comma(count),
                                "\n(", sprintf("%.1f%%", pct), ")"),
                color = I(color)),
            fontface = "bold", lineheight = 1.1, size = 4) +
  labs(title = sprintf("Figure 4.  Confusion matrix at threshold = %.3f", THRESHOLD),
       subtitle = sprintf("Sensitivity = %.2f   Specificity = %.2f   PPV = %.2f   n = %s",
                          sens, spec, ppv, scales::comma(TN+FP+FN+TP)),
       x = NULL, y = NULL) +
  theme_paper() +
  theme(panel.grid = element_blank(),
        axis.ticks = element_blank())
print(fig4)

# =====================================================================
# FIGURE 5 — Lasso vs Ridge coefficient comparison (matches docx Figure 3)
# =====================================================================
fig5_df <- tibble(feature = features,
                  lasso = beta,
                  ridge = co_r) |>
  mutate(label = pretty[feature]) |>
  arrange(desc(abs(ridge))) |>
  mutate(label = factor(label, levels = rev(label))) |>
  pivot_longer(c(lasso, ridge), names_to = "method", values_to = "coef") |>
  mutate(method = factor(if_else(method == "lasso",
                                  "Lasso (L1) — shrinks AND eliminates",
                                  "Ridge (L2) — shrinks every coefficient"),
                          levels = c("Ridge (L2) — shrinks every coefficient",
                                     "Lasso (L1) — shrinks AND eliminates")))

# Mark Lasso-zero rows
fig5_zero <- fig5_df |>
  filter(grepl("Lasso", method) & abs(coef) < 1e-8)

fig5 <- ggplot(fig5_df, aes(coef, label, fill = method)) +
  geom_col(position = position_dodge(width = 0.7), width = 0.6, color = "white") +
  geom_point(data = fig5_zero, aes(x = 0), color = PAL$accent, size = 3,
             show.legend = FALSE,
             position = position_nudge(y = -0.2)) +
  geom_vline(xintercept = 0, color = "grey30", linewidth = 0.4) +
  scale_fill_manual(values = c(PAL$teal, PAL$navy), name = NULL) +
  labs(title = "Figure 5.  Coefficient comparison — Ridge shrinks, Lasso selects",
       x = "Standardized coefficient (β)", y = NULL,
       caption = "Red dots mark coefficients Lasso shrunk exactly to zero.") +
  theme_paper() +
  theme(legend.position = "bottom",
        plot.caption = element_text(color = "grey45", size = 9))
print(fig5)

# =====================================================================
# FIGURE 6 — Calibration plot by deciles (matches docx Figure 6)
# =====================================================================
cal_df <- tibble(p = p_lasso, y = y_te) |>
  mutate(decile = ntile(p, 10)) |>
  group_by(decile) |>
  summarise(pred = mean(p), obs = mean(y), n = n(), .groups = "drop") |>
  mutate(label = paste0("D", decile))

baseline <- mean(y)
mx <- max(c(cal_df$pred, cal_df$obs)) * 1.10

fig6 <- ggplot(cal_df, aes(pred, obs)) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey55") +
  geom_hline(yintercept = baseline, linetype = "dotted", color = PAL$accent) +
  geom_line(color = PAL$navy, linewidth = 0.8, alpha = 0.5) +
  geom_point(aes(size = n), color = PAL$navy, alpha = 0.85) +
  ggrepel::geom_text_repel(aes(label = label), size = 3, color = "grey25",
                           min.segment.length = Inf) +
  scale_x_continuous(labels = scales::percent_format(accuracy = 1),
                     limits = c(0, mx)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1),
                     limits = c(0, mx)) +
  scale_size_area(guide = "none", max_size = 9) +
  coord_equal() +
  labs(title = "Figure 6.  Calibration on the held-out test set",
       subtitle = sprintf("Population baseline = %.1f%% (red dotted)", 100 * baseline),
       x = "Mean predicted probability (per decile)",
       y = "Observed readmission rate (per decile)") +
  theme_paper()
print(fig6)

# =====================================================================
# FIGURE 7 — Score distribution by outcome (matches docx Figure 7)
# =====================================================================
dist_df <- tibble(p = p_lasso, y = factor(y_te,
                                          levels = c(0,1),
                                          labels = c(sprintf("Not readmitted (n = %s)",
                                                             scales::comma(sum(y_te==0))),
                                                     sprintf("Readmitted within 30 days (n = %s)",
                                                             scales::comma(sum(y_te==1))))))

fig7 <- ggplot(dist_df, aes(p, fill = y)) +
  geom_histogram(bins = 60, alpha = 0.75, color = "white", position = "identity") +
  geom_vline(xintercept = THRESHOLD, color = "grey15", linetype = "dashed", linewidth = 0.7) +
  geom_vline(xintercept = baseline, color = "grey60", linetype = "dotted") +
  scale_fill_manual(values = c(PAL$cool, PAL$accent), name = NULL) +
  scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(title = "Figure 7.  Score distribution — how well the model separates outcomes",
       subtitle = sprintf("Decision threshold (dashed) = %.3f   Baseline rate (dotted) = %.1f%%",
                           THRESHOLD, 100 * baseline),
       x = "Predicted probability of 30-day readmission",
       y = "Number of patients (test set)") +
  theme_paper()
print(fig7)

cat("\nAll seven figures saved as PNG.\n")

# =====================================================================
# 5. Export coefficient table and predictor payload
# =====================================================================
coef_tbl <- tibble(variable = features,
                   lasso_beta = beta,
                   ridge_beta = co_r,
                   abs_beta   = abs(beta),
                   active     = ifelse(abs(beta) > 1e-8, "Yes", "No")) |>
  arrange(desc(abs_beta))
write_csv(coef_tbl, "lasso_coefficients.csv")

# Means/SDs of the FULL dataset for predictor standardization
# (the predictor uses these to standardize new patient inputs)
mu_full <- colMeans(X[, scale_cols, drop = FALSE])
sd_full <- apply(X[, scale_cols, drop = FALSE], 2, sd)

payload <- list(
  intercept     = unname(intercept),
  threshold     = THRESHOLD,
  baseline_rate = mean(y),
  scale_cols    = scale_cols,
  features      = features,
  means         = as.list(mu_full),
  sds           = as.list(sd_full),
  coefs         = as.list(beta),
  auc_lasso     = auc_l,
  auc_ridge     = auc_r,
  auc_glm       = auc_g,
  n_active      = n_active
)
jsonlite::write_json(payload, "module5_predictor_payload.json",
                     pretty = TRUE, auto_unbox = TRUE)

cat("\nWrote: lasso_coefficients.csv  +  module5_predictor_payload.json\n")
cat("\n--- DONE ---\n")

ALY6015 Intermediate Analytics · College of Professional Studies, Northeastern University · May 2026
Built with R, glmnet, plotly, and ggplot2 · Data: Diabetes 130-US Hospitals (Strack et al., 2014)

Predicting 30-Day Hospital Readmission

Lasso-regularized risk model with a live in-browser calculator

Anush Goel · Mandeep Kaur · Sandeep Kaur · ALY6015 Module 5

May 2026

Predicting 30-Day Hospital Readmission

🎯 Try the Live Risk Predictor

30-Day Readmission Risk Predictor Diabetes 130-US Hospitals · Lasso Logistic Regression

What's driving this prediction?

📋 Headline Numbers

Research questions

What changed from Module 4 → Module 5

Module 4 — descriptive baseline

Module 5 — predictive deployment

📊 Three Patient Case Studies

Frail elderly with prior admissions

Complex polypharmacy, mid-age

Young Type 2, first hospitalization

🔍 Which Factors Drive the Prediction?

Top-decile contribution breakdown

📈 How Well Does It Discriminate?

Subgroup audit — is performance uniform across age and sex?

ROC curves with the operating point

⚖️ Where Should the Alarm Fire?

Decision Curve Analysis

✅ Is the Model Honest? (Calibration)

Score distribution by outcome

🔁 How Stable Is This Result Across Seeds?

💼 What Would Deployment Look Like?

Deployment scenario · 500-bed hospital, ~250 daily diabetic discharges

🚧 What This Model Cannot Do

🎯 Conclusions

References

📜 Appendix · Full R Pipeline (`.R` script)

Predicting 30-Day Hospital Readmission

Lasso-regularized risk model with a live in-browser calculator

Anush Goel · Mandeep Kaur · Sandeep Kaur · ALY6015 Module 5

May 2026

Predicting 30-Day Hospital Readmission

🎯 Try the Live Risk Predictor

What's driving this prediction?

📋 Headline Numbers

Research questions

What changed from Module 4 → Module 5

Module 4 — descriptive baseline

Module 5 — predictive deployment

📊 Three Patient Case Studies

Frail elderly with prior admissions

Complex polypharmacy, mid-age

Young Type 2, first hospitalization

🔍 Which Factors Drive the Prediction?

Top-decile contribution breakdown

📈 How Well Does It Discriminate?

Subgroup audit — is performance uniform across age and sex?

ROC curves with the operating point

⚖️ Where Should the Alarm Fire?

Decision Curve Analysis

✅ Is the Model Honest? (Calibration)

Score distribution by outcome

🔁 How Stable Is This Result Across Seeds?

💼 What Would Deployment Look Like?

Deployment scenario · 500-bed hospital, ~250 daily diabetic discharges

🚧 What This Model Cannot Do

🎯 Conclusions

References

📜 Appendix · Full R Pipeline (.R script)

📜 Appendix · Full R Pipeline (`.R` script)