A multi-model decision-support and business-impact tool for diabetic discharge planning
68,061 patients 7 ML models 19 features Lasso AUC = 0.604 Live in browser
Every metric computed on the held-out test set at each model’s Youden-optimal threshold. The starred row is the best AUC.
SE(AUC) = sqrt(AUC(1-AUC) + (n+-1)(Q1-AUC^2) + (n–1)(Q2-AUC^2)) / sqrt(n+ n-)
All models cluster near AUC = 0.60. This is the information ceiling of administrative-data features for this outcome — not an algorithm-choice problem.
The curves are visually indistinguishable, confirming the AUC convergence.
Axes are min-max normalized across the 7 models, so the radar shows relative performance, not absolute. Look for models that bulge on the metric you care about (e.g. sensitivity for triage).
Each dot is one of 10 risk deciles. The diagonal is perfect calibration. Lasso’s points hug the diagonal closely — the predicted probabilities can be trusted at face value, which is essential for the business P&L math.
The curve flattens between 30K and 47K training patients. Additional data would not materially improve AUC — the bottleneck is feature signal, not sample size.
Strong correlations between admission-type indicators and between prior-utilization features explain Lasso’s feature drops — the dropped features are mostly redundant.
Lasso’s AUC across demographic subgroups. Differences under 0.02 are within statistical noise; gaps above 0.04 warrant investigation.
Reading the fairness audit. Performance is broadly comparable across race and gender subgroups (max AUC gap = 0.018). The largest disparity is by age: the model has higher discrimination for 60+ patients than for under-40s, likely because younger diabetics are rarer and the model has less training signal for them. Action: monitor performance for under-40 patients post-deployment.
Reading the scoreboard. Random Forest leads on sensitivity (71%) but pays in specificity (44%) — it casts the widest net. Lasso and Ridge tie for top AUC at 0.604 and both produce calibrated probabilities; Lasso uses fewer features. Logistic GLM is the interpretable baseline that gives you odds ratios for free. Pick Lasso for production deployment, Random Forest for sensitivity-first triage, GLM when stakeholders demand interpretability.
Where this patient ranks in the 68,061-patient cohort –
Predicted probability 8.3% · approx. 95% prediction interval [6.6%, 10.4%] based on cross-validated residual variance
Recent predictions (last 6) Each dot shows the predicted risk for a recent scenario you tested. The current scenario is the larger dot. click any dot to recall
How a clinician would use this: At point of discharge, enter the patient’s profile in under 60 seconds. If the badge reads Above baseline with the YES flag, route to the high-risk pathway — 48-hour follow-up call, home-health referral, transitional-care visit at day 7. The waterfall above explains why, which is what makes the decision auditable.
Why this tab matters. A single-number prediction is hard to trust. Seeing exactly how the model responds to a one-variable change — diabetes medication off versus on, holding everything else equal — builds clinician trust, surfaces model bugs, and exposes potential bias. Try the Prior Inpatient scenario: it’s the single strongest predictor in the Lasso model, and you can watch the gauge move from low to high.
Defaults sourced from Jencks et al. (NEJM 2009) and CMS HRRP penalty data.
The red marker is the threshold that maximizes annual net benefit under your current settings. Move the economic sliders above to see how it shifts.
Standard medical-economics framework for evaluating prediction models. The model beats both “treat everyone” and “treat no one” across all clinically reasonable thresholds — this is the formal justification for deploying it over either default.
0.08
Override the Youden-optimal threshold to test sensitivity-first or specificity-first deployments.
At this manual threshold: Net benefit – · ROI –. –
Three cost scenarios shown at each one’s Youden-optimal threshold. Lets decision-makers see which intervention budget produces the best net benefit.
Scaled to your annual discharge volume. Green = correct, red = errors.
The model concentrates risk in the top deciles — this is what powers triage even at modest AUC.
Reading the P&L. The break-even logic is simple: one prevented readmission ($15,400) is worth roughly 50 interventions ($300 each). The model only needs to convert one in 50 flagged patients into a prevented readmission to break even. At the Youden-optimal threshold and 30% prevention effectiveness, the model converts roughly one in 10 — a 5× margin over break-even.
Sensitivity analysis. Try halving prevention effectiveness to 0.15: the model still produces positive net benefit. Try setting intervention cost to $1,000 (very expensive program): the optimal threshold rises but net benefit remains positive. The model is robust to substantial parameter uncertainty — this is the case for deploying it.
Showing full cohort (N = 68,061)
After preprocessing the raw 101,766-encounter dataset, the analytic sample contains 68,061 first-encounter patients. Age dominates the distribution, with the modal patient in the 60-80 band.
Method: Chi-square test of independence on the 5-level disposition × binary outcome cross-tabulation. Cramér’s V reported to quantify effect strength independent of sample size.
Result: χ²(4, N = 68,061) = 757.06, p < .001, Cramér’s V = 0.106. We reject the null hypothesis of independence.
Method: Two-way ANOVA with interaction term. F-statistics and eta-squared (η²) reported separately to distinguish statistical from practical significance.
Result: Admission F(2, 60,253) = 53.54, p < .001, η² = 0.0018; Age F(2, 60,253) = 373.88, p < .001, η² = 0.0122; Interaction F(4, 60,253) = 2.85, p = .023, η² = 0.0002. All three null hypotheses rejected.
Method: Compare 7 classifiers (logistic GLM, Ridge L2, Lasso L1, Elastic Net α=0.5, Random Forest, XGBoost, Neural Network) on the same held-out test split.
Result: All models cluster at AUC ≈ 0.60−0.61. Lasso retains 10 of 19 features — a 47% reduction at zero AUC cost.
For the seven-model comparison details, see the Model Scoreboard tab.
What three questions together tell us. Discharge pathway is a strong leading indicator (Q1). Age, not admission type, drives capacity needs (Q2). And the predictive signal in administrative data tops out at AUC ≈ 0.60 regardless of algorithm — but that’s still enough to produce a 3× risk gradient between deciles, which is the operational engine of triage (Q3). The three answers together justify a deployment-ready model used in conjunction with discharge-pathway redesign, not in place of it.
The Diabetes 130-US Hospitals dataset (Strack et al., 2014), accessed via the UCI Machine Learning Repository, contains 101,766 inpatient encounters across 130 U.S. hospitals between 1999 and 2008.
weight (97% missing), payer_code (40%), medical_specialty (49%).
patient_nbr to avoid intra-patient dependence.
Tests the null hypothesis that discharge disposition and 30-day readmission are statistically independent. Cramér’s V quantifies effect strength (0 = none, 1 = perfect), correcting for sample size so that the result is interpretable independent of the n = 68,061 denominator that would make any tiny effect “significant”.
Tests three simultaneous nulls: no admission-type effect, no age-group effect, no interaction. Eta-squared (η²) is reported for each effect because in a 68,061-row dataset every p-value will be vanishingly small — we need an effect-size measure to know if any of it matters operationally.
cv.glmnet, α = 0, 5-fold CV optimized for AUC. Shrinks coefficients toward zero without dropping features.
70/30 stratified train/test split with caret::createDataPartition and set.seed(42). All metrics reported on the held-out test set. Optimal classification threshold chosen by Youden’s J (maximizing sensitivity + specificity − 1).
All seven models are trained once, offline. The Lasso intercept, the per-model coefficient vectors (for linear models) or linear surrogates fit on logit(p) (for tree and neural models), and the standardization statistics are exported as a single 20 KB JSON cache. The cache is embedded into the published HTML page; a JavaScript engine reads it and recomputes predictions, threshold-based decisions, and per-feature contributions in real time as the user moves sliders. No server, no R session, no compute budget, and no latency.
Required for reproducibility: the precise operational definition of every feature in the model, with the source variable in the raw dataset.
| Feature | Type | Definition | Source |
|---|---|---|---|
age_num
|
numeric | Age band midpoint (5, 15, 25, …, 95) derived from the original 10-year age groups |
age
|
gender_bin
|
binary | 1 if Male, 0 if Female; Unknown/Invalid records dropped |
gender
|
race_AA
|
binary | 1 if African American, 0 otherwise |
race
|
race_His
|
binary | 1 if Hispanic, 0 otherwise |
race
|
race_Oth
|
binary | 1 if Other (Asian, Other), 0 otherwise (Caucasian is reference) |
race
|
adm_emerg
|
binary | 1 if Emergency admission (admission_type_id = 1), 0 otherwise |
admission_type_id
|
adm_urgent
|
binary | 1 if Urgent admission (admission_type_id = 2), 0 otherwise |
admission_type_id
|
time_in_hospital
|
integer | Length of stay in days (1-14, integer) |
time_in_hospital
|
num_lab_procedures
|
integer | Number of laboratory tests during the encounter |
num_lab_procedures
|
num_procedures
|
integer | Number of non-lab procedures performed during the encounter |
num_procedures
|
num_medications
|
integer | Number of distinct generic medication names administered |
num_medications
|
number_outpatient
|
integer | Outpatient visits in the 365 days before the index encounter |
number_outpatient
|
number_emergency
|
integer | Emergency-room visits in the 365 days before the index encounter |
number_emergency
|
number_inpatient
|
integer | Inpatient admissions in the 365 days before the index encounter |
number_inpatient
|
number_diagnoses
|
integer | Number of distinct diagnosis codes assigned to the encounter |
number_diagnoses
|
A1C_tested
|
binary | 1 if HbA1c was measured during the encounter (any result), 0 if “None” |
A1Cresult
|
change_bin
|
binary | 1 if any diabetic medication was changed during the encounter, 0 otherwise |
change
|
diabetesMed_bin
|
binary | 1 if patient is prescribed diabetes medication at discharge, 0 otherwise |
diabetesMed
|
insulin_bin
|
binary | 1 if insulin was prescribed or its dose changed during the encounter, 0 otherwise |
insulin
|
Centers for Medicare & Medicaid Services. (2023). Hospital Readmissions Reduction Program (HRRP). https://www.cms.gov/medicare/quality-initiatives-patient-assessment-instruments/value-based-programs/hrrp/hospital-readmissions-reduction-program-hrrp
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785-794. https://doi.org/10.1145/2939672.2939785
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1-22.
Jencks, S. F., Williams, M. V., & Coleman, E. A. (2009). Rehospitalizations among patients in the Medicare fee-for-service program. New England Journal of Medicine, 360(14), 1418-1428.
Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of Statistical Software, 28(5), 1-26.
Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R News, 2(3), 18-22.
Strack, B., DeShazo, J. P., Gennings, C., Olmo, J. L., Ventura, S., Cios, K. J., & Clore, J. N. (2014). Impact of HbA1c measurement on hospital readmission rates. BioMed Research International, 2014, 781670.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58(1), 267-288.
Reproducibility: set.seed(42). Place diabetic_data.csv in the working directory and source the script. The exported model_cache.json (~20 KB) is what this RPubs page embeds at knit time.
# =============================================================================
# ALY 6015 — Final Project (Module 6)
# Predicting 30-Day Hospital Readmission in Diabetic Patients
# Multi-Model Risk Lab — trains 7 classifiers, runs 3 statistical tests,
# prints every figure in the report, exports JSON cache for the RPubs app.
#
# Authors: Anush Goel · Mandeep Kaur · Sandeep Kaur
# Date: May 2026
# =============================================================================
# Sections (each starts with "# ---- N. Name -----" so the Appendix tab
# automatically splits them into collapsible blocks):
# 0. Setup
# 1. Data preprocessing (101,766 -> 68,061)
# 2. Q1 — Chi-square: discharge disposition x readmission
# 3. Q2 — Two-way ANOVA: length of stay ~ admission x age
# 4. Q3 — Train 7 classifiers (GLM, Ridge, Lasso, Enet, RF, XGB, NN)
# 5. Score & cache (model_cache.json)
# 6. Report figures — print only, no file saving
# Reproducibility: set.seed(42)
# =============================================================================
# ---- 0. Environment Setup & Reproducibility ----
pkgs <- c("tidyverse","janitor","glmnet","randomForest","xgboost",
"nnet","pROC","caret","jsonlite","plotly","scales","kableExtra")
for (p in pkgs) if (!requireNamespace(p, quietly = TRUE)) install.packages(p)
invisible(lapply(pkgs, library, character.only = TRUE))
set.seed(42)
options(scipen = 999)
NAVY <- "#1A3A6C"; TEAL <- "#2C8AA8"; ORANGE <- "#E07B1F"
RED <- "#C1352D"; GREEN <- "#3A8A3A"; GREY <- "#7A7A7A"
report_theme <- theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold", color = NAVY, size = 13,
margin = ggplot2::margin(b = 10)),
plot.subtitle = element_text(color = GREY, size = 10,
margin = ggplot2::margin(b = 10)),
axis.title = element_text(size = 11),
panel.grid.minor = element_blank(),
panel.grid.major = element_line(linewidth = 0.3, linetype = "dashed"),
legend.position = "top",
legend.title = element_text(face = "bold", size = 10))
# ---- 1. Data Preprocessing & Cohort Construction ----
df <- read_csv("diabetic_data.csv", na = c("?", ""), show_col_types = FALSE)
cat("Raw encounters:", nrow(df), "\n")
df <- df %>%
select(-weight, -payer_code, -medical_specialty) %>%
filter(!discharge_disposition_id %in% c(11, 13, 14, 19, 20, 21)) %>%
arrange(encounter_id) %>%
distinct(patient_nbr, .keep_all = TRUE) %>%
filter(!is.na(race), !is.na(diag_1), gender != "Unknown/Invalid") %>%
mutate(readmit_30 = if_else(readmitted == "<30", 1L, 0L),
A1C_tested = as.integer(A1Cresult != "None"),
change_bin = as.integer(change == "Ch"),
diabetesMed_bin = as.integer(diabetesMed == "Yes"),
insulin_bin = as.integer(insulin != "No"),
gender_bin = as.integer(gender == "Male"),
race_AA = as.integer(race == "AfricanAmerican"),
race_His = as.integer(race == "Hispanic"),
race_Oth = as.integer(race == "Other"),
age_num = case_when(
age == "[0-10)" ~ 5, age == "[10-20)" ~ 15, age == "[20-30)" ~ 25,
age == "[30-40)" ~ 35, age == "[40-50)" ~ 45, age == "[50-60)" ~ 55,
age == "[60-70)" ~ 65, age == "[70-80)" ~ 75, age == "[80-90)" ~ 85,
age == "[90-100)" ~ 95),
adm_emerg = as.integer(admission_type_id == 1),
adm_urgent = as.integer(admission_type_id == 2))
cat("Cleaned analytic sample:", nrow(df), "\n")
cat("30-day readmission rate:", scales::percent(mean(df$readmit_30), 0.01), "\n")
# ---- 2. Research Question 1: Discharge Disposition Chi-Square Analysis ----
df <- df %>% mutate(disp_grp = case_when(
discharge_disposition_id == 1 ~ "Home",
discharge_disposition_id %in% c(3, 5, 6) ~ "SNF/Rehab",
discharge_disposition_id == 2 ~ "Other Hospital",
discharge_disposition_id %in% c(4, 22, 23, 24) ~ "Home w/ Health Svc",
TRUE ~ "Other"))
tab_q1 <- table(df$disp_grp, df$readmit_30)
q1_test <- chisq.test(tab_q1); print(q1_test)
cramers_v <- sqrt(q1_test$statistic / (sum(tab_q1) * (min(dim(tab_q1)) - 1)))
cat("Cramer's V:", round(cramers_v, 4), "\n")
# ---- 3. Research Question 2: Length-of-Stay Two-Way ANOVA Analysis ----
sub <- df %>%
filter(admission_type_id %in% 1:3) %>%
mutate(adm = factor(admission_type_id, levels = c(3, 1, 2),
labels = c("Elective", "Emergency", "Urgent")),
age_grp = case_when(
age %in% c("[0-10)","[10-20)","[20-30)","[30-40)") ~ "<40",
age %in% c("[40-50)","[50-60)") ~ "40-60",
TRUE ~ "60+"),
age_grp = factor(age_grp, levels = c("<40", "40-60", "60+")))
q2_fit <- aov(time_in_hospital ~ adm * age_grp, data = sub)
print(summary(q2_fit))
ssq <- summary(q2_fit)[[1]][["Sum Sq"]]
cat("Eta sq (admission): ", round(ssq[1]/sum(ssq), 4), "\n")
cat("Eta sq (age): ", round(ssq[2]/sum(ssq), 4), "\n")
cat("Eta sq (interaction):", round(ssq[3]/sum(ssq), 4), "\n")
# ---- 4. Research Question 3: Seven-Model Predictive Comparison ----
feat_cols <- c("time_in_hospital","num_lab_procedures","num_procedures",
"num_medications","number_outpatient","number_emergency",
"number_inpatient","number_diagnoses","age_num",
"race_AA","race_His","race_Oth","gender_bin","A1C_tested",
"change_bin","diabetesMed_bin","insulin_bin",
"adm_emerg","adm_urgent")
feat_labels <- c("Length of stay (days)","# Lab procedures","# In-hospital procedures",
"# Medications","Prior outpatient visits","Prior ER visits",
"Prior inpatient visits","# Diagnoses","Age",
"Race: African American","Race: Hispanic","Race: Other",
"Sex: Male","HbA1c tested","Medication changed",
"On diabetes medication","On insulin",
"Admission: Emergency","Admission: Urgent")
X <- as.matrix(df[, feat_cols]); y <- df$readmit_30
idx <- caret::createDataPartition(y, p = 0.7, list = FALSE)
X_tr <- X[idx, ]; y_tr <- y[idx]
X_te <- X[-idx, ]; y_te <- y[-idx]
mu <- colMeans(X_tr); sd_ <- apply(X_tr, 2, sd)
X_tr_s <- scale(X_tr, center = mu, scale = sd_)
X_te_s <- scale(X_te, center = mu, scale = sd_)
cat("Train:", nrow(X_tr_s), "Test:", nrow(X_te_s), "\n")
# 4a — Logistic GLM
m_glm <- glm(y_tr ~ ., data = data.frame(X_tr_s, y_tr = y_tr), family = "binomial")
# 4b — Ridge (L2)
m_ridge <- cv.glmnet(X_tr_s, y_tr, family = "binomial",
alpha = 0, type.measure = "auc", nfolds = 5)
# 4c — Lasso (L1)
m_lasso <- cv.glmnet(X_tr_s, y_tr, family = "binomial",
alpha = 1, type.measure = "auc", nfolds = 5)
# 4d — Elastic Net (alpha = 0.5)
m_enet <- cv.glmnet(X_tr_s, y_tr, family = "binomial",
alpha = 0.5, type.measure = "auc", nfolds = 5)
# 4e — Random Forest
m_rf <- randomForest(x = X_tr_s, y = factor(y_tr),
ntree = 200, maxnodes = 100, nodesize = 20)
# 4f — XGBoost
dtrain <- xgb.DMatrix(X_tr_s, label = y_tr)
m_xgb <- xgb.train(data = dtrain, nrounds = 300, max_depth = 4,
eta = 0.05, subsample = 0.85, colsample_bytree = 0.85,
objective = "binary:logistic", eval_metric = "auc",
verbose = 0)
# 4g — Neural Network
m_nn <- nnet(x = X_tr_s, y = y_tr, size = 16, decay = 0.001,
maxit = 200, trace = FALSE, MaxNWts = 3000)
# ---- 5. Model Evaluation & Production Cache Export ----
score <- function(probs, y_true) {
roc_obj <- pROC::roc(y_true, probs, quiet = TRUE)
auc_val <- as.numeric(pROC::auc(roc_obj))
thr <- as.numeric(pROC::coords(roc_obj, "best", best.method = "youden")["threshold"])
pred <- as.integer(probs >= thr)
cm <- table(actual = y_true, predicted = factor(pred, levels = c(0, 1)))
list(auc = round(auc_val, 4), threshold = round(thr, 4),
sensitivity = round(cm[2,2] / sum(cm[2,]), 4),
specificity = round(cm[1,1] / sum(cm[1,]), 4),
precision = round(cm[2,2] / max(sum(cm[,2]), 1), 4),
accuracy = round((cm[1,1] + cm[2,2]) / sum(cm), 4),
f1 = round(2 * cm[2,2] / (2 * cm[2,2] + cm[1,2] + cm[2,1]), 4),
tn = as.integer(cm[1,1]), fp = as.integer(cm[1,2]),
fn = as.integer(cm[2,1]), tp = as.integer(cm[2,2]),
fpr_curve = as.numeric(1 - roc_obj$specificities),
tpr_curve = as.numeric(roc_obj$sensitivities))
}
p_glm <- predict(m_glm, newdata = data.frame(X_te_s), type = "response")
p_ridge <- as.numeric(predict(m_ridge, X_te_s, s = "lambda.min", type = "response"))
p_lasso <- as.numeric(predict(m_lasso, X_te_s, s = "lambda.min", type = "response"))
p_enet <- as.numeric(predict(m_enet, X_te_s, s = "lambda.min", type = "response"))
p_rf <- predict(m_rf, X_te_s, type = "prob")[, "1"]
p_xgb <- predict(m_xgb, xgb.DMatrix(X_te_s))
p_nn <- as.numeric(predict(m_nn, X_te_s, type = "raw"))
scoreboard <- tibble(
Model = c("Logistic GLM","Ridge (L2)","Lasso (L1)","Elastic Net",
"Random Forest","XGBoost","Neural Network"),
AUC = c(score(p_glm,y_te)$auc, score(p_ridge,y_te)$auc, score(p_lasso,y_te)$auc,
score(p_enet,y_te)$auc, score(p_rf,y_te)$auc, score(p_xgb,y_te)$auc,
score(p_nn,y_te)$auc))
print(scoreboard)
# Linear surrogate on logit(p) for non-linear models — lets the JS engine recompute
surrogate <- function(probs, X_s) {
eps <- 1e-6
p_clip <- pmin(pmax(probs, eps), 1 - eps)
lgt <- log(p_clip / (1 - p_clip))
fit <- lm(lgt ~ X_s)
list(coefs = as.numeric(coef(fit)[-1]),
intercept = as.numeric(coef(fit)[1]))
}
thr_grid <- function(probs, y_true) {
thr <- seq(0.01, 0.5, length.out = 50)
tp_v <- integer(length(thr)); fp_v <- integer(length(thr))
for (i in seq_along(thr)) {
pred <- as.integer(probs >= thr[i])
tp_v[i] <- sum(pred == 1 & y_true == 1)
fp_v[i] <- sum(pred == 1 & y_true == 0)
}
list(thr = thr, tp = tp_v, fp = fp_v,
all_positives = as.integer(sum(y_true == 1)),
all_negatives = as.integer(sum(y_true == 0)))
}
decile_gradient <- function(probs, y_true) {
d <- ntile(probs, 10)
out <- tibble(p = probs, y = y_true, d = d) %>%
group_by(d) %>%
summarise(pred = mean(p), obs = mean(y), n = n(), .groups = "drop")
list(decile = as.integer(out$d - 1L), pred = round(out$pred, 4),
obs = round(out$obs, 4), n = as.integer(out$n))
}
build_entry <- function(name, kind, probs, X_te_s, y_te,
coefs = NULL, intercept = NULL, importance = NULL) {
s <- score(probs, y_te)
entry <- list(
name = name, kind = kind,
metrics = list(auc = s$auc, threshold = s$threshold,
sensitivity = s$sensitivity, specificity = s$specificity,
precision = s$precision, accuracy = s$accuracy, f1 = s$f1,
tn = s$tn, fp = s$fp, fn = s$fn, tp = s$tp,
fpr_curve = round(s$fpr_curve[seq(1, length(s$fpr_curve),
length.out = 80)], 4),
tpr_curve = round(s$tpr_curve[seq(1, length(s$tpr_curve),
length.out = 80)], 4)),
deciles = decile_gradient(probs, y_te),
threshold_grid = thr_grid(probs, y_te))
if (kind == "linear") {
entry$coefs <- round(coefs, 5)
entry$intercept <- round(intercept, 5)
} else {
surr <- surrogate(probs, X_te_s)
entry$surrogate_coefs <- round(surr$coefs, 5)
entry$surrogate_intercept <- round(surr$intercept, 5)
if (!is.null(importance))
entry$feature_importance <- round(importance, 5)
}
entry
}
glm_coefs <- as.numeric(coef(m_glm))
ridge_coefs <- as.numeric(coef(m_ridge, s = "lambda.min"))
lasso_coefs <- as.numeric(coef(m_lasso, s = "lambda.min"))
enet_coefs <- as.numeric(coef(m_enet, s = "lambda.min"))
payload <- list(
features = feat_cols, labels = feat_labels,
ui = lapply(seq_along(feat_cols), function(i) {
fc <- feat_cols[i]
if (fc %in% c("time_in_hospital","num_lab_procedures","num_procedures",
"num_medications","number_outpatient","number_emergency",
"number_inpatient","number_diagnoses","age_num"))
list(type = "slider",
min = as.integer(min(X[, fc])),
max = as.integer(quantile(X[, fc], 0.99)),
default = as.integer(round(mean(X[, fc]))),
step = 1L)
else
list(type = "toggle", default = 0L)
}),
mu = round(mu, 5), sd = round(sd_, 5),
population_rate = round(mean(y_te), 5),
n_test = length(y_te),
models = list(
glm = build_entry("Logistic GLM", "linear", p_glm, X_te_s, y_te,
glm_coefs[-1], glm_coefs[1]),
ridge = build_entry("Ridge (L2)", "linear", p_ridge, X_te_s, y_te,
ridge_coefs[-1], ridge_coefs[1]),
lasso = build_entry("Lasso (L1)", "linear", p_lasso, X_te_s, y_te,
lasso_coefs[-1], lasso_coefs[1]),
enet = build_entry("Elastic Net", "linear", p_enet, X_te_s, y_te,
enet_coefs[-1], enet_coefs[1]),
rf = build_entry("Random Forest", "tree", p_rf, X_te_s, y_te,
importance = as.numeric(importance(m_rf))),
xgb = build_entry("XGBoost", "tree", p_xgb, X_te_s, y_te,
importance = xgb.importance(model = m_xgb)$Gain),
nn = build_entry("Neural Network","nn", p_nn, X_te_s, y_te)
)
)
write_json(payload, "model_cache.json", auto_unbox = TRUE,
digits = 5, pretty = FALSE)
cat("Cache exported:", file.size("model_cache.json"), "bytes\n")
# ---- 6. Publication-Ready Figures (Print Only) ----
# Each block recreates one figure that appears in the Word report. Run the
# script interactively (or source it inside RStudio) to see them in the Plots
# pane. No ggsave / no file writes — pure print() calls.
# Figure 1 — Cohort funnel
fig1 <- tibble(
Step = factor(c("Raw\nencounters","After death/\nhospice removal",
"First encounter\nper patient","After missing\nfilter"),
levels = c("Raw\nencounters","After death/\nhospice removal",
"First encounter\nper patient","After missing\nfilter")),
N = c(101766, 99343, 69990, 68061)) %>%
ggplot(aes(x = Step, y = N)) +
geom_col(fill = c(NAVY, TEAL, TEAL, GREEN), width = 0.6,
color = "white", linewidth = 1.5) +
geom_text(aes(label = scales::comma(N)), vjust = -0.5,
fontface = "bold", color = NAVY, size = 4) +
scale_y_continuous(labels = comma, limits = c(0, 115000)) +
labs(title = "Figure 1. Cohort Construction — From Raw Data to Analytic Sample",
x = NULL, y = "Encounters") +
report_theme
print(fig1)
# Figure 2a — Outcome distribution donut (No / Readmit >30d / Readmit <30d)
# Counts derived from the cleaned cohort
out_counts <- df %>%
mutate(grp = case_when(readmitted == "<30" ~ "Readmit <30d",
readmitted == ">30" ~ "Readmit >30d",
TRUE ~ "No readmit")) %>%
count(grp) %>%
mutate(grp = factor(grp, levels = c("No readmit","Readmit >30d","Readmit <30d")),
pct = n / sum(n) * 100,
label = sprintf("%s\n%.1f%%", grp, pct))
fig2a <- ggplot(out_counts, aes(x = 2, y = n, fill = grp)) +
geom_col(width = 1, color = "white", linewidth = 2) +
coord_polar(theta = "y", start = 0) +
xlim(0.5, 2.5) +
scale_fill_manual(values = c("No readmit" = GREEN, "Readmit >30d" = "#D4A52A",
"Readmit <30d" = RED), guide = "none") +
geom_text(aes(label = label), position = position_stack(vjust = 0.5),
color = "white", fontface = "bold", size = 3.5) +
annotate("text", x = 0.5, y = 0, label = sprintf("%.2f%%", mean(df$readmit_30)*100),
size = 8, fontface = "bold", color = RED) +
annotate("text", x = 0.5, y = 0, label = "30-day rate", vjust = 3,
size = 3.5, color = GREY) +
annotate("text", x = 0.5, y = 0, label = sprintf("N = %s", scales::comma(nrow(df))),
vjust = 5, size = 3, color = GREY) +
labs(title = "Figure 2a. Outcome Distribution") +
theme_void() +
theme(plot.title = element_text(face = "bold", color = NAVY, size = 13,
hjust = 0, margin = ggplot2::margin(b = 10)))
print(fig2a)
# Figure 2b — Business translation KPIs
# (Rendered as a text-only ggplot since it's a KPI panel, not a data chart)
n30 <- sum(df$readmit_30)
cost_per <- 15400
total_cost <- n30 * cost_per
kpis <- tibble(
y = c(4, 3, 2, 1),
value = c(scales::comma(n30),
scales::dollar(cost_per),
sprintf("$%.1fM", total_cost / 1e6),
"25–40%"),
label = c("patients readmitted within 30 days",
"avg cost per readmission",
"estimated 30-day readmission cost in cohort",
"estimated preventable share (Jencks et al.)"),
color = c(RED, ORANGE, NAVY, GREEN))
fig2b <- ggplot(kpis, aes(x = 0, y = y)) +
geom_text(aes(label = value, color = color),
hjust = 0, fontface = "bold", size = 7) +
geom_text(aes(label = label, x = 0.5),
hjust = 0, color = GREY, size = 3.5) +
scale_color_identity() +
coord_cartesian(xlim = c(-0.05, 1.5), ylim = c(0.5, 4.6)) +
labs(title = "Figure 2b. Business Translation") +
theme_void() +
theme(plot.title = element_text(face = "bold", color = NAVY, size = 13,
hjust = 0, margin = ggplot2::margin(b = 14)))
print(fig2b)
# Figure 3 — Q1 diverging bar (Discharge disposition vs baseline)
baseline <- mean(df$readmit_30) * 100
fig3 <- df %>%
group_by(disp_grp) %>%
summarise(rate = mean(readmit_30) * 100, n = n(), .groups = "drop") %>%
arrange(rate) %>%
mutate(delta = rate - baseline,
disp_grp = factor(disp_grp, levels = disp_grp),
color_grp = case_when(delta < 0 ~ "Lower", delta < 5 ~ "Modest", TRUE ~ "High")) %>%
ggplot(aes(x = delta, y = disp_grp, fill = color_grp)) +
geom_col(width = 0.65, color = "white") +
geom_vline(xintercept = 0, color = NAVY, linewidth = 0.6) +
geom_text(aes(x = ifelse(delta >= 0, delta + 0.25, 0.25),
label = sprintf("%.1f%% (n = %s)", rate, comma(n))),
hjust = 0, fontface = "bold", size = 3.4) +
scale_fill_manual(values = c("Lower" = GREEN, "Modest" = ORANGE, "High" = RED),
guide = "none") +
scale_x_continuous(limits = c(-5, 14)) +
annotate("label", x = 12.5, y = 1, size = 3, color = NAVY,
label = "chi sq = 757.1 · df = 4\np < .001\nCramer's V = 0.106") +
labs(title = "Figure 3. Q1 — Readmission Risk by Discharge Disposition",
x = "Difference from baseline 30-day readmission rate (pp)",
y = NULL) +
report_theme + theme(legend.position = "none")
print(fig3)
# Figure 4a — Q2 interaction plot
cell_means <- sub %>% group_by(adm, age_grp) %>%
summarise(m = mean(time_in_hospital), .groups = "drop")
fig4a <- ggplot(cell_means, aes(x = adm, y = m, color = age_grp, group = age_grp)) +
geom_line(linewidth = 1.4) +
geom_point(aes(shape = age_grp), size = 4) +
geom_text(aes(label = sprintf("%.2f", m)), vjust = -1.3,
fontface = "bold", size = 3.2, show.legend = FALSE) +
scale_color_manual(values = c("<40" = TEAL, "40-60" = ORANGE, "60+" = RED),
name = "Age band") +
scale_shape_manual(values = c("<40" = 16, "40-60" = 15, "60+" = 18),
name = "Age band") +
coord_cartesian(ylim = c(2.8, 5.0)) +
labs(title = "Figure 4a. Interaction Plot — LOS by Admission x Age",
x = "Admission Type", y = "Mean Length of Stay (days)") +
report_theme
print(fig4a)
# Figure 4b — Effect sizes (F vs eta squared)
fig4b <- tibble(
Factor = factor(c("Admission Type","Age Group","Interaction"),
levels = c("Admission Type","Age Group","Interaction")),
F_stat = c(53.54, 373.88, 2.85),
Eta_sq = c(0.0018, 0.0122, 0.0002)) %>%
ggplot(aes(x = Factor, y = F_stat, fill = Factor)) +
geom_col(width = 0.55, color = "white", linewidth = 1.5) +
geom_text(aes(label = sprintf("F = %.1f", F_stat)), vjust = -0.4,
fontface = "bold", size = 3.6) +
geom_text(aes(label = sprintf("eta sq = %.4f", Eta_sq)),
y = log10(2), color = "white", fontface = "bold", size = 3.2) +
scale_y_log10(limits = c(0.5, 800)) +
scale_fill_manual(values = c(TEAL, NAVY, GREY), guide = "none") +
labs(title = "Figure 4b. Effect Sizes",
x = NULL, y = "F statistic (log scale)") +
report_theme
print(fig4b)
# Figure 5a — Held-out AUC across all 7 models
fig5a <- tibble(
Model = factor(c("Logistic GLM","Ridge (L2)","Lasso (L1)","Elastic Net",
"Random Forest","XGBoost","Neural Network"),
levels = c("Logistic GLM","Ridge (L2)","Lasso (L1)","Elastic Net",
"Random Forest","XGBoost","Neural Network")),
AUC = c(score(p_glm,y_te)$auc, score(p_ridge,y_te)$auc,
score(p_lasso,y_te)$auc, score(p_enet,y_te)$auc,
score(p_rf,y_te)$auc, score(p_xgb,y_te)$auc,
score(p_nn,y_te)$auc)) %>%
ggplot(aes(x = Model, y = AUC, fill = Model)) +
geom_col(width = 0.55, color = "white", linewidth = 1.5) +
geom_text(aes(label = sprintf("%.3f", AUC)), vjust = -0.4,
fontface = "bold", size = 4) +
geom_hline(yintercept = 0.50, linetype = "dashed", color = RED, alpha = 0.7) +
annotate("text", x = 7.4, y = 0.505, label = "chance", color = RED, size = 3, hjust = 1) +
scale_fill_manual(values = c(GREY, TEAL, NAVY, "#7c3aed", GREEN, ORANGE, RED),
guide = "none") +
coord_cartesian(ylim = c(0.45, 0.65)) +
labs(title = "Figure 5a. Held-out AUC — 7 Models",
x = NULL, y = "AUC (test set)") +
report_theme + theme(axis.text.x = element_text(angle = 25, hjust = 1))
print(fig5a)
# Figure 5b — ROC curves (overlay)
roc_df <- bind_rows(
tibble(fpr = 1 - pROC::roc(y_te, p_glm, quiet = TRUE)$specificities,
tpr = pROC::roc(y_te, p_glm, quiet = TRUE)$sensitivities,
model = sprintf("GLM (%.3f)", score(p_glm, y_te)$auc)),
tibble(fpr = 1 - pROC::roc(y_te, p_ridge, quiet = TRUE)$specificities,
tpr = pROC::roc(y_te, p_ridge, quiet = TRUE)$sensitivities,
model = sprintf("Ridge (%.3f)", score(p_ridge, y_te)$auc)),
tibble(fpr = 1 - pROC::roc(y_te, p_lasso, quiet = TRUE)$specificities,
tpr = pROC::roc(y_te, p_lasso, quiet = TRUE)$sensitivities,
model = sprintf("Lasso (%.3f)", score(p_lasso, y_te)$auc)),
tibble(fpr = 1 - pROC::roc(y_te, p_rf, quiet = TRUE)$specificities,
tpr = pROC::roc(y_te, p_rf, quiet = TRUE)$sensitivities,
model = sprintf("RF (%.3f)", score(p_rf, y_te)$auc)))
fig5b <- ggplot(roc_df, aes(fpr, tpr, color = model)) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = GREY) +
geom_line(linewidth = 1.1) +
scale_color_manual(values = c(GREY, TEAL, NAVY, GREEN), name = NULL) +
labs(title = "Figure 5b. ROC Curves on Held-out Test Set",
x = "False positive rate (1 - specificity)",
y = "True positive rate (sensitivity)") +
report_theme + theme(legend.position = c(0.7, 0.2))
print(fig5b)
# Figure 5c — Lasso feature retention donut (10 of 19 features survive L1)
n_retained <- sum(lasso_coefs[-1] != 0)
n_dropped <- length(feat_cols) - n_retained
ret_df <- tibble(Status = factor(c("Retained","Dropped"),
levels = c("Retained","Dropped")),
Count = c(n_retained, n_dropped))
fig5c <- ggplot(ret_df, aes(x = 2, y = Count, fill = Status)) +
geom_col(width = 1, color = "white", linewidth = 2) +
coord_polar(theta = "y") +
xlim(0.5, 2.5) +
scale_fill_manual(values = c("Retained" = NAVY, "Dropped" = "#E0E0E0"),
guide = "none") +
geom_text(aes(label = sprintf("%s\n%d features", Status, Count)),
position = position_stack(vjust = 0.5),
color = c("white","#404040"), fontface = "bold", size = 3.6) +
annotate("text", x = 0.5, y = 0,
label = sprintf("%.0f%%", n_retained / length(feat_cols) * 100),
size = 9, fontface = "bold", color = NAVY) +
annotate("text", x = 0.5, y = 0, label = "feature\nparsimony",
vjust = 2.6, size = 3.2, color = GREY) +
labs(title = "Figure 5c. Lasso Feature Retention") +
theme_void() +
theme(plot.title = element_text(face = "bold", color = NAVY, size = 13,
hjust = 0, margin = ggplot2::margin(b = 10)))
print(fig5c)
# Figure 6 — Lasso variable importance
imp_df <- tibble(feature = feat_labels, beta = lasso_coefs[-1]) %>%
filter(abs(beta) > 1e-6) %>%
arrange(desc(abs(beta))) %>%
mutate(feature = factor(feature, levels = rev(feature)),
direction = if_else(beta > 0, "Increases risk", "Decreases risk"))
fig6 <- ggplot(imp_df, aes(x = beta, y = feature, fill = direction)) +
geom_col(color = "white", linewidth = 1) +
geom_vline(xintercept = 0, color = NAVY) +
geom_text(aes(label = sprintf("%+.3f", beta),
x = beta + if_else(beta > 0, 0.005, -0.005),
hjust = if_else(beta > 0, 0, 1)),
fontface = "bold", size = 3.2) +
scale_fill_manual(values = c("Increases risk" = RED, "Decreases risk" = GREEN)) +
labs(title = "Figure 6. Lasso Variable Importance",
x = "Standardized log-odds coefficient (beta)",
y = NULL, fill = NULL) +
report_theme
print(fig6)
# Figure 7a — Confusion matrix at Youden-optimal threshold (Lasso)
s_lasso <- score(p_lasso, y_te)
cm_df <- tibble(
Actual = factor(c("No","No","Yes","Yes"), levels = c("Yes","No")),
Predicted = factor(c("No","Yes","No","Yes"), levels = c("No","Yes")),
Count = c(s_lasso$tn, s_lasso$fp, s_lasso$fn, s_lasso$tp),
Label = c("TN","FP","FN","TP")) %>%
mutate(Pct = Count / sum(Count) * 100,
Fill = if_else(Label %in% c("TN","TP"), "#A8E0B8", "#F9C9C9"))
fig7a <- ggplot(cm_df, aes(Predicted, Actual, fill = Fill)) +
geom_tile(color = "white", linewidth = 2) +
geom_text(aes(label = sprintf("%s\n%s\n(%.1f%%)", Label, comma(Count), Pct)),
fontface = "bold", size = 4.5) +
scale_fill_identity() +
labs(title = sprintf("Figure 7a. Confusion Matrix at Threshold = %.3f (Lasso)",
s_lasso$threshold),
x = NULL, y = NULL) +
report_theme +
theme(panel.grid = element_blank(),
axis.text = element_text(face = "bold", size = 11))
print(fig7a)
# Figure 7b — Business read-out per 10,000 discharges
# Scale the confusion-matrix counts from the held-out test set to a 10,000-discharge population
scale_factor <- 10000 / length(y_te)
flagged_10k <- round((s_lasso$tp + s_lasso$fp) * scale_factor)
caught_10k <- round(s_lasso$tp * scale_factor)
missed_10k <- round(s_lasso$fn * scale_factor)
intervention_cost <- flagged_10k * 300 # $300 per intervention
readmission_saved <- caught_10k * 0.30 * 15400 # 30% prevented, $15,400 each
net_benefit <- readmission_saved - intervention_cost
biz <- tibble(
y = c(6, 5, 4, 3, 2, 1),
value = c(scales::comma(flagged_10k),
scales::comma(caught_10k),
scales::comma(missed_10k),
sprintf("$%.2fM", intervention_cost / 1e6),
sprintf("$%.2fM", readmission_saved / 1e6),
sprintf("%s$%.2fM",
if_else(net_benefit >= 0, "+", "-"),
abs(net_benefit) / 1e6)),
label = c("Patients flagged for high-risk pathway",
"True readmissions caught",
"True readmissions missed",
"Cost of high-risk pathway @ $300/pt",
"Cost of missed readmits @ $15,400",
"Net benefit per 10,000 discharges"),
color = c(ORANGE, GREEN, RED, NAVY, RED,
if_else(net_benefit >= 0, GREEN, RED)))
fig7b <- ggplot(biz, aes(x = 0, y = y)) +
geom_text(aes(label = value, color = color),
hjust = 0, fontface = "bold", size = 6.5) +
geom_text(aes(label = label, x = 0.42),
hjust = 0, color = GREY, size = 3.4) +
scale_color_identity() +
coord_cartesian(xlim = c(-0.05, 1.4), ylim = c(0.5, 6.6)) +
labs(title = "Figure 7b. Business Read-out per 10,000 Discharges") +
theme_void() +
theme(plot.title = element_text(face = "bold", color = NAVY, size = 13,
hjust = 0, margin = ggplot2::margin(b = 14)))
print(fig7b)
# Figure 8 — Calibration plot (Lasso)
calib <- tibble(prob = p_lasso, actual = y_te) %>%
mutate(decile = ntile(prob, 10)) %>%
group_by(decile) %>%
summarise(pred_mean = mean(prob), obs_rate = mean(actual), .groups = "drop")
fig8 <- ggplot(calib, aes(pred_mean, obs_rate)) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = GREY) +
geom_hline(yintercept = mean(y_te), linetype = "dotted", color = RED) +
geom_line(color = NAVY, linewidth = 1) +
geom_point(color = NAVY, size = 4) +
geom_text(aes(label = paste0("D", decile)), vjust = -1, hjust = 0.3,
size = 3, color = NAVY) +
coord_cartesian(xlim = c(0, 0.18), ylim = c(0, 0.18)) +
labs(title = "Figure 8. Calibration on Held-out Test Set (Lasso)",
x = "Mean predicted probability per decile",
y = "Observed readmission rate per decile") +
report_theme
print(fig8)
# Figure 9 — Risk gradient by decile (Lasso)
fig9 <- calib %>%
ggplot(aes(x = factor(decile), y = obs_rate * 100, fill = obs_rate)) +
geom_col(width = 0.7, color = "white", linewidth = 1.2) +
geom_text(aes(label = sprintf("%.1f%%", obs_rate * 100)),
vjust = -0.4, fontface = "bold", size = 3.4) +
geom_hline(yintercept = mean(y_te) * 100, linetype = "dashed", color = NAVY) +
scale_fill_distiller(palette = "RdYlGn", direction = -1, guide = "none") +
scale_x_discrete(labels = paste0("D", 1:10)) +
labs(title = "Figure 9. Risk Gradient — Observed Readmission Rate by Decile",
x = "Predicted-risk decile (D1 = lowest, D10 = highest)",
y = "Observed 30-day rate (%)") +
report_theme
print(fig9)
# =============================================================================
# END OF SCRIPT
# Knit ALY6015_FinalApp.Rmd next to render the live RPubs app.
# =============================================================================
set.seed(42) · Anush Goel, Mandeep Kaur, Sandeep Kaur