Diabetic Readmission Risk Lab

A multi-model decision-support and business-impact tool for diabetic discharge planning

Anush Goel · Mandeep Kaur · Sandeep Kaur · ALY 6015 Final Project · May 2026

68,061 patients 7 ML models 19 features Lasso AUC = 0.604 Live in browser

Model Scoreboard — Seven Algorithms, One Held-Out Comparison

All seven classifiers trained on the same 47,642-patient training split and evaluated on the same 20,419-patient test set. Pick a champion to use in the live predictor and business P&L tabs.

Recommended Model
Lasso (L1)
10 of 19 features · auditable
Held-Out AUC
0.604
95% CI [0.591, 0.617]
Risk Gradient
3.3×
top vs bottom decile
Annual Net Benefit
+$1.31M
at 10K discharges · 30% prevention
7
Models Trained
linear · tree · neural
19
Features Used
administrative only
47,642
Training Patients
70% stratified split
20,419
Test Patients
30% held-out
9.02%
Base Rate
imbalanced

Cross-Model Comparison Table ?All metrics computed on the same held-out test set at each model’s Youden-optimal threshold. AUC measures discrimination; sensitivity is recall; specificity is true-negative rate; precision is positive predictive value; F1 balances precision and recall.

Every metric computed on the held-out test set at each model’s Youden-optimal threshold. The starred row is the best AUC.

Bootstrap 95% CIs computed analytically via the Hanley-McNeil standard error (1982). All seven models’ CIs overlap, confirming they are statistically indistinguishable. SE(AUC) = sqrt(AUC(1-AUC) + (n+-1)(Q1-AUC^2) + (n–1)(Q2-AUC^2)) / sqrt(n+ n-)

Discrimination — AUC across all 7 models ?AUC = Area Under the ROC Curve. Probability that a randomly chosen positive case scores higher than a randomly chosen negative. 0.5 = chance; 1.0 = perfect.

All models cluster near AUC = 0.60. This is the information ceiling of administrative-data features for this outcome — not an algorithm-choice problem.

ROC Curves — All Models Overlaid ?An ROC curve traces sensitivity (TPR) against 1 - specificity (FPR) as the threshold varies from 1 to 0. Steeper, closer-to-corner curves are better.

The curves are visually indistinguishable, confirming the AUC convergence.

Multi-Metric Radar — Where Each Model Wins

Axes are min-max normalized across the 7 models, so the radar shows relative performance, not absolute. Look for models that bulge on the metric you care about (e.g. sensitivity for triage).

Calibration — Are Predicted Probabilities Trustworthy? ?A calibration plot bins patients by their predicted probability, then plots predicted vs observed rates. Points on the diagonal mean the model is well-calibrated (a “20% risk” really happens 20% of the time).

Each dot is one of 10 risk deciles. The diagonal is perfect calibration. Lasso’s points hug the diagonal closely — the predicted probabilities can be trusted at face value, which is essential for the business P&L math.

Learning Curves — Would More Training Data Help? ?Train Lasso on increasingly larger subsets of the training data and measure test AUC. If the curve has flattened, the model has extracted all available signal; if still rising, more data would help.

The curve flattens between 30K and 47K training patients. Additional data would not materially improve AUC — the bottleneck is feature signal, not sample size.

Feature Correlation Heatmap ?Pairwise Pearson correlations on the 19 features. High off-diagonal values (|r| > 0.5) indicate multicollinearity, which explains why Lasso drops some features without losing AUC.

Strong correlations between admission-type indicators and between prior-utilization features explain Lasso’s feature drops — the dropped features are mostly redundant.

Subgroup Performance — Fairness Audit ?A fair model performs comparably across demographic subgroups. Large AUC or sensitivity gaps indicate the model is more accurate for some patients than others.

Lasso’s AUC across demographic subgroups. Differences under 0.02 are within statistical noise; gaps above 0.04 warrant investigation.

Reading the fairness audit. Performance is broadly comparable across race and gender subgroups (max AUC gap = 0.018). The largest disparity is by age: the model has higher discrimination for 60+ patients than for under-40s, likely because younger diabetics are rarer and the model has less training signal for them. Action: monitor performance for under-40 patients post-deployment.

Context: Comparison vs CMS HRRP Penalty

Under the CMS Hospital Readmissions Reduction Program (HRRP), the average penalty for a 200-bed hospital with above-expected readmissions is roughly $130K to $250K per year. The Lasso model’s annual net benefit of +$1.31M covers that penalty exposure approximately 5–10× over — deployment is financially compelling even before counting reputational benefits.

Reading the scoreboard. Random Forest leads on sensitivity (71%) but pays in specificity (44%) — it casts the widest net. Lasso and Ridge tie for top AUC at 0.604 and both produce calibrated probabilities; Lasso uses fewer features. Logistic GLM is the interpretable baseline that gives you odds ratios for free. Pick Lasso for production deployment, Random Forest for sensitivity-first triage, GLM when stakeholders demand interpretability.

Live Risk Predictor

Move sliders, toggle markers, change demographics — the chosen model’s prediction updates in real time. The hero panel below stays in view as you scroll through inputs.

1

Pick a Model

All seven models are loaded. Switch any time — the NO EFFECT tags update to show what each model drops.
Predicted 30-Day Readmission Risk
%
vs population baseline of 9.02%
Active modelLasso (L1)
Decision threshold
Cohort baseline9.02%
Risk vs baseline
Risk percentile

Where this patient ranks in the 68,061-patient cohort

0% (lowest risk)50%100% (highest risk)

Predicted probability 8.3% · approx. 95% prediction interval [6.6%, 10.4%] based on cross-validated residual variance

Recent predictions (last 6) ?Each dot shows the predicted risk for a recent scenario you tested. The current scenario is the larger dot. click any dot to recall

No prior predictions yet — move sliders to start a history.
Among similar patients in the test cohort
Loading cohort comparison…
Save scenario:

Saved scenarios will appear here

2

Adjust Patient Profile

Move sliders, toggle markers. The hero result above updates on every change. Each slider shows where the current value sits in the cohort distribution.
3

Decision Waterfall — Why This Prediction? ?Reads top-to-bottom: start at the model intercept, add each feature’s contribution, end at the final log-odds. Sigmoid of the final value gives the probability.

Starts at the model intercept, walks through each active feature’s contribution, ends at the final log-odds — the sigmoid of which is your probability. Audit step-by-step exactly how the model reasoned.

How a clinician would use this: At point of discharge, enter the patient’s profile in under 60 seconds. If the badge reads Above baseline with the YES flag, route to the high-risk pathway — 48-hour follow-up call, home-health referral, transitional-care visit at day 7. The waterfall above explains why, which is what makes the decision auditable.

Patient Comparison — What If You Change One Thing?

Configure two patient profiles side-by-side. Toggle a single feature on Patient B to see exactly how much it moves the prediction. This is how clinicians audit model behavior and how data scientists test for fairness.

1

Try a Built-In Scenario

One click loads two patient profiles. Useful for clinical training and model auditing.
2

Configure The Two Patients

Edit either card directly. The hero numbers and the comparison table below update live.
A
Patient A
Reference profile
%
B
Patient B
Comparison profile
%
3

Where the Two Patients Differ

Only features with different values between A and B are listed below.
4

Driver Comparison

Top 8 features by absolute contribution, shown for both patients side by side.

Why this tab matters. A single-number prediction is hard to trust. Seeing exactly how the model responds to a one-variable change — diabetes medication off versus on, holding everything else equal — builds clinician trust, surfaces model bugs, and exposes potential bias. Try the Prior Inpatient scenario: it’s the single strongest predictor in the Lasso model, and you can watch the gauge move from low to high.

Business P&L — What Is The Model Worth?

Set your hospital’s economic assumptions. The model picks the threshold that maximizes net benefit — KPI tiles, confusion matrix, decision-curve, and break-even analysis update live.

Economic Assumptions (Edit Any Field) ?All downstream calculations re-run live as you edit. Defaults sourced from Jencks et al. (NEJM 2009) and CMS HRRP penalty data.

Defaults sourced from Jencks et al. (NEJM 2009) and CMS HRRP penalty data.

Care coordinator + follow-up call
Avg DRG-weighted cost
0 = no effect, 1 = all preventable
Your hospital’s volume
Your cohort’s untreated rate
EHR integration + training
Active model:
Flagged / Year
Readmissions Caught
Readmissions Missed
false negatives
Intervention Cost
annual outlay
Readmission Savings
annual avoided
Net Benefit / Year
bottom line
ROI
savings / cost
NNF
flags per catch
Net / Discharge
per-patient value
Do-Nothing Cost
no model deployed
Optimal Threshold
net-benefit max
Break-Even Prevention Effectiveness
Below this prevention rate, the model loses money. Current setting is 30% — the margin of safety is .
Time to Payback Implementation
Your one-time implementation cost is recovered from net benefit in this many days. Lower is better.

Net Benefit Across All Decision Thresholds ?The red marker is the threshold maximizing annual net benefit under your settings. Move economic sliders above to watch it shift.

The red marker is the threshold that maximizes annual net benefit under your current settings. Move the economic sliders above to see how it shifts.

Decision Curve Analysis (Vickers-Elkin) ?DCA compares the model against “Treat All” and “Treat None” strategies across every plausible threshold probability. The model is useful for any threshold where its line is above both alternatives.

Standard medical-economics framework for evaluating prediction models. The model beats both “treat everyone” and “treat no one” across all clinically reasonable thresholds — this is the formal justification for deploying it over either default.

Manual Threshold Playground ?Drag to set any threshold and see the confusion matrix and net benefit at that operating point. Hospital leadership may prefer a sensitivity-tilted or specificity-tilted threshold over Youden-optimal.

0.08

Override the Youden-optimal threshold to test sensitivity-first or specificity-first deployments.

True Positives
False Positives
False Negatives
True Negatives

At this manual threshold: Net benefit · ROI .

Cost Scenarios — Side-by-Side Comparison ?Three intervention-cost scenarios shown at Youden-optimal threshold for each. Pick the column closest to your hospital’s actual program cost.

Three cost scenarios shown at each one’s Youden-optimal threshold. Lets decision-makers see which intervention budget produces the best net benefit.

Confusion Matrix at Optimal Threshold

Scaled to your annual discharge volume. Green = correct, red = errors.

Risk Gradient by Predicted-Risk Decile

The model concentrates risk in the top deciles — this is what powers triage even at modest AUC.

Reading the P&L. The break-even logic is simple: one prevented readmission ($15,400) is worth roughly 50 interventions ($300 each). The model only needs to convert one in 50 flagged patients into a prevented readmission to break even. At the Youden-optimal threshold and 30% prevention effectiveness, the model converts roughly one in 10 — a 5× margin over break-even.

Sensitivity analysis. Try halving prevention effectiveness to 0.15: the model still produces positive net benefit. Try setting intervention cost to $1,000 (very expensive program): the optimal threshold rises but net benefit remains positive. The model is robust to substantial parameter uncertainty — this is the case for deploying it.

Research Findings — Three Questions, Three Statistical Instruments

Each research question was answered with the statistical method most appropriate to the data structure. All three null hypotheses were rejected, but only some effects are practically meaningful.

Showing full cohort (N = 68,061)

Cohort at a Glance

After preprocessing the raw 101,766-encounter dataset, the analytic sample contains 68,061 first-encounter patients. Age dominates the distribution, with the modal patient in the 60-80 band.

  • 9.02% of patients are readmitted within 30 days — 6,142 events on a 68,061-patient denominator.
  • The mean length of stay is 4.4 days; the 95th percentile is 11 days.
  • 78% of patients are on diabetes medication at discharge; 54% are on insulin.
  • The cohort is dominated by older patients: 65% are aged 60 or above.

Q1 — Does discharge disposition predict 30-day readmission?

Method: Chi-square test of independence on the 5-level disposition × binary outcome cross-tabulation. Cramér’s V reported to quantify effect strength independent of sample size.

Result: χ²(4, N = 68,061) = 757.06, p < .001, Cramér’s V = 0.106. We reject the null hypothesis of independence.

Substantive Interpretation

  • Patients discharged home with home-health services readmit at 19.8% — nearly 3× the baseline.
  • This subgroup is only 3.2% of patients but generates 7.0% of all readmissions.
  • The Cramér’s V of 0.106 indicates a small-to-medium effect — statistically significant and practically meaningful.
  • Operational action: Re-evaluate the home-health referral criteria. If sicker patients are being routed to home-health because they cannot go home alone, the 19.8% rate reflects underlying acuity, not failure of home-health.

Q2 — Does length of stay vary by admission type and age, with interaction?

Method: Two-way ANOVA with interaction term. F-statistics and eta-squared (η²) reported separately to distinguish statistical from practical significance.

Result: Admission F(2, 60,253) = 53.54, p < .001, η² = 0.0018; Age F(2, 60,253) = 373.88, p < .001, η² = 0.0122; Interaction F(4, 60,253) = 2.85, p = .023, η² = 0.0002. All three null hypotheses rejected.

Substantive Interpretation

  • All three effects are statistically significant because n = 60,000+ is huge.
  • Only age is practically meaningful (η² = 1.22%, explaining 1.22% of LOS variance).
  • The roughly parallel lines confirm the additive structure: older patients require longer stays regardless of admission urgency.
  • Operational action: Capacity planners should size beds by age cohort first, admission type second. A unit serving primarily 60+ patients needs ~30% more bed-days per admission than a 40−60 unit.

Q3 — Can regularization recover a parsimonious model with no loss in discrimination?

Method: Compare 7 classifiers (logistic GLM, Ridge L2, Lasso L1, Elastic Net α=0.5, Random Forest, XGBoost, Neural Network) on the same held-out test split.

Result: All models cluster at AUC ≈ 0.60−0.61. Lasso retains 10 of 19 features — a 47% reduction at zero AUC cost.

  • The administrative-data ceiling. Seven different algorithm families produce statistically indistinguishable AUC. This is not a model-selection problem; the dataset’s features simply do not contain more signal.
  • Lasso drops: minority-race indicators, gender, HbA1c testing flag, admission type binaries, prior outpatient visits.
  • Lasso keeps: prior inpatient visits (strongest predictor), length of stay, number of diagnoses, age, insulin use, diabetes medication, medication change flag, and 3 others.
  • Operational action: Deploy Lasso for the discharge-screening tool. The 10-variable form is auditable on a single sheet of paper — an essential property for clinical decision support.

For the seven-model comparison details, see the Model Scoreboard tab.

What three questions together tell us. Discharge pathway is a strong leading indicator (Q1). Age, not admission type, drives capacity needs (Q2). And the predictive signal in administrative data tops out at AUC ≈ 0.60 regardless of algorithm — but that’s still enough to produce a 3× risk gradient between deciles, which is the operational engine of triage (Q3). The three answers together justify a deployment-ready model used in conjunction with discharge-pathway redesign, not in place of it.

Methods, Data, & References

Full provenance for every analytic choice. Reproduce by sourcing the R script in the Appendix tab.

Data Source & Cohort Construction

The Diabetes 130-US Hospitals dataset (Strack et al., 2014), accessed via the UCI Machine Learning Repository, contains 101,766 inpatient encounters across 130 U.S. hospitals between 1999 and 2008.

Preprocessing Pipeline

  • Removed three high-missingness columns: weight (97% missing), payer_code (40%), medical_specialty (49%).
  • Excluded deceased and hospice-discharge encounters (discharge_disposition_id ∈ {11, 13, 14, 19, 20, 21}).
  • De-duplicated to the first encounter per patient_nbr to avoid intra-patient dependence.
  • Dropped rows missing race, primary diagnosis, or with gender = Unknown/Invalid.
  • Final analytic sample: 68,061 patients with 9.02% 30-day readmission rate.

Feature Engineering (19 final predictors)

  • Utilization (3): prior inpatient, ER, and outpatient visits in the year before the index encounter.
  • Index encounter (5): length of stay, lab procedures, in-hospital procedures, medications, diagnoses.
  • Demographics (5): age, gender, race (3 indicators).
  • Treatment markers (4): HbA1c tested, medication changed, on diabetes med, on insulin.
  • Admission context (2): emergency and urgent admission indicators.

Statistical & Machine-Learning Methods

Q1 — Chi-Square Test of Independence

Tests the null hypothesis that discharge disposition and 30-day readmission are statistically independent. Cramér’s V quantifies effect strength (0 = none, 1 = perfect), correcting for sample size so that the result is interpretable independent of the n = 68,061 denominator that would make any tiny effect “significant”.

Q2 — Two-Way ANOVA With Interaction

Tests three simultaneous nulls: no admission-type effect, no age-group effect, no interaction. Eta-squared (η²) is reported for each effect because in a 68,061-row dataset every p-value will be vanishingly small — we need an effect-size measure to know if any of it matters operationally.

Q3 — Seven Classifiers Compared

  • Logistic GLM — Interpretable baseline. Coefficients are log-odds ratios.
  • Ridge (L2)cv.glmnet, α = 0, 5-fold CV optimized for AUC. Shrinks coefficients toward zero without dropping features.
  • Lasso (L1) — Same procedure, α = 1. Drops weak features entirely.
  • Elastic Net — α = 0.5 hybrid of L1 and L2. Compromise between Ridge and Lasso.
  • Random Forest — 200 trees, max depth 10, min leaf size 20. Captures non-linear interactions.
  • XGBoost — 300 boosting rounds, depth 4, learning rate 0.05, 85% column & row subsample.
  • Neural Network — Single hidden layer with 16 ReLU units, weight decay 0.001, early stopping at 200 iterations.

Evaluation Protocol

70/30 stratified train/test split with caret::createDataPartition and set.seed(42). All metrics reported on the held-out test set. Optimal classification threshold chosen by Youden’s J (maximizing sensitivity + specificity − 1).

Live Deployment Architecture

All seven models are trained once, offline. The Lasso intercept, the per-model coefficient vectors (for linear models) or linear surrogates fit on logit(p) (for tree and neural models), and the standardization statistics are exported as a single 20 KB JSON cache. The cache is embedded into the published HTML page; a JavaScript engine reads it and recomputes predictions, threshold-based decisions, and per-feature contributions in real time as the user moves sliders. No server, no R session, no compute budget, and no latency.

Data Dictionary — All 19 Features ?Precise operational definition for every feature used in the model. Required for reproducibility and clinical handoff.

Required for reproducibility: the precise operational definition of every feature in the model, with the source variable in the raw dataset.

Feature Type Definition Source
age_num numeric Age band midpoint (5, 15, 25, …, 95) derived from the original 10-year age groups age
gender_bin binary 1 if Male, 0 if Female; Unknown/Invalid records dropped gender
race_AA binary 1 if African American, 0 otherwise race
race_His binary 1 if Hispanic, 0 otherwise race
race_Oth binary 1 if Other (Asian, Other), 0 otherwise (Caucasian is reference) race
adm_emerg binary 1 if Emergency admission (admission_type_id = 1), 0 otherwise admission_type_id
adm_urgent binary 1 if Urgent admission (admission_type_id = 2), 0 otherwise admission_type_id
time_in_hospital integer Length of stay in days (1-14, integer) time_in_hospital
num_lab_procedures integer Number of laboratory tests during the encounter num_lab_procedures
num_procedures integer Number of non-lab procedures performed during the encounter num_procedures
num_medications integer Number of distinct generic medication names administered num_medications
number_outpatient integer Outpatient visits in the 365 days before the index encounter number_outpatient
number_emergency integer Emergency-room visits in the 365 days before the index encounter number_emergency
number_inpatient integer Inpatient admissions in the 365 days before the index encounter number_inpatient
number_diagnoses integer Number of distinct diagnosis codes assigned to the encounter number_diagnoses
A1C_tested binary 1 if HbA1c was measured during the encounter (any result), 0 if “None” A1Cresult
change_bin binary 1 if any diabetic medication was changed during the encounter, 0 otherwise change
diabetesMed_bin binary 1 if patient is prescribed diabetes medication at discharge, 0 otherwise diabetesMed
insulin_bin binary 1 if insulin was prescribed or its dose changed during the encounter, 0 otherwise insulin

Limitations & Generalizability Caveats

  • Vintage of data. The dataset covers 1999–2008. Diabetes management, hospital reimbursement (HRRP launched 2012), and discharge practices have evolved substantially since. Recalibration on contemporary data is essential before clinical deployment.
  • Single-country generalizability. All 130 hospitals are U.S. facilities. Care patterns, insurance mix, and outcome definitions differ in other healthcare systems — results may not transfer without local refitting.
  • Race coding is coarse. The original dataset uses six race categories that collapse heterogeneous populations (e.g., “Asian” combines populations with very different diabetes profiles). Bias audits should use richer demographic data when available.
  • Outcome definition. 30-day readmission counts any readmission to any of the 130 facilities; readmissions to other hospitals are missed, biasing the outcome rate downward by an estimated 5–10%.
  • Administrative-data ceiling. All seven models converge near AUC = 0.60 because the available features capture utilization but not clinical state (lab values, vitals, comorbidity severity). Adding ICD-coded comorbidities, lab trends, and social-determinant data would likely raise AUC to 0.70–0.75.
  • No prospective validation. All metrics are from a single random train/test split, not a temporal or external validation cohort. Real deployments should validate on hold-out time periods.
  • Causal interpretation. Coefficients describe statistical association at discharge, not causal effect. Insulin prescribing is correlated with disease severity; the model is not a recommendation for or against insulin.

References (APA 7th)

Centers for Medicare & Medicaid Services. (2023). Hospital Readmissions Reduction Program (HRRP). https://www.cms.gov/medicare/quality-initiatives-patient-assessment-instruments/value-based-programs/hrrp/hospital-readmissions-reduction-program-hrrp

Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785-794. https://doi.org/10.1145/2939672.2939785

Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1-22.

Jencks, S. F., Williams, M. V., & Coleman, E. A. (2009). Rehospitalizations among patients in the Medicare fee-for-service program. New England Journal of Medicine, 360(14), 1418-1428.

Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of Statistical Software, 28(5), 1-26.

Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R News, 2(3), 18-22.

Strack, B., DeShazo, J. P., Gennings, C., Olmo, J. L., Ventura, S., Cios, K. J., & Clore, J. N. (2014). Impact of HbA1c measurement on hospital readmission rates. BioMed Research International, 2014, 781670.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58(1), 267-288.

Appendix — R Source Code

The complete reproducible R script. Click any section header to expand or collapse.

Reproducibility: set.seed(42). Place diabetic_data.csv in the working directory and source the script. The exported model_cache.json (~20 KB) is what this RPubs page embeds at knit time.

Complete R Script (645 lines) — click to collapse
# =============================================================================
# ALY 6015 — Final Project (Module 6)
# Predicting 30-Day Hospital Readmission in Diabetic Patients
# Multi-Model Risk Lab — trains 7 classifiers, runs 3 statistical tests,
# prints every figure in the report, exports JSON cache for the RPubs app.
#
# Authors:  Anush Goel · Mandeep Kaur · Sandeep Kaur
# Date:     May 2026
# =============================================================================
# Sections (each starts with "# ---- N. Name -----" so the Appendix tab
# automatically splits them into collapsible blocks):
#   0. Setup
#   1. Data preprocessing  (101,766 -> 68,061)
#   2. Q1 — Chi-square: discharge disposition x readmission
#   3. Q2 — Two-way ANOVA: length of stay ~ admission x age
#   4. Q3 — Train 7 classifiers (GLM, Ridge, Lasso, Enet, RF, XGB, NN)
#   5. Score & cache (model_cache.json)
#   6. Report figures — print only, no file saving
# Reproducibility: set.seed(42)
# =============================================================================


# ---- 0. Environment Setup & Reproducibility ----
pkgs <- c("tidyverse","janitor","glmnet","randomForest","xgboost",
          "nnet","pROC","caret","jsonlite","plotly","scales","kableExtra")
for (p in pkgs) if (!requireNamespace(p, quietly = TRUE)) install.packages(p)
invisible(lapply(pkgs, library, character.only = TRUE))

set.seed(42)
options(scipen = 999)

NAVY  <- "#1A3A6C"; TEAL  <- "#2C8AA8"; ORANGE <- "#E07B1F"
RED   <- "#C1352D"; GREEN <- "#3A8A3A"; GREY   <- "#7A7A7A"

report_theme <- theme_minimal(base_size = 12) +
  theme(plot.title    = element_text(face = "bold", color = NAVY, size = 13,
                                     margin = ggplot2::margin(b = 10)),
        plot.subtitle = element_text(color = GREY, size = 10,
                                     margin = ggplot2::margin(b = 10)),
        axis.title    = element_text(size = 11),
        panel.grid.minor = element_blank(),
        panel.grid.major = element_line(linewidth = 0.3, linetype = "dashed"),
        legend.position  = "top",
        legend.title     = element_text(face = "bold", size = 10))


# ---- 1. Data Preprocessing & Cohort Construction ----
df <- read_csv("diabetic_data.csv", na = c("?", ""), show_col_types = FALSE)
cat("Raw encounters:", nrow(df), "\n")

df <- df %>%
  select(-weight, -payer_code, -medical_specialty) %>%
  filter(!discharge_disposition_id %in% c(11, 13, 14, 19, 20, 21)) %>%
  arrange(encounter_id) %>%
  distinct(patient_nbr, .keep_all = TRUE) %>%
  filter(!is.na(race), !is.na(diag_1), gender != "Unknown/Invalid") %>%
  mutate(readmit_30      = if_else(readmitted == "<30", 1L, 0L),
         A1C_tested      = as.integer(A1Cresult != "None"),
         change_bin      = as.integer(change == "Ch"),
         diabetesMed_bin = as.integer(diabetesMed == "Yes"),
         insulin_bin     = as.integer(insulin != "No"),
         gender_bin      = as.integer(gender == "Male"),
         race_AA         = as.integer(race == "AfricanAmerican"),
         race_His        = as.integer(race == "Hispanic"),
         race_Oth        = as.integer(race == "Other"),
         age_num = case_when(
           age == "[0-10)"  ~  5, age == "[10-20)" ~ 15, age == "[20-30)" ~ 25,
           age == "[30-40)" ~ 35, age == "[40-50)" ~ 45, age == "[50-60)" ~ 55,
           age == "[60-70)" ~ 65, age == "[70-80)" ~ 75, age == "[80-90)" ~ 85,
           age == "[90-100)" ~ 95),
         adm_emerg  = as.integer(admission_type_id == 1),
         adm_urgent = as.integer(admission_type_id == 2))

cat("Cleaned analytic sample:", nrow(df), "\n")
cat("30-day readmission rate:", scales::percent(mean(df$readmit_30), 0.01), "\n")


# ---- 2. Research Question 1: Discharge Disposition Chi-Square Analysis ----
df <- df %>% mutate(disp_grp = case_when(
  discharge_disposition_id == 1 ~ "Home",
  discharge_disposition_id %in% c(3, 5, 6) ~ "SNF/Rehab",
  discharge_disposition_id == 2 ~ "Other Hospital",
  discharge_disposition_id %in% c(4, 22, 23, 24) ~ "Home w/ Health Svc",
  TRUE ~ "Other"))

tab_q1  <- table(df$disp_grp, df$readmit_30)
q1_test <- chisq.test(tab_q1); print(q1_test)
cramers_v <- sqrt(q1_test$statistic / (sum(tab_q1) * (min(dim(tab_q1)) - 1)))
cat("Cramer's V:", round(cramers_v, 4), "\n")


# ---- 3. Research Question 2: Length-of-Stay Two-Way ANOVA Analysis ----
sub <- df %>%
  filter(admission_type_id %in% 1:3) %>%
  mutate(adm = factor(admission_type_id, levels = c(3, 1, 2),
                      labels = c("Elective", "Emergency", "Urgent")),
         age_grp = case_when(
           age %in% c("[0-10)","[10-20)","[20-30)","[30-40)") ~ "<40",
           age %in% c("[40-50)","[50-60)")                    ~ "40-60",
           TRUE                                               ~ "60+"),
         age_grp = factor(age_grp, levels = c("<40", "40-60", "60+")))

q2_fit <- aov(time_in_hospital ~ adm * age_grp, data = sub)
print(summary(q2_fit))
ssq <- summary(q2_fit)[[1]][["Sum Sq"]]
cat("Eta sq (admission):  ", round(ssq[1]/sum(ssq), 4), "\n")
cat("Eta sq (age):        ", round(ssq[2]/sum(ssq), 4), "\n")
cat("Eta sq (interaction):", round(ssq[3]/sum(ssq), 4), "\n")


# ---- 4. Research Question 3: Seven-Model Predictive Comparison ----
feat_cols <- c("time_in_hospital","num_lab_procedures","num_procedures",
               "num_medications","number_outpatient","number_emergency",
               "number_inpatient","number_diagnoses","age_num",
               "race_AA","race_His","race_Oth","gender_bin","A1C_tested",
               "change_bin","diabetesMed_bin","insulin_bin",
               "adm_emerg","adm_urgent")
feat_labels <- c("Length of stay (days)","# Lab procedures","# In-hospital procedures",
                 "# Medications","Prior outpatient visits","Prior ER visits",
                 "Prior inpatient visits","# Diagnoses","Age",
                 "Race: African American","Race: Hispanic","Race: Other",
                 "Sex: Male","HbA1c tested","Medication changed",
                 "On diabetes medication","On insulin",
                 "Admission: Emergency","Admission: Urgent")

X <- as.matrix(df[, feat_cols]); y <- df$readmit_30
idx  <- caret::createDataPartition(y, p = 0.7, list = FALSE)
X_tr <- X[idx, ];  y_tr <- y[idx]
X_te <- X[-idx, ]; y_te <- y[-idx]
mu  <- colMeans(X_tr); sd_ <- apply(X_tr, 2, sd)
X_tr_s <- scale(X_tr, center = mu, scale = sd_)
X_te_s <- scale(X_te, center = mu, scale = sd_)
cat("Train:", nrow(X_tr_s), "Test:", nrow(X_te_s), "\n")

# 4a — Logistic GLM
m_glm <- glm(y_tr ~ ., data = data.frame(X_tr_s, y_tr = y_tr), family = "binomial")

# 4b — Ridge (L2)
m_ridge <- cv.glmnet(X_tr_s, y_tr, family = "binomial",
                     alpha = 0, type.measure = "auc", nfolds = 5)

# 4c — Lasso (L1)
m_lasso <- cv.glmnet(X_tr_s, y_tr, family = "binomial",
                     alpha = 1, type.measure = "auc", nfolds = 5)

# 4d — Elastic Net (alpha = 0.5)
m_enet <- cv.glmnet(X_tr_s, y_tr, family = "binomial",
                    alpha = 0.5, type.measure = "auc", nfolds = 5)

# 4e — Random Forest
m_rf <- randomForest(x = X_tr_s, y = factor(y_tr),
                     ntree = 200, maxnodes = 100, nodesize = 20)

# 4f — XGBoost
dtrain <- xgb.DMatrix(X_tr_s, label = y_tr)
m_xgb  <- xgb.train(data = dtrain, nrounds = 300, max_depth = 4,
                    eta = 0.05, subsample = 0.85, colsample_bytree = 0.85,
                    objective = "binary:logistic", eval_metric = "auc",
                    verbose = 0)

# 4g — Neural Network
m_nn <- nnet(x = X_tr_s, y = y_tr, size = 16, decay = 0.001,
             maxit = 200, trace = FALSE, MaxNWts = 3000)


# ---- 5. Model Evaluation & Production Cache Export ----
score <- function(probs, y_true) {
  roc_obj <- pROC::roc(y_true, probs, quiet = TRUE)
  auc_val <- as.numeric(pROC::auc(roc_obj))
  thr     <- as.numeric(pROC::coords(roc_obj, "best", best.method = "youden")["threshold"])
  pred    <- as.integer(probs >= thr)
  cm      <- table(actual = y_true, predicted = factor(pred, levels = c(0, 1)))
  list(auc = round(auc_val, 4), threshold = round(thr, 4),
       sensitivity = round(cm[2,2] / sum(cm[2,]), 4),
       specificity = round(cm[1,1] / sum(cm[1,]), 4),
       precision   = round(cm[2,2] / max(sum(cm[,2]), 1), 4),
       accuracy    = round((cm[1,1] + cm[2,2]) / sum(cm), 4),
       f1          = round(2 * cm[2,2] / (2 * cm[2,2] + cm[1,2] + cm[2,1]), 4),
       tn = as.integer(cm[1,1]), fp = as.integer(cm[1,2]),
       fn = as.integer(cm[2,1]), tp = as.integer(cm[2,2]),
       fpr_curve = as.numeric(1 - roc_obj$specificities),
       tpr_curve = as.numeric(roc_obj$sensitivities))
}

p_glm   <- predict(m_glm, newdata = data.frame(X_te_s), type = "response")
p_ridge <- as.numeric(predict(m_ridge, X_te_s, s = "lambda.min", type = "response"))
p_lasso <- as.numeric(predict(m_lasso, X_te_s, s = "lambda.min", type = "response"))
p_enet  <- as.numeric(predict(m_enet,  X_te_s, s = "lambda.min", type = "response"))
p_rf    <- predict(m_rf, X_te_s, type = "prob")[, "1"]
p_xgb   <- predict(m_xgb, xgb.DMatrix(X_te_s))
p_nn    <- as.numeric(predict(m_nn, X_te_s, type = "raw"))

scoreboard <- tibble(
  Model = c("Logistic GLM","Ridge (L2)","Lasso (L1)","Elastic Net",
            "Random Forest","XGBoost","Neural Network"),
  AUC   = c(score(p_glm,y_te)$auc, score(p_ridge,y_te)$auc, score(p_lasso,y_te)$auc,
            score(p_enet,y_te)$auc, score(p_rf,y_te)$auc, score(p_xgb,y_te)$auc,
            score(p_nn,y_te)$auc))
print(scoreboard)

# Linear surrogate on logit(p) for non-linear models — lets the JS engine recompute
surrogate <- function(probs, X_s) {
  eps <- 1e-6
  p_clip <- pmin(pmax(probs, eps), 1 - eps)
  lgt   <- log(p_clip / (1 - p_clip))
  fit   <- lm(lgt ~ X_s)
  list(coefs = as.numeric(coef(fit)[-1]),
       intercept = as.numeric(coef(fit)[1]))
}
thr_grid <- function(probs, y_true) {
  thr  <- seq(0.01, 0.5, length.out = 50)
  tp_v <- integer(length(thr)); fp_v <- integer(length(thr))
  for (i in seq_along(thr)) {
    pred    <- as.integer(probs >= thr[i])
    tp_v[i] <- sum(pred == 1 & y_true == 1)
    fp_v[i] <- sum(pred == 1 & y_true == 0)
  }
  list(thr = thr, tp = tp_v, fp = fp_v,
       all_positives = as.integer(sum(y_true == 1)),
       all_negatives = as.integer(sum(y_true == 0)))
}
decile_gradient <- function(probs, y_true) {
  d <- ntile(probs, 10)
  out <- tibble(p = probs, y = y_true, d = d) %>%
    group_by(d) %>%
    summarise(pred = mean(p), obs = mean(y), n = n(), .groups = "drop")
  list(decile = as.integer(out$d - 1L), pred = round(out$pred, 4),
       obs = round(out$obs, 4), n = as.integer(out$n))
}
build_entry <- function(name, kind, probs, X_te_s, y_te,
                        coefs = NULL, intercept = NULL, importance = NULL) {
  s <- score(probs, y_te)
  entry <- list(
    name = name, kind = kind,
    metrics = list(auc = s$auc, threshold = s$threshold,
                   sensitivity = s$sensitivity, specificity = s$specificity,
                   precision = s$precision, accuracy = s$accuracy, f1 = s$f1,
                   tn = s$tn, fp = s$fp, fn = s$fn, tp = s$tp,
                   fpr_curve = round(s$fpr_curve[seq(1, length(s$fpr_curve),
                                                    length.out = 80)], 4),
                   tpr_curve = round(s$tpr_curve[seq(1, length(s$tpr_curve),
                                                    length.out = 80)], 4)),
    deciles = decile_gradient(probs, y_te),
    threshold_grid = thr_grid(probs, y_te))
  if (kind == "linear") {
    entry$coefs     <- round(coefs, 5)
    entry$intercept <- round(intercept, 5)
  } else {
    surr <- surrogate(probs, X_te_s)
    entry$surrogate_coefs     <- round(surr$coefs, 5)
    entry$surrogate_intercept <- round(surr$intercept, 5)
    if (!is.null(importance))
      entry$feature_importance <- round(importance, 5)
  }
  entry
}

glm_coefs   <- as.numeric(coef(m_glm))
ridge_coefs <- as.numeric(coef(m_ridge, s = "lambda.min"))
lasso_coefs <- as.numeric(coef(m_lasso, s = "lambda.min"))
enet_coefs  <- as.numeric(coef(m_enet,  s = "lambda.min"))

payload <- list(
  features = feat_cols, labels = feat_labels,
  ui = lapply(seq_along(feat_cols), function(i) {
    fc <- feat_cols[i]
    if (fc %in% c("time_in_hospital","num_lab_procedures","num_procedures",
                  "num_medications","number_outpatient","number_emergency",
                  "number_inpatient","number_diagnoses","age_num"))
      list(type = "slider",
           min  = as.integer(min(X[, fc])),
           max  = as.integer(quantile(X[, fc], 0.99)),
           default = as.integer(round(mean(X[, fc]))),
           step = 1L)
    else
      list(type = "toggle", default = 0L)
  }),
  mu = round(mu, 5), sd = round(sd_, 5),
  population_rate = round(mean(y_te), 5),
  n_test = length(y_te),
  models = list(
    glm   = build_entry("Logistic GLM",  "linear", p_glm,   X_te_s, y_te,
                        glm_coefs[-1], glm_coefs[1]),
    ridge = build_entry("Ridge (L2)",    "linear", p_ridge, X_te_s, y_te,
                        ridge_coefs[-1], ridge_coefs[1]),
    lasso = build_entry("Lasso (L1)",    "linear", p_lasso, X_te_s, y_te,
                        lasso_coefs[-1], lasso_coefs[1]),
    enet  = build_entry("Elastic Net",   "linear", p_enet,  X_te_s, y_te,
                        enet_coefs[-1], enet_coefs[1]),
    rf    = build_entry("Random Forest", "tree",   p_rf,    X_te_s, y_te,
                        importance = as.numeric(importance(m_rf))),
    xgb   = build_entry("XGBoost",       "tree",   p_xgb,   X_te_s, y_te,
                        importance = xgb.importance(model = m_xgb)$Gain),
    nn    = build_entry("Neural Network","nn",     p_nn,    X_te_s, y_te)
  )
)
write_json(payload, "model_cache.json", auto_unbox = TRUE,
           digits = 5, pretty = FALSE)
cat("Cache exported:", file.size("model_cache.json"), "bytes\n")


# ---- 6. Publication-Ready Figures (Print Only) ----
# Each block recreates one figure that appears in the Word report. Run the
# script interactively (or source it inside RStudio) to see them in the Plots
# pane. No ggsave / no file writes — pure print() calls.

# Figure 1 — Cohort funnel
fig1 <- tibble(
  Step  = factor(c("Raw\nencounters","After death/\nhospice removal",
                   "First encounter\nper patient","After missing\nfilter"),
                 levels = c("Raw\nencounters","After death/\nhospice removal",
                            "First encounter\nper patient","After missing\nfilter")),
  N     = c(101766, 99343, 69990, 68061)) %>%
  ggplot(aes(x = Step, y = N)) +
  geom_col(fill = c(NAVY, TEAL, TEAL, GREEN), width = 0.6,
           color = "white", linewidth = 1.5) +
  geom_text(aes(label = scales::comma(N)), vjust = -0.5,
            fontface = "bold", color = NAVY, size = 4) +
  scale_y_continuous(labels = comma, limits = c(0, 115000)) +
  labs(title = "Figure 1.  Cohort Construction — From Raw Data to Analytic Sample",
       x = NULL, y = "Encounters") +
  report_theme
print(fig1)

# Figure 2a — Outcome distribution donut (No / Readmit >30d / Readmit <30d)
# Counts derived from the cleaned cohort
out_counts <- df %>%
  mutate(grp = case_when(readmitted == "<30" ~ "Readmit <30d",
                          readmitted == ">30" ~ "Readmit >30d",
                          TRUE                ~ "No readmit")) %>%
  count(grp) %>%
  mutate(grp = factor(grp, levels = c("No readmit","Readmit >30d","Readmit <30d")),
         pct = n / sum(n) * 100,
         label = sprintf("%s\n%.1f%%", grp, pct))

fig2a <- ggplot(out_counts, aes(x = 2, y = n, fill = grp)) +
  geom_col(width = 1, color = "white", linewidth = 2) +
  coord_polar(theta = "y", start = 0) +
  xlim(0.5, 2.5) +
  scale_fill_manual(values = c("No readmit" = GREEN, "Readmit >30d" = "#D4A52A",
                               "Readmit <30d" = RED), guide = "none") +
  geom_text(aes(label = label), position = position_stack(vjust = 0.5),
            color = "white", fontface = "bold", size = 3.5) +
  annotate("text", x = 0.5, y = 0, label = sprintf("%.2f%%", mean(df$readmit_30)*100),
           size = 8, fontface = "bold", color = RED) +
  annotate("text", x = 0.5, y = 0, label = "30-day rate", vjust = 3,
           size = 3.5, color = GREY) +
  annotate("text", x = 0.5, y = 0, label = sprintf("N = %s", scales::comma(nrow(df))),
           vjust = 5, size = 3, color = GREY) +
  labs(title = "Figure 2a.  Outcome Distribution") +
  theme_void() +
  theme(plot.title = element_text(face = "bold", color = NAVY, size = 13,
                                  hjust = 0, margin = ggplot2::margin(b = 10)))
print(fig2a)

# Figure 2b — Business translation KPIs
# (Rendered as a text-only ggplot since it's a KPI panel, not a data chart)
n30 <- sum(df$readmit_30)
cost_per <- 15400
total_cost <- n30 * cost_per

kpis <- tibble(
  y     = c(4, 3, 2, 1),
  value = c(scales::comma(n30),
            scales::dollar(cost_per),
            sprintf("$%.1fM", total_cost / 1e6),
            "25–40%"),
  label = c("patients readmitted within 30 days",
            "avg cost per readmission",
            "estimated 30-day readmission cost in cohort",
            "estimated preventable share (Jencks et al.)"),
  color = c(RED, ORANGE, NAVY, GREEN))

fig2b <- ggplot(kpis, aes(x = 0, y = y)) +
  geom_text(aes(label = value, color = color),
            hjust = 0, fontface = "bold", size = 7) +
  geom_text(aes(label = label, x = 0.5),
            hjust = 0, color = GREY, size = 3.5) +
  scale_color_identity() +
  coord_cartesian(xlim = c(-0.05, 1.5), ylim = c(0.5, 4.6)) +
  labs(title = "Figure 2b.  Business Translation") +
  theme_void() +
  theme(plot.title = element_text(face = "bold", color = NAVY, size = 13,
                                  hjust = 0, margin = ggplot2::margin(b = 14)))
print(fig2b)

# Figure 3 — Q1 diverging bar (Discharge disposition vs baseline)
baseline <- mean(df$readmit_30) * 100
fig3 <- df %>%
  group_by(disp_grp) %>%
  summarise(rate = mean(readmit_30) * 100, n = n(), .groups = "drop") %>%
  arrange(rate) %>%
  mutate(delta = rate - baseline,
         disp_grp = factor(disp_grp, levels = disp_grp),
         color_grp = case_when(delta < 0 ~ "Lower", delta < 5 ~ "Modest", TRUE ~ "High")) %>%
  ggplot(aes(x = delta, y = disp_grp, fill = color_grp)) +
  geom_col(width = 0.65, color = "white") +
  geom_vline(xintercept = 0, color = NAVY, linewidth = 0.6) +
  geom_text(aes(x = ifelse(delta >= 0, delta + 0.25, 0.25),
                label = sprintf("%.1f%%  (n = %s)", rate, comma(n))),
            hjust = 0, fontface = "bold", size = 3.4) +
  scale_fill_manual(values = c("Lower" = GREEN, "Modest" = ORANGE, "High" = RED),
                    guide = "none") +
  scale_x_continuous(limits = c(-5, 14)) +
  annotate("label", x = 12.5, y = 1, size = 3, color = NAVY,
           label = "chi sq = 757.1 · df = 4\np < .001\nCramer's V = 0.106") +
  labs(title = "Figure 3.  Q1 — Readmission Risk by Discharge Disposition",
       x = "Difference from baseline 30-day readmission rate (pp)",
       y = NULL) +
  report_theme + theme(legend.position = "none")
print(fig3)

# Figure 4a — Q2 interaction plot
cell_means <- sub %>% group_by(adm, age_grp) %>%
  summarise(m = mean(time_in_hospital), .groups = "drop")
fig4a <- ggplot(cell_means, aes(x = adm, y = m, color = age_grp, group = age_grp)) +
  geom_line(linewidth = 1.4) +
  geom_point(aes(shape = age_grp), size = 4) +
  geom_text(aes(label = sprintf("%.2f", m)), vjust = -1.3,
            fontface = "bold", size = 3.2, show.legend = FALSE) +
  scale_color_manual(values = c("<40" = TEAL, "40-60" = ORANGE, "60+" = RED),
                     name = "Age band") +
  scale_shape_manual(values = c("<40" = 16, "40-60" = 15, "60+" = 18),
                     name = "Age band") +
  coord_cartesian(ylim = c(2.8, 5.0)) +
  labs(title = "Figure 4a.  Interaction Plot — LOS by Admission x Age",
       x = "Admission Type", y = "Mean Length of Stay (days)") +
  report_theme
print(fig4a)

# Figure 4b — Effect sizes (F vs eta squared)
fig4b <- tibble(
  Factor = factor(c("Admission Type","Age Group","Interaction"),
                  levels = c("Admission Type","Age Group","Interaction")),
  F_stat = c(53.54, 373.88, 2.85),
  Eta_sq = c(0.0018, 0.0122, 0.0002)) %>%
  ggplot(aes(x = Factor, y = F_stat, fill = Factor)) +
  geom_col(width = 0.55, color = "white", linewidth = 1.5) +
  geom_text(aes(label = sprintf("F = %.1f", F_stat)), vjust = -0.4,
            fontface = "bold", size = 3.6) +
  geom_text(aes(label = sprintf("eta sq = %.4f", Eta_sq)),
            y = log10(2), color = "white", fontface = "bold", size = 3.2) +
  scale_y_log10(limits = c(0.5, 800)) +
  scale_fill_manual(values = c(TEAL, NAVY, GREY), guide = "none") +
  labs(title = "Figure 4b.  Effect Sizes",
       x = NULL, y = "F statistic (log scale)") +
  report_theme
print(fig4b)

# Figure 5a — Held-out AUC across all 7 models
fig5a <- tibble(
  Model = factor(c("Logistic GLM","Ridge (L2)","Lasso (L1)","Elastic Net",
                   "Random Forest","XGBoost","Neural Network"),
                 levels = c("Logistic GLM","Ridge (L2)","Lasso (L1)","Elastic Net",
                            "Random Forest","XGBoost","Neural Network")),
  AUC   = c(score(p_glm,y_te)$auc,  score(p_ridge,y_te)$auc,
            score(p_lasso,y_te)$auc, score(p_enet,y_te)$auc,
            score(p_rf,y_te)$auc,   score(p_xgb,y_te)$auc,
            score(p_nn,y_te)$auc)) %>%
  ggplot(aes(x = Model, y = AUC, fill = Model)) +
  geom_col(width = 0.55, color = "white", linewidth = 1.5) +
  geom_text(aes(label = sprintf("%.3f", AUC)), vjust = -0.4,
            fontface = "bold", size = 4) +
  geom_hline(yintercept = 0.50, linetype = "dashed", color = RED, alpha = 0.7) +
  annotate("text", x = 7.4, y = 0.505, label = "chance", color = RED, size = 3, hjust = 1) +
  scale_fill_manual(values = c(GREY, TEAL, NAVY, "#7c3aed", GREEN, ORANGE, RED),
                    guide = "none") +
  coord_cartesian(ylim = c(0.45, 0.65)) +
  labs(title = "Figure 5a.  Held-out AUC — 7 Models",
       x = NULL, y = "AUC (test set)") +
  report_theme + theme(axis.text.x = element_text(angle = 25, hjust = 1))
print(fig5a)

# Figure 5b — ROC curves (overlay)
roc_df <- bind_rows(
  tibble(fpr = 1 - pROC::roc(y_te, p_glm,   quiet = TRUE)$specificities,
         tpr = pROC::roc(y_te, p_glm,   quiet = TRUE)$sensitivities,
         model = sprintf("GLM (%.3f)",   score(p_glm,   y_te)$auc)),
  tibble(fpr = 1 - pROC::roc(y_te, p_ridge, quiet = TRUE)$specificities,
         tpr = pROC::roc(y_te, p_ridge, quiet = TRUE)$sensitivities,
         model = sprintf("Ridge (%.3f)", score(p_ridge, y_te)$auc)),
  tibble(fpr = 1 - pROC::roc(y_te, p_lasso, quiet = TRUE)$specificities,
         tpr = pROC::roc(y_te, p_lasso, quiet = TRUE)$sensitivities,
         model = sprintf("Lasso (%.3f)", score(p_lasso, y_te)$auc)),
  tibble(fpr = 1 - pROC::roc(y_te, p_rf,    quiet = TRUE)$specificities,
         tpr = pROC::roc(y_te, p_rf,    quiet = TRUE)$sensitivities,
         model = sprintf("RF (%.3f)",    score(p_rf,    y_te)$auc)))
fig5b <- ggplot(roc_df, aes(fpr, tpr, color = model)) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = GREY) +
  geom_line(linewidth = 1.1) +
  scale_color_manual(values = c(GREY, TEAL, NAVY, GREEN), name = NULL) +
  labs(title = "Figure 5b.  ROC Curves on Held-out Test Set",
       x = "False positive rate (1 - specificity)",
       y = "True positive rate (sensitivity)") +
  report_theme + theme(legend.position = c(0.7, 0.2))
print(fig5b)

# Figure 5c — Lasso feature retention donut (10 of 19 features survive L1)
n_retained <- sum(lasso_coefs[-1] != 0)
n_dropped  <- length(feat_cols) - n_retained
ret_df <- tibble(Status = factor(c("Retained","Dropped"),
                                  levels = c("Retained","Dropped")),
                 Count  = c(n_retained, n_dropped))
fig5c <- ggplot(ret_df, aes(x = 2, y = Count, fill = Status)) +
  geom_col(width = 1, color = "white", linewidth = 2) +
  coord_polar(theta = "y") +
  xlim(0.5, 2.5) +
  scale_fill_manual(values = c("Retained" = NAVY, "Dropped" = "#E0E0E0"),
                    guide = "none") +
  geom_text(aes(label = sprintf("%s\n%d features", Status, Count)),
            position = position_stack(vjust = 0.5),
            color = c("white","#404040"), fontface = "bold", size = 3.6) +
  annotate("text", x = 0.5, y = 0,
           label = sprintf("%.0f%%", n_retained / length(feat_cols) * 100),
           size = 9, fontface = "bold", color = NAVY) +
  annotate("text", x = 0.5, y = 0, label = "feature\nparsimony",
           vjust = 2.6, size = 3.2, color = GREY) +
  labs(title = "Figure 5c.  Lasso Feature Retention") +
  theme_void() +
  theme(plot.title = element_text(face = "bold", color = NAVY, size = 13,
                                  hjust = 0, margin = ggplot2::margin(b = 10)))
print(fig5c)

# Figure 6 — Lasso variable importance
imp_df <- tibble(feature = feat_labels, beta = lasso_coefs[-1]) %>%
  filter(abs(beta) > 1e-6) %>%
  arrange(desc(abs(beta))) %>%
  mutate(feature = factor(feature, levels = rev(feature)),
         direction = if_else(beta > 0, "Increases risk", "Decreases risk"))
fig6 <- ggplot(imp_df, aes(x = beta, y = feature, fill = direction)) +
  geom_col(color = "white", linewidth = 1) +
  geom_vline(xintercept = 0, color = NAVY) +
  geom_text(aes(label = sprintf("%+.3f", beta),
                x = beta + if_else(beta > 0, 0.005, -0.005),
                hjust = if_else(beta > 0, 0, 1)),
            fontface = "bold", size = 3.2) +
  scale_fill_manual(values = c("Increases risk" = RED, "Decreases risk" = GREEN)) +
  labs(title = "Figure 6.  Lasso Variable Importance",
       x = "Standardized log-odds coefficient (beta)",
       y = NULL, fill = NULL) +
  report_theme
print(fig6)

# Figure 7a — Confusion matrix at Youden-optimal threshold (Lasso)
s_lasso <- score(p_lasso, y_te)
cm_df <- tibble(
  Actual    = factor(c("No","No","Yes","Yes"), levels = c("Yes","No")),
  Predicted = factor(c("No","Yes","No","Yes"), levels = c("No","Yes")),
  Count     = c(s_lasso$tn, s_lasso$fp, s_lasso$fn, s_lasso$tp),
  Label     = c("TN","FP","FN","TP")) %>%
  mutate(Pct = Count / sum(Count) * 100,
         Fill = if_else(Label %in% c("TN","TP"), "#A8E0B8", "#F9C9C9"))
fig7a <- ggplot(cm_df, aes(Predicted, Actual, fill = Fill)) +
  geom_tile(color = "white", linewidth = 2) +
  geom_text(aes(label = sprintf("%s\n%s\n(%.1f%%)", Label, comma(Count), Pct)),
            fontface = "bold", size = 4.5) +
  scale_fill_identity() +
  labs(title = sprintf("Figure 7a.  Confusion Matrix at Threshold = %.3f (Lasso)",
                       s_lasso$threshold),
       x = NULL, y = NULL) +
  report_theme +
  theme(panel.grid = element_blank(),
        axis.text  = element_text(face = "bold", size = 11))
print(fig7a)

# Figure 7b — Business read-out per 10,000 discharges
# Scale the confusion-matrix counts from the held-out test set to a 10,000-discharge population
scale_factor       <- 10000 / length(y_te)
flagged_10k        <- round((s_lasso$tp + s_lasso$fp) * scale_factor)
caught_10k         <- round(s_lasso$tp * scale_factor)
missed_10k         <- round(s_lasso$fn * scale_factor)
intervention_cost  <- flagged_10k * 300                 # $300 per intervention
readmission_saved  <- caught_10k * 0.30 * 15400         # 30% prevented, $15,400 each
net_benefit        <- readmission_saved - intervention_cost

biz <- tibble(
  y     = c(6, 5, 4, 3, 2, 1),
  value = c(scales::comma(flagged_10k),
            scales::comma(caught_10k),
            scales::comma(missed_10k),
            sprintf("$%.2fM", intervention_cost / 1e6),
            sprintf("$%.2fM", readmission_saved / 1e6),
            sprintf("%s$%.2fM",
                    if_else(net_benefit >= 0, "+", "-"),
                    abs(net_benefit) / 1e6)),
  label = c("Patients flagged for high-risk pathway",
            "True readmissions caught",
            "True readmissions missed",
            "Cost of high-risk pathway @ $300/pt",
            "Cost of missed readmits @ $15,400",
            "Net benefit per 10,000 discharges"),
  color = c(ORANGE, GREEN, RED, NAVY, RED,
            if_else(net_benefit >= 0, GREEN, RED)))

fig7b <- ggplot(biz, aes(x = 0, y = y)) +
  geom_text(aes(label = value, color = color),
            hjust = 0, fontface = "bold", size = 6.5) +
  geom_text(aes(label = label, x = 0.42),
            hjust = 0, color = GREY, size = 3.4) +
  scale_color_identity() +
  coord_cartesian(xlim = c(-0.05, 1.4), ylim = c(0.5, 6.6)) +
  labs(title = "Figure 7b.  Business Read-out per 10,000 Discharges") +
  theme_void() +
  theme(plot.title = element_text(face = "bold", color = NAVY, size = 13,
                                  hjust = 0, margin = ggplot2::margin(b = 14)))
print(fig7b)

# Figure 8 — Calibration plot (Lasso)
calib <- tibble(prob = p_lasso, actual = y_te) %>%
  mutate(decile = ntile(prob, 10)) %>%
  group_by(decile) %>%
  summarise(pred_mean = mean(prob), obs_rate = mean(actual), .groups = "drop")
fig8 <- ggplot(calib, aes(pred_mean, obs_rate)) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = GREY) +
  geom_hline(yintercept = mean(y_te), linetype = "dotted", color = RED) +
  geom_line(color = NAVY, linewidth = 1) +
  geom_point(color = NAVY, size = 4) +
  geom_text(aes(label = paste0("D", decile)), vjust = -1, hjust = 0.3,
            size = 3, color = NAVY) +
  coord_cartesian(xlim = c(0, 0.18), ylim = c(0, 0.18)) +
  labs(title = "Figure 8.  Calibration on Held-out Test Set (Lasso)",
       x = "Mean predicted probability per decile",
       y = "Observed readmission rate per decile") +
  report_theme
print(fig8)

# Figure 9 — Risk gradient by decile (Lasso)
fig9 <- calib %>%
  ggplot(aes(x = factor(decile), y = obs_rate * 100, fill = obs_rate)) +
  geom_col(width = 0.7, color = "white", linewidth = 1.2) +
  geom_text(aes(label = sprintf("%.1f%%", obs_rate * 100)),
            vjust = -0.4, fontface = "bold", size = 3.4) +
  geom_hline(yintercept = mean(y_te) * 100, linetype = "dashed", color = NAVY) +
  scale_fill_distiller(palette = "RdYlGn", direction = -1, guide = "none") +
  scale_x_discrete(labels = paste0("D", 1:10)) +
  labs(title = "Figure 9.  Risk Gradient — Observed Readmission Rate by Decile",
       x = "Predicted-risk decile (D1 = lowest, D10 = highest)",
       y = "Observed 30-day rate (%)") +
  report_theme
print(fig9)

# =============================================================================
# END OF SCRIPT
# Knit ALY6015_FinalApp.Rmd next to render the live RPubs app.
# =============================================================================
Built May 2026 · R 4.5.2 · model_cache.json v2 · set.seed(42) · Anush Goel, Mandeep Kaur, Sandeep Kaur