| Question | Purpose |
|---|---|
| What was the prevalence of documented in-hospital cardiovascular complications? | Describe outcome frequency and class imbalance. |
| What were the distributions and missingness patterns of admission-available predictors? | Assess feasibility and data quality. |
| What were the apparent and optimism-corrected discrimination estimates for the primary model? | Evaluate risk ranking under internal validation. |
| How uncertain was calibration of predicted risks? | Assess whether probabilities were reliable enough for interpretation. |
| Did low-risk thresholds provide useful sensitivity, PPV, and net benefit? | Evaluate exploratory clinical utility. |
| How sensitive were results to missing-data handling and highly missing biomarkers? | Assess robustness and care-process bias. |
Exploratory Development and Internal Validation of an Admission-Based Prognostic Model for Cardiovascular Complications in Hospitalized COVID-19 Patients: A Single-Hospital Feasibility Study
A Retrospective Prognostic Prediction-Modeling Study
Background: Cardiovascular complications are clinically important among hospitalized patients with COVID-19, but risk prediction in small retrospective cohorts is challenged by rare outcomes, missing biomarkers, and limited transportability across settings.
Objective: To evaluate the feasibility, apparent performance, internal validity, calibration, and threshold behavior of a low-dimensional admission-based prognostic model for in-hospital cardiovascular complications among adult hospitalized patients with COVID-19.
Methods: We conducted a retrospective single-hospital prognostic prediction-modeling feasibility study using de-identified clinical data from hospitalized patients with COVID-19. Candidate predictors included age, sex, cough, ischemic coronary artery disease, lymphocyte count, blood urea nitrogen, creatinine, partial thromboplastin time, and procalcitonin. Identifiers were removed, categorical variables were harmonized, continuous predictors were retained on their measurement scale or log-transformed where appropriate, and missingness indicators were created. The primary probability model was unweighted ridge logistic regression with median or most-frequent imputation embedded inside the modeling pipeline. Performance assessment included apparent and bootstrap optimism-corrected AUC, PR-AUC, Brier score, calibration summaries, threshold-specific metrics, and decision-curve analysis.
Results: The analytic cohort included 1,042 patients, of whom 41 (3.93%) had documented cardiovascular complications. Procalcitonin was missing in 652 patients (62.6%), and partial thromboplastin time was missing in 549 patients (52.7%). The primary unweighted ridge model showed apparent AUC 0.834 and bootstrap optimism-corrected AUC 0.774, but apparent PR-AUC was 0.184. Apparent calibration was less alarming than the class-weighted model but remained uncertain: calibration intercept 0.377, slope 1.192, and optimism-corrected slope 0.843. At thresholds of 0.02, 0.05, 0.10, and 0.15, sensitivity decreased from 0.951 to 0.293, PPV ranged from 0.067 to 0.245, and net benefit was positive but small. A balanced ridge sensitivity model improved apparent classification pressure but produced severe overprediction.
Conclusions: The model showed exploratory risk-ranking signal but weak positive-class precision, threshold instability, limited event count, substantial biomarker missingness, calibration uncertainty, unresolved outcome-adjudication details, and no external validation. It is not suitable for clinical deployment. Class-weighted modeling materially worsened calibration and should be treated only as a classification sensitivity analysis. Larger externally validated studies with standardized outcome adjudication, prespecified missing-data handling including multiple-imputation sensitivity analysis, calibration assessment, and clinical-utility evaluation are required before clinical decision support can be considered.
COVID-19, cardiovascular complications, clinical prediction model, interpretable machine learning, logistic regression, internal validation, explainable AI, calibration
1 Introduction
1.1 Clinical Importance of Cardiovascular Complications in Hospitalized COVID-19
Cardiovascular complications are a clinically important source of deterioration among hospitalized patients with COVID-19. SARS-CoV-2 infection has been associated with myocardial injury, arrhythmia, heart failure, thromboembolic events, acute coronary syndromes, and cardiovascular death, particularly in patients with severe systemic illness or pre-existing cardiovascular disease (Zheng et al. 2020; Giustino et al. 2020). In hospital practice, these events matter because they may alter triage intensity, monitoring requirements, anticoagulation and cardiology consultation decisions, and escalation planning. An admission-based model for cardiovascular complication risk would therefore be most defensible as an aid to structured risk stratification and local hypothesis generation, not as a stand-alone triage rule.
1.2 Existing Prediction Models and Transportability Limits
Several COVID-19 prediction models were proposed early in the pandemic, including a multicenter risk score by Huang et al. (2020) for cardiovascular complications among patients with COVID-19. Such models are valuable because they organize clinically plausible predictors into an estimated probability of risk, but their performance can change when transported across hospitals, populations, disease waves, laboratory workflows, outcome definitions, and missing-data mechanisms. This concern is not theoretical: early COVID-19 prediction-model literature was repeatedly criticized for optimistic performance, incomplete reporting, high risk of bias, and limited calibration assessment (Wynants et al. 2020). For this reason, any local model developed from a single-hospital cohort should be framed as exploratory model development and internal validation, reported according to TRIPOD and TRIPOD+AI principles, and interpreted through PROBAST and PROBAST+AI risk-of-bias considerations (Collins et al. 2015, 2024; Wolff et al. 2019; Moons et al. 2025).
1.3 Methodological Challenges in a Small, Imbalanced Retrospective Cohort
Small retrospective clinical datasets create a difficult setting for prognostic modeling. When cardiovascular complications are rare, the number of outcome events rather than the total sample size becomes the practical constraint on model complexity, coefficient stability, internal validation, and threshold selection (Steyerberg 2019; Harrell 2015). Severe class imbalance can make accuracy actively misleading: a model may appear accurate by predicting the majority class while failing to identify patients with complications. Precision-recall analysis is often more informative than receiver operating characteristic analysis under imbalance because positive predictive value and sensitivity are directly affected by outcome prevalence (Saito and Rehmsmeier 2015). Calibration also requires explicit assessment, because a model with acceptable discrimination may still produce risk estimates that are systematically too high or too low for clinical use (Van Calster et al. 2019). These issues are intensified when key biomarkers, such as procalcitonin, have substantial missingness that may reflect illness severity, ordering practices, or data-collection artifacts rather than random absence.
1.4 Interpretability and Explainability
Interpretability is important in clinical prediction because clinicians and investigators need to understand whether a model’s behavior is coherent with the clinical setting, whether predictor contributions are plausible, and whether local explanations are stable enough to support further investigation. However, post-hoc explanation methods should not be treated as causal evidence. Variable importance, coefficient direction, partial dependence, and local prediction breakdowns describe model behavior under the available data and preprocessing choices; they do not establish that a predictor causes cardiovascular complications. This distinction is especially important in high-stakes health-care settings, where Rudin (2019) argues for interpretable modeling approaches whenever possible, and where model-agnostic explanations require careful interpretation in light of missingness, confounding, measurement error, and validation limitations (Molnar 2025).
1.5 Clinical-Prognostic Conceptual Framework
The candidate predictors were organized around a clinical-prognostic framework rather than treated as interchangeable algorithmic features. Demographic vulnerability was represented by age and sex; pre-existing cardiovascular susceptibility by ischemic coronary artery disease; immune dysregulation by lymphocyte count; renal or systemic severity by blood urea nitrogen and creatinine; coagulation abnormality by partial thromboplastin time; and inflammatory or infectious burden by procalcitonin. Missingness indicators were interpreted separately as care-process signals, not biological mechanisms, because missing laboratory values may reflect selective ordering, documentation practices, resource availability, or illness severity.
This framework implies four cautious hypotheses. First, admission-available clinical and laboratory variables may discriminate patients with and without documented cardiovascular complications better than chance. Second, calibration may be unstable because the outcome is rare and several biomarkers are selectively measured. Third, positive predictive value and precision-recall performance may remain limited even when ROC AUC appears acceptable because the outcome prevalence is low. Fourth, excluding highly missing biomarkers may not materially reduce discrimination if missingness-related care processes are contributing to apparent model performance.
1.6 Study Objective and Research Questions
The objective of this study is to evaluate whether a low-dimensional, interpretable, admission-based prognostic model can provide stable risk ranking and sufficiently reliable absolute-risk estimates for in-hospital cardiovascular complications among adult hospitalized patients with COVID-19. The study is explicitly positioned as a retrospective, single-hospital, exploratory feasibility and internal-validation analysis. It is not a definitive external validation study and does not produce a deployable clinical decision tool.
The analysis addressed six research questions:
Table 1 provides the question hierarchy used to organize the Results and Discussion. The questions are descriptive, predictive, and decision-analytic; none are causal.
2 Methods
2.1 Study Design and Reporting Framework
This study was designed as a retrospective observational prognostic prediction-model study using a single-hospital cohort of adult patients hospitalized with COVID-19. The analysis is limited to model development and internal validation; it is not an external validation study and is not intended to produce a deployable clinical decision tool. Reporting was structured according to TRIPOD and TRIPOD+AI, which emphasize transparent description of data sources, eligibility, predictors, outcome definition, sample size, missing data, model specification, validation, and performance (Collins et al. 2015, 2024). Risk-of-bias considerations were informed by PROBAST and PROBAST+AI, particularly the domains concerning participant selection, predictor definition, outcome definition, and analysis (Wolff et al. 2019; Moons et al. 2025).
The central methodological premise was that the small number of cardiovascular complication events should govern the complexity and interpretation of the model. Contemporary guidance on prediction-model sample size emphasizes that total sample size alone is insufficient; the number of outcome events, number of predictor parameters, expected outcome fraction, and anticipated model fit all affect overfitting risk and optimism (Riley et al. 2019). The analytic cohort contained 41 events. Even before considering missingness indicators and transformations, this event count allows only a limited predictor degrees-of-freedom budget. Therefore, the model was specified as a deliberately low-dimensional, penalized, interpretable admission-based model rather than a high-dimensional machine-learning classifier. The events-per-parameter ratio was treated as a warning signal rather than a mechanical adequacy rule, and all performance estimates were interpreted as exploratory.
2.2 Setting and Participants
The source population consisted of adult patients aged 19 years and above with RT-PCR-confirmed COVID-19 admitted to Dr. Jose N. Rodriguez Memorial Hospital and Sanitarium from March 2020 to December 2020. The intended clinical population was hospitalized patients with moderate, severe, or critical COVID-19. Patient-identifying and administrative variables were removed before analysis, including hospital numbers and index-like columns. Records with missing predictors were not excluded by default because complete-case restriction can reduce power and may introduce bias when missingness is related to clinical severity, laboratory ordering, or outcome status.
The available analytic files did not contain a complete screening log for readmissions, transfers, patients without RT-PCR confirmation, mild cases, or missing outcome status. The analysis therefore treated the cleaned dataset as the eligible analytic cohort and assumed one record per hospitalization. If repeat admissions were present in the source system, they could not be distinguished from index hospitalizations in the de-identified modeling file. This is handled as a participant-selection limitation rather than as a resolved eligibility feature.
2.3 Outcome
The outcome was in-hospital cardiovascular complication status, represented by the binary variable Cardio, coded as 1 for cardiovascular complication present and 0 for absent. For analysis, Cardio was treated as a binary in-hospital composite cardiovascular-complication indicator as encoded in the dataset. The de-identified analytic file did not include the source-protocol definition, specific event components, adjudication source, timing of occurrence, or whether outcome assessment was blinded to candidate predictors. The safest operational interpretation is therefore that the model predicts the locally recorded Cardio indicator rather than a universally adjudicated cardiovascular endpoint. This limits clinical interpretability and means that myocardial injury, arrhythmia, heart failure, acute coronary syndrome, myocarditis, thromboembolism, stroke, cardiac arrest, and cardiovascular death cannot be confirmed as included or excluded event components from the modeling file alone. Because event components and adjudication procedures were not available in the analysis file, potential outcome heterogeneity and misclassification are treated as major limitations rather than resolved assumptions.
2.4 Predictors
Candidate predictors were restricted to admission-available demographic, symptom, comorbidity, immune, renal, coagulation, and inflammatory markers: age during admission, sex, cough, ischemic coronary artery disease, lymphocyte count, blood urea nitrogen, creatinine, partial thromboplastin time, and procalcitonin. These variables were chosen because they are clinically plausible, likely to be available early in hospitalization, and low-dimensional enough for an exploratory rare-event model. The intended predictor window was the first recorded value from emergency department arrival through the first 24 hours of hospitalization and before any documented cardiovascular complication onset. Because exact laboratory timestamps and units were not available in the modeling file, the analysis assumed that these values represented admission or early-hospitalization measurements. Possible predictor-timing ambiguity and predictor-outcome temporal overlap are therefore treated as limitations.
Continuous laboratory variables were preserved as continuous predictors. They were not one-hot encoded and should not be arbitrarily dichotomized unless the analysis is explicitly reproducing a published score or a clinically prespecified threshold. For skewed biomarkers, log-transformed versions may be evaluated as sensitivity or prespecified transformed predictors, with the original measurement scale retained for descriptive reporting.
| Predictor | Coding | Expected direction | Sensitivity handling |
|---|---|---|---|
| Age | Continuous, years | Higher risk | None |
| Sex | Male/Female | Not prespecified | None |
| Cough | Binary | Uncertain | None |
| Ischemic CAD | Binary | Higher risk if present | None |
| LYM | Continuous | Lower values may increase risk | None |
| BUN | Continuous | Higher risk | log1p |
| Crea | Continuous | Higher risk | log1p |
| PTT | Continuous | Higher/prolonged risk | log1p |
| Procalcitonin | Continuous | Higher risk | log1p + missingness |
Table 2 summarizes the candidate model specification. The table is included in the Methods rather than the Results because it documents prespecified analytic handling decisions: continuous predictors are retained on their measurement scale, binary variables are harmonized, and procalcitonin is flagged for missingness-sensitive analyses. Predictor inclusion was based on clinical plausibility rather than data-driven screening.
2.5 Data Preprocessing
Data preprocessing was conducted before model estimation but without using outcome information to create artificial predictors or tune transformations. Identifier and administrative fields were removed. Categorical values were harmonized so that Sex was represented as Male or Female, and cough and ischemic coronary artery disease were represented as binary indicators. The outcome was standardized to 1 for cardiovascular complication present and 0 for absent. Numeric predictors were converted to numeric values after removal of clearly non-numeric formatting artifacts. Implausible values were flagged rather than automatically deleted: age below 19 or above 110 years, and negative values for lymphocyte count, blood urea nitrogen, creatinine, partial thromboplastin time, or procalcitonin.
Outliers were assessed using descriptive summaries, interquartile-range rules, and clinical plausibility checks. Extreme values were retained initially because automatic winsorization or deletion can distort clinical signals in small datasets. Log1p transformations were created for skewed biomarkers where appropriate, particularly blood urea nitrogen, creatinine, partial thromboplastin time, and procalcitonin. Original values were retained so that transformed predictors could be compared with clinically interpretable raw-scale summaries.
2.6 Missing Data
Missingness was quantified overall and by outcome group for all candidate predictors. Missing values were not imputed during data cleaning. Instead, missingness was preserved in the modeling-ready dataset, and missingness indicators were created for numeric predictors so that missingness could be evaluated as a possible care-process signal. Procalcitonin and partial thromboplastin time were treated as special concerns because substantial missingness may reflect selective laboratory ordering rather than random absence. Complete-case analysis was therefore considered insufficient as the sole analytic strategy. Complete-case estimates can be biased if the probability of having a measured biomarker is related to severity, outcome, or other observed clinical variables.
For the primary ridge logistic regression pipeline, numeric predictors were imputed using median imputation and binary or categorical predictors were imputed using most-frequent imputation, with imputation embedded inside the modeling pipeline. This pragmatic strategy defines a reproducible deployable prediction pipeline under the observed missingness pattern, but it is not equivalent to a full multiple-imputation analysis and does not represent missing-data uncertainty. Multiple imputation by chained equations is a principled approach for mixed clinical data when a defensible missing-at-random working assumption is plausible and the imputation model is carefully specified (Buuren and Groothuis-Oudshoorn 2011). Multiple imputation with Rubin-rule pooling was not performed in the present exploratory analysis because the study aimed to evaluate a pragmatic pipeline using the observed single-hospital data structure. Sensitivity analyses therefore focused on exclusion of procalcitonin, exclusion of both procalcitonin and partial thromboplastin time, complete-case analysis reported as unstable, and unweighted versus class-weighted ridge comparison. A multiple-imputation sensitivity analysis, ideally embedded within internal validation, should be added in future or final analyses.
2.7 Model Development
The primary probability model was unweighted ridge logistic regression. Unweighted penalization was prioritized because the goal was risk prediction and probability estimation, whereas class weighting can improve minority-class classification pressure while distorting calibration. Firth-type penalized likelihood remains a reasonable future comparator because it can reduce small-sample bias in rare-event binary models, although predicted probabilities and calibration still require careful evaluation (Puhr et al. 2017). Class-weighted ridge regression was retained only as a sensitivity analysis for classification behavior and calibration consequences, not as the primary clinical probability model. Standard logistic regression may be reported only as an exploratory comparator and should be described as potentially unstable if the number of events is limited relative to predictor parameters.
The modeling strategy should limit predictor degrees of freedom, avoid high-dimensional dummy expansion, preserve continuous predictors, and avoid data-driven variable selection. If transformations are evaluated, they should be prespecified or handled within resampling. Scaling, imputation, transformation selection, and any threshold optimization should be estimated inside each training fold or bootstrap sample to avoid optimistic performance estimates.
2.8 Internal Validation
Because cardiovascular complication events are rare, a simple 70/30 train-test split was not used as the primary validation strategy. Such a split would allocate very few positive cases to the test set and could yield highly unstable estimates of sensitivity, calibration, and positive predictive value. The analysis used bootstrap internal validation with 200 resamples for this exploratory pass. A final submission analysis should increase this to at least 500 to 1,000 resamples if computation permits and should repeat the entire preprocessing and modeling workflow inside each resample.
Internal validation should estimate optimism-corrected discrimination, calibration, and overall prediction error. The bootstrap procedure should repeat the entire modeling workflow, including imputation and any preprocessing steps that require learned parameters. Apparent performance should be reported separately from optimism-corrected performance so readers can see the likely degree of overfitting.
2.9 Performance Metrics
Model performance will be evaluated using discrimination, calibration, overall accuracy of probability estimates, and threshold-specific classification metrics. Discrimination should include the area under the receiver operating characteristic curve with a confidence interval and the precision-recall area under the curve, because precision-recall performance is more informative when the positive class is rare (Saito and Rehmsmeier 2015). Calibration should include calibration intercept, calibration slope, and a calibration plot; calibration is essential because a model can rank patients acceptably while still producing unreliable absolute risk estimates (Van Calster et al. 2019). Overall prediction error should include the Brier score.
Threshold-specific metrics included sensitivity, specificity, positive predictive value, negative predictive value, balanced accuracy, and confusion matrices at clinically relevant thresholds. Because the observed event fraction was low, thresholds of 0.02, 0.05, 0.10, and 0.15 were evaluated to reflect low-to-moderate risk-alert thresholds that are more plausible than a default 0.50 probability cutoff in this setting. The 0.50 threshold was retained only as a reference and not as a clinically justified decision threshold. Decision-curve analysis was used to estimate net benefit across threshold probabilities while recognizing that clinical utility cannot be established without external validation and calibration assessment (Vickers and Elkin 2006).
2.10 Explainability Analysis
Explainability analyses will prioritize transparent model coefficients, coefficient direction, and clinically interpretable predictor contributions. If model-agnostic tools such as DALEX variable importance, partial dependence, accumulated local effects, or local prediction breakdowns are used, they will be presented as descriptions of model behavior rather than causal explanations. Local explanations should be inspected for stability, especially when they depend on variables with high missingness or sparse positive events. Any feature-importance ranking should be interpreted alongside missingness, event count, preprocessing, and clinical plausibility.
2.11 Risk-of-Bias Considerations
| Domain | Concern | Mitigation |
|---|---|---|
| Participants | Single-hospital 2020 cohort; unclear readmission, transfer, screening-log, and index-hospitalization handling. | Finalize eligibility criteria and index-hospitalization rules from source records. |
| Predictors | Admission timing and laboratory units require source verification; exact first-24-hour and pre-outcome timing could not be confirmed from the modeling file. | Verify timestamps and units from source records; treat possible predictor leakage as a limitation until verified. |
| Outcome | Binary Cardio variable lacks source-protocol definition, event components, adjudication source, timing, and blinding information. | Insert chart-abstraction definition and adjudication procedure when available; otherwise report as a local recorded indicator only. |
| Missing data | Procalcitonin and partial thromboplastin time were highly missing; median/mode imputation with indicators does not propagate imputation uncertainty. | Add multiple-imputation sensitivity analysis with Rubin-rule pooling in future or final analyses. |
| Analysis | Only 41 events; internal validation used 200 bootstrap resamples; threshold and calibration estimates are event-limited. | Use penalization, cautious claims, and increase bootstrap resamples to 500-1,000 for final reporting. |
| Applicability | No external or temporal validation; care-process missingness and local outcome coding may not transport. | Require external validation, calibration updating, and local workflow assessment before use. |
| Ethics and governance | Ethics committee, approval or waiver number, approval date, consent-waiver status, and data-governance conditions were unavailable in the analytic package. | Insert verifiable institutional ethics language before journal submission; retain as a limitation if unavailable. |
Table 3 summarizes the main PROBAST/PROBAST+AI concerns that constrain interpretation. The table is intentionally included in the Methods because these risks affect the design, analysis, and reporting of all subsequent results.
2.12 Ethics and Confidentiality
The study used retrospective de-identified clinical data with direct identifiers removed before analysis. Hospital numbers, patient names, record numbers, and index fields were excluded from modeling. The analytic files available to the analyst did not include the institutional review board or ethics committee name, approval or waiver classification, protocol number, approval date, consent-waiver status, or formal data-sharing conditions. The analysis is therefore reported as a secondary analysis of de-identified retrospective data, with ethics approval details unavailable in the modeling file. This absence is a reporting limitation and should be interpreted conservatively.
3 Results
3.1 Cohort Characteristics
The cleaned analytic cohort included 1,042 hospitalized patients with recorded cardiovascular complication status. Cardiovascular complications were documented in 41 patients (3.93%), while 1,001 patients had no recorded cardiovascular complication. The complete-case dataset for the original main predictors contained 141 patients, underscoring that complete-case analysis alone would discard most records and could alter the study population. Figure 1 shows the retained analytic sample and the marked outcome imbalance.
Table 4 summarizes the cohort by outcome group. Patients with cardiovascular complications were older by median age than patients without complications. Lymphocyte values were lower among patients with complications, while blood urea nitrogen and creatinine values were higher. These descriptive differences are clinically plausible but should not be interpreted causally because they are unadjusted summaries from a retrospective cohort.
| Variable | Overall | Cardio No | Cardio Yes | Missing |
|---|---|---|---|---|
| Age, years | 52.00 (34.07, 64.00) | 51.00 (34.00, 64.00) | 64.00 (58.50, 67.57) | 261 (25.0%) |
| LYM | 19.93 (11.15, 29.05) | 20.42 (11.92, 29.26) | 8.36 (5.83, 12.55) | 65 (6.2%) |
| BUN | 5.58 (3.47, 16.43) | 5.43 (3.42, 15.98) | 14.86 (6.41, 27.41) | 170 (16.3%) |
| Crea | 83.62 (59.48, 586.73) | 83.05 (59.05, 588.53) | 123.74 (77.51, 459.75) | 144 (13.8%) |
| PTT | 31.70 (28.20, 35.40) | 31.90 (28.30, 35.40) | 30.80 (27.55, 34.08) | 549 (52.7%) |
| Procalcitonin | 0.11 (0.05, 0.51) | 0.11 (0.05, 0.46) | 0.18 (0.10, 1.00) | 652 (62.6%) |
| Male sex | 539 (51.9%) | 513 (51.5%) | 26 (63.4%) | 4 (0.4%) |
| Cough | 417 (44.2%) | 397 (44.0%) | 20 (50.0%) | 99 (9.5%) |
| Ischemic CAD | 36 (3.6%) | 32 (3.3%) | 4 (10.0%) | 30 (2.9%) |
3.2 Missingness and Data Quality
Missingness was substantial for several admission biomarkers. Procalcitonin was missing in 652 patients (62.6%), and partial thromboplastin time was missing in 549 patients (52.7%). Age was missing in 261 patients (25.0%). Figure 2 highlights the missingness pattern visually, with procalcitonin standing out as the most incomplete candidate predictor. Table 5 gives the same pattern numerically and adds outcome-stratified missingness and handling decisions.
| Variable | Missing overall | Missing by Cardio status | Implausible | Handling |
|---|---|---|---|---|
| Age | 261 (25%) | No: 250 (25.0%); Yes: 11 (26.8%) | 24 | Model pipeline. |
| Sex | 4 (0.4%) | No: 4 (0.4%); Yes: 0 (0.0%) | 0 | Model pipeline. |
| Cough | 99 (9.5%) | No: 98 (9.8%); Yes: 1 (2.4%) | 0 | Model pipeline. |
| Ischemic CAD | 30 (2.9%) | No: 29 (2.9%); Yes: 1 (2.4%) | 0 | Model pipeline. |
| LYM | 65 (6.2%) | No: 62 (6.2%); Yes: 3 (7.3%) | 0 | Model pipeline. |
| BUN | 170 (16.3%) | No: 160 (16.0%); Yes: 10 (24.4%) | 0 | log1p sensitivity. |
| Crea | 144 (13.8%) | No: 137 (13.7%); Yes: 7 (17.1%) | 0 | log1p sensitivity. |
| PTT | 549 (52.7%) | No: 536 (53.5%); Yes: 13 (31.7%) | 0 | log1p sensitivity. |
| Procalcitonin | 652 (62.6%) | No: 626 (62.5%); Yes: 26 (63.4%) | 0 | Missingness indicator. |
The data-quality assessment identified 24 implausible age values under the adult-only plausibility rule or above the upper age threshold. These values were flagged rather than automatically removed. Biomarker outliers were retained for modeling because automatic deletion or winsorization can distort signals in small clinical datasets. The missingness pattern in Table 5 supports the planned sensitivity analyses described in the Methods, particularly analyses excluding procalcitonin and analyses using missingness indicators.
3.3 Outcome Imbalance
The outcome distribution was severely imbalanced: cardiovascular complications occurred in only 3.93% of the cohort. This prevalence means that overall accuracy is not a reliable primary measure of model performance. Precision-recall metrics, calibration, and threshold-specific detection are more informative when the minority class is rare (Saito and Rehmsmeier 2015; Van Calster et al. 2019). The class distribution shown in Figure 1 is therefore central to interpreting all model results.
3.4 Model Performance
The primary unweighted ridge logistic regression model showed apparent discrimination, with an apparent AUC of 0.834 and a bootstrap optimism-corrected AUC of 0.774. However, precision-recall performance was modest: apparent PR-AUC was 0.184 and optimism-corrected PR-AUC was 0.074, much closer to the low event prevalence than the ROC AUC alone might suggest. Figure 3 shows both the ROC and precision-recall curves. The ROC curve indicates apparent ranking ability, while the precision-recall curve makes clear that positive-class prediction remained limited under severe outcome imbalance.
Table 6 reports discrimination, calibration, Brier score, and default-threshold metrics for the primary unweighted probability model. The apparent Brier score was 0.035. Apparent calibration was less alarming than the class-weighted model, with calibration intercept 0.377 and slope 1.192, but optimism correction reduced the calibration slope to 0.843 and moved the calibration intercept to -0.462. These findings indicate that the model may contain risk-ranking signal but that absolute risk estimates remain too uncertain for clinical interpretation without external validation and calibration updating.
| Metric | Apparent performance | Optimism-corrected performance | 95% CI | Interpretation |
|---|---|---|---|---|
| AUC | 0.834 | 0.774 | 0.771 to 0.893 | Exploratory; internal validation only. |
| PR-AUC | 0.184 | 0.074 | 0.108 to 0.326 | Exploratory; internal validation only. |
| Brier score | 0.035 | 0.038 | 0.026 to 0.044 | Exploratory; internal validation only. |
| Calibration intercept | 0.377 | -0.462 | -0.481 to 1.405 | Exploratory; internal validation only. |
| Calibration slope | 1.192 | 0.843 | 0.888 to 1.605 | Exploratory; internal validation only. |
| Sensitivity | 0.000 | NA | 0.000 to 0.000 | Exploratory; internal validation only. |
| Specificity | 1.000 | NA | 1.000 to 1.000 | Exploratory; internal validation only. |
| PPV | NA | NA | Not estimated | Exploratory; internal validation only. |
| NPV | 0.961 | NA | 0.948 to 0.972 | Exploratory; internal validation only. |
| Balanced accuracy | 0.500 | NA | 0.500 to 0.500 | Exploratory; internal validation only. |
| Accuracy | 0.961 | NA | 0.948 to 0.972 | Exploratory; internal validation only. |
3.5 Calibration
The calibration plot in Figure 4 supports cautious interpretation of the primary unweighted model. The plotted bins include observed event fractions with 95% confidence intervals and bin counts. The plot did not show the extreme overprediction seen in the class-weighted sensitivity model, but interpretation is limited by the small number of events, compressed predicted-risk range, and wide bin-level confidence intervals. Calibration assessment is particularly important here because a model may rank patients acceptably by AUC while producing risk estimates that are poorly aligned with observed event frequencies (Van Calster et al. 2019).
The severe overprediction identified in the class-weighted analysis was not treated as the primary probability result. Instead, it is reported in Table 9 as evidence that balanced class weighting can distort probability calibration in this rare-outcome setting.
3.6 Threshold Performance and Decision-Curve Analysis
Because the observed event rate was 3.93%, the default 0.50 threshold was not clinically plausible as a primary decision threshold. Table 7 therefore reports thresholds of 0.02, 0.05, 0.10, and 0.15, with 0.50 retained only as a reference. Sensitivity decreased from 0.951 at a 0.02 threshold to 0.293 at a 0.15 threshold. PPV remained low at lower thresholds but increased from 0.067 at 0.02 to 0.245 at 0.15. Net benefit was positive but small across thresholds from 0.02 to 0.15 and was zero at 0.50 because no patients were classified positive at that threshold. Figure 5 shows the same limited clinical-utility pattern across threshold probabilities.
| Threshold | Sensitivity | Specificity | PPV | TP/FP | Net benefit | Treat all |
|---|---|---|---|---|---|---|
| 0.02 | 0.951 | 0.459 | 0.067 | 39/542 | 0.0268 | 0.0197 |
| 0.05 | 0.780 | 0.744 | 0.111 | 32/256 | 0.0178 | -0.0112 |
| 0.10 | 0.415 | 0.900 | 0.145 | 17/100 | 0.0057 | -0.0674 |
| 0.15 | 0.293 | 0.963 | 0.245 | 12/37 | 0.0053 | -0.1302 |
| 0.50 | 0.000 | 1.000 | NA | 0/0 | 0.0000 | -0.9213 |
3.7 Predictor Contribution and Explainability
Coefficient-based predictor contributions are summarized in Table 8 and visualized in Figure 6. The highest absolute standardized coefficient was for lymphocyte count, followed by log-transformed blood urea nitrogen and log-transformed creatinine. Age, sex, and missingness indicators also contributed to the fitted model. These findings are consistent with a model using immune, renal, demographic, and missingness-related information, but they should be interpreted as predictive associations rather than causal effects.
| Predictor | Direction of coefficient | Variable importance rank | Clinical interpretation | Stability concern |
|---|---|---|---|---|
| LYM | Negative | 1 | Predictive; not causal. | Needs external validation. |
| BUN_missing | Positive | 2 | Predictive; not causal. | Missingness-sensitive. |
| PTT_missing | Negative | 3 | Predictive; not causal. | Missingness-sensitive. |
| Crea_missing | Negative | 4 | Predictive; not causal. | Missingness-sensitive. |
| log_BUN | Positive | 5 | Predictive; not causal. | Needs external validation. |
| Age | Positive | 6 | Predictive; not causal. | Needs external validation. |
| log_Crea | Negative | 7 | Predictive; not causal. | Needs external validation. |
| Age missing | Positive | 8 | Predictive; not causal. | Missingness-sensitive. |
| LYM_missing | Positive | 9 | Predictive; not causal. | Missingness-sensitive. |
| log_PTT | Positive | 10 | Predictive; not causal. | Needs external validation. |
| Ischemic CAD | Positive | 11 | Predictive; not causal. | Needs external validation. |
| Cough | Positive | 12 | Predictive; not causal. | Needs external validation. |
| PCT missing | Positive | 13 | Predictive; not causal. | Missingness-sensitive. |
| log_PCT | Positive | 14 | Predictive; not causal. | Missingness-sensitive. |
| Sex_Male | Positive | 15 | Predictive; not causal. | Needs external validation. |
Procalcitonin did not dominate the coefficient-based ranking in this reanalysis, despite being clinically plausible and highlighted in earlier exploratory work. Its interpretation is limited by 62.6% missingness. The presence of missingness indicators among the higher-ranked terms in Table 8 and Figure 6 suggests that laboratory ordering or availability may contain predictive information, but this may also reflect care-process artifacts. Feature-importance findings are therefore hypothesis-generating and require sensitivity analyses and external validation before any clinical interpretation.
3.8 Sensitivity Analyses
Sensitivity analyses are summarized in Table 9. Excluding procalcitonin terms produced nearly unchanged apparent discrimination and calibration compared with the primary unweighted model. Excluding both partial thromboplastin time and procalcitonin modestly reduced apparent discrimination, with AUC 0.826 and PR-AUC 0.150. Complete-case modeling retained only 141 patients and 6 events, producing highly unstable estimates despite apparent AUC 0.827. The balanced ridge classification sensitivity analysis had slightly higher apparent AUC 0.849 but much worse Brier score 0.172, calibration intercept -3.103, and calibration slope 0.727. This contrast supports the decision to treat class weighting as a classification-oriented sensitivity analysis rather than the primary probability model.
| Analysis | N | Events | AUC | PR-AUC | Brier | Cal. intercept | Cal. slope |
|---|---|---|---|---|---|---|---|
| Primary unweighted ridge | 1042 | 41 | 0.834 | 0.184 | 0.035 | 0.377 | 1.192 |
| No procalcitonin terms | 1042 | 41 | 0.834 | 0.183 | 0.035 | 0.379 | 1.193 |
| No PTT or procalcitonin terms | 1042 | 41 | 0.826 | 0.150 | 0.036 | 0.404 | 1.196 |
| Complete-case unweighted ridge | 141 | 6 | 0.827 | 0.316 | 0.038 | 2.637 | 2.322 |
| Balanced ridge classification sensitivity | 1042 | 41 | 0.849 | 0.154 | 0.172 | -3.103 | 0.727 |
Full multiple imputation with Rubin-rule pooling was not performed in the present exploratory analysis. The sensitivity analyses should therefore be interpreted as pragmatic model-specification checks rather than a complete missing-data uncertainty analysis. This is a major limitation because median/mode imputation with missingness indicators does not fully represent missing-data uncertainty, and a fully prespecified multiple-imputation sensitivity analysis, ideally embedded within internal validation, could yield different calibration and discrimination estimates.
4 Discussion
4.1 Principal Findings
In this retrospective single-hospital cohort of 1,042 hospitalized patients with COVID-19, cardiovascular complications were uncommon, occurring in 41 patients (3.93%). The primary unweighted ridge logistic regression model showed apparent discrimination, with an apparent AUC of 0.834 and a bootstrap optimism-corrected AUC of 0.774. However, apparent PR-AUC was only 0.184, optimism-corrected PR-AUC was 0.074, and threshold-specific net benefit was small. At low thresholds from 0.02 to 0.15, sensitivity traded off sharply against specificity and PPV. These findings support feasibility and methodological learning rather than clinical deployment.
The main result is therefore deliberately mixed. The unweighted model appeared able to rank some patients by relative risk and avoided the severe overprediction observed under class weighting, but the low event count, modest precision-recall performance, threshold instability, calibration uncertainty, and absence of external validation prevent any claim of clinical readiness. The class-weighted sensitivity analysis was a major negative calibration finding: it increased classification pressure but produced severe overprediction, with calibration intercept -3.103 and slope 0.727. This distinction is central to clinical prediction modeling: discrimination describes ranking, whereas clinical usefulness requires reliable risk estimates, sensible thresholds, and evidence that model-guided decisions would improve outcomes or net benefit (Van Calster et al. 2019; Vickers and Elkin 2006).
4.2 Comparison With Prior Literature
The findings are broadly consistent with prior work showing that cardiovascular complications in COVID-19 are clinically important and often related to systemic inflammation, myocardial injury, renal dysfunction, coagulation abnormalities, and pre-existing cardiovascular disease (Zheng et al. 2020; Giustino et al. 2020). They also align with the rationale behind the multicenter risk score proposed by Huang et al. (2020), which treated cardiovascular complications as a predictable but clinically heterogeneous outcome in hospitalized COVID-19 patients. However, the present analysis should not be interpreted as a validation of that score. The current cohort was single-hospital, smaller in effective event count, and modeled a local variable set using a different analytic strategy.
The caution is also consistent with systematic reviews of COVID-19 prediction models. Wynants et al. (2020) found that many early COVID-19 models were poorly reported, at high risk of bias, and likely optimistic without proper validation. The present analysis explicitly tries to avoid that overclaiming by reporting outcome imbalance, missingness, apparent and optimism-corrected performance, calibration concerns, and the absence of external validation.
4.3 Clinical Implications
The model should not be used to guide triage, monitoring intensity, cardiology referral, anticoagulation decisions, or discharge planning. Its most defensible clinical implication is narrower: routinely available admission variables may contain signal relevant to cardiovascular-complication risk, and that signal may help design a future, larger, externally validated local risk model. The low positive predictive value at the evaluated threshold means that many patients classified as high risk would not have documented cardiovascular complications. Conversely, the small event count limits confidence that the model would identify future complication cases reliably in another cohort.
Before clinical use could be considered, the model would require a standardized outcome definition, prospective or temporally separated validation, calibration updating if transported to a new setting, clinically justified threshold selection, and favorable decision-curve analysis across plausible threshold probabilities. In this analysis, decision-curve results did not support clinical utility beyond a very low 0.02 threshold, and even there the gain was small. This finding reinforces that a model can have acceptable discrimination but still fail to provide useful net benefit if false positives or false negatives carry substantial clinical consequences (Vickers and Elkin 2006).
4.4 Methodological Implications
This analysis illustrates why AUC alone is insufficient in imbalanced clinical prediction tasks. The apparent AUC of 0.834 could appear promising in isolation, but the PR-AUC of 0.184 and optimism-corrected PR-AUC of 0.074 show that positive-class prediction remained weak. This pattern is expected when the event rate is low and reinforces the value of precision-recall metrics for imbalanced binary outcomes (Saito and Rehmsmeier 2015).
Calibration remained a central concern. The primary unweighted model had apparent calibration intercept 0.377 and slope 1.192, but optimism correction moved these estimates away from ideal, and the small event count makes calibration uncertainty substantial. The unweighted model was therefore less miscalibrated than the class-weighted sensitivity model, not clinically calibrated. The class-weighted model demonstrated how a classification-oriented adjustment can substantially damage calibration, with severe overprediction. Calibration is not a cosmetic add-on; it determines whether predicted probabilities can support clinical threshold decisions (Van Calster et al. 2019). In a small rare-event cohort, calibration estimates are themselves unstable, and that instability strengthens the conclusion that this model is not ready for clinical interpretation.
The missing-data pattern further limits interpretation. Procalcitonin was missing in 62.6% of patients, and partial thromboplastin time was missing in 52.7%. Missingness may represent care processes, clinical severity, laboratory availability, or documentation patterns. Missingness indicators can improve apparent prediction, but they may not transport if ordering behavior differs across hospitals or pandemic periods. Future modeling should compare complete-case analysis, multiple imputation, exclusion of highly missing biomarkers, and explicit missingness-indicator strategies.
4.5 Explainability Implications
The coefficient-based explainability analysis suggested that lymphocyte count, blood urea nitrogen, creatinine, age, sex, and missingness indicators contributed to model predictions. These findings are clinically coherent with immune, renal, demographic, and care-process information, but they are not causal explanations. Feature importance is conditional on the model class, preprocessing, missingness handling, predictor correlation structure, and outcome definition. In high-stakes clinical settings, interpretable models are preferable where possible, but interpretability does not compensate for weak validation, poor calibration, or biased data (Rudin 2019; Molnar 2025).
4.6 Strengths
This study has several defensible strengths. It used clinically plausible admission-available predictors, preserved continuous laboratory variables, avoided high-dimensional expansion of biomarkers, explicitly quantified missingness, and reported performance metrics beyond accuracy. The analysis also separated apparent performance from optimism-corrected performance and interpreted explainability outputs as model behavior rather than causal evidence. Finally, the single-hospital setting provides local relevance for future hospital-specific model development, although not generalizable clinical evidence.
4.7 Limitations
The limitations are substantial. The study was retrospective and single-hospital, with only 41 cardiovascular complication events. The low event count limits model stability, calibration assessment, internal validation, and sensitivity analyses. The outcome variable Cardio lacks a source-protocol definition, granular event components, event timing, adjudication source, and blinding information in the available dataset, so cardiovascular complications may combine heterogeneous clinical events or reflect documentation practices. The cohort had substantial missingness in procalcitonin and partial thromboplastin time, and age had notable missingness and implausible values flagged during cleaning. Exact predictor timestamps and laboratory units were not available in the modeling file; therefore, the intended first-24-hour and pre-outcome predictor window could not be independently verified, and predictor leakage cannot be excluded.
The analysis used internal validation only. No external, temporal, or prospective validation was performed. The ridge model used median/mode imputation inside the modeling pipeline and included missingness indicators. Sensitivity analyses examined complete cases, exclusion of procalcitonin terms, exclusion of both procalcitonin and partial thromboplastin time, and class-weighted ridge as a classification sensitivity analysis, but full multiple imputation with pooled estimates was not performed. This should be treated as an unresolved feasibility limitation rather than evidence that the missing-data strategy is definitive. The bootstrap optimism correction used 200 resamples, which is adequate for an exploratory pass but should be increased to 500-1,000 for final reporting. Confidence intervals were bootstrap intervals around apparent predictions rather than a full validation uncertainty framework. Decision-curve analysis was exploratory and did not establish clinical utility. Ethics committee name, approval or waiver number, approval date, consent-waiver status, and formal data-governance conditions were not available in the de-identified analytic package and would need source verification before journal submission. Finally, local explanations and coefficient rankings may be unstable because of small event count, correlated predictors, missingness, and outcome imbalance.
4.8 Future Research
Future work should begin with standardized cardiovascular outcome adjudication and a larger multicenter or temporally separated cohort. Model development should use a prespecified analysis plan, adequate sample-size justification for binary prediction, robust missing-data handling, penalized or bias-reduced logistic regression, and bootstrap internal validation. External validation should assess discrimination, calibration, PR-AUC, Brier score, threshold-specific performance, subgroup performance, and decision-curve net benefit. The local model should also be compared with Huang et al.’s original risk score only if the same predictors, coding, outcome definition, and thresholds can be reconstructed (Huang et al. 2020).
Before any deployment claim, the model would require calibration updating, threshold selection linked to clinical action, prospective validation, and impact analysis. Explainability should remain a supporting diagnostic tool for model audit and communication, not evidence of biological mechanism or clinical utility.
5 Conclusion
In this retrospective single-hospital cohort of hospitalized patients with COVID-19, an interpretable admission-based unweighted ridge logistic regression model showed apparent discrimination for in-hospital cardiovascular complication risk, with an apparent AUC of 0.834 and a bootstrap optimism-corrected AUC of 0.774. However, cardiovascular complications were rare, with only 41 events among 1,042 patients, apparent PR-AUC was modest at 0.184, optimism-corrected PR-AUC was 0.074, and threshold analyses showed limited positive predictive performance and small net benefit.
These findings support feasibility and hypothesis generation, not clinical deployment. Severe class imbalance, limited event count, substantial biomarker missingness, calibration uncertainty, weak threshold performance, unresolved outcome-adjudication details, unverified predictor timing, and absence of external validation prevent use as a clinical decision tool. Larger cohorts with standardized outcome adjudication, robust missing-data handling including multiple-imputation sensitivity analysis, calibration assessment, decision-curve analysis, and external validation are required before any model of this type could be considered for clinical decision support (Collins et al. 2024; Moons et al. 2025; Van Calster et al. 2019; Vickers and Elkin 2006).
6 Ethics and Data Governance
This study used de-identified retrospective clinical data, and direct identifiers were excluded from the modeling files before analysis. The files available for this analysis did not include the approving ethics committee or institutional review board, approval or waiver classification, protocol number, approval date, consent-waiver language, or formal data-governance conditions. The safest reportable statement is therefore that ethics and governance details were not available in the de-identified analytic package reviewed by the analyst. The analysis should be interpreted as a secondary analysis of de-identified clinical records, with institutional approval status not independently verifiable from the available files.