PE Diagnostic Model Report

Author

Awan

Published

April 30, 2026

1 Executive Summary

The models appear to be learning pieces of the clinical context around PE, but not PE itself cleanly.

Technical assessment: I did not find evidence that the main results are explained by any simple data-structure problem. The expected lockbox/validation sizes line up, labels are present and concordant across sources, and the available diagnostic checks did not point to a patient-join problem or a score-direction problem, where higher model scores meant lower PE risk.

Full EKG-only model: The EKG model appears to be picking up physiologic stress rather than a PE-specific electrical pattern. Its modest discrimination (AUROC 0.622, AUPRC 0.350, about 1.5x prevalence) appears to come mostly from encounters with more evident cardiopulmonary strain, and its false positives cluster in patients with tachycardia, higher shock index, and report language suggesting hypoxia. That signal is not useless, but it suggests the model may be learning severity or stress patterns associated with PE workups rather than identifying PE itself.

Full CXR-only model: The CXR embedding model has a view-position problem. Its predictions are strongly tied to AP (anteroposterior) CXRs, with many PE-negative AP studies called positive. AP status comes from the DICOM image metadata (VIEWPOSITION) and likely reflects a different clinical context, since AP films are often obtained as portable bedside studies in less mobile or more acutely ill patients. PA (posteroanterior) studies behave much better.

Paired-data Fusion model: Fusion model does not solve the standalone-model problems. On the paired 177-patient validation set, the best fusion Model-A037 does not outperform the paired CXR-only model at the matched high-sensitivity cutoff. It catches the same number of PE-positive cases and misses the same number of PE-positive cases as CXR-only, but it incorrectly calls more PE-negative patients positive. Fusion mostly inherits the CXR-embedding branch’s error pattern rather than adding a useful complementary signal from EKG. One reason fusion does not gain much from EKG is that the paired EKG-only branch is essentially at chance on the 177 paired lockbox (AUROC 0.506).

2 Sample Context And Derived Characteristics

I made a subject-level master frame. Each row is one patient/subject. The full frame has 5,289 subjects, formed from the union of the EKG and CXR pools:

Pool Total pool Tuning Validation/Lockbox
EKG 2,925 2,340 585
CXR 3,250 2,601 649
Paired EKG+CXR 886 709 177

The 177 paired validation set subjects are the clean comparison set for Section 3, because all paired and fusion scores are available there.

Terminology note: when I refer to the paired EKG-only or paired CXR-only model, I mean a single-modality branch model trained within the paired-data pool: the n=886 patients who had both an EKG and a CXR available. These are different from the full EKG-only and full CXR-only models, which were trained on their larger modality-specific pools.

For the descriptive clinical strata, I used the following definitions:

Characteristic How it is defined here
PE type Acute and Subsegmental count as positive labels. No PE, Chronic, and Equivocal count as negative labels.
Tachycardia Heart rate >=100. Bins are <80, 80-99, 100-119, and >=120.
O2 saturation Bins are <90, 90-94, and >=95. I call these O2 saturation strata, not hypoxia.
Shock index Heart rate divided by systolic blood pressure. Bins are <0.7, 0.7-0.9, and >=0.9.
CXR viewpoint AP versus PA acquisition context. AP often means portable/sicker-patient context; PA is usually a more standard upright acquisition.
Report-text flags These are keyword indicators from the radiology report text. They are used only to stratify model behavior. They should not be interpreted as adjudicated findings because the current extraction is not negation-aware; for example, “no pleural effusion” can still trigger the effusion search.

The three lockboxes have different case mixes:

Characteristic EKG lockbox N=585 CXR lockbox N=649 Paired lockbox N=177
Positive / negative 137 / 448 52 / 597 42 / 135
Acute PE 116 39 33
Subsegmental PE 21 13 9
Chronic 15 13 4
Equivocal 3 4 1
No PE 430 580 130
Heart rate missing 201 105 25
O2 saturation missing 249 170 41
Shock index missing 204 111 27
AP / PA CXR NA 334 / 315 88 / 89

The paired lockbox has a higher PE prevalence than the CXR-only lockbox, so AUPRC should not be compared casually across pools. For side-by-side model comparison, the 177 paired lockbox is the fairer setting.

2.1 Operating Points

I show two cutoffs.

The main cutoff is the model-specific threshold that reaches sensitivity at or just above 0.80 on the relevant lockbox. This answers: if we force the model to catch roughly 80% of PE positives, where do the false positives and false negatives concentrate? Each model produces scores on its own scale, so the cutoff needed to catch about 80% of PE-positive cases differs across models.

AUROC summarizes ranking quality (0.5 is random, 1.0 is perfect). AUPRC summarizes precision-recall and is shown as value (x prevalence baseline), where 1.0x means no gain over the positive-case rate.

The second cutoff is 0.50. This is not clinically meaningful for most of these models. It is useful mainly as a calibration and score-scale diagnostic. At 0.50, several models classify every subject as negative.

Sensitivity is anchored near 0.80 by design; specificity varies sharply across models, with raw CXR weakest and paired CXR-only strongest.

Because all sampled inputs were already inside the 0-48h pre-CTPA window, timing is treated below as a within-window stratification variable in each model section.

3 Section 1. Full EKG Standalone Model

Model: EKG standalone cfg0017
Cohort: 585-subject EKG lockbox

Cutoff AUROC AUPRC Sensitivity Specificity TP FP TN FN
Sensitivity >=0.80 cutoff, 0.249 0.622 0.350 (1.5x) 0.803 0.382 110 277 171 27
Fixed 0.50 cutoff 0.622 0.350 (1.5x) 0.000 1.000 0 0 448 137

The EKG model has modest ranking performance. At the high-sensitivity cutoff it catches 110/137 (80.3%) PE positives, but it also flags 277/448 (61.8%) negatives as positive.

The clinically interesting part is where the errors concentrate. The model catches acute PE better than subsegmental PE:

PE type Positives caught at high-sensitivity cutoff
Acute PE 96/116 (82.8%)
Subsegmental PE 14/21 (66.7%)

The false positives are concentrated in physiologic-stress strata:

Stratum False positives / true negatives among negatives
HR 100-119 63 FP / 11 TN (85.1% FP)
HR >=120 23 FP / 2 TN (92.0% FP)
Shock index >=0.9 38 FP / 4 TN (90.5% FP)
CTPA text mentions tachycardia 60 FP / 8 TN (88.2% FP)
CTPA text mentions hypoxia/hypoxemia 56 FP / 7 TN (88.9% FP)

EKG false positives are concentrated in physiologic-stress strata, especially tachycardia and high shock index.

My interpretation is that the EKG model is partly identifying patients who look clinically stressed. That overlaps with PE, but it also overlaps with many non-PE illnesses. This is why the model can catch many acute cases while still producing many false positives.

3.1 EKG Timing Within 0-48h

Most EKG lockbox records are close to CTPA time. The <=6h bucket contains 449/585 subjects, while >6h to 48h contains 136/585. In the timing tables, TP, FP, TN, and FN are shown as n (% of row), so the four percentages sum to approximately 100% within each row.

EKG-to-CTPA bucket N Pos/Neg TP FP TN FN Sensitivity Specificity
<=6h 449 109/340 87 (19.4%) 204 (45.4%) 136 (30.3%) 22 (4.9%) 0.798 0.400
>6h to 48h 136 28/108 23 (16.9%) 73 (53.7%) 35 (25.7%) 5 (3.7%) 0.821 0.324

I do not see a clean EKG timing gradient. Specificity is somewhat lower after 6 hours, but the main EKG story remains physiologic stress rather than time-before-CTPA.

I would be careful not to say the EKG model has learned an EKG signature of PE. The evidence is more consistent with a severity or physiologic-stress signal.

4 Section 2. Full CXR Standalone Models

Models: CXR embedding cfg0078 and raw-image CXR cfg0023
Cohort: 649-subject CXR lockbox

Model Cutoff AUROC AUPRC Sensitivity Specificity TP FP TN FN
CXR embedding cfg0078 0.057 0.615 0.114 (1.4x) 0.808 0.348 42 389 208 10
Raw CXR cfg0023 0.004 0.500 0.097 (1.2x) 0.808 0.154 42 505 92 10

The embedding model is the only CXR model worth discussing seriously. The raw CXR model is near-null by AUROC and reaches high sensitivity only by using an extremely low cutoff, which creates 505 false positives among 597 negatives.

The main CXR embedding finding is the AP/PA split:

CXR view Positives caught False positives / true negatives among negatives Specificity
AP 29/31 (93.5%) 267 FP / 36 TN (88.1% FP) 0.119
PA 13/21 (61.9%) 122 FP / 172 TN (41.5% FP) 0.585

CXR specificity is much lower for AP studies than PA studies, supporting the portable/sicker-patient context interpretation.

This is a large and clinically interpretable pattern. AP chest x-rays are often portable studies and tend to come from sicker patients. The model appears to treat that acquisition context as a strong PE-associated signal. That helps sensitivity in AP-positive cases, but it causes very poor specificity among AP-negative cases.

The report-text strata point in the same direction. When CXR text contains limited/portable language, the embedding model catches 20/22 (90.9%) positives but false-positives on 172/199 (86.4%) negatives. This does not mean the text was used by the model; it means the image context and report context are aligned in a way that helps us diagnose model behavior.

4.1 CXR Timing Within 0-48h

CXR timing shows a stronger pattern than EKG timing: specificity is better in the <=6h bucket and worse in the >6h to 48h bucket. But this is not clean evidence that elapsed time itself is the mechanism, because the later CXR bucket is much more AP/portable.

CXR-to-CTPA bucket N Pos/Neg TP FP TN FN Sensitivity Specificity AP percent
<=6h 413 32/381 25 (6.1%) 224 (54.2%) 157 (38.0%) 7 (1.7%) 0.781 0.412 40.7%
>6h to 48h 236 20/216 17 (7.2%) 165 (69.9%) 51 (21.6%) 3 (1.3%) 0.850 0.236 70.3%

This timing pattern supports the AP/portable-context interpretation. After 6 hours, the model catches a slightly larger fraction of positives, but it also calls many more negatives positive. The raw CXR model remains weak across timing buckets and does not change that conclusion.

The CXR model also catches acute PE better than subsegmental PE:

PE type Positives caught at high-sensitivity cutoff
Acute PE 33/39 (84.6%)
Subsegmental PE 9/13 (69.2%)

I would frame the CXR embedding result as a real but limited signal. It is not simply random, but a large part of its behavior may be driven by AP/portable/sicker-patient context rather than PE-specific radiographic evidence.

5 Section 3. Paired Lockbox And Fusion

Cohort: 177-subject paired lockbox
Case mix: 42 positive, 135 negative
Models: paired EKG-only, paired CXR-only, and Fusion A037. I also include the broad-pool standalone scores restricted to these same 177 patients.

5.1 Paired-Trained Models On The Same 177 Subjects

Model Cutoff AUROC AUPRC Sensitivity Specificity TP FP TN FN
Paired EKG-only 0.351 0.506 0.244 (1.0x) 0.810 0.259 34 100 35 8
Paired CXR-only 0.420 0.632 0.328 (1.4x) 0.810 0.430 34 77 58 8
Fusion A037 0.305 0.565 0.291 (1.2x) 0.810 0.356 34 87 48 8

This is the central fusion result. A037 does not improve over paired CXR-only. At the same sensitivity, it has the same 34 true positives and 8 false negatives, but it creates 10 more false positives than paired CXR-only.

At matched sensitivity, A037 has the same true positives as paired CXR-only but more false positives.

The complementarity story also does not hold well. Fusion preserves most CXR-only successes, but it preserves only 6/29 (20.7%) EKG-only successes and rescues only 3/56 (5.4%) cases where both branches were wrong. In plain terms: the fusion model does not appear to be combining two independent useful signals in a way that improves the final decision.

5.2 Paired/Fusion Timing Within 0-48h

In the paired lockbox, the same CXR timing/context pattern appears. Paired CXR-only is more specific in the <=6h CXR bucket, while the >6h to 48h bucket is more AP-heavy and has worse specificity.

Model Timing axis Bucket N TP FP TN FN Sensitivity Specificity
Paired EKG-only EKG-to-CTPA <=6h 137 28 (20.4%) 75 (54.7%) 26 (19.0%) 8 (5.8%) 0.778 0.257
Paired EKG-only EKG-to-CTPA >6h to 48h 40 6 (15.0%) 25 (62.5%) 9 (22.5%) 0 (0.0%) 1.000 0.265
Paired CXR-only CXR-to-CTPA <=6h 115 20 (17.4%) 41 (35.7%) 47 (40.9%) 7 (6.1%) 0.741 0.534
Paired CXR-only CXR-to-CTPA >6h to 48h 62 14 (22.6%) 36 (58.1%) 11 (17.7%) 1 (1.6%) 0.933 0.234
Fusion A037 CXR-to-CTPA <=6h 115 21 (18.3%) 52 (45.2%) 36 (31.3%) 6 (5.2%) 0.778 0.409
Fusion A037 CXR-to-CTPA >6h to 48h 62 13 (21.0%) 35 (56.5%) 12 (19.4%) 2 (3.2%) 0.867 0.255

Timing does not change the fusion conclusion. In the paired data, CXR-only is still better than A037 in the main <=6h CXR group. After 6 hours, both CXR-only and A037 make many false-positive calls. That later group is also much more AP-heavy: 74.2% AP after 6 hours versus 36.5% AP within 6 hours. So I would not read this as a pure timing effect. It looks more like later CXR timing is mixed with the same AP/portable-patient context that drives the CXR false positives.

5.3 Broad-Pool Standalone Scores Restricted To The Same 177 Subjects

This is a useful check because the broad EKG and broad CXR models were trained on their larger modality-specific pools, then applied to the same paired 177 subjects.

Broad score on paired 177 AUROC AUPRC Cutoff from broad lockbox Sensitivity Specificity TP FP TN FN
Broad EKG cfg0017 0.637 0.370 (1.6x) 0.249 0.905 0.333 38 90 45 4
Broad CXR embedding cfg0078 0.649 0.335 (1.4x) 0.057 0.833 0.400 35 81 54 7
Broad raw CXR cfg0023 0.514 0.268 (1.1x) 0.004 0.833 0.148 35 115 20 7

This strengthens the main point: A037 is not clearly better than the standalone scores on the same patients. The broad CXR embedding score restricted to the paired 177 catches one more positive than A037 (35 vs 34), with six fewer false positives. AUROC and AUPRC also point in the same direction.

5.4 Do The Standalone Error Patterns Hold In The Paired Lockbox?

Yes, broadly.

For CXR, the AP/portable pattern persists:

Model on paired 177 AP positives caught AP false positives / true negatives PA positives caught PA false positives / true negatives
Paired CXR-only 25/26 (96.2%) 56 FP / 6 TN (90.3% FP) 9/16 (56.2%) 21 FP / 52 TN (28.8% FP)
Fusion A037 24/26 (92.3%) 57 FP / 5 TN (91.9% FP) 10/16 (62.5%) 30 FP / 43 TN (41.1% FP)

Fusion does not fix the AP false-positive problem. It slightly worsens PA false positives compared with paired CXR-only.

For PE type, subsegmental PE remains a weak point:

Model Acute PE caught Subsegmental PE caught
Paired EKG-only 29/33 (87.9%) 5/9 (55.6%)
Paired CXR-only 28/33 (84.8%) 6/9 (66.7%)
Fusion A037 29/33 (87.9%) 5/9 (55.6%)

Across paired models, subsegmental PE is caught less reliably than acute PE.

The chronic/equivocal negatives are also frequently overcalled at the high-sensitivity threshold:

Model Chronic/equivocal negatives classified positive
EKG standalone 14/18 (77.8%)
CXR embedding standalone 13/17 (76.5%)
Raw CXR standalone 15/17 (88.2%)
Paired CXR-only 5/5 (100.0%)
Fusion A037 4/5 (80.0%)

These are small strata, but clinically important. Chronic and equivocal cases sit near the boundary of how we define the binary label, so they should be handled as a label-boundary caveat rather than treated as ordinary clean negatives.

5.5 Fixed 0.50 Cutoff

The 0.50 threshold mostly shows that the scores are not on a common calibrated probability scale.

At 0.50, EKG standalone, CXR embedding standalone, paired EKG-only, Fusion A037, broad EKG-on-177, and broad CXR-embedding-on-177 classify every subject as negative. Paired CXR-only and the raw CXR models do produce some positive calls at 0.50, but paired CXR-only sensitivity falls to 0.524 and raw CXR sensitivity stays below 0.10. None of this changes the operating-point story; it mainly shows that 0.50 is not a shared clinical threshold across models.

6 My takes

First, the master data structure is sound enough for this diagnostic report. The row unit is one subject, labels are not missing, and the expected lockbox sizes line up.

Second, the main clinical story is not “fusion improves PE detection.” The better story is “the models expose different non-PE-specific signals.” EKG leans toward physiologic stress. CXR embedding leans toward AP/portable imaging context. Both signals overlap with PE, but both also generate many false positives.

Third, the CXR embedding model is the strongest single branch in the paired setting, but its AP specificity problem is substantial. If Barbara wants to clinically inspect anything first, I would inspect AP negative false positives and subsegmental false negatives.

Fourth, subsegmental PE is a consistent miss pattern. This makes clinical sense: subsegmental PE is smaller and may produce weaker indirect physiologic or imaging-context signals.

Fifth, chronic/equivocal negatives should be explicitly caveated. These are binary negatives in the analysis, but they are not clinically the same as clean “no PE” negatives.

7 Caveats

  1. Some subjects have multiple CXR records available, but the primary CXR pipeline selected one record per subject using a PA-first rule: prefer PA if available, otherwise AP, then smallest dicom_id as a tie-breaker. This means the AP false-positive pattern is not explained by choosing AP over PA when both were available; selected AP studies mostly represent subjects without a PA alternative, so AP is better interpreted as an acquisition-context marker.

  2. Section 3 uses the paired/fusion CXR metadata for CXR descriptors such as AP/PA view and CXR-to-CTPA timing. This matters because the standalone CXR selector and the paired/fusion dataset can occasionally point to different CXR records for the same subject.

  3. Vitals are descriptive context. The vitals file already contains one selected vitals row per subject, drawn from measurements within 0-48 hours before that row’s CTPA. For some subjects, that vitals row could only be matched by subject_id, so it may not correspond to the exact same admission or CTPA event as the EKG/CXR used in this report. This subject-level fallback occurred for about 25% of EKG-pool subjects, 32% of CXR-pool subjects, and 30% of paired-pool subjects.

  4. Report-text flags are not negation-aware and should not be described as adjudicated clinical findings.

  5. AUPRC depends on the underlying positive-case rate. The EKG, CXR, and paired pools have different PE prevalence, so AUPRC values across pools are not directly comparable; the (x prevalence baseline) multiplier helps with that comparison.