PE Diagnostic Model Report
1 Executive Summary
The models appear to be learning pieces of the clinical context around PE, but not PE itself cleanly.
Technical assessment: I did not find evidence that the main results are explained by any simple data-structure problem. The expected lockbox/validation sizes line up, labels are present and concordant across sources, and the available diagnostic checks did not point to a patient-join problem or a score-direction problem, where higher model scores meant lower PE risk.
Full EKG-only model: The EKG model appears to be picking up physiologic stress rather than a PE-specific electrical pattern. Its modest discrimination (AUROC 0.622, AUPRC 0.350, about 1.5x prevalence) appears to come mostly from encounters with more evident cardiopulmonary strain, and its false positives cluster in patients with tachycardia, higher shock index, and report language suggesting hypoxia. That signal is not useless, but it suggests the model may be learning severity or stress patterns associated with PE workups rather than identifying PE itself.
Full CXR-only model: The CXR embedding model has a view-position problem. Its predictions are strongly tied to AP (anteroposterior) CXRs, with many PE-negative AP studies called positive. AP status comes from the DICOM image metadata (VIEWPOSITION) and likely reflects a different clinical context, since AP films are often obtained as portable bedside studies in less mobile or more acutely ill patients. PA (posteroanterior) studies behave much better.
Paired-data Fusion model: Fusion model does not solve the standalone-model problems. On the paired 177-patient validation set, the best fusion Model-A037 does not outperform the paired CXR-only model at the matched high-sensitivity cutoff. It catches the same number of PE-positive cases and misses the same number of PE-positive cases as CXR-only, but it incorrectly calls more PE-negative patients positive. Fusion mostly inherits the CXR-embedding branch’s error pattern rather than adding a useful complementary signal from EKG. One reason fusion does not gain much from EKG is that the paired EKG-only branch is essentially at chance on the 177 paired lockbox (AUROC 0.506).
2 Sample Context And Derived Characteristics
I made a subject-level master frame. Each row is one patient/subject. The full frame has 5,289 subjects, formed from the union of the EKG and CXR pools:
| Pool | Total pool | Tuning | Validation/Lockbox |
|---|---|---|---|
| EKG | 2,925 | 2,340 | 585 |
| CXR | 3,250 | 2,601 | 649 |
| Paired EKG+CXR | 886 | 709 | 177 |
The 177 paired validation set subjects are the clean comparison set for Section 3, because all paired and fusion scores are available there.
Terminology note: when I refer to the paired EKG-only or paired CXR-only model, I mean a single-modality branch model trained within the paired-data pool: the n=886 patients who had both an EKG and a CXR available. These are different from the full EKG-only and full CXR-only models, which were trained on their larger modality-specific pools.
For the descriptive clinical strata, I used the following definitions:
| Characteristic | How it is defined here |
|---|---|
| PE type | Acute and Subsegmental count as positive labels. No PE, Chronic, and Equivocal count as negative labels. |
| Tachycardia | Heart rate >=100. Bins are <80, 80-99, 100-119, and >=120. |
| O2 saturation | Bins are <90, 90-94, and >=95. I call these O2 saturation strata, not hypoxia. |
| Shock index | Heart rate divided by systolic blood pressure. Bins are <0.7, 0.7-0.9, and >=0.9. |
| CXR viewpoint | AP versus PA acquisition context. AP often means portable/sicker-patient context; PA is usually a more standard upright acquisition. |
| Report-text flags | These are keyword indicators from the radiology report text. They are used only to stratify model behavior. They should not be interpreted as adjudicated findings because the current extraction is not negation-aware; for example, “no pleural effusion” can still trigger the effusion search. |
The three lockboxes have different case mixes:
| Characteristic | EKG lockbox N=585 | CXR lockbox N=649 | Paired lockbox N=177 |
|---|---|---|---|
| Positive / negative | 137 / 448 | 52 / 597 | 42 / 135 |
| Acute PE | 116 | 39 | 33 |
| Subsegmental PE | 21 | 13 | 9 |
| Chronic | 15 | 13 | 4 |
| Equivocal | 3 | 4 | 1 |
| No PE | 430 | 580 | 130 |
| Heart rate missing | 201 | 105 | 25 |
| O2 saturation missing | 249 | 170 | 41 |
| Shock index missing | 204 | 111 | 27 |
| AP / PA CXR | NA | 334 / 315 | 88 / 89 |
The paired lockbox has a higher PE prevalence than the CXR-only lockbox, so AUPRC should not be compared casually across pools. For side-by-side model comparison, the 177 paired lockbox is the fairer setting.
2.1 Operating Points
I show two cutoffs.
The main cutoff is the model-specific threshold that reaches sensitivity at or just above 0.80 on the relevant lockbox. This answers: if we force the model to catch roughly 80% of PE positives, where do the false positives and false negatives concentrate? Each model produces scores on its own scale, so the cutoff needed to catch about 80% of PE-positive cases differs across models.
AUROC summarizes ranking quality (0.5 is random, 1.0 is perfect). AUPRC summarizes precision-recall and is shown as value (x prevalence baseline), where 1.0x means no gain over the positive-case rate.
The second cutoff is 0.50. This is not clinically meaningful for most of these models. It is useful mainly as a calibration and score-scale diagnostic. At 0.50, several models classify every subject as negative.
Because all sampled inputs were already inside the 0-48h pre-CTPA window, timing is treated below as a within-window stratification variable in each model section.
3 Section 1. Full EKG Standalone Model
Model: EKG standalone cfg0017
Cohort: 585-subject EKG lockbox
| Cutoff | AUROC | AUPRC | Sensitivity | Specificity | TP | FP | TN | FN |
|---|---|---|---|---|---|---|---|---|
Sensitivity >=0.80 cutoff, 0.249 |
0.622 | 0.350 (1.5x) | 0.803 | 0.382 | 110 | 277 | 171 | 27 |
| Fixed 0.50 cutoff | 0.622 | 0.350 (1.5x) | 0.000 | 1.000 | 0 | 0 | 448 | 137 |
The EKG model has modest ranking performance. At the high-sensitivity cutoff it catches 110/137 (80.3%) PE positives, but it also flags 277/448 (61.8%) negatives as positive.
The clinically interesting part is where the errors concentrate. The model catches acute PE better than subsegmental PE:
| PE type | Positives caught at high-sensitivity cutoff |
|---|---|
| Acute PE | 96/116 (82.8%) |
| Subsegmental PE | 14/21 (66.7%) |
The false positives are concentrated in physiologic-stress strata:
| Stratum | False positives / true negatives among negatives |
|---|---|
| HR 100-119 | 63 FP / 11 TN (85.1% FP) |
| HR >=120 | 23 FP / 2 TN (92.0% FP) |
| Shock index >=0.9 | 38 FP / 4 TN (90.5% FP) |
| CTPA text mentions tachycardia | 60 FP / 8 TN (88.2% FP) |
| CTPA text mentions hypoxia/hypoxemia | 56 FP / 7 TN (88.9% FP) |
My interpretation is that the EKG model is partly identifying patients who look clinically stressed. That overlaps with PE, but it also overlaps with many non-PE illnesses. This is why the model can catch many acute cases while still producing many false positives.
3.1 EKG Timing Within 0-48h
Most EKG lockbox records are close to CTPA time. The <=6h bucket contains 449/585 subjects, while >6h to 48h contains 136/585. In the timing tables, TP, FP, TN, and FN are shown as n (% of row), so the four percentages sum to approximately 100% within each row.
| EKG-to-CTPA bucket | N | Pos/Neg | TP | FP | TN | FN | Sensitivity | Specificity |
|---|---|---|---|---|---|---|---|---|
| <=6h | 449 | 109/340 | 87 (19.4%) | 204 (45.4%) | 136 (30.3%) | 22 (4.9%) | 0.798 | 0.400 |
| >6h to 48h | 136 | 28/108 | 23 (16.9%) | 73 (53.7%) | 35 (25.7%) | 5 (3.7%) | 0.821 | 0.324 |
I do not see a clean EKG timing gradient. Specificity is somewhat lower after 6 hours, but the main EKG story remains physiologic stress rather than time-before-CTPA.
I would be careful not to say the EKG model has learned an EKG signature of PE. The evidence is more consistent with a severity or physiologic-stress signal.
4 Section 2. Full CXR Standalone Models
Models: CXR embedding cfg0078 and raw-image CXR cfg0023
Cohort: 649-subject CXR lockbox
| Model | Cutoff | AUROC | AUPRC | Sensitivity | Specificity | TP | FP | TN | FN |
|---|---|---|---|---|---|---|---|---|---|
CXR embedding cfg0078 |
0.057 | 0.615 | 0.114 (1.4x) | 0.808 | 0.348 | 42 | 389 | 208 | 10 |
Raw CXR cfg0023 |
0.004 | 0.500 | 0.097 (1.2x) | 0.808 | 0.154 | 42 | 505 | 92 | 10 |
The embedding model is the only CXR model worth discussing seriously. The raw CXR model is near-null by AUROC and reaches high sensitivity only by using an extremely low cutoff, which creates 505 false positives among 597 negatives.
The main CXR embedding finding is the AP/PA split:
| CXR view | Positives caught | False positives / true negatives among negatives | Specificity |
|---|---|---|---|
| AP | 29/31 (93.5%) | 267 FP / 36 TN (88.1% FP) | 0.119 |
| PA | 13/21 (61.9%) | 122 FP / 172 TN (41.5% FP) | 0.585 |
This is a large and clinically interpretable pattern. AP chest x-rays are often portable studies and tend to come from sicker patients. The model appears to treat that acquisition context as a strong PE-associated signal. That helps sensitivity in AP-positive cases, but it causes very poor specificity among AP-negative cases.
The report-text strata point in the same direction. When CXR text contains limited/portable language, the embedding model catches 20/22 (90.9%) positives but false-positives on 172/199 (86.4%) negatives. This does not mean the text was used by the model; it means the image context and report context are aligned in a way that helps us diagnose model behavior.
4.1 CXR Timing Within 0-48h
CXR timing shows a stronger pattern than EKG timing: specificity is better in the <=6h bucket and worse in the >6h to 48h bucket. But this is not clean evidence that elapsed time itself is the mechanism, because the later CXR bucket is much more AP/portable.
| CXR-to-CTPA bucket | N | Pos/Neg | TP | FP | TN | FN | Sensitivity | Specificity | AP percent |
|---|---|---|---|---|---|---|---|---|---|
| <=6h | 413 | 32/381 | 25 (6.1%) | 224 (54.2%) | 157 (38.0%) | 7 (1.7%) | 0.781 | 0.412 | 40.7% |
| >6h to 48h | 236 | 20/216 | 17 (7.2%) | 165 (69.9%) | 51 (21.6%) | 3 (1.3%) | 0.850 | 0.236 | 70.3% |
This timing pattern supports the AP/portable-context interpretation. After 6 hours, the model catches a slightly larger fraction of positives, but it also calls many more negatives positive. The raw CXR model remains weak across timing buckets and does not change that conclusion.
The CXR model also catches acute PE better than subsegmental PE:
| PE type | Positives caught at high-sensitivity cutoff |
|---|---|
| Acute PE | 33/39 (84.6%) |
| Subsegmental PE | 9/13 (69.2%) |
I would frame the CXR embedding result as a real but limited signal. It is not simply random, but a large part of its behavior may be driven by AP/portable/sicker-patient context rather than PE-specific radiographic evidence.
5 Section 3. Paired Lockbox And Fusion
Cohort: 177-subject paired lockbox
Case mix: 42 positive, 135 negative
Models: paired EKG-only, paired CXR-only, and Fusion A037. I also include the broad-pool standalone scores restricted to these same 177 patients.
5.1 Paired-Trained Models On The Same 177 Subjects
| Model | Cutoff | AUROC | AUPRC | Sensitivity | Specificity | TP | FP | TN | FN |
|---|---|---|---|---|---|---|---|---|---|
| Paired EKG-only | 0.351 | 0.506 | 0.244 (1.0x) | 0.810 | 0.259 | 34 | 100 | 35 | 8 |
| Paired CXR-only | 0.420 | 0.632 | 0.328 (1.4x) | 0.810 | 0.430 | 34 | 77 | 58 | 8 |
| Fusion A037 | 0.305 | 0.565 | 0.291 (1.2x) | 0.810 | 0.356 | 34 | 87 | 48 | 8 |
This is the central fusion result. A037 does not improve over paired CXR-only. At the same sensitivity, it has the same 34 true positives and 8 false negatives, but it creates 10 more false positives than paired CXR-only.
The complementarity story also does not hold well. Fusion preserves most CXR-only successes, but it preserves only 6/29 (20.7%) EKG-only successes and rescues only 3/56 (5.4%) cases where both branches were wrong. In plain terms: the fusion model does not appear to be combining two independent useful signals in a way that improves the final decision.
5.2 Paired/Fusion Timing Within 0-48h
In the paired lockbox, the same CXR timing/context pattern appears. Paired CXR-only is more specific in the <=6h CXR bucket, while the >6h to 48h bucket is more AP-heavy and has worse specificity.
| Model | Timing axis | Bucket | N | TP | FP | TN | FN | Sensitivity | Specificity |
|---|---|---|---|---|---|---|---|---|---|
| Paired EKG-only | EKG-to-CTPA | <=6h | 137 | 28 (20.4%) | 75 (54.7%) | 26 (19.0%) | 8 (5.8%) | 0.778 | 0.257 |
| Paired EKG-only | EKG-to-CTPA | >6h to 48h | 40 | 6 (15.0%) | 25 (62.5%) | 9 (22.5%) | 0 (0.0%) | 1.000 | 0.265 |
| Paired CXR-only | CXR-to-CTPA | <=6h | 115 | 20 (17.4%) | 41 (35.7%) | 47 (40.9%) | 7 (6.1%) | 0.741 | 0.534 |
| Paired CXR-only | CXR-to-CTPA | >6h to 48h | 62 | 14 (22.6%) | 36 (58.1%) | 11 (17.7%) | 1 (1.6%) | 0.933 | 0.234 |
| Fusion A037 | CXR-to-CTPA | <=6h | 115 | 21 (18.3%) | 52 (45.2%) | 36 (31.3%) | 6 (5.2%) | 0.778 | 0.409 |
| Fusion A037 | CXR-to-CTPA | >6h to 48h | 62 | 13 (21.0%) | 35 (56.5%) | 12 (19.4%) | 2 (3.2%) | 0.867 | 0.255 |
Timing does not change the fusion conclusion. In the paired data, CXR-only is still better than A037 in the main <=6h CXR group. After 6 hours, both CXR-only and A037 make many false-positive calls. That later group is also much more AP-heavy: 74.2% AP after 6 hours versus 36.5% AP within 6 hours. So I would not read this as a pure timing effect. It looks more like later CXR timing is mixed with the same AP/portable-patient context that drives the CXR false positives.
5.3 Broad-Pool Standalone Scores Restricted To The Same 177 Subjects
This is a useful check because the broad EKG and broad CXR models were trained on their larger modality-specific pools, then applied to the same paired 177 subjects.
| Broad score on paired 177 | AUROC | AUPRC | Cutoff from broad lockbox | Sensitivity | Specificity | TP | FP | TN | FN |
|---|---|---|---|---|---|---|---|---|---|
Broad EKG cfg0017 |
0.637 | 0.370 (1.6x) | 0.249 | 0.905 | 0.333 | 38 | 90 | 45 | 4 |
Broad CXR embedding cfg0078 |
0.649 | 0.335 (1.4x) | 0.057 | 0.833 | 0.400 | 35 | 81 | 54 | 7 |
Broad raw CXR cfg0023 |
0.514 | 0.268 (1.1x) | 0.004 | 0.833 | 0.148 | 35 | 115 | 20 | 7 |
This strengthens the main point: A037 is not clearly better than the standalone scores on the same patients. The broad CXR embedding score restricted to the paired 177 catches one more positive than A037 (35 vs 34), with six fewer false positives. AUROC and AUPRC also point in the same direction.
5.4 Do The Standalone Error Patterns Hold In The Paired Lockbox?
Yes, broadly.
For CXR, the AP/portable pattern persists:
| Model on paired 177 | AP positives caught | AP false positives / true negatives | PA positives caught | PA false positives / true negatives |
|---|---|---|---|---|
| Paired CXR-only | 25/26 (96.2%) | 56 FP / 6 TN (90.3% FP) | 9/16 (56.2%) | 21 FP / 52 TN (28.8% FP) |
| Fusion A037 | 24/26 (92.3%) | 57 FP / 5 TN (91.9% FP) | 10/16 (62.5%) | 30 FP / 43 TN (41.1% FP) |
Fusion does not fix the AP false-positive problem. It slightly worsens PA false positives compared with paired CXR-only.
For PE type, subsegmental PE remains a weak point:
| Model | Acute PE caught | Subsegmental PE caught |
|---|---|---|
| Paired EKG-only | 29/33 (87.9%) | 5/9 (55.6%) |
| Paired CXR-only | 28/33 (84.8%) | 6/9 (66.7%) |
| Fusion A037 | 29/33 (87.9%) | 5/9 (55.6%) |
The chronic/equivocal negatives are also frequently overcalled at the high-sensitivity threshold:
| Model | Chronic/equivocal negatives classified positive |
|---|---|
| EKG standalone | 14/18 (77.8%) |
| CXR embedding standalone | 13/17 (76.5%) |
| Raw CXR standalone | 15/17 (88.2%) |
| Paired CXR-only | 5/5 (100.0%) |
| Fusion A037 | 4/5 (80.0%) |
These are small strata, but clinically important. Chronic and equivocal cases sit near the boundary of how we define the binary label, so they should be handled as a label-boundary caveat rather than treated as ordinary clean negatives.
5.5 Fixed 0.50 Cutoff
The 0.50 threshold mostly shows that the scores are not on a common calibrated probability scale.
At 0.50, EKG standalone, CXR embedding standalone, paired EKG-only, Fusion A037, broad EKG-on-177, and broad CXR-embedding-on-177 classify every subject as negative. Paired CXR-only and the raw CXR models do produce some positive calls at 0.50, but paired CXR-only sensitivity falls to 0.524 and raw CXR sensitivity stays below 0.10. None of this changes the operating-point story; it mainly shows that 0.50 is not a shared clinical threshold across models.
6 My takes
First, the master data structure is sound enough for this diagnostic report. The row unit is one subject, labels are not missing, and the expected lockbox sizes line up.
Second, the main clinical story is not “fusion improves PE detection.” The better story is “the models expose different non-PE-specific signals.” EKG leans toward physiologic stress. CXR embedding leans toward AP/portable imaging context. Both signals overlap with PE, but both also generate many false positives.
Third, the CXR embedding model is the strongest single branch in the paired setting, but its AP specificity problem is substantial. If Barbara wants to clinically inspect anything first, I would inspect AP negative false positives and subsegmental false negatives.
Fourth, subsegmental PE is a consistent miss pattern. This makes clinical sense: subsegmental PE is smaller and may produce weaker indirect physiologic or imaging-context signals.
Fifth, chronic/equivocal negatives should be explicitly caveated. These are binary negatives in the analysis, but they are not clinically the same as clean “no PE” negatives.
7 Caveats
Some subjects have multiple CXR records available, but the primary CXR pipeline selected one record per subject using a PA-first rule: prefer PA if available, otherwise AP, then smallest
dicom_idas a tie-breaker. This means the AP false-positive pattern is not explained by choosing AP over PA when both were available; selected AP studies mostly represent subjects without a PA alternative, so AP is better interpreted as an acquisition-context marker.Section 3 uses the paired/fusion CXR metadata for CXR descriptors such as AP/PA view and CXR-to-CTPA timing. This matters because the standalone CXR selector and the paired/fusion dataset can occasionally point to different CXR records for the same subject.
Vitals are descriptive context. The vitals file already contains one selected vitals row per subject, drawn from measurements within 0-48 hours before that row’s CTPA. For some subjects, that vitals row could only be matched by
subject_id, so it may not correspond to the exact same admission or CTPA event as the EKG/CXR used in this report. This subject-level fallback occurred for about 25% of EKG-pool subjects, 32% of CXR-pool subjects, and 30% of paired-pool subjects.Report-text flags are not negation-aware and should not be described as adjudicated clinical findings.
AUPRC depends on the underlying positive-case rate. The EKG, CXR, and paired pools have different PE prevalence, so AUPRC values across pools are not directly comparable; the
(x prevalence baseline)multiplier helps with that comparison.