PE Diagnostic Model Report

Author

Awan

Published

April 30, 2026

1 Executive Summary

The models appear to be learning pieces of the clinical context around PE, but not PE itself cleanly.

Technical assessment: I did not find evidence that the main results are explained by any simple data-structure problem. The expected lockbox/validation sizes line up, labels are present and concordant across sources, and the available diagnostic checks did not point to a patient-join problem or a score-direction problem, where higher model scores meant lower PE risk.

Full EKG-only model: The EKG model appears to be picking up physiologic stress rather than a PE-specific electrical pattern. Its modest discrimination (AUROC 0.622, AUPRC 0.350, about 1.5x prevalence) appears to come mostly from encounters with more evident cardiopulmonary strain, and its false positives cluster in patients with tachycardia, higher shock index, and report language suggesting hypoxia. That signal is not useless, but it suggests the model may be learning severity or stress patterns associated with PE workups rather than identifying PE itself.

Full CXR-only model: The CXR embedding model has a view-position problem. Its predictions are strongly tied to AP (anteroposterior) CXRs, with many PE-negative AP studies called positive. AP status comes from the DICOM image metadata (VIEWPOSITION) and likely reflects a different clinical context, since AP films are often obtained as portable bedside studies in less mobile or more acutely ill patients. PA (posteroanterior) studies behave much better.

Paired-data Fusion model: Fusion model does not solve the standalone-model problems. On the paired 177-patient validation set, the best fusion Model-A037 does not outperform the paired CXR-only model at the matched high-sensitivity cutoff. It catches the same number of PE-positive cases and misses the same number of PE-positive cases as CXR-only, but it incorrectly calls more PE-negative patients positive. Fusion mostly inherits the CXR-embedding branch’s error pattern rather than adding a useful complementary signal from EKG. One reason fusion does not gain much from EKG is that the paired EKG-only branch is essentially at chance on the 177 paired lockbox (AUROC 0.506).

2 Sample Context And Derived Characteristics

I made a subject-level master frame. Each row is one patient/subject. The full frame has 5,289 subjects, formed from the union of the EKG and CXR pools:

Pool	Total pool	Tuning	Validation/Lockbox
EKG	2,925	2,340	585
CXR	3,250	2,601	649
Paired EKG+CXR	886	709	177

The 177 paired validation set subjects are the clean comparison set for Section 3, because all paired and fusion scores are available there.

Terminology note: when I refer to the paired EKG-only or paired CXR-only model, I mean a single-modality branch model trained within the paired-data pool: the n=886 patients who had both an EKG and a CXR available. These are different from the full EKG-only and full CXR-only models, which were trained on their larger modality-specific pools.

For the descriptive clinical strata, I used the following definitions:

Characteristic	How it is defined here
PE type	Acute and Subsegmental count as positive labels. No PE, Chronic, and Equivocal count as negative labels.
Tachycardia	Heart rate `>=100`. Bins are `<80`, `80-99`, `100-119`, and `>=120`.
O2 saturation	Bins are `<90`, `90-94`, and `>=95`. I call these O2 saturation strata, not hypoxia.
Shock index	Heart rate divided by systolic blood pressure. Bins are `<0.7`, `0.7-0.9`, and `>=0.9`.
CXR viewpoint	AP versus PA acquisition context. AP often means portable/sicker-patient context; PA is usually a more standard upright acquisition.
Report-text flags	These are keyword indicators from the radiology report text. They are used only to stratify model behavior. They should not be interpreted as adjudicated findings because the current extraction is not negation-aware; for example, “no pleural effusion” can still trigger the effusion search.

The three lockboxes have different case mixes:

Characteristic	EKG lockbox N=585	CXR lockbox N=649	Paired lockbox N=177
Positive / negative	137 / 448	52 / 597	42 / 135
Acute PE	116	39	33
Subsegmental PE	21	13	9
Chronic	15	13	4
Equivocal	3	4	1
No PE	430	580	130
Heart rate missing	201	105	25
O2 saturation missing	249	170	41
Shock index missing	204	111	27
AP / PA CXR	NA	334 / 315	88 / 89

The paired lockbox has a higher PE prevalence than the CXR-only lockbox, so AUPRC should not be compared casually across pools. For side-by-side model comparison, the 177 paired lockbox is the fairer setting.

2.1 Operating Points

I show two cutoffs.

The main cutoff is the model-specific threshold that reaches sensitivity at or just above 0.80 on the relevant lockbox. This answers: if we force the model to catch roughly 80% of PE positives, where do the false positives and false negatives concentrate? Each model produces scores on its own scale, so the cutoff needed to catch about 80% of PE-positive cases differs across models.

AUROC summarizes ranking quality (0.5 is random, 1.0 is perfect). AUPRC summarizes precision-recall and is shown as value (x prevalence baseline), where 1.0x means no gain over the positive-case rate.

The second cutoff is 0.50. This is not clinically meaningful for most of these models. It is useful mainly as a calibration and score-scale diagnostic. At 0.50, several models classify every subject as negative.

Sensitivity is anchored near 0.80 by design; specificity varies sharply across models, with raw CXR weakest and paired CXR-only strongest.

Because all sampled inputs were already inside the 0-48h pre-CTPA window, timing is treated below as a within-window stratification variable in each model section.

3 Section 1. Full EKG Standalone Model

Model: EKG standalone cfg0017
Cohort: 585-subject EKG lockbox

Cutoff	AUROC	AUPRC	Sensitivity	Specificity	TP	FP	TN	FN
Sensitivity >=0.80 cutoff, `0.249`	0.622	0.350 (1.5x)	0.803	0.382	110	277	171	27
Fixed 0.50 cutoff	0.622	0.350 (1.5x)	0.000	1.000	0	0	448	137

The EKG model has modest ranking performance. At the high-sensitivity cutoff it catches 110/137 (80.3%) PE positives, but it also flags 277/448 (61.8%) negatives as positive.

The clinically interesting part is where the errors concentrate. The model catches acute PE better than subsegmental PE:

PE type	Positives caught at high-sensitivity cutoff
Acute PE	96/116 (82.8%)
Subsegmental PE	14/21 (66.7%)

The false positives are concentrated in physiologic-stress strata:

Stratum	False positives / true negatives among negatives
HR 100-119	63 FP / 11 TN (85.1% FP)
HR >=120	23 FP / 2 TN (92.0% FP)
Shock index >=0.9	38 FP / 4 TN (90.5% FP)
CTPA text mentions tachycardia	60 FP / 8 TN (88.2% FP)
CTPA text mentions hypoxia/hypoxemia	56 FP / 7 TN (88.9% FP)

EKG false positives are concentrated in physiologic-stress strata, especially tachycardia and high shock index.

My interpretation is that the EKG model is partly identifying patients who look clinically stressed. That overlaps with PE, but it also overlaps with many non-PE illnesses. This is why the model can catch many acute cases while still producing many false positives.

3.1 EKG Timing Within 0-48h

Most EKG lockbox records are close to CTPA time. The <=6h bucket contains 449/585 subjects, while >6h to 48h contains 136/585. In the timing tables, TP, FP, TN, and FN are shown as n (% of row), so the four percentages sum to approximately 100% within each row.

EKG-to-CTPA bucket	N	Pos/Neg	TP	FP	TN	FN	Sensitivity	Specificity
<=6h	449	109/340	87 (19.4%)	204 (45.4%)	136 (30.3%)	22 (4.9%)	0.798	0.400
>6h to 48h	136	28/108	23 (16.9%)	73 (53.7%)	35 (25.7%)	5 (3.7%)	0.821	0.324

I do not see a clean EKG timing gradient. Specificity is somewhat lower after 6 hours, but the main EKG story remains physiologic stress rather than time-before-CTPA.

I would be careful not to say the EKG model has learned an EKG signature of PE. The evidence is more consistent with a severity or physiologic-stress signal.

4 Section 2. Full CXR Standalone Models

Models: CXR embedding cfg0078 and raw-image CXR cfg0023
Cohort: 649-subject CXR lockbox

Model	Cutoff	AUROC	AUPRC	Sensitivity	Specificity	TP	FP	TN	FN
CXR embedding `cfg0078`	0.057	0.615	0.114 (1.4x)	0.808	0.348	42	389	208	10
Raw CXR `cfg0023`	0.004	0.500	0.097 (1.2x)	0.808	0.154	42	505	92	10

The embedding model is the only CXR model worth discussing seriously. The raw CXR model is near-null by AUROC and reaches high sensitivity only by using an extremely low cutoff, which creates 505 false positives among 597 negatives.

The main CXR embedding finding is the AP/PA split:

CXR view	Positives caught	False positives / true negatives among negatives	Specificity
AP	29/31 (93.5%)	267 FP / 36 TN (88.1% FP)	0.119
PA	13/21 (61.9%)	122 FP / 172 TN (41.5% FP)	0.585

CXR specificity is much lower for AP studies than PA studies, supporting the portable/sicker-patient context interpretation.

This is a large and clinically interpretable pattern. AP chest x-rays are often portable studies and tend to come from sicker patients. The model appears to treat that acquisition context as a strong PE-associated signal. That helps sensitivity in AP-positive cases, but it causes very poor specificity among AP-negative cases.

The report-text strata point in the same direction. When CXR text contains limited/portable language, the embedding model catches 20/22 (90.9%) positives but false-positives on 172/199 (86.4%) negatives. This does not mean the text was used by the model; it means the image context and report context are aligned in a way that helps us diagnose model behavior.

4.1 CXR Timing Within 0-48h

CXR timing shows a stronger pattern than EKG timing: specificity is better in the <=6h bucket and worse in the >6h to 48h bucket. But this is not clean evidence that elapsed time itself is the mechanism, because the later CXR bucket is much more AP/portable.

CXR-to-CTPA bucket	N	Pos/Neg	TP	FP	TN	FN	Sensitivity	Specificity	AP percent
<=6h	413	32/381	25 (6.1%)	224 (54.2%)	157 (38.0%)	7 (1.7%)	0.781	0.412	40.7%
>6h to 48h	236	20/216	17 (7.2%)	165 (69.9%)	51 (21.6%)	3 (1.3%)	0.850	0.236	70.3%

This timing pattern supports the AP/portable-context interpretation. After 6 hours, the model catches a slightly larger fraction of positives, but it also calls many more negatives positive. The raw CXR model remains weak across timing buckets and does not change that conclusion.

The CXR model also catches acute PE better than subsegmental PE:

PE type	Positives caught at high-sensitivity cutoff
Acute PE	33/39 (84.6%)
Subsegmental PE	9/13 (69.2%)

I would frame the CXR embedding result as a real but limited signal. It is not simply random, but a large part of its behavior may be driven by AP/portable/sicker-patient context rather than PE-specific radiographic evidence.

5 Section 3. Paired Lockbox And Fusion

Cohort: 177-subject paired lockbox
Case mix: 42 positive, 135 negative
Models: paired EKG-only, paired CXR-only, and Fusion A037. I also include the broad-pool standalone scores restricted to these same 177 patients.

5.1 Paired-Trained Models On The Same 177 Subjects

Model	Cutoff	AUROC	AUPRC	Sensitivity	Specificity	TP	FP	TN	FN
Paired EKG-only	0.351	0.506	0.244 (1.0x)	0.810	0.259	34	100	35	8
Paired CXR-only	0.420	0.632	0.328 (1.4x)	0.810	0.430	34	77	58	8
Fusion A037	0.305	0.565	0.291 (1.2x)	0.810	0.356	34	87	48	8

This is the central fusion result. A037 does not improve over paired CXR-only. At the same sensitivity, it has the same 34 true positives and 8 false negatives, but it creates 10 more false positives than paired CXR-only.

At matched sensitivity, A037 has the same true positives as paired CXR-only but more false positives.

The complementarity story also does not hold well. Fusion preserves most CXR-only successes, but it preserves only 6/29 (20.7%) EKG-only successes and rescues only 3/56 (5.4%) cases where both branches were wrong. In plain terms: the fusion model does not appear to be combining two independent useful signals in a way that improves the final decision.

5.2 Paired/Fusion Timing Within 0-48h

In the paired lockbox, the same CXR timing/context pattern appears. Paired CXR-only is more specific in the <=6h CXR bucket, while the >6h to 48h bucket is more AP-heavy and has worse specificity.

Model	Timing axis	Bucket	N	TP	FP	TN	FN	Sensitivity	Specificity
Paired EKG-only	EKG-to-CTPA	<=6h	137	28 (20.4%)	75 (54.7%)	26 (19.0%)	8 (5.8%)	0.778	0.257
Paired EKG-only	EKG-to-CTPA	>6h to 48h	40	6 (15.0%)	25 (62.5%)	9 (22.5%)	0 (0.0%)	1.000	0.265
Paired CXR-only	CXR-to-CTPA	<=6h	115	20 (17.4%)	41 (35.7%)	47 (40.9%)	7 (6.1%)	0.741	0.534
Paired CXR-only	CXR-to-CTPA	>6h to 48h	62	14 (22.6%)	36 (58.1%)	11 (17.7%)	1 (1.6%)	0.933	0.234
Fusion A037	CXR-to-CTPA	<=6h	115	21 (18.3%)	52 (45.2%)	36 (31.3%)	6 (5.2%)	0.778	0.409
Fusion A037	CXR-to-CTPA	>6h to 48h	62	13 (21.0%)	35 (56.5%)	12 (19.4%)	2 (3.2%)	0.867	0.255

Timing does not change the fusion conclusion. In the paired data, CXR-only is still better than A037 in the main <=6h CXR group. After 6 hours, both CXR-only and A037 make many false-positive calls. That later group is also much more AP-heavy: 74.2% AP after 6 hours versus 36.5% AP within 6 hours. So I would not read this as a pure timing effect. It looks more like later CXR timing is mixed with the same AP/portable-patient context that drives the CXR false positives.

5.3 Broad-Pool Standalone Scores Restricted To The Same 177 Subjects

This is a useful check because the broad EKG and broad CXR models were trained on their larger modality-specific pools, then applied to the same paired 177 subjects.

Broad score on paired 177	AUROC	AUPRC	Cutoff from broad lockbox	Sensitivity	Specificity	TP	FP	TN	FN
Broad EKG `cfg0017`	0.637	0.370 (1.6x)	0.249	0.905	0.333	38	90	45	4
Broad CXR embedding `cfg0078`	0.649	0.335 (1.4x)	0.057	0.833	0.400	35	81	54	7
Broad raw CXR `cfg0023`	0.514	0.268 (1.1x)	0.004	0.833	0.148	35	115	20	7

This strengthens the main point: A037 is not clearly better than the standalone scores on the same patients. The broad CXR embedding score restricted to the paired 177 catches one more positive than A037 (35 vs 34), with six fewer false positives. AUROC and AUPRC also point in the same direction.

5.4 Do The Standalone Error Patterns Hold In The Paired Lockbox?

Yes, broadly.

For CXR, the AP/portable pattern persists:

Model on paired 177	AP positives caught	AP false positives / true negatives	PA positives caught	PA false positives / true negatives
Paired CXR-only	25/26 (96.2%)	56 FP / 6 TN (90.3% FP)	9/16 (56.2%)	21 FP / 52 TN (28.8% FP)
Fusion A037	24/26 (92.3%)	57 FP / 5 TN (91.9% FP)	10/16 (62.5%)	30 FP / 43 TN (41.1% FP)

Fusion does not fix the AP false-positive problem. It slightly worsens PA false positives compared with paired CXR-only.

For PE type, subsegmental PE remains a weak point:

Model	Acute PE caught	Subsegmental PE caught
Paired EKG-only	29/33 (87.9%)	5/9 (55.6%)
Paired CXR-only	28/33 (84.8%)	6/9 (66.7%)
Fusion A037	29/33 (87.9%)	5/9 (55.6%)

Across paired models, subsegmental PE is caught less reliably than acute PE.

The chronic/equivocal negatives are also frequently overcalled at the high-sensitivity threshold:

Model	Chronic/equivocal negatives classified positive
EKG standalone	14/18 (77.8%)
CXR embedding standalone	13/17 (76.5%)
Raw CXR standalone	15/17 (88.2%)
Paired CXR-only	5/5 (100.0%)
Fusion A037	4/5 (80.0%)

These are small strata, but clinically important. Chronic and equivocal cases sit near the boundary of how we define the binary label, so they should be handled as a label-boundary caveat rather than treated as ordinary clean negatives.

5.5 Fixed 0.50 Cutoff

The 0.50 threshold mostly shows that the scores are not on a common calibrated probability scale.

At 0.50, EKG standalone, CXR embedding standalone, paired EKG-only, Fusion A037, broad EKG-on-177, and broad CXR-embedding-on-177 classify every subject as negative. Paired CXR-only and the raw CXR models do produce some positive calls at 0.50, but paired CXR-only sensitivity falls to 0.524 and raw CXR sensitivity stays below 0.10. None of this changes the operating-point story; it mainly shows that 0.50 is not a shared clinical threshold across models.

6 My takes

First, the master data structure is sound enough for this diagnostic report. The row unit is one subject, labels are not missing, and the expected lockbox sizes line up.

Second, the main clinical story is not “fusion improves PE detection.” The better story is “the models expose different non-PE-specific signals.” EKG leans toward physiologic stress. CXR embedding leans toward AP/portable imaging context. Both signals overlap with PE, but both also generate many false positives.

Third, the CXR embedding model is the strongest single branch in the paired setting, but its AP specificity problem is substantial. If Barbara wants to clinically inspect anything first, I would inspect AP negative false positives and subsegmental false negatives.

Fourth, subsegmental PE is a consistent miss pattern. This makes clinical sense: subsegmental PE is smaller and may produce weaker indirect physiologic or imaging-context signals.

Fifth, chronic/equivocal negatives should be explicitly caveated. These are binary negatives in the analysis, but they are not clinically the same as clean “no PE” negatives.

7 Caveats

Some subjects have multiple CXR records available, but the primary CXR pipeline selected one record per subject using a PA-first rule: prefer PA if available, otherwise AP, then smallest dicom_id as a tie-breaker. This means the AP false-positive pattern is not explained by choosing AP over PA when both were available; selected AP studies mostly represent subjects without a PA alternative, so AP is better interpreted as an acquisition-context marker.
Section 3 uses the paired/fusion CXR metadata for CXR descriptors such as AP/PA view and CXR-to-CTPA timing. This matters because the standalone CXR selector and the paired/fusion dataset can occasionally point to different CXR records for the same subject.
Vitals are descriptive context. The vitals file already contains one selected vitals row per subject, drawn from measurements within 0-48 hours before that row’s CTPA. For some subjects, that vitals row could only be matched by subject_id, so it may not correspond to the exact same admission or CTPA event as the EKG/CXR used in this report. This subject-level fallback occurred for about 25% of EKG-pool subjects, 32% of CXR-pool subjects, and 30% of paired-pool subjects.
Report-text flags are not negation-aware and should not be described as adjudicated clinical findings.
AUPRC depends on the underlying positive-case rate. The EKG, CXR, and paired pools have different PE prevalence, so AUPRC values across pools are not directly comparable; the (x prevalence baseline) multiplier helps with that comparison.