We evaluated 6 kernel configurations (K0-K5) spanning receptive fields from 82ms to 2,598ms on a 2-layer Conv2D architecture for pulmonary embolism detection from raw 12-lead EKG waveforms. 2,160 experiments (6 kernels x 3 class weights x 6 LRs x 2 dropouts x 2 data splits x 5 seeds).
Key results:
Kernel size significantly affects AUROC (Kruskal-Wallis p < 1e-14). K3 (1,122ms RF) has the highest median AUROC (0.590) and AUPRC (0.290, ~1.3x prevalence). K3 outperforms K0-K2 (Wilcoxon p < 0.001, BH-adjusted). K3 vs K4/K5: not significant.
Class weight does not affect AUROC/AUPRC. All three strategies produce overlapping distributions. Class weight affects binary predictions at cutoff 0.50: inverse_freq eliminates collapse; other strategies require lower cutoffs.
Learning rate and dropout have limited effect on AUROC within the tested ranges.
EKG alone: ~0.59 AUROC. Above chance but limited.
Methods Note: A critical optimizer bug (GPU parameters not updating after .to(cuda)) was discovered and fixed prior to these runs. All 2,160 experiments reported here used the corrected training pipeline.
Terminology
AUROC (Area Under the Receiver Operating Characteristic Curve): Measures the model’s ability to distinguish PE-positive from PE-negative patients across all possible decision thresholds. Ranges from 0.5 (random chance) to 1.0 (perfect discrimination). Threshold-independent.
AUPRC (Area Under the Precision-Recall Curve): Measures discrimination performance with emphasis on the positive (PE) class. More informative than AUROC when the positive class is rare. Baseline equals the prevalence (~0.232 for our EKG-only cohort, ~0.235 for the paired validation cohort). Values are reported with a multiple of prevalence in parentheses (e.g., 0.292 (~1.3x) means 1.3 times the baseline). Threshold-independent.
Sensitivity (Recall, True Positive Rate): Proportion of actual PE cases correctly identified. A sensitivity of 80% means 80 out of 100 PE patients are flagged. Critical for screening, as missed PE cases can be fatal.
Specificity (True Negative Rate): Proportion of actual non-PE cases correctly identified. A specificity of 60% means 60 out of 100 non-PE patients are correctly cleared.
PPV (Positive Predictive Value, Precision): Among patients the model flags as PE-positive, the proportion who actually have PE.
NPV (Negative Predictive Value): Among patients the model clears as PE-negative, the proportion who truly do not have PE.
F1 Score: Harmonic mean of precision (PPV) and sensitivity. Balances the trade-off between catching PE cases and avoiding false alarms. Ranges from 0 to 1.
Youden’s J (Youden Index): Sensitivity + Specificity - 1. Measures how much better the model is than random guessing (J=0). A model with 60% sensitivity and 60% specificity has J=0.20.
Receptive Field (RF): The temporal span of raw EKG signal that a single output unit of the convolutional network can “see.” A 1,122ms receptive field means each feature captures approximately 1-2 cardiac beats depending on heart rate (e.g., ~1.4 beats at 75 bpm, ~1.9 beats at 100 bpm). [See full explanation in the experiment design section below]
Collapse: When a model predicts the same class for all patients (e.g., all PE-negative), producing 0% sensitivity or 0% specificity. Common in imbalanced classification without proper class weighting.
Inverse Frequency Weighting: A class weighting strategy that penalizes misclassification of the minority class (PE-positive) proportional to its rarity. With ~23% PE prevalence, the positive class receives ~3.3x the loss weight of the negative class.
Cutoff (Decision Threshold): The probability above which the model classifies a patient as PE-positive. A cutoff of 0.50 means any predicted probability above 50% is classified as PE+. Lower cutoffs increase sensitivity at the cost of specificity.
2 Experiment Design
2.1 Architecture
Input: 1 x 12 x 5,000 (channel x leads x timepoints). Raw 12-lead EKG waveform sampled at 500Hz for 10 seconds (5,000 = 500Hz x 10s), unsqueezed to a single-channel 2D tensor where leads are the height dimension and time is the width dimension.
Where k1, s1, k2, pk are kernel-configuration-specific parameters (see Kernel Configurations table below). Conv2 uses a (3, k2) kernel, spanning 3 leads vertically while varying temporally. All dropout layers use the same rate (hyperparameter). Output is 2 logits (PE-negative, PE-positive), passed through softmax at inference.
Kernel configurations and corresponding receptive fields.
Kernel
Conv1
Stride
Conv2
Pool
RF (ms)
K0
(1,11)
(1,2)
(3,7)
(1,2)
82
K1
(1,11)
(1,1)
(3,5)
(1,4)
84
K2
(1,25)
(1,1)
(3,11)
(1,4)
160
K3
(1,51)
(1,2)
(3,25)
(1,8)
1122
K4
(1,75)
(1,2)
(3,35)
(1,8)
1490
K5
(1,101)
(1,2)
(3,51)
(1,10)
2598
Receptive Field Design Rationale
2.1.1 Receptive Field Design Rationale
What is a receptive field? The input to the model is 5,000 timepoints of raw EKG signal sampled at 500Hz (5,000 / 500 = 10 seconds). After passing through convolutional and pooling layers, the signal is compressed into a small feature map. Each value in that feature map was computed from some contiguous chunk of the original 5,000 timepoints. The receptive field (RF) is the width of that chunk, i.e. how much raw signal each learned feature can “see.”
If the RF is 82ms, each feature sees only a fraction of the QRS complex. If the RF is 1,122ms, each feature sees an entire P-QRS-ST-T cycle, the full electrical signature of one heartbeat. Since classic PE signs on EKG (right heart strain, ST-segment changes, T-wave inversions) are expressed across the full beat, the hypothesis is that larger RFs should improve detection.
The RF formula. For a sequence of convolutional and pooling layers, the RF grows according to:
where \(k\) is the kernel size of the current layer, and \(S_{cumulative}\) is the product of all strides in preceding layers. This captures the key insight: later layers operate on downsampled feature maps, so each position in their input represents multiple raw samples. A kernel of width 25 in Conv2 does not span 25 raw samples. Rather, it spans 25 positions that are each \(S_{cumulative}\) raw samples apart.
The large jump from 65 to 449 at Conv2 is because the cumulative stride is 16 at that point. Each of Conv2’s 25 kernel positions is 16 raw samples apart, so the kernel reaches across 24 x 16 = 384 additional raw samples.
Design constraints. Kernel sizes were chosen under three constraints:
Broad temporal sweep. Six configurations were designed to produce receptive fields spanning two orders of magnitude, from ~80ms to ~2,600ms, allowing the data to determine which timescale contains the most signal for PE detection. Rather than targeting exact cardiac cycle fractions, the configurations were spaced to sample a wide range of temporal resolutions.
2:1 kernel ratio. A consistent ~2:1 ratio between Conv1 and Conv2 kernel widths was maintained across configurations (K2-K5 range from 1.98:1 to 2.27:1), so that each layer contributes proportionally to the total RF.
Odd kernel widths. Standard convention in convolutional networks. Odd kernels allow symmetric padding (padding = (k-1)/2 on each side), preserving spatial alignment.
Given the 2:1 ratio and odd-kernel constraint, K3 uses conv1_k=51 and conv2_k=25, yielding 1,122ms. The RF tuning resolution at Conv2 is coarse: each +/-1 change in conv2_k shifts the RF by 32ms. To hit a different target while maintaining the 2:1 ratio would require changing both kernels simultaneously, so the achieved RFs are approximate by design.
Receptive field verification. RF computed from architecture parameters and confirmed against stored values.
Kernel
conv1_k
conv1_s
pool_k
conv2_k
RF_samples
RF_ms
Stored_ms
Conv1:Conv2
K0
11
2
2
7
41
82
82
1.6:1
K1
11
1
4
5
42
84
84
2:1
K2
25
1
4
11
80
160
160
2:1
K3
51
2
8
25
561
1122
1122
2:1
K4
75
2
8
35
745
1490
1490
2:1
K5
101
2
10
51
1299
2598
2598
2:1
Clinical interpretation. How much signal each kernel captures depends on the patient’s heart rate. A normal resting heart rate is ~75 bpm (800ms per cycle). PE patients commonly present with sinus tachycardia (>=100 bpm), giving <=600ms per cycle.
Show code
hr_vals <-c(800, 600, 500, 400)rf_vals <-c(82, 84, 160, 1122, 1490, 2598)# Round to nearest 0.5 for cleaner presentationround_half <-function(x) { r <-round(x *2) /2ifelse(r ==round(r), sprintf("~%.0f", r), sprintf("~%.1f", r))}hr_table <-tibble(`Heart Rate`=c("75 bpm (normal resting)", "100 bpm (tachycardia threshold)", "120 bpm", "150 bpm"),`Cycle (ms)`= hr_vals)for (j inseq_along(rf_vals)) { raw <- rf_vals[j] / hr_vals hr_table[[paste0("K", j -1)]] <-ifelse(raw <0.5, sprintf("%.1f", raw), round_half(raw))}kable(hr_table, booktabs =TRUE,caption ="Approximate number of cardiac cycles captured by each kernel at different heart rates.") %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"), full_width =FALSE)
Approximate number of cardiac cycles captured by each kernel at different heart rates.
Heart Rate
Cycle (ms)
K0
K1
K2
K3
K4
K5
75 bpm (normal resting)
800
0.1
0.1
0.2
~1.5
~2
~3
100 bpm (tachycardia threshold)
600
0.1
0.1
0.3
~2
~2.5
~4.5
120 bpm
500
0.2
0.2
0.3
~2
~3
~5
150 bpm
400
0.2
0.2
0.4
~3
~3.5
~6.5
K0-K2 capture less than one beat at any heart rate. K3 captures 1-3 beats depending on heart rate, which is enough to see both waveform morphology (ST changes, T-wave inversions) and beat-to-beat timing. K4 and K5 capture more beats but show no significant improvement over K3, suggesting the additional context does not help.
K0 and K1 were carried over from V1/V2 for backward comparison and were not designed under the same 2:1 ratio convention.
2.2 Hyperparameter Grid
Learning rate: 1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 5e-3
Class weight: none, weight_1_2 (1:2), inverse_freq (~1:3.3)
Dropout: 0.10, 0.20
Data split seeds: 612, 928 (70/30 stratified splits of 2,039 EKG-only patients)
Initialization seeds: 42, 123, 456, 789, 2024
Total: 6 kernels x 36 HP combos x 2 splits x 5 seeds = 2,160 experiments. Early stopping with patience=7 on test loss, maximum 30 epochs.
2.3 Data
Training/test: 2,039 EKG-only patients, split 70/30 per data split seed. PE prevalence ~23.2%.
Validation: 132 patients from the paired pool (patients with both EKG + CXR), using fixed paired_split_seed=612. The specialist only evaluates on validation and never trains on paired data, preventing leakage into Phase 3 fusion experiments.
Model selection strategy: The EKG specialist is selected based on test set performance (613 EKG-only patients), not validation set performance. This is deliberate: because Phase 3 fusion models will be evaluated on the same 132-patient validation set, selecting the specialist based on validation would constitute indirect data leakage. In warm fusion, the specialist weights are fine-tuned with paired data, so any pre-optimization for validation patients would propagate through the fusion training. Test-based selection keeps specialist selection independent of fusion evaluation.
Overfitting safeguards: (1) Early stopping with patience=7 on test loss (maximum 30 epochs) prevents overtraining. (2) Dropout regularization is applied after each convolutional block and in the fully connected layer. (3) Two independent stratified data splits (seeds 612, 928) verify that results are not driven by a particular train/test partition. (4) Five initialization seeds per configuration quantify seed-to-seed variability. (5) The 132-patient validation set is completely held out from both training and hyperparameter selection, serving only as an independent check on generalization.
3 Results
All metrics in Sections 4.1-4.7 are threshold-independent (AUROC, AUPRC). Binary prediction metrics (sensitivity, specificity, confusion matrices) are introduced in Section 4.8.
3.1 The Full Landscape (2,160 experiments)
Show code
# One row per experiment (AUROC/AUPRC are identical across cutoff rows)auroc_data <- master %>%filter(cutoff ==0.50)ggplot(auroc_data, aes(x = test_auroc)) +geom_histogram(binwidth =0.02, fill ="#5DADE2", color ="white", alpha =0.8) +geom_vline(xintercept =0.50, linetype ="dashed", color ="#E74C3C", linewidth =0.6) +annotate("text", x =0.52, y =Inf, label ="Random chance (0.50)",hjust =0, vjust =2, color ="#E74C3C", size =3.5) +labs(x ="Test AUROC", y ="Count",title ="Distribution of test AUROC across all 2,160 experiments",caption ="All kernels, class weights, learning rates, dropout rates, splits, and seeds.") +theme_minimal(base_size =13) +theme(plot.title =element_text(face ="bold"))
The bulk of experiments fall in the 0.55-0.62 AUROC range. A small number fall below 0.50.
Show code
ggplot(auroc_data, aes(x = test_auprc)) +geom_histogram(binwidth =0.01, fill ="#58D68D", color ="white", alpha =0.8) +geom_vline(xintercept =0.232, linetype ="dashed", color ="#E74C3C", linewidth =0.6) +annotate("text", x =0.237, y =Inf, label ="Prevalence baseline (0.232)",hjust =0, vjust =2, color ="#E74C3C", size =3.5) +labs(x ="Test AUPRC", y ="Count",title ="Distribution of test AUPRC across all 2,160 experiments",caption ="Dashed line = prevalence baseline (expected AUPRC of a random classifier).") +theme_minimal(base_size =13) +theme(plot.title =element_text(face ="bold"))
Most experiments exceed the prevalence baseline (0.232), with the bulk in the 0.25-0.32 range.
3.2 Class Weight (2,160 experiments)
Show code
auroc_data <- auroc_data %>%mutate(cw_label =case_when( class_weight =="none"~"None", class_weight =="weight_1_2"~"1:2", class_weight =="inverse_freq"~"Inverse freq (~1:3.3)" ) %>%factor(levels =c("Inverse freq (~1:3.3)", "1:2", "None")))ggplot(auroc_data, aes(x = test_auroc, y = cw_label, fill = cw_label)) +geom_density_ridges(alpha =0.7, scale =1.2, rel_min_height =0.01) +geom_vline(xintercept =0.50, linetype ="dashed", color ="#E74C3C", linewidth =0.4) +scale_fill_manual(values =c("#27AE60", "#F39C12", "#E74C3C")) +labs(x ="Test AUROC", y =NULL,title ="Test AUROC distribution by class weight strategy",caption ="All kernels, LRs, dropout rates, splits, and seeds pooled (720 experiments per weight).") +theme_minimal(base_size =13) +theme(legend.position ="none", plot.title =element_text(face ="bold"))
Show code
ggplot(auroc_data, aes(x = test_auprc, y = cw_label, fill = cw_label)) +geom_density_ridges(alpha =0.7, scale =1.2, rel_min_height =0.01) +geom_vline(xintercept =0.232, linetype ="dashed", color ="#E74C3C", linewidth =0.4) +scale_fill_manual(values =c("#27AE60", "#F39C12", "#E74C3C")) +labs(x ="Test AUPRC", y =NULL,title ="Test AUPRC distribution by class weight strategy",caption ="720 experiments per weight. Dashed line = prevalence baseline (0.232).") +theme_minimal(base_size =13) +theme(legend.position ="none", plot.title =element_text(face ="bold"))
All three class weight strategies produce overlapping AUROC distributions in the 0.55-0.62 range. Class weight does not affect AUROC or AUPRC (threshold-independent metrics).
Kernel comparison, validation set. Median (SD) across 360 experiments per kernel.
Kernel
RF (ms)
AUROC
AUPRC
K0
82
0.603 (0.034)
0.363 (0.043)
K1
84
0.598 (0.046)
0.331 (0.042)
K2
160
0.616 (0.040)
0.378 (0.048)
K3
1122
0.590 (0.015)
0.307 (0.023)
K4
1490
0.579 (0.016)
0.293 (0.025)
K5
2598
0.565 (0.023)
0.296 (0.026)
K3 has the highest median AUROC and AUPRC on both test and validation sets.
3.3.1 Statistical Test
Kruskal-Wallis rank-sum test on test AUROC across the 6 kernel groups, with pairwise Wilcoxon rank-sum tests (BH-adjusted) for post-hoc comparisons. All 2,160 experiments included.
Show code
cat(sprintf("N = %d experiments (%d per kernel)\n",nrow(auroc_data),nrow(auroc_data) /n_distinct(auroc_data$kernel_id)))
N = 2160 experiments (360 per kernel)
Show code
kw <-kruskal.test(test_auroc ~ kernel_id, data = auroc_data)cat(sprintf("Kruskal-Wallis chi-squared = %.2f, df = %d, p = %.2e\n", kw$statistic, kw$parameter, kw$p.value))
Kruskal-Wallis chi-squared = 673.05, df = 5, p = 3.29e-143
epoch_data <- rec_seeds %>%mutate(split_label =factor(data_split_seed))ggplot(epoch_data, aes(x = best_epoch, y = split_label, color = class_weight)) +geom_jitter(size =3, height =0.15, alpha =0.8) +labs(x ="Best Epoch (early stopping)", y ="Data Split", color ="Class Weight",title ="Convergence epoch for candidate config across seeds",caption ="K3, LR=5e-3, dropout=0.1.") +theme_minimal(base_size =13) +theme(plot.title =element_text(face ="bold"))
3.7.1 Validation Set Performance
The validation set (132 paired patients with both CXR and EKG) is held out from specialist training. These are the patients that will be used in Phase 3 warm fusion.
Candidate config (K3, LR=5e-3, dropout=0.1) validation set performance across all seed/split/class_weight combinations.
Split
Seed
Class Weight
Val AUROC
Val AUPRC
612
42
inverse_freq
0.617
0.347
612
123
inverse_freq
0.605
0.342
612
456
inverse_freq
0.614
0.329
612
789
inverse_freq
0.602
0.336
612
2024
inverse_freq
0.592
0.317
612
42
none
0.601
0.320
612
123
none
0.579
0.317
612
456
none
0.583
0.319
612
789
none
0.590
0.304
612
2024
none
0.599
0.339
612
42
weight_1_2
0.608
0.350
612
123
weight_1_2
0.611
0.342
612
456
weight_1_2
0.603
0.318
612
789
weight_1_2
0.585
0.302
612
2024
weight_1_2
0.599
0.319
928
42
inverse_freq
0.575
0.319
928
123
inverse_freq
0.589
0.327
928
456
inverse_freq
0.566
0.290
928
789
inverse_freq
0.579
0.278
928
2024
inverse_freq
0.581
0.304
928
42
none
0.602
0.330
928
123
none
0.587
0.302
928
456
none
0.592
0.292
928
789
none
0.608
0.291
928
2024
none
0.578
0.317
928
42
weight_1_2
0.579
0.307
928
123
weight_1_2
0.588
0.309
928
456
weight_1_2
0.561
0.290
928
789
weight_1_2
0.569
0.310
928
2024
weight_1_2
0.581
0.370
Show code
rec_compare <- rec_seeds %>%group_by(class_weight) %>%summarise(N =n(),`Test AUROC`=sprintf("%.3f (%.3f)", median(test_auroc), sd(test_auroc)),`Val AUROC`=sprintf("%.3f (%.3f)", median(val_auroc), sd(val_auroc)),`Test AUPRC`=sprintf("%.3f (%.3f)", median(test_auprc), sd(test_auprc)),`Val AUPRC`=sprintf("%.3f (%.3f)", median(val_auprc), sd(val_auprc)),.groups ="drop" ) %>%rename(`Class Weight`= class_weight)kable(rec_compare, booktabs =TRUE,caption ="Candidate config: test vs validation comparison. Median (SD) across seeds and splits.") %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"), full_width =FALSE)
Candidate config: test vs validation comparison. Median (SD) across seeds and splits.
Class Weight
N
Test AUROC
Val AUROC
Test AUPRC
Val AUPRC
inverse_freq
10
0.591 (0.012)
0.590 (0.017)
0.292 (0.009)
0.323 (0.022)
none
10
0.597 (0.011)
0.591 (0.010)
0.292 (0.010)
0.317 (0.016)
weight_1_2
10
0.595 (0.011)
0.586 (0.017)
0.290 (0.013)
0.314 (0.025)
Show code
ggplot(rec_seeds, aes(x = test_auroc, y = val_auroc, color = class_weight)) +geom_abline(slope =1, intercept =0, linetype ="dashed", color ="grey50") +geom_point(aes(shape =factor(data_split_seed)), size =3, alpha =0.8) +labs(x ="Test AUROC", y ="Validation AUROC", color ="Class Weight", shape ="Split",title ="Test vs validation AUROC for candidate config",caption ="K3, LR=5e-3, dropout=0.1. Each point is one seed.") +theme_minimal(base_size =13) +theme(plot.title =element_text(face ="bold"))
Show code
ggplot(rec_seeds, aes(x = test_auprc, y = val_auprc, color = class_weight)) +geom_abline(slope =1, intercept =0, linetype ="dashed", color ="grey50") +geom_point(aes(shape =factor(data_split_seed)), size =3, alpha =0.8) +labs(x ="Test AUPRC", y ="Validation AUPRC", color ="Class Weight", shape ="Split",title ="Test vs validation AUPRC for candidate config",caption ="K3, LR=5e-3, dropout=0.1. Each point is one seed.") +theme_minimal(base_size =13) +theme(plot.title =element_text(face ="bold"))
3.8 Clinical Utility: Binary Predictions and Operating Points
Sections 4.1-4.7 used threshold-independent metrics (AUROC, AUPRC). This section examines binary prediction behavior, which depends on both the decision cutoff and the class weight strategy.
Prediction collapse rates by cutoff and class weight (2,160 experiments per cutoff).
Cutoff
Class Weight
All-Negative (Sens=0)
All-Positive (Spec=0)
Balanced
0.15
inverse_freq
0.0%
99.6%
0.4%
none
0.0%
47.4%
52.6%
weight_1_2
0.0%
91.7%
8.3%
0.25
inverse_freq
0.0%
92.5%
7.5%
none
0.0%
10.0%
90.0%
weight_1_2
0.0%
59.2%
40.8%
0.35
inverse_freq
0.0%
65.0%
35.0%
none
24.3%
3.6%
72.1%
weight_1_2
0.0%
11.5%
88.5%
0.50
inverse_freq
0.0%
1.1%
98.9%
none
98.3%
0.0%
1.7%
weight_1_2
39.4%
0.1%
60.4%
The table shows how collapse rates change across cutoffs. At cutoff 0.50, none and weight_1_2 show high all-negative rates. At lower cutoffs, these rates decrease as more models’ predicted probabilities cross the threshold.
For the cutoff analysis below (Section 4.8.3), inverse_freq with cutoff 0.50 is used as one operating point. The table above shows the full picture across strategies and cutoffs.
3.8.2 Representative Confusion Matrices by Kernel
For each kernel, one model was selected from a fixed combination (inverse_freq, LR=5e-3, dropout=0.1, split 612) by choosing the seed with median test AUROC.
Cutoff analysis for candidate K3 model (validation set, 132 patients).
Cutoff
Sensitivity
Specificity
PPV
NPV
F1
TP
FP
FN
TN
0.15
100.0%
0.0%
23.5%
0.0%
0.380
31
101
0
0
0.25
100.0%
0.0%
23.5%
0.0%
0.380
31
101
0
0
0.35
100.0%
0.0%
23.5%
0.0%
0.380
31
101
0
0
0.50
61.3%
57.4%
30.6%
82.9%
0.409
19
43
12
58
3.9 Validation vs Test Agreement
Show code
ggplot(auroc_data, aes(x = test_auroc, y = val_auroc, color = kernel_id)) +geom_abline(slope =1, intercept =0, linetype ="dashed", color ="grey50") +geom_point(alpha =0.5, size =1.5) +scale_color_brewer(palette ="Set2", name ="Kernel") +labs(x ="Test AUROC", y ="Validation AUROC",title ="Validation vs test AUROC (all 2,160 experiments)",caption ="Dashed line = perfect agreement.") +theme_minimal(base_size =13) +theme(plot.title =element_text(face ="bold"))
Show code
ggplot(auroc_data, aes(x = test_auprc, y = val_auprc, color = kernel_id)) +geom_abline(slope =1, intercept =0, linetype ="dashed", color ="grey50") +geom_point(alpha =0.5, size =1.5) +scale_color_brewer(palette ="Set2", name ="Kernel") +labs(x ="Test AUPRC", y ="Validation AUPRC",title ="Validation vs test AUPRC (all 2,160 experiments)",caption ="Dashed line = perfect agreement.") +theme_minimal(base_size =13) +theme(plot.title =element_text(face ="bold"))
Show code
cor_val <-cor(auroc_data$test_auroc, auroc_data$val_auroc, method ="spearman")cat(sprintf("Spearman correlation between test and validation AUROC: %.3f\n", cor_val))
Spearman correlation between test and validation AUROC: -0.199
Spearman correlation si slightly negative, but not worrisome.
3.10 Summary
Show code
bottom <-tibble(Item =c("Candidate kernel","Class weight effect on AUROC","Learning rate","Dropout","Test AUROC (median, all class weights)","Test AUPRC (median, all class weights)","Binary predictions at cutoff 0.50","Next step"),Finding =c("K3 (1,122ms receptive field); significantly outperforms K0-K2, comparable to K4-K5","No effect; all three strategies produce similar AUROC/AUPRC","Stable across tested range (1e-5 to 5e-3); 5e-3 carried forward","Stable (0.1 vs 0.2); 0.1 carried forward","~0.590","~0.290 (~1.3x prevalence)","Requires inverse_freq weighting to avoid collapse; other strategies need lower cutoff","Phase 3: warm fusion with CXR features using paired data"))kable(bottom, booktabs =TRUE, caption ="Summary of findings from 2,160 EKG Specialist V3 experiments.") %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"), full_width =FALSE) %>%column_spec(1, bold =TRUE, width ="5cm") %>%column_spec(2, width ="9cm")
Summary of findings from 2,160 EKG Specialist V3 experiments.
Item
Finding
Candidate kernel
K3 (1,122ms receptive field); significantly outperforms K0-K2, comparable to K4-K5
Class weight effect on AUROC
No effect; all three strategies produce similar AUROC/AUPRC
Learning rate
Stable across tested range (1e-5 to 5e-3); 5e-3 carried forward
Dropout
Stable (0.1 vs 0.2); 0.1 carried forward
Test AUROC (median, all class weights)
~0.590
Test AUPRC (median, all class weights)
~0.290 (~1.3x prevalence)
Binary predictions at cutoff 0.50
Requires inverse_freq weighting to avoid collapse; other strategies need lower cutoff
Next step
Phase 3: warm fusion with CXR features using paired data
Discussion. Kernel size is the primary hyperparameter affecting AUROC: K3 (1,122ms) significantly outperforms K0-K2, while K4-K5 show no further gain. Class weight, learning rate, and dropout do not meaningfully affect AUROC or AUPRC within the tested ranges. Class weight becomes relevant only for binary predictions: at cutoff 0.50, inverse_freq is the only strategy producing non-degenerate predictions, but this reflects the cutoff choice, not the model ranking quality. The EKG specialist achieves ~0.59 AUROC, which is above chance but limited. The candidate configuration (K3, LR=5e-3, dropout=0.1) will proceed to Phase 3 warm fusion, where both the EKG and CXR specialist weights are fine-tuned jointly on paired CXR+EKG data.