Adaptive Conformal Sets Under Distribution Shift

Background & Methods

Why eelgrass mapping matters

Eelgrass (Zostera marina) supports coastal ecosystems (habitat restoration, sediment stabilization, carbon sequestration)
Monitoring requires accurate maps over time (different imaging conditions)
Actual deployments need reliability: when should we trust the model?

Core Problem: Domain Shift Breaks Confidence

Imagery can change year-to-year (tide height, lighting, turbidity, morphology, drone degradation)
A model can be confident yet wrong under drift
Want uncertainty quantification that flags:
- ambiguous pixels
- “shifted” regions where model less reliable

Data: Morro Bay, California (2018–2022)

DJI Phantom 4 Pro drone orthomosaics of Morro Bay estuary, California
Labels:
- 2018–2021: dense polygon annotations → rasterized labels
- 2022: 1,000 human-labeled points (cheap evaluation)
Tiling: 448×448 chips, 50% overlap

Binary Ground-Truth Mask (Rasterized from Polygons)

Background Concepts

Types of Uncertainty

Predictive uncertainty is the sum of aleotoric and epistemic uncertainties.
- Aleotoric: noise inherent to data
- Epistemic: model uncertainty (insufficient/shifted knowledge)
Under drift, a proxy for epistemic uncertainty should increase

Deep Ensembles

Ensemble members \(\{f_k\}_{k=1}^K\) output probabilities \(p_y^{(k)}(x)\)
Practical epistemic proxy - disagreement: \(V(x) = Var_k(p_y^{(k)}(x))\)

Model Architectures Used

Diverse heterogeneous ensemble (K=6):
- DeepLabv3+, U-Net, SAM-LoRA variants
- Single-year (2021) + multi-year (2018–2021) models

Split Conformal Prediction for Classification (vanilla baseline)

Conformal Guarantees

Finite-sample coverage guarantee (exact)
For any data distribution (distribution-free)
For any predictive model (model-free)

Two main limitations:

Marginal coverage
Exchangeability assumption

CP Under Drift

In 2022, difficult / OOD pixels shift the score distribution: \(\hat q_{\alpha}\) becomes too lenient
Existing approaches are difficult to apply in high-dimensional drone imagery segmentation
Instead tailor difficulty-normalized CP, using ensemble disagreement as a shift-aware normalizer

Variance-Aware Score Normalization

Method 1: Parametric Linear Scaling

Replace vanilla score s with \(s_i^\star = s_i/g(v_i), i \in C\). Then do split cp on \(s^\star\).

\[S_{\lambda}(x,y)=\frac{S(x,y)}{1+\lambda V(x)},\quad \lambda \geq 0.\]

Assume scores increase linearly with difficulty
Choose \(\lambda\) via grid search balancing OOD coverage and % singletons (efficiency)

Method 2: Nonparametric Normalization

\[S=a(V)U,\quad a(V)>0,\quad U \perp \!\!\! \perp V \Rightarrow\]

\[ S'(x,y) = \frac{S(x,y)}{\hat a(V(x))}, \quad \hat a(V) \approx \mathbb{E}[S \mid V] \]

Assume scores increase multiplicatively
Learn non-linear scaling a(v), no tuning parameter

Evaluation Metrics

Primary:

Global coverage vs target \(\alpha\)
Set composition: % empty/singleton/two-label

Additional:

class-conditional coverage
coverage vs variance bins
spatial robustness

Ensemble Results

Figure 1: Example ensemble inference outputs on a 2022 chip: Raw chip (+ point), ensemble mean probability, ensemble variance, and argmax segmentation.

Table 1: Pixel-wise performance of the ensemble and individual segmentation models on the 2022 evaluation points.

Training Year	Model	Precision	Recall	F Score	Accuracy
N/A	Ensemble	0.905	0.869	0.886	0.91
2021	U-Net	0.820	0.890	0.860	0.88
2021	SAM LoRA	0.820	0.890	0.850	0.87
2021	DeepLab	0.880	0.800	0.840	0.88
2018–2021	U-Net	0.880	0.780	0.830	0.87
2018–2021	SAM LoRA	0.810	0.800	0.810	0.84
2018–2021	DeepLab	0.880	0.890	0.860	0.89

In-Distribution Evaluation

Table 2: In-distribution (2021 test) coverage and set composition for vanilla CP, linear variance-normalized scores across (\(\lambda\)), and the nonparametric normalizer at \(\alpha = 0.1\).

Method	q-hat	Coverage	% singletons	% empty	% two-label
Vanilla	0.317	0.9	93.9	6.1	0.0
Linear (λ = 0.5)	0.298	0.9	93.9	6.1	0.0
Linear (λ = 1.0)	0.281	0.9	93.9	6.1	0.0
Linear (λ = 2.0)	0.254	0.9	93.9	6.1	0.0
Linear (λ = 3.0)	0.233	0.9	93.8	6.0	0.1
Linear (λ = 4.0)	0.216	0.9	93.7	6.1	0.2
Nonparametric	1.512	0.9	94.5	4.5	1.0

Temporal OOD (2022)

Linear Normalization

Table 3: Temporal OOD (2022) global coverage and set composition for vanilla CP and linear variance-normalized scores across \(\lambda\) at \(\alpha = 0.1\).

Method	Coverage	% singletons	% empty	% two-label
Vanilla	0.860	92.7	7.3	0.0
Linear (λ = 0.5)	0.867	94.4	5.6	0.0
Linear (λ = 1.0)	0.871	95.4	4.5	0.1
Linear (λ = 2.0)	0.891	95.6	3.0	1.4
Linear (λ = 3.0)	0.901	93.8	2.5	3.7
Linear (λ = 4.0)	0.908	92.4	2.1	5.5

Figure 2: Overall coverage and singleton % on 2022 OOD points as a function of the linear shrink parameter \((\lambda)\) at \((\alpha = 0.1)\).

Nonparametric Normalization

Figure 3: Estimated normalization scale from 2021 calibration compared to the empirical score-variance relationship observed in 2022. The extension shows extrapolation beyond calibration support.

Table 4: Temporal OOD (2022) global coverage and set composition for Vanilla CP, Linear normalization \((\lambda = 3)\), and the Nonparametric normalizer at \(\alpha = 0.1\).

Method	q-hat	Coverage	% singletons	% empty	% two-label
Vanilla	0.317	0.860	92.7	7.3	0.0
Linear (λ = 3.0)	0.233	0.901	93.8	2.5	3.7
Nonparametric	1.511	0.906	96.4	0.8	2.8

Figure 4: Coverage on 2022 OOD points as a function of ensemble-variance bins for vanilla, linear \((\lambda = 3)\), and nonparametric normalization.

Class-conditional coverage (2022)

Table 5: Per-class coverage on 2022 OOD points at \(\alpha = 0.1\) for vanilla CP, linear normalization \((\lambda = 3)\), and the nonparametric normalizer.

Method	Coverage (background)	Coverage (eelgrass)	Difference
Vanilla	0.914	0.780	0.134
Linear (λ = 3.0)	0.943	0.839	0.104
Nonparametric	0.946	0.847	0.099

Spatial Robustness

Figure 5: Spatial blocks (10 equal-count spatial blocks formed by Morton/Z-ordering)

Figure 6: Four iterations of the Z-order curve (from Wikipedia)

Table 6: Spatial robustness diagnostics on 2022 OOD points.

Method	SD	Correlation r	p-value
Vanilla	0.049	-0.622	0.055
Linear (λ = 3.0)	0.040	-0.416	0.231
Nonparametric	0.036	-0.206	0.569

Figure 7: Scatter plot of block-wise coverage vs mean variance for vanilla, linear \((\lambda=3)\), and nonparametric methods

Figure 8: Per-block mean variance for 2022 points

Figure 9: Per-block coverage for 2022 points for vanilla, linear \((\lambda=3)\), and nonparametric methods

Visual Method Comparison

Figure 10: Visuals of the 2022 raster and overlayed ground truth points (Blue = Other, Green = Eelgrass).

Figure 11: Set composition overlay for 2022 raster for vanilla, linear \((\lambda=3)\), and nonparametric methods.

Figure 12: Set composition changes from vanilla for linear \((\lambda=3)\) and nonparametric methods. Counts for each transition type are reported in the legend.

Figure 13: Chip-level set prediction comparison on a representative 2022 point (GT = eelgrass). Both vanilla and linear methods return empty sets at that pixel, while the nonparametric method returns the correct singleton {eelgrass}.

This figure provides an interesting example of how variance normalization can change the predictions. At the labeled point, the ensemble mean predicts eelgrass with low confidence (\(\hat p \approx 0.56\)), yet the corresponding ensemble variance is also fairly low (\(\hat v \approx 0.045\)). This is a case where predictive uncertainty is relatively high (pixel is intrinsically ambigious, so we have high entropy/low confidence) but epistemic uncertainty is low (the ensemble members agree, resulting in low variance). As a result, both vanilla CP and linear variance-normalized method produce an empty set prediction at the labeled point. In contrast, the nonparametric method, retains eelgrass as the correct singleton set prediction.

We can see these results even more clearly in the entire segmented chip, vanilla CP results in a large portion of empty sets in areas of ambiguity (borders between eelgrass and other as well as darker shadows). The linear method is able to shrink these ambiguous areas to singletons and two-set classes (as seen by the thinner strips between other and eelgrass) but still has a large portion of empty set predictions. Nonparametric is able to further shrink the area of ambiguity, while also producing almost no empty sets, instead providing more informative 2 class sets.

Sensitivity Analysis

Table 7: Coverage and set composition on 2022 OOD points across \(\alpha \in {0.10, 0.05, 0.025}\).

Method	Coverage	Empty (%)	Single (%)	Two-label (%)
α = 0.025
Linear (λ=3)	0.967	0.0	70.8	29.2
Nonparametric	0.972	0.3	65.4	34.3
Vanilla	0.949	0.0	87.2	12.8
α = 0.05
Linear (λ=3)	0.952	0.0	80.5	19.5
Nonparametric	0.944	0.4	87.0	12.6
Vanilla	0.919	0.0	95.9	4.1
α = 0.1
Linear (λ=3)	0.901	2.5	93.8	3.7
Nonparametric	0.906	0.8	96.4	2.8
Vanilla	0.860	7.3	92.7	0.0

Figure 14: Coverage across \(\alpha \in \{0.1, 0.05, 0.025\}\). Dashed line shows the target coverage 1 - α.

Figure 15: Set composition (empty, single, two-label) across \(\alpha \in \{0.1, 0.05, 0.025\}\).

Takeaways

Ensemble improves accuracy and provides useful uncertainty
ID: all conformal methods meet target coverage
OOD (2022): vanilla CP undercovers, ensemble variance normalization recovers lost coverage
Improved coverage vs variance and improved spatial equity
Nonparametric seemed to work best and requires no parameter tuning.

Limitations

Formal CP guarantees require exchangeability. This only target empircal robustness.
Assumes score–variance relationship transfers across years (may fail under larger shifts)
2022 uses point labels (not full dense raster labels)
Ensembles add compute cost

Conclusion

Ensemble disagreement provides a practical way to make conformal prediction sets more reliable under slight shift for eelgrass segmentation.
Works as a post-hoc wrapper: no retraining, no test-time labels