Adaptive Conformal Sets Under Distribution Shift

Background & Methods

Why eelgrass mapping matters

Eelgrass (Zostera marina) supports coastal ecosystems (habitat restoration, sediment stabilization, carbon sequestration)
Monitoring requires accurate maps over time (different imaging conditions)
Actual deployments need reliability: when should we trust the model?

Core Problem: Domain Shift Breaks Confidence

Imagery can change year-to-year (tide height, lighting, turbidity, morphology, drone degradation)
A model can be confident yet wrong when the data changes (drift)
Want uncertainty quantification that flags:
- ambiguous pixels
- “shifted” regions where model less reliable

Data: Morro Bay, California (2018–2022)

Drone imagery of Morro Bay estuary, California
Labels:
- Train (2018–2021): pixel-level mask
- Test (2022): 1,000 human-labeled points
Tiling: 448×448 chips

Binary Ground-Truth Mask (Rasterized from Polygons)

Background Concepts

Types of Uncertainty

Predictive uncertainty is the sum of aleotoric and epistemic uncertainties.
- Aleotoric: noise inherent to data
- Epistemic: model uncertainty (insufficient/shifted knowledge)
Under drift, epistemic uncertainty should increase

Deep Ensembles

Ensemble members \(\{f_k\}_{k=1}^K\) output probabilities \(p_y^{(k)}(x)\)
Practical epistemic proxy - disagreement: \(V(x) = Var_k(p_y^{(k)}(x))\)

Model Architectures Used

Diverse heterogeneous ensemble (K=6):
- DeepLabv3+, U-Net, SAM-LoRA variants
- Single-year (2021) + multi-year (2018–2021) models

Split Conformal Prediction

Goal: For each pixel, return a set of possible labels with ~\(1-\alpha\) coverage.

Train a classifier that outputs \(p(y|x)\)
Use a holdout (calibration) set to choose a cutoff \(\hat q_\alpha\):
- For each point, measure how wrong the model was (low probability on the true label = big error)
For a new pixel (x), include a label if its probability is high enough: \[ C(x)=\{y:p(y \mid x)\ge 1-\hat q_\alpha\}. \]

Now for the split conformal prediction algorithm
So first you take any model which outputs class probabilities, trained on a training set
then on a calibration set, for each point you measure how wrong the model was, also known as how nonconforming the models output was for that prediction. So a low predicted probability for the true class means a big error. You then order these noncofrmity scores to get a threshold q alpha.
Then on your test set, for a new pixel, include a class in your prediction set if it’s probability is high enough compared to your threshold.

This gives us some nice guarantees: - Coverage (average number of sets including true class) will be at least 1-alpha - Works for any data distribution, without making any distributional assumptions - Works for any predictive model, even if mispessified

Image: https://daniel-bethell.co.uk/posts/conformal-prediction-guide/

Two Main Limitations

The guarantee is on average, some regions may over or under cover
Exchangeability: Assumes future data looks like past data

If conditions change in 2022, \(\hat q_{\alpha}\) becomes too lenient

Variance-Aware Score Normalization

Take vanilla score (how wrong model is): \[ S(y\mid x) = 1 - p(y \mid x). \]

Normalize by difficulty \((V(x))\)

Method 1: Parametric Linear Scaling

\[S_{\lambda}(y \mid x)=\frac{S(y \mid x)}{1+\lambda V(x)},\quad \lambda \geq 0.\]

Assume scores increase linearly with difficulty
Choose \(\lambda\) via grid search balancing coverage and % singletons (efficiency)

Method 2: Nonparametric Normalization

\[ S'(x,y) = \frac{S(x,y)}{\hat a(V(x))}\]

No linear assumption
Learn scale function a(v), no tuning parameter

Figure 1: Estimated normalization scale from 2021 calibration compared to the empirical score-variance relationship observed in 2022. The extension shows extrapolation beyond calibration support.

Evaluation Metrics

Primary:

Global coverage vs target \(\alpha\)
Set composition: % empty/singleton/two-label

Additional:

class-conditional coverage
coverage vs variance/spatial robustness

Results

Table 1: In-distribution (2021 test) coverage for vanilla CP, linear variance-normalized scores across (\(\lambda\)), and the nonparametric normalizer at \(\alpha = 0.1\).

Method	q-hat	Coverage
Vanilla	0.317	0.9
Linear (λ = 0.5)	0.298	0.9
Linear (λ = 1.0)	0.281	0.9
Linear (λ = 2.0)	0.254	0.9
Linear (λ = 3.0)	0.233	0.9
Linear (λ = 4.0)	0.216	0.9
Nonparametric	1.512	0.9

Table 2: Temporal OOD (2022) global coverage and set composition for vanilla CP and variance-normalized scores across \(\lambda\) at \(\alpha = 0.1\).

Method	Coverage
Vanilla	0.860
Linear (λ = 0.5)	0.867
Linear (λ = 1.0)	0.871
Linear (λ = 2.0)	0.891
Linear (λ = 3.0)	0.901
Linear (λ = 4.0)	0.908
Nonparametric	0.906

Figure 2: Overall coverage and singleton % on 2022 OOD points as a function of the linear shrink parameter \((\lambda)\) at \((\alpha = 0.1)\).

Table 3: Temporal OOD (2022) global coverage and set composition for Vanilla CP, Linear normalization \((\lambda = 3)\), and the Nonparametric normalizer at \(\alpha = 0.1\).

Method	q-hat	Coverage	% singletons	% empty	% two-label
Vanilla	0.317	0.860	92.7	7.3	0.0
Linear (λ = 3.0)	0.233	0.901	93.8	2.5	3.7
Nonparametric	1.511	0.906	96.4	0.8	2.8

Class-conditional coverage (2022)

Table 4: Per-class coverage on 2022 OOD points at \(\alpha = 0.1\) for vanilla CP, linear normalization \((\lambda = 3)\), and the nonparametric normalizer.

Method	Coverage (background)	Coverage (eelgrass)	Difference
Vanilla	0.914	0.780	0.134
Linear (λ = 3.0)	0.943	0.839	0.104
Nonparametric	0.946	0.847	0.099

Spatial Robustness

Figure 3: Spatial blocks (10 equal-count spatial blocks formed by Morton/Z-ordering)

Figure 4: Per-block mean variance for 2022 points

Figure 5: Per-block coverage for 2022 points for vanilla, linear \((\lambda=3)\), and nonparametric methods

Visual Method Comparison

Figure 6: Visuals of the 2022 raster and overlayed ground truth points (Blue = Other, Green = Eelgrass).

Figure 7: Set composition overlay for 2022 raster for vanilla, linear \((\lambda=3)\), and nonparametric methods.

Figure 8: Set composition changes from vanilla for linear \((\lambda=3)\) and nonparametric methods. Counts for each transition type are reported in the legend.

Figure 9: Example ensemble inference outputs on a 2022 chip: Raw chip (+ point), ensemble mean probability, ensemble variance, and argmax segmentation.

Figure 10: Chip-level set prediction comparison on a representative 2022 point (GT = eelgrass). Both vanilla and linear methods return empty sets at that pixel, while the nonparametric method returns the correct singleton {eelgrass}.

Sensitivity Analysis

Table 5: Coverage and set composition on 2022 OOD points across \(\alpha \in {0.10, 0.05, 0.025}\).

Method	Coverage	Empty (%)	Single (%)	Two-label (%)
α = 0.025
Linear (λ=3)	0.967	0.0	70.8	29.2
Nonparametric	0.972	0.3	65.4	34.3
Vanilla	0.949	0.0	87.2	12.8
α = 0.05
Linear (λ=3)	0.952	0.0	80.5	19.5
Nonparametric	0.944	0.4	87.0	12.6
Vanilla	0.919	0.0	95.9	4.1
α = 0.1
Linear (λ=3)	0.901	2.5	93.8	3.7
Nonparametric	0.906	0.8	96.4	2.8
Vanilla	0.860	7.3	92.7	0.0

Figure 11: Coverage across \(\alpha \in \{0.1, 0.05, 0.025\}\). Dashed line shows the target coverage 1 - α.

Figure 12: Set composition (empty, single, two-label) across \(\alpha \in \{0.1, 0.05, 0.025\}\).

Takeaways

ID: all conformal methods meet target coverage
OOD (2022): vanilla CP undercovers, ensemble variance normalization recovers lost coverage
Variance-normalization improved coverage vs variance and improved spatial equity
Nonparametric seemed to work best and requires no parameter tuning.

Limitations

Formal CP guarantees require exchangeability. This only target empircal robustness.
Assumes score–variance relationship transfers across years (may fail under larger shifts)
2022 uses point labels (not full dense raster labels)
Ensembles add compute cost

Conclusion

Ensemble disagreement provides a practical way to make conformal prediction sets more reliable under slight shift for eelgrass segmentation.
Works as a post-hoc wrapper: no retraining, no test-time labels