Adaptive Conformal Sets Under Distribution Shift

Using Ensemble Disagreement for Reliable Eelgrass Mapping

Dillon Murphy

Cal Poly San Luis Obispo

Background & Methods

Why eelgrass mapping matters

  • Eelgrass (Zostera marina) supports coastal ecosystems (habitat restoration, sediment stabilization, carbon sequestration)
  • Monitoring requires accurate maps over time (different imaging conditions)
  • Actual deployments need reliability: when should we trust the model?

Core Problem: Domain Shift Breaks Confidence

  • Imagery can change year-to-year (tide height, lighting, turbidity, morphology, drone degradation)
  • A model can be confident yet wrong when the data changes (drift)
  • Want uncertainty quantification that flags:
    • ambiguous pixels
    • “shifted” regions where model less reliable

Data: Morro Bay, California (2018–2022)

  • Drone imagery of Morro Bay estuary, California
  • Labels:
    • Train (2018–2021): pixel-level mask
    • Test (2022): 1,000 human-labeled points
  • Tiling: 448×448 chips

Example Raw Chip (2021)

Binary Ground-Truth Mask (Rasterized from Polygons)

Background Concepts

Types of Uncertainty

  • Predictive uncertainty is the sum of aleotoric and epistemic uncertainties.
    • Aleotoric: noise inherent to data
    • Epistemic: model uncertainty (insufficient/shifted knowledge)
  • Under drift, epistemic uncertainty should increase

Deep Ensembles

  • Ensemble members \(\{f_k\}_{k=1}^K\) output probabilities \(p_y^{(k)}(x)\)
  • Practical epistemic proxy - disagreement: \(V(x) = Var_k(p_y^{(k)}(x))\)

Model Architectures Used

  • Diverse heterogeneous ensemble (K=6):
    • DeepLabv3+, U-Net, SAM-LoRA variants
    • Single-year (2021) + multi-year (2018–2021) models

SAM-LoRA architecture

U-Net architecture

DeepLabv3+ architecture

Split Conformal Prediction

Goal: For each pixel, return a set of possible labels with ~\(1-\alpha\) coverage.

  1. Train a classifier that outputs \(p(y|x)\)

  2. Use a holdout (calibration) set to choose a cutoff \(\hat q_\alpha\):

    • For each point, measure how wrong the model was (low probability on the true label = big error)
  3. For a new pixel (x), include a label if its probability is high enough: \[ C(x)=\{y:p(y \mid x)\ge 1-\hat q_\alpha\}. \]

Two Main Limitations

  1. The guarantee is on average, some regions may over or under cover
  2. Exchangeability: Assumes future data looks like past data

If conditions change in 2022, \(\hat q_{\alpha}\) becomes too lenient

Variance-Aware Score Normalization

Take vanilla score (how wrong model is): \[ S(y\mid x) = 1 - p(y \mid x). \]

Normalize by difficulty \((V(x))\)

Method 1: Parametric Linear Scaling

\[S_{\lambda}(y \mid x)=\frac{S(y \mid x)}{1+\lambda V(x)},\quad \lambda \geq 0.\]

  • Assume scores increase linearly with difficulty

  • Choose \(\lambda\) via grid search balancing coverage and % singletons (efficiency)

Method 2: Nonparametric Normalization

\[ S'(x,y) = \frac{S(x,y)}{\hat a(V(x))}\]

  • No linear assumption
  • Learn scale function a(v), no tuning parameter
Figure 1: Estimated normalization scale from 2021 calibration compared to the empirical score-variance relationship observed in 2022. The extension shows extrapolation beyond calibration support.

Evaluation Metrics

Primary:

  • Global coverage vs target \(\alpha\)
  • Set composition: % empty/singleton/two-label

Additional:

  • class-conditional coverage
  • coverage vs variance/spatial robustness

Results

Table 1: In-distribution (2021 test) coverage for vanilla CP, linear variance-normalized scores across (\(\lambda\)), and the nonparametric normalizer at \(\alpha = 0.1\).
Method q-hat Coverage
Vanilla 0.317 0.9
Linear (λ = 0.5) 0.298 0.9
Linear (λ = 1.0) 0.281 0.9
Linear (λ = 2.0) 0.254 0.9
Linear (λ = 3.0) 0.233 0.9
Linear (λ = 4.0) 0.216 0.9
Nonparametric 1.512 0.9
Table 2: Temporal OOD (2022) global coverage and set composition for vanilla CP and variance-normalized scores across \(\lambda\) at \(\alpha = 0.1\).
Method Coverage
Vanilla 0.860
Linear (λ = 0.5) 0.867
Linear (λ = 1.0) 0.871
Linear (λ = 2.0) 0.891
Linear (λ = 3.0) 0.901
Linear (λ = 4.0) 0.908
Nonparametric 0.906
Figure 2: Overall coverage and singleton % on 2022 OOD points as a function of the linear shrink parameter \((\lambda)\) at \((\alpha = 0.1)\).
Table 3: Temporal OOD (2022) global coverage and set composition for Vanilla CP, Linear normalization \((\lambda = 3)\), and the Nonparametric normalizer at \(\alpha = 0.1\).
Method q-hat Coverage % singletons % empty % two-label
Vanilla 0.317 0.860 92.7 7.3 0.0
Linear (λ = 3.0) 0.233 0.901 93.8 2.5 3.7
Nonparametric 1.511 0.906 96.4 0.8 2.8

Class-conditional coverage (2022)

Table 4: Per-class coverage on 2022 OOD points at \(\alpha = 0.1\) for vanilla CP, linear normalization \((\lambda = 3)\), and the nonparametric normalizer.
Method Coverage (background) Coverage (eelgrass) Difference
Vanilla 0.914 0.780 0.134
Linear (λ = 3.0) 0.943 0.839 0.104
Nonparametric 0.946 0.847 0.099

Spatial Robustness

Figure 3: Spatial blocks (10 equal-count spatial blocks formed by Morton/Z-ordering)

Figure 4: Per-block mean variance for 2022 points
Figure 5: Per-block coverage for 2022 points for vanilla, linear \((\lambda=3)\), and nonparametric methods

Visual Method Comparison

Figure 6: Visuals of the 2022 raster and overlayed ground truth points (Blue = Other, Green = Eelgrass).
Figure 7: Set composition overlay for 2022 raster for vanilla, linear \((\lambda=3)\), and nonparametric methods.
Figure 8: Set composition changes from vanilla for linear \((\lambda=3)\) and nonparametric methods. Counts for each transition type are reported in the legend.
Figure 9: Example ensemble inference outputs on a 2022 chip: Raw chip (+ point), ensemble mean probability, ensemble variance, and argmax segmentation.
Figure 10: Chip-level set prediction comparison on a representative 2022 point (GT = eelgrass). Both vanilla and linear methods return empty sets at that pixel, while the nonparametric method returns the correct singleton {eelgrass}.

Sensitivity Analysis

Table 5: Coverage and set composition on 2022 OOD points across \(\alpha \in {0.10, 0.05, 0.025}\).
Method Coverage Empty (%) Single (%) Two-label (%)
α = 0.025
Linear (λ=3) 0.967 0.0 70.8 29.2
Nonparametric 0.972 0.3 65.4 34.3
Vanilla 0.949 0.0 87.2 12.8
α = 0.05
Linear (λ=3) 0.952 0.0 80.5 19.5
Nonparametric 0.944 0.4 87.0 12.6
Vanilla 0.919 0.0 95.9 4.1
α = 0.1
Linear (λ=3) 0.901 2.5 93.8 3.7
Nonparametric 0.906 0.8 96.4 2.8
Vanilla 0.860 7.3 92.7 0.0
Figure 11: Coverage across \(\alpha \in \{0.1, 0.05, 0.025\}\). Dashed line shows the target coverage 1 - α.
Figure 12: Set composition (empty, single, two-label) across \(\alpha \in \{0.1, 0.05, 0.025\}\).

Takeaways

  1. ID: all conformal methods meet target coverage
  2. OOD (2022): vanilla CP undercovers, ensemble variance normalization recovers lost coverage
  3. Variance-normalization improved coverage vs variance and improved spatial equity
  4. Nonparametric seemed to work best and requires no parameter tuning.

Limitations

  • Formal CP guarantees require exchangeability. This only target empircal robustness.
  • Assumes score–variance relationship transfers across years (may fail under larger shifts)
  • 2022 uses point labels (not full dense raster labels)
  • Ensembles add compute cost

Conclusion

  • Ensemble disagreement provides a practical way to make conformal prediction sets more reliable under slight shift for eelgrass segmentation.
  • Works as a post-hoc wrapper: no retraining, no test-time labels

Questions?