Adaptive Conformal Sets Under Distribution Shift

Using Ensemble Disagreement as an Epistemic Normalizer for Eelgrass Segmentation under Temporal Drift

Dillon Murphy

Cal Poly San Luis Obispo

Background & Methods

Why eelgrass mapping matters

  • Eelgrass (Zostera marina) supports coastal ecosystems (habitat restoration, sediment stabilization, carbon sequestration)
  • Monitoring requires accurate maps over time (different imaging conditions)
  • Actual deployments need reliability: when should we trust the model?

Core Problem: Domain Shift Breaks Confidence

  • Imagery can change year-to-year (tide height, lighting, turbidity, morphology, drone degradation)
  • A model can be confident yet wrong under drift
  • Want uncertainty quantification that flags:
    • ambiguous pixels
    • “shifted” regions where model less reliable

Data: Morro Bay, California (2018–2022)

  • DJI Phantom 4 Pro drone orthomosaics of Morro Bay estuary, California
  • Labels:
    • 2018–2021: dense polygon annotations → rasterized labels
    • 2022: 1,000 human-labeled points (cheap evaluation)
  • Tiling: 448×448 chips, 50% overlap

Example Raw Chip (2021)

Binary Ground-Truth Mask (Rasterized from Polygons)

Background Concepts

Types of Uncertainty

  • Predictive uncertainty is the sum of aleotoric and epistemic uncertainties.
    • Aleotoric: noise inherent to data
    • Epistemic: model uncertainty (insufficient/shifted knowledge)
  • Under drift, a proxy for epistemic uncertainty should increase

Deep Ensembles

  • Ensemble members \(\{f_k\}_{k=1}^K\) output probabilities \(p_y^{(k)}(x)\)
  • Practical epistemic proxy - disagreement: \(V(x) = Var_k(p_y^{(k)}(x))\)

Model Architectures Used

  • Diverse heterogeneous ensemble (K=6):
    • DeepLabv3+, U-Net, SAM-LoRA variants
    • Single-year (2021) + multi-year (2018–2021) models

SAM-LoRA architecture

U-Net architecture

DeepLabv3+ architecture

Split Conformal Prediction for Classification (vanilla baseline)

Conformal Guarantees

  1. Finite-sample coverage guarantee (exact)
  2. For any data distribution (distribution-free)
  3. For any predictive model (model-free)

Two main limitations:

  1. Marginal coverage
  2. Exchangeability assumption

CP Under Drift

  • In 2022, difficult / OOD pixels shift the score distribution: \(\hat q_{\alpha}\) becomes too lenient
  • Existing approaches are difficult to apply in high-dimensional drone imagery segmentation
  • Instead tailor difficulty-normalized CP, using ensemble disagreement as a shift-aware normalizer

Variance-Aware Score Normalization

Method 1: Parametric Linear Scaling

Replace vanilla score s with \(s_i^\star = s_i/g(v_i), i \in C\). Then do split cp on \(s^\star\).

\[S_{\lambda}(x,y)=\frac{S(x,y)}{1+\lambda V(x)},\quad \lambda \geq 0.\]

  • Assume scores increase linearly with difficulty

  • Choose \(\lambda\) via grid search balancing OOD coverage and % singletons (efficiency)

Method 2: Nonparametric Normalization

\[S=a(V)U,\quad a(V)>0,\quad U \perp \!\!\! \perp V \Rightarrow\]

\[ S'(x,y) = \frac{S(x,y)}{\hat a(V(x))}, \quad \hat a(V) \approx \mathbb{E}[S \mid V] \]

  • Assume scores increase multiplicatively

  • Learn non-linear scaling a(v), no tuning parameter

Evaluation Metrics

Primary:

  • Global coverage vs target \(\alpha\)
  • Set composition: % empty/singleton/two-label

Additional:

  • class-conditional coverage
  • coverage vs variance bins
  • spatial robustness

Ensemble Results

Figure 1: Example ensemble inference outputs on a 2022 chip: Raw chip (+ point), ensemble mean probability, ensemble variance, and argmax segmentation.
Table 1: Pixel-wise performance of the ensemble and individual segmentation models on the 2022 evaluation points.
Training Year Model Precision Recall F Score Accuracy
N/A Ensemble 0.905 0.869 0.886 0.91
2021 U-Net 0.820 0.890 0.860 0.88
2021 SAM LoRA 0.820 0.890 0.850 0.87
2021 DeepLab 0.880 0.800 0.840 0.88
2018–2021 U-Net 0.880 0.780 0.830 0.87
2018–2021 SAM LoRA 0.810 0.800 0.810 0.84
2018–2021 DeepLab 0.880 0.890 0.860 0.89

In-Distribution Evaluation

Table 2: In-distribution (2021 test) coverage and set composition for vanilla CP, linear variance-normalized scores across (\(\lambda\)), and the nonparametric normalizer at \(\alpha = 0.1\).
Method q-hat Coverage % singletons % empty % two-label
Vanilla 0.317 0.9 93.9 6.1 0.0
Linear (λ = 0.5) 0.298 0.9 93.9 6.1 0.0
Linear (λ = 1.0) 0.281 0.9 93.9 6.1 0.0
Linear (λ = 2.0) 0.254 0.9 93.9 6.1 0.0
Linear (λ = 3.0) 0.233 0.9 93.8 6.0 0.1
Linear (λ = 4.0) 0.216 0.9 93.7 6.1 0.2
Nonparametric 1.512 0.9 94.5 4.5 1.0

Temporal OOD (2022)

Linear Normalization

Table 3: Temporal OOD (2022) global coverage and set composition for vanilla CP and linear variance-normalized scores across \(\lambda\) at \(\alpha = 0.1\).
Method Coverage % singletons % empty % two-label
Vanilla 0.860 92.7 7.3 0.0
Linear (λ = 0.5) 0.867 94.4 5.6 0.0
Linear (λ = 1.0) 0.871 95.4 4.5 0.1
Linear (λ = 2.0) 0.891 95.6 3.0 1.4
Linear (λ = 3.0) 0.901 93.8 2.5 3.7
Linear (λ = 4.0) 0.908 92.4 2.1 5.5
Figure 2: Overall coverage and singleton % on 2022 OOD points as a function of the linear shrink parameter \((\lambda)\) at \((\alpha = 0.1)\).

Nonparametric Normalization

Figure 3: Estimated normalization scale from 2021 calibration compared to the empirical score-variance relationship observed in 2022. The extension shows extrapolation beyond calibration support.
Table 4: Temporal OOD (2022) global coverage and set composition for Vanilla CP, Linear normalization \((\lambda = 3)\), and the Nonparametric normalizer at \(\alpha = 0.1\).
Method q-hat Coverage % singletons % empty % two-label
Vanilla 0.317 0.860 92.7 7.3 0.0
Linear (λ = 3.0) 0.233 0.901 93.8 2.5 3.7
Nonparametric 1.511 0.906 96.4 0.8 2.8
Figure 4: Coverage on 2022 OOD points as a function of ensemble-variance bins for vanilla, linear \((\lambda = 3)\), and nonparametric normalization.

Class-conditional coverage (2022)

Table 5: Per-class coverage on 2022 OOD points at \(\alpha = 0.1\) for vanilla CP, linear normalization \((\lambda = 3)\), and the nonparametric normalizer.
Method Coverage (background) Coverage (eelgrass) Difference
Vanilla 0.914 0.780 0.134
Linear (λ = 3.0) 0.943 0.839 0.104
Nonparametric 0.946 0.847 0.099

Spatial Robustness

Figure 5: Spatial blocks (10 equal-count spatial blocks formed by Morton/Z-ordering)
Figure 6: Four iterations of the Z-order curve (from Wikipedia)
Table 6: Spatial robustness diagnostics on 2022 OOD points.
Method SD Correlation r p-value
Vanilla 0.049 -0.622 0.055
Linear (λ = 3.0) 0.040 -0.416 0.231
Nonparametric 0.036 -0.206 0.569
Figure 7: Scatter plot of block-wise coverage vs mean variance for vanilla, linear \((\lambda=3)\), and nonparametric methods
Figure 8: Per-block mean variance for 2022 points
Figure 9: Per-block coverage for 2022 points for vanilla, linear \((\lambda=3)\), and nonparametric methods

Visual Method Comparison

Figure 10: Visuals of the 2022 raster and overlayed ground truth points (Blue = Other, Green = Eelgrass).
Figure 11: Set composition overlay for 2022 raster for vanilla, linear \((\lambda=3)\), and nonparametric methods.
Figure 12: Set composition changes from vanilla for linear \((\lambda=3)\) and nonparametric methods. Counts for each transition type are reported in the legend.
Figure 13: Chip-level set prediction comparison on a representative 2022 point (GT = eelgrass). Both vanilla and linear methods return empty sets at that pixel, while the nonparametric method returns the correct singleton {eelgrass}.

Sensitivity Analysis

Table 7: Coverage and set composition on 2022 OOD points across \(\alpha \in {0.10, 0.05, 0.025}\).
Method Coverage Empty (%) Single (%) Two-label (%)
α = 0.025
Linear (λ=3) 0.967 0.0 70.8 29.2
Nonparametric 0.972 0.3 65.4 34.3
Vanilla 0.949 0.0 87.2 12.8
α = 0.05
Linear (λ=3) 0.952 0.0 80.5 19.5
Nonparametric 0.944 0.4 87.0 12.6
Vanilla 0.919 0.0 95.9 4.1
α = 0.1
Linear (λ=3) 0.901 2.5 93.8 3.7
Nonparametric 0.906 0.8 96.4 2.8
Vanilla 0.860 7.3 92.7 0.0
Figure 14: Coverage across \(\alpha \in \{0.1, 0.05, 0.025\}\). Dashed line shows the target coverage 1 - α.
Figure 15: Set composition (empty, single, two-label) across \(\alpha \in \{0.1, 0.05, 0.025\}\).

Takeaways

  1. Ensemble improves accuracy and provides useful uncertainty
  2. ID: all conformal methods meet target coverage
  3. OOD (2022): vanilla CP undercovers, ensemble variance normalization recovers lost coverage
  4. Improved coverage vs variance and improved spatial equity
  5. Nonparametric seemed to work best and requires no parameter tuning.

Limitations

  • Formal CP guarantees require exchangeability. This only target empircal robustness.
  • Assumes score–variance relationship transfers across years (may fail under larger shifts)
  • 2022 uses point labels (not full dense raster labels)
  • Ensembles add compute cost

Conclusion

  • Ensemble disagreement provides a practical way to make conformal prediction sets more reliable under slight shift for eelgrass segmentation.
  • Works as a post-hoc wrapper: no retraining, no test-time labels

Questions?