High-level results

Across both aggregation assumptions, the expert panel indicates:

Strongly negative counterfactual baseline: the mean expected change by 2042 under business-as-usual is around −12%, with a high proportion of units negative.
All policy scenarios provide positive uplifts relative to baseline: mean Q2 deltas are positive for every scenario, with a stable ranking of strength (policy_3 strongest; policy_1 weakest).
But policy is not sufficient on average to deliver net recovery: when combining baseline + uplift, mean absolute outcomes remain negative under all scenarios. policy_3 is closest to stability (near 0) but still negative in the mean, implying widespread improvement but persistent pockets of decline.
Habitat structure matters: Woodland is closest to neutral outcomes; Freshwater/Wetland and Open habitats remain the most constrained even under the strongest scenario.
Uncertainty is structured and greatest under the strongest intervention: dispersion and resampling indicate scenario ranking is fairly stable, but the precise “distance to zero” under policy_3 should be treated with uncertainty.

1. Background / purpose

This notebook cleans and analyses the Round-2 Expert Elicitation (EE) export to produce policy-relevant diagnostics of:

the counterfactual baseline expected by experts (Q1),
the incremental uplift under each policy scenario relative to that baseline (Q2),
the implied absolute outcome under policy (Q1 + Q2), i.e. whether scenarios are expected to slow decline, stabilise (≈0), or produce net recovery (>0) by 2042,
and the robustness of conclusions given a small expert panel (dispersion and resampling stability).

The survey is structured around taxa × habitat group combinations. Some taxa are represented by multiple survey_taxa rows within the same habitat group (notably freshwater assemblages and lichens). To make sure headline results are not an artefact of that survey-row multiplicity, the analysis reports key summaries under two transparent aggregation assumptions:

Assumption A (rows): all survey_taxa rows contribute as-is (faithful to survey structure).
Assumption B (collapsed): within each expert × scenario, multiple rows that represent the same taxa × habitat_group are averaged so that each unique taxa:habitat_group contributes once.

Where A and B agree, conclusions are robust to the aggregation choice; where they differ, results should be interpreted as sensitive to how survey rows represent ecological “units”.

Core framing: baseline vs delta vs absolute outcome

This analysis is structured as expert × taxa:habitat (survey row) × scenario observations.

Policy magnitudes (Q2) are explicitly defined as changes relative to the counterfactual (Q1). Therefore the implied signed outcome under policy is:

\[ \text{Absolute policy outcome (signed \%)} = \text{Counterfactual signed \%} + \text{Policy delta signed \%} \]

This ensures we keep separate:

baseline decline expectations (Q1),
incremental policy effects (Q2),
implied absolute outcomes under policy (Q1 + Q2).

Example (signed magnitudes): - If Q1b = 20 and Q1a indicates “decrease” → counterfactual = −20% - If Q2b = 10 and Q2a indicates “increase” → delta = +10% - Absolute outcome under policy = −20 + 10 = −10%

In the code, this quantity is stored as:

abs_policy_mag_signed = cf_mag_signed + pol_mag_signed

Outputs include summary tables, compact plots, category distributions, and a panel-size sensitivity test (resampling stability).

2. Data

# Packages
# ============================================================
library(readxl)
library(dplyr)
library(tidyr)
library(stringr)
library(purrr)
library(readr)
library(ggplot2)
library(forcats)
library(knitr)
library(kableExtra)
library(scales)

# ------------------------------------------------------------
# Input file
# ------------------------------------------------------------
xlsx_path <- "./data/p591243814707_benjamin.firth_38179087.xlsx"

df_raw <- read_excel(xlsx_path)

tibble(
  n_rows = nrow(df_raw),
  n_cols = ncol(df_raw),
  n_Q_cols = sum(str_detect(names(df_raw), "^Q\\d"))
) %>%
  kable(caption = "Raw Excel shape") %>%
  kable_styling(full_width = FALSE)

Raw Excel shape
n_rows	n_cols	n_Q_cols
7	445	420

2.2 Tidy and standardise

2.2.1 Strip HTML formatting

Some respondent cells contain HTML tags (e.g. <span style="...">No meaningful change</span>). We remove tags while preserving the displayed text.

Some, but not all respondent cells contain HTML (e.g. No meaningful change). Remove tags while preserving text:

strip_html <- function(x) {
  if (is.character(x) || inherits(x, "factor")) {
    x2 <- as.character(x)
    x2 <- str_replace_all(x2, "<[^>]+>", "")
    str_squish(x2)
  } else {
    x
  }
}

df_clean <- df_raw %>%
  mutate(across(everything(), strip_html))

2.2.2. New expert ID (`id_gj`) mapped from `Contact_name` (keep all experts)

Two respondents are missing some metadata but do have Contact_name. To keep all experts while removing identifiable information, we assign a stable integer ID (1..N) based on Contact_name and do not carry forward names.

df_clean <- df_clean %>%
  mutate(Contact_name = str_squish(as.character(Contact_name)))

contact_key <- df_clean %>%
  distinct(Contact_name) %>%
  filter(!is.na(Contact_name) & Contact_name != "") %>%
  arrange(Contact_name) %>%
  mutate(id_gj = row_number())

# save if needed later
#write.csv(contact_key, "./data/respondent_key_lookup.csv", row.names = FALSE)

# Expert mapping: Contact_name → id_gj
df1 <- df_clean %>%
  left_join(contact_key, by = "Contact_name")

2.2.3 Reshape Q1 + Q2 into `df_tidy` (expert × survey_taxa × scenario)

We pivot all Q* columns into long format, parse scenario, then pivot wide.

Parsing rules:

Q1* = "counterfactual" scenario
Q2*_*_{1..4} = "policy_1".."policy_4" scenario

df_q <- df1 %>%
  select(id_gj, matches("^Q\\d"))

long_q <- df_q %>%
  pivot_longer(
    cols = matches("^Q\\d"),
    names_to = "colname",
    values_to = "value",
    values_transform = list(value = as.character) # avoid type conflicts
  ) %>%
  mutate(
    q_code = str_match(colname, "^(Q\\d+[a-z])_\\d+")[,2],
    idx    = as.integer(str_match(colname, "^(Q\\d+[a-z])_(\\d+)")[,3]),
    scen_num = str_match(colname, "^(Q\\d+[a-z])_\\d+_(\\d+)")[,3],
    scen_num = if_else(is.na(scen_num), NA_integer_, as.integer(scen_num)),
    scenario = case_when(
      str_detect(q_code, "^Q1") ~ "counterfactual",
      str_detect(q_code, "^Q2") & !is.na(scen_num) ~ paste0("policy_", scen_num),
      TRUE ~ NA_character_
    ),
    question = str_trim(str_match(colname, ":\\s*([^\\(]+)\\s*\\(")[,2]),
    survey_taxa     = str_trim(str_match(colname, "\\((.*)\\)\\s*$")[,2])
  ) %>%
  mutate(
    col_out = paste0(
      q_code, "_",
      str_replace_all(str_to_lower(question), "[^a-z0-9]+", "_") %>%
        str_replace_all("^_|_$", "")
    )
  ) %>%
  filter(!is.na(scenario)) %>%
  select(id_gj, survey_taxa, scenario, col_out, value)


# reshape_wide
df_tidy <- long_q %>%
  pivot_wider(
    id_cols = c(id_gj, survey_taxa, scenario),
    names_from = col_out,
    values_from = value,
    values_fn = ~ {
      if (length(.x) == 0) NA_character_
      else if (length(.x) == 1) .x
      else paste(.x, collapse = " | ")
    }
  ) %>%
  arrange(id_gj, survey_taxa, scenario)

df_tidy %>%
  count(scenario) %>%
  kable(caption = "Scenario counts in df_tidy") %>%
  kable_styling(full_width = FALSE)

Scenario counts in df_tidy
scenario	n
counterfactual	147
policy_1	147
policy_2	147
policy_3	147
policy_4	147

2.2.4 `habitat` parsing

IMPORTANT: parsing rules

Goal: 1. Keep habitat_group logic exactly as-is (broad keyword mapping).
2. Set habitat_type to NA for all rows EXCEPT the five listed survey_taxa values, where we want a human-readable, standardised habitat_type label.

Why: - Many survey_taxa strings do not have a trailing “(…)” substring, and for reporting we only want habitat_type populated for the handful of taxa where we explicitly want a more granular label.

Output targets (exact): - “Ditch & Wetland Channel Freshwater invert assemblages” -> “Ditch & Wetland Channel” - “Running-Water Freshwater invert assemblages (Rivers/Streams)” -> “Running-Water (Rivers/Streams)” - “Standing-Water Freshwater invert assemblages (Ponds/Lakes)” -> “Standing-Water (Ponds/Lakes)” - “Saxicolous Lichens (Rock & Stone)” -> “Saxicolous (Rock & Stone)” - “Terricolous Heath/Bog Lichens” -> “Terricolous (Heath/Bog)”

# ============================================================
# Habitat parsing 
# ============================================================

# Helper: 
# Return NA unless survey_taxa is one of the five special cases
habitat_type <- function(survey_taxa) {
  x <- stringr::str_squish(as.character(survey_taxa))

  dplyr::case_when(
    # Freshwater assemblages
    stringr::str_detect(x, "^Ditch\\s*&\\s*Wetland\\s*Channel\\s+Freshwater\\s+invert\\s+assemblages\\s*$") ~
      "Ditch & Wetland Channel",

    stringr::str_detect(x, "^Running\\-Water\\s+Freshwater\\s+invert\\s+assemblages\\s*\\(Rivers/Streams\\)\\s*$") ~
      "Running-Water (Rivers/Streams)",

    stringr::str_detect(x, "^Standing\\-Water\\s+Freshwater\\s+invert\\s+assemblages\\s*\\(Ponds/Lakes\\)\\s*$") ~
      "Standing-Water (Ponds/Lakes)",

    # Lichens
    stringr::str_detect(x, "^Saxicolous\\s+Lichens\\s*\\(Rock\\s*&\\s*Stone\\)\\s*$") ~
      "Saxicolous (Rock & Stone)",

    stringr::str_detect(x, "^Terricolous\\s+Heath/Bog\\s+Lichens\\s*$") ~
      "Terricolous (Heath/Bog)",

    TRUE ~ NA_character_
  )
}

df_tidy <- df_tidy %>%
  dplyr::mutate(
    # --- habitat_type: NA except for five explicitly defined taxa ---
    habitat_type = habitat_type(survey_taxa),

    # Primary classification text for habitat_group mapping:
    # keep our original intent: use habitat_type if present, else survey_taxa.
    hg_txt = stringr::str_to_lower(dplyr::coalesce(habitat_type, survey_taxa)),

    # --- habitat_group mapping (UNCHANGED) ---
    habitat_group = dplyr::case_when(
      stringr::str_detect(hg_txt, "wetland|aquatic|water\\b|water\\-|water\\s+margin|river|stream|rivers|streams|pond|ponds|lake|lakes|ditch|channel|riparian|rheophytic|rheo") ~
        "Freshwater/Wetland",
      stringr::str_detect(hg_txt, "woodland|parkland|scrub|deadwood|epiphyt|humidity|veteran|ancient|saproxylic") ~
        "Woodland",
      stringr::str_detect(hg_txt, "grassland|heath|bog|open\\-habitat|open\\s+habitat|open\\s+ground|bare\\-ground|bare\\s+ground|exposed\\s+substrates?|terricolous") ~
        "Open habitats",
      stringr::str_detect(hg_txt, "rock|stone|saxicolous") ~
        "Open habitats",
      TRUE ~ "Unknown"
    ),

    # Force non-missing even if something slips through
    habitat_group = dplyr::if_else(is.na(habitat_group) | habitat_group == "", "Unknown", habitat_group),

    # taxa grouping (unchanged idea; keyword-based)
    taxa = dplyr::case_when(
      stringr::str_detect(survey_taxa, stringr::regex("\\bDiptera\\b", ignore_case = TRUE)) ~ "Diptera",
      stringr::str_detect(survey_taxa, stringr::regex("\\bBeetles\\b", ignore_case = TRUE)) ~ "Beetles",
      stringr::str_detect(survey_taxa, stringr::regex("Vascular Plants\\b", ignore_case = TRUE)) ~ "Vascular plants",
      stringr::str_detect(survey_taxa, stringr::regex("\\bLichens\\b", ignore_case = TRUE)) ~ "Lichens",
      stringr::str_detect(survey_taxa, stringr::regex("\\bBryophytes?\\b", ignore_case = TRUE)) ~ "Bryophytes",
      stringr::str_detect(survey_taxa, stringr::regex("\\bSpiders\\b", ignore_case = TRUE)) ~ "Spiders",
      stringr::str_detect(survey_taxa, stringr::regex("Freshwater invert assemblages\\b", ignore_case = TRUE)) ~ "Freshwater invertebrate assemblages",
      TRUE ~ "Other/unknown"
    )
  ) %>%
  dplyr::select(-hg_txt)

# Diagnostics: do we have any NA habitat_group now?
df_tidy %>%
  summarise(
    n_rows = n(),
    n_habitat_group_na = sum(is.na(habitat_group)),
    n_habitat_group_unknown = sum(habitat_group == "Unknown", na.rm = TRUE)
  ) %>%
  kable(caption = "habitat_group diagnostics (should have 0 NA)") %>%
  kable_styling(full_width = FALSE)

habitat_group diagnostics (should have 0 NA)
n_rows	n_habitat_group_na	n_habitat_group_unknown
735	0	0

# Coverage: show habitat_group by survey_taxa (useful spot-check)
df_tidy %>%
  distinct(survey_taxa, habitat_type, habitat_group) %>%
  arrange(habitat_group, survey_taxa) %>%
  kable(caption = "Mapping check: survey_taxa to habitat_group") %>%
  kable_styling(full_width = FALSE) %>%
  scroll_box(height = "320px")

Mapping check: survey_taxa to habitat_group
survey_taxa	habitat_type	habitat_group
Ditch & Wetland Channel Freshwater invert assemblages	Ditch & Wetland Channel	Freshwater/Wetland
Running-Water Freshwater invert assemblages (Rivers/Streams)	Running-Water (Rivers/Streams)	Freshwater/Wetland
Standing-Water Freshwater invert assemblages (Ponds/Lakes)	Standing-Water (Ponds/Lakes)	Freshwater/Wetland
Wetland & Aquatic Beetles	NA	Freshwater/Wetland
Wetland & Aquatic Vascular Plants	NA	Freshwater/Wetland
Wetland & Rheophytic Bryophytes	NA	Freshwater/Wetland
Wetland & Riparian Spiders	NA	Freshwater/Wetland
Wetland & Water-Margin Diptera	NA	Freshwater/Wetland
Open Grassland & Heath Vascular Plants	NA	Open habitats
Open Grassland/Heath/Bare-Ground Spiders	NA	Open habitats
Open-Ground Bryophytes (Grassland, Heath, Exposed Substrates)	NA	Open habitats
Open-Habitat & Flower-Visiting Diptera	NA	Open habitats
Open-Habitat Ground Beetles (Grassland & Heath)	NA	Open habitats
Saxicolous Lichens (Rock & Stone)	Saxicolous (Rock & Stone)	Open habitats
Terricolous Heath/Bog Lichens	Terricolous (Heath/Bog)	Open habitats
Epiphytic Lichens (Woodland & Parkland)	NA	Woodland
Saproxylic & Woodland Diptera	NA	Woodland
Saproxylic Beetles (Woodland Deadwood)	NA	Woodland
Woodland & Scrub Spiders	NA	Woodland
Woodland & Scrub Vascular Plants	NA	Woodland
Woodland Humidity & Epiphyte Bryophytes	NA	Woodland

2.3 Scoring categories and signed magnitudes

We now translate/compute:

Distribution score (1–5)
Certainty score (1–5)
Signed counterfactual magnitude (%)
Signed policy delta magnitude (%
Implied signed policy outcome (%): cf_mag_signed + pol_mag_signed

Important convention: responses of “do not know/unsure” are treated the same as missing.

# ============================================================
# Scoring helpers
# ============================================================
score_dist <- function(x){
  x0 <- str_squish(tolower(as.character(x)))
  x0[x0 %in% c("do not know/unsure","do not know","unsure","don't know","dont know","dk","")] <- NA_character_

  case_when(
    is.na(x0) ~ NA_real_,
    str_detect(x0, "robust|significant|large") & str_detect(x0, "decrease") ~ 1,
    str_detect(x0, "moderate") & str_detect(x0, "decrease") ~ 2,
    str_detect(x0, "no meaningful|no change") ~ 3,
    str_detect(x0, "moderate") & str_detect(x0, "increase") ~ 4,
    str_detect(x0, "robust|significant|large") & str_detect(x0, "increase") ~ 5,
    TRUE ~ NA_real_
  )
}

score_cert <- function(x){
  x0 <- str_squish(tolower(as.character(x)))
  x0[x0 %in% c("do not know/unsure","do not know","unsure","don't know","dont know","dk","")] <- NA_character_

  case_when(
    is.na(x0) ~ NA_real_,
    x0 %in% c("very low","very_low","very-low") ~ 1,
    x0 %in% c("low") ~ 2,
    x0 %in% c("medium","moderate") ~ 3,
    x0 %in% c("high") ~ 4,
    x0 %in% c("very high","very_high","very-high") ~ 5,
    TRUE ~ NA_real_
  )
}

dir_sign <- function(x){
  x0 <- str_squish(tolower(as.character(x)))
  x0[x0 %in% c("do not know/unsure","do not know","unsure","don't know","dont know","dk","")] <- NA_character_

  case_when(
    is.na(x0) ~ NA_real_,
    str_detect(x0, "decrease") ~ -1,
    str_detect(x0, "increase") ~  1,
    str_detect(x0, "no meaningful|no change") ~ 0,
    TRUE ~ NA_real_
  )
}


# ============================================================
# Score and sign responses
# ============================================================
Q1a <- "Q1a_likely_change_in_distribution"
Q1b <- "Q1b_magnitude_of_change"
Q1c <- "Q1c_your_certainty"

Q2a <- "Q2a_likely_change_in_distribution_compared_to_counterfactual"
Q2b <- "Q2b_magnitude_of_change_compared_to_the_counterfactual"
Q2c <- "Q2c_your_certainty"

df_scored <- df_tidy %>%
  mutate(
    # parse numeric magnitudes (percent scale assumed in survey)
    Q1b_num = parse_number(as.character(.data[[Q1b]]), na = c("N/A","NA","na","n/a","")),
    Q2b_num = parse_number(as.character(.data[[Q2b]]), na = c("N/A","NA","na","n/a","")),

    # scored categories
    cf_dist = score_dist(.data[[Q1a]]),
    pol_dist = score_dist(.data[[Q2a]]),

    cf_cert = score_cert(.data[[Q1c]]),
    pol_cert = score_cert(.data[[Q2c]]),

    # direction
    cf_dir = dir_sign(.data[[Q1a]]),
    pol_dir = dir_sign(.data[[Q2a]]),

    # signed magnitudes
    cf_mag_signed = if_else(is.na(Q1b_num) | is.na(cf_dir), NA_real_, Q1b_num * cf_dir),
    pol_mag_signed = if_else(is.na(Q2b_num) | is.na(pol_dir), NA_real_, Q2b_num * pol_dir)
  )

# ============================================================
# Attach counterfactual (Q1) values onto policy rows
# so we can compute implied absolute outcomes later
# ============================================================
scenario_levels <- c("policy_1","policy_2","policy_3","policy_4")

cf_lookup <- df_scored %>%
  filter(scenario == "counterfactual") %>%
  transmute(
    id_gj, survey_taxa,
    cf_dist2 = cf_dist,
    cf_cert2 = cf_cert,
    cf_mag_signed2 = cf_mag_signed
  )

df_scored2 <- df_scored %>%
  left_join(cf_lookup, by = c("id_gj", "survey_taxa")) %>%
  mutate(
    cf_dist = cf_dist2,
    cf_cert = cf_cert2,
    cf_mag_signed = cf_mag_signed2
  ) %>%
  select(-cf_dist2, -cf_cert2, -cf_mag_signed2) %>%
  mutate(
    abs_policy_mag_signed = if_else(
      scenario %in% scenario_levels,
      cf_mag_signed + pol_mag_signed,
      cf_mag_signed
    ),
    delta_cert = if_else(
      scenario %in% scenario_levels,
      pol_cert - cf_cert,
      NA_real_
    )
  )

df_scored2 %>%
  summarise(
    n_rows = n(),
    n_experts = n_distinct(id_gj),
    n_survey_taxa = n_distinct(survey_taxa),
    n_cf_signed = sum(!is.na(cf_mag_signed)),
    n_pol_delta_signed = sum(!is.na(pol_mag_signed)),
    n_abs_policy_signed = sum(!is.na(abs_policy_mag_signed))
  ) %>%
  kable(caption = "Coverage: signed magnitudes and absolute policy outcomes") %>%
  kable_styling(full_width = FALSE)

Coverage: signed magnitudes and absolute policy outcomes
n_rows	n_experts	n_survey_taxa	n_cf_signed	n_pol_delta_signed	n_abs_policy_signed
735	7	21	705	564	705

IMPORTANT assumption: we treat the response category “do not know/unsure” the same way as we treat NAs (ie respondent leaving questions empty == respondents selecting “do not know/unsure”)

# Expected columns from reshape
Q1a <- "Q1a_likely_change_in_distribution"
Q1b <- "Q1b_magnitude_of_change"
Q1c <- "Q1c_your_certainty"

Q2a <- "Q2a_likely_change_in_distribution_compared_to_counterfactual"
Q2b <- "Q2b_magnitude_of_change_compared_to_the_counterfactual"
Q2c <- "Q2c_your_certainty"




df_scored <- df_tidy %>%
  mutate(
    Q1b_num = parse_number(as.character(.data[[Q1b]]), na = c("N/A","NA","na","n/a","")),
    Q2b_num = parse_number(as.character(.data[[Q2b]]), na = c("N/A","NA","na","n/a","")),

    cf_dist = score_dist(.data[[Q1a]]),
    pol_dist = score_dist(.data[[Q2a]]),

    cf_cert = score_cert(.data[[Q1c]]),
    pol_cert = score_cert(.data[[Q2c]]),

    cf_dir = dir_sign(.data[[Q1a]]),
    pol_dir = dir_sign(.data[[Q2a]]),

    cf_mag_signed = if_else(is.na(Q1b_num) | is.na(cf_dir), NA_real_, Q1b_num * cf_dir),
    pol_mag_signed = if_else(is.na(Q2b_num) | is.na(pol_dir), NA_real_, Q2b_num * pol_dir),
    abs_policy_mag_signed = cf_mag_signed + pol_mag_signed,

    delta_cert = pol_cert - cf_cert
  )

df_scored %>%
  summarise(
    n_rows = n(),
    n_cf_mag = sum(!is.na(Q1b_num)),
    n_pol_mag = sum(!is.na(Q2b_num)),
    n_cf_signed = sum(!is.na(cf_mag_signed)),
    n_delta_signed = sum(!is.na(pol_mag_signed))
  ) %>%
  kable(caption = "Coverage: magnitudes and signed magnitudes") %>%
  kable_styling(full_width = FALSE)

Coverage: magnitudes and signed magnitudes
n_rows	n_cf_mag	n_pol_mag	n_cf_signed	n_delta_signed
735	141	564	141	564

# ============================================================
# FIX: attach counterfactual (Q1) values onto policy rows
# so abs_policy_mag_signed and delta_cert can be computed
# ============================================================

scenario_levels <- c("policy_1","policy_2","policy_3","policy_4")

# 1) Build a CF lookup (one row per id_gj × survey_taxa)
cf_lookup <- df_scored %>%
  filter(scenario == "counterfactual") %>%
  transmute(
    id_gj, survey_taxa,
    cf_dist2 = cf_dist,
    cf_cert2 = cf_cert,
    cf_mag_signed2 = cf_mag_signed
  )

# 2) Join CF values onto ALL rows (including policy rows)
df_scored2 <- df_scored %>%
  left_join(cf_lookup, by = c("id_gj", "survey_taxa")) %>%
  mutate(
    # overwrite cf_* with joined versions (so they exist on policy rows too)
    cf_dist = cf_dist2,
    cf_cert = cf_cert2,
    cf_mag_signed = cf_mag_signed2
  ) %>%
  select(-cf_dist2, -cf_cert2, -cf_mag_signed2) %>%
  mutate(
    # Now these work (because cf_* exist on policy rows)
    abs_policy_mag_signed = if_else(
      scenario %in% scenario_levels,
      cf_mag_signed + pol_mag_signed,
      cf_mag_signed
    ),
    delta_cert = if_else(
      scenario %in% scenario_levels,
      pol_cert - cf_cert,
      NA_real_
    )
  )

# 3) Quick sanity check: abs_policy_mag_signed should no longer be all NA
df_scored2 %>%
  group_by(scenario) %>%
  summarise(
    n = n(),
    n_abs_policy_mag = sum(!is.na(abs_policy_mag_signed)),
    n_delta_cert = sum(!is.na(delta_cert)),
    .groups = "drop"
  ) %>% print()

## # A tibble: 5 × 4
##   scenario           n n_abs_policy_mag n_delta_cert
##   <chr>          <int>            <int>        <int>
## 1 counterfactual   147              141            0
## 2 policy_1         147              141          138
## 3 policy_2         147              141          141
## 4 policy_3         147              141          141
## 5 policy_4         147              141          141

The dataset is balanced across scenarios and experts. All scenarios are evaluated on the same underlying baseline expectations, ensuring comparability.

3. Aggregation assumptions: row-based vs taxa:habitat_group

The elicitation design is such that the data includes multiple survey_taxa rows for some taxa within a single habitat group (notably freshwater assemblages and lichens). If we compute means across all survey_taxa rows, these taxa would contribute more weight to habitat-group and overall summaries simply because they have more rows.

To make this transparent, we run the analysis under two assumptions:

Assumption A (Row-based: all unique survey_taxa contribute)

Unit: > expert × scenario × survey_taxa

This preserves the survey structure exactly.

Assumption B (Collapsed within habitat: unique taxa:habitat_group contribute once)

Unit: > expert × scenario × taxa × habitat_group

This averages across multiple survey_taxa rows that fall into the same taxa:habitat_group cell for each expert and scenario.

Both assumptions are shown side-by-side for: - overall scenario summaries, and - habitat_group summaries.

scenario_levels <- c("counterfactual","policy_1","policy_2","policy_3","policy_4")

# A) Row-based: keep all survey_taxa rows (no collapsing)
df_A <- df_scored2 %>%
  filter(scenario %in% scenario_levels) %>%
  filter(!(taxa == "Lichens" & habitat_group == "Freshwater/Wetland")) %>%
  mutate(assumption = "A (rows = all survey_taxa)")

# B) Collapse within habitat: unique taxa:habitat_group per expert × scenario
df_B <- df_A %>%
  group_by(id_gj, scenario, habitat_group, taxa) %>%
  summarise(
    cf_mag_signed = mean(cf_mag_signed, na.rm = TRUE),
    pol_mag_signed = mean(pol_mag_signed, na.rm = TRUE),
    abs_policy_mag_signed = mean(abs_policy_mag_signed, na.rm = TRUE),
    cf_dist = mean(cf_dist, na.rm = TRUE),
    pol_dist = mean(pol_dist, na.rm = TRUE),
    cf_cert = mean(cf_cert, na.rm = TRUE),
    pol_cert = mean(pol_cert, na.rm = TRUE),
    n_survey_taxa_rows_collapsed = n(),
    .groups = "drop"
  ) %>%
  mutate(assumption = "B (collapsed = unique taxa:habitat_group)")

# Combined object (handy for faceting tables/plots)
df_AB <- bind_rows(df_A, df_B)

# Diagnostics: show where collapsing occurred
df_B %>%
  filter(n_survey_taxa_rows_collapsed > 1) %>%
  count(habitat_group, taxa, n_survey_taxa_rows_collapsed, sort = TRUE) %>%
  kable(caption = "Where Assumption B collapses multiple survey_taxa rows into one taxa:habitat_group unit") %>%
  kable_styling(full_width = FALSE)

Where Assumption B collapses multiple survey_taxa rows into one taxa:habitat_group unit
habitat_group	taxa	n_survey_taxa_rows_collapsed	n
Freshwater/Wetland	Freshwater invertebrate assemblages	3	35
Open habitats	Lichens	2	35

4. Expected trajectory under the counterfactual (Q1 baseline)

This section establishes the counterfactual (“business-as-usual”) baseline from Q1, before we interpret policy deltas from Q2. Because Q2 magnitudes are explicitly defined relative to Q1, we need a clear view of baseline “pull” to answer the policy-relevant question: do uplifts merely slow decline, or do they offset it sufficiently to deliver stability/recovery by 2042?

We report results under two aggregation assumptions:

Assumption A (rows = all survey_taxa): each survey row contributes as an observation. This reflects the survey design directly, but can implicitly overweight taxa that were split into multiple survey_taxa rows within a habitat group (notably freshwater assemblages and lichens).
Assumption B (collapsed = unique taxa:habitat_group): within each expert × scenario, multiple survey_taxa rows that represent the same taxa within the same habitat group are collapsed to one unit (averaged). This is the most direct way to prevent survey-row multiplicity from driving habitat-group summaries.

Importantly, taxa-level comparisons are not expected to change materially across A vs B, because the collapsing only affects where a taxon has multiple survey rows within the same habitat group. By construction, the main place where A vs B can matter is therefore habitat-group summaries.

cf_AB <- df_AB %>%
  filter(scenario == "counterfactual")

# Headline baseline (A vs B)
cf_AB %>%
  group_by(assumption) %>%
  summarise(
    n_units = n(),
    n_experts = n_distinct(id_gj),
    n_taxa = n_distinct(taxa),
    n_hab_groups = n_distinct(habitat_group),
    mean_cf = mean(cf_mag_signed, na.rm = TRUE),
    median_cf = median(cf_mag_signed, na.rm = TRUE),
    sd_cf = sd(cf_mag_signed, na.rm = TRUE),
    p10 = quantile(cf_mag_signed, 0.10, na.rm = TRUE),
    p90 = quantile(cf_mag_signed, 0.90, na.rm = TRUE),
    prop_negative = mean(cf_mag_signed < 0, na.rm = TRUE),
    prop_zero = mean(cf_mag_signed == 0, na.rm = TRUE),
    prop_positive = mean(cf_mag_signed > 0, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(across(where(is.numeric), ~ round(.x, 2))) %>%
  kable(caption = "Counterfactual baseline (Q1 signed %): Assumptions A and B") %>%
  kable_styling(full_width = FALSE)

Counterfactual baseline (Q1 signed %): Assumptions A and B
assumption	n_units	n_experts	n_taxa	n_hab_groups	mean_cf	median_cf	sd_cf	p10	p90	prop_negative	prop_zero	prop_positive
A (rows = all survey_taxa)	147	7	7	3	-12.33	-10	11.46	-30	0	0.79	0.13	0.07
B (collapsed = unique taxa:habitat_group)	126	7	7	3	-11.95	-10	11.74	-30	0	0.78	0.15	0.07

4.1 Baseline by habitat_group (Assumptions A and B)

How to read the baseline tables (and why A vs B matters here)

The baseline tables report:

mean_cf: mean signed counterfactual change (%), where negative values imply decline by 2042.
prop_negative: share of units with baseline decline (< 0), which is a compact signal of how widespread negative expectations are.
mean_cert: average reported certainty (1–5 scale).

At headline level, the baseline is strongly negative under both assumptions (mean around −12% in A and slightly less negative in B), and the proportion negative is high (around four-fifths). This indicates that, in expert judgement, continued decline is the default state without further intervention.

The habitat-group baseline table is shown explicitly under both assumptions because this is where multiplicity can matter. In these data, the only systematic collapsing in B is:

Freshwater/Wetland × Freshwater invertebrate assemblages (3 survey_taxa rows collapsed)
Open habitats × Lichens (2 survey_taxa rows collapsed)

so A can overweight those components in habitat-group means relative to other taxa that appear as a single survey row within habitat.

This table is shown under both assumptions because it is precisely where survey-row multiplicity can matter.

cf_by_hab_AB <- cf_AB %>%
  group_by(assumption, habitat_group) %>%
  summarise(
    n = n(),
    mean_cf = mean(cf_mag_signed, na.rm = TRUE),
    prop_negative = mean(cf_mag_signed < 0, na.rm = TRUE),
    mean_cert = mean(cf_cert, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(assumption, mean_cf)

cf_by_hab_AB %>%
  mutate(
    mean_cf = round(mean_cf, 2),
    mean_cert = round(mean_cert, 2),
    prop_negative = scales::percent(prop_negative, accuracy = 1)
  ) %>%
  kable(caption = "Counterfactual baseline by habitat_group (Assumptions A and B)") %>%
  kable_styling(full_width = FALSE)

Counterfactual baseline by habitat_group (Assumptions A and B)
assumption	habitat_group	n	mean_cf	prop_negative	mean_cert
A (rows = all survey_taxa)	Freshwater/Wetland	56	-13.83	87%	2.06
A (rows = all survey_taxa)	Open habitats	49	-13.43	85%	2.02
A (rows = all survey_taxa)	Woodland	42	-9.00	62%	1.93
B (collapsed = unique taxa:habitat_group)	Freshwater/Wetland	42	-13.47	85%	1.99
B (collapsed = unique taxa:habitat_group)	Open habitats	42	-13.36	85%	2.00
B (collapsed = unique taxa:habitat_group)	Woodland	42	-9.00	62%	1.93

two_cols <- c("#c9643e", "#618f72")

ggplot(cf_by_hab_AB, aes(x = habitat_group, y = mean_cf, fill = assumption)) +
  geom_col(position = position_dodge(width = 0.8)) +
  coord_flip() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  scale_fill_manual(values = two_cols) +
  labs(
    title = "Counterfactual baseline by habitat_group (Assumptions A and B)",
    x = NULL,
    y = "Mean signed % change (counterfactual)"
  ) +
  theme_minimal()

Interpretation: baseline differs by habitat group, but A vs B does not change the story

Across both assumptions, Freshwater/Wetland and Open habitats show the most negative baseline means, while Woodland is less negative. Under Assumption A, the mean counterfactual baseline is approximately:

Freshwater/Wetland: −13.83% (87% negative)
Open habitats: −13.43% (85% negative)
Woodland: −9.00% (62% negative)

Under Assumption B, the corresponding habitat-group means are very similar (Freshwater/Wetland and Open habitats remain strongly negative; Woodland remains less negative), with small shifts consistent with removing survey-row multiplicity.

Policy implication at this stage: the baseline “pull” is large enough that even sizeable positive deltas may still leave net negative outcomes once CF + delta are combined. This is exactly what Sections 5–6 test.

4.2 Baseline by taxa (A only; A == B)

Taxa-level comparisons are not meaningfully altered by collapsing within habitat, so we report once using the row-based structure.

cf_by_taxa <- df_A %>%
  filter(scenario == "counterfactual") %>%
  group_by(taxa) %>%
  summarise(
    n = n(),
    mean_cf = mean(cf_mag_signed, na.rm = TRUE),
    prop_negative = mean(cf_mag_signed < 0, na.rm = TRUE),
    mean_cert = mean(cf_cert, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(mean_cf)

cf_by_taxa %>%
  mutate(
    mean_cf = round(mean_cf, 2),
    mean_cert = round(mean_cert, 2),
    prop_negative = scales::percent(prop_negative, accuracy = 1)
  ) %>%
  kable(caption = "Counterfactual baseline by taxa (A; identical interpretation under B)") %>%
  kable_styling(full_width = FALSE)

Counterfactual baseline by taxa (A; identical interpretation under B)
taxa	n	mean_cf	prop_negative	mean_cert
Freshwater invertebrate assemblages	21	-14.86	95%	2.24
Vascular plants	21	-13.33	76%	2.19
Lichens	21	-13.00	90%	2.14
Bryophytes	21	-12.62	86%	2.00
Diptera	21	-12.39	78%	1.78
Beetles	21	-10.48	67%	1.95
Spiders	21	-9.17	61%	1.67

4.3 Compact counterfactual plots (A vs B)

ggplot(cf_AB, aes(x = cf_mag_signed, fill = assumption)) +
  geom_histogram(bins = 18, color = "white") +
  geom_vline(xintercept = 0, linetype = "dashed") +
  facet_wrap(~ assumption) +
  scale_fill_manual(values = two_cols) +
  guides(fill = "none") +
  labs(
    title = "Counterfactual (Q1): distribution of signed magnitude (% change)",
    x = "Signed % change under counterfactual",
    y = "Count"
  ) +
  theme_minimal()

5. Policy uplifts relative to the counterfactual (Q2 deltas)

Section 5 interprets Q2 policy deltas, which are defined as incremental change relative to the counterfactual. These are not absolute outcomes. A positive delta means “less decline than counterfactual” (or “more increase than counterfactual”), but it does not tell us whether the system is stable/recovering overall.

We therefore treat Section 5 as answering:

Are the scenarios directionally beneficial relative to baseline?
How large are the uplifts, and how consistent are they?
Do uplifts differ across habitat groups, and does the A vs B aggregation choice affect those habitat-group comparisons?

The coverage table confirms that both A and B are based on the same expert panel (n=7), with B having fewer units because of the within-habitat collapsing step.

Q2 magnitudes are defined relative to the counterfactual. We show overall and habitat-group summaries under both assumptions.

pol_AB <- df_AB %>%
  filter(str_detect(scenario, "^policy_"))

pol_AB %>%
  group_by(assumption) %>%
  summarise(
    n_units = n(),
    n_nonmiss = sum(!is.na(pol_mag_signed)),
    n_experts = n_distinct(id_gj),
    .groups = "drop"
  ) %>%
  kable(caption = "Incremental policy delta (Q2) coverage check (Assumptions A and B)") %>%
  kable_styling(full_width = FALSE)

Incremental policy delta (Q2) coverage check (Assumptions A and B)
assumption	n_units	n_nonmiss	n_experts
A (rows = all survey_taxa)	588	564	7
B (collapsed = unique taxa:habitat_group)	504	480	7

5.1 Overall incremental uplifts by scenario (Assumptions A and B)

Interpretation: all policies uplift relative to counterfactual; policy_3 is consistently strongest

The scenario-level delta table shows positive mean uplifts for every policy scenario under both assumptions. Under Assumption A (row-based), the mean deltas are:

policy_1: +2.66%
policy_2: +7.77%
policy_4: +7.34%
policy_3: +10.60% (largest)

This produces a clear and stable ranking: policy_3 > policy_2 ≈ policy_4 > policy_1.

The “prop_positive” column shows that most expert–unit responses are positive deltas (for example, under policy_3 it is very high), meaning experts generally agree policies improve outcomes relative to baseline even when they disagree about magnitude.

A vs B: At the overall scenario level, collapsing multiplicity (B) should not materially change the ordering and typically shifts headline means only modestly, because only a small subset of taxa:habitat combinations are affected by the collapse step. The main sensitivity to A vs B is expected to show up in habitat-group summaries, not in the overall ranking.

delta_headline_AB <- pol_AB %>%
  group_by(assumption, scenario) %>%
  summarise(
    mean_delta = mean(pol_mag_signed, na.rm = TRUE),
    median_delta = median(pol_mag_signed, na.rm = TRUE),
    sd_delta = sd(pol_mag_signed, na.rm = TRUE),
    p10 = quantile(pol_mag_signed, 0.10, na.rm = TRUE),
    p90 = quantile(pol_mag_signed, 0.90, na.rm = TRUE),
    prop_positive = mean(pol_mag_signed > 0, na.rm = TRUE),
    prop_zero = mean(pol_mag_signed == 0, na.rm = TRUE),
    prop_negative = mean(pol_mag_signed < 0, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(assumption, desc(mean_delta))

delta_headline_AB %>%
  mutate(across(where(is.numeric), ~ round(.x, 2))) %>%
  kable(caption = "Incremental policy uplifts (Q2 deltas): overall by scenario (Assumptions A and B)") %>%
  kable_styling(full_width = FALSE)

Incremental policy uplifts (Q2 deltas): overall by scenario (Assumptions A and B)
assumption	scenario	mean_delta	median_delta	sd_delta	p10	p90	prop_positive	prop_zero	prop_negative
A (rows = all survey_taxa)	policy_3	10.60	15	14.29	-20	25.0	0.84	0.01	0.15
A (rows = all survey_taxa)	policy_2	7.77	10	9.43	-10	20.0	0.79	0.06	0.15
A (rows = all survey_taxa)	policy_4	7.34	10	10.55	-15	18.0	0.82	0.04	0.15
A (rows = all survey_taxa)	policy_1	2.66	5	7.67	-10	10.0	0.58	0.13	0.29
B (collapsed = unique taxa:habitat_group)	policy_3	10.58	15	14.30	-20	25.0	0.83	0.02	0.15
B (collapsed = unique taxa:habitat_group)	policy_2	7.94	10	9.42	-10	20.0	0.82	0.03	0.15
B (collapsed = unique taxa:habitat_group)	policy_4	7.31	10	10.56	-15	17.1	0.81	0.04	0.15
B (collapsed = unique taxa:habitat_group)	policy_1	2.74	5	7.68	-10	10.0	0.62	0.09	0.29

ggplot(delta_headline_AB, aes(x = scenario, y = mean_delta, fill = assumption)) +
  geom_col(position = position_dodge(width = 0.8)) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  scale_fill_manual(values = two_cols) +
  labs(
    title = "Incremental policy uplifts by scenario (Assumptions A and B)",
    x = NULL,
    y = "Mean signed % delta"
  ) +
  theme_minimal()

5.2 Uplifts by habitat_group (Assumptions A and B)

delta_by_hab_AB <- pol_AB %>%
  group_by(assumption, habitat_group, scenario) %>%
  summarise(
    n = n(),
    mean_delta = mean(pol_mag_signed, na.rm = TRUE),
    prop_positive = mean(pol_mag_signed > 0, na.rm = TRUE),
    .groups = "drop"
  )

delta_by_hab_AB %>%
  mutate(
    mean_delta = round(mean_delta, 2),
    prop_positive = scales::percent(prop_positive, accuracy = 1)
  ) %>%
  arrange(assumption, habitat_group, scenario) %>%
  kable(caption = "Incremental policy uplifts by habitat_group (Assumptions A and B)") %>%
  kable_styling(full_width = FALSE) %>%
  scroll_box(height = "320px")

Incremental policy uplifts by habitat_group (Assumptions A and B)
assumption	habitat_group	scenario	n	mean_delta	prop_positive
A (rows = all survey_taxa)	Freshwater/Wetland	policy_1	56	2.96	59%
A (rows = all survey_taxa)	Freshwater/Wetland	policy_2	56	7.87	78%
A (rows = all survey_taxa)	Freshwater/Wetland	policy_3	56	11.69	85%
A (rows = all survey_taxa)	Freshwater/Wetland	policy_4	56	8.69	83%
A (rows = all survey_taxa)	Open habitats	policy_1	49	2.45	57%
A (rows = all survey_taxa)	Open habitats	policy_2	49	7.83	79%
A (rows = all survey_taxa)	Open habitats	policy_3	49	10.36	83%
A (rows = all survey_taxa)	Open habitats	policy_4	49	7.04	83%
A (rows = all survey_taxa)	Woodland	policy_1	42	2.50	57%
A (rows = all survey_taxa)	Woodland	policy_2	42	7.57	82%
A (rows = all survey_taxa)	Woodland	policy_3	42	9.40	82%
A (rows = all survey_taxa)	Woodland	policy_4	42	5.88	78%
B (collapsed = unique taxa:habitat_group)	Freshwater/Wetland	policy_1	42	2.85	62%
B (collapsed = unique taxa:habitat_group)	Freshwater/Wetland	policy_2	42	7.89	82%
B (collapsed = unique taxa:habitat_group)	Freshwater/Wetland	policy_3	42	11.59	85%
B (collapsed = unique taxa:habitat_group)	Freshwater/Wetland	policy_4	42	8.62	82%
B (collapsed = unique taxa:habitat_group)	Open habitats	policy_1	42	2.88	65%
B (collapsed = unique taxa:habitat_group)	Open habitats	policy_2	42	8.35	80%
B (collapsed = unique taxa:habitat_group)	Open habitats	policy_3	42	10.75	82%
B (collapsed = unique taxa:habitat_group)	Open habitats	policy_4	42	7.43	82%
B (collapsed = unique taxa:habitat_group)	Woodland	policy_1	42	2.50	57%
B (collapsed = unique taxa:habitat_group)	Woodland	policy_2	42	7.57	82%
B (collapsed = unique taxa:habitat_group)	Woodland	policy_3	42	9.40	82%
B (collapsed = unique taxa:habitat_group)	Woodland	policy_4	42	5.88	78%

ggplot(delta_by_hab_AB, aes(x = scenario, y = mean_delta, fill = assumption)) +
  geom_col(position = position_dodge(width = 0.8)) +
  facet_wrap(~ habitat_group) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  scale_fill_manual(values = two_cols) +
  labs(
    title = "Incremental policy uplifts by habitat_group (Assumptions A and B)",
    x = NULL,
    y = "Mean signed % delta"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 0.5)
  )

Interpretation: uplifts vary by habitat group; A vs B matters most here (by design)

Habitat-group uplift summaries provide the first indication of where policy leverage is expected to be strongest. The key point is not just “uplifts are positive”, but whether uplifts in the most-negative baseline habitat groups (Freshwater/Wetland and Open habitats) are large enough to offset the baseline pull established in Section 4.

Because Assumption B collapses the freshwater assemblage multiplicity (and lichen multiplicity in open habitats), the habitat-group mean delta is the place to look for any material change between A vs B. If the habitat-group conclusions are robust across A and B, that strengthens confidence that the reported habitat differences are not artefacts of how the survey rows were structured.

5.3 Uplifts by taxa (A only; A == B)

pol_A <- df_A %>%
  filter(str_detect(scenario, "^policy_"))

delta_by_taxa <- pol_A %>%
  group_by(taxa, scenario) %>%
  summarise(
    n = n(),
    mean_delta = mean(pol_mag_signed, na.rm = TRUE),
    prop_positive = mean(pol_mag_signed > 0, na.rm = TRUE),
    .groups = "drop"
  )

delta_by_taxa %>%
  mutate(
    mean_delta = round(mean_delta, 2),
    prop_positive = scales::percent(prop_positive, accuracy = 1)
  ) %>%
  kable(caption = "Incremental policy uplifts by taxa (A; identical interpretation under B)") %>%
  kable_styling(full_width = FALSE) %>%
  scroll_box(height = "320px")

Incremental policy uplifts by taxa (A; identical interpretation under B)
taxa	scenario	n	mean_delta	prop_positive
Beetles	policy_1	21	3.33	57%
Beetles	policy_2	21	9.57	86%
Beetles	policy_3	21	11.38	86%
Beetles	policy_4	21	7.43	81%
Bryophytes	policy_1	21	1.81	62%
Bryophytes	policy_2	21	7.10	81%
Bryophytes	policy_3	21	10.38	81%
Bryophytes	policy_4	21	7.24	81%
Diptera	policy_1	21	2.22	50%
Diptera	policy_2	21	7.11	72%
Diptera	policy_3	21	9.17	78%
Diptera	policy_4	21	6.56	67%
Freshwater invertebrate assemblages	policy_1	21	3.29	57%
Freshwater invertebrate assemblages	policy_2	21	7.81	71%
Freshwater invertebrate assemblages	policy_3	21	11.95	86%
Freshwater invertebrate assemblages	policy_4	21	8.86	86%
Lichens	policy_1	21	1.10	48%
Lichens	policy_2	21	5.05	76%
Lichens	policy_3	21	8.19	86%
Lichens	policy_4	21	4.90	86%
Spiders	policy_1	21	1.83	61%
Spiders	policy_2	21	7.61	83%
Spiders	policy_3	21	9.61	83%
Spiders	policy_4	21	6.44	83%
Vascular plants	policy_1	21	4.86	71%
Vascular plants	policy_2	21	10.05	86%
Vascular plants	policy_3	21	13.14	86%
Vascular plants	policy_4	21	9.71	86%

5.4 summary

Across both aggregation assumptions, experts perceive all policy scenarios as improving outcomes relative to the counterfactual. The policy scenarios are clearly differentiated in strength, with policy_3 consistently eliciting the largest average uplift, followed by policy_2 and policy_4, with policy_1 weakest.

However, deltas alone do not answer whether the target-relevant condition (stability/recovery by 2042) is achieved. That requires combining Q1 and Q2 into absolute outcomes (Section 6).

6. Absolute outcomes under policy relative to counterfactual

Section 6 performs the key “sufficiency test” by converting the elicited components into absolute outcomes under each policy scenario:

abs_policy_mag_signed = cf_mag_signed + pol_mag_signed

This step matters because it answers the policy-facing question:

Do scenarios merely slow decline (still negative)?
Do they stabilise (near 0)?
Do they generate net recovery (positive)?

We present headline results under both Assumptions A and B. Here, A vs B matters most for habitat-group interpretation, because the collapsing step in B prevents freshwater assemblage and lichen multiplicity from dominating habitat-group means.

Absolute signed outcomes under policy are:

Absolute policy outcome = Counterfactual (Q1) + Policy delta (Q2)

6.1 Scenario-level outcomes: sufficiency test (Assumptions A and B)

policy_summary_AB <- pol_AB %>%
  group_by(assumption, scenario) %>%
  summarise(
    mean_policy = mean(abs_policy_mag_signed, na.rm = TRUE),
    median_policy = median(abs_policy_mag_signed, na.rm = TRUE),
    sd_policy = sd(abs_policy_mag_signed, na.rm = TRUE),
    p10 = quantile(abs_policy_mag_signed, 0.10, na.rm = TRUE),
    p90 = quantile(abs_policy_mag_signed, 0.90, na.rm = TRUE),
    prop_positive = mean(abs_policy_mag_signed > 0, na.rm = TRUE),
    prop_near_zero = mean(abs(abs_policy_mag_signed) <= 1, na.rm = TRUE),
    prop_negative = mean(abs_policy_mag_signed < 0, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(assumption, desc(mean_policy))

policy_summary_AB %>%
  mutate(across(where(is.numeric), ~ round(.x, 2))) %>%
  kable(caption = "Absolute signed % outcome under each policy (Assumptions A and B)") %>%
  kable_styling(full_width = FALSE)

Absolute signed % outcome under each policy (Assumptions A and B)
assumption	scenario	mean_policy	median_policy	sd_policy	p10	p90	prop_positive	prop_near_zero	prop_negative
A (rows = all survey_taxa)	policy_3	-1.73	5	21.54	-50	15.0	0.67	0.13	0.21
A (rows = all survey_taxa)	policy_2	-4.55	0	16.51	-40	9.0	0.34	0.28	0.39
A (rows = all survey_taxa)	policy_4	-4.99	0	18.74	-45	13.0	0.41	0.23	0.36
A (rows = all survey_taxa)	policy_1	-9.67	-5	14.56	-35	5.0	0.18	0.20	0.63
B (collapsed = unique taxa:habitat_group)	policy_3	-1.37	5	21.87	-50	15.0	0.68	0.12	0.21
B (collapsed = unique taxa:habitat_group)	policy_2	-4.01	0	16.88	-40	10.0	0.38	0.31	0.34
B (collapsed = unique taxa:habitat_group)	policy_4	-4.64	0	19.09	-45	13.0	0.42	0.22	0.35
B (collapsed = unique taxa:habitat_group)	policy_1	-9.20	-5	15.00	-35	5.3	0.20	0.22	0.59

Interpretation: strong uplifts, but mean absolute outcomes remain negative under every policy scenario

The scenario-level absolute outcome table shows a consistent pattern:

All scenarios improve outcomes relative to the counterfactual (consistent with Section 5),
but the mean absolute signed outcome remains negative under every scenario, i.e. the central tendency is continued decline by 2042 even after policy uplift is applied.

Under Assumption A, the mean absolute outcomes are:

policy_1: −9.67%
policy_2: −4.55%
policy_4: −4.99%
policy_3: −1.73% (closest to stability)

This is the core quantitative message: policy_3 comes close to offsetting baseline decline on average, but does not cross the stability threshold (0%) in the mean.

Two nuances are important for interpretation:

The median can be more optimistic than the mean (e.g., policy_3 median is positive), indicating that a substantial portion of units have positive outcomes even when the mean remains negative.
The p10 values remain strongly negative for all scenarios, implying persistent “tails” of severe decline that can pull the average below zero even if many units improve.

A vs B: the table is shown under both assumptions because the “headline mean” is sensitive to how units are counted. If the A vs B results are close, that strengthens confidence that conclusions are not being driven by survey-row multiplicity.

ggplot(policy_summary_AB, aes(x = scenario, y = mean_policy, fill = assumption)) +
  geom_col(position = position_dodge(width = 0.8)) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  scale_fill_manual(values = two_cols) +
  labs(
    title = "Absolute signed % outcome under each policy (Assumptions A and B)",
    x = NULL,
    y = "Mean signed % change (CF + delta)"
  ) +
  theme_minimal()

6.2 Visual comparison: counterfactual mean vs policy mean (Assumptions A and B)

cf_mean_AB <- cf_AB %>%
  group_by(assumption) %>%
  summarise(mean_cf = mean(cf_mag_signed, na.rm = TRUE), .groups = "drop")

policy_summary_plot_AB <- policy_summary_AB %>%
  left_join(cf_mean_AB, by = "assumption")

ggplot(policy_summary_plot_AB, aes(x = scenario)) +
  geom_col(aes(y = mean_policy, fill = assumption), position = position_dodge(width = 0.8)) +
  facet_wrap(~ assumption) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_hline(aes(yintercept = mean_cf), linetype = "dotted") +
  scale_fill_manual(values = two_cols) +
  labs(
    title = "Mean signed % outcome: Counterfactual vs Policy (A and B)",
    subtitle = "Dotted = counterfactual mean; dashed = stability (0%)",
    x = NULL,
    y = "Mean signed % change"
  ) +
  theme_minimal()

Reading the CF vs policy mean plot: “distance to zero” is the target-relevant diagnostic

In the mean comparison plot, the dashed line at 0% represents stability (no net change by 2042). The dotted line marks the counterfactual mean. The vertical distance:

from dotted line to each policy bar ≈ the average uplift achieved, and
from each policy bar to zero ≈ the remaining recovery gap (the shortfall relative to stability).

The important result is that all scenario means sit above the counterfactual mean (uplift), but still below zero (insufficient for stability in the mean). This graph makes the distinction between “uplift” and “recovery” visually explicit.

6.3 Proportion stabilised and recovered (Assumptions A and B)

pol_AB %>%
  group_by(assumption, scenario) %>%
  summarise(
    prop_recovery = mean(abs_policy_mag_signed > 0, na.rm = TRUE),
    prop_stabilised = mean(abs_policy_mag_signed >= 0, na.rm = TRUE),
    prop_decline_remaining = mean(abs_policy_mag_signed < 0, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(across(where(is.numeric), ~ scales::percent(.x, accuracy = 1))) %>%
  kable(caption = "Proportion stabilised/recovered under each policy (Assumptions A and B)") %>%
  kable_styling(full_width = FALSE)

Proportion stabilised/recovered under each policy (Assumptions A and B)
assumption	scenario	prop_recovery	prop_stabilised	prop_decline_remaining
A (rows = all survey_taxa)	policy_1	18%	37%	63%
A (rows = all survey_taxa)	policy_2	34%	61%	39%
A (rows = all survey_taxa)	policy_3	67%	79%	21%
A (rows = all survey_taxa)	policy_4	41%	64%	36%
B (collapsed = unique taxa:habitat_group)	policy_1	20%	41%	59%
B (collapsed = unique taxa:habitat_group)	policy_2	38%	66%	34%
B (collapsed = unique taxa:habitat_group)	policy_3	68%	79%	21%
B (collapsed = unique taxa:habitat_group)	policy_4	42%	65%	35%

Interpretation: many units stabilise/recover under policy_3, but a non-trivial minority remain in decline

The stabilised/recovered fractions provide a more decision-friendly view than the mean alone:

Under policy_3, around 79% of units are stabilised (≥0), and around 67–68% are positive (>0), while roughly 21% remain negative.
We see clear monotonic improvement from policy_1 → policy_3 (and policy_4 intermediate).

This reconciles two facts that can otherwise look contradictory:

Why can the mean remain negative under policy_3 while most units are stable/positive?
- Because the remaining negative units are often sufficiently negative to “drag down” the mean; the aggregate is sensitive to tails.

A vs B: the fraction table is presented under both assumptions; the similarity between A and B proportions indicates the “many improve, some remain strongly negative” pattern is not an artefact of survey-row multiplicity.

6.4 Absolute outcomes by habitat_group (Assumptions A and B)

policy_by_hab_AB <- df_AB %>%
  group_by(assumption, habitat_group, scenario) %>%
  summarise(
    n = n(),
    mean_policy = mean(abs_policy_mag_signed, na.rm = TRUE),
    prop_positive = mean(abs_policy_mag_signed > 0, na.rm = TRUE),
    .groups = "drop"
  )

policy_by_hab_AB %>%
  mutate(
    mean_policy = round(mean_policy, 2),
    prop_positive = scales::percent(prop_positive, accuracy = 1)
  ) %>%
  arrange(assumption, habitat_group, scenario) %>%
  kable(caption = "Absolute outcomes under policy by habitat_group (Assumptions A and B)") %>%
  kable_styling(full_width = FALSE) %>%
  scroll_box(height = "360px")

Absolute outcomes under policy by habitat_group (Assumptions A and B)
assumption	habitat_group	scenario	n	mean_policy	prop_positive
A (rows = all survey_taxa)	Freshwater/Wetland	counterfactual	56	-13.83	2%
A (rows = all survey_taxa)	Freshwater/Wetland	policy_1	56	-10.87	9%
A (rows = all survey_taxa)	Freshwater/Wetland	policy_2	56	-5.96	24%
A (rows = all survey_taxa)	Freshwater/Wetland	policy_3	56	-2.15	69%
A (rows = all survey_taxa)	Freshwater/Wetland	policy_4	56	-5.15	41%
A (rows = all survey_taxa)	Open habitats	counterfactual	49	-13.43	0%
A (rows = all survey_taxa)	Open habitats	policy_1	49	-10.98	17%
A (rows = all survey_taxa)	Open habitats	policy_2	49	-5.60	36%
A (rows = all survey_taxa)	Open habitats	policy_3	49	-3.06	66%
A (rows = all survey_taxa)	Open habitats	policy_4	49	-6.38	40%
A (rows = all survey_taxa)	Woodland	counterfactual	42	-9.00	22%
A (rows = all survey_taxa)	Woodland	policy_1	42	-6.50	30%
A (rows = all survey_taxa)	Woodland	policy_2	42	-1.42	45%
A (rows = all survey_taxa)	Woodland	policy_3	42	0.40	68%
A (rows = all survey_taxa)	Woodland	policy_4	42	-3.12	42%
B (collapsed = unique taxa:habitat_group)	Freshwater/Wetland	counterfactual	42	-13.47	0%
B (collapsed = unique taxa:habitat_group)	Freshwater/Wetland	policy_1	42	-10.62	10%
B (collapsed = unique taxa:habitat_group)	Freshwater/Wetland	policy_2	42	-5.58	30%
B (collapsed = unique taxa:habitat_group)	Freshwater/Wetland	policy_3	42	-1.88	68%
B (collapsed = unique taxa:habitat_group)	Freshwater/Wetland	policy_4	42	-4.85	42%
B (collapsed = unique taxa:habitat_group)	Open habitats	counterfactual	42	-13.36	0%
B (collapsed = unique taxa:habitat_group)	Open habitats	policy_1	42	-10.49	20%
B (collapsed = unique taxa:habitat_group)	Open habitats	policy_2	42	-5.01	40%
B (collapsed = unique taxa:habitat_group)	Open habitats	policy_3	42	-2.61	70%
B (collapsed = unique taxa:habitat_group)	Open habitats	policy_4	42	-5.94	42%
B (collapsed = unique taxa:habitat_group)	Woodland	counterfactual	42	-9.00	22%
B (collapsed = unique taxa:habitat_group)	Woodland	policy_1	42	-6.50	30%
B (collapsed = unique taxa:habitat_group)	Woodland	policy_2	42	-1.42	45%
B (collapsed = unique taxa:habitat_group)	Woodland	policy_3	42	0.40	68%
B (collapsed = unique taxa:habitat_group)	Woodland	policy_4	42	-3.12	42%

ggplot(policy_by_hab_AB, aes(x = scenario, y = mean_policy, fill = assumption)) +
  geom_col(position = position_dodge(width = 0.8)) +
  facet_wrap(~ habitat_group) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  scale_fill_manual(values = two_cols) +
  labs(
    title = "Absolute outcomes under policy by habitat_group (Assumptions A and B)",
    x = NULL,
    y = "Mean signed % change (CF + delta)"
  ) +
  theme_minimal()

Interpretation: habitat-group constraints are clear, and Woodland is the closest to neutrality

The habitat-group absolute outcomes sharpen where the recovery gap sits:

Freshwater/Wetland remains negative under every scenario, including policy_3 (mean still below 0), though the proportion positive rises strongly under policy_3.
Open habitats also remain negative across scenarios; policy_3 improves outcomes markedly but does not remove the negative mean.
Woodland is the closest to neutral and can become slightly positive under the strongest scenario (consistent with the baseline being less negative here).

This pattern is consistent with the baseline in Section 4: the most negative baseline domains (Freshwater/Wetland and Open habitats) remain the binding constraints on achieving system-wide stability by 2042.

Because these habitat-group results are shown for both A and B, we can interpret the habitat differences with higher confidence: the story persists even after removing the freshwater/lichen row multiplicity in Assumption B.

6.5 Absolute outcomes by taxa (A only; A == B)

policy_by_taxa <- pol_A %>%
  group_by(taxa, scenario) %>%
  summarise(
    mean_policy = mean(abs_policy_mag_signed, na.rm = TRUE),
    prop_positive = mean(abs_policy_mag_signed > 0, na.rm = TRUE),
    .groups = "drop"
  )

policy_by_taxa %>%
  mutate(
    mean_policy = round(mean_policy, 2),
    prop_positive = scales::percent(prop_positive, accuracy = 1)
  ) %>%
  kable(caption = "Absolute outcomes under policy by taxa (A; identical interpretation under B)") %>%
  kable_styling(full_width = FALSE) %>%
  scroll_box(height = "360px")

Absolute outcomes under policy by taxa (A; identical interpretation under B)
taxa	scenario	mean_policy	prop_positive
Beetles	policy_1	-7.14	29%
Beetles	policy_2	-0.90	57%
Beetles	policy_3	0.90	67%
Beetles	policy_4	-3.05	52%
Bryophytes	policy_1	-10.81	10%
Bryophytes	policy_2	-5.52	24%
Bryophytes	policy_3	-2.24	71%
Bryophytes	policy_4	-5.38	33%
Diptera	policy_1	-10.17	22%
Diptera	policy_2	-5.28	28%
Diptera	policy_3	-3.22	61%
Diptera	policy_4	-5.83	33%
Freshwater invertebrate assemblages	policy_1	-11.57	5%
Freshwater invertebrate assemblages	policy_2	-7.05	10%
Freshwater invertebrate assemblages	policy_3	-2.90	71%
Freshwater invertebrate assemblages	policy_4	-6.00	38%
Lichens	policy_1	-11.90	5%
Lichens	policy_2	-7.95	14%
Lichens	policy_3	-4.81	52%
Lichens	policy_4	-8.10	24%
Spiders	policy_1	-7.33	33%
Spiders	policy_2	-1.56	61%
Spiders	policy_3	0.44	83%
Spiders	policy_4	-2.72	61%
Vascular plants	policy_1	-8.48	24%
Vascular plants	policy_2	-3.29	48%
Vascular plants	policy_3	-0.19	67%
Vascular plants	policy_4	-3.62	48%

6.6 Probability of full recovery (policy outcome ≥ 0) with expert-bootstrap uncertainty

To make the scenario results easier to interpret, we compute the probability of full recovery for each scenario:

Full recovery means the absolute implied policy outcome is ≥ 0% (stabilisation or net recovery).
In this workflow, Q2 magnitudes are elicited relative to the counterfactual, and the absolute outcome is therefore represented by abs_policy_mag_signed (counterfactual signed magnitude plus policy delta).

To quantify uncertainty given a small expert panel, we compute the recovery probability for each scenario and assumption, and attach 95% percentile intervals from a cluster bootstrap over experts (resampling experts,id_gj, with replacement).

# ============================================================
#  probability of full recovery (abs_policy_mag_signed >= 0)
# cluster bootstrap over experts (id_gj), by Assumption (A vs B)
# ============================================================

library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)
library(scales)
library(knitr)
library(kableExtra)

# ---- 0) Use the Section 6 object exactly as-is ----
stopifnot(exists("df_AB"))
stopifnot(all(c("assumption","scenario","id_gj","abs_policy_mag_signed") %in% names(df_AB)))

policy_levels <- c("policy_1","policy_2","policy_3","policy_4")

dat <- df_AB %>%
  mutate(
    assumption = as.character(assumption),
    scenario   = as.character(scenario)
  ) %>%
  filter(scenario %in% policy_levels)

assumptions <- sort(unique(dat$assumption))
scenarios   <- policy_levels

# ---- 1) Define full recovery indicator ----
dat <- dat %>%
  mutate(full_recovery = abs_policy_mag_signed >= 0)

# ---- 2) Bootstrap settings ----
set.seed(1)
B <- 5000

# helper: weighted mean for logical indicators under expert-resampling weights
wtd_mean <- function(x, w) {
  ok <- !is.na(x) & !is.na(w)
  if (!any(ok)) return(NA_real_)
  sum(w[ok] * x[ok]) / sum(w[ok])
}

# ---- 3) One cluster-bootstrap draw within ONE assumption ----
boot_once_assumption <- function(df_a) {

  experts <- sort(unique(df_a$id_gj[!is.na(df_a$id_gj)]))
  Nexp <- length(experts)

  samp <- sample(experts, size = Nexp, replace = TRUE)
  w <- table(samp)

  dfb <- df_a %>%
    semi_join(tibble(id_gj = as.integer(names(w))), by = "id_gj") %>%
    mutate(.w = as.numeric(w[as.character(id_gj)]))

  dfb %>%
    group_by(scenario) %>%
    summarise(
      p = wtd_mean(as.numeric(full_recovery), .w),
      .groups = "drop"
    )
}

# ---- 4) Run bootstrap separately for A and B ----
boot_draws <- map_dfr(assumptions, function(a) {
  df_a <- dat %>% filter(assumption == a)

  map_dfr(seq_len(B), function(b) {
    boot_once_assumption(df_a) %>%
      mutate(assumption = a, b = b)
  })
})

# ---- 5) Point estimates (non-bootstrap) ----
point_est <- dat %>%
  group_by(assumption, scenario) %>%
  summarise(
    p_hat = mean(full_recovery, na.rm = TRUE),
    n = n(),
    .groups = "drop"
  )

# ---- 6) Bootstrap percentile CIs ----
boot_ci <- boot_draws %>%
  group_by(assumption, scenario) %>%
  summarise(
    p_lo  = quantile(p, 0.025, na.rm = TRUE),
    p_med = quantile(p, 0.50,  na.rm = TRUE),
    p_hi  = quantile(p, 0.975, na.rm = TRUE),
    .groups = "drop"
  )

prob_recovery_AB <- point_est %>%
  left_join(boot_ci, by = c("assumption","scenario")) %>%
  mutate(
    scenario   = factor(scenario, levels = policy_levels),
    assumption = factor(assumption, levels = assumptions)
  ) %>%
  arrange(assumption, scenario)

# ---- 7) Pretty table ----
prob_recovery_AB %>%
  mutate(
    p_hat = percent(p_hat, accuracy = 1),
    p_lo  = percent(p_lo,  accuracy = 1),
    p_hi  = percent(p_hi,  accuracy = 1),
    ci95  = paste0(p_lo, "–", p_hi)
  ) %>%
  select(assumption, scenario, n, p_hat, ci95) %>%
  pivot_wider(
    id_cols = c(scenario),
    names_from = assumption,
    values_from = c(n, p_hat, ci95),
    names_glue = "{assumption}_{.value}"
  ) %>%
  kable(caption = "Probability of full recovery (abs_policy_mag_signed ≥ 0%) by scenario and assumption (cluster bootstrap over experts; 95% CI)") %>%
  kable_styling(full_width = FALSE)

Probability of full recovery (abs_policy_mag_signed ≥ 0%) by scenario and assumption (cluster bootstrap over experts; 95% CI)
scenario	A (rows = all survey_taxa)_n	B (collapsed = unique taxa:habitat_group)_n	A (rows = all survey_taxa)_p_hat	B (collapsed = unique taxa:habitat_group)_p_hat	A (rows = all survey_taxa)_ci95	B (collapsed = unique taxa:habitat_group)_ci95
policy_1	147	126	37%	41%	17%–58%	18%–64%
policy_2	147	126	61%	66%	37%–80%	41%–85%
policy_3	147	126	79%	79%	50%–97%	51%–98%
policy_4	147	126	64%	65%	33%–90%	35%–91%

# ---- 8) Plot: probability with CI (facet by assumption) ----
prob_recovery_AB %>%
  ggplot(aes(x = scenario, y = p_hat)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = p_lo, ymax = p_hi), width = 0.15) +
  facet_wrap(~ assumption) +
  geom_hline(yintercept=1, linetype="dashed") +
  scale_y_continuous(labels = percent_format(accuracy = 1), limits = c(0, 1)) +
  labs(
    title = "Probability of full recovery (abs_policy_mag_signed ≥ 0%)",
    subtitle = "Cluster bootstrap over experts (id_gj), 95% percentile intervals",
    x = NULL,
    y = "Probability"
  ) +
  theme_minimal()

# ---- 9) context table showing mean absolute outcomes too ----
dat %>%
  group_by(assumption, scenario) %>%
  summarise(
    mean_abs_policy = mean(abs_policy_mag_signed, na.rm = TRUE),
    median_abs_policy = median(abs_policy_mag_signed, na.rm = TRUE),
    prop_full_recovery = mean(full_recovery, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(assumption, scenario) %>%
  kable(digits = 2, caption = "Context: mean/median absolute policy outcome and raw recovery share") %>%
  kable_styling(full_width = FALSE)

Context: mean/median absolute policy outcome and raw recovery share
assumption	scenario	mean_abs_policy	median_abs_policy	prop_full_recovery
A (rows = all survey_taxa)	policy_1	-9.67	-5	0.37
A (rows = all survey_taxa)	policy_2	-4.55	0	0.61
A (rows = all survey_taxa)	policy_3	-1.73	5	0.79
A (rows = all survey_taxa)	policy_4	-4.99	0	0.64
B (collapsed = unique taxa:habitat_group)	policy_1	-9.20	-5	0.41
B (collapsed = unique taxa:habitat_group)	policy_2	-4.01	0	0.66
B (collapsed = unique taxa:habitat_group)	policy_3	-1.37	5	0.79
B (collapsed = unique taxa:habitat_group)	policy_4	-4.64	0	0.65

Across both assumptions, recovery probabilities increase strongly with policy ambition. Under Assumption A (all survey_taxa rows), the probability that the implied absolute policy outcome is ≥0% rises from 37% under policy_1 to 61% under policy_2, 79% under policy_3, and 64% under policy_4. The bootstrap intervals are wide—reflecting the small expert panel—but clearly shift upward with scenario strength (e.g. policy_3: 50–97%).

Results are highly consistent under Assumption B (collapsed to unique taxa × habitat_group), with probabilities of 41%, 66%, 79%, and 65% for policy_1–4 respectively. The similarity between A and B indicates that aggregation structure does not materially alter the overall recovery signal.

Policy_3 consistently emerges as the strongest scenario: it produces the highest probability of full recovery (~79%) under both assumptions. However, the wide bootstrap intervals (e.g. ~51–98% under B) indicate substantial expert-level uncertainty. Policy_1 remains clearly insufficient, with recovery probabilities below 50% under both assumptions and lower bounds well below 20%. Overall, stronger policy scenarios substantially increase the likelihood of stabilisation or recovery, but do not deliver near-certainty of full recovery across taxa.

6.7 Summary

Section 6 provides the core “sufficiency” conclusion:

Baseline counterfactual change is strongly negative (Section 4).
Policies generate positive deltas relative to baseline (Section 5).
After combining CF + delta, mean absolute outcomes remain negative under all scenarios, including the strongest package.
policy_3 is consistently closest to stability, and yields a high share of stable/positive units, but does not eliminate residual decline.

This creates a clear bridge to Section 7: we now test whether this sufficiency verdict is robust given panel size and disagreement.

7. Expert agreement, uncertainty structure, and robustness

Section 7 examines whether the headline conclusions (notably: “policy_3 is strongest but still not sufficient on average”) are robust to the small expert panel and to disagreement about magnitude.

We focus on three complementary diagnostics:

Dispersion (SD) in signed outcomes across experts (agreement proxy)
Mean–variance relationship (whether disagreement scales with effect size; a key Chapter-3-style diagnostic)
Resampling / panel-size sensitivity (whether means and rankings depend on who is in the panel)

To keep uncertainty aligned with the outcome summaries, dispersion and resampling are reported under both assumptions A and B.

7.1 Dispersion in signed magnitude (agreement proxy) — A and B

consensus_var_AB <- df_AB %>%
  filter(str_detect(scenario, "^policy_") | scenario == "counterfactual") %>%
  group_by(assumption, taxa, habitat_group, scenario) %>%
  summarise(
    mean_signed = mean(abs_policy_mag_signed, na.rm = TRUE),
    sd_signed   = sd(abs_policy_mag_signed, na.rm = TRUE),
    n = n(),
    .groups = "drop"
  )

consensus_summary_AB <- consensus_var_AB %>%
  group_by(assumption, scenario) %>%
  summarise(
    mean_sd = mean(sd_signed, na.rm = TRUE),
    median_sd = median(sd_signed, na.rm = TRUE),
    .groups = "drop"
  )

consensus_summary_AB %>%
  mutate(across(where(is.numeric), ~ round(.x, 2))) %>%
  kable(caption = "Mean dispersion (SD) in signed magnitude by scenario (Assumptions A and B)") %>%
  kable_styling(full_width = FALSE)

Mean dispersion (SD) in signed magnitude by scenario (Assumptions A and B)
assumption	scenario	mean_sd	median_sd
A (rows = all survey_taxa)	counterfactual	11.94	11.13
A (rows = all survey_taxa)	policy_1	15.49	14.98
A (rows = all survey_taxa)	policy_2	17.60	16.79
A (rows = all survey_taxa)	policy_3	23.25	23.09
A (rows = all survey_taxa)	policy_4	20.18	19.79
B (collapsed = unique taxa:habitat_group)	counterfactual	11.92	11.13
B (collapsed = unique taxa:habitat_group)	policy_1	15.52	14.98
B (collapsed = unique taxa:habitat_group)	policy_2	17.65	16.79
B (collapsed = unique taxa:habitat_group)	policy_3	23.33	23.09
B (collapsed = unique taxa:habitat_group)	policy_4	20.24	19.79

Interpretation: disagreement increases under stronger policy, as expected for high-leverage interventions

The dispersion table shows that SD in signed outcomes is lowest under the counterfactual and increases under stronger policy packages, peaking around policy_3:

counterfactual mean SD ≈ 12
policy_1 mean SD ≈ 15.5
policy_2 mean SD ≈ 17.6
policy_3 mean SD ≈ 23
policy_4 mean SD ≈ 20

This pattern is typical of expert elicitation: experts agree more on the existence of baseline pressures than on the extent to which ambitious intervention can offset them. Importantly, the pattern is essentially the same under A and B, suggesting the uncertainty structure is not being driven by survey-row multiplicity.

Policy interpretation: uncertainty is concentrated where decisions are highest leverage (policy_3), so “how close to zero” policy_3 sits should be treated probabilistically rather than as a point estimate, even when the ranking of scenarios is stable.

ggplot(consensus_var_AB, aes(x = mean_signed, y = sd_signed)) +
  geom_point(alpha = 0.6) +
  facet_grid(assumption ~ scenario) +
  labs(
    title = "Mean–variance relationship across taxa × habitat_group (A and B)",
    x = "Mean signed % outcome",
    y = "Standard deviation (expert disagreement)"
  ) +
  theme_minimal()

Interpretation: the mean–variance relationship indicates structured (epistemic) uncertainty rather than noise

The mean–variance plot tests whether disagreement tends to increase as projected outcomes become more extreme. A widening spread away from zero (a “funnel” pattern) is a healthy signal: it suggests experts are most aligned around modest changes and diverge most when asked about large recovery or large decline.

In this analysis, the combination of (i) increasing dispersion under ambitious policy and (ii) mean–variance structure supports the interpretation that uncertainty is epistemic and structured (reflecting real limits on knowledge of intervention efficacy), rather than arbitrary disagreement.

7.2 Certainty structure — A and B

certainty_summary_AB <- df_AB %>%
  filter(str_detect(scenario, "^policy_") | scenario == "counterfactual") %>%
  group_by(assumption, scenario) %>%
  summarise(
    mean_cert = mean(if_else(scenario == "counterfactual", cf_cert, pol_cert), na.rm = TRUE),
    sd_cert = sd(if_else(scenario == "counterfactual", cf_cert, pol_cert), na.rm = TRUE),
    .groups = "drop"
  )

certainty_summary_AB %>%
  mutate(across(where(is.numeric), round, 2)) %>%
  kable(caption = "Mean and SD of certainty scores (1–5) (Assumptions A and B)") %>%
  kable_styling(full_width = FALSE)

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `across(where(is.numeric), round, 2)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))

Mean and SD of certainty scores (1–5) (Assumptions A and B)
assumption	scenario	mean_cert	sd_cert
A (rows = all survey_taxa)	counterfactual	2.01	0.54
A (rows = all survey_taxa)	policy_1	2.01	0.77
A (rows = all survey_taxa)	policy_2	2.01	0.71
A (rows = all survey_taxa)	policy_3	1.97	0.74
A (rows = all survey_taxa)	policy_4	1.99	0.72
B (collapsed = unique taxa:habitat_group)	counterfactual	1.97	0.53
B (collapsed = unique taxa:habitat_group)	policy_1	2.01	0.77
B (collapsed = unique taxa:habitat_group)	policy_2	2.00	0.72
B (collapsed = unique taxa:habitat_group)	policy_3	1.96	0.75
B (collapsed = unique taxa:habitat_group)	policy_4	1.98	0.73

Interpretation: certainty scores add context, but magnitudes/dispersion carry most of the inference

Certainty scores provide a self-reported complement to dispersion. In practice, with small panels, certainty scales can be “sticky” (limited spread), so it is appropriate that the analysis relies more heavily on:

signed magnitudes (what experts think will happen),
dispersion (how much they disagree), and
resampling stability (how dependent conclusions are on panel composition).

Where certainty is systematically low in the most negative baseline domains, it reinforces the case for carrying uncertainty forward into trajectory modelling rather than over-interpreting point means.

7.3 Resampling stability — A and B

set.seed(123)
scenario_levels_pol <- c("policy_1","policy_2","policy_3","policy_4")

resample_means_any <- function(dat, k, scenario_name, n_iter = 1000){
  replicate(n_iter, {
    sample_ids <- sample(unique(dat$id_gj), k, replace = FALSE)
    dat %>%
      filter(id_gj %in% sample_ids, scenario == scenario_name) %>%
      summarise(mean_signed = mean(abs_policy_mag_signed, na.rm = TRUE)) %>%
      pull(mean_signed)
  })
}

resampling_results_AB <- df_AB %>%
  filter(str_detect(scenario, "^policy_")) %>%
  group_by(assumption) %>%
  group_modify(~{
    dat <- .x
    purrr::map_dfr(scenario_levels_pol, function(sc){
      tibble(
        scenario = sc,
        k2 = sd(resample_means_any(dat, 2, sc)),
        k3 = sd(resample_means_any(dat, 3, sc)),
        k4 = sd(resample_means_any(dat, 4, sc)),
        k5 = sd(resample_means_any(dat, 5, sc)),
        k6 = sd(resample_means_any(dat, 6, sc)),
        k7 = sd(resample_means_any(dat, 7, sc))
      )
    })
  }) %>%
  ungroup()

resampling_results_AB %>%
  mutate(across(where(is.numeric), round, 2)) %>%
  kable(caption = "Resampling SD of mean signed outcome by panel size (Assumptions A and B)") %>%
  kable_styling(full_width = FALSE)

Resampling SD of mean signed outcome by panel size (Assumptions A and B)
assumption	scenario	k2	k3	k4	k5	k6
A (rows = all survey_taxa)	policy_1	8.32	6.01	4.47	3.24	2.07
A (rows = all survey_taxa)	policy_2	9.77	7.10	5.36	3.83	2.50
A (rows = all survey_taxa)	policy_3	13.14	9.80	7.38	5.36	3.44
A (rows = all survey_taxa)	policy_4	11.02	8.40	6.31	4.60	2.97
B (collapsed = unique taxa:habitat_group)	policy_1	8.24	6.24	4.56	3.34	2.24
B (collapsed = unique taxa:habitat_group)	policy_2	10.24	7.20	5.51	4.15	2.57
B (collapsed = unique taxa:habitat_group)	policy_3	13.74	10.04	7.50	5.57	3.34
B (collapsed = unique taxa:habitat_group)	policy_4	11.63	8.48	6.30	4.68	2.97

resampling_long_AB <- resampling_results_AB %>%
  pivot_longer(cols = starts_with("k"), names_to = "panel_size", values_to = "sd_mean") %>%
  mutate(panel_size = as.numeric(str_remove(panel_size, "k")))

ggplot(resampling_long_AB, aes(x = panel_size, y = sd_mean)) +
  geom_line() +
  geom_point() +
  facet_grid(assumption ~ scenario) +
  labs(
    title = "Panel-size sensitivity of mean signed policy outcome (Assumptions A and B)",
    x = "Number of experts sampled",
    y = "SD of mean signed outcome"
  ) +
  theme_minimal()

Interpretation: scenario ordering is stable; sensitivity is highest where outcomes are closest to zero

The resampling analysis asks: “If we had elicited from a slightly different subset of experts, would the headline means and scenario ranking change?”

The expected and policy-relevant pattern is:

Stability increases quickly as panel size grows (k increases),
The strongest scenario (policy_3) shows the greatest sensitivity, because it is closest to the stability threshold and has the highest dispersion.

This supports a nuanced conclusion:

We can be fairly confident in the directional result (“policy improves outcomes”) and the relative ranking of scenarios,
while treating the absolute distance to zero under policy_3 as uncertain and best communicated with uncertainty bounds.

7.3 summary

Uncertainty diagnostics strengthen (rather than weaken) the credibility of the results:

Disagreement increases under more ambitious policy, which is realistic.
The uncertainty structure is coherent and similar under assumptions A and B.
Resampling indicates that headline conclusions are not obviously driven by a single expert.

Together, this supports using the elicitation outputs as inputs to trajectory modelling, with uncertainty propagation and careful interpretation of near-zero outcomes.

8. Integrated synthesis and implications for the extinction-risk target (ERLI)

Counterfactual baseline is strongly negative: mean change is around −12% by 2042, with a high share of units negative.
All policy scenarios are beneficial relative to baseline: mean deltas are positive for every scenario, with a consistent ranking (policy_3 > policy_2 ≈ policy_4 > policy_1).
But uplifts are not sufficient in the mean: after combining CF + delta, the mean absolute outcome remains negative under every scenario; policy_3 is closest to stability but still below 0 in the mean.
Many units improve, yet residual “pockets of decline” remain: under policy_3, most units are stable/positive, but a material minority remain negative; these tails keep the mean below zero.
Habitat constraints are clear: Freshwater/Wetland and Open habitats remain negative even under the strongest scenario; Woodland is closest to neutrality.
Uncertainty is structured and concentrated where leverage is highest: disagreement is largest under policy_3; resampling supports stable scenario ranking but uncertain distance-to-zero under ambitious intervention.

8.1 What was elicited, and what it is (and is not) telling us

This expert elicitation separates two components of expected change by 2042:

Q1 (counterfactual): what experts expect under baseline pressures without the additional policy package.
Q2 (policy delta): the incremental change expected relative to that counterfactual.

This structure is valuable because it prevents us from conflating “policies help” with “policies deliver recovery.” Q2 can be strongly positive while absolute outcomes remain negative if baseline decline is large.

All synthesis statements below are robust to the two aggregation assumptions used in the analysis:

A: each survey_taxa row contributes (survey-design faithful, but can overweight taxa split into multiple rows within a habitat group),
B: within each expert × scenario, multiple survey rows representing the same taxa within the same habitat group are collapsed to one unit (prevents row multiplicity driving habitat-group results).

Only a small subset of combinations are affected by the A→B collapse step (freshwater assemblages in Freshwater/Wetland; lichens in Open habitats), so A vs B is best seen as a sensitivity check for habitat-group interpretation rather than a competing “true” baseline.

Selective evidence to keep visible in this section: - Table: Counterfactual baseline (Q1 signed %): Assumptions A and B - Table + figure: Counterfactual baseline by habitat_group (Assumptions A and B)

8.2 Baseline pressure: the counterfactual is strongly negative

Across both assumptions, the counterfactual baseline mean is around −12%, with a high proportion of units negative.

Habitat-group baselines show that:

Freshwater/Wetland and Open habitats have the most negative baseline means (large and widespread decline expectations),
Woodland is less negative, but still predominantly negative.

This is a crucial policy framing point: the baseline “pull” is substantial, so “reasonable” policy uplifts may still be insufficient to deliver stability unless they are large, broad, and fast-acting.

Selective evidence to keep visible: - Table + figure: Counterfactual baseline by habitat_group (Assumptions A and B)

8.3 Incremental uplifts: policies help, and scenarios are clearly ranked

All policy scenarios show positive mean deltas relative to the counterfactual, with a consistent ranking:

policy_3 is strongest (mean delta around +10.6% in A),
policy_2 and policy_4 are intermediate (~+7–8%),
policy_1 is weakest (~+2–3%).

The high proportion of positive deltas, especially under policy_3, indicates broad agreement that interventions improve outcomes relative to baseline. However, because these are deltas, this result should be communicated as “policies reduce decline relative to BAU,” not as “policies deliver recovery.”

Selective evidence to keep visible: - Table: Incremental policy uplifts (Q2 deltas): overall by scenario (Assumptions A and B)
- One figure: the scenario-level uplift bar chart (faceting by assumption if shown)

8.4 Sufficiency: CF + delta implies continued decline on average (even under the strongest package)

The key sufficiency test is the absolute outcome:

Absolute outcome under policy = counterfactual + delta

This is the quantity that matters for target framing (stability vs recovery). Under Assumption A, the mean absolute outcomes remain negative for every scenario:

policy_1 ≈ −9.67%
policy_2 ≈ −4.55%
policy_4 ≈ −4.99%
policy_3 ≈ −1.73% (closest to stability)

This means the policy packages, as elicited, are best characterised as decline-reduction pathways, not system-wide recovery pathways, at least in the mean.

At the same time, the distributions are heterogeneous: under policy_3, most units are stable/positive, yet a material minority remain negative, and these residual declines keep the mean below zero. This is exactly why it is helpful to report both the mean and the recovered/stabilised fractions.

Selective evidence to keep visible: - Table: Absolute signed % outcome under each policy (Assumptions A and B) - Figure: Mean signed % outcome: counterfactual vs policy

8.5 “Most improve” is not the same as “target met”: residual declines remain binding

The recovered/stabilised/decline-remaining table provides the most accessible translation:

Under policy_3, around 79% of units are stabilised (≥0) and around 67–68% are positive (>0), while roughly 21% remain negative.

This reconciles the apparent tension between “many outcomes are positive” and “mean outcome is negative.” The minority of negative outcomes are often sufficiently negative to dominate the mean. From a policy perspective, this implies that:

the strongest package generates widespread improvement,
but there remain persistent and/or severe problem components that would need targeted strengthening if system-wide neutrality is required.

Selective evidence to keep visible: - Table: recovered/stabilised/decline remaining under A and B

8.6 Habitat-group implications: where the remaining gap sits

Habitat-group absolute outcomes indicate that:

Freshwater/Wetland remains negative under all scenarios, including policy_3, even though the proportion positive rises strongly.
Open habitats also remain negative; policy_3 improves outcomes materially but does not eliminate the negative mean.
Woodland is closest to neutrality and can become slightly positive under the strongest package.

Because these habitat-group results are reported under both A and B, one can interpret them as a robust signal rather than a survey-row artefact.

Selective evidence to keep visible: - Table/scroll box: Absolute outcomes under policy by habitat_group (Assumptions A and B)
- One figure: habitat-group mean outcome bars by scenario (faceted by habitat_group and/or assumption)

8.7 Uncertainty and robustness: what we can say confidently, and what we should hedge

Uncertainty diagnostics show:

Dispersion (expert disagreement) increases under more ambitious policy, with the largest SD under policy_3.
The mean–variance relationship is consistent with structured epistemic uncertainty: experts diverge most on high-leverage outcomes.
Resampling supports the interpretation that scenario ordering is not driven by a single expert, but that “how close to zero” policy_3 lands should be treated probabilistically rather than as a point estimate.

This supports a credible communication stance:

High confidence: policies improve outcomes, and policy_3 is strongest.
Medium confidence: policy_3 is close to stability in the mean.
Lower confidence: the precise distance to zero and whether the system crosses into net-positive territory once uncertainties are propagated.

Selective evidence to keep visible: - Table: Mean dispersion (SD) in signed magnitude by scenario (Assumptions A and B) - One figure: mean–variance scatter (faceted by scenario), OR resampling stability plot (pick one, not both)

8.8 Implications for the extinction-risk target (ERLI framing)

The statutory extinction-risk target implies no net increase in extinction risk by 2042, and ideally improvement. We have not translated percent distribution-change expectations into explicit Red List category transitions (and therefore not directly into ERLI change), because that would require an additional mapping model and strong assumptions about how these percent changes relate to category thresholds.

However, the CF + delta results provide a strong sufficiency diagnostic:

The counterfactual is strongly negative (continued deterioration pressure).
Policy scenarios reduce that deterioration, but mean outcomes remain negative under all packages, including the strongest.
Habitat-group patterns indicate persistent negative components (especially freshwater and open habitats) even under policy_3.

Therefore, on the elicited evidence alone, it is difficult to argue that any of the tested policy packages is likely to deliver system-wide stability/recovery by 2042 in a way that would reliably support ERLI neutrality or improvement. The most defensible framing is:

policy_3 is closest and may deliver stabilisation/recovery for many components,
but residual decline remains, implying a continuing risk of upward extinction-risk pressure unless additional measures close the remaining gap.

Expert Elicitation (EE) responses – Tidy + Summaries + Signed Magnitudes + Resampling

Galina Jönsson

2026-03-02