1 Species descriptions and habitat code

1.1 VICFLORA data extract

We start the compilation by extracting species description and habitat code from VicFlora website. For this, we use the 316 species list to create queries from VicFlora. The section below writes an output table containing the available species descriptions, habitat or bioregion codes, match diagnostics, synonym fallback details.

1.1.1 Matching hierarchy

The extract follows this hierarchy of rationales: 1. For ordinary names, var., subsp., aff., and names with bracketed notes, try the original full name first 2. For s.s. / sensu stricto names, always query the name before the s.s. marker 3. If a parenthetical-note name fails, remove the trailing bracketed note and try again, for example Cardamine tenuifolia (small-flower form) → Cardamine tenuifolia 4. For named sp. X records, fall back to the named record without the bracketed descriptor, for example Arthropodium sp. 2 (greenish flowers) → Arthropodium sp. 2 5. If a var. / subsp. full-name match fails or has no profile, fall back to the parent species 6. If an aff. full-name match fails or has no profile, query Genus + text after aff., for example Caladenia aff. vulgaris (Aireys Inlet) → Caladenia vulgaris (Aireys Inlet), then Caladenia vulgaris 7. If the result is not_found or no_profile, inspect VicFlora synonym fields and match names while ignoring authorship at the end of the scientific name 8. Add plant_group from VicFlora higher classification: bryophyte, flowering plant, other vascular plant, or unknown

# -----------------------------
# Text helpers
# -----------------------------
clean_text <- function(x) {
  if (is.null(x) || length(x) == 0) return(NA_character_)
  x <- as.character(x)
  x[is.na(x)] <- NA_character_
  x <- stringr::str_replace_all(x, "\\u00A0", " ")
  x <- stringr::str_replace_all(x, "\u00A0", " ")
  x <- stringr::str_replace_all(x, "[[:space:]]+", " ")
  x <- stringr::str_squish(x)
  x[!is.na(x) & !nzchar(x)] <- NA_character_
  x
}

name_key <- function(x) {
  x <- clean_text(x)
  x <- x[1]
  if (length(x) == 0 || is.na(x)) {
    return(NA_character_)
  }
  x <- enc2utf8(x)
  # Replace hybrid symbol × with plain x
  x <- stringr::str_replace_all(x, "\u00D7", "x")
  x <- stringr::str_replace_all(x, "[[:punct:]]+", " ")
  x <- stringr::str_replace_all(x, "[[:space:]]+", " ")
  stringr::str_to_lower(stringr::str_squish(x))
}


different_name_key <- function(a, b) {
  a_key <- name_key(a)
  b_key <- name_key(b)
  if (is.na(a_key) || is.na(b_key)) return(TRUE)
  !identical(a_key, b_key)
}

first_sentence <- function(x) {
  x <- clean_text(x)
  if (length(x) == 0 || is.na(x[1])) return(NA_character_)
  x <- x[1]
  pieces <- stringi::stri_split_boundaries(x, type = "sentence")[[1]]
  pieces <- clean_text(pieces)
  pieces <- pieces[!is.na(pieces)]
  if (length(pieces) == 0) x else pieces[1]
}

profile_to_paragraphs <- function(profile_html) {
  if (is.null(profile_html) || length(profile_html) == 0 || is.na(profile_html) || !nzchar(profile_html)) {
    return(character())
  }

  # VicFlora profile is HTML. Wrap in a body tag so xml2 can parse fragments safely.
  doc <- tryCatch(
    xml2::read_html(paste0("<html><body>", profile_html, "</body></html>")),
    error = function(e) NULL
  )

  if (is.null(doc)) {
    text <- clean_text(profile_html)
    return(if (is.na(text)) character() else text)
  }

  paragraphs <- rvest::html_elements(doc, "p") |> rvest::html_text2()

  # Fallback for profiles that do not use <p> tags.
  if (length(paragraphs) == 0) {
    text <- rvest::html_text2(doc)
    paragraphs <- stringr::str_split(text, "\\n\\s*\\n")[[1]]
  }

  paragraphs <- purrr::map_chr(paragraphs, clean_text)
  paragraphs <- paragraphs[!is.na(paragraphs)]
  paragraphs
}

extract_codes_from_treatment <- function(second_paragraph) {
  second_paragraph <- clean_text(second_paragraph)
  if (is.na(second_paragraph)) return(NA_character_)

  # Backup heuristic. This catches short region-style codes such as EGL, HSF, HNF, VAlp, WPro, GipP.
  candidates <- stringr::str_extract_all(second_paragraph, "\\b[A-Z][A-Za-z]{1,4}\\b")[[1]]

  exclude <- c(
    month.abb, "Sept", "NSW", "ACT", "QLD", "Qld", "SA", "WA", "NT", "TAS", "Tas", "VIC", "Vic",
    "Also", "Flowers", "Flower", "Fruit", "Fruits", "Mostly", "Grows", "Occurs", "Known", "Rare", "Common",
    "New", "South", "Wales", "Victoria", "Victorian", "Australian", "Australia", "Plants", "Plant",
    "East", "West", "North", "South", "Central", "Mt", "Mount", "River", "Creek", "Lake", "Near"
  )

  candidates <- candidates[!candidates %in% exclude]
  candidates <- candidates[stringr::str_detect(candidates, "^([A-Z]{2,5}|[A-Z][a-z]{1,3}[A-Z]|[A-Z][a-z]?[A-Z][a-z]{0,2})$")]
  candidates <- unique(candidates)

  if (length(candidates) == 0) NA_character_ else paste(candidates, collapse = ", ")
}

collapse_bioregion_codes <- function(bioregions) {
  if (is.null(bioregions) || length(bioregions) == 0) return(NA_character_)
  codes <- purrr::map_chr(bioregions, ~ .x$bioregionCode %||% NA_character_)
  codes <- clean_text(codes)
  codes <- codes[!is.na(codes)]
  if (length(codes) == 0) NA_character_ else paste(unique(codes), collapse = ", ")
}

classification_names <- function(concept) {
  if (is.null(concept) || length(concept) == 0) return(character())

  names <- c(
    concept$taxonName$fullName %||% NA_character_,
    concept$taxonName$fullNameWithAuthorship %||% NA_character_
  )

  hc <- concept$higherClassification
  if (!is.null(hc) && length(hc) > 0) {
    names <- c(
      names,
      purrr::map_chr(hc, ~ .x$taxonName$fullName %||% NA_character_),
      purrr::map_chr(hc, ~ .x$taxonName$fullNameWithAuthorship %||% NA_character_)
    )
  }

  names <- clean_text(names)
  unique(names[!is.na(names)])
}

collapse_higher_classification <- function(concept) {
  hc <- concept$higherClassification
  if (is.null(hc) || length(hc) == 0) return(NA_character_)

  pieces <- purrr::map_chr(hc, function(x) {
    nm <- clean_text(x$taxonName$fullName %||% NA_character_)[1]
    rk <- clean_text(x$taxonRank %||% NA_character_)[1]
    if (is.na(nm)) return(NA_character_)
    if (is.na(rk)) nm else paste0(rk, ": ", nm)
  })

  pieces <- pieces[!is.na(pieces)]
  if (length(pieces) == 0) NA_character_ else paste(unique(pieces), collapse = " | ")
}

infer_plant_group <- function(concept) {
  # Uses VicFlora's higherClassification to separate bryophytes from flowering plants.
  # Keeps an audit trail in higher_classification so borderline cases can be checked.
  names <- classification_names(concept)
  if (length(names) == 0) return(NA_character_)
  keys <- stringr::str_to_lower(names)

  bryophyte_patterns <- c(
    "bryophyta", "marchantiophyta", "anthocerotophyta",
    "bryopsida", "sphagnopsida", "polytrichopsida",
    "jungermanniopsida", "marchantiopsida", "anthocerotopsida",
    "moss", "liverwort", "hornwort"
  )

  flowering_patterns <- c(
    "magnoliophyta", "angiosperm", "flowering plant",
    "magnoliopsida", "liliopsida", "monocot", "dicot", "eudicot",
    "rosids", "asterids", "commelinids", "magnoliids"
  )

  other_vascular_patterns <- c(
    "pteridophyta", "polypodiophyta", "lycopodiophyta", "pinophyta",
    "cycadophyta", "ginkgophyta", "gnetophyta", "gymnosperm",
    "fern", "clubmoss", "quillwort", "conifer"
  )

  has_pattern <- function(patterns) {
    any(purrr::map_lgl(patterns, function(pat) {
      any(stringr::str_detect(keys, stringr::fixed(pat, ignore_case = TRUE)))
    }))
  }

  if (has_pattern(bryophyte_patterns)) return("bryophyte")
  if (has_pattern(flowering_patterns)) return("flowering plant")
  if (has_pattern(other_vascular_patterns)) return("other vascular plant")

  "unknown"
}

1.1.2 Species matches and

For most of the 316 species, exact taxonomic name matches were found in Vicflora. But when species were matched to their closest affinities, species level descriptions applied to subspecies, or species information applied to variations, the match type was recorded. Examples of match types are recorded:

Acacia leprosa var. graveolens, exact VicFlora match -> exact match
If parent species used for species variations then the match is recorded as -> var
Caladenia aff. vulgaris (Aireys Inlet), fallback/closest match Caladenia vulgaris used -> aff
Arthropodium sp. 2 (greenish flowers), fallback to Arthropodium sp. 2 -> exact match if found
Cardamine tenuifolia (small-flower form), fallback to Cardamine tenuifolia -> exact match if found
For species that has no records in Vicflora e.g., Ptychomitrium muelleri, then -> not_found

# -----------------------------
# Scientific-name parsing helpers
# -----------------------------
rank_regex <- "(?:subsp\\.?|ssp\\.?|var\\.?|subvar\\.?|forma|f\\.?)"

species_binomial <- function(x) {
  x <- clean_text(x)
  x <- x[1]
  if (length(x) == 0 || is.na(x)) return(NA_character_)
  x <- stringr::str_replace_all(x, "x", "x")
  m <- stringr::str_match(x, "^([A-Z][A-Za-z-]+\\s+(?:x\\s+)?[a-z][A-Za-z-]+)\\b")
  if (!is.na(m[1, 2])) clean_text(m[1, 2]) else NA_character_
}

remove_ss_marker <- function(x) {
  x <- clean_text(x)
  x <- x[1]
  if (length(x) == 0 || is.na(x)) return(NA_character_)
  x <- stringr::str_replace(
    x,
    stringr::regex("\\s+(s\\.?\\s*s\\.?|ss|sensu\\s+stricto)(\\s+.*|$)", ignore_case = TRUE),
    ""
  )
  clean_text(x)
}

botanical_name_without_authorship <- function(x) {
  x <- clean_text(x)
  x <- x[1]
  if (length(x) == 0 || is.na(x)) return(NA_character_)
  x <- stringr::str_replace_all(x, "x", "x")
  x <- remove_ss_marker(x)

  # Named forms like Geranium sp. 3 or Alternanthera sp. 1.
  m_sp_number <- stringr::str_match(
    x,
    stringr::regex("^([A-Z][A-Za-z-]+\\s+sp\\.?\\s*[0-9A-Za-z-]+)\\b", ignore_case = FALSE)
  )
  if (!is.na(m_sp_number[1, 2])) return(clean_text(m_sp_number[1, 2]))

  # Infraspecific rank. Keep only Genus species rank epithet, dropping authorship after that.
  infra_pat <- paste0("^([A-Z][A-Za-z-]+\\s+(?:x\\s+)?[a-z][A-Za-z-]+\\s+", rank_regex, "\\s+(?:[a-z][A-Za-z-]+|[0-9]+))\\b")
  m_infra <- stringr::str_match(x, stringr::regex(infra_pat, ignore_case = FALSE))
  if (!is.na(m_infra[1, 2])) return(clean_text(m_infra[1, 2]))

  # Ordinary binomial.
  species_binomial(x)
}

authorless_key <- function(x) {
  name_key(botanical_name_without_authorship(x))
}

name_matches_ignoring_authorship <- function(query_name, candidate_names) {
  q_key <- authorless_key(query_name)
  if (is.na(q_key)) return(FALSE)
  cand_keys <- purrr::map_chr(candidate_names, authorless_key)
  cand_keys <- cand_keys[!is.na(cand_keys)]
  any(cand_keys == q_key)
}

detect_name_flags <- function(original_name) {
  x <- clean_text(original_name)
  x <- x[1]
  if (length(x) == 0 || is.na(x)) return(list(flags = NA_character_, has_ss = FALSE, has_var = FALSE, has_aff = FALSE))

  has_ss <- stringr::str_detect(x, stringr::regex("(^|\\s)(s\\.?\\s*s\\.?|ss|sensu\\s+stricto)(\\s|$)", ignore_case = TRUE))
  has_aff <- stringr::str_detect(x, stringr::regex("(^|\\s)aff\\.?(\\s|$)", ignore_case = TRUE))
  has_var <- stringr::str_detect(x, stringr::regex(paste0("(^|\\s)", rank_regex, "(\\s|$)"), ignore_case = TRUE))

  flags <- c()
  if (has_ss) flags <- c(flags, "ss")
  if (has_var) flags <- c(flags, "var")
  if (has_aff) flags <- c(flags, "aff")

  list(
    flags = if (length(flags) == 0) "exact match" else paste(flags, collapse = "; "),
    has_ss = has_ss,
    has_var = has_var,
    has_aff = has_aff
  )
}

final_ss_var_aff <- function(original_name, attempt_reason, scrape_status) {

  status <- clean_text(scrape_status)[1]
  if (is.na(status)) return(NA_character_)

  if (!identical(status, "ok")) return(status)

  reason <- clean_text(attempt_reason)[1]
  if (is.na(reason)) return("exact match")

  if (identical(reason, "ss_name_before_marker")) return("ss")
  if (stringr::str_detect(reason, "^aff_genus_plus_text_after_aff")) return("aff")
  if (identical(reason, "var_or_subsp_parent_species_fallback")) return("var")

  "exact match"
}

failure_priority <- function(status) {
  status <- clean_text(status)[1]
  if (is.na(status)) return(0L)
  dplyr::case_when(
    status == "not_started" ~ 0L,
    status == "blank_name" ~ 0L,
    status == "not_found" ~ 1L,
    status == "error" ~ 2L,
    status == "no_profile" ~ 3L,
    TRUE ~ 1L
  )
}

aff_target_name <- function(original_name, keep_parenthetical = TRUE) {
  x <- clean_text(original_name)
  x <- x[1]
  if (length(x) == 0 || is.na(x)) return(NA_character_)

  # Keep the genus and replace the aff. phrase with the text after aff.
  # Examples:
  #   Caladenia aff. vulgaris (Aireys Inlet) -> Caladenia vulgaris (Aireys Inlet)
  #   Caladenia sp. aff. iridescens (Chapple Vale) -> Caladenia iridescens (Chapple Vale)
  #   Geranium aff. sp. 3 -> Geranium sp. 3
  m <- stringr::str_match(
    x,
    stringr::regex("^([A-Z][A-Za-z-]+)\\b.*?\\baff\\.?\\s+(.+)$", ignore_case = TRUE)
  )
  if (is.na(m[1, 2]) || is.na(m[1, 3])) return(NA_character_)

  tail <- clean_text(m[1, 3])
  tail <- stringr::str_replace(tail, stringr::regex("^of\\s+", ignore_case = TRUE), "")
  target <- clean_text(paste(m[1, 2], tail))

  if (!keep_parenthetical) {
    target <- stringr::str_remove(target, "\\s*\\([^)]*\\)\\s*$")
    target <- clean_text(target)
  }

  target
}

aff_place_name <- function(original_name) {
  x <- clean_text(original_name)
  x <- x[1]
  if (length(x) == 0 || is.na(x)) return(NA_character_)
  if (!stringr::str_detect(x, stringr::regex("(^|\\s)aff\\.?(\\s|$)", ignore_case = TRUE))) return(NA_character_)
  m <- stringr::str_match(x, "\\(([^)]+)\\)")
  if (!is.na(m[1, 2])) clean_text(m[1, 2]) else NA_character_
}

place_match_in_text <- function(place, text) {
  place <- clean_text(place)
  text <- clean_text(text)
  if (is.na(place) || is.na(text)) return(FALSE)

  place_options <- unique(c(
    place,
    stringr::str_remove(place, stringr::regex("\\s+variant$", ignore_case = TRUE))
  ))
  place_options <- place_options[!is.na(place_options) & nzchar(place_options)]

  any(purrr::map_lgl(place_options, function(p) {
    stringr::str_detect(text, stringr::fixed(p, ignore_case = TRUE))
  }))
}

named_sp_number_base <- function(original_name) {
  x <- clean_text(original_name)
  x <- x[1]
  if (length(x) == 0 || is.na(x)) return(NA_character_)

  # Named phrase records such as:
  #   Arthropodium sp. 2 (greenish flowers) -> Arthropodium sp. 2
  # Do not use bracketed informal descriptors as part of the VicFlora query.
  # This deliberately excludes aff./cf./nov. patterns such as Caladenia sp. aff. vulgaris.
  m <- stringr::str_match(
    x,
    stringr::regex("^([A-Z][A-Za-z-]+\\s+sp\\.?\\s*([0-9A-Za-z-]+))\\b", ignore_case = FALSE)
  )
  if (is.na(m[1, 2]) || is.na(m[1, 3])) return(NA_character_)

  token <- stringr::str_to_lower(stringr::str_remove_all(m[1, 3], "\\."))
  if (token %in% c("aff", "cf", "nov", "indet")) return(NA_character_)

  clean_text(m[1, 2])
}

strip_trailing_parenthetical <- function(original_name) {
  x <- clean_text(original_name)
  x <- x[1]
  if (length(x) == 0 || is.na(x)) return(NA_character_)

  stripped <- stringr::str_remove(x, "\\s*\\([^)]*\\)\\s*$")
  stripped <- clean_text(stripped)

  if (is.na(stripped) || !different_name_key(stripped, x)) NA_character_ else stripped
}

build_search_plan <- function(original_name) {
  original_name <- clean_text(original_name)
  original_name <- original_name[1]
  flags <- detect_name_flags(original_name)

  if (length(original_name) == 0 || is.na(original_name)) {
    return(tibble::tibble(search_name = NA_character_, attempt_reason = "blank_name", allow_fuzzy = FALSE))
  }

  # Sensu stricto is the one exception to "try the full name first": it means
  # use the taxon name as written before the s.s. marker.
  if (flags$has_ss) {
    ss_name <- remove_ss_marker(original_name)
    return(tibble::tibble(
      search_name = ss_name,
      attempt_reason = "ss_name_before_marker",
      allow_fuzzy = FALSE
    ))
  }

  # Always start with the species full. This preserves true exact matches for
  # varieties, subspecies, informal forms, and parenthetical names when VicFlora has them.
  plan <- tibble::tibble(
    search_name = original_name,
    attempt_reason = "original_full_name_exact_first",
    allow_fuzzy = FALSE
  )

  # For ordinary parenthetical notes, only remove the bracketed text after the
  # full-name search has failed. This handles examples such as:
  #   Cardamine tenuifolia (small-flower form) -> Cardamine tenuifolia
  #   Arthropodium sp. 2 (greenish flowers) -> Arthropodium sp. 2
  # For aff. names, keep the specialised aff. hierarchy below because it converts
  # Caladenia aff. vulgaris (Aireys Inlet) to Caladenia vulgaris first.
  if (!flags$has_aff) {
    parenthetical_stripped <- strip_trailing_parenthetical(original_name)
    if (!is.na(parenthetical_stripped)) {
      plan <- dplyr::bind_rows(
        plan,
        tibble::tibble(
          search_name = parenthetical_stripped,
          attempt_reason = "parenthetical_note_removed_fallback",
          allow_fuzzy = FALSE
        )
      )
    }
  }

  # Named sp. number fallback also covers non-parenthetical informal strings.
  # If the stripped-parenthetical fallback already added the same name, distinct()
  # below removes the duplicate.
  sp_number_name <- named_sp_number_base(original_name)
  if (!is.na(sp_number_name) && different_name_key(sp_number_name, original_name)) {
    plan <- dplyr::bind_rows(
      plan,
      tibble::tibble(
        search_name = sp_number_name,
        attempt_reason = "sp_number_without_extra_text_fallback",
        allow_fuzzy = FALSE
      )
    )
  }

  if (flags$has_aff) {
    target_with_place <- aff_target_name(original_name, keep_parenthetical = TRUE)
    target_without_place <- aff_target_name(original_name, keep_parenthetical = FALSE)

    if (!is.na(target_with_place) && different_name_key(target_with_place, original_name)) {
      plan <- dplyr::bind_rows(
        plan,
        tibble::tibble(
          search_name = target_with_place,
          attempt_reason = "aff_genus_plus_text_after_aff_with_place",
          allow_fuzzy = FALSE
        )
      )
    }

    if (!is.na(target_without_place) &&
        different_name_key(target_without_place, target_with_place) &&
        different_name_key(target_without_place, original_name)) {
      plan <- dplyr::bind_rows(
        plan,
        tibble::tibble(
          search_name = target_without_place,
          attempt_reason = "aff_genus_plus_text_after_aff_without_place",
          allow_fuzzy = FALSE
        )
      )
    }
  }

  if (flags$has_var) {
    parent_species <- species_binomial(original_name)
    if (!is.na(parent_species) && different_name_key(parent_species, original_name)) {
      plan <- dplyr::bind_rows(
        plan,
        tibble::tibble(
          search_name = parent_species,
          attempt_reason = "var_or_subsp_parent_species_fallback",
          allow_fuzzy = FALSE
        )
      )
    }
  }

  plan |>
    dplyr::mutate(plan_order = dplyr::row_number()) |>
    dplyr::distinct(search_name, .keep_all = TRUE) |>
    dplyr::select(search_name, attempt_reason, allow_fuzzy, plan_order)
}

# -----------------------------
# VicFlora GraphQL helpers
# -----------------------------
endpoint <- "https://vicflora.rbg.vic.gov.au/graphql"

lookup_query <- '
query TaxonLookup($q: String!) {
  taxonConceptAutocomplete(q: $q) {
    ...LookupConceptFields
    synonyms {
      fullName
      fullNameWithAuthorship
    }
    synonymUsages {
      ...LookupConceptFields
      acceptedConcept {
        ...LookupConceptFields
      }
    }
    acceptedConcept {
      ...LookupConceptFields
    }
  }
}

fragment LookupConceptFields on TaxonConcept {
  id
  taxonomicStatus
  taxonRank
  taxonName {
    fullName
    fullNameWithAuthorship
  }
  higherClassification {
    taxonRank
    taxonName {
      fullName
      fullNameWithAuthorship
    }
  }
  currentProfile {
    profile
    modified
  }
  bioregions {
    bioregionCode
    bioregionName
    occurrenceStatus
  }
}
'

gql_post <- function(query, variables = list()) {
  req <- httr2::request(endpoint) |>
    httr2::req_user_agent("VicFlora species description extraction script; local research use") |>
    httr2::req_headers(Accept = "application/json") |>
    httr2::req_body_json(list(query = query, variables = variables), auto_unbox = TRUE) |>
    httr2::req_timeout(45) |>
    httr2::req_retry(max_tries = 4, backoff = ~ 1.5 ^ .x)

  response <- httr2::req_perform(req)
  parsed <- httr2::resp_body_json(response, simplifyVector = FALSE)

  if (!is.null(parsed$errors)) {
    stop(jsonlite::toJSON(parsed$errors, auto_unbox = TRUE), call. = FALSE)
  }

  parsed$data
}

concept_for_candidate <- function(candidate) {
  if (!is.null(candidate$acceptedConcept) && !is.null(candidate$acceptedConcept$id)) {
    candidate$acceptedConcept
  } else {
    candidate
  }
}

concept_for_synonym_usage <- function(synonym_usage) {
  if (!is.null(synonym_usage$acceptedConcept) && !is.null(synonym_usage$acceptedConcept$id)) {
    synonym_usage$acceptedConcept
  } else {
    synonym_usage
  }
}

candidate_display_names <- function(candidate) {
  c(
    candidate$taxonName$fullName %||% NA_character_,
    candidate$taxonName$fullNameWithAuthorship %||% NA_character_
  )
}

accepted_display_names <- function(candidate) {
  c(
    candidate$acceptedConcept$taxonName$fullName %||% NA_character_,
    candidate$acceptedConcept$taxonName$fullNameWithAuthorship %||% NA_character_
  )
}

candidate_synonym_names <- function(candidate) {
  syn_names <- character()

  if (!is.null(candidate$synonyms) && length(candidate$synonyms) > 0) {
    syn_names <- c(
      syn_names,
      purrr::map_chr(candidate$synonyms, ~ .x$fullName %||% NA_character_),
      purrr::map_chr(candidate$synonyms, ~ .x$fullNameWithAuthorship %||% NA_character_)
    )
  }

  if (!is.null(candidate$synonymUsages) && length(candidate$synonymUsages) > 0) {
    syn_names <- c(
      syn_names,
      purrr::map_chr(candidate$synonymUsages, ~ .x$taxonName$fullName %||% NA_character_),
      purrr::map_chr(candidate$synonymUsages, ~ .x$taxonName$fullNameWithAuthorship %||% NA_character_)
    )
  }

  syn_names <- clean_text(syn_names)
  unique(syn_names[!is.na(syn_names)])
}

make_empty_result <- function(original_name) {
  tibble::tibble(
    query_name = clean_text(original_name)[1],
    ss_var_aff = NA_character_,
    plant_group = NA_character_,
    higher_classification = NA_character_,
    vicflora_search_name = NA_character_,
    search_hierarchy = NA_character_,
    match_type = NA_character_,
    synonym_fallback_used = NA,
    matched_synonym_name = NA_character_,
    aff_place_name = aff_place_name(original_name),
    aff_place_found_in_treatment = NA_character_,
    matched_name = NA_character_,
    matched_name_with_authorship = NA_character_,
    matched_id = NA_character_,
    matched_taxonomic_status = NA_character_,
    used_concept_id = NA_character_,
    used_scientific_name = NA_character_,
    used_name_with_authorship = NA_character_,
    used_accepted_concept = NA,
    profile_modified = NA_character_,
    species_description = NA_character_,
    habitat_codes = NA_character_,
    bioregion_codes_api = NA_character_,
    second_paragraph_codes = NA_character_,
    treatment_first_paragraph = NA_character_,
    treatment_second_paragraph = NA_character_,
    all_treatment_text = NA_character_,
    vicflora_url = NA_character_,
    scrape_status = "not_started",
    error_message = NA_character_
  )
}

build_result_row <- function(original_name, search_name, attempt_reason, candidate, concept, match_type,
                             synonym_fallback_used = FALSE, matched_synonym_name = NA_character_) {
  base <- make_empty_result(original_name)

  profile_html <- concept$currentProfile$profile %||% NA_character_
  paragraphs <- profile_to_paragraphs(profile_html)
  first_para <- if (length(paragraphs) >= 1) paragraphs[1] else NA_character_
  second_para <- if (length(paragraphs) >= 2) paragraphs[2] else NA_character_
  all_treatment <- if (length(paragraphs) > 0) paste(paragraphs, collapse = " ") else NA_character_

  description <- first_sentence(first_para)
  bioregion_codes <- collapse_bioregion_codes(concept$bioregions)
  second_para_codes <- extract_codes_from_treatment(second_para)

  # Fall back to the API bioregion codes if the second paragraph heuristic finds no codes.
  habitat_codes <- if (!is.na(second_para_codes)) second_para_codes else bioregion_codes
  plant_group_value <- infer_plant_group(concept)
  higher_classification_value <- collapse_higher_classification(concept)

  place <- aff_place_name(original_name)
  place_found <- if (!is.na(place) && !is.na(all_treatment)) {
    if (place_match_in_text(place, all_treatment)) place else NA_character_
  } else {
    NA_character_
  }

  final_status <- if (is.na(profile_html) || !nzchar(profile_html)) "no_profile" else "ok"
  final_label <- final_ss_var_aff(original_name, attempt_reason, final_status)

  # Avoid dplyr data-mask collisions: these argument names are also output-column names.
  # If we wrote match_type = match_type, mutate would use the existing blank column
  # from base instead of the function argument.
  match_type_value <- match_type
  synonym_fallback_used_value <- synonym_fallback_used
  matched_synonym_name_value <- matched_synonym_name

  base |>
    dplyr::mutate(
      ss_var_aff = final_label,
      plant_group = plant_group_value,
      higher_classification = higher_classification_value,
      vicflora_search_name = search_name,
      search_hierarchy = attempt_reason,
      match_type = match_type_value,
      synonym_fallback_used = synonym_fallback_used_value,
      matched_synonym_name = matched_synonym_name_value,
      aff_place_found_in_treatment = place_found,
      matched_name = candidate$taxonName$fullName %||% NA_character_,
      matched_name_with_authorship = candidate$taxonName$fullNameWithAuthorship %||% NA_character_,
      matched_id = as.character(candidate$id %||% NA_character_),
      matched_taxonomic_status = candidate$taxonomicStatus %||% NA_character_,
      used_concept_id = as.character(concept$id %||% NA_character_),
      used_scientific_name = concept$taxonName$fullName %||% NA_character_,
      used_name_with_authorship = concept$taxonName$fullNameWithAuthorship %||% NA_character_,
      used_accepted_concept = !is.null(concept$id) && !is.null(candidate$id) && as.character(concept$id) != as.character(candidate$id),
      profile_modified = concept$currentProfile$modified %||% NA_character_,
      species_description = description,
      habitat_codes = habitat_codes,
      bioregion_codes_api = bioregion_codes,
      second_paragraph_codes = second_para_codes,
      treatment_first_paragraph = first_para,
      treatment_second_paragraph = second_para,
      all_treatment_text = all_treatment,
      vicflora_url = if (!is.null(concept$id)) paste0("https://vicflora.rbg.vic.gov.au/flora/taxon/", concept$id) else NA_character_,
      scrape_status = final_status,
      error_message = NA_character_
    )
}

lookup_one_search_name <- function(original_name, search_name, attempt_reason, allow_fuzzy = FALSE) {
  if (is_blank(search_name)) {
    out <- make_empty_result(original_name)
    out$scrape_status <- "blank_name"
    out$ss_var_aff <- final_ss_var_aff(original_name, attempt_reason, out$scrape_status)
    return(out)
  }

  out <- tryCatch({
    data <- gql_post(lookup_query, list(q = search_name))
    candidates <- data$taxonConceptAutocomplete

    if (is.null(candidates) || length(candidates) == 0) {
      row <- make_empty_result(original_name)
      row$vicflora_search_name <- search_name
      row$search_hierarchy <- attempt_reason
      row$scrape_status <- "not_found"
      row$ss_var_aff <- final_ss_var_aff(original_name, attempt_reason, row$scrape_status)
      return(row)
    }

    search_key <- name_key(search_name)
    no_profile_rows <- list()
    choices <- list()

    add_choice <- function(candidate, concept, match_type, synonym_fallback_used = FALSE, matched_synonym_name = NA_character_) {
      choices[[length(choices) + 1]] <<- list(
        candidate = candidate,
        concept = concept,
        match_type = match_type,
        synonym_fallback_used = synonym_fallback_used,
        matched_synonym_name = matched_synonym_name
      )
    }

    # 1. Exact direct matches to the returned taxon name.
    for (candidate in candidates) {
      cand_keys <- purrr::map_chr(candidate_display_names(candidate), name_key)
      cand_keys <- cand_keys[!is.na(cand_keys)]
      if (any(cand_keys == search_key)) {
        add_choice(candidate, concept_for_candidate(candidate), "exact_returned_name", FALSE, NA_character_)
      }
    }

    # 2. Exact matches to an accepted concept returned by a synonym/misapplied result.
    for (candidate in candidates) {
      acc_keys <- purrr::map_chr(accepted_display_names(candidate), name_key)
      acc_keys <- acc_keys[!is.na(acc_keys)]
      if (length(acc_keys) > 0 && any(acc_keys == search_key)) {
        add_choice(candidate, concept_for_candidate(candidate), "exact_accepted_concept_name", FALSE, NA_character_)
      }
    }

    # 3. Synonym-tab matches, ignoring authorship at the end of the name.
    for (candidate in candidates) {
      syn_names <- candidate_synonym_names(candidate)
      if (length(syn_names) > 0 && name_matches_ignoring_authorship(original_name, syn_names)) {
        matched_syn <- syn_names[which(purrr::map_lgl(syn_names, ~ name_matches_ignoring_authorship(original_name, .x)))[1]]
        add_choice(candidate, concept_for_candidate(candidate), "synonym_tab_match", TRUE, matched_syn)
      }

      # If any synonym usage itself has an accepted concept, use that accepted concept.
      if (!is.null(candidate$synonymUsages) && length(candidate$synonymUsages) > 0) {
        for (syn_usage in candidate$synonymUsages) {
          usage_names <- c(
            syn_usage$taxonName$fullName %||% NA_character_,
            syn_usage$taxonName$fullNameWithAuthorship %||% NA_character_
          )
          if (name_matches_ignoring_authorship(original_name, usage_names)) {
            matched_syn <- clean_text(usage_names)[!is.na(clean_text(usage_names))][1]
            add_choice(candidate, concept_for_synonym_usage(syn_usage), "synonym_usage_match", TRUE, matched_syn)
          }
        }
      }
    }

    # 4. Optional fuzzy fallback. Kept available, but the default search plan sets allow_fuzzy = FALSE to avoid subpar matches.
    if (allow_fuzzy && length(choices) == 0) {
      statuses <- purrr::map_chr(candidates, ~ .x$taxonomicStatus %||% NA_character_)
      idx <- which(statuses == "ACCEPTED")
      if (length(idx) == 0) idx <- seq_along(candidates)
      candidate <- candidates[[idx[1]]]
      add_choice(candidate, concept_for_candidate(candidate), "accepted_or_first_autocomplete_fallback", FALSE, NA_character_)
    }

    if (length(choices) == 0) {
      row <- make_empty_result(original_name)
      row$vicflora_search_name <- search_name
      row$search_hierarchy <- attempt_reason
      row$scrape_status <- "not_found"
      row$ss_var_aff <- final_ss_var_aff(original_name, attempt_reason, row$scrape_status)
      return(row)
    }

    for (choice in choices) {
      row <- build_result_row(
        original_name = original_name,
        search_name = search_name,
        attempt_reason = attempt_reason,
        candidate = choice$candidate,
        concept = choice$concept,
        match_type = choice$match_type,
        synonym_fallback_used = choice$synonym_fallback_used,
        matched_synonym_name = choice$matched_synonym_name
      )

      if (identical(row$scrape_status[[1]], "ok")) return(row)
      no_profile_rows[[length(no_profile_rows) + 1]] <- row
    }

    if (length(no_profile_rows) > 0) no_profile_rows[[1]] else {
      row <- make_empty_result(original_name)
      row$vicflora_search_name <- search_name
      row$search_hierarchy <- attempt_reason
      row$scrape_status <- "not_found"
      row$ss_var_aff <- final_ss_var_aff(original_name, attempt_reason, row$scrape_status)
      row
    }
  }, error = function(e) {
    row <- make_empty_result(original_name)
    row$vicflora_search_name <- search_name
    row$search_hierarchy <- attempt_reason
    row$scrape_status <- "error"
    row$ss_var_aff <- final_ss_var_aff(original_name, attempt_reason, row$scrape_status)
    row$error_message <- conditionMessage(e)
    row
  })

  Sys.sleep(request_pause_seconds)
  out
}

scrape_one_species <- function(query_name) {
  query_name <- clean_text(query_name)[1]
  result_if_all_fail <- make_empty_result(query_name)

  if (is_blank(query_name)) {
    result_if_all_fail$scrape_status <- "blank_name"
    result_if_all_fail$ss_var_aff <- final_ss_var_aff(query_name, NA_character_, result_if_all_fail$scrape_status)
    return(result_if_all_fail)
  }

  plan <- build_search_plan(query_name)
  attempt_summaries <- character()

  for (i in seq_len(nrow(plan))) {
    attempt <- plan[i, ]
    row <- lookup_one_search_name(
      original_name = query_name,
      search_name = attempt$search_name,
      attempt_reason = attempt$attempt_reason,
      allow_fuzzy = attempt$allow_fuzzy
    )

    attempt_summaries <- c(
      attempt_summaries,
      paste0(attempt$attempt_reason, " [", attempt$search_name, "]: ", row$scrape_status[[1]])
    )

    if (identical(row$scrape_status[[1]], "ok")) {
      row$search_hierarchy <- paste(attempt_summaries, collapse = " -> ")
      row$ss_var_aff <- final_ss_var_aff(query_name, attempt$attempt_reason, row$scrape_status[[1]])
      return(row)
    }

    # Keep the most informative failure, but make sure that an all-not_found species
    # returns not_found rather than the initial not_started row.
    if (failure_priority(row$scrape_status[[1]]) >= failure_priority(result_if_all_fail$scrape_status[[1]])) {
      result_if_all_fail <- row
    }
  }

  result_if_all_fail$search_hierarchy <- paste(attempt_summaries, collapse = " -> ")
  result_if_all_fail$ss_var_aff <- final_ss_var_aff(query_name, result_if_all_fail$search_hierarchy, result_if_all_fail$scrape_status[[1]])
  result_if_all_fail
}

if (exists("final_tbl")) {
  print(dplyr::count(final_tbl, scrape_status, ss_var_aff, sort = TRUE))
  cat("
Rows written:", nrow(final_tbl), "
")
  cat("CSV output:", normalizePath(output_csv, mustWork = FALSE), "
")
  cat("XLSX output:", normalizePath(output_xlsx, mustWork = FALSE), "
")
}

## [1] "proj_scientific_name"

## [1] "vicflora_search_name"

vicflora_summary <- results_unique |>
  summarise(
    n_project_taxa = n(),
    n_vicflora_matched = sum(scrape_status %in% c("success", "matched", "ok"), na.rm = TRUE),
    n_vicflora_not_found = sum(scrape_status == "not_found", na.rm = TRUE),
    n_vicflora_no_profile = sum(scrape_status == "no_profile", na.rm = TRUE),
    n_with_description = sum(!is.na(species_description) & species_description != "", na.rm = TRUE),
    n_with_habitat_codes = sum(!is.na(bioregion_codes_api) & bioregion_codes_api != "", na.rm = TRUE),
    n_bryophytes = sum(plant_group == "bryophyte", na.rm = TRUE),
    n_flowering_plants = sum(plant_group == "flowering plant", na.rm = TRUE),
    n_other_or_unknown = sum(
      is.na(plant_group) |
        plant_group %in% c("unknown", "other vascular plant"),
      na.rm = TRUE
    )
  ) |>
  pivot_longer(
    cols = everything(),
    names_to = "summary_metric",
    values_to = "value"
  )

vicflora_summary |>
  knitr::kable(
    caption = "Summary of VicFlora scraping results"
  )

Summary of VicFlora scraping results
summary_metric	value
n_project_taxa	316
n_vicflora_matched	247
n_vicflora_not_found	46
n_vicflora_no_profile	23
n_with_description	247
n_with_habitat_codes	261
n_bryophytes	53
n_flowering_plants	212
n_other_or_unknown	51

1.2 Species that are not in VICFLORA

Because VICFLORA focuses on vascular plants mostly, the information on most of the bryophytes, liverworts and hornworts have to be sourced elsewhere. Although ALA is the next obvious choice, there isn’t a species description section that is accessible. Most of the species description on bryophytes, liverworts and hornworts may have to be individually/manually compiled. The geopgraphic range information is however available for most of these species on ALA - and this information has been acquired in this phase.

2 Extracting traits information

2.1 Overview of the AusTraits database

AusTraits is a large, curated database of plant trait information for Australian plant species. A plant trait is a measurable characteristic of a species, such as growth form, plant height, lifespan, seed mass, leaf size, dispersal mode, or reproductive strategy. These traits are useful because they help describe how plants live, grow, reproduce, and respond to their environment.

The database brings together trait records from many different published and unpublished sources. This means that a single species can have multiple records for the same trait, sometimes from different studies, regions, or measurement methods. For this reason, the analysis keeps the original trait records as an audit table, but also creates a simplified species-level summary that can be joined back to the main species dataset.

Here, AusTraits is used to add ecological trait information focusing on life history related traits to the species list.

2.2 What the AusTraits functions do in this analysis

The AusTraits part of the workflow has four main purposes.

First, the script prepares the species names from the main results_unique table. These names come from the VicFlora scrape and have already been cleaned as much as possible. The script then creates matching names that can be compared with names in the AusTraits database.

Second, the script matches the project species list to AusTraits taxonomy. It first tries to match the cleaned project species name directly to the AusTraits taxon_name column. This is the preferred match because it preserves the most specific name available, including varieties or subspecies where they are present in AusTraits. If no exact taxon-name match is found, the script then uses a species-level name and performs a fuzzy match against the AusTraits binomial column. This helps recover likely matches when names differ slightly between sources.

Third, the script extracts only the traits of interest (16 traits listed below). Instead of using the full AusTraits database, which is very large (over 530 traits!), the code filters the data to keep only the project species and the chosen traits of interest.

Fourth, the script appends the trait information back to the main species table. Because AusTraits can contain several records for the same species and trait, the workflow keeps two outputs:

a long-format audit table, where every AusTraits record is retained separately;
a wide-format species table, where trait values are collapsed and appended to results_unique.

The wide table is easier to read because each species remains as one row. If a species has several values for the same trait, these values are combined into a single cell separated by semicolons. The long audit table is kept so that the original records can still be checked if needed.

2.3 Why the workflow uses saved outputs

Downloading or loading large biodiversity datasets can take time and may require an internet connection. To avoid repeating the same steps every time the R Markdown document is knitted, the workflow saves the AusTraits outputs as local files after they are created. Later, the document can reload these saved files directly, instead of contacting the online database again.

This makes the document faster, more reproducible, and less likely to fail because of temporary internet or server issues. The trait database contains over 530 plant species traits. We are mainly interested in life history related traits, therefore extract trait information for a specific subset of traits (string called traits_of_interest) for the 316 poorly known species.

2.3.1 Extracting data by traits of interest.

Here we specify 16 plant traits: plant growth form, life history, flowering time, resprouting capacity, dispersers, reproductive maturity, vegetation reproduction ability, life span, recruitment time, post fire recruitment, resprouting capacity non fire disturbance, resprouting capacity proportion individuals, seed viability, seedbank longevity, seedling establishment conditions, fire exposure level

For FFG nomination, there is a field called “Generation length”. Unfortuntaly AusTraits does not have this particular trait information. We may have to make assumptions about generation length based on reproductive maturity and lifespan (both of which are available).

full_list_of_traits<-summarise_database(austraits07,"trait_name")

traits_of_interest<-c("plant_growth_form","life_history","flowering_time","resprouting_capacity","dispersers","reproductive_maturity","vegetative_reproduction_ability","lifespan","recruitment_time","post_fire_recruitment","resprouting_capacity_non_fire_disturbance","resprouting_capacity_proportion_individuals","seed_viability","seedbank_longevity","seedling_establishment_conditions","fire_exposure_level")

traits_with_taxa <- austraits07$traits |>
  filter(trait_name %in% traits_of_interest) |>
  left_join(
    taxa_lookup |> select(taxon_name, binomial),
    by = "taxon_name"
  )
  
trait_rows_exact <- results_unique_matched |>
  filter(austraits_match_type == "exact_taxon_name") |>
  select(row_id, matched_taxon_name) |>
  inner_join(
    traits_with_taxa,
    by = c("matched_taxon_name" = "taxon_name")
  )

trait_rows_fuzzy <- results_unique_matched |>
  filter(austraits_match_type == "fuzzy_binomial") |>
  select(row_id, matched_binomial) |>
  inner_join(
    traits_with_taxa,
    by = c("matched_binomial" = "binomial")
  )

trait_rows_to_append <- bind_rows(
  trait_rows_exact,
  trait_rows_fuzzy
)

collapse_unique <- function(x) {
  x <- unique(na.omit(as.character(x)))
  if (length(x) == 0) NA_character_ else paste(x, collapse = "; ")
}

results_unique <- results_unique |>
  mutate(row_id = row_number())

trait_wide <- trait_rows_to_append |>
  mutate(
    value = as.character(value),
    unit = as.character(unit)
  ) |>
  group_by(row_id, trait_name) |>
  summarise(
    trait_value = collapse_unique(value),
    trait_unit = collapse_unique(unit),
    trait_source_id = collapse_unique(source_id),
    trait_dataset_id = collapse_unique(dataset_id),
    .groups = "drop"
  ) |>
  pivot_wider(
    names_from = trait_name,
    values_from = c(
      trait_value,
      trait_unit,
      trait_source_id,
      trait_dataset_id
    ),
    names_glue = "{trait_name}_{.value}"
  )

results_unique_with_traits <- results_unique |>
  left_join(trait_wide, by = "row_id")

austraits_summary <- tibble(
  summary_metric = c(
    "Project taxa",
    "Taxa with at least one selected AusTraits record",
    "Total AusTraits records extracted",
    "Selected traits found",
    "Selected traits requested"
  ),
  value = c(
    nrow(results_unique),
    trait_rows_to_append |> distinct(row_id) |> nrow(),
    nrow(trait_rows_to_append),
    trait_rows_to_append |> distinct(trait_name) |> nrow(),
    length(unique(traits_of_interest))
  )
)

austraits_summary |>
  knitr::kable(
    caption = "Summary of AusTraits trait extraction"
  )

trait_coverage_summary <- trait_rows_to_append |>
  group_by(trait_name) |>
  summarise(
    n_taxa = n_distinct(row_id),
    n_records = n(),
    n_unique_values = n_distinct(value, na.rm = TRUE),
    example_values = paste(
      head(unique(na.omit(as.character(value))), 5),
      collapse = "; "
    ),
    .groups = "drop"
  ) |>
  arrange(desc(n_taxa), trait_name)

trait_coverage_summary |>
  knitr::kable(
    caption = "Coverage of selected AusTraits traits"
  )

Coverage of selected AusTraits traits
trait_name	n_taxa	n_records	n_unique_values	example_values
plant_growth_form	222	1419	20	shrub tree; shrub; tree; fern; graminoid herb
life_history	220	913	10	perennial; annual perennial; annual; annual short_lived_perennial; short_lived_perennial
lifespan	214	375	16	500–1000; 10–50; 0–1; 50–500; 50–100
reproductive_maturity	214	337	11	5–20; 1–5; 0–1; 4–5; 2
seedbank_longevity	214	337	6	2–; –2; 1–; 10–; 5–
vegetative_reproduction_ability	214	352	2	vegetative; not_vegetative
resprouting_capacity	212	563	6	fire_killed; resprouts; partial_resprouting; partial_resprouting resprouts; fire_killed partial_resprouting
dispersers	206	401	19	invertebrates; ants passive vertebrates; wind; abiotic animals; water
flowering_time	203	482	90	nnnnnnyyyynn; nnnnnnyyyyyn; nnnnnnnyyyyn; nnnnnnnnyynn; nnnnnnnyyynn
resprouting_capacity_non_fire_disturbance	117	161	1	resprouts_non_fire_disturbance
post_fire_recruitment	46	69	2	post_fire_recruitment; post_fire_recruitment_absent
seedling_establishment_conditions	45	49	2	establish_anytime; establish_post_fire
resprouting_capacity_proportion_individuals	3	9	3	1; 0.8; 0.4
fire_exposure_level	2	3	1	aquatic_taxon

3 Geographic range

This section speaks directly to the Geographic Range criteria of the nomination process, in particular, the AOO and EOO.

3.1 ALA with *galah package.

Galah package is the R package that helps extract data from the ALA, which is the primary resource we use to determine the geographic range/species distribution. Here, species records are bound to Victoria only (this can be changed if the project is expanded for a continental analysis).

3.1.1 ALA taxon matching helper functions

The taxon matching helper functions create a hierarchy of candidate names for each project species with the same logic as cleaned VicFlora search name described above.

For each candidate name, the script uses galah::search_taxa() to query the ALA taxonomic backbone. The returned matches are filtered to retain plant taxa and to avoid broad higher-level matches such as genera or families. This prevents the script from accidentally downloading all records for a whole genus when a species-level name cannot be resolved.

When several possible matches are returned, the helper function prioritises exact scientific-name matches and exact ALA match types. The best match is then stored with the original project name, the candidate name used for the successful query, the resolved ALA scientific name, taxon concept identifier, taxon rank, and match type. Species that cannot be resolved are retained in the output with a not_found status, allowing them to be reviewed manually.

This matching step produces an auditable table of taxonomic decisions before occurrence records are downloaded. This is useful because it separates uncertainty in name resolution from uncertainty in occurrence data, and makes it possible to check which species were matched exactly, which required simplified fallback names, and which were not found in ALA.

victoria_exact <- victoria_sf |>
  st_make_valid() |>
  st_transform(4326) |>
  st_zm(drop = TRUE, what = "ZM")

victoria_exact <- st_as_sf(
  st_sfc(
    st_union(st_geometry(victoria_exact)),
    crs = st_crs(victoria_exact)
  )
)

# The exact polygon crop happens locally after download.
victoria_query_area <- victoria_exact

results_unique_ala_base <- results_unique_with_traits |>
  mutate(
    ala_row_id = row_number(),
    ala_input_name = as.character(.data[[species_col]]),
    ala_input_name = str_squish(ala_input_name)
  )

species_to_resolve <- results_unique_ala_base |>
  filter(!is.na(ala_input_name), ala_input_name != "") |>
  distinct(ala_input_name) |>
  mutate(ala_species_id = row_number())

# ------------------------------------------------------------
# 5. Candidate names for ALA search
# ------------------------------------------------------------
# Since vicflora_search_name is already cleaned, the exact value is tried first.
# Bracket removal and aff./var./s.s. handling are only fallback candidates.

make_ala_candidates <- function(x) {
  original <- str_squish(x)

  no_brackets <- original |>
    str_remove("\\s*\\([^)]*\\)\\s*$") |>
    str_squish()

  no_ss <- no_brackets |>
    str_remove("\\s+s\\.?s\\.?\\s*$") |>
    str_remove("\\s+ss\\s*$") |>
    str_squish()

  # Caladenia aff. vulgaris -> Caladenia vulgaris
  aff_name <- no_ss |>
    str_replace(
      "^([A-Z][A-Za-z-]+)\\s+(?:sp\\.?\\s*)?aff\\.?\\s+([a-z][a-z-]+).*$",
      "\\1 \\2"
    ) |>
    str_squish()

  # Acacia leprosa var. graveolens -> Acacia leprosa
  species_level <- no_ss |>
    str_replace(
      "^([A-Z][A-Za-z-]+\\s+[a-z][a-z-]+)\\s+(var\\.|subsp\\.|ssp\\.).*$",
      "\\1"
    ) |>
    str_squish()

  # Arthropodium sp. 2 (greenish flowers) -> Arthropodium sp. 2
  sp_number <- str_extract(
    no_brackets,
    "^[A-Z][A-Za-z-]+\\s+sp\\.?\\s*\\d+"
  )

  candidates <- c(
    original,
    no_brackets,
    no_ss,
    aff_name,
    sp_number,
    species_level
  )

  candidates |>
    unique() |>
    discard(~ is.na(.x) || .x == "")
}

# ------------------------------------------------------------
# 6. Helpers for robust ALA taxon matching
# ------------------------------------------------------------

find_col <- function(df, possible_names) {
  hit <- intersect(possible_names, names(df))
  if (length(hit) == 0) NA_character_ else hit[1]
}

safe_pull <- function(df, possible_names, default = NA_character_) {
  col <- find_col(df, possible_names)
  if (is.na(col)) {
    rep(default, nrow(df))
  } else {
    as.character(df[[col]])
  }
}

resolve_ala_taxon <- function(ala_input_name, ala_species_id) {
  candidates <- make_ala_candidates(ala_input_name)

  for (i in seq_along(candidates)) {
    query_name <- candidates[[i]]

    tax <- tryCatch(
      galah::search_taxa(query_name),
      error = function(e) NULL
    )

    if (is.null(tax) || nrow(tax) == 0) next

    tax <- as_tibble(tax)

    sci_col <- find_col(
      tax,
      c("scientific_name", "scientificName", "taxon_name", "name")
    )

    if (is.na(sci_col)) next

    tax <- tax |>
      mutate(
        .scientific_name = as.character(.data[[sci_col]]),
        .kingdom = safe_pull(tax, c("kingdom", "kingdom_name", "kingdomName")),
        .rank = safe_pull(tax, c("rank", "taxon_rank", "taxonRank")),
        .match_type = safe_pull(tax, c("match_type", "matchType"))
      )

    # Keep plant records when kingdom is supplied
    tax <- tax |>
      filter(
        is.na(.kingdom) |
          .kingdom == "" |
          str_to_lower(.kingdom) == "plantae"
      )

    # Avoid downloading whole genera/families when species-level matching fails
    tax <- tax |>
      filter(
        is.na(.rank) |
          !str_to_lower(.rank) %in% c(
            "kingdom", "phylum", "class", "order", "family", "genus"
          )
      )

    if (nrow(tax) == 0) next

    tax_best <- tax |>
      mutate(
        .exact_name = str_to_lower(.scientific_name) == str_to_lower(query_name),
        .exact_match_type = str_detect(str_to_lower(.match_type), "exact")
      ) |>
      arrange(
        desc(.exact_name),
        desc(.exact_match_type)
      ) |>
      slice(1)

    taxon_id_col <- find_col(
      tax_best,
      c("taxon_concept_id", "taxonConceptID", "taxonConceptId", "guid", "taxon_id")
    )

    return(tibble(
      ala_species_id = ala_species_id,
      ala_input_name = ala_input_name,
      ala_query_name = query_name,
      ala_candidate_number = i,
      ala_scientific_name = tax_best$.scientific_name[1],
      ala_taxon_concept_id = if (!is.na(taxon_id_col)) as.character(tax_best[[taxon_id_col]][1]) else NA_character_,
      ala_taxon_rank = tax_best$.rank[1],
      ala_match_type = tax_best$.match_type[1],
      ala_resolution_status = "matched"
    ))
  }

  tibble(
    ala_species_id = ala_species_id,
    ala_input_name = ala_input_name,
    ala_query_name = NA_character_,
    ala_candidate_number = NA_integer_,
    ala_scientific_name = NA_character_,
    ala_taxon_concept_id = NA_character_,
    ala_taxon_rank = NA_character_,
    ala_match_type = NA_character_,
    ala_resolution_status = "not_found"
  )
}

taxon_matches <- map2_dfr(
  species_to_resolve$ala_input_name,
  species_to_resolve$ala_species_id,
  resolve_ala_taxon
)

write_csv(
  taxon_matches,
  file.path(out_dir, "ala_taxon_matches.csv")
)

# Quick check
taxon_matches |>
  count(ala_resolution_status, ala_match_type)

# ------------------------------------------------------------
# 7. Helpers for occurrence downloads and exact Victoria crop
# ------------------------------------------------------------

safe_slug <- function(x) {
  x |>
    str_replace_all("[^A-Za-z0-9]+", "_") |>
    str_replace_all("^_|_$", "") |>
    str_to_lower()
}

get_count_value <- function(x) {
  if ("count" %in% names(x)) return(as.integer(x$count[1]))
  if ("totalRecords" %in% names(x)) return(as.integer(x$totalRecords[1]))
  NA_integer_
}

crop_occurrences_to_victoria <- function(occ, victoria_exact) {
  if (is.null(occ) || nrow(occ) == 0) {
    return(occ)
  }

  lon_col <- intersect(
    c("decimalLongitude", "decimal_longitude", "longitude"),
    names(occ)
  )[1]

  lat_col <- intersect(
    c("decimalLatitude", "decimal_latitude", "latitude"),
    names(occ)
  )[1]

  if (is.na(lon_col) || is.na(lat_col)) {
    warning("No longitude/latitude columns found. Returning uncropped records.")
    return(occ)
  }

  occ_sf <- occ |>
    filter(
      !is.na(.data[[lon_col]]),
      !is.na(.data[[lat_col]])
    ) |>
    st_as_sf(
      coords = c(lon_col, lat_col),
      crs = 4326,
      remove = FALSE
    )

  occ_vic <- occ_sf |>
    st_filter(victoria_exact, .predicate = st_intersects)

  occ_vic |>
    st_drop_geometry()
}

download_one_ala_species <- function(match_row,
                                     victoria_query_area,
                                     victoria_exact,
                                     out_dir,
                                     pause_seconds = 0.5) {

  if (match_row$ala_resolution_status != "matched") {
    return(tibble(
      ala_species_id = match_row$ala_species_id,
      ala_input_name = match_row$ala_input_name,
      ala_scientific_name = match_row$ala_scientific_name,
      download_status = "taxon_not_found",
      n_records_bbox = NA_integer_,
      n_records_victoria = NA_integer_,
      error_message = NA_character_
    ))
  }

  slug <- safe_slug(paste0(
    match_row$ala_species_id, "_",
    match_row$ala_input_name
  ))

  rds_file <- file.path(out_dir, paste0(slug, "_occurrences.rds"))
  csv_file <- file.path(out_dir, paste0(slug, "_occurrences.csv"))

  # Resume behaviour: skip if already downloaded
  if (file.exists(rds_file)) {
    occ <- readRDS(rds_file)

    return(tibble(
      ala_species_id = match_row$ala_species_id,
      ala_input_name = match_row$ala_input_name,
      ala_scientific_name = match_row$ala_scientific_name,
      download_status = "already_downloaded",
      n_records_bbox = NA_integer_,
      n_records_victoria = nrow(occ),
      error_message = NA_character_
    ))
  }

result <- tryCatch({

  base_query <- galah::request_data("occurrences") |>
    galah::identify(match_row$ala_scientific_name) |>
    galah::geolocate(victoria_query_area, type = "bbox")

  if (!inherits(base_query, "data_request")) {
    stop("base_query is not a galah data_request. Check galah query construction.")
  }

  bbox_count <- base_query |>
    dplyr::count() |>
    dplyr::collect() |>
    get_count_value()

  if (is.na(bbox_count) || bbox_count == 0) {
    return(tibble(
      ala_species_id = match_row$ala_species_id,
      ala_input_name = match_row$ala_input_name,
      ala_scientific_name = match_row$ala_scientific_name,
      download_status = "zero_records_bbox",
      n_records_bbox = bbox_count,
      n_records_victoria = 0L,
      error_message = NA_character_
    ))
  }

  occ_raw <- base_query |>
    dplyr::select(group = "basic") |>
    dplyr::collect()

  occ_vic <- crop_occurrences_to_victoria(
    occ = occ_raw,
    victoria_exact = victoria_exact
  )

  if (nrow(occ_vic) == 0) {
    saveRDS(occ_vic, rds_file)
    write_csv(occ_vic, csv_file)

    return(tibble(
      ala_species_id = match_row$ala_species_id,
      ala_input_name = match_row$ala_input_name,
      ala_scientific_name = match_row$ala_scientific_name,
      download_status = "zero_records_after_victoria_crop",
      n_records_bbox = bbox_count,
      n_records_victoria = 0L,
      error_message = NA_character_
    ))
  }

  occ_vic <- occ_vic |>
    mutate(
      ala_species_id = match_row$ala_species_id,
      ala_input_name = match_row$ala_input_name,
      ala_query_name = match_row$ala_query_name,
      ala_scientific_name_resolved = match_row$ala_scientific_name,
      ala_taxon_concept_id_resolved = match_row$ala_taxon_concept_id,
      ala_taxon_rank_resolved = match_row$ala_taxon_rank,
      ala_match_type = match_row$ala_match_type,
      ala_candidate_number = match_row$ala_candidate_number,
      .before = 1
    )

  saveRDS(occ_vic, rds_file)
  write_csv(occ_vic, csv_file)

  Sys.sleep(pause_seconds)

  tibble(
    ala_species_id = match_row$ala_species_id,
    ala_input_name = match_row$ala_input_name,
    ala_scientific_name = match_row$ala_scientific_name,
    download_status = "downloaded",
    n_records_bbox = bbox_count,
    n_records_victoria = nrow(occ_vic),
    error_message = NA_character_
  )

}, error = function(e) {
  tibble(
    ala_species_id = match_row$ala_species_id,
    ala_input_name = match_row$ala_input_name,
    ala_scientific_name = match_row$ala_scientific_name,
    download_status = "error",
    n_records_bbox = NA_integer_,
    n_records_victoria = NA_integer_,
    error_message = conditionMessage(e)
  )
})
}

ala_summary <- tibble(
  summary_metric = c(
    "Taxa submitted to ALA",
    "Taxa matched to ALA taxonomy",
    "Taxa not found in ALA taxonomy",
    "Taxa with Victoria records",
    "Total Victoria occurrence records"
  ),
  value = c(
    taxon_matches |> distinct(ala_input_name) |> nrow(),
    taxon_matches |> filter(ala_resolution_status == "matched") |> distinct(ala_input_name) |> nrow(),
    taxon_matches |> filter(ala_resolution_status == "not_found") |> distinct(ala_input_name) |> nrow(),
    download_log |> filter(n_records_victoria > 0) |> distinct(ala_input_name) |> nrow(),
    if (exists("ala_occurrences_victoria")) nrow(ala_occurrences_victoria) else NA_integer_
  )
)

ala_summary |>
  knitr::kable(
    caption = "Summary of ALA taxon matching and Victoria occurrence downloads"
  )

Summary of ALA taxon matching and Victoria occurrence downloads
summary_metric	value
Taxa submitted to ALA	313
Taxa matched to ALA taxonomy	308
Taxa not found in ALA taxonomy	5
Taxa with Victoria records	294
Total Victoria occurrence records	123525

ala_download_status_summary <- download_log |>
  count(download_status, name = "n_taxa") |>
  left_join(
    download_log |>
      group_by(download_status) |>
      summarise(
        total_records_victoria = sum(n_records_victoria, na.rm = TRUE),
        .groups = "drop"
      ),
    by = "download_status"
  ) |>
  arrange(desc(n_taxa))

ala_download_status_summary |>
  knitr::kable(
    caption = "ALA download status summary"
  )

ALA download status summary
download_status	n_taxa	total_records_victoria
downloaded	294	123525
zero_records_bbox	12	0
taxon_not_found	5	0
zero_records_after_victoria_crop	2	0

3.1.2 Data compilation inventory

The summary table is the the intersection of all information gathered from Vicflora, Austraits database and ALA. This has been exported as a separate .xlsx file due to it’s wide format.

4 Steps for next project phase

4.1 Threats informed by biogeographical regions

This information can be inferred from land-use mapping or other spatial layers related to land degradation. Threats layers from the SMP project could be an option.

4.2 Population

Threatened status doesn’t often get determined by population. But gathering information on population status is very difficult. The only viable options are talking to experts to elicit their best estimate, and looking at herbarium specimen notes. David Cameron’s has collected this information over time, and therefore digitising the notes and assigning relevant information to each species should be primary focus to acquire population level information. There will still be gaps after this time consuming process. The alternative is to build a statistical model to estimate the probability of a species occurring in a plot, and by considering the dimensions of a plant it can feed into population estimates (e.g. a tree might be one individual per plot).

4.3 Translocation

This won’t be relevant to many poorly known species.

4.4 Subpopulations

Primarily only relevant if one population. Rarely determines the outcome of most assessments.

4.5 Fragmentation

If in a fragmented landscape, likely to be highly fragmented.

4.6 Reduction

This is the most important criteria. Similarly to Threats, this can be inferred from habitat of taxon. More specifically, Bioregional Conservation Status for EVCs give decline and or reduction amounts for habitat which can be inferred for the species description. It basically compares pre-1750 to 2005 habitat.

4.7 Survey Effort

Never contributes to the assessment outcome.

4.8 Very Restricted

Species is restricted if geographic range is <5 locations or <20 km. These can be calculated after ALA extract has been cleaned and audited.

4.9 Locations

This criteria is a complex one to deduct. It will involve putting together a series of prevailing threats and the geographic range for those threats (e.g. roadsides, tenure, historic land use, current land use).

Poorly known species (k) data compilation workflow

Khorloo Batpurev