1 Species descriptions and habitat code

1.1 VICFLORA data extract

We start the compilation by extracting species description and habitat code from VicFlora website. For this, we use the 316 species list to create queries from VicFlora. The section below writes an output table containing the available species descriptions, habitat or bioregion codes, match diagnostics, synonym fallback details.

1.1.1 Matching hierarchy

The extract follows this hierarchy of rationales: 1. For ordinary names, var., subsp., aff., and names with bracketed notes, try the original full name first 2. For s.s. / sensu stricto names, always query the name before the s.s. marker 3. If a parenthetical-note name fails, remove the trailing bracketed note and try again, for example Cardamine tenuifolia (small-flower form)Cardamine tenuifolia 4. For named sp. X records, fall back to the named record without the bracketed descriptor, for example Arthropodium sp. 2 (greenish flowers)Arthropodium sp. 2 5. If a var. / subsp. full-name match fails or has no profile, fall back to the parent species 6. If an aff. full-name match fails or has no profile, query Genus + text after aff., for example Caladenia aff. vulgaris (Aireys Inlet)Caladenia vulgaris (Aireys Inlet), then Caladenia vulgaris 7. If the result is not_found or no_profile, inspect VicFlora synonym fields and match names while ignoring authorship at the end of the scientific name 8. Add plant_group from VicFlora higher classification: bryophyte, flowering plant, other vascular plant, or unknown

# -----------------------------
# Text helpers
# -----------------------------
clean_text <- function(x) {
  if (is.null(x) || length(x) == 0) return(NA_character_)
  x <- as.character(x)
  x[is.na(x)] <- NA_character_
  x <- stringr::str_replace_all(x, "\\u00A0", " ")
  x <- stringr::str_replace_all(x, "\u00A0", " ")
  x <- stringr::str_replace_all(x, "[[:space:]]+", " ")
  x <- stringr::str_squish(x)
  x[!is.na(x) & !nzchar(x)] <- NA_character_
  x
}

name_key <- function(x) {
  x <- clean_text(x)
  x <- x[1]
  if (length(x) == 0 || is.na(x)) {
    return(NA_character_)
  }
  x <- enc2utf8(x)
  # Replace hybrid symbol × with plain x
  x <- stringr::str_replace_all(x, "\u00D7", "x")
  x <- stringr::str_replace_all(x, "[[:punct:]]+", " ")
  x <- stringr::str_replace_all(x, "[[:space:]]+", " ")
  stringr::str_to_lower(stringr::str_squish(x))
}


different_name_key <- function(a, b) {
  a_key <- name_key(a)
  b_key <- name_key(b)
  if (is.na(a_key) || is.na(b_key)) return(TRUE)
  !identical(a_key, b_key)
}

first_sentence <- function(x) {
  x <- clean_text(x)
  if (length(x) == 0 || is.na(x[1])) return(NA_character_)
  x <- x[1]
  pieces <- stringi::stri_split_boundaries(x, type = "sentence")[[1]]
  pieces <- clean_text(pieces)
  pieces <- pieces[!is.na(pieces)]
  if (length(pieces) == 0) x else pieces[1]
}

profile_to_paragraphs <- function(profile_html) {
  if (is.null(profile_html) || length(profile_html) == 0 || is.na(profile_html) || !nzchar(profile_html)) {
    return(character())
  }

  # VicFlora profile is HTML. Wrap in a body tag so xml2 can parse fragments safely.
  doc <- tryCatch(
    xml2::read_html(paste0("<html><body>", profile_html, "</body></html>")),
    error = function(e) NULL
  )

  if (is.null(doc)) {
    text <- clean_text(profile_html)
    return(if (is.na(text)) character() else text)
  }

  paragraphs <- rvest::html_elements(doc, "p") |> rvest::html_text2()

  # Fallback for profiles that do not use <p> tags.
  if (length(paragraphs) == 0) {
    text <- rvest::html_text2(doc)
    paragraphs <- stringr::str_split(text, "\\n\\s*\\n")[[1]]
  }

  paragraphs <- purrr::map_chr(paragraphs, clean_text)
  paragraphs <- paragraphs[!is.na(paragraphs)]
  paragraphs
}

extract_codes_from_treatment <- function(second_paragraph) {
  second_paragraph <- clean_text(second_paragraph)
  if (is.na(second_paragraph)) return(NA_character_)

  # Backup heuristic. This catches short region-style codes such as EGL, HSF, HNF, VAlp, WPro, GipP.
  candidates <- stringr::str_extract_all(second_paragraph, "\\b[A-Z][A-Za-z]{1,4}\\b")[[1]]

  exclude <- c(
    month.abb, "Sept", "NSW", "ACT", "QLD", "Qld", "SA", "WA", "NT", "TAS", "Tas", "VIC", "Vic",
    "Also", "Flowers", "Flower", "Fruit", "Fruits", "Mostly", "Grows", "Occurs", "Known", "Rare", "Common",
    "New", "South", "Wales", "Victoria", "Victorian", "Australian", "Australia", "Plants", "Plant",
    "East", "West", "North", "South", "Central", "Mt", "Mount", "River", "Creek", "Lake", "Near"
  )

  candidates <- candidates[!candidates %in% exclude]
  candidates <- candidates[stringr::str_detect(candidates, "^([A-Z]{2,5}|[A-Z][a-z]{1,3}[A-Z]|[A-Z][a-z]?[A-Z][a-z]{0,2})$")]
  candidates <- unique(candidates)

  if (length(candidates) == 0) NA_character_ else paste(candidates, collapse = ", ")
}

collapse_bioregion_codes <- function(bioregions) {
  if (is.null(bioregions) || length(bioregions) == 0) return(NA_character_)
  codes <- purrr::map_chr(bioregions, ~ .x$bioregionCode %||% NA_character_)
  codes <- clean_text(codes)
  codes <- codes[!is.na(codes)]
  if (length(codes) == 0) NA_character_ else paste(unique(codes), collapse = ", ")
}

classification_names <- function(concept) {
  if (is.null(concept) || length(concept) == 0) return(character())

  names <- c(
    concept$taxonName$fullName %||% NA_character_,
    concept$taxonName$fullNameWithAuthorship %||% NA_character_
  )

  hc <- concept$higherClassification
  if (!is.null(hc) && length(hc) > 0) {
    names <- c(
      names,
      purrr::map_chr(hc, ~ .x$taxonName$fullName %||% NA_character_),
      purrr::map_chr(hc, ~ .x$taxonName$fullNameWithAuthorship %||% NA_character_)
    )
  }

  names <- clean_text(names)
  unique(names[!is.na(names)])
}

collapse_higher_classification <- function(concept) {
  hc <- concept$higherClassification
  if (is.null(hc) || length(hc) == 0) return(NA_character_)

  pieces <- purrr::map_chr(hc, function(x) {
    nm <- clean_text(x$taxonName$fullName %||% NA_character_)[1]
    rk <- clean_text(x$taxonRank %||% NA_character_)[1]
    if (is.na(nm)) return(NA_character_)
    if (is.na(rk)) nm else paste0(rk, ": ", nm)
  })

  pieces <- pieces[!is.na(pieces)]
  if (length(pieces) == 0) NA_character_ else paste(unique(pieces), collapse = " | ")
}

infer_plant_group <- function(concept) {
  # Uses VicFlora's higherClassification to separate bryophytes from flowering plants.
  # Keeps an audit trail in higher_classification so borderline cases can be checked.
  names <- classification_names(concept)
  if (length(names) == 0) return(NA_character_)
  keys <- stringr::str_to_lower(names)

  bryophyte_patterns <- c(
    "bryophyta", "marchantiophyta", "anthocerotophyta",
    "bryopsida", "sphagnopsida", "polytrichopsida",
    "jungermanniopsida", "marchantiopsida", "anthocerotopsida",
    "moss", "liverwort", "hornwort"
  )

  flowering_patterns <- c(
    "magnoliophyta", "angiosperm", "flowering plant",
    "magnoliopsida", "liliopsida", "monocot", "dicot", "eudicot",
    "rosids", "asterids", "commelinids", "magnoliids"
  )

  other_vascular_patterns <- c(
    "pteridophyta", "polypodiophyta", "lycopodiophyta", "pinophyta",
    "cycadophyta", "ginkgophyta", "gnetophyta", "gymnosperm",
    "fern", "clubmoss", "quillwort", "conifer"
  )

  has_pattern <- function(patterns) {
    any(purrr::map_lgl(patterns, function(pat) {
      any(stringr::str_detect(keys, stringr::fixed(pat, ignore_case = TRUE)))
    }))
  }

  if (has_pattern(bryophyte_patterns)) return("bryophyte")
  if (has_pattern(flowering_patterns)) return("flowering plant")
  if (has_pattern(other_vascular_patterns)) return("other vascular plant")

  "unknown"
}

1.1.2 Species matches and

For most of the 316 species, exact taxonomic name matches were found in Vicflora. But when species were matched to their closest affinities, species level descriptions applied to subspecies, or species information applied to variations, the match type was recorded. Examples of match types are recorded:

  • Acacia leprosa var. graveolens, exact VicFlora match -> exact match
  • If parent species used for species variations then the match is recorded as -> var
  • Caladenia aff. vulgaris (Aireys Inlet), fallback/closest match Caladenia vulgaris used -> aff
  • Arthropodium sp. 2 (greenish flowers), fallback to Arthropodium sp. 2 -> exact match if found
  • Cardamine tenuifolia (small-flower form), fallback to Cardamine tenuifolia -> exact match if found
  • For species that has no records in Vicflora e.g., Ptychomitrium muelleri, then -> not_found
# -----------------------------
# Scientific-name parsing helpers
# -----------------------------
rank_regex <- "(?:subsp\\.?|ssp\\.?|var\\.?|subvar\\.?|forma|f\\.?)"

species_binomial <- function(x) {
  x <- clean_text(x)
  x <- x[1]
  if (length(x) == 0 || is.na(x)) return(NA_character_)
  x <- stringr::str_replace_all(x, "x", "x")
  m <- stringr::str_match(x, "^([A-Z][A-Za-z-]+\\s+(?:x\\s+)?[a-z][A-Za-z-]+)\\b")
  if (!is.na(m[1, 2])) clean_text(m[1, 2]) else NA_character_
}

remove_ss_marker <- function(x) {
  x <- clean_text(x)
  x <- x[1]
  if (length(x) == 0 || is.na(x)) return(NA_character_)
  x <- stringr::str_replace(
    x,
    stringr::regex("\\s+(s\\.?\\s*s\\.?|ss|sensu\\s+stricto)(\\s+.*|$)", ignore_case = TRUE),
    ""
  )
  clean_text(x)
}

botanical_name_without_authorship <- function(x) {
  x <- clean_text(x)
  x <- x[1]
  if (length(x) == 0 || is.na(x)) return(NA_character_)
  x <- stringr::str_replace_all(x, "x", "x")
  x <- remove_ss_marker(x)

  # Named forms like Geranium sp. 3 or Alternanthera sp. 1.
  m_sp_number <- stringr::str_match(
    x,
    stringr::regex("^([A-Z][A-Za-z-]+\\s+sp\\.?\\s*[0-9A-Za-z-]+)\\b", ignore_case = FALSE)
  )
  if (!is.na(m_sp_number[1, 2])) return(clean_text(m_sp_number[1, 2]))

  # Infraspecific rank. Keep only Genus species rank epithet, dropping authorship after that.
  infra_pat <- paste0("^([A-Z][A-Za-z-]+\\s+(?:x\\s+)?[a-z][A-Za-z-]+\\s+", rank_regex, "\\s+(?:[a-z][A-Za-z-]+|[0-9]+))\\b")
  m_infra <- stringr::str_match(x, stringr::regex(infra_pat, ignore_case = FALSE))
  if (!is.na(m_infra[1, 2])) return(clean_text(m_infra[1, 2]))

  # Ordinary binomial.
  species_binomial(x)
}

authorless_key <- function(x) {
  name_key(botanical_name_without_authorship(x))
}

name_matches_ignoring_authorship <- function(query_name, candidate_names) {
  q_key <- authorless_key(query_name)
  if (is.na(q_key)) return(FALSE)
  cand_keys <- purrr::map_chr(candidate_names, authorless_key)
  cand_keys <- cand_keys[!is.na(cand_keys)]
  any(cand_keys == q_key)
}

detect_name_flags <- function(original_name) {
  x <- clean_text(original_name)
  x <- x[1]
  if (length(x) == 0 || is.na(x)) return(list(flags = NA_character_, has_ss = FALSE, has_var = FALSE, has_aff = FALSE))

  has_ss <- stringr::str_detect(x, stringr::regex("(^|\\s)(s\\.?\\s*s\\.?|ss|sensu\\s+stricto)(\\s|$)", ignore_case = TRUE))
  has_aff <- stringr::str_detect(x, stringr::regex("(^|\\s)aff\\.?(\\s|$)", ignore_case = TRUE))
  has_var <- stringr::str_detect(x, stringr::regex(paste0("(^|\\s)", rank_regex, "(\\s|$)"), ignore_case = TRUE))

  flags <- c()
  if (has_ss) flags <- c(flags, "ss")
  if (has_var) flags <- c(flags, "var")
  if (has_aff) flags <- c(flags, "aff")

  list(
    flags = if (length(flags) == 0) "exact match" else paste(flags, collapse = "; "),
    has_ss = has_ss,
    has_var = has_var,
    has_aff = has_aff
  )
}

final_ss_var_aff <- function(original_name, attempt_reason, scrape_status) {

  status <- clean_text(scrape_status)[1]
  if (is.na(status)) return(NA_character_)

  if (!identical(status, "ok")) return(status)

  reason <- clean_text(attempt_reason)[1]
  if (is.na(reason)) return("exact match")

  if (identical(reason, "ss_name_before_marker")) return("ss")
  if (stringr::str_detect(reason, "^aff_genus_plus_text_after_aff")) return("aff")
  if (identical(reason, "var_or_subsp_parent_species_fallback")) return("var")

  "exact match"
}

failure_priority <- function(status) {
  status <- clean_text(status)[1]
  if (is.na(status)) return(0L)
  dplyr::case_when(
    status == "not_started" ~ 0L,
    status == "blank_name" ~ 0L,
    status == "not_found" ~ 1L,
    status == "error" ~ 2L,
    status == "no_profile" ~ 3L,
    TRUE ~ 1L
  )
}

aff_target_name <- function(original_name, keep_parenthetical = TRUE) {
  x <- clean_text(original_name)
  x <- x[1]
  if (length(x) == 0 || is.na(x)) return(NA_character_)

  # Keep the genus and replace the aff. phrase with the text after aff.
  # Examples:
  #   Caladenia aff. vulgaris (Aireys Inlet) -> Caladenia vulgaris (Aireys Inlet)
  #   Caladenia sp. aff. iridescens (Chapple Vale) -> Caladenia iridescens (Chapple Vale)
  #   Geranium aff. sp. 3 -> Geranium sp. 3
  m <- stringr::str_match(
    x,
    stringr::regex("^([A-Z][A-Za-z-]+)\\b.*?\\baff\\.?\\s+(.+)$", ignore_case = TRUE)
  )
  if (is.na(m[1, 2]) || is.na(m[1, 3])) return(NA_character_)

  tail <- clean_text(m[1, 3])
  tail <- stringr::str_replace(tail, stringr::regex("^of\\s+", ignore_case = TRUE), "")
  target <- clean_text(paste(m[1, 2], tail))

  if (!keep_parenthetical) {
    target <- stringr::str_remove(target, "\\s*\\([^)]*\\)\\s*$")
    target <- clean_text(target)
  }

  target
}

aff_place_name <- function(original_name) {
  x <- clean_text(original_name)
  x <- x[1]
  if (length(x) == 0 || is.na(x)) return(NA_character_)
  if (!stringr::str_detect(x, stringr::regex("(^|\\s)aff\\.?(\\s|$)", ignore_case = TRUE))) return(NA_character_)
  m <- stringr::str_match(x, "\\(([^)]+)\\)")
  if (!is.na(m[1, 2])) clean_text(m[1, 2]) else NA_character_
}

place_match_in_text <- function(place, text) {
  place <- clean_text(place)
  text <- clean_text(text)
  if (is.na(place) || is.na(text)) return(FALSE)

  place_options <- unique(c(
    place,
    stringr::str_remove(place, stringr::regex("\\s+variant$", ignore_case = TRUE))
  ))
  place_options <- place_options[!is.na(place_options) & nzchar(place_options)]

  any(purrr::map_lgl(place_options, function(p) {
    stringr::str_detect(text, stringr::fixed(p, ignore_case = TRUE))
  }))
}

named_sp_number_base <- function(original_name) {
  x <- clean_text(original_name)
  x <- x[1]
  if (length(x) == 0 || is.na(x)) return(NA_character_)

  # Named phrase records such as:
  #   Arthropodium sp. 2 (greenish flowers) -> Arthropodium sp. 2
  # Do not use bracketed informal descriptors as part of the VicFlora query.
  # This deliberately excludes aff./cf./nov. patterns such as Caladenia sp. aff. vulgaris.
  m <- stringr::str_match(
    x,
    stringr::regex("^([A-Z][A-Za-z-]+\\s+sp\\.?\\s*([0-9A-Za-z-]+))\\b", ignore_case = FALSE)
  )
  if (is.na(m[1, 2]) || is.na(m[1, 3])) return(NA_character_)

  token <- stringr::str_to_lower(stringr::str_remove_all(m[1, 3], "\\."))
  if (token %in% c("aff", "cf", "nov", "indet")) return(NA_character_)

  clean_text(m[1, 2])
}

strip_trailing_parenthetical <- function(original_name) {
  x <- clean_text(original_name)
  x <- x[1]
  if (length(x) == 0 || is.na(x)) return(NA_character_)

  stripped <- stringr::str_remove(x, "\\s*\\([^)]*\\)\\s*$")
  stripped <- clean_text(stripped)

  if (is.na(stripped) || !different_name_key(stripped, x)) NA_character_ else stripped
}

build_search_plan <- function(original_name) {
  original_name <- clean_text(original_name)
  original_name <- original_name[1]
  flags <- detect_name_flags(original_name)

  if (length(original_name) == 0 || is.na(original_name)) {
    return(tibble::tibble(search_name = NA_character_, attempt_reason = "blank_name", allow_fuzzy = FALSE))
  }

  # Sensu stricto is the one exception to "try the full name first": it means
  # use the taxon name as written before the s.s. marker.
  if (flags$has_ss) {
    ss_name <- remove_ss_marker(original_name)
    return(tibble::tibble(
      search_name = ss_name,
      attempt_reason = "ss_name_before_marker",
      allow_fuzzy = FALSE
    ))
  }

  # Always start with the species full. This preserves true exact matches for
  # varieties, subspecies, informal forms, and parenthetical names when VicFlora has them.
  plan <- tibble::tibble(
    search_name = original_name,
    attempt_reason = "original_full_name_exact_first",
    allow_fuzzy = FALSE
  )

  # For ordinary parenthetical notes, only remove the bracketed text after the
  # full-name search has failed. This handles examples such as:
  #   Cardamine tenuifolia (small-flower form) -> Cardamine tenuifolia
  #   Arthropodium sp. 2 (greenish flowers) -> Arthropodium sp. 2
  # For aff. names, keep the specialised aff. hierarchy below because it converts
  # Caladenia aff. vulgaris (Aireys Inlet) to Caladenia vulgaris first.
  if (!flags$has_aff) {
    parenthetical_stripped <- strip_trailing_parenthetical(original_name)
    if (!is.na(parenthetical_stripped)) {
      plan <- dplyr::bind_rows(
        plan,
        tibble::tibble(
          search_name = parenthetical_stripped,
          attempt_reason = "parenthetical_note_removed_fallback",
          allow_fuzzy = FALSE
        )
      )
    }
  }

  # Named sp. number fallback also covers non-parenthetical informal strings.
  # If the stripped-parenthetical fallback already added the same name, distinct()
  # below removes the duplicate.
  sp_number_name <- named_sp_number_base(original_name)
  if (!is.na(sp_number_name) && different_name_key(sp_number_name, original_name)) {
    plan <- dplyr::bind_rows(
      plan,
      tibble::tibble(
        search_name = sp_number_name,
        attempt_reason = "sp_number_without_extra_text_fallback",
        allow_fuzzy = FALSE
      )
    )
  }

  if (flags$has_aff) {
    target_with_place <- aff_target_name(original_name, keep_parenthetical = TRUE)
    target_without_place <- aff_target_name(original_name, keep_parenthetical = FALSE)

    if (!is.na(target_with_place) && different_name_key(target_with_place, original_name)) {
      plan <- dplyr::bind_rows(
        plan,
        tibble::tibble(
          search_name = target_with_place,
          attempt_reason = "aff_genus_plus_text_after_aff_with_place",
          allow_fuzzy = FALSE
        )
      )
    }

    if (!is.na(target_without_place) &&
        different_name_key(target_without_place, target_with_place) &&
        different_name_key(target_without_place, original_name)) {
      plan <- dplyr::bind_rows(
        plan,
        tibble::tibble(
          search_name = target_without_place,
          attempt_reason = "aff_genus_plus_text_after_aff_without_place",
          allow_fuzzy = FALSE
        )
      )
    }
  }

  if (flags$has_var) {
    parent_species <- species_binomial(original_name)
    if (!is.na(parent_species) && different_name_key(parent_species, original_name)) {
      plan <- dplyr::bind_rows(
        plan,
        tibble::tibble(
          search_name = parent_species,
          attempt_reason = "var_or_subsp_parent_species_fallback",
          allow_fuzzy = FALSE
        )
      )
    }
  }

  plan |>
    dplyr::mutate(plan_order = dplyr::row_number()) |>
    dplyr::distinct(search_name, .keep_all = TRUE) |>
    dplyr::select(search_name, attempt_reason, allow_fuzzy, plan_order)
}
# -----------------------------
# VicFlora GraphQL helpers
# -----------------------------
endpoint <- "https://vicflora.rbg.vic.gov.au/graphql"

lookup_query <- '
query TaxonLookup($q: String!) {
  taxonConceptAutocomplete(q: $q) {
    ...LookupConceptFields
    synonyms {
      fullName
      fullNameWithAuthorship
    }
    synonymUsages {
      ...LookupConceptFields
      acceptedConcept {
        ...LookupConceptFields
      }
    }
    acceptedConcept {
      ...LookupConceptFields
    }
  }
}

fragment LookupConceptFields on TaxonConcept {
  id
  taxonomicStatus
  taxonRank
  taxonName {
    fullName
    fullNameWithAuthorship
  }
  higherClassification {
    taxonRank
    taxonName {
      fullName
      fullNameWithAuthorship
    }
  }
  currentProfile {
    profile
    modified
  }
  bioregions {
    bioregionCode
    bioregionName
    occurrenceStatus
  }
}
'

gql_post <- function(query, variables = list()) {
  req <- httr2::request(endpoint) |>
    httr2::req_user_agent("VicFlora species description extraction script; local research use") |>
    httr2::req_headers(Accept = "application/json") |>
    httr2::req_body_json(list(query = query, variables = variables), auto_unbox = TRUE) |>
    httr2::req_timeout(45) |>
    httr2::req_retry(max_tries = 4, backoff = ~ 1.5 ^ .x)

  response <- httr2::req_perform(req)
  parsed <- httr2::resp_body_json(response, simplifyVector = FALSE)

  if (!is.null(parsed$errors)) {
    stop(jsonlite::toJSON(parsed$errors, auto_unbox = TRUE), call. = FALSE)
  }

  parsed$data
}

concept_for_candidate <- function(candidate) {
  if (!is.null(candidate$acceptedConcept) && !is.null(candidate$acceptedConcept$id)) {
    candidate$acceptedConcept
  } else {
    candidate
  }
}

concept_for_synonym_usage <- function(synonym_usage) {
  if (!is.null(synonym_usage$acceptedConcept) && !is.null(synonym_usage$acceptedConcept$id)) {
    synonym_usage$acceptedConcept
  } else {
    synonym_usage
  }
}

candidate_display_names <- function(candidate) {
  c(
    candidate$taxonName$fullName %||% NA_character_,
    candidate$taxonName$fullNameWithAuthorship %||% NA_character_
  )
}

accepted_display_names <- function(candidate) {
  c(
    candidate$acceptedConcept$taxonName$fullName %||% NA_character_,
    candidate$acceptedConcept$taxonName$fullNameWithAuthorship %||% NA_character_
  )
}

candidate_synonym_names <- function(candidate) {
  syn_names <- character()

  if (!is.null(candidate$synonyms) && length(candidate$synonyms) > 0) {
    syn_names <- c(
      syn_names,
      purrr::map_chr(candidate$synonyms, ~ .x$fullName %||% NA_character_),
      purrr::map_chr(candidate$synonyms, ~ .x$fullNameWithAuthorship %||% NA_character_)
    )
  }

  if (!is.null(candidate$synonymUsages) && length(candidate$synonymUsages) > 0) {
    syn_names <- c(
      syn_names,
      purrr::map_chr(candidate$synonymUsages, ~ .x$taxonName$fullName %||% NA_character_),
      purrr::map_chr(candidate$synonymUsages, ~ .x$taxonName$fullNameWithAuthorship %||% NA_character_)
    )
  }

  syn_names <- clean_text(syn_names)
  unique(syn_names[!is.na(syn_names)])
}

make_empty_result <- function(original_name) {
  tibble::tibble(
    query_name = clean_text(original_name)[1],
    ss_var_aff = NA_character_,
    plant_group = NA_character_,
    higher_classification = NA_character_,
    vicflora_search_name = NA_character_,
    search_hierarchy = NA_character_,
    match_type = NA_character_,
    synonym_fallback_used = NA,
    matched_synonym_name = NA_character_,
    aff_place_name = aff_place_name(original_name),
    aff_place_found_in_treatment = NA_character_,
    matched_name = NA_character_,
    matched_name_with_authorship = NA_character_,
    matched_id = NA_character_,
    matched_taxonomic_status = NA_character_,
    used_concept_id = NA_character_,
    used_scientific_name = NA_character_,
    used_name_with_authorship = NA_character_,
    used_accepted_concept = NA,
    profile_modified = NA_character_,
    species_description = NA_character_,
    habitat_codes = NA_character_,
    bioregion_codes_api = NA_character_,
    second_paragraph_codes = NA_character_,
    treatment_first_paragraph = NA_character_,
    treatment_second_paragraph = NA_character_,
    all_treatment_text = NA_character_,
    vicflora_url = NA_character_,
    scrape_status = "not_started",
    error_message = NA_character_
  )
}

build_result_row <- function(original_name, search_name, attempt_reason, candidate, concept, match_type,
                             synonym_fallback_used = FALSE, matched_synonym_name = NA_character_) {
  base <- make_empty_result(original_name)

  profile_html <- concept$currentProfile$profile %||% NA_character_
  paragraphs <- profile_to_paragraphs(profile_html)
  first_para <- if (length(paragraphs) >= 1) paragraphs[1] else NA_character_
  second_para <- if (length(paragraphs) >= 2) paragraphs[2] else NA_character_
  all_treatment <- if (length(paragraphs) > 0) paste(paragraphs, collapse = " ") else NA_character_

  description <- first_sentence(first_para)
  bioregion_codes <- collapse_bioregion_codes(concept$bioregions)
  second_para_codes <- extract_codes_from_treatment(second_para)

  # Fall back to the API bioregion codes if the second paragraph heuristic finds no codes.
  habitat_codes <- if (!is.na(second_para_codes)) second_para_codes else bioregion_codes
  plant_group_value <- infer_plant_group(concept)
  higher_classification_value <- collapse_higher_classification(concept)

  place <- aff_place_name(original_name)
  place_found <- if (!is.na(place) && !is.na(all_treatment)) {
    if (place_match_in_text(place, all_treatment)) place else NA_character_
  } else {
    NA_character_
  }

  final_status <- if (is.na(profile_html) || !nzchar(profile_html)) "no_profile" else "ok"
  final_label <- final_ss_var_aff(original_name, attempt_reason, final_status)

  # Avoid dplyr data-mask collisions: these argument names are also output-column names.
  # If we wrote match_type = match_type, mutate would use the existing blank column
  # from base instead of the function argument.
  match_type_value <- match_type
  synonym_fallback_used_value <- synonym_fallback_used
  matched_synonym_name_value <- matched_synonym_name

  base |>
    dplyr::mutate(
      ss_var_aff = final_label,
      plant_group = plant_group_value,
      higher_classification = higher_classification_value,
      vicflora_search_name = search_name,
      search_hierarchy = attempt_reason,
      match_type = match_type_value,
      synonym_fallback_used = synonym_fallback_used_value,
      matched_synonym_name = matched_synonym_name_value,
      aff_place_found_in_treatment = place_found,
      matched_name = candidate$taxonName$fullName %||% NA_character_,
      matched_name_with_authorship = candidate$taxonName$fullNameWithAuthorship %||% NA_character_,
      matched_id = as.character(candidate$id %||% NA_character_),
      matched_taxonomic_status = candidate$taxonomicStatus %||% NA_character_,
      used_concept_id = as.character(concept$id %||% NA_character_),
      used_scientific_name = concept$taxonName$fullName %||% NA_character_,
      used_name_with_authorship = concept$taxonName$fullNameWithAuthorship %||% NA_character_,
      used_accepted_concept = !is.null(concept$id) && !is.null(candidate$id) && as.character(concept$id) != as.character(candidate$id),
      profile_modified = concept$currentProfile$modified %||% NA_character_,
      species_description = description,
      habitat_codes = habitat_codes,
      bioregion_codes_api = bioregion_codes,
      second_paragraph_codes = second_para_codes,
      treatment_first_paragraph = first_para,
      treatment_second_paragraph = second_para,
      all_treatment_text = all_treatment,
      vicflora_url = if (!is.null(concept$id)) paste0("https://vicflora.rbg.vic.gov.au/flora/taxon/", concept$id) else NA_character_,
      scrape_status = final_status,
      error_message = NA_character_
    )
}

lookup_one_search_name <- function(original_name, search_name, attempt_reason, allow_fuzzy = FALSE) {
  if (is_blank(search_name)) {
    out <- make_empty_result(original_name)
    out$scrape_status <- "blank_name"
    out$ss_var_aff <- final_ss_var_aff(original_name, attempt_reason, out$scrape_status)
    return(out)
  }

  out <- tryCatch({
    data <- gql_post(lookup_query, list(q = search_name))
    candidates <- data$taxonConceptAutocomplete

    if (is.null(candidates) || length(candidates) == 0) {
      row <- make_empty_result(original_name)
      row$vicflora_search_name <- search_name
      row$search_hierarchy <- attempt_reason
      row$scrape_status <- "not_found"
      row$ss_var_aff <- final_ss_var_aff(original_name, attempt_reason, row$scrape_status)
      return(row)
    }

    search_key <- name_key(search_name)
    no_profile_rows <- list()
    choices <- list()

    add_choice <- function(candidate, concept, match_type, synonym_fallback_used = FALSE, matched_synonym_name = NA_character_) {
      choices[[length(choices) + 1]] <<- list(
        candidate = candidate,
        concept = concept,
        match_type = match_type,
        synonym_fallback_used = synonym_fallback_used,
        matched_synonym_name = matched_synonym_name
      )
    }

    # 1. Exact direct matches to the returned taxon name.
    for (candidate in candidates) {
      cand_keys <- purrr::map_chr(candidate_display_names(candidate), name_key)
      cand_keys <- cand_keys[!is.na(cand_keys)]
      if (any(cand_keys == search_key)) {
        add_choice(candidate, concept_for_candidate(candidate), "exact_returned_name", FALSE, NA_character_)
      }
    }

    # 2. Exact matches to an accepted concept returned by a synonym/misapplied result.
    for (candidate in candidates) {
      acc_keys <- purrr::map_chr(accepted_display_names(candidate), name_key)
      acc_keys <- acc_keys[!is.na(acc_keys)]
      if (length(acc_keys) > 0 && any(acc_keys == search_key)) {
        add_choice(candidate, concept_for_candidate(candidate), "exact_accepted_concept_name", FALSE, NA_character_)
      }
    }

    # 3. Synonym-tab matches, ignoring authorship at the end of the name.
    for (candidate in candidates) {
      syn_names <- candidate_synonym_names(candidate)
      if (length(syn_names) > 0 && name_matches_ignoring_authorship(original_name, syn_names)) {
        matched_syn <- syn_names[which(purrr::map_lgl(syn_names, ~ name_matches_ignoring_authorship(original_name, .x)))[1]]
        add_choice(candidate, concept_for_candidate(candidate), "synonym_tab_match", TRUE, matched_syn)
      }

      # If any synonym usage itself has an accepted concept, use that accepted concept.
      if (!is.null(candidate$synonymUsages) && length(candidate$synonymUsages) > 0) {
        for (syn_usage in candidate$synonymUsages) {
          usage_names <- c(
            syn_usage$taxonName$fullName %||% NA_character_,
            syn_usage$taxonName$fullNameWithAuthorship %||% NA_character_
          )
          if (name_matches_ignoring_authorship(original_name, usage_names)) {
            matched_syn <- clean_text(usage_names)[!is.na(clean_text(usage_names))][1]
            add_choice(candidate, concept_for_synonym_usage(syn_usage), "synonym_usage_match", TRUE, matched_syn)
          }
        }
      }
    }

    # 4. Optional fuzzy fallback. Kept available, but the default search plan sets allow_fuzzy = FALSE to avoid subpar matches.
    if (allow_fuzzy && length(choices) == 0) {
      statuses <- purrr::map_chr(candidates, ~ .x$taxonomicStatus %||% NA_character_)
      idx <- which(statuses == "ACCEPTED")
      if (length(idx) == 0) idx <- seq_along(candidates)
      candidate <- candidates[[idx[1]]]
      add_choice(candidate, concept_for_candidate(candidate), "accepted_or_first_autocomplete_fallback", FALSE, NA_character_)
    }

    if (length(choices) == 0) {
      row <- make_empty_result(original_name)
      row$vicflora_search_name <- search_name
      row$search_hierarchy <- attempt_reason
      row$scrape_status <- "not_found"
      row$ss_var_aff <- final_ss_var_aff(original_name, attempt_reason, row$scrape_status)
      return(row)
    }

    for (choice in choices) {
      row <- build_result_row(
        original_name = original_name,
        search_name = search_name,
        attempt_reason = attempt_reason,
        candidate = choice$candidate,
        concept = choice$concept,
        match_type = choice$match_type,
        synonym_fallback_used = choice$synonym_fallback_used,
        matched_synonym_name = choice$matched_synonym_name
      )

      if (identical(row$scrape_status[[1]], "ok")) return(row)
      no_profile_rows[[length(no_profile_rows) + 1]] <- row
    }

    if (length(no_profile_rows) > 0) no_profile_rows[[1]] else {
      row <- make_empty_result(original_name)
      row$vicflora_search_name <- search_name
      row$search_hierarchy <- attempt_reason
      row$scrape_status <- "not_found"
      row$ss_var_aff <- final_ss_var_aff(original_name, attempt_reason, row$scrape_status)
      row
    }
  }, error = function(e) {
    row <- make_empty_result(original_name)
    row$vicflora_search_name <- search_name
    row$search_hierarchy <- attempt_reason
    row$scrape_status <- "error"
    row$ss_var_aff <- final_ss_var_aff(original_name, attempt_reason, row$scrape_status)
    row$error_message <- conditionMessage(e)
    row
  })

  Sys.sleep(request_pause_seconds)
  out
}

scrape_one_species <- function(query_name) {
  query_name <- clean_text(query_name)[1]
  result_if_all_fail <- make_empty_result(query_name)

  if (is_blank(query_name)) {
    result_if_all_fail$scrape_status <- "blank_name"
    result_if_all_fail$ss_var_aff <- final_ss_var_aff(query_name, NA_character_, result_if_all_fail$scrape_status)
    return(result_if_all_fail)
  }

  plan <- build_search_plan(query_name)
  attempt_summaries <- character()

  for (i in seq_len(nrow(plan))) {
    attempt <- plan[i, ]
    row <- lookup_one_search_name(
      original_name = query_name,
      search_name = attempt$search_name,
      attempt_reason = attempt$attempt_reason,
      allow_fuzzy = attempt$allow_fuzzy
    )

    attempt_summaries <- c(
      attempt_summaries,
      paste0(attempt$attempt_reason, " [", attempt$search_name, "]: ", row$scrape_status[[1]])
    )

    if (identical(row$scrape_status[[1]], "ok")) {
      row$search_hierarchy <- paste(attempt_summaries, collapse = " -> ")
      row$ss_var_aff <- final_ss_var_aff(query_name, attempt$attempt_reason, row$scrape_status[[1]])
      return(row)
    }

    # Keep the most informative failure, but make sure that an all-not_found species
    # returns not_found rather than the initial not_started row.
    if (failure_priority(row$scrape_status[[1]]) >= failure_priority(result_if_all_fail$scrape_status[[1]])) {
      result_if_all_fail <- row
    }
  }

  result_if_all_fail$search_hierarchy <- paste(attempt_summaries, collapse = " -> ")
  result_if_all_fail$ss_var_aff <- final_ss_var_aff(query_name, result_if_all_fail$search_hierarchy, result_if_all_fail$scrape_status[[1]])
  result_if_all_fail
}
if (exists("final_tbl")) {
  print(dplyr::count(final_tbl, scrape_status, ss_var_aff, sort = TRUE))
  cat("
Rows written:", nrow(final_tbl), "
")
  cat("CSV output:", normalizePath(output_csv, mustWork = FALSE), "
")
  cat("XLSX output:", normalizePath(output_xlsx, mustWork = FALSE), "
")
}
## [1] "proj_scientific_name"
## [1] "vicflora_search_name"
vicflora_summary <- results_unique |>
  summarise(
    n_project_taxa = n(),
    n_vicflora_matched = sum(scrape_status %in% c("success", "matched", "ok"), na.rm = TRUE),
    n_vicflora_not_found = sum(scrape_status == "not_found", na.rm = TRUE),
    n_vicflora_no_profile = sum(scrape_status == "no_profile", na.rm = TRUE),
    n_with_description = sum(!is.na(species_description) & species_description != "", na.rm = TRUE),
    n_with_habitat_codes = sum(!is.na(bioregion_codes_api) & bioregion_codes_api != "", na.rm = TRUE),
    n_bryophytes = sum(plant_group == "bryophyte", na.rm = TRUE),
    n_flowering_plants = sum(plant_group == "flowering plant", na.rm = TRUE),
    n_other_or_unknown = sum(
      is.na(plant_group) |
        plant_group %in% c("unknown", "other vascular plant"),
      na.rm = TRUE
    )
  ) |>
  pivot_longer(
    cols = everything(),
    names_to = "summary_metric",
    values_to = "value"
  )

vicflora_summary |>
  knitr::kable(
    caption = "Summary of VicFlora scraping results"
  )
Summary of VicFlora scraping results
summary_metric value
n_project_taxa 316
n_vicflora_matched 247
n_vicflora_not_found 46
n_vicflora_no_profile 23
n_with_description 247
n_with_habitat_codes 261
n_bryophytes 53
n_flowering_plants 212
n_other_or_unknown 51

1.2 Species that are not in VICFLORA

Because VICFLORA focuses on vascular plants mostly, the information on most of the bryophytes, liverworts and hornworts have to be sourced elsewhere. Although ALA is the next obvious choice, there isn’t a species description section that is accessible. Most of the species description on bryophytes, liverworts and hornworts may have to be individually/manually compiled. The geopgraphic range information is however available for most of these species on ALA - and this information has been acquired in this phase.

2 Extracting traits information

2.1 Overview of the AusTraits database

AusTraits is a large, curated database of plant trait information for Australian plant species. A plant trait is a measurable characteristic of a species, such as growth form, plant height, lifespan, seed mass, leaf size, dispersal mode, or reproductive strategy. These traits are useful because they help describe how plants live, grow, reproduce, and respond to their environment.

The database brings together trait records from many different published and unpublished sources. This means that a single species can have multiple records for the same trait, sometimes from different studies, regions, or measurement methods. For this reason, the analysis keeps the original trait records as an audit table, but also creates a simplified species-level summary that can be joined back to the main species dataset.

Here, AusTraits is used to add ecological trait information focusing on life history related traits to the species list.

2.2 What the AusTraits functions do in this analysis

The AusTraits part of the workflow has four main purposes.

First, the script prepares the species names from the main results_unique table. These names come from the VicFlora scrape and have already been cleaned as much as possible. The script then creates matching names that can be compared with names in the AusTraits database.

Second, the script matches the project species list to AusTraits taxonomy. It first tries to match the cleaned project species name directly to the AusTraits taxon_name column. This is the preferred match because it preserves the most specific name available, including varieties or subspecies where they are present in AusTraits. If no exact taxon-name match is found, the script then uses a species-level name and performs a fuzzy match against the AusTraits binomial column. This helps recover likely matches when names differ slightly between sources.

Third, the script extracts only the traits of interest (16 traits listed below). Instead of using the full AusTraits database, which is very large (over 530 traits!), the code filters the data to keep only the project species and the chosen traits of interest.

Fourth, the script appends the trait information back to the main species table. Because AusTraits can contain several records for the same species and trait, the workflow keeps two outputs:

  • a long-format audit table, where every AusTraits record is retained separately;
  • a wide-format species table, where trait values are collapsed and appended to results_unique.

The wide table is easier to read because each species remains as one row. If a species has several values for the same trait, these values are combined into a single cell separated by semicolons. The long audit table is kept so that the original records can still be checked if needed.

2.3 Why the workflow uses saved outputs

Downloading or loading large biodiversity datasets can take time and may require an internet connection. To avoid repeating the same steps every time the R Markdown document is knitted, the workflow saves the AusTraits outputs as local files after they are created. Later, the document can reload these saved files directly, instead of contacting the online database again.

This makes the document faster, more reproducible, and less likely to fail because of temporary internet or server issues. The trait database contains over 530 plant species traits. We are mainly interested in life history related traits, therefore extract trait information for a specific subset of traits (string called traits_of_interest) for the 316 poorly known species.

2.3.1 Extracting data by traits of interest.

Here we specify 16 plant traits: plant growth form, life history, flowering time, resprouting capacity, dispersers, reproductive maturity, vegetation reproduction ability, life span, recruitment time, post fire recruitment, resprouting capacity non fire disturbance, resprouting capacity proportion individuals, seed viability, seedbank longevity, seedling establishment conditions, fire exposure level

For FFG nomination, there is a field called “Generation length”. Unfortuntaly AusTraits does not have this particular trait information. We may have to make assumptions about generation length based on reproductive maturity and lifespan (both of which are available).

full_list_of_traits<-summarise_database(austraits07,"trait_name")

traits_of_interest<-c("plant_growth_form","life_history","flowering_time","resprouting_capacity","dispersers","reproductive_maturity","vegetative_reproduction_ability","lifespan","recruitment_time","post_fire_recruitment","resprouting_capacity_non_fire_disturbance","resprouting_capacity_proportion_individuals","seed_viability","seedbank_longevity","seedling_establishment_conditions","fire_exposure_level")

traits_with_taxa <- austraits07$traits |>
  filter(trait_name %in% traits_of_interest) |>
  left_join(
    taxa_lookup |> select(taxon_name, binomial),
    by = "taxon_name"
  )
  
trait_rows_exact <- results_unique_matched |>
  filter(austraits_match_type == "exact_taxon_name") |>
  select(row_id, matched_taxon_name) |>
  inner_join(
    traits_with_taxa,
    by = c("matched_taxon_name" = "taxon_name")
  )

trait_rows_fuzzy <- results_unique_matched |>
  filter(austraits_match_type == "fuzzy_binomial") |>
  select(row_id, matched_binomial) |>
  inner_join(
    traits_with_taxa,
    by = c("matched_binomial" = "binomial")
  )

trait_rows_to_append <- bind_rows(
  trait_rows_exact,
  trait_rows_fuzzy
)

collapse_unique <- function(x) {
  x <- unique(na.omit(as.character(x)))
  if (length(x) == 0) NA_character_ else paste(x, collapse = "; ")
}

results_unique <- results_unique |>
  mutate(row_id = row_number())

trait_wide <- trait_rows_to_append |>
  mutate(
    value = as.character(value),
    unit = as.character(unit)
  ) |>
  group_by(row_id, trait_name) |>
  summarise(
    trait_value = collapse_unique(value),
    trait_unit = collapse_unique(unit),
    trait_source_id = collapse_unique(source_id),
    trait_dataset_id = collapse_unique(dataset_id),
    .groups = "drop"
  ) |>
  pivot_wider(
    names_from = trait_name,
    values_from = c(
      trait_value,
      trait_unit,
      trait_source_id,
      trait_dataset_id
    ),
    names_glue = "{trait_name}_{.value}"
  )

results_unique_with_traits <- results_unique |>
  left_join(trait_wide, by = "row_id")
austraits_summary <- tibble(
  summary_metric = c(
    "Project taxa",
    "Taxa with at least one selected AusTraits record",
    "Total AusTraits records extracted",
    "Selected traits found",
    "Selected traits requested"
  ),
  value = c(
    nrow(results_unique),
    trait_rows_to_append |> distinct(row_id) |> nrow(),
    nrow(trait_rows_to_append),
    trait_rows_to_append |> distinct(trait_name) |> nrow(),
    length(unique(traits_of_interest))
  )
)

austraits_summary |>
  knitr::kable(
    caption = "Summary of AusTraits trait extraction"
  )
trait_coverage_summary <- trait_rows_to_append |>
  group_by(trait_name) |>
  summarise(
    n_taxa = n_distinct(row_id),
    n_records = n(),
    n_unique_values = n_distinct(value, na.rm = TRUE),
    example_values = paste(
      head(unique(na.omit(as.character(value))), 5),
      collapse = "; "
    ),
    .groups = "drop"
  ) |>
  arrange(desc(n_taxa), trait_name)

trait_coverage_summary |>
  knitr::kable(
    caption = "Coverage of selected AusTraits traits"
  )
Coverage of selected AusTraits traits
trait_name n_taxa n_records n_unique_values example_values
plant_growth_form 222 1419 20 shrub tree; shrub; tree; fern; graminoid herb
life_history 220 913 10 perennial; annual perennial; annual; annual short_lived_perennial; short_lived_perennial
lifespan 214 375 16 500–1000; 10–50; 0–1; 50–500; 50–100
reproductive_maturity 214 337 11 5–20; 1–5; 0–1; 4–5; 2
seedbank_longevity 214 337 6 2–; –2; 1–; 10–; 5–
vegetative_reproduction_ability 214 352 2 vegetative; not_vegetative
resprouting_capacity 212 563 6 fire_killed; resprouts; partial_resprouting; partial_resprouting resprouts; fire_killed partial_resprouting
dispersers 206 401 19 invertebrates; ants passive vertebrates; wind; abiotic animals; water
flowering_time 203 482 90 nnnnnnyyyynn; nnnnnnyyyyyn; nnnnnnnyyyyn; nnnnnnnnyynn; nnnnnnnyyynn
resprouting_capacity_non_fire_disturbance 117 161 1 resprouts_non_fire_disturbance
post_fire_recruitment 46 69 2 post_fire_recruitment; post_fire_recruitment_absent
seedling_establishment_conditions 45 49 2 establish_anytime; establish_post_fire
resprouting_capacity_proportion_individuals 3 9 3 1; 0.8; 0.4
fire_exposure_level 2 3 1 aquatic_taxon

3 Geographic range

This section speaks directly to the Geographic Range criteria of the nomination process, in particular, the AOO and EOO.

3.1 ALA with *galah package.

Galah package is the R package that helps extract data from the ALA, which is the primary resource we use to determine the geographic range/species distribution. Here, species records are bound to Victoria only (this can be changed if the project is expanded for a continental analysis).

3.1.1 ALA taxon matching helper functions

The taxon matching helper functions create a hierarchy of candidate names for each project species with the same logic as cleaned VicFlora search name described above.

For each candidate name, the script uses galah::search_taxa() to query the ALA taxonomic backbone. The returned matches are filtered to retain plant taxa and to avoid broad higher-level matches such as genera or families. This prevents the script from accidentally downloading all records for a whole genus when a species-level name cannot be resolved.

When several possible matches are returned, the helper function prioritises exact scientific-name matches and exact ALA match types. The best match is then stored with the original project name, the candidate name used for the successful query, the resolved ALA scientific name, taxon concept identifier, taxon rank, and match type. Species that cannot be resolved are retained in the output with a not_found status, allowing them to be reviewed manually.

This matching step produces an auditable table of taxonomic decisions before occurrence records are downloaded. This is useful because it separates uncertainty in name resolution from uncertainty in occurrence data, and makes it possible to check which species were matched exactly, which required simplified fallback names, and which were not found in ALA.

victoria_exact <- victoria_sf |>
  st_make_valid() |>
  st_transform(4326) |>
  st_zm(drop = TRUE, what = "ZM")

victoria_exact <- st_as_sf(
  st_sfc(
    st_union(st_geometry(victoria_exact)),
    crs = st_crs(victoria_exact)
  )
)

# The exact polygon crop happens locally after download.
victoria_query_area <- victoria_exact

results_unique_ala_base <- results_unique_with_traits |>
  mutate(
    ala_row_id = row_number(),
    ala_input_name = as.character(.data[[species_col]]),
    ala_input_name = str_squish(ala_input_name)
  )

species_to_resolve <- results_unique_ala_base |>
  filter(!is.na(ala_input_name), ala_input_name != "") |>
  distinct(ala_input_name) |>
  mutate(ala_species_id = row_number())

# ------------------------------------------------------------
# 5. Candidate names for ALA search
# ------------------------------------------------------------
# Since vicflora_search_name is already cleaned, the exact value is tried first.
# Bracket removal and aff./var./s.s. handling are only fallback candidates.

make_ala_candidates <- function(x) {
  original <- str_squish(x)

  no_brackets <- original |>
    str_remove("\\s*\\([^)]*\\)\\s*$") |>
    str_squish()

  no_ss <- no_brackets |>
    str_remove("\\s+s\\.?s\\.?\\s*$") |>
    str_remove("\\s+ss\\s*$") |>
    str_squish()

  # Caladenia aff. vulgaris -> Caladenia vulgaris
  aff_name <- no_ss |>
    str_replace(
      "^([A-Z][A-Za-z-]+)\\s+(?:sp\\.?\\s*)?aff\\.?\\s+([a-z][a-z-]+).*$",
      "\\1 \\2"
    ) |>
    str_squish()

  # Acacia leprosa var. graveolens -> Acacia leprosa
  species_level <- no_ss |>
    str_replace(
      "^([A-Z][A-Za-z-]+\\s+[a-z][a-z-]+)\\s+(var\\.|subsp\\.|ssp\\.).*$",
      "\\1"
    ) |>
    str_squish()

  # Arthropodium sp. 2 (greenish flowers) -> Arthropodium sp. 2
  sp_number <- str_extract(
    no_brackets,
    "^[A-Z][A-Za-z-]+\\s+sp\\.?\\s*\\d+"
  )

  candidates <- c(
    original,
    no_brackets,
    no_ss,
    aff_name,
    sp_number,
    species_level
  )

  candidates |>
    unique() |>
    discard(~ is.na(.x) || .x == "")
}

# ------------------------------------------------------------
# 6. Helpers for robust ALA taxon matching
# ------------------------------------------------------------

find_col <- function(df, possible_names) {
  hit <- intersect(possible_names, names(df))
  if (length(hit) == 0) NA_character_ else hit[1]
}

safe_pull <- function(df, possible_names, default = NA_character_) {
  col <- find_col(df, possible_names)
  if (is.na(col)) {
    rep(default, nrow(df))
  } else {
    as.character(df[[col]])
  }
}

resolve_ala_taxon <- function(ala_input_name, ala_species_id) {
  candidates <- make_ala_candidates(ala_input_name)

  for (i in seq_along(candidates)) {
    query_name <- candidates[[i]]

    tax <- tryCatch(
      galah::search_taxa(query_name),
      error = function(e) NULL
    )

    if (is.null(tax) || nrow(tax) == 0) next

    tax <- as_tibble(tax)

    sci_col <- find_col(
      tax,
      c("scientific_name", "scientificName", "taxon_name", "name")
    )

    if (is.na(sci_col)) next

    tax <- tax |>
      mutate(
        .scientific_name = as.character(.data[[sci_col]]),
        .kingdom = safe_pull(tax, c("kingdom", "kingdom_name", "kingdomName")),
        .rank = safe_pull(tax, c("rank", "taxon_rank", "taxonRank")),
        .match_type = safe_pull(tax, c("match_type", "matchType"))
      )

    # Keep plant records when kingdom is supplied
    tax <- tax |>
      filter(
        is.na(.kingdom) |
          .kingdom == "" |
          str_to_lower(.kingdom) == "plantae"
      )

    # Avoid downloading whole genera/families when species-level matching fails
    tax <- tax |>
      filter(
        is.na(.rank) |
          !str_to_lower(.rank) %in% c(
            "kingdom", "phylum", "class", "order", "family", "genus"
          )
      )

    if (nrow(tax) == 0) next

    tax_best <- tax |>
      mutate(
        .exact_name = str_to_lower(.scientific_name) == str_to_lower(query_name),
        .exact_match_type = str_detect(str_to_lower(.match_type), "exact")
      ) |>
      arrange(
        desc(.exact_name),
        desc(.exact_match_type)
      ) |>
      slice(1)

    taxon_id_col <- find_col(
      tax_best,
      c("taxon_concept_id", "taxonConceptID", "taxonConceptId", "guid", "taxon_id")
    )

    return(tibble(
      ala_species_id = ala_species_id,
      ala_input_name = ala_input_name,
      ala_query_name = query_name,
      ala_candidate_number = i,
      ala_scientific_name = tax_best$.scientific_name[1],
      ala_taxon_concept_id = if (!is.na(taxon_id_col)) as.character(tax_best[[taxon_id_col]][1]) else NA_character_,
      ala_taxon_rank = tax_best$.rank[1],
      ala_match_type = tax_best$.match_type[1],
      ala_resolution_status = "matched"
    ))
  }

  tibble(
    ala_species_id = ala_species_id,
    ala_input_name = ala_input_name,
    ala_query_name = NA_character_,
    ala_candidate_number = NA_integer_,
    ala_scientific_name = NA_character_,
    ala_taxon_concept_id = NA_character_,
    ala_taxon_rank = NA_character_,
    ala_match_type = NA_character_,
    ala_resolution_status = "not_found"
  )
}

taxon_matches <- map2_dfr(
  species_to_resolve$ala_input_name,
  species_to_resolve$ala_species_id,
  resolve_ala_taxon
)

write_csv(
  taxon_matches,
  file.path(out_dir, "ala_taxon_matches.csv")
)

# Quick check
taxon_matches |>
  count(ala_resolution_status, ala_match_type)

# ------------------------------------------------------------
# 7. Helpers for occurrence downloads and exact Victoria crop
# ------------------------------------------------------------

safe_slug <- function(x) {
  x |>
    str_replace_all("[^A-Za-z0-9]+", "_") |>
    str_replace_all("^_|_$", "") |>
    str_to_lower()
}

get_count_value <- function(x) {
  if ("count" %in% names(x)) return(as.integer(x$count[1]))
  if ("totalRecords" %in% names(x)) return(as.integer(x$totalRecords[1]))
  NA_integer_
}

crop_occurrences_to_victoria <- function(occ, victoria_exact) {
  if (is.null(occ) || nrow(occ) == 0) {
    return(occ)
  }

  lon_col <- intersect(
    c("decimalLongitude", "decimal_longitude", "longitude"),
    names(occ)
  )[1]

  lat_col <- intersect(
    c("decimalLatitude", "decimal_latitude", "latitude"),
    names(occ)
  )[1]

  if (is.na(lon_col) || is.na(lat_col)) {
    warning("No longitude/latitude columns found. Returning uncropped records.")
    return(occ)
  }

  occ_sf <- occ |>
    filter(
      !is.na(.data[[lon_col]]),
      !is.na(.data[[lat_col]])
    ) |>
    st_as_sf(
      coords = c(lon_col, lat_col),
      crs = 4326,
      remove = FALSE
    )

  occ_vic <- occ_sf |>
    st_filter(victoria_exact, .predicate = st_intersects)

  occ_vic |>
    st_drop_geometry()
}

download_one_ala_species <- function(match_row,
                                     victoria_query_area,
                                     victoria_exact,
                                     out_dir,
                                     pause_seconds = 0.5) {

  if (match_row$ala_resolution_status != "matched") {
    return(tibble(
      ala_species_id = match_row$ala_species_id,
      ala_input_name = match_row$ala_input_name,
      ala_scientific_name = match_row$ala_scientific_name,
      download_status = "taxon_not_found",
      n_records_bbox = NA_integer_,
      n_records_victoria = NA_integer_,
      error_message = NA_character_
    ))
  }

  slug <- safe_slug(paste0(
    match_row$ala_species_id, "_",
    match_row$ala_input_name
  ))

  rds_file <- file.path(out_dir, paste0(slug, "_occurrences.rds"))
  csv_file <- file.path(out_dir, paste0(slug, "_occurrences.csv"))

  # Resume behaviour: skip if already downloaded
  if (file.exists(rds_file)) {
    occ <- readRDS(rds_file)

    return(tibble(
      ala_species_id = match_row$ala_species_id,
      ala_input_name = match_row$ala_input_name,
      ala_scientific_name = match_row$ala_scientific_name,
      download_status = "already_downloaded",
      n_records_bbox = NA_integer_,
      n_records_victoria = nrow(occ),
      error_message = NA_character_
    ))
  }

result <- tryCatch({

  base_query <- galah::request_data("occurrences") |>
    galah::identify(match_row$ala_scientific_name) |>
    galah::geolocate(victoria_query_area, type = "bbox")

  if (!inherits(base_query, "data_request")) {
    stop("base_query is not a galah data_request. Check galah query construction.")
  }

  bbox_count <- base_query |>
    dplyr::count() |>
    dplyr::collect() |>
    get_count_value()

  if (is.na(bbox_count) || bbox_count == 0) {
    return(tibble(
      ala_species_id = match_row$ala_species_id,
      ala_input_name = match_row$ala_input_name,
      ala_scientific_name = match_row$ala_scientific_name,
      download_status = "zero_records_bbox",
      n_records_bbox = bbox_count,
      n_records_victoria = 0L,
      error_message = NA_character_
    ))
  }

  occ_raw <- base_query |>
    dplyr::select(group = "basic") |>
    dplyr::collect()

  occ_vic <- crop_occurrences_to_victoria(
    occ = occ_raw,
    victoria_exact = victoria_exact
  )

  if (nrow(occ_vic) == 0) {
    saveRDS(occ_vic, rds_file)
    write_csv(occ_vic, csv_file)

    return(tibble(
      ala_species_id = match_row$ala_species_id,
      ala_input_name = match_row$ala_input_name,
      ala_scientific_name = match_row$ala_scientific_name,
      download_status = "zero_records_after_victoria_crop",
      n_records_bbox = bbox_count,
      n_records_victoria = 0L,
      error_message = NA_character_
    ))
  }

  occ_vic <- occ_vic |>
    mutate(
      ala_species_id = match_row$ala_species_id,
      ala_input_name = match_row$ala_input_name,
      ala_query_name = match_row$ala_query_name,
      ala_scientific_name_resolved = match_row$ala_scientific_name,
      ala_taxon_concept_id_resolved = match_row$ala_taxon_concept_id,
      ala_taxon_rank_resolved = match_row$ala_taxon_rank,
      ala_match_type = match_row$ala_match_type,
      ala_candidate_number = match_row$ala_candidate_number,
      .before = 1
    )

  saveRDS(occ_vic, rds_file)
  write_csv(occ_vic, csv_file)

  Sys.sleep(pause_seconds)

  tibble(
    ala_species_id = match_row$ala_species_id,
    ala_input_name = match_row$ala_input_name,
    ala_scientific_name = match_row$ala_scientific_name,
    download_status = "downloaded",
    n_records_bbox = bbox_count,
    n_records_victoria = nrow(occ_vic),
    error_message = NA_character_
  )

}, error = function(e) {
  tibble(
    ala_species_id = match_row$ala_species_id,
    ala_input_name = match_row$ala_input_name,
    ala_scientific_name = match_row$ala_scientific_name,
    download_status = "error",
    n_records_bbox = NA_integer_,
    n_records_victoria = NA_integer_,
    error_message = conditionMessage(e)
  )
})
}
ala_summary <- tibble(
  summary_metric = c(
    "Taxa submitted to ALA",
    "Taxa matched to ALA taxonomy",
    "Taxa not found in ALA taxonomy",
    "Taxa with Victoria records",
    "Total Victoria occurrence records"
  ),
  value = c(
    taxon_matches |> distinct(ala_input_name) |> nrow(),
    taxon_matches |> filter(ala_resolution_status == "matched") |> distinct(ala_input_name) |> nrow(),
    taxon_matches |> filter(ala_resolution_status == "not_found") |> distinct(ala_input_name) |> nrow(),
    download_log |> filter(n_records_victoria > 0) |> distinct(ala_input_name) |> nrow(),
    if (exists("ala_occurrences_victoria")) nrow(ala_occurrences_victoria) else NA_integer_
  )
)

ala_summary |>
  knitr::kable(
    caption = "Summary of ALA taxon matching and Victoria occurrence downloads"
  )
Summary of ALA taxon matching and Victoria occurrence downloads
summary_metric value
Taxa submitted to ALA 313
Taxa matched to ALA taxonomy 308
Taxa not found in ALA taxonomy 5
Taxa with Victoria records 294
Total Victoria occurrence records 123525
ala_download_status_summary <- download_log |>
  count(download_status, name = "n_taxa") |>
  left_join(
    download_log |>
      group_by(download_status) |>
      summarise(
        total_records_victoria = sum(n_records_victoria, na.rm = TRUE),
        .groups = "drop"
      ),
    by = "download_status"
  ) |>
  arrange(desc(n_taxa))

ala_download_status_summary |>
  knitr::kable(
    caption = "ALA download status summary"
  )
ALA download status summary
download_status n_taxa total_records_victoria
downloaded 294 123525
zero_records_bbox 12 0
taxon_not_found 5 0
zero_records_after_victoria_crop 2 0

3.1.2 Data compilation inventory

The summary table is the the intersection of all information gathered from Vicflora, Austraits database and ALA. This has been exported as a separate .xlsx file due to it’s wide format.

4 Steps for next project phase

4.1 Threats informed by biogeographical regions

This information can be inferred from land-use mapping or other spatial layers related to land degradation. Threats layers from the SMP project could be an option.

4.2 Population

Threatened status doesn’t often get determined by population. But gathering information on population status is very difficult. The only viable options are talking to experts to elicit their best estimate, and looking at herbarium specimen notes. David Cameron’s has collected this information over time, and therefore digitising the notes and assigning relevant information to each species should be primary focus to acquire population level information. There will still be gaps after this time consuming process. The alternative is to build a statistical model to estimate the probability of a species occurring in a plot, and by considering the dimensions of a plant it can feed into population estimates (e.g. a tree might be one individual per plot).

4.3 Translocation

This won’t be relevant to many poorly known species.

4.4 Subpopulations

Primarily only relevant if one population. Rarely determines the outcome of most assessments.

4.5 Fragmentation

If in a fragmented landscape, likely to be highly fragmented.

4.6 Reduction

This is the most important criteria. Similarly to Threats, this can be inferred from habitat of taxon. More specifically, Bioregional Conservation Status for EVCs give decline and or reduction amounts for habitat which can be inferred for the species description. It basically compares pre-1750 to 2005 habitat.

4.7 Survey Effort

Never contributes to the assessment outcome.

4.8 Very Restricted

Species is restricted if geographic range is <5 locations or <20 km. These can be calculated after ALA extract has been cleaned and audited.

4.9 Locations

This criteria is a complex one to deduct. It will involve putting together a series of prevailing threats and the geographic range for those threats (e.g. roadsides, tenure, historic land use, current land use).