We start the compilation by extracting species description and habitat code from VicFlora website. For this, we use the 316 species list to create queries from VicFlora. The section below writes an output table containing the available species descriptions, habitat or bioregion codes, match diagnostics, synonym fallback details.
The extract follows this hierarchy of rationales: 1. For ordinary
names, var., subsp., aff., and
names with bracketed notes, try the original full name first 2. For
s.s. / sensu stricto names, always query the name before
the s.s. marker 3. If a parenthetical-note name fails,
remove the trailing bracketed note and try again, for example
Cardamine tenuifolia (small-flower form) →
Cardamine tenuifolia 4. For named sp. X
records, fall back to the named record without the bracketed descriptor,
for example Arthropodium sp. 2 (greenish flowers) →
Arthropodium sp. 2 5. If a var. /
subsp. full-name match fails or has no profile, fall back
to the parent species 6. If an aff. full-name match fails
or has no profile, query Genus + text after aff., for
example Caladenia aff. vulgaris (Aireys Inlet) →
Caladenia vulgaris (Aireys Inlet), then
Caladenia vulgaris 7. If the result is
not_found or no_profile, inspect VicFlora
synonym fields and match names while ignoring authorship at the end of
the scientific name 8. Add plant_group from VicFlora higher
classification: bryophyte, flowering plant,
other vascular plant, or unknown
# -----------------------------
# Text helpers
# -----------------------------
clean_text <- function(x) {
if (is.null(x) || length(x) == 0) return(NA_character_)
x <- as.character(x)
x[is.na(x)] <- NA_character_
x <- stringr::str_replace_all(x, "\\u00A0", " ")
x <- stringr::str_replace_all(x, "\u00A0", " ")
x <- stringr::str_replace_all(x, "[[:space:]]+", " ")
x <- stringr::str_squish(x)
x[!is.na(x) & !nzchar(x)] <- NA_character_
x
}
name_key <- function(x) {
x <- clean_text(x)
x <- x[1]
if (length(x) == 0 || is.na(x)) {
return(NA_character_)
}
x <- enc2utf8(x)
# Replace hybrid symbol × with plain x
x <- stringr::str_replace_all(x, "\u00D7", "x")
x <- stringr::str_replace_all(x, "[[:punct:]]+", " ")
x <- stringr::str_replace_all(x, "[[:space:]]+", " ")
stringr::str_to_lower(stringr::str_squish(x))
}
different_name_key <- function(a, b) {
a_key <- name_key(a)
b_key <- name_key(b)
if (is.na(a_key) || is.na(b_key)) return(TRUE)
!identical(a_key, b_key)
}
first_sentence <- function(x) {
x <- clean_text(x)
if (length(x) == 0 || is.na(x[1])) return(NA_character_)
x <- x[1]
pieces <- stringi::stri_split_boundaries(x, type = "sentence")[[1]]
pieces <- clean_text(pieces)
pieces <- pieces[!is.na(pieces)]
if (length(pieces) == 0) x else pieces[1]
}
profile_to_paragraphs <- function(profile_html) {
if (is.null(profile_html) || length(profile_html) == 0 || is.na(profile_html) || !nzchar(profile_html)) {
return(character())
}
# VicFlora profile is HTML. Wrap in a body tag so xml2 can parse fragments safely.
doc <- tryCatch(
xml2::read_html(paste0("<html><body>", profile_html, "</body></html>")),
error = function(e) NULL
)
if (is.null(doc)) {
text <- clean_text(profile_html)
return(if (is.na(text)) character() else text)
}
paragraphs <- rvest::html_elements(doc, "p") |> rvest::html_text2()
# Fallback for profiles that do not use <p> tags.
if (length(paragraphs) == 0) {
text <- rvest::html_text2(doc)
paragraphs <- stringr::str_split(text, "\\n\\s*\\n")[[1]]
}
paragraphs <- purrr::map_chr(paragraphs, clean_text)
paragraphs <- paragraphs[!is.na(paragraphs)]
paragraphs
}
extract_codes_from_treatment <- function(second_paragraph) {
second_paragraph <- clean_text(second_paragraph)
if (is.na(second_paragraph)) return(NA_character_)
# Backup heuristic. This catches short region-style codes such as EGL, HSF, HNF, VAlp, WPro, GipP.
candidates <- stringr::str_extract_all(second_paragraph, "\\b[A-Z][A-Za-z]{1,4}\\b")[[1]]
exclude <- c(
month.abb, "Sept", "NSW", "ACT", "QLD", "Qld", "SA", "WA", "NT", "TAS", "Tas", "VIC", "Vic",
"Also", "Flowers", "Flower", "Fruit", "Fruits", "Mostly", "Grows", "Occurs", "Known", "Rare", "Common",
"New", "South", "Wales", "Victoria", "Victorian", "Australian", "Australia", "Plants", "Plant",
"East", "West", "North", "South", "Central", "Mt", "Mount", "River", "Creek", "Lake", "Near"
)
candidates <- candidates[!candidates %in% exclude]
candidates <- candidates[stringr::str_detect(candidates, "^([A-Z]{2,5}|[A-Z][a-z]{1,3}[A-Z]|[A-Z][a-z]?[A-Z][a-z]{0,2})$")]
candidates <- unique(candidates)
if (length(candidates) == 0) NA_character_ else paste(candidates, collapse = ", ")
}
collapse_bioregion_codes <- function(bioregions) {
if (is.null(bioregions) || length(bioregions) == 0) return(NA_character_)
codes <- purrr::map_chr(bioregions, ~ .x$bioregionCode %||% NA_character_)
codes <- clean_text(codes)
codes <- codes[!is.na(codes)]
if (length(codes) == 0) NA_character_ else paste(unique(codes), collapse = ", ")
}
classification_names <- function(concept) {
if (is.null(concept) || length(concept) == 0) return(character())
names <- c(
concept$taxonName$fullName %||% NA_character_,
concept$taxonName$fullNameWithAuthorship %||% NA_character_
)
hc <- concept$higherClassification
if (!is.null(hc) && length(hc) > 0) {
names <- c(
names,
purrr::map_chr(hc, ~ .x$taxonName$fullName %||% NA_character_),
purrr::map_chr(hc, ~ .x$taxonName$fullNameWithAuthorship %||% NA_character_)
)
}
names <- clean_text(names)
unique(names[!is.na(names)])
}
collapse_higher_classification <- function(concept) {
hc <- concept$higherClassification
if (is.null(hc) || length(hc) == 0) return(NA_character_)
pieces <- purrr::map_chr(hc, function(x) {
nm <- clean_text(x$taxonName$fullName %||% NA_character_)[1]
rk <- clean_text(x$taxonRank %||% NA_character_)[1]
if (is.na(nm)) return(NA_character_)
if (is.na(rk)) nm else paste0(rk, ": ", nm)
})
pieces <- pieces[!is.na(pieces)]
if (length(pieces) == 0) NA_character_ else paste(unique(pieces), collapse = " | ")
}
infer_plant_group <- function(concept) {
# Uses VicFlora's higherClassification to separate bryophytes from flowering plants.
# Keeps an audit trail in higher_classification so borderline cases can be checked.
names <- classification_names(concept)
if (length(names) == 0) return(NA_character_)
keys <- stringr::str_to_lower(names)
bryophyte_patterns <- c(
"bryophyta", "marchantiophyta", "anthocerotophyta",
"bryopsida", "sphagnopsida", "polytrichopsida",
"jungermanniopsida", "marchantiopsida", "anthocerotopsida",
"moss", "liverwort", "hornwort"
)
flowering_patterns <- c(
"magnoliophyta", "angiosperm", "flowering plant",
"magnoliopsida", "liliopsida", "monocot", "dicot", "eudicot",
"rosids", "asterids", "commelinids", "magnoliids"
)
other_vascular_patterns <- c(
"pteridophyta", "polypodiophyta", "lycopodiophyta", "pinophyta",
"cycadophyta", "ginkgophyta", "gnetophyta", "gymnosperm",
"fern", "clubmoss", "quillwort", "conifer"
)
has_pattern <- function(patterns) {
any(purrr::map_lgl(patterns, function(pat) {
any(stringr::str_detect(keys, stringr::fixed(pat, ignore_case = TRUE)))
}))
}
if (has_pattern(bryophyte_patterns)) return("bryophyte")
if (has_pattern(flowering_patterns)) return("flowering plant")
if (has_pattern(other_vascular_patterns)) return("other vascular plant")
"unknown"
}
For most of the 316 species, exact taxonomic name matches were found in Vicflora. But when species were matched to their closest affinities, species level descriptions applied to subspecies, or species information applied to variations, the match type was recorded. Examples of match types are recorded:
# -----------------------------
# Scientific-name parsing helpers
# -----------------------------
rank_regex <- "(?:subsp\\.?|ssp\\.?|var\\.?|subvar\\.?|forma|f\\.?)"
species_binomial <- function(x) {
x <- clean_text(x)
x <- x[1]
if (length(x) == 0 || is.na(x)) return(NA_character_)
x <- stringr::str_replace_all(x, "x", "x")
m <- stringr::str_match(x, "^([A-Z][A-Za-z-]+\\s+(?:x\\s+)?[a-z][A-Za-z-]+)\\b")
if (!is.na(m[1, 2])) clean_text(m[1, 2]) else NA_character_
}
remove_ss_marker <- function(x) {
x <- clean_text(x)
x <- x[1]
if (length(x) == 0 || is.na(x)) return(NA_character_)
x <- stringr::str_replace(
x,
stringr::regex("\\s+(s\\.?\\s*s\\.?|ss|sensu\\s+stricto)(\\s+.*|$)", ignore_case = TRUE),
""
)
clean_text(x)
}
botanical_name_without_authorship <- function(x) {
x <- clean_text(x)
x <- x[1]
if (length(x) == 0 || is.na(x)) return(NA_character_)
x <- stringr::str_replace_all(x, "x", "x")
x <- remove_ss_marker(x)
# Named forms like Geranium sp. 3 or Alternanthera sp. 1.
m_sp_number <- stringr::str_match(
x,
stringr::regex("^([A-Z][A-Za-z-]+\\s+sp\\.?\\s*[0-9A-Za-z-]+)\\b", ignore_case = FALSE)
)
if (!is.na(m_sp_number[1, 2])) return(clean_text(m_sp_number[1, 2]))
# Infraspecific rank. Keep only Genus species rank epithet, dropping authorship after that.
infra_pat <- paste0("^([A-Z][A-Za-z-]+\\s+(?:x\\s+)?[a-z][A-Za-z-]+\\s+", rank_regex, "\\s+(?:[a-z][A-Za-z-]+|[0-9]+))\\b")
m_infra <- stringr::str_match(x, stringr::regex(infra_pat, ignore_case = FALSE))
if (!is.na(m_infra[1, 2])) return(clean_text(m_infra[1, 2]))
# Ordinary binomial.
species_binomial(x)
}
authorless_key <- function(x) {
name_key(botanical_name_without_authorship(x))
}
name_matches_ignoring_authorship <- function(query_name, candidate_names) {
q_key <- authorless_key(query_name)
if (is.na(q_key)) return(FALSE)
cand_keys <- purrr::map_chr(candidate_names, authorless_key)
cand_keys <- cand_keys[!is.na(cand_keys)]
any(cand_keys == q_key)
}
detect_name_flags <- function(original_name) {
x <- clean_text(original_name)
x <- x[1]
if (length(x) == 0 || is.na(x)) return(list(flags = NA_character_, has_ss = FALSE, has_var = FALSE, has_aff = FALSE))
has_ss <- stringr::str_detect(x, stringr::regex("(^|\\s)(s\\.?\\s*s\\.?|ss|sensu\\s+stricto)(\\s|$)", ignore_case = TRUE))
has_aff <- stringr::str_detect(x, stringr::regex("(^|\\s)aff\\.?(\\s|$)", ignore_case = TRUE))
has_var <- stringr::str_detect(x, stringr::regex(paste0("(^|\\s)", rank_regex, "(\\s|$)"), ignore_case = TRUE))
flags <- c()
if (has_ss) flags <- c(flags, "ss")
if (has_var) flags <- c(flags, "var")
if (has_aff) flags <- c(flags, "aff")
list(
flags = if (length(flags) == 0) "exact match" else paste(flags, collapse = "; "),
has_ss = has_ss,
has_var = has_var,
has_aff = has_aff
)
}
final_ss_var_aff <- function(original_name, attempt_reason, scrape_status) {
status <- clean_text(scrape_status)[1]
if (is.na(status)) return(NA_character_)
if (!identical(status, "ok")) return(status)
reason <- clean_text(attempt_reason)[1]
if (is.na(reason)) return("exact match")
if (identical(reason, "ss_name_before_marker")) return("ss")
if (stringr::str_detect(reason, "^aff_genus_plus_text_after_aff")) return("aff")
if (identical(reason, "var_or_subsp_parent_species_fallback")) return("var")
"exact match"
}
failure_priority <- function(status) {
status <- clean_text(status)[1]
if (is.na(status)) return(0L)
dplyr::case_when(
status == "not_started" ~ 0L,
status == "blank_name" ~ 0L,
status == "not_found" ~ 1L,
status == "error" ~ 2L,
status == "no_profile" ~ 3L,
TRUE ~ 1L
)
}
aff_target_name <- function(original_name, keep_parenthetical = TRUE) {
x <- clean_text(original_name)
x <- x[1]
if (length(x) == 0 || is.na(x)) return(NA_character_)
# Keep the genus and replace the aff. phrase with the text after aff.
# Examples:
# Caladenia aff. vulgaris (Aireys Inlet) -> Caladenia vulgaris (Aireys Inlet)
# Caladenia sp. aff. iridescens (Chapple Vale) -> Caladenia iridescens (Chapple Vale)
# Geranium aff. sp. 3 -> Geranium sp. 3
m <- stringr::str_match(
x,
stringr::regex("^([A-Z][A-Za-z-]+)\\b.*?\\baff\\.?\\s+(.+)$", ignore_case = TRUE)
)
if (is.na(m[1, 2]) || is.na(m[1, 3])) return(NA_character_)
tail <- clean_text(m[1, 3])
tail <- stringr::str_replace(tail, stringr::regex("^of\\s+", ignore_case = TRUE), "")
target <- clean_text(paste(m[1, 2], tail))
if (!keep_parenthetical) {
target <- stringr::str_remove(target, "\\s*\\([^)]*\\)\\s*$")
target <- clean_text(target)
}
target
}
aff_place_name <- function(original_name) {
x <- clean_text(original_name)
x <- x[1]
if (length(x) == 0 || is.na(x)) return(NA_character_)
if (!stringr::str_detect(x, stringr::regex("(^|\\s)aff\\.?(\\s|$)", ignore_case = TRUE))) return(NA_character_)
m <- stringr::str_match(x, "\\(([^)]+)\\)")
if (!is.na(m[1, 2])) clean_text(m[1, 2]) else NA_character_
}
place_match_in_text <- function(place, text) {
place <- clean_text(place)
text <- clean_text(text)
if (is.na(place) || is.na(text)) return(FALSE)
place_options <- unique(c(
place,
stringr::str_remove(place, stringr::regex("\\s+variant$", ignore_case = TRUE))
))
place_options <- place_options[!is.na(place_options) & nzchar(place_options)]
any(purrr::map_lgl(place_options, function(p) {
stringr::str_detect(text, stringr::fixed(p, ignore_case = TRUE))
}))
}
named_sp_number_base <- function(original_name) {
x <- clean_text(original_name)
x <- x[1]
if (length(x) == 0 || is.na(x)) return(NA_character_)
# Named phrase records such as:
# Arthropodium sp. 2 (greenish flowers) -> Arthropodium sp. 2
# Do not use bracketed informal descriptors as part of the VicFlora query.
# This deliberately excludes aff./cf./nov. patterns such as Caladenia sp. aff. vulgaris.
m <- stringr::str_match(
x,
stringr::regex("^([A-Z][A-Za-z-]+\\s+sp\\.?\\s*([0-9A-Za-z-]+))\\b", ignore_case = FALSE)
)
if (is.na(m[1, 2]) || is.na(m[1, 3])) return(NA_character_)
token <- stringr::str_to_lower(stringr::str_remove_all(m[1, 3], "\\."))
if (token %in% c("aff", "cf", "nov", "indet")) return(NA_character_)
clean_text(m[1, 2])
}
strip_trailing_parenthetical <- function(original_name) {
x <- clean_text(original_name)
x <- x[1]
if (length(x) == 0 || is.na(x)) return(NA_character_)
stripped <- stringr::str_remove(x, "\\s*\\([^)]*\\)\\s*$")
stripped <- clean_text(stripped)
if (is.na(stripped) || !different_name_key(stripped, x)) NA_character_ else stripped
}
build_search_plan <- function(original_name) {
original_name <- clean_text(original_name)
original_name <- original_name[1]
flags <- detect_name_flags(original_name)
if (length(original_name) == 0 || is.na(original_name)) {
return(tibble::tibble(search_name = NA_character_, attempt_reason = "blank_name", allow_fuzzy = FALSE))
}
# Sensu stricto is the one exception to "try the full name first": it means
# use the taxon name as written before the s.s. marker.
if (flags$has_ss) {
ss_name <- remove_ss_marker(original_name)
return(tibble::tibble(
search_name = ss_name,
attempt_reason = "ss_name_before_marker",
allow_fuzzy = FALSE
))
}
# Always start with the species full. This preserves true exact matches for
# varieties, subspecies, informal forms, and parenthetical names when VicFlora has them.
plan <- tibble::tibble(
search_name = original_name,
attempt_reason = "original_full_name_exact_first",
allow_fuzzy = FALSE
)
# For ordinary parenthetical notes, only remove the bracketed text after the
# full-name search has failed. This handles examples such as:
# Cardamine tenuifolia (small-flower form) -> Cardamine tenuifolia
# Arthropodium sp. 2 (greenish flowers) -> Arthropodium sp. 2
# For aff. names, keep the specialised aff. hierarchy below because it converts
# Caladenia aff. vulgaris (Aireys Inlet) to Caladenia vulgaris first.
if (!flags$has_aff) {
parenthetical_stripped <- strip_trailing_parenthetical(original_name)
if (!is.na(parenthetical_stripped)) {
plan <- dplyr::bind_rows(
plan,
tibble::tibble(
search_name = parenthetical_stripped,
attempt_reason = "parenthetical_note_removed_fallback",
allow_fuzzy = FALSE
)
)
}
}
# Named sp. number fallback also covers non-parenthetical informal strings.
# If the stripped-parenthetical fallback already added the same name, distinct()
# below removes the duplicate.
sp_number_name <- named_sp_number_base(original_name)
if (!is.na(sp_number_name) && different_name_key(sp_number_name, original_name)) {
plan <- dplyr::bind_rows(
plan,
tibble::tibble(
search_name = sp_number_name,
attempt_reason = "sp_number_without_extra_text_fallback",
allow_fuzzy = FALSE
)
)
}
if (flags$has_aff) {
target_with_place <- aff_target_name(original_name, keep_parenthetical = TRUE)
target_without_place <- aff_target_name(original_name, keep_parenthetical = FALSE)
if (!is.na(target_with_place) && different_name_key(target_with_place, original_name)) {
plan <- dplyr::bind_rows(
plan,
tibble::tibble(
search_name = target_with_place,
attempt_reason = "aff_genus_plus_text_after_aff_with_place",
allow_fuzzy = FALSE
)
)
}
if (!is.na(target_without_place) &&
different_name_key(target_without_place, target_with_place) &&
different_name_key(target_without_place, original_name)) {
plan <- dplyr::bind_rows(
plan,
tibble::tibble(
search_name = target_without_place,
attempt_reason = "aff_genus_plus_text_after_aff_without_place",
allow_fuzzy = FALSE
)
)
}
}
if (flags$has_var) {
parent_species <- species_binomial(original_name)
if (!is.na(parent_species) && different_name_key(parent_species, original_name)) {
plan <- dplyr::bind_rows(
plan,
tibble::tibble(
search_name = parent_species,
attempt_reason = "var_or_subsp_parent_species_fallback",
allow_fuzzy = FALSE
)
)
}
}
plan |>
dplyr::mutate(plan_order = dplyr::row_number()) |>
dplyr::distinct(search_name, .keep_all = TRUE) |>
dplyr::select(search_name, attempt_reason, allow_fuzzy, plan_order)
}
# -----------------------------
# VicFlora GraphQL helpers
# -----------------------------
endpoint <- "https://vicflora.rbg.vic.gov.au/graphql"
lookup_query <- '
query TaxonLookup($q: String!) {
taxonConceptAutocomplete(q: $q) {
...LookupConceptFields
synonyms {
fullName
fullNameWithAuthorship
}
synonymUsages {
...LookupConceptFields
acceptedConcept {
...LookupConceptFields
}
}
acceptedConcept {
...LookupConceptFields
}
}
}
fragment LookupConceptFields on TaxonConcept {
id
taxonomicStatus
taxonRank
taxonName {
fullName
fullNameWithAuthorship
}
higherClassification {
taxonRank
taxonName {
fullName
fullNameWithAuthorship
}
}
currentProfile {
profile
modified
}
bioregions {
bioregionCode
bioregionName
occurrenceStatus
}
}
'
gql_post <- function(query, variables = list()) {
req <- httr2::request(endpoint) |>
httr2::req_user_agent("VicFlora species description extraction script; local research use") |>
httr2::req_headers(Accept = "application/json") |>
httr2::req_body_json(list(query = query, variables = variables), auto_unbox = TRUE) |>
httr2::req_timeout(45) |>
httr2::req_retry(max_tries = 4, backoff = ~ 1.5 ^ .x)
response <- httr2::req_perform(req)
parsed <- httr2::resp_body_json(response, simplifyVector = FALSE)
if (!is.null(parsed$errors)) {
stop(jsonlite::toJSON(parsed$errors, auto_unbox = TRUE), call. = FALSE)
}
parsed$data
}
concept_for_candidate <- function(candidate) {
if (!is.null(candidate$acceptedConcept) && !is.null(candidate$acceptedConcept$id)) {
candidate$acceptedConcept
} else {
candidate
}
}
concept_for_synonym_usage <- function(synonym_usage) {
if (!is.null(synonym_usage$acceptedConcept) && !is.null(synonym_usage$acceptedConcept$id)) {
synonym_usage$acceptedConcept
} else {
synonym_usage
}
}
candidate_display_names <- function(candidate) {
c(
candidate$taxonName$fullName %||% NA_character_,
candidate$taxonName$fullNameWithAuthorship %||% NA_character_
)
}
accepted_display_names <- function(candidate) {
c(
candidate$acceptedConcept$taxonName$fullName %||% NA_character_,
candidate$acceptedConcept$taxonName$fullNameWithAuthorship %||% NA_character_
)
}
candidate_synonym_names <- function(candidate) {
syn_names <- character()
if (!is.null(candidate$synonyms) && length(candidate$synonyms) > 0) {
syn_names <- c(
syn_names,
purrr::map_chr(candidate$synonyms, ~ .x$fullName %||% NA_character_),
purrr::map_chr(candidate$synonyms, ~ .x$fullNameWithAuthorship %||% NA_character_)
)
}
if (!is.null(candidate$synonymUsages) && length(candidate$synonymUsages) > 0) {
syn_names <- c(
syn_names,
purrr::map_chr(candidate$synonymUsages, ~ .x$taxonName$fullName %||% NA_character_),
purrr::map_chr(candidate$synonymUsages, ~ .x$taxonName$fullNameWithAuthorship %||% NA_character_)
)
}
syn_names <- clean_text(syn_names)
unique(syn_names[!is.na(syn_names)])
}
make_empty_result <- function(original_name) {
tibble::tibble(
query_name = clean_text(original_name)[1],
ss_var_aff = NA_character_,
plant_group = NA_character_,
higher_classification = NA_character_,
vicflora_search_name = NA_character_,
search_hierarchy = NA_character_,
match_type = NA_character_,
synonym_fallback_used = NA,
matched_synonym_name = NA_character_,
aff_place_name = aff_place_name(original_name),
aff_place_found_in_treatment = NA_character_,
matched_name = NA_character_,
matched_name_with_authorship = NA_character_,
matched_id = NA_character_,
matched_taxonomic_status = NA_character_,
used_concept_id = NA_character_,
used_scientific_name = NA_character_,
used_name_with_authorship = NA_character_,
used_accepted_concept = NA,
profile_modified = NA_character_,
species_description = NA_character_,
habitat_codes = NA_character_,
bioregion_codes_api = NA_character_,
second_paragraph_codes = NA_character_,
treatment_first_paragraph = NA_character_,
treatment_second_paragraph = NA_character_,
all_treatment_text = NA_character_,
vicflora_url = NA_character_,
scrape_status = "not_started",
error_message = NA_character_
)
}
build_result_row <- function(original_name, search_name, attempt_reason, candidate, concept, match_type,
synonym_fallback_used = FALSE, matched_synonym_name = NA_character_) {
base <- make_empty_result(original_name)
profile_html <- concept$currentProfile$profile %||% NA_character_
paragraphs <- profile_to_paragraphs(profile_html)
first_para <- if (length(paragraphs) >= 1) paragraphs[1] else NA_character_
second_para <- if (length(paragraphs) >= 2) paragraphs[2] else NA_character_
all_treatment <- if (length(paragraphs) > 0) paste(paragraphs, collapse = " ") else NA_character_
description <- first_sentence(first_para)
bioregion_codes <- collapse_bioregion_codes(concept$bioregions)
second_para_codes <- extract_codes_from_treatment(second_para)
# Fall back to the API bioregion codes if the second paragraph heuristic finds no codes.
habitat_codes <- if (!is.na(second_para_codes)) second_para_codes else bioregion_codes
plant_group_value <- infer_plant_group(concept)
higher_classification_value <- collapse_higher_classification(concept)
place <- aff_place_name(original_name)
place_found <- if (!is.na(place) && !is.na(all_treatment)) {
if (place_match_in_text(place, all_treatment)) place else NA_character_
} else {
NA_character_
}
final_status <- if (is.na(profile_html) || !nzchar(profile_html)) "no_profile" else "ok"
final_label <- final_ss_var_aff(original_name, attempt_reason, final_status)
# Avoid dplyr data-mask collisions: these argument names are also output-column names.
# If we wrote match_type = match_type, mutate would use the existing blank column
# from base instead of the function argument.
match_type_value <- match_type
synonym_fallback_used_value <- synonym_fallback_used
matched_synonym_name_value <- matched_synonym_name
base |>
dplyr::mutate(
ss_var_aff = final_label,
plant_group = plant_group_value,
higher_classification = higher_classification_value,
vicflora_search_name = search_name,
search_hierarchy = attempt_reason,
match_type = match_type_value,
synonym_fallback_used = synonym_fallback_used_value,
matched_synonym_name = matched_synonym_name_value,
aff_place_found_in_treatment = place_found,
matched_name = candidate$taxonName$fullName %||% NA_character_,
matched_name_with_authorship = candidate$taxonName$fullNameWithAuthorship %||% NA_character_,
matched_id = as.character(candidate$id %||% NA_character_),
matched_taxonomic_status = candidate$taxonomicStatus %||% NA_character_,
used_concept_id = as.character(concept$id %||% NA_character_),
used_scientific_name = concept$taxonName$fullName %||% NA_character_,
used_name_with_authorship = concept$taxonName$fullNameWithAuthorship %||% NA_character_,
used_accepted_concept = !is.null(concept$id) && !is.null(candidate$id) && as.character(concept$id) != as.character(candidate$id),
profile_modified = concept$currentProfile$modified %||% NA_character_,
species_description = description,
habitat_codes = habitat_codes,
bioregion_codes_api = bioregion_codes,
second_paragraph_codes = second_para_codes,
treatment_first_paragraph = first_para,
treatment_second_paragraph = second_para,
all_treatment_text = all_treatment,
vicflora_url = if (!is.null(concept$id)) paste0("https://vicflora.rbg.vic.gov.au/flora/taxon/", concept$id) else NA_character_,
scrape_status = final_status,
error_message = NA_character_
)
}
lookup_one_search_name <- function(original_name, search_name, attempt_reason, allow_fuzzy = FALSE) {
if (is_blank(search_name)) {
out <- make_empty_result(original_name)
out$scrape_status <- "blank_name"
out$ss_var_aff <- final_ss_var_aff(original_name, attempt_reason, out$scrape_status)
return(out)
}
out <- tryCatch({
data <- gql_post(lookup_query, list(q = search_name))
candidates <- data$taxonConceptAutocomplete
if (is.null(candidates) || length(candidates) == 0) {
row <- make_empty_result(original_name)
row$vicflora_search_name <- search_name
row$search_hierarchy <- attempt_reason
row$scrape_status <- "not_found"
row$ss_var_aff <- final_ss_var_aff(original_name, attempt_reason, row$scrape_status)
return(row)
}
search_key <- name_key(search_name)
no_profile_rows <- list()
choices <- list()
add_choice <- function(candidate, concept, match_type, synonym_fallback_used = FALSE, matched_synonym_name = NA_character_) {
choices[[length(choices) + 1]] <<- list(
candidate = candidate,
concept = concept,
match_type = match_type,
synonym_fallback_used = synonym_fallback_used,
matched_synonym_name = matched_synonym_name
)
}
# 1. Exact direct matches to the returned taxon name.
for (candidate in candidates) {
cand_keys <- purrr::map_chr(candidate_display_names(candidate), name_key)
cand_keys <- cand_keys[!is.na(cand_keys)]
if (any(cand_keys == search_key)) {
add_choice(candidate, concept_for_candidate(candidate), "exact_returned_name", FALSE, NA_character_)
}
}
# 2. Exact matches to an accepted concept returned by a synonym/misapplied result.
for (candidate in candidates) {
acc_keys <- purrr::map_chr(accepted_display_names(candidate), name_key)
acc_keys <- acc_keys[!is.na(acc_keys)]
if (length(acc_keys) > 0 && any(acc_keys == search_key)) {
add_choice(candidate, concept_for_candidate(candidate), "exact_accepted_concept_name", FALSE, NA_character_)
}
}
# 3. Synonym-tab matches, ignoring authorship at the end of the name.
for (candidate in candidates) {
syn_names <- candidate_synonym_names(candidate)
if (length(syn_names) > 0 && name_matches_ignoring_authorship(original_name, syn_names)) {
matched_syn <- syn_names[which(purrr::map_lgl(syn_names, ~ name_matches_ignoring_authorship(original_name, .x)))[1]]
add_choice(candidate, concept_for_candidate(candidate), "synonym_tab_match", TRUE, matched_syn)
}
# If any synonym usage itself has an accepted concept, use that accepted concept.
if (!is.null(candidate$synonymUsages) && length(candidate$synonymUsages) > 0) {
for (syn_usage in candidate$synonymUsages) {
usage_names <- c(
syn_usage$taxonName$fullName %||% NA_character_,
syn_usage$taxonName$fullNameWithAuthorship %||% NA_character_
)
if (name_matches_ignoring_authorship(original_name, usage_names)) {
matched_syn <- clean_text(usage_names)[!is.na(clean_text(usage_names))][1]
add_choice(candidate, concept_for_synonym_usage(syn_usage), "synonym_usage_match", TRUE, matched_syn)
}
}
}
}
# 4. Optional fuzzy fallback. Kept available, but the default search plan sets allow_fuzzy = FALSE to avoid subpar matches.
if (allow_fuzzy && length(choices) == 0) {
statuses <- purrr::map_chr(candidates, ~ .x$taxonomicStatus %||% NA_character_)
idx <- which(statuses == "ACCEPTED")
if (length(idx) == 0) idx <- seq_along(candidates)
candidate <- candidates[[idx[1]]]
add_choice(candidate, concept_for_candidate(candidate), "accepted_or_first_autocomplete_fallback", FALSE, NA_character_)
}
if (length(choices) == 0) {
row <- make_empty_result(original_name)
row$vicflora_search_name <- search_name
row$search_hierarchy <- attempt_reason
row$scrape_status <- "not_found"
row$ss_var_aff <- final_ss_var_aff(original_name, attempt_reason, row$scrape_status)
return(row)
}
for (choice in choices) {
row <- build_result_row(
original_name = original_name,
search_name = search_name,
attempt_reason = attempt_reason,
candidate = choice$candidate,
concept = choice$concept,
match_type = choice$match_type,
synonym_fallback_used = choice$synonym_fallback_used,
matched_synonym_name = choice$matched_synonym_name
)
if (identical(row$scrape_status[[1]], "ok")) return(row)
no_profile_rows[[length(no_profile_rows) + 1]] <- row
}
if (length(no_profile_rows) > 0) no_profile_rows[[1]] else {
row <- make_empty_result(original_name)
row$vicflora_search_name <- search_name
row$search_hierarchy <- attempt_reason
row$scrape_status <- "not_found"
row$ss_var_aff <- final_ss_var_aff(original_name, attempt_reason, row$scrape_status)
row
}
}, error = function(e) {
row <- make_empty_result(original_name)
row$vicflora_search_name <- search_name
row$search_hierarchy <- attempt_reason
row$scrape_status <- "error"
row$ss_var_aff <- final_ss_var_aff(original_name, attempt_reason, row$scrape_status)
row$error_message <- conditionMessage(e)
row
})
Sys.sleep(request_pause_seconds)
out
}
scrape_one_species <- function(query_name) {
query_name <- clean_text(query_name)[1]
result_if_all_fail <- make_empty_result(query_name)
if (is_blank(query_name)) {
result_if_all_fail$scrape_status <- "blank_name"
result_if_all_fail$ss_var_aff <- final_ss_var_aff(query_name, NA_character_, result_if_all_fail$scrape_status)
return(result_if_all_fail)
}
plan <- build_search_plan(query_name)
attempt_summaries <- character()
for (i in seq_len(nrow(plan))) {
attempt <- plan[i, ]
row <- lookup_one_search_name(
original_name = query_name,
search_name = attempt$search_name,
attempt_reason = attempt$attempt_reason,
allow_fuzzy = attempt$allow_fuzzy
)
attempt_summaries <- c(
attempt_summaries,
paste0(attempt$attempt_reason, " [", attempt$search_name, "]: ", row$scrape_status[[1]])
)
if (identical(row$scrape_status[[1]], "ok")) {
row$search_hierarchy <- paste(attempt_summaries, collapse = " -> ")
row$ss_var_aff <- final_ss_var_aff(query_name, attempt$attempt_reason, row$scrape_status[[1]])
return(row)
}
# Keep the most informative failure, but make sure that an all-not_found species
# returns not_found rather than the initial not_started row.
if (failure_priority(row$scrape_status[[1]]) >= failure_priority(result_if_all_fail$scrape_status[[1]])) {
result_if_all_fail <- row
}
}
result_if_all_fail$search_hierarchy <- paste(attempt_summaries, collapse = " -> ")
result_if_all_fail$ss_var_aff <- final_ss_var_aff(query_name, result_if_all_fail$search_hierarchy, result_if_all_fail$scrape_status[[1]])
result_if_all_fail
}
if (exists("final_tbl")) {
print(dplyr::count(final_tbl, scrape_status, ss_var_aff, sort = TRUE))
cat("
Rows written:", nrow(final_tbl), "
")
cat("CSV output:", normalizePath(output_csv, mustWork = FALSE), "
")
cat("XLSX output:", normalizePath(output_xlsx, mustWork = FALSE), "
")
}
## [1] "proj_scientific_name"
## [1] "vicflora_search_name"
vicflora_summary <- results_unique |>
summarise(
n_project_taxa = n(),
n_vicflora_matched = sum(scrape_status %in% c("success", "matched", "ok"), na.rm = TRUE),
n_vicflora_not_found = sum(scrape_status == "not_found", na.rm = TRUE),
n_vicflora_no_profile = sum(scrape_status == "no_profile", na.rm = TRUE),
n_with_description = sum(!is.na(species_description) & species_description != "", na.rm = TRUE),
n_with_habitat_codes = sum(!is.na(bioregion_codes_api) & bioregion_codes_api != "", na.rm = TRUE),
n_bryophytes = sum(plant_group == "bryophyte", na.rm = TRUE),
n_flowering_plants = sum(plant_group == "flowering plant", na.rm = TRUE),
n_other_or_unknown = sum(
is.na(plant_group) |
plant_group %in% c("unknown", "other vascular plant"),
na.rm = TRUE
)
) |>
pivot_longer(
cols = everything(),
names_to = "summary_metric",
values_to = "value"
)
vicflora_summary |>
knitr::kable(
caption = "Summary of VicFlora scraping results"
)
| summary_metric | value |
|---|---|
| n_project_taxa | 316 |
| n_vicflora_matched | 247 |
| n_vicflora_not_found | 46 |
| n_vicflora_no_profile | 23 |
| n_with_description | 247 |
| n_with_habitat_codes | 261 |
| n_bryophytes | 53 |
| n_flowering_plants | 212 |
| n_other_or_unknown | 51 |
Because VICFLORA focuses on vascular plants mostly, the information on most of the bryophytes, liverworts and hornworts have to be sourced elsewhere. Although ALA is the next obvious choice, there isn’t a species description section that is accessible. Most of the species description on bryophytes, liverworts and hornworts may have to be individually/manually compiled. The geopgraphic range information is however available for most of these species on ALA - and this information has been acquired in this phase.
AusTraits is a large, curated database of plant trait information for Australian plant species. A plant trait is a measurable characteristic of a species, such as growth form, plant height, lifespan, seed mass, leaf size, dispersal mode, or reproductive strategy. These traits are useful because they help describe how plants live, grow, reproduce, and respond to their environment.
The database brings together trait records from many different published and unpublished sources. This means that a single species can have multiple records for the same trait, sometimes from different studies, regions, or measurement methods. For this reason, the analysis keeps the original trait records as an audit table, but also creates a simplified species-level summary that can be joined back to the main species dataset.
Here, AusTraits is used to add ecological trait information focusing on life history related traits to the species list.
The AusTraits part of the workflow has four main purposes.
First, the script prepares the species names from the main
results_unique table. These names come from the VicFlora
scrape and have already been cleaned as much as possible. The script
then creates matching names that can be compared with names in the
AusTraits database.
Second, the script matches the project species list to AusTraits
taxonomy. It first tries to match the cleaned project species name
directly to the AusTraits taxon_name column. This is the
preferred match because it preserves the most specific name available,
including varieties or subspecies where they are present in AusTraits.
If no exact taxon-name match is found, the script then uses a
species-level name and performs a fuzzy match against the AusTraits
binomial column. This helps recover likely matches when
names differ slightly between sources.
Third, the script extracts only the traits of interest (16 traits listed below). Instead of using the full AusTraits database, which is very large (over 530 traits!), the code filters the data to keep only the project species and the chosen traits of interest.
Fourth, the script appends the trait information back to the main species table. Because AusTraits can contain several records for the same species and trait, the workflow keeps two outputs:
results_unique.The wide table is easier to read because each species remains as one row. If a species has several values for the same trait, these values are combined into a single cell separated by semicolons. The long audit table is kept so that the original records can still be checked if needed.
Downloading or loading large biodiversity datasets can take time and may require an internet connection. To avoid repeating the same steps every time the R Markdown document is knitted, the workflow saves the AusTraits outputs as local files after they are created. Later, the document can reload these saved files directly, instead of contacting the online database again.
This makes the document faster, more reproducible, and less likely to fail because of temporary internet or server issues. The trait database contains over 530 plant species traits. We are mainly interested in life history related traits, therefore extract trait information for a specific subset of traits (string called traits_of_interest) for the 316 poorly known species.
Here we specify 16 plant traits: plant growth form, life history, flowering time, resprouting capacity, dispersers, reproductive maturity, vegetation reproduction ability, life span, recruitment time, post fire recruitment, resprouting capacity non fire disturbance, resprouting capacity proportion individuals, seed viability, seedbank longevity, seedling establishment conditions, fire exposure level
For FFG nomination, there is a field called “Generation length”. Unfortuntaly AusTraits does not have this particular trait information. We may have to make assumptions about generation length based on reproductive maturity and lifespan (both of which are available).
full_list_of_traits<-summarise_database(austraits07,"trait_name")
traits_of_interest<-c("plant_growth_form","life_history","flowering_time","resprouting_capacity","dispersers","reproductive_maturity","vegetative_reproduction_ability","lifespan","recruitment_time","post_fire_recruitment","resprouting_capacity_non_fire_disturbance","resprouting_capacity_proportion_individuals","seed_viability","seedbank_longevity","seedling_establishment_conditions","fire_exposure_level")
traits_with_taxa <- austraits07$traits |>
filter(trait_name %in% traits_of_interest) |>
left_join(
taxa_lookup |> select(taxon_name, binomial),
by = "taxon_name"
)
trait_rows_exact <- results_unique_matched |>
filter(austraits_match_type == "exact_taxon_name") |>
select(row_id, matched_taxon_name) |>
inner_join(
traits_with_taxa,
by = c("matched_taxon_name" = "taxon_name")
)
trait_rows_fuzzy <- results_unique_matched |>
filter(austraits_match_type == "fuzzy_binomial") |>
select(row_id, matched_binomial) |>
inner_join(
traits_with_taxa,
by = c("matched_binomial" = "binomial")
)
trait_rows_to_append <- bind_rows(
trait_rows_exact,
trait_rows_fuzzy
)
collapse_unique <- function(x) {
x <- unique(na.omit(as.character(x)))
if (length(x) == 0) NA_character_ else paste(x, collapse = "; ")
}
results_unique <- results_unique |>
mutate(row_id = row_number())
trait_wide <- trait_rows_to_append |>
mutate(
value = as.character(value),
unit = as.character(unit)
) |>
group_by(row_id, trait_name) |>
summarise(
trait_value = collapse_unique(value),
trait_unit = collapse_unique(unit),
trait_source_id = collapse_unique(source_id),
trait_dataset_id = collapse_unique(dataset_id),
.groups = "drop"
) |>
pivot_wider(
names_from = trait_name,
values_from = c(
trait_value,
trait_unit,
trait_source_id,
trait_dataset_id
),
names_glue = "{trait_name}_{.value}"
)
results_unique_with_traits <- results_unique |>
left_join(trait_wide, by = "row_id")
austraits_summary <- tibble(
summary_metric = c(
"Project taxa",
"Taxa with at least one selected AusTraits record",
"Total AusTraits records extracted",
"Selected traits found",
"Selected traits requested"
),
value = c(
nrow(results_unique),
trait_rows_to_append |> distinct(row_id) |> nrow(),
nrow(trait_rows_to_append),
trait_rows_to_append |> distinct(trait_name) |> nrow(),
length(unique(traits_of_interest))
)
)
austraits_summary |>
knitr::kable(
caption = "Summary of AusTraits trait extraction"
)
trait_coverage_summary <- trait_rows_to_append |>
group_by(trait_name) |>
summarise(
n_taxa = n_distinct(row_id),
n_records = n(),
n_unique_values = n_distinct(value, na.rm = TRUE),
example_values = paste(
head(unique(na.omit(as.character(value))), 5),
collapse = "; "
),
.groups = "drop"
) |>
arrange(desc(n_taxa), trait_name)
trait_coverage_summary |>
knitr::kable(
caption = "Coverage of selected AusTraits traits"
)
| trait_name | n_taxa | n_records | n_unique_values | example_values |
|---|---|---|---|---|
| plant_growth_form | 222 | 1419 | 20 | shrub tree; shrub; tree; fern; graminoid herb |
| life_history | 220 | 913 | 10 | perennial; annual perennial; annual; annual short_lived_perennial; short_lived_perennial |
| lifespan | 214 | 375 | 16 | 500–1000; 10–50; 0–1; 50–500; 50–100 |
| reproductive_maturity | 214 | 337 | 11 | 5–20; 1–5; 0–1; 4–5; 2 |
| seedbank_longevity | 214 | 337 | 6 | 2–; –2; 1–; 10–; 5– |
| vegetative_reproduction_ability | 214 | 352 | 2 | vegetative; not_vegetative |
| resprouting_capacity | 212 | 563 | 6 | fire_killed; resprouts; partial_resprouting; partial_resprouting resprouts; fire_killed partial_resprouting |
| dispersers | 206 | 401 | 19 | invertebrates; ants passive vertebrates; wind; abiotic animals; water |
| flowering_time | 203 | 482 | 90 | nnnnnnyyyynn; nnnnnnyyyyyn; nnnnnnnyyyyn; nnnnnnnnyynn; nnnnnnnyyynn |
| resprouting_capacity_non_fire_disturbance | 117 | 161 | 1 | resprouts_non_fire_disturbance |
| post_fire_recruitment | 46 | 69 | 2 | post_fire_recruitment; post_fire_recruitment_absent |
| seedling_establishment_conditions | 45 | 49 | 2 | establish_anytime; establish_post_fire |
| resprouting_capacity_proportion_individuals | 3 | 9 | 3 | 1; 0.8; 0.4 |
| fire_exposure_level | 2 | 3 | 1 | aquatic_taxon |
This section speaks directly to the Geographic Range criteria of the nomination process, in particular, the AOO and EOO.
Galah package is the R package that helps extract data from the ALA, which is the primary resource we use to determine the geographic range/species distribution. Here, species records are bound to Victoria only (this can be changed if the project is expanded for a continental analysis).
The taxon matching helper functions create a hierarchy of candidate names for each project species with the same logic as cleaned VicFlora search name described above.
For each candidate name, the script uses
galah::search_taxa() to query the ALA taxonomic backbone.
The returned matches are filtered to retain plant taxa and to avoid
broad higher-level matches such as genera or families. This prevents the
script from accidentally downloading all records for a whole genus when
a species-level name cannot be resolved.
When several possible matches are returned, the helper function
prioritises exact scientific-name matches and exact ALA match types. The
best match is then stored with the original project name, the candidate
name used for the successful query, the resolved ALA scientific name,
taxon concept identifier, taxon rank, and match type. Species that
cannot be resolved are retained in the output with a
not_found status, allowing them to be reviewed
manually.
This matching step produces an auditable table of taxonomic decisions before occurrence records are downloaded. This is useful because it separates uncertainty in name resolution from uncertainty in occurrence data, and makes it possible to check which species were matched exactly, which required simplified fallback names, and which were not found in ALA.
victoria_exact <- victoria_sf |>
st_make_valid() |>
st_transform(4326) |>
st_zm(drop = TRUE, what = "ZM")
victoria_exact <- st_as_sf(
st_sfc(
st_union(st_geometry(victoria_exact)),
crs = st_crs(victoria_exact)
)
)
# The exact polygon crop happens locally after download.
victoria_query_area <- victoria_exact
results_unique_ala_base <- results_unique_with_traits |>
mutate(
ala_row_id = row_number(),
ala_input_name = as.character(.data[[species_col]]),
ala_input_name = str_squish(ala_input_name)
)
species_to_resolve <- results_unique_ala_base |>
filter(!is.na(ala_input_name), ala_input_name != "") |>
distinct(ala_input_name) |>
mutate(ala_species_id = row_number())
# ------------------------------------------------------------
# 5. Candidate names for ALA search
# ------------------------------------------------------------
# Since vicflora_search_name is already cleaned, the exact value is tried first.
# Bracket removal and aff./var./s.s. handling are only fallback candidates.
make_ala_candidates <- function(x) {
original <- str_squish(x)
no_brackets <- original |>
str_remove("\\s*\\([^)]*\\)\\s*$") |>
str_squish()
no_ss <- no_brackets |>
str_remove("\\s+s\\.?s\\.?\\s*$") |>
str_remove("\\s+ss\\s*$") |>
str_squish()
# Caladenia aff. vulgaris -> Caladenia vulgaris
aff_name <- no_ss |>
str_replace(
"^([A-Z][A-Za-z-]+)\\s+(?:sp\\.?\\s*)?aff\\.?\\s+([a-z][a-z-]+).*$",
"\\1 \\2"
) |>
str_squish()
# Acacia leprosa var. graveolens -> Acacia leprosa
species_level <- no_ss |>
str_replace(
"^([A-Z][A-Za-z-]+\\s+[a-z][a-z-]+)\\s+(var\\.|subsp\\.|ssp\\.).*$",
"\\1"
) |>
str_squish()
# Arthropodium sp. 2 (greenish flowers) -> Arthropodium sp. 2
sp_number <- str_extract(
no_brackets,
"^[A-Z][A-Za-z-]+\\s+sp\\.?\\s*\\d+"
)
candidates <- c(
original,
no_brackets,
no_ss,
aff_name,
sp_number,
species_level
)
candidates |>
unique() |>
discard(~ is.na(.x) || .x == "")
}
# ------------------------------------------------------------
# 6. Helpers for robust ALA taxon matching
# ------------------------------------------------------------
find_col <- function(df, possible_names) {
hit <- intersect(possible_names, names(df))
if (length(hit) == 0) NA_character_ else hit[1]
}
safe_pull <- function(df, possible_names, default = NA_character_) {
col <- find_col(df, possible_names)
if (is.na(col)) {
rep(default, nrow(df))
} else {
as.character(df[[col]])
}
}
resolve_ala_taxon <- function(ala_input_name, ala_species_id) {
candidates <- make_ala_candidates(ala_input_name)
for (i in seq_along(candidates)) {
query_name <- candidates[[i]]
tax <- tryCatch(
galah::search_taxa(query_name),
error = function(e) NULL
)
if (is.null(tax) || nrow(tax) == 0) next
tax <- as_tibble(tax)
sci_col <- find_col(
tax,
c("scientific_name", "scientificName", "taxon_name", "name")
)
if (is.na(sci_col)) next
tax <- tax |>
mutate(
.scientific_name = as.character(.data[[sci_col]]),
.kingdom = safe_pull(tax, c("kingdom", "kingdom_name", "kingdomName")),
.rank = safe_pull(tax, c("rank", "taxon_rank", "taxonRank")),
.match_type = safe_pull(tax, c("match_type", "matchType"))
)
# Keep plant records when kingdom is supplied
tax <- tax |>
filter(
is.na(.kingdom) |
.kingdom == "" |
str_to_lower(.kingdom) == "plantae"
)
# Avoid downloading whole genera/families when species-level matching fails
tax <- tax |>
filter(
is.na(.rank) |
!str_to_lower(.rank) %in% c(
"kingdom", "phylum", "class", "order", "family", "genus"
)
)
if (nrow(tax) == 0) next
tax_best <- tax |>
mutate(
.exact_name = str_to_lower(.scientific_name) == str_to_lower(query_name),
.exact_match_type = str_detect(str_to_lower(.match_type), "exact")
) |>
arrange(
desc(.exact_name),
desc(.exact_match_type)
) |>
slice(1)
taxon_id_col <- find_col(
tax_best,
c("taxon_concept_id", "taxonConceptID", "taxonConceptId", "guid", "taxon_id")
)
return(tibble(
ala_species_id = ala_species_id,
ala_input_name = ala_input_name,
ala_query_name = query_name,
ala_candidate_number = i,
ala_scientific_name = tax_best$.scientific_name[1],
ala_taxon_concept_id = if (!is.na(taxon_id_col)) as.character(tax_best[[taxon_id_col]][1]) else NA_character_,
ala_taxon_rank = tax_best$.rank[1],
ala_match_type = tax_best$.match_type[1],
ala_resolution_status = "matched"
))
}
tibble(
ala_species_id = ala_species_id,
ala_input_name = ala_input_name,
ala_query_name = NA_character_,
ala_candidate_number = NA_integer_,
ala_scientific_name = NA_character_,
ala_taxon_concept_id = NA_character_,
ala_taxon_rank = NA_character_,
ala_match_type = NA_character_,
ala_resolution_status = "not_found"
)
}
taxon_matches <- map2_dfr(
species_to_resolve$ala_input_name,
species_to_resolve$ala_species_id,
resolve_ala_taxon
)
write_csv(
taxon_matches,
file.path(out_dir, "ala_taxon_matches.csv")
)
# Quick check
taxon_matches |>
count(ala_resolution_status, ala_match_type)
# ------------------------------------------------------------
# 7. Helpers for occurrence downloads and exact Victoria crop
# ------------------------------------------------------------
safe_slug <- function(x) {
x |>
str_replace_all("[^A-Za-z0-9]+", "_") |>
str_replace_all("^_|_$", "") |>
str_to_lower()
}
get_count_value <- function(x) {
if ("count" %in% names(x)) return(as.integer(x$count[1]))
if ("totalRecords" %in% names(x)) return(as.integer(x$totalRecords[1]))
NA_integer_
}
crop_occurrences_to_victoria <- function(occ, victoria_exact) {
if (is.null(occ) || nrow(occ) == 0) {
return(occ)
}
lon_col <- intersect(
c("decimalLongitude", "decimal_longitude", "longitude"),
names(occ)
)[1]
lat_col <- intersect(
c("decimalLatitude", "decimal_latitude", "latitude"),
names(occ)
)[1]
if (is.na(lon_col) || is.na(lat_col)) {
warning("No longitude/latitude columns found. Returning uncropped records.")
return(occ)
}
occ_sf <- occ |>
filter(
!is.na(.data[[lon_col]]),
!is.na(.data[[lat_col]])
) |>
st_as_sf(
coords = c(lon_col, lat_col),
crs = 4326,
remove = FALSE
)
occ_vic <- occ_sf |>
st_filter(victoria_exact, .predicate = st_intersects)
occ_vic |>
st_drop_geometry()
}
download_one_ala_species <- function(match_row,
victoria_query_area,
victoria_exact,
out_dir,
pause_seconds = 0.5) {
if (match_row$ala_resolution_status != "matched") {
return(tibble(
ala_species_id = match_row$ala_species_id,
ala_input_name = match_row$ala_input_name,
ala_scientific_name = match_row$ala_scientific_name,
download_status = "taxon_not_found",
n_records_bbox = NA_integer_,
n_records_victoria = NA_integer_,
error_message = NA_character_
))
}
slug <- safe_slug(paste0(
match_row$ala_species_id, "_",
match_row$ala_input_name
))
rds_file <- file.path(out_dir, paste0(slug, "_occurrences.rds"))
csv_file <- file.path(out_dir, paste0(slug, "_occurrences.csv"))
# Resume behaviour: skip if already downloaded
if (file.exists(rds_file)) {
occ <- readRDS(rds_file)
return(tibble(
ala_species_id = match_row$ala_species_id,
ala_input_name = match_row$ala_input_name,
ala_scientific_name = match_row$ala_scientific_name,
download_status = "already_downloaded",
n_records_bbox = NA_integer_,
n_records_victoria = nrow(occ),
error_message = NA_character_
))
}
result <- tryCatch({
base_query <- galah::request_data("occurrences") |>
galah::identify(match_row$ala_scientific_name) |>
galah::geolocate(victoria_query_area, type = "bbox")
if (!inherits(base_query, "data_request")) {
stop("base_query is not a galah data_request. Check galah query construction.")
}
bbox_count <- base_query |>
dplyr::count() |>
dplyr::collect() |>
get_count_value()
if (is.na(bbox_count) || bbox_count == 0) {
return(tibble(
ala_species_id = match_row$ala_species_id,
ala_input_name = match_row$ala_input_name,
ala_scientific_name = match_row$ala_scientific_name,
download_status = "zero_records_bbox",
n_records_bbox = bbox_count,
n_records_victoria = 0L,
error_message = NA_character_
))
}
occ_raw <- base_query |>
dplyr::select(group = "basic") |>
dplyr::collect()
occ_vic <- crop_occurrences_to_victoria(
occ = occ_raw,
victoria_exact = victoria_exact
)
if (nrow(occ_vic) == 0) {
saveRDS(occ_vic, rds_file)
write_csv(occ_vic, csv_file)
return(tibble(
ala_species_id = match_row$ala_species_id,
ala_input_name = match_row$ala_input_name,
ala_scientific_name = match_row$ala_scientific_name,
download_status = "zero_records_after_victoria_crop",
n_records_bbox = bbox_count,
n_records_victoria = 0L,
error_message = NA_character_
))
}
occ_vic <- occ_vic |>
mutate(
ala_species_id = match_row$ala_species_id,
ala_input_name = match_row$ala_input_name,
ala_query_name = match_row$ala_query_name,
ala_scientific_name_resolved = match_row$ala_scientific_name,
ala_taxon_concept_id_resolved = match_row$ala_taxon_concept_id,
ala_taxon_rank_resolved = match_row$ala_taxon_rank,
ala_match_type = match_row$ala_match_type,
ala_candidate_number = match_row$ala_candidate_number,
.before = 1
)
saveRDS(occ_vic, rds_file)
write_csv(occ_vic, csv_file)
Sys.sleep(pause_seconds)
tibble(
ala_species_id = match_row$ala_species_id,
ala_input_name = match_row$ala_input_name,
ala_scientific_name = match_row$ala_scientific_name,
download_status = "downloaded",
n_records_bbox = bbox_count,
n_records_victoria = nrow(occ_vic),
error_message = NA_character_
)
}, error = function(e) {
tibble(
ala_species_id = match_row$ala_species_id,
ala_input_name = match_row$ala_input_name,
ala_scientific_name = match_row$ala_scientific_name,
download_status = "error",
n_records_bbox = NA_integer_,
n_records_victoria = NA_integer_,
error_message = conditionMessage(e)
)
})
}
ala_summary <- tibble(
summary_metric = c(
"Taxa submitted to ALA",
"Taxa matched to ALA taxonomy",
"Taxa not found in ALA taxonomy",
"Taxa with Victoria records",
"Total Victoria occurrence records"
),
value = c(
taxon_matches |> distinct(ala_input_name) |> nrow(),
taxon_matches |> filter(ala_resolution_status == "matched") |> distinct(ala_input_name) |> nrow(),
taxon_matches |> filter(ala_resolution_status == "not_found") |> distinct(ala_input_name) |> nrow(),
download_log |> filter(n_records_victoria > 0) |> distinct(ala_input_name) |> nrow(),
if (exists("ala_occurrences_victoria")) nrow(ala_occurrences_victoria) else NA_integer_
)
)
ala_summary |>
knitr::kable(
caption = "Summary of ALA taxon matching and Victoria occurrence downloads"
)
| summary_metric | value |
|---|---|
| Taxa submitted to ALA | 313 |
| Taxa matched to ALA taxonomy | 308 |
| Taxa not found in ALA taxonomy | 5 |
| Taxa with Victoria records | 294 |
| Total Victoria occurrence records | 123525 |
ala_download_status_summary <- download_log |>
count(download_status, name = "n_taxa") |>
left_join(
download_log |>
group_by(download_status) |>
summarise(
total_records_victoria = sum(n_records_victoria, na.rm = TRUE),
.groups = "drop"
),
by = "download_status"
) |>
arrange(desc(n_taxa))
ala_download_status_summary |>
knitr::kable(
caption = "ALA download status summary"
)
| download_status | n_taxa | total_records_victoria |
|---|---|---|
| downloaded | 294 | 123525 |
| zero_records_bbox | 12 | 0 |
| taxon_not_found | 5 | 0 |
| zero_records_after_victoria_crop | 2 | 0 |
The summary table is the the intersection of all information gathered from Vicflora, Austraits database and ALA. This has been exported as a separate .xlsx file due to it’s wide format.
This information can be inferred from land-use mapping or other spatial layers related to land degradation. Threats layers from the SMP project could be an option.
Threatened status doesn’t often get determined by population. But gathering information on population status is very difficult. The only viable options are talking to experts to elicit their best estimate, and looking at herbarium specimen notes. David Cameron’s has collected this information over time, and therefore digitising the notes and assigning relevant information to each species should be primary focus to acquire population level information. There will still be gaps after this time consuming process. The alternative is to build a statistical model to estimate the probability of a species occurring in a plot, and by considering the dimensions of a plant it can feed into population estimates (e.g. a tree might be one individual per plot).
This won’t be relevant to many poorly known species.
Primarily only relevant if one population. Rarely determines the outcome of most assessments.
If in a fragmented landscape, likely to be highly fragmented.
This is the most important criteria. Similarly to Threats, this can be inferred from habitat of taxon. More specifically, Bioregional Conservation Status for EVCs give decline and or reduction amounts for habitat which can be inferred for the species description. It basically compares pre-1750 to 2005 habitat.
Never contributes to the assessment outcome.
Species is restricted if geographic range is <5 locations or <20 km. These can be calculated after ALA extract has been cleaned and audited.
This criteria is a complex one to deduct. It will involve putting together a series of prevailing threats and the geographic range for those threats (e.g. roadsides, tenure, historic land use, current land use).