---
title: "Predicting the Best Times to Spot Wildlife"
subtitle: "A confidence-weighted scoring function for the ecotourism package"
format:
html:
toc: true
toc-depth: 3
code-fold: show
code-tools: true
theme: flatly
self-contained: true
execute:
warning: false
message: false
editor:
markdown:
wrap: 72
---
## The question
The hard task asked for a function that takes occurrence and weather
data and predicts the top five days and times to spot an organism.
The first question is what *predict* means here. The package contains no
future weather forecasts. What it contains is ten years of sighting
records with timestamps, the weather conditions on those days, and a set
of top weather stations matched to each organism. The approach taken
here is:
> Look at every combination of month and hour of day. Find the ones where the most sightings have been recorded historically. Check whether the weather on those sighting days was noticeably different from typical weather at those stations. Then report each recommendation with a confidence score that is honest about how many sightings actually back it up.
This is the prediction: if these seasonal and weather conditions repeat, this is when you are most likely to see the organism.
------------------------------------------------------------------------
## Setup
```{r}
#| label: setup
library(ecotourism)
library(dplyr)
library(tidyr)
library(lubridate)
library(kableExtra)
library(plotly)
library(conflicted)
conflicts_prefer(dplyr::filter)
conflicts_prefer(dplyr::select)
conflicts_prefer(plotly::layout)
data("glowworms")
data("gouldian_finch")
data("manta_rays")
data("orchids")
data("weather")
data("top_stations")
```
------------------------------------------------------------------------
## Understanding the data
Before writing the function, four things needed to be established from
the data.
### Record types: why hour 0 is not really midnight
Inspecting `record_type` across all four organisms revealed an important
data quality issue:
```{r}
#| label: record-types
bind_rows(
glowworms |> mutate(organism = "Glowworms"),
gouldian_finch |> mutate(organism = "Gouldian Finch"),
manta_rays |> mutate(organism = "Manta Rays"),
orchids |> mutate(organism = "Orchids")
) |>
count(organism, record_type) |>
pivot_wider(names_from = record_type, values_from = n, values_fill = 0) |>
kable() |>
kable_styling(full_width = FALSE)
```
Manta rays stand out immediately: 675 of 953 records are
`MACHINE_OBSERVATION`. The following confirms these are not real
behavioural sightings:
```{r}
#| label: machine-obs-evidence
manta_rays %>%
group_by(record_type) %>%
summarise(
n = n(),
pct_hour_0 = round(mean(hour == 0) * 100, 1),
median_hour = median(hour),
.groups = "drop"
) %>%
kable(col.names = c("Record type", "Count", "% at hour 0", "Median hour")) %>%
kable_styling(full_width = FALSE)
```
**100% of `MACHINE_OBSERVATION` records are recorded at hour 0.** Checking the hour distribution for human observations of manta rays shows a spread across daylight hours, which is what you would expect from divers and snorkellers. The machine records show no such spread. Whatever the recording mechanism, the hour 0 pattern carries no behavioural signal and these records are excluded from all time-of-day analysis in the function.
We are flagging hour = 0 in `human observations` but keeping them because for nocturnal organisms like glowworms, a genuine midnight sighting
is biologically plausible. We cannot distinguish a real midnight
sighting from a missing timestamp, so rather than silently dropping
these records we keep them and let the plausibility window (explained later) decide whether midnight is a credible hour for this organism.
### The join strategy
The occurrence datasets each contain a `ws_id` column (the nearest
weather station to each sighting). Rather than joining directly on
`ws_id` and `date` (which produces few matches because occurrence
stations often differ from weather stations), the `top_stations` table
is used as a bridge. It maps each organism to its two or three most
relevant weather stations, giving much better weather coverage.
Where multiple stations cover the same date, weather variables are
averaged across stations before joining. This also resolves the
many-to-many join warning that arises when a single sighting date
matches multiple station records.
### Hour distributions and ecological plausibility
The hour distribution for each organism reveals its activity pattern:
```{r}
#| label: hour-distributions
bind_rows(
glowworms |> mutate(organism = "Glowworms"),
gouldian_finch |> mutate(organism = "Gouldian Finch"),
manta_rays |> filter(record_type == "HUMAN_OBSERVATION") |>
mutate(organism = "Manta Rays (human only)"),
orchids |> mutate(organism = "Orchids")
) |>
filter(hour != 0) |>
count(organism, hour) |>
ggplot(aes(x = hour, y = n)) +
geom_col(fill = "#2D5A27") +
facet_wrap(~organism, ncol = 2, scales = "free_y") +
scale_x_continuous(breaks = seq(2, 22, by = 2)) +
labs(x = "Hour of day", y = "Number of sightings") +
theme_bw(base_size = 12) +
theme(
panel.grid.minor = element_blank(),
strip.background = element_rect(fill = "#F0EAD8"),
strip.text = element_text(face = "bold", size = 11)
)
```
Rather than hardcoding an activity window per organism, the function derives the plausible hour range directly from the data: the central 80% of sightings (10th to 90th percentile of the hour distribution). Outside this window, scores drop off gradually rather than cutting off sharply at the boundary.
### Data sparsity
How spread out are the sightings across month and hour combinations? A sparse grid means most combinations have only one or two sightings recorded, which makes it hard to know whether a pattern is real or just chance. The table below shows for each organism: how many distinct month and hour combinations have at least one sighting, the typical number of sightings per combination, how many combinations have only a single sighting (too few to trust), and how many have five or more (enough to suggest a real pattern).
```{r}
#| label: sparsity
purrr::map_dfr(
list(
Glowworms = glowworms,
`Gouldian Finch` = gouldian_finch,
`Manta Rays` = manta_rays,
Orchids = orchids
),
function(df) {
df %>%
filter(hour != 0, record_type != "MACHINE_OBSERVATION") %>%
count(month, hour) %>%
summarise(
`Month x hour combinations` = n(),
`Median sightings per combination` = median(n),
`Combinations with only 1 sighting` = sum(n == 1),
`Combinations with 5 or more sightings` = sum(n >= 5),
.groups = "drop"
)
},
.id = "Organism"
) |>
kable() |>
kable_styling(full_width = FALSE)
```
Glowworms and manta rays are very sparse: most month and hour combinations contain only a single sighting. Orchids and Gouldian Finch are much denser. If you simply ranked combinations by how often they appear in the data, a combination that shows up once would look just as convincing as one that shows up fifty times. That is a problem because one sighting could easily be a coincidence, while fifty sightings in the same month and hour is a genuine pattern. The confidence score fixes this by scaling down recommendations that are backed by very few sightings, so the output is honest about how much data is actually behind each result.
### Weather coverage and name conflicts
Not every sighting record has a matching weather observation. A second issue is that the sighting and weather datasets both have columns with the same names, such as `month`, `weekday`, and `year`. If you join them without being careful, R will keep both versions of each duplicate column and rename them `month.x` and `month.y`, which breaks any code that tries to use those columns afterwards. The fix is simple: select only the weather columns you actually need before joining, so the duplicates never arise.
```{r}
#| label: weather-coverage
check_coverage <- function(occ, name) {
joined <- occ |>
left_join(
weather |> select(ws_id, date, temp, prcp),
by = c("ws_id", "date")
)
tibble(
Organism = name,
`Total sightings` = nrow(joined),
`With weather data` = sum(!is.na(joined$temp)),
`Weather coverage %` = round(100 * mean(!is.na(joined$temp)), 1)
)
}
bind_rows(
check_coverage(manta_rays, "Manta Rays"),
check_coverage(gouldian_finch, "Gouldian Finch"),
check_coverage(glowworms, "Glowworms"),
check_coverage(orchids, "Orchids")
) |>
kable() |>
kable_styling(full_width = FALSE)
```
Glowworms have very low weather coverage when you join directly on `ws_id` and `date`. This happens because the weather station recorded nearest to each glowworm sighting often has very little data. The function gets around this by using the `top_stations` table, which identifies the two or three weather stations with the best and most complete records for each organism's region. Weather data from those stations is averaged across each date before joining, which gives much better coverage than a direct match. It also sidesteps the duplicate column name problem, since only the columns actually needed are brought in from the weather data.
------------------------------------------------------------------------
## The function
The function combines three components into a final score for every
month × hour combination:
**Composite score** — A weighted average of three things: how common sightings are in that month (40%), how common sightings are at that hour of day (40%), and how favourable the weather conditions are (20%). The weather component adjusts itself automatically using a measure called Cohen's d, which compares the weather on days when sightings were recorded against the weather on all other days at the same stations. If sightings tend to cluster in noticeably warmer or drier conditions, weather gets meaningful input into the score. If sightings happen in all kinds of weather with no clear pattern, the weather component contributes almost nothing and the score is driven by month and hour alone.
**Confidence score** — The confidence score is calculated using the formula `1 - exp(-n / k)`, where `n` is the number of sightings recorded for that month and hour combination and `k = 5` by default. The idea is simple: confidence builds quickly when you have a few sightings and then levels off as you add more. A single sighting gives a confidence of 0.18, meaning low confidence. Five sightings gives 0.63, ten gives 0.86, and twenty or more gets close to 0.98. It never reaches 1.0 because no finite amount of historical data can give you complete certainty. The k value controls how quickly confidence builds: a smaller k rewards sparse data more generously, a larger k is stricter and requires more sightings before trusting a recommendation.
**Ecological plausibility score** — Derived from the data rather than set manually for each organism. The function looks at all the hours when sightings have been recorded and finds the range that covers the middle 80% of them. Any hour that falls inside this range gets a full plausibility score of 1.0. Hours outside the range are not simply thrown out: their scores drop off gradually the further they are from the window, bottoming out at 0.2. This means an unusual hour still gets some credit if there is real data behind it, rather than being penalised as if it were impossible.
**Final score** = `composite × (0.6 × confidence + 0.4 × plausibility)`
This means:
- High confidence + plausible hour: high score, recommended strongly
- Low confidence + plausible hour: moderate score, flagged as sparse but biologically consistent
- High confidence + implausible hour: penalised, flagged as caution
- Low confidence + implausible hour: penalised hard
```{r}
#| label: function-definition
predict_best_spotting_times <- function(occurrence_data,
weather_data,
top_stations,
organism_name,
n = 5,
conf_k = 5) {
#1. Remove machine observations (explained earlier)
n_total <- nrow(occurrence_data)
n_machine <- sum(occurrence_data$record_type == "MACHINE_OBSERVATION",
na.rm = TRUE)
occ <- occurrence_data %>%
filter(record_type != "MACHINE_OBSERVATION")
if (n_machine > 0) {
message("Excluded ", n_machine, "/", n_total,
" MACHINE_OBSERVATION records (all recorded at hour 0 by default)")
}
#Flag hour = 0 in human observations
n_zero <- sum(occ$hour == 0, na.rm = TRUE)
occ <- occ %>% filter(!is.na(hour))
if (n_zero > 0) {
message(n_zero, " human observation record(s) have hour = 0. ",
"These are kept. For nocturnal organisms this may be a real ",
"sighting; the plausibility score will weight it accordingly.")
}
#Derive ecological plausibility window from the data
hour_dist <- occ %>%
count(hour) %>%
arrange(hour) %>%
mutate(cum_pct = cumsum(n) / sum(n))
plausible_hours <- hour_dist %>%
filter(cum_pct >= 0.10, cum_pct <= 0.90) %>%
pull(hour)
plaus_min <- min(plausible_hours)
plaus_max <- max(plausible_hours)
message("Ecological plausibility window (central 80% of sightings): ",
sprintf("%02d:00", plaus_min), " - ", sprintf("%02d:00", plaus_max))
plausibility_score <- function(h) {
dist <- pmax(0, pmax(plaus_min - h, h - plaus_max))
ifelse(h >= plaus_min & h <= plaus_max,
1.0,
pmax(0.2, cos(dist / 3 * pi / 2)))
}
#Get weather for top stations, averaged per date
org_stations <- top_stations %>%
filter(organism == organism_name)
if (nrow(org_stations) == 0) {
stop("No top stations found for organism: ", organism_name)
}
weather_org <- weather_data %>%
filter(ws_id %in% org_stations$ws_id) %>%
group_by(date, month, dayofyear) %>%
summarise(
temp = mean(temp, na.rm = TRUE),
rh = mean(rh, na.rm = TRUE),
prcp = mean(prcp, na.rm = TRUE),
rainy = mean(rainy, na.rm = TRUE),
wind_speed = mean(wind_speed, na.rm = TRUE),
.groups = "drop"
) %>%
# mean(na.rm = TRUE) on all-NA groups produces NaN, converted to NA
mutate(across(where(is.numeric), ~ifelse(is.nan(.x), NA, .x)))
#Join occurrences to weather
occ_weather <- occ %>%
select(date, hour, month, weekday, dayofyear) %>%
inner_join(weather_org, by = "date")
#Weather weights via Cohen's d
cohen_d <- function(x_obs, x_all) {
x_obs <- x_obs[!is.na(x_obs)]
x_all <- x_all[!is.na(x_all)]
if (length(x_obs) < 3 || length(x_all) < 3) return(0)
s <- sd(x_all)
if (is.na(s) || s == 0) return(0)
abs(mean(x_obs) - mean(x_all)) / s
}
d_temp <- cohen_d(occ_weather$temp, weather_org$temp)
d_rain <- cohen_d(occ_weather$rainy, weather_org$rainy)
d_wind <- cohen_d(occ_weather$wind_speed, weather_org$wind_speed)
d_rh <- cohen_d(occ_weather$rh, weather_org$rh)
total_d <- d_temp + d_rain + d_wind + d_rh + 1e-6
ideal_temp <- mean(occ_weather$temp, na.rm = TRUE)
ideal_rain <- mean(occ_weather$rainy, na.rm = TRUE)
ideal_wind <- mean(occ_weather$wind_speed, na.rm = TRUE)
ideal_rh <- mean(occ_weather$rh, na.rm = TRUE)
# Score each historical weather day by closeness to ideal conditions
weather_scored <- weather_org %>%
mutate(
weather_score = 1 - (
(d_temp / total_d) * pmin(abs(temp - ideal_temp) / 10, 1) +
(d_rain / total_d) * pmin(abs(rainy - ideal_rain), 1) +
(d_wind / total_d) * pmin(abs(wind_speed - ideal_wind) / 10, 1) +
(d_rh / total_d) * pmin(abs(rh - ideal_rh) / 100, 1)
),
weather_score = ifelse(is.na(weather_score), 0.5, weather_score)
) %>%
select(date, month, weather_score)
monthly_weather <- weather_scored %>%
group_by(month) %>%
summarise(avg_weather_score = mean(weather_score, na.rm = TRUE),
.groups = "drop")
#Frequency scores
month_scores <- occ %>%
count(month, name = "n_month") %>%
mutate(month_score = n_month / sum(n_month))
hour_scores <- occ %>%
count(hour, name = "n_hour") %>%
mutate(hour_score = n_hour / sum(n_hour))
#Confidence: sightings per month x hour cell
month_hour_counts <- occ %>%
count(month, hour, name = "n_cell")
#Build month x hour grid and score every combination
grid <- merge(data.frame(month = 1:12), data.frame(hour = 1:23))
results <- grid %>%
left_join(month_scores, by = "month") %>%
left_join(monthly_weather, by = "month") %>%
left_join(hour_scores, by = "hour") %>%
left_join(month_hour_counts, by = c("month", "hour")) %>%
replace_na(list(
month_score = 0,
avg_weather_score = 0.5,
hour_score = 0,
n_cell = 0
)) %>%
mutate(
# Base composite score
composite_score = 0.40 * month_score +
0.40 * hour_score +
0.20 * avg_weather_score,
# Confidence: exponential saturation curve
confidence = 1 - exp(-n_cell / conf_k),
# Ecological plausibility
plausibility = plausibility_score(hour),
# Final score: composite scaled by confidence-plausibility blend
confidence_weight = 0.6 * confidence + 0.4 * plausibility,
final_score = composite_score * confidence_weight,
# Labels
month_name = month.name[month],
time_label = sprintf("%02d:00", hour),
period = case_when(
hour < 12 ~ "Morning",
hour < 17 ~ "Afternoon",
TRUE ~ "Evening/Night"
),
confidence_tag = case_when(
confidence >= 0.63 & plausibility >= 0.8 ~
"High: strong data, ecologically expected",
confidence >= 0.18 & plausibility >= 0.8 ~
"Moderate: sparse data but ecologically plausible",
confidence >= 0.63 & plausibility < 0.8 ~
"Caution: data present but unusual time",
TRUE ~
"Low: sparse data and atypical time"
)
) %>%
arrange(desc(final_score))
top_results <- results %>% slice_head(n = n)
#Print results
cat("\n")
cat(" Best times to spot:", organism_name, "\n")
cat(rep("-", 60), "\n", sep = "")
cat(sprintf(" %-22s %-6s %-6s %-6s %s\n",
"When", "Score", "Conf.", "Plaus.", "Tag"))
cat(rep("-", 60), "\n", sep = "")
for (i in seq_len(nrow(top_results))) {
row <- top_results[i, ]
cat(sprintf(" %d. %-10s %s %.3f %.2f %.2f %s\n",
i,
row$month_name,
row$time_label,
row$final_score,
row$confidence,
row$plausibility,
row$confidence_tag))
}
cat("\n")
cat(" Plausibility window:",
sprintf("%02d:00", plaus_min), "-", sprintf("%02d:00", plaus_max), "\n")
cat(" Cohen's d, temp:", round(d_temp, 3),
"| rain:", round(d_rain, 3),
"| wind:", round(d_wind, 3),
"| rh:", round(d_rh, 3), "\n")
cat(" Confidence k =", conf_k,
"(1 sighting = 0.18, 5 = 0.63, 10 = 0.86, 20+ ~ 0.98)\n")
invisible(list(
recommendations = top_results,
full_grid = results,
plausibility_window = c(plaus_min, plaus_max),
cohen_d = c(temp = d_temp, rain = d_rain, wind = d_wind, rh = d_rh)
))
}
```
------------------------------------------------------------------------
## Results
### Glowworms
```{r}
#| label: results-glowworms
results_glow <- predict_best_spotting_times(
glowworms, weather, top_stations, "glowworms"
)
```
December dominates, which reflects the combined signal from all three states. December 15:00 ranks first, which is initially surprising for a bioluminescent organism. Looking at the state-level data explains why:
```{r}
#| label: glowworm-state-hour
glowworms %>%
filter(month == 12, hour != 0) %>%
mutate(
time_type = case_when(
hour >= 20 | hour <= 4 ~ "Night",
hour >= 17 ~ "Evening",
TRUE ~ "Daytime"
)
) %>%
count(obs_state, time_type) %>%
arrange(obs_state, time_type) %>%
pivot_wider(names_from = time_type, values_from = n, values_fill = 0) %>%
kable() %>%
kable_styling(full_width = FALSE)
```
Tasmania accounts for the majority of December sightings, and 19 of those are daytime, almost certainly from organised cave and tunnel tours where artificial darkness makes glowworms visible regardless of the time outside. Queensland sightings cluster after 18:00, which is consistent with natural bioluminescence behaviour.
**Note on scope:** The function uses all states and all months, so the December 15:00 result reflects the full dataset. Restricting to Queensland only, as the accompanying tutorial does, shifts the picture to November after dark. These are not contradictory results. They are the same data viewed through different filters, each ecologically defensible. The December daytime signal is Tasmania cave tourism; the November evening signal is wild Queensland glowworms. Knowing which question you are answering determines which filter to apply.
The confidence scoring system surfaced this result rather than suppressing it. A simple frequency ranking would have ranked December 22:00 first and moved on. The high confidence score at 15:00 prompted the investigation that revealed two genuinely different spotting contexts within one organism dataset: cave tourism in Tasmania and wild sightings in Queensland.
The plausibility window of 09:00 to 21:00 is broader than expected for a nocturnal organism precisely because it is data-derived and both contexts are present in the data.
### Gouldian Finch
```{r}
#| label: results-finch
results_finch <- predict_best_spotting_times(
gouldian_finch, weather, top_stations, "gouldian_finch"
)
```
All five top recommendations are at 07:00 across the dry season months (May to September). Confidence is 1.00 across the board, as this is the most data-rich dataset with a median of 8 sightings per month and hour combination. The early morning pattern is unambiguous and the weather effect sizes confirm that dry season conditions are meaningfully associated with sightings: temperature and humidity both differ noticeably between sighting days and typical days at the same stations.
### Manta Rays
```{r}
#| label: results-manta
results_manta <- predict_best_spotting_times(
manta_rays, weather, top_stations, "manta_rays"
)
```
The function automatically excludes 675 machine observation records before any analysis. After exclusion, June emerges as the peak month with mid-morning hours (08:00 to 11:00) scoring highest, consistent with divers and snorkellers operating during daylight in winter feeding aggregations.
June 22:00 appears at rank 4 with a "Caution" flag (confidence 0.97 but plausibility 0.50). This is the system working correctly: there are real human sightings recorded at 22:00 in June, but the hour sits outside the central 80% of the distribution. The recommendation surfaces with a clear warning rather than being silently suppressed or silently promoted.
The temperature effect size is the highest of any organism, which confirms that weather is a genuine predictor for manta rays. This reflects their sensitivity to water temperature during feeding season: they are more likely to be spotted when conditions are right, not just when people happen to be looking.
### Orchids
```{r}
#| label: results-orchids
results_orchid <- predict_best_spotting_times(
orchids, weather, top_stations, "orchids"
)
```
September dominates with all five slots falling between 10:00 and 14:00, which is Australian spring bloom season and the part of the day when pollinators are most active. Confidence is 1.00 throughout as orchids have a median of 20.5 sightings per month and hour combination, making this the most data-backed set of recommendations of all four organisms. The weather effect sizes for wind and humidity are both 0, confirming that for orchids, the month and time of day carry the prediction entirely. What day it is weather-wise barely matters: if it is September mid-morning, orchids are there.
------------------------------------------------------------------------
## What makes this approach different
The three components are multiplied together rather than added. This matters because it means low confidence combined with low plausibility compounds the penalty: a result that is both data-poor and falls at an unusual hour gets ranked very low. But a low-confidence recommendation that falls within a plausible hour still surfaces, labelled honestly as "Moderate: sparse data but ecologically plausible." The plain-English label means the output is immediately readable without having to interpret the underlying numbers.
The manta ray June 22:00 result and the glowworm December 15:00 result
are both examples of the system behaving correctly: one surfaced with a
caution flag, one surfaced with high confidence and prompted an
investigation that revealed a genuine biological and ecological pattern
in the data.