Determining the Length of an ARU Recording that Contains the Same Amount of Information as a 10 Minute Point Count

Mark Schulist

2022-07-15

Loading Libraries

library(tidyverse)
library(data.table)
library(lubridate)
library(furrr)
library(hms)
library(here)
source(here("caples_functions.R")) # commonly used functions throughout project
options(dplyr.summarise.inform = FALSE) # getting rid of super annoying information

Ingesting Data

Here I am loading the ML (machine learning) and PC (point count) datasets. The ML dataset contains all of the birds that the machine learning model believes are in the recordings. The PC dataset is the output from FileMakerPro where the point count data was inputted.

dataML <- fread(here("machine_learning/eda/input/dataML.csv")) %>% 
  filter(species != "UNKN") # getting rid of the unknown classication as it is just noise
dataPC <- fread(here("machine_learning/eda/input/dataPC.csv"))

Tidying the ML data

The end goal with the ML data is to determine the average number of species that it detects in the first x minutes of each recording from 0530 to 0945. This is the same time frame that the point counts were conducted. This will be grouped by point and year so that we are comparing on a point by point basis which should correct for the different number of species that are seen at different points and during different years. I will try and find the number of x minutes that records the same number of species as the 10 minute point count. Theoretically, this should be around 10 minutes. If x > 10 mins, than it means that point count observers are able to more easily detect species, possibly due to being able to see and hear birds (instead of just hearing them). The ARUs could also be less sensitive to distant vocalizations that humans are able to hear. If x < 10 mins, than it means that ARUs are able to more easily detect species, possibly due to their ability to identify multiple birds calling at once and the fact that humans may make birds vocalize/become more difficult to detect.

I hypothesize that the ARU data will need about 10% more time than the point count data to detect the same amount of species. This is due to the small number of species that are seen only such as flyovers and non-vocalizing birds (which are more rare than you would think).

Let’s begin the tidying by subsetting the data to only contain recordings between 0530 and 0945. We are also going to use 0 as the logit cutoff. This means that anytime the ML model gave a species a logit > 0 is going to be counted as being observed. A logit is a way of representing a probability across all real numbers.

dataML_small <- dataML %>% 
  filter(logit > 0) %>% 
  mutate(time = as_hms(Date_Time), year = year(Date_Time)) %>% 
  filter(time >= 19800 & time <= 35100) # 19800 = 05:30 in seconds...

Now we can summarize the data and determine the number of species seen at each point each year with x amounts of data. I’ll begin by using 1 minute-intervals to see the number of species that are detected as we add minute by minute (great song btw).

times <- lapply(1:60, '*', 15) %>% unlist()

plan(multisession, workers = 32)
options(future.globals.maxSize = 8000 * 1024^2)

dataML_summary <- future_map_dfr(times, 
  ~ (dataML_small %>% 
  filter(Start_Time <= .x) %>% 
  group_by(point, year, Date_Time, species) %>% 
  summarize(logit = max(.data[["logit"]]), species) %>% 
  ungroup(species) %>% 
  summarize(species, nspecies = n_distinct(species), logit) %>% 
  group_by(point, year, species, nspecies, logit) %>% 
  slice_sample(n = 1) %>% # fixing issue where it makes duplicate rows within a group
  group_by(point, year) %>% 
  summarize(species = mean(nspecies)) %>% 
  mutate(seconds = (.x))
  ))

Plotting ML Data

Now that we have the number of species detected at each point each year, we can plot how the number of species detected increases with time. Note that this maxes out at 15 minutes, so later we will need to extend the number of minutes and see when the number of species reaches an asymptote in relation to observation time.

ggplot(dataML_summary, aes(seconds, species)) +
  geom_jitter() +
  geom_smooth() +
  labs(title = 'Mean Number of Species on a Given Point Each Year \nWhen The Recording Time is Truncated')

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Increasing the Maximum Number of Minutes in the Plot Above

While the slope of the best fit curve is much flatter at 15 minutes than 5 minutes, it still does not give the complete picture as to how the number of species increases with recording time. Ideally, we would be able to extend the above plot to longer periods of time so that we give the plot a chance to flatten out more.

My first thought is to group_by() date and point and just start from 0530 and just add x minutes to that starting time. This might be fine as there are multiple days of data at each point, but it is not an even representation of observation time. The later minutes will contain fewer species as birds vocalize less later in the day. This reduces the number of species detected at later times which does not allow for fair comparison.

Another option is to use slice_sample() and select a random 2.5-second snippet. If done completely randomly, this is the best method as it is 100% fair and allows for the most objective way to get a random sampling from the recordings. It also enables us to extend the number of minutes that are used. It is important that we do NOT use the dataML_small df (dataframe) as it only contains high-logit 2.5-second snippets. We need to sample from the entire df and THEN filter for high logits. Let’s try this method.

dataML_large <- dataML %>% 
  filter(!is.na(point)) %>% # dumb point that has NA 
  mutate(year = year(Date_Time), time = as_hms(Date_Time)) %>% 
  filter(time >= 19800 & time <= 35100) # 19800 = 05:30 in seconds...
  
plan(multisession, workers = 32)
options(future.globals.maxSize = 8000 * 1024^2)

dataML_large_summary <- future_map_dfr(1:2880,
  ~ (dataML_large %>% 
       group_by(point, year) %>% 
       slice_sample(n = .x) %>% 
       group_by(point, year, species) %>% 
       summarize(logit = max(.data[["logit"]]), species) %>% 
       filter(logit > 0) %>% 
       ungroup(species) %>% 
       summarize(species, nspecies = n_distinct(species), logit) %>% 
       group_by(point, year, species, nspecies, logit) %>% 
       slice_sample(n = 1) %>% # fixing issue where it makes duplicate rows within a group
       group_by(point, year) %>% 
       summarize(species = mean(nspecies)) %>% 
       mutate(seconds = (.x * 2.5))
  )
)

Plotting Extended ML data

ggplot() +
  geom_jitter(data = slice_sample(dataML_large_summary, prop = 0.01), aes(seconds, species)) +
  geom_smooth(data = dataML_large_summary, aes(seconds, species)) +
  labs(title = 'Number of Species on a Given Point Each Year \nWhen the Number of 2.5-second intervals is Varied', 
  subtitle = 'The data are randomly sampled from 0530 to 0945 on a given point each year')

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Tidying PC Data

We are going to find the number of species at each point each year when a 10 minute point count is conducted. Because there are three distinct point counts done at each point each year, I am going to take the average number of species observed across all three point counts.

dataPC_summary <- dataPC %>% 
  filter(eBird_6_code != "", point_ID_fk != 397) %>% # 397 was a practice point count at a point that was not used in the study
  rename(species = birdCode_fk, point = point_ID_fk) %>% 
  mutate(year = year(Date)) %>% 
  group_by(point, year, pointCount_ID_fk) %>% 
  summarize(nspecies = n_distinct(eBird_6_code)) %>% 
  group_by(point, year) %>% 
  summarize(species = round(mean(nspecies)))

Combining ML and PC Data

Now that we have the ML and PC data in the same format, we can merge them into a single dataframe and determine the number of minutes of ARU data that is equilivant to 10 minutes of point count data.

ML_PC_combined <- dataML_large_summary %>% 
  left_join(dataPC_summary, by = c("point", "year")) %>% 
  filter(species.x == species.y) %>% 
  group_by(point, year) %>% 
  slice_min(order_by = seconds) %>% 
  select(-species.y) %>% 
  rename(species = species.x) %>% 
  mutate(minutes = (seconds / 60))

Plotting Combined Data

Now that we have combined the data, let’s graph it to see how close my estimate is. Remember, I guessed that the ARU data would need about 10% more time than point counts to detect the same number of species.

ggplot(ML_PC_combined, aes(minutes)) +
  geom_histogram(aes(y = ..density..), binwidth = 1, color = "black", fill = "white") +
  geom_density(alpha = .2, fill = "#FF6666") + # Overlay with transparent density plot
  geom_vline(aes(xintercept = mean(minutes)), color = "red", linetype = "dashed", size = 1) +
  labs(
    title = "Number of Minutes of an ARU Recording to Detect the \nSame Number of Species as a 10 Minute Point Count", 
    caption = paste("Mean Number of Minutes =", round(mean(ML_PC_combined$minutes), 1))
  )

Wow, my estimate was super close! I estimated 11 minutes (10% more than 10 minutes) and the actual number of minutes is 12.7 minutes.