Comparing the Amount of Time ARUs are Utilized and the Number of Species at a Point Each Year

Mark Schulist

2022-07-07

Loading libraries

library(tidyverse)
library(data.table)
library(here)
library(vegan)
library(fs)
library(lubridate)
library(furrr)
source(here("caples_functions.R")) # common functions used throughout project
options(dplyr.summarise.inform = FALSE) # getting rid of super annoying information

Ingesting the data

dataML is the ARU (automated recording unit) data that has been classified by the machine learning algorithm created by Google. We don’t need the UNKN column as that is mostly noise and is not a bird species. dataPC is the point count data that comes directly from FileMakerPro.

dataML <- fread(here("machine_learning/eda/input/dataML.csv"), showProgress = F) %>% 
  filter(species != "UNKN")
dataPC <- fread(here("machine_learning/eda/input/dataPC.csv"))

Subsetting the ML data

The ML algorithm works by looking at a 2.5-second snippet of the recording and then classifying how close that spectrogram (2.5 seconds) is to every species that could possibly be seen on the study. The ML using a logit to show how close they are to each other. A logit is basically a way of representing a probability across all real numbers. For our purposes, I’m going to use 0 as a logit cutoff. This means that we are going to filter the ML df (dataframe) to only include logits that are greater than or equal to 0. I’m also going to get rid of the times that the ARU was recording but was not on a point (when we were moving them to new points). I’ll also add a column that just contains the year and another for the day (for later use).

dataML_small <- dataML %>% 
  filter(logit >= 0, !is.na(point)) %>% 
  mutate(year = year(Date_Time), day = day(Date_Time))

Now that we have a df that only includes signals (birds were present), we can further subset it to see how many species were seen with less and less data. The ARUs were on points for at least 3 full days, so we are going to make 3 separate dfs (1 day, 2 days, 3 days of data). We are going to make three dfs that each contain 1, 2, or 3 days worth of data from each point each year.

dataML_small <- dataML_small %>% 
  group_by(point, year) %>% 
  arrange(point, Date_Time) %>% 
  mutate(first_day = first(day) + 1, second_day = first_day + 1, third_day = second_day + 1) 

dataML_oneday <- dataML_small %>% 
  filter(first_day == day)

dataML_twoday <- dataML_small %>% 
  filter(first_day == day | second_day == day)

dataML_threeday <- dataML_small %>% 
  filter(first_day == day | second_day == day | third_day == day)

Summarizing ML Data

Now that we have three dataframes that contain varying number of days of the entire dataset, we can summarize them to figure out the number of species that are observed when we have more data.

dataML_one_summary <- dataML_oneday %>% 
  group_by(point, year, species) %>% 
  summarize(logit = max(.data[["logit"]]), species) %>% 
  ungroup(species) %>% 
  summarize(species, nspecies = n_distinct(species), logit) %>% 
  group_by(point, year, species, nspecies, logit) %>% 
  slice_sample(n = 1) %>% #fixing issue where it makes duplicate rows within a group
  group_by(point, year) %>% 
  summarize(species = max(nspecies)) %>% 
  ungroup() %>% 
  mutate(days = "1 day", mean = mean(species))

dataML_two_summary <- dataML_twoday %>% 
  group_by(point, year, species) %>% 
  summarize(logit = max(.data[["logit"]]), species) %>% 
  ungroup(species) %>% 
  summarize(species, nspecies = n_distinct(species), logit) %>% 
  group_by(point, year, species, nspecies, logit) %>% 
  slice_sample(n = 1) %>% #fixing issue where it makes duplicate rows within a group
  group_by(point, year) %>% 
  summarize(species = max(nspecies)) %>% 
  ungroup() %>% 
  mutate(days = "2 days", mean = mean(species))

dataML_three_summary <- dataML_threeday %>% 
  group_by(point, year, species) %>% 
  summarize(logit = max(.data[["logit"]]), species) %>% 
  ungroup(species) %>% 
  summarize(species, nspecies = n_distinct(species), logit) %>% 
  group_by(point, year, species, nspecies, logit) %>% 
  slice_sample(n = 1) %>% #fixing issue where it makes duplicate rows within a group
  group_by(point, year) %>% 
  summarize(species = max(nspecies)) %>% 
  ungroup() %>% 
  mutate(days = "3 days", mean = mean(species))

Plotting the Distributions

Now that we have the number of species seen on each point each year with 1-3 days of data, we can plot the data and compare them. The dotted red line is the mean number of species seen with the corresponding number of days of ARU data.

# Binding all of the data together to allow for faceting in ggplot
dataML_summary <- rbind(dataML_one_summary, dataML_two_summary, dataML_three_summary) %>% ungroup()

# Plotting with ggplot2
ggplot(dataML_summary, aes(species)) +
  geom_histogram(aes(y = ..density..), binwidth = 2, colour = "black", fill = "white") +
  geom_density(alpha = .2, fill = "#FF6666") + # Overlay with transparent density plot
  geom_vline(aes(xintercept = mean), color = "red", linetype = "dashed", size = 1) +
  facet_wrap("days")