Loading Libraries
library(tidyverse)
library(data.table)
library(lubridate)
library(furrr)
library(hms)
library(here)
library(randomcoloR)
source(here("caples_functions.R")) # commonly used functions throughout project
options(dplyr.summarise.inform = FALSE) # getting rid of super annoying information
Ingesting Data
Here I am loading the ML (machine learning) and PC (point count) datasets. The ML dataset contains all of the birds that the machine learning model believes are in the recordings. The PC dataset is the output from FileMakerPro where the point count data was inputted.
dataML <- fread(here("machine_learning/eda/input/dataML_all.csv")) %>%
filter(species != "UNKN", species != "nonbird", species != "human") # getting rid of the unknown classification as it is just noise
dataPC <- fread(here("machine_learning/eda/input/dataPC.csv"))
Tidying the ML data
The end goal with the ML data is to determine the average number of
species that it detects in the first x minutes of each
recording from 0530 to 0945. This is the same time frame that the point
counts were conducted. This will be grouped by point and year so that we
are comparing on a point by point basis which should correct for the
different number of species that are seen at different points and during
different years. I will try and find the number of x
minutes that records the same number of species as the 10 minute point
count. Theoretically, this should be around 10 minutes. If
x > 10 mins, than it means that point count observers
are able to more easily detect species, possibly due to being able to
see and hear birds (instead of just hearing them). The ARUs could also
be less sensitive to distant vocalizations that humans are able to hear.
If x < 10 mins, than it means that ARUs are able to more
easily detect species, possibly due to their ability to identify
multiple birds calling at once and the fact that humans may make birds
vocalize/become more difficult to detect.
I hypothesize that the ARU data will need about 10% more time than the point count data to detect the same amount of species. This is due to the small number of species that are seen only such as flyovers and non-vocalizing birds (which are more rare than you would think).
Let’s begin the tidying by subsetting the data to only contain recordings between 0530 and 0945. We are also going to use 0 as the logit cutoff. This means that anytime the ML model gave a species a logit > 0 is going to be counted as being observed. A logit is a way of representing a probability across all real numbers.
dataML_small <- dataML %>%
filter(logit > 0) %>%
mutate(time = as_hms(Date_Time), year = year(Date_Time)) %>%
filter(time >= 19800 & time <= 35100) # 19800 = 05:30 in seconds...
Now we can summarize the data and determine the number of species
seen at each point each year with x amounts of data. I’ll
begin by using 1 minute-intervals to see the number of species that are
detected as we add minute by minute (great song btw).
times <- lapply(1:60, '*', 15) %>% unlist()
plan(multisession, workers = 32)
options(future.globals.maxSize = 8000 * 1024^2)
dataML_summary <- future_map_dfr(times,
~ (dataML_small %>%
filter(Start_Time <= .x) %>%
group_by(point, year, Date_Time, species) %>%
summarize(logit = max(.data[["logit"]]), species) %>%
ungroup(species) %>%
summarize(species, nspecies = n_distinct(species), logit) %>%
group_by(point, year, species, nspecies, logit) %>%
slice_sample(n = 1) %>% # fixing issue where it makes duplicate rows within a group
group_by(point, year) %>%
summarize(species = mean(nspecies)) %>%
mutate(seconds = (.x))
))
Plotting ML Data
Now that we have the number of species detected at each point each year, we can plot how the number of species detected increases with time. Note that this maxes out at 15 minutes, so later we will need to extend the number of minutes and see when the number of species reaches an asymptote in relation to observation time.
ggplot(dataML_summary, aes(seconds, species)) +
geom_jitter() +
geom_smooth() +
labs(title = 'Mean Number of Species on a Given Point Each Year \nWhen The Recording Time is Truncated',
subtitle = "Species are summed from sequential 2.5-second snippets within a single recording")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Increasing the Maximum Number of Minutes in the Plot Above
While the slope of the best fit curve is much flatter at 15 minutes than 5 minutes, it still does not give the complete picture as to how the number of species increases with recording time. Ideally, we would be able to extend the above plot to longer periods of time so that we give the plot a chance to flatten out more.
My first thought is to group_by() date and point and
just start from 0530 and just add x minutes to that
starting time. This might be fine as there are multiple days of data at
each point, but it is not an even representation of observation time.
The later minutes will contain fewer species as birds vocalize less
later in the day. This reduces the number of species detected at later
times which does not allow for fair comparison.
Another option is to use slice_sample() and select a random 2.5-second snippet. If done completely randomly, this is the best method as it is 100% fair and allows for the most objective way to get a random sampling from the recordings. It also enables us to extend the number of minutes that are used. It is important that we do NOT use the dataML_small df (dataframe) as it only contains high-logit 2.5-second snippets. We need to sample from the entire df and THEN filter for high logits. Let’s try this method.
Wait a minute. After careful consideration, I realized that this was still using -2.5 as a logit cutoff, which was making the “random” sampling not very random as it would only look through times that had a logit > -2.5. I am now going to use the entire dataset (which will take a while) in order to combat this problem. I need to do some hacking in order to equally sample the df.
Let’s first just filter the df to only contain times between 0530 and 0945.
dataML_large <- dataML %>%
filter(!is.na(point)) %>% # dumb point that has NA
mutate(year = year(Date_Time), time = as_hms(Date_Time)) %>%
filter(time >= 19800 & time <= 35100) # 19800 = 05:30 in seconds...
Okay, now we can begin the hacking. I am going to make an index df
that will give a unique number to each group of:
point, Date_Time, Start_Time. Then, I will join that df
with dataML to give each
ML_index <- crossing(
unique(dataML_large$Date_Time),
unique(dataML_large$Start_Time),
unique(dataML_large$point)
)
ML_index <- ML_index[sample(1:nrow(ML_index)), ] %>%
rowid_to_column(var = "id") %>%
rename(Date_Time = 2, Start_Time = 3, point = 4)
dataML_id <- dataML_large %>%
left_join(ML_index, by = c("point", "Date_Time", "Start_Time"))
Now we can finally run the big code that will sample the data and
summarize it for us. I am going to filter by the highest id
inside of each group, and because id is assigned completely
randomly, it does not matter that it is the highest (we could also use
the lowest and achieve the same outcome). slice_max() is
the function that completes this task.
plan(multisession, workers = 32)
options(future.globals.maxSize = 8000 * 8192^2)
dataML_large_summary <- future_map_dfr(1:240,
~ (dataML_id %>%
group_by(point, year) %>%
slice_max(n = (.x * 534), order_by = id) %>% # there are 89 species, so we need the number of snippets * 89 for to get all of the data for that 2.5-second snippet
group_by(point, year, species) %>%
summarize(logit = max(.data[["logit"]]), species) %>%
filter(logit > 0) %>%
ungroup(species) %>%
summarize(species, nspecies = n_distinct(species), logit) %>%
group_by(point, year, species, nspecies, logit) %>%
slice_sample(n = 1) %>% # fixing issue where it makes duplicate rows within a group
group_by(point, year) %>%
summarize(species = mean(nspecies)) %>%
mutate(seconds = (.x * 15))
)
)
Plotting Extended ML data
ggplot() +
geom_jitter(data = slice_sample(dataML_large_summary, prop = 0.01), aes(seconds, species)) +
geom_smooth(data = dataML_large_summary, aes(seconds, species)) +
labs(title = 'Number of Species on a Given Point Each Year \nWhen the Number of 2.5-second intervals is Varied',
subtitle = 'The data are randomly sampled from 0530 to 0945 on a given point each year')
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Tidying PC Data
We are going to find the number of species at each point each year when a 10 minute point count is conducted. Because there are three distinct point counts done at each point each year, I am going to take the average number of species observed across all three point counts.
dataPC_summary <- dataPC %>%
filter(eBird_6_code != "", point_ID_fk != 397) %>% # 397 was a practice point count at a point that was not used in the study
rename(species = birdCode_fk, point = point_ID_fk) %>%
mutate(year = year(Date)) %>%
group_by(point, year, pointCount_ID_fk) %>%
summarize(nspecies = n_distinct(eBird_6_code)) %>%
group_by(point, year) %>%
summarize(species = round(mean(nspecies)))
Combining ML and PC Data
Now that we have the ML and PC data in the same format, we can merge them into a single dataframe and determine the number of minutes of ARU data that is equilivant to 10 minutes of point count data.
ML_PC_combined <- dataML_large_summary %>%
left_join(dataPC_summary, by = c("point", "year")) %>%
filter(species.x == species.y) %>%
group_by(point, year) %>%
slice_min(order_by = seconds) %>%
select(-species.y) %>%
rename(species = species.x) %>%
mutate(minutes = (seconds / 60))
Plotting Combined Data
Now that we have combined the data, let’s graph it to see how close my estimate is. Remember, I guessed that the ARU data would need about 10% more time than point counts to detect the same number of species.
ggplot(ML_PC_combined, aes(minutes)) +
geom_histogram(aes(y = ..density..), binwidth = 1, color = "black", fill = "white") +
geom_density(alpha = .2, fill = "#FF6666") + # Overlay with transparent density plot
geom_vline(aes(xintercept = mean(minutes)), color = "red", linetype = "dashed", size = 1) +
labs(
title = "Number of Minutes of an ARU Recording to Detect the \nSame Number of Species as a 10 Minute Point Count",
subtitle = "Using random sampling dataset described above",
caption = paste("Median Number of Minutes =", round(median(ML_PC_combined$minutes), 1))
)
Wow, my estimate was super close! I estimated 11 minutes (10% more than 10 minutes) and the actual number of minutes is 6 minutes.
Simpler Method of Comparing
For this final method, I will group the recordings into sudo-one-hour-long recordings. Then, I can determine the number of species in a sequential manner without needing to do the random sampling that may be causing issues and overestimating the number of species. This method will also allow us to view how the number of species changes from the early morning to the late morning.
dataML_small <- dataML_small %>%
mutate(
subtime = case_when(
time <= 26100 ~ "early", # 26100 = 0715 in seconds
time >= 26100 ~ "late"
),
subsubtime = case_when(
time == 19800 ~ 0,
time == 21600 ~ 1,
time == 23400 ~ 2,
time == 25200 ~ 3,
time == 27000 ~ 0,
time == 28800 ~ 1,
time == 30600 ~ 2,
time == 32400 ~ 3,
time == 34200 ~ 4
),
day = day(Date_Time),
offset_time = Start_Time + (subsubtime * 892.5)
) %>%
filter(!(is.na(point)))
bigtimes <- lapply(1:360, '*', 10) %>% unlist()
plan(multisession, workers = 32)
dataML_summary_hour <- future_map_dfr(bigtimes,
~ (dataML_small %>%
filter(offset_time <= .x) %>%
group_by(point, year, day, subtime) %>%
summarize(species, nspecies = n_distinct(species)) %>%
group_by(point, year, day, subtime, species, nspecies) %>%
slice_sample(n = 1) %>% # fixing issue where it makes duplicate rows within a group
group_by(point, year, day, subtime) %>%
summarize(species = mean(nspecies)) %>%
mutate(seconds = (.x))
)
)
Plotting Simple Method
First by time, then by point
ggplot() +
geom_jitter(data = slice_sample(dataML_summary_hour, prop = 0.005), aes(seconds, species)) +
geom_smooth(data = filter(dataML_summary_hour, subtime == "early"), aes(seconds, species, color = "early")) +
geom_smooth(data = filter(dataML_summary_hour, subtime == "late"), aes(seconds, species, color = "late")) +
labs(
title = "Number of Species as Recording Time is Varied",
subtitle = "Recordings are grouped into one-hour long sections within each day. \nThere are two sections for each day, one from 05:30 - 07:15, \nthe other from 07:30 - 09:45"
)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Now plotting by point. Ggplot2’s built-in colors are not very unique, making the plot difficult to read.
ggplot(data = dataML_summary_hour, aes(seconds, species, color = as.factor(point))) +
geom_jitter(data = slice_sample(dataML_summary_hour, prop = 0.005)) +
geom_smooth(aes(group = point)) +
scale_color_manual(values = distinctColorPalette(n_distinct(dataML_summary_hour$point))) +
labs(
title = "Number of Species as Recording Time is Varied",
subtitle = "Recordings are grouped into one-hour long sections \nwithin each day. \nThere are two sections for each day, \none from 05:30 - 07:15, \nthe other from 07:30 - 09:45. \nEach color represents a different point. ",
color = "point"
)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Comparing Simple Method to Point Counts
Comparing this method to point counts will be almost identical to the previous comparison in terms of code. It’s just that the input will be from this simple data set instead of the more complex method of comparison.
ML_PC_combined_simple <- dataML_summary_hour %>%
group_by(point, year, seconds) %>%
summarize(species = round(mean(.data[["species"]]))) %>%
left_join(dataPC_summary, by = c("point", "year")) %>%
filter(species.x == species.y) %>%
group_by(point, year) %>%
slice_min(order_by = seconds) %>%
select(-species.y) %>%
rename(species = species.x) %>%
mutate(minutes = (seconds / 60))
Plotting Simple Comparison
ggplot(ML_PC_combined_simple, aes(minutes)) +
geom_histogram(aes(y = ..density..), binwidth = 1, color = "black", fill = "white") +
geom_density(alpha = .2, fill = "#FF6666") + # Overlay with transparent density plot
geom_vline(aes(xintercept = mean(minutes)), color = "red", linetype = "dashed", size = 1) +
labs(
title = "Number of Minutes of an ARU Recording to Detect the \nSame Number of Species as a 10 Minute Point Count",
subtitle = "Using Continuous Sampling Described Above",
caption = paste("Median Number of Minutes =", round(median(ML_PC_combined_simple$minutes), 1))
)
Wow, that is a lot more time required than when I used the random
sampling method. The random sampling probably overestimates the number
of species due to it taking 2.5-second snippets from multiple dates and
times within a day. This means that the probability of there being a new
species increases.