Encountering the Red-eyed Vireo: Predicting Red-eyed Vireo Encounter Rates Across Land Cover Types

Project Description: The goal of this project is to understand and predict Red-eyed Vireo encounter rates across various land cover types using citizen science data and the National Land Cover Database (NLCD).

In this study, encounter rates are analyzed using eBird data collected through the Traveling Protocol, which records bird observations alongside spatial and temporal information. Complete checklists from eBird are used for their ability to mitigate biases, such as taxonomic, spatial, and temporal biases, which may arise from citizen science data. Integrating these observations with NLCD data enables the exploration of how land cover influences Red-eyed Vireo distribution and habitat preferences (Johnston, et al., 2021).

The Red-eyed Vireo is a small songbird known for its olive-colored plumage and red eyes. These birds are prolific singers singing up to twenty thousand songs each day during the breeding season.

Learn More about the Red-eyed Verio from All About Birds (https://www.allaboutbirds.org/guide/Red-eyed_Vireo/id)

2. Data Wrangling: Checklists

eBird data is organized using a checklist that represents observations for single birding events such as a walk or backyard observation. Each checklist has a list of observed species, numbers of species, and spatial/temporal data. The acquired eBird dataset contains both incomplete checklists and complete checklists. Both checklists can provide insights into data, however researchers should be aware of several distinctions between the two types of checklists. Complete checklists use a count of zero that is inferred for individual species that are not reported. An incomplete checklist does not infer the zero means no species were observed. If the checklist is not completed it is not possible to know if the absence of the species was non detected or if the observer did not report on the species. An example of this might be if a birder was only reporting new species for a bird list, rather then reporting each species observed (Strimas-Mackey, 2023).

library(auk)
library(lubridate)
library(sf)
library(gridExtra)
library(tidyverse)
# resolve namespace conflicts
select <- dplyr::select

# setup 
dir.create("data", showWarnings = FALSE)

# Read eBird data using auk
ebd <- auk_ebd("C:/PENNSTATE/GEOG588_Analytical Approaches_/Term_Project/Feb_22_TermProject/Term_Project/Data/ebd_US-WI-007_200201_202501_smp_relJan-2025.txt",
               file_sampling = "C:/PENNSTATE/GEOG588_Analytical Approaches_/Term_Project/Feb_22_TermProject/Term_Project/Data/ebd_US-WI-007_200201_202501_smp_relJan-2025_sampling.txt")


ebd_filters <- ebd %>% 
  auk_species("Red-eyed Vireo") %>% 
  auk_date(date = c("2002-01-01", "2025-01-01")) %>% 
  # restrict to the standard traveling and stationary count protocols
  auk_protocol(protocol = c("Stationary", "Traveling")) %>% 
  auk_complete()

# Define and create data dictionary 
data_dir <- "data"
if (!dir.exists(data_dir)) {
  dir.create(data_dir)
}
f_ebd <- file.path(data_dir, "ebd_Red_Eyed_Vireo.txt")
f_sampling <- file.path(data_dir, "ebd_checklists_Red_eye_Vireo.txt")

#  run if the files don't already exist
if (!file.exists(f_ebd)) {
  auk_filter(ebd_filters, file = f_ebd, file_sampling = f_sampling)
}

#Importing and Zero-filling
ebd_zf <- auk_zerofill(f_ebd, f_sampling, collapse = TRUE)




# function to convert time observation to hours since midnight
time_to_decimal <- function(x) {
  x <- hms(x, quiet = TRUE)
  hour(x) + minute(x) / 60 + second(x) / 3600
}

# clean up variables
ebd_zf <- ebd_zf %>% 
  mutate(
    # convert X to NA
    observation_count = if_else(observation_count == "X", 
                                NA_character_, observation_count),
    observation_count = as.integer(observation_count),
    # effort_distance_km to 0 for non-travelling counts
    effort_distance_km = if_else(protocol_type != "Traveling", 
                                 0, effort_distance_km),
    # convert time to decimal hours since midnight
    time_observations_started = time_to_decimal(time_observations_started),
    # split date into year and day of year
    year = year(observation_date),
    day_of_year = yday(observation_date)
  )

# additional filtering
ebd_zf_filtered <- ebd_zf %>% 
  filter(
    # effort filters
    duration_minutes <= 5 * 60,
    effort_distance_km <= 5,
    # 10 or fewer observers
    number_observers <= 10)


ebird <- ebd_zf_filtered %>% 
  select(checklist_id, observer_id, sampling_event_identifier,
         scientific_name,
         observation_count, species_observed, 
         state_code, locality_id, latitude, longitude,
         protocol_type, all_species_reported,
         observation_date, year, day_of_year,
         time_observations_started, 
         duration_minutes, effort_distance_km,
         number_observers)
write_csv(ebird, "data/ebd_Red_Eyed_Vireo_zf.csv", na = "")

#Plotting Observations types-Checklist vs sightings.
library(ggplot2)
library(sf)
library(scales) # for alpha color transparency

# Prepare eBird data for mapping (already cleaned and filtered)
ebird_sf <- ebd_zf_filtered %>% 
  # Convert to spatial points
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326)

The plot below shows the the complete checklists and the observations of the Red-eyed Vireo

ggplot() +
  # Plot eBird observations
  geom_sf(data = ebird_sf, aes(color = species_observed), size = 1, alpha = 0.7) +
  scale_color_manual(
    values = c("FALSE" = alpha("#FF0000", 0.25), # Red for checklists
               "TRUE" = alpha("#4daf4a", 1)),   # Green for sightings
    labels = c("eBird checklists", "Red-eyed Vireo Observitions")
  ) +
  theme_minimal() +
  labs(
    title = "Red-eyed Vireo eBird Observations",
    subtitle = "January 2002 to January 2025",
    color = "Observation Type",
    x = "Longitude",
    y = "Latitude"
  ) +
  theme(legend.position = "bottom")

Figure 1:This plot shows the completed checklist from 2002-2025. The Plot includes only the observations of the Red-eyed Vireo. The checklists are the data entries that may or may not include the Red-eyed Vireo. The observations represent confirmed presence of the species of interest at a location.

3. Exploring eBird Checklists

This section examines the differences between using complete checklists (Figure 2) and all checklists (Figure 3) in eBird data analysis. While including all checklists offers a simpler approach to uncovering general trends or gaining basic insights into bird populations, complete checklists provide a more robust dataset by mitigating biases introduced by recorder preferences. Although complete checklists require more detailed and time-intensive workflows, they yield data of higher value for research purposes. This distinction is effectively illustrated through a histogram analyzing distances traveled within the “Traveling” protocol.

# Filter for traveling protocol type
checklists_traveling <- filter(ebird, protocol_type == "Traveling")

# Calculate the 95th percentile for effort_distance_km within the Traveling protocol
distance_95th <- quantile(checklists_traveling$effort_distance_km, 0.95, na.rm = TRUE)

ggplot(checklists_traveling) +
  aes(x = effort_distance_km, fill = effort_distance_km <= distance_95th) +
  geom_histogram(binwidth = 1, boundary = 0, aes(y = after_stat(count / sum(count)))) +
  scale_fill_manual(values = c("TRUE" = "blue", "FALSE" = "red"), guide = FALSE) +
  scale_x_continuous(limits = c(0, NA)) +  # Ensure x-axis starts at 0
  scale_y_continuous(limits = c(0, NA), labels = scales::label_percent()) +
  labs(
    x = "Distance Traveled [km]",
    y = "% of Traveling Protocol Checklists",
    title = "Distribution of Distance Traveled on Traveling Protocol Checklists",
    fill = "Within 95% Range"
  ) +
  theme_minimal()

Figure 2 This histogram illustrates the distribution of distances traveled using only complete checklists. The x-axis represents the distance traveled (in kilometers) and the y-axis shows the percentage of these checklists. Most observations (over 50%) occur within 0 to 1 km, highlighting that the majority of participants conduct very short-distance birdwatching sessions. As the distance increases, the percentage decreases, with very few checklists covering 4 to 5 km. The blue bars represent distances within the 95th percentile range (up to approximately 3.2 km), while the red bar highlights the small fraction of checklists exceeding this range

Figure 3 is a histogram that uses all of the checklists.

All Checklists

Figure 3: This histogram visualizes the percentage of eBird checklists based on the distance traveled. The x-axis spans distances from 0 to 80 kilometers, while the y-axis displays the percentage of checklists, ranging up to 30%. The majority of checklists are for short distances, as shown by the dense cluster of blue bars. In contrast, the red bars, representing longer distances, are sparse, indicating far fewer checklists for these greater travel ranges.

Why the types of checklists matter

Using all checklists provides a larger dataset with broader trends, however complete checklists offer higher-quality data by addressing biases. This distinction underlines the importance of choosing the right dataset for specific research objectives.

4. Exploring Red-eyed Viero observations from a complete checklist (2002-2025)

Figure 4 shows the number of observations and the number of unique observers has changed over time (grouped by year). The “number of observations” refers to the count of rows of data in the dataset. Each row in the dataset represents a single observation or record.

# Ensure the `observation_date` column is in Date format
ebird <- ebird %>%
  mutate(observation_date = as.Date(observation_date, format = "%Y-%m-%d"))

# Add a "Year" column based on `observation_date`
ebird <- ebird %>%
  mutate(Year = as.numeric(format(observation_date, "%Y")))

# Group by year and summarize the data
yearly_counts <- ebird %>%
  group_by(Year) %>%
  summarize(Count = n(), Unique_Observers = n_distinct(observer_id))

# Plot observations and unique observers 
ggplot(yearly_counts, aes(x = Year)) +
  geom_bar(aes(y = Count), stat = "identity", fill = "skyblue") +
  geom_line(aes(y = Unique_Observers), color = "darkred", linewidth = 1) +
  geom_point(aes(y = Unique_Observers), color = "darkred", size = 2) +
  labs(x = "Year", y = "Number of Observations", 
       title = "Observations and Unique Observers by Year") +
  scale_y_continuous(name = "Number of Observations",
                     sec.axis = sec_axis(~., name = "Number of Unique Observers")) +
  theme_minimal()

Figure 4 Trends in the total number of observations (blue bars) and unique observers (red line) over time. The left y-axis represents the number of observations reported each year, while the right y-axis shows the count of unique observers. The data highlights variations in participation and sighting activity across years. This data only reflects the subset of data where the Red-eyed Vireo was reported.

Figure 5. Shows the counts of observations by month.

# Ensure the `observation_date` column is in Date format
ebird <- ebird %>%
  mutate(observation_date = as.Date(observation_date, format = "%Y-%m-%d"))

# Create a complete list of months in calendar order
all_months <- tibble(Month = factor(month.name, levels = month.name))  # Ensure months are treated as ordered factors

# Summarize the data by month, filtering out rows where `observation_count` is 0
monthly_counts <- ebird %>%
  filter(observation_count != 0) %>%  # Keep only rows with non-zero observation counts
  mutate(Month = factor(format(observation_date, "%B"), levels = month.name)) %>%  # Extract and order month names
  group_by(Month) %>%
  summarize(Count = n(), .groups = 'drop') %>%
  right_join(all_months, by = "Month") %>%  # Join with all months
  replace_na(list(Count = 0))  # Replace missing counts with 0

# Plot 
ggplot(monthly_counts, aes(x = Month, y = Count)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(x = "Month", y = "Observation Count", 
       title = "Count of Observations by Month (Non-Zero Counts Only)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate month labels

Figure 5: shows the number of observations of Red-eyed Vireo by month, excluding cases where observation count equals zero. The numbers reflect the sum of all records per month from each year. Observations peak in June, with over 2,000 recorded, followed by July and August with fewer sightings. The majority of observations occur during late spring and summer (May to September) and no reported data from October through April. This pattern reflects the species’ breeding season and migratory behavior.

Section 4 Encounter Rates of the Red-eyed Verio for the “Traveling” protocol.

The NLCD data, provides detailed land cover classifications, was loaded and clipped to the Bayfield County region to match the scope of the eBird observations. A spatial join was performed to associate individual bird observations with corresponding land cover classes, allowing for an analysis of habitat use. Encounter rates were calculated by grouping data by land cover class and dividing the total number of Red-eyed Vireo observations by the total effort (distance traveled in kilometers).

“Total Observations” refers to the sum of all recorded counts of the species within each land cover class, and “Total Effort” represents the cumulative distance traveled (in kilometers) by observers within that class. Records were filtered to include only Traveling Protocol data to ensure effort metrics were consistently available and relevant. This mathematical approach standardized observations relative to observer effort, providing insights into the habitat preferences and accounting for variations in spatial and observer behavior.

Data Preparation

library(auk)
library(lubridate)
library(sf)
library(gridExtra)
library(tidyverse)

glimpse(ebird)

## Rows: 22,904
## Columns: 20
## $ checklist_id              <chr> "S71389058", "S71410330", "S71410088", "S699…
## $ observer_id               <chr> "obsr398748", "obsr652990", "obsr652990", "o…
## $ sampling_event_identifier <chr> "S71389058", "S71410330", "S71410088", "S699…
## $ scientific_name           <chr> "Vireo olivaceus", "Vireo olivaceus", "Vireo…
## $ observation_count         <int> 1, 2, 6, 4, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ species_observed          <lgl> TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE, …
## $ state_code                <chr> "US-WI", "US-WI", "US-WI", "US-WI", "US-WI",…
## $ locality_id               <chr> "L11881816", "L11885894", "L11885879", "L118…
## $ latitude                  <dbl> 46.67769, 46.22388, 46.22452, 46.57175, 46.2…
## $ longitude                 <dbl> -90.87994, -91.14767, -91.16111, -91.49045, …
## $ protocol_type             <chr> "Traveling", "Traveling", "Traveling", "Trav…
## $ all_species_reported      <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…
## $ observation_date          <date> 2020-07-12, 2020-07-12, 2020-07-12, 2020-06…
## $ year                      <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 20…
## $ day_of_year               <dbl> 194, 194, 194, 154, 216, 216, 202, 203, 203,…
## $ time_observations_started <dbl> 10.016667, 16.166667, 15.416667, 9.750000, 9…
## $ duration_minutes          <int> 38, 120, 40, 20, 20, 11, 31, 58, 13, 22, 6, …
## $ effort_distance_km        <dbl> 1.094, 0.400, 0.500, 0.402, 1.282, 0.000, 2.…
## $ number_observers          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1,…
## $ Year                      <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 20…

# `ebird` as data set
ebird_sf <- st_as_sf(ebird, coords = c("longitude", "latitude"), crs = 4326)

# Load NLCD data
nlcd_data <- st_read("C:/PENNSTATE/GEOG588_Analytical Approaches_/Term_Project/Feb_22_TermProject/Term_Project/Data/Term_Spatial_Data.gdb",
                     layer = "NLCD_Bayfield_Dissolve")

## Reading layer `NLCD_Bayfield_Dissolve' from data source 
##   `C:\PENNSTATE\GEOG588_Analytical Approaches_\Term_Project\Feb_22_TermProject\Term_Project\Data\Term_Spatial_Data.gdb' 
##   using driver `OpenFileGDB'

## Simple feature collection with 15 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 318006 ymin: 734596.1 xmax: 381486 ymax: 837736.1
## Projected CRS: North_America_Albers_Equal_Area_Conic

#Change coordinate system of ebird_sf to match the NLCD data
ebird_sf <- st_transform(ebird_sf, crs = st_crs(nlcd_data))

#Join NLCD with ebird Data
ebird_nlcd <- st_join(ebird_sf, nlcd_data, join = st_intersects)

# Filter for Traveling protocol type and calculate encounter rates
Traveling_encounter_rates <- ebird_nlcd %>%
  filter(protocol_type == "Traveling") %>%  # Include only Traveling protocol type
  group_by(ClassName) %>%
  summarize(TotalObservations = sum(observation_count, na.rm = TRUE),
            TotalEffort = sum(effort_distance_km, na.rm = TRUE),
            EncounterRate = TotalObservations / TotalEffort)

Figure 6 Plots Encounter Rates by NLCD Class using the Traveling Protocol. This chart displays the encounter rates (observations per kilometer) across different National Land Cover Database (NLCD) classes.

library(ggplot2)
ggplot(Traveling_encounter_rates, aes(x = reorder(ClassName, EncounterRate), y = EncounterRate)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  coord_flip() +
  labs(
    title = "Encounter Rates by NLCD Class\n(Traveling Protocol)",
    x = "NLCD Class",
    y = "Encounter Rate (Observations per Km)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 14)
  )

Figure 6: The x-axis represents the “Encounter Rate” expressed as the number of observations per kilometer for each NLCD (National Land Cover Database) land cover class. Higher values (closer to 1.2 on the x-axis) indicate that more birds were observed per kilometer in those land cover types, suggesting higher activity or suitability of the habitat for Red-eyed Vireos. Lower values (closer to 0) indicate fewer observations per kilometer, Encounter rates are highest in Developed Low Intensity areas, followed by Grassland/Herbaceous and Mixed Forest. The chart shows how bird observations correlate with various land cover types under the traveling protocol.

Encounter Rates using Random Forest Model

A random forest model was used to explore the relationship between land cover types and encounter rates Data was prepared for modeling using the same metrics as the previous encounter rate prediction model, however the random forest library was used to train the model, with encounter rates as the response variable and land cover class and mean duration as predictors. The model highlighted the relative importance of each predictor and was applied to predict encounter rates for additional land cover classes.

Data Preparation

library(dplyr)
library(sf)
library(randomForest)
library(ggplot2)

# Load NLCD data
nlcd_data <- st_read("C:/PENNSTATE/GEOG588_Analytical Approaches_/Term_Project/Feb_22_TermProject/Term_Project/Data/Term_Spatial_Data.gdb",
                     layer = "NLCD_Bayfield_Dissolve")

## Reading layer `NLCD_Bayfield_Dissolve' from data source 
##   `C:\PENNSTATE\GEOG588_Analytical Approaches_\Term_Project\Feb_22_TermProject\Term_Project\Data\Term_Spatial_Data.gdb' 
##   using driver `OpenFileGDB'
## Simple feature collection with 15 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 318006 ymin: 734596.1 xmax: 381486 ymax: 837736.1
## Projected CRS: North_America_Albers_Equal_Area_Conic

# transform it to an sf object
ebird_sf <- st_as_sf(ebird, coords = c("longitude", "latitude"), crs = 4326)

# Change coordinate system of ebird_sf to match NLCD data
ebird_sf <- st_transform(ebird_sf, crs = st_crs(nlcd_data))

# Join NLCD with ebird data
ebird_nlcd <- st_join(ebird_sf, nlcd_data, join = st_intersects)

# Prepare data for Random Forest
rf_data_Traveling <- ebird_nlcd %>%
  filter(protocol_type == "Traveling") %>%  
  mutate(EncounterRate = observation_count / effort_distance_km) %>%
  filter(effort_distance_km > 0)

# Glimpse the prepared data
glimpse(rf_data_Traveling)

## Rows: 10,811
## Columns: 23
## $ checklist_id              <chr> "S71389058", "S71410330", "S71410088", "S699…
## $ observer_id               <chr> "obsr398748", "obsr652990", "obsr652990", "o…
## $ sampling_event_identifier <chr> "S71389058", "S71410330", "S71410088", "S699…
## $ scientific_name           <chr> "Vireo olivaceus", "Vireo olivaceus", "Vireo…
## $ observation_count         <int> 1, 2, 6, 4, 0, 5, 0, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ species_observed          <lgl> TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, …
## $ state_code                <chr> "US-WI", "US-WI", "US-WI", "US-WI", "US-WI",…
## $ locality_id               <chr> "L11881816", "L11885894", "L11885879", "L118…
## $ protocol_type             <chr> "Traveling", "Traveling", "Traveling", "Trav…
## $ all_species_reported      <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…
## $ observation_date          <date> 2020-07-12, 2020-07-12, 2020-07-12, 2020-06…
## $ year                      <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 20…
## $ day_of_year               <dbl> 194, 194, 194, 154, 216, 202, 203, 215, 215,…
## $ time_observations_started <dbl> 10.016667, 16.166667, 15.416667, 9.750000, 9…
## $ duration_minutes          <int> 38, 120, 40, 20, 20, 31, 58, 1, 22, 48, 18, …
## $ effort_distance_km        <dbl> 1.094, 0.400, 0.500, 0.402, 1.282, 2.832, 0.…
## $ number_observers          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,…
## $ Year                      <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 20…
## $ geometry                  <POINT [m]> POINT (368762.3 798819.6), POINT (3522…
## $ ClassName                 <chr> "Developed Open Space", "Woody Wetlands", "M…
## $ Shape_Length              <dbl> 10048109, 14844551, 41588907, 10048109, 1484…
## $ Shape_Area                <dbl> 137810129, 622722888, 1489469752, 137810129,…
## $ EncounterRate             <dbl> 0.9140768, 5.0000000, 12.0000000, 9.9502488,…

# Summarize data for training
training_data <- rf_data_Traveling %>%
  group_by(ClassName) %>%
  summarize(
    TotalObservations = sum(observation_count, na.rm = TRUE),
    TotalEffort = sum(effort_distance_km, na.rm = TRUE),
    MeanDuration = mean(duration_minutes, na.rm = TRUE),
    EncounterRate = TotalObservations / TotalEffort,  # Response variable
    .groups = "drop"
  )

# Train Random Forest Model
set.seed(123)
rf_model <- randomForest(
  EncounterRate ~ ClassName + MeanDuration,
  data = training_data,
  importance = TRUE,
  ntree = 500
)
print(rf_model)

## 
## Call:
##  randomForest(formula = EncounterRate ~ ClassName + MeanDuration,      data = training_data, importance = TRUE, ntree = 500) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 0.1042197
##                     % Var explained: -18.03

# Prepare prediction dataset
prediction_data <- nlcd_data %>%
  group_by(ClassName) %>%
  summarize(
    MeanArea = mean(Shape_Area, na.rm = TRUE),
    MeanLength = mean(Shape_Length, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(MeanDuration = mean(training_data$MeanDuration, na.rm = TRUE))

# Update Random Forest
set.seed(123)
rf_model_updated <- randomForest(
  EncounterRate ~ ClassName,
  data = training_data,
  importance = TRUE,
  ntree = 500
)
print(rf_model_updated)

## 
## Call:
##  randomForest(formula = EncounterRate ~ ClassName, data = training_data,      importance = TRUE, ntree = 500) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 0.1634497
##                     % Var explained: -85.11

# Predict encounter rates using the model
prediction_data$PredictedEncounterRate <- predict(rf_model_updated, newdata = prediction_data)

# Plot
ggplot(prediction_data, aes(x = reorder(ClassName, PredictedEncounterRate), y = PredictedEncounterRate)) +
  geom_bar(stat = "identity", fill = "cornflowerblue") +
  coord_flip() +
  labs(
    title = "Predicted Encounter Rates by NLCD Class",
    x = "NLCD Class",
    y = "Predicted Encounter Rate"
  ) +
  theme_minimal()

# Evaluate variable importance
varImpPlot(rf_model_updated)

# Summarize encounter rates for the Traveling protocol
Traveling_encounter_rates <- rf_data_Traveling %>%
  group_by(ClassName) %>%
  summarize(
    EncounterRate = sum(observation_count, na.rm = TRUE) / sum(effort_distance_km, na.rm = TRUE),  # Observations per Km
    .groups = "drop"
  )

Figure 7 Plots predicted Encounter Rates by NLCD Class using the Traveling Protocol and Random Forest Model. This chart displays the encounter rates (observations per kilometer) across different National Land Cover Database (NLCD) classes.

# Bar plot of encounter rates for the Traveling protocol
library(ggplot2)

ggplot(Traveling_encounter_rates, aes(x = reorder(ClassName, EncounterRate), y = EncounterRate)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  coord_flip() +
  labs(
    title = "Encounter Rates by NLCD Class (Traveling Protocol)",
    x = "NLCD Class",
    y = "Encounter Rate (Observations per Km)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 14)
  )

Figure 6: The graph shows the results using the random forest model to predict encounter rates across the NLCD land classes in Bayfield Count WI. The highest encounter rates are observed in Developed Low Intensity. The lowest encounter rates in the Barren Land and the Cultivated Crop classes

5 Direct Observation and Predicted Encounter Rates

The encounter rate results derived from direct observations provide summary of Red-eyed Vireo activity across different land cover classes. These rates are calculated as the number of observations per kilometer of effort for each land cover class. This is a straightforward approach to working with the data and provides some useful insights.

Encounter rates predicted by the Random Forest model integrate multiple variables and uses machine learning to predict encounter rates. This allows the model account for multiple variable interactions and provides predictions for encounter rates across habitat types that may not have sufficient observational data. While this is a more complicated approach to predicting encounter rates, it is ability to predict encounters in areas where no data exists may be extremely beneficial for conservation management strategies.

6 Conclusion

This project successfully demonstrated the value of integrating best practices for citizen science data with the National Land Cover Database to analyze and predict Red-eyed Vireo encounter rates across land cover types. Combining direct observations with advanced modeling approaches, such as Random Forest, allows us to model encounter rates in areas that do not have current reported checklists. This approach may provide valuable information for developing conservation strategies in land management plans.

James Paterson GOEG Term Project

Encountering the Red-eyed Vireo: Predicting Red-eyed Vireo Encounter Rates Across Land Cover Types

Project Description: The goal of this project is to understand and predict Red-eyed Vireo encounter rates across various land cover types using citizen science data and the National Land Cover Database (NLCD).

The Red-eyed Vireo is a small songbird known for its olive-colored plumage and red eyes. These birds are prolific singers singing up to twenty thousand songs each day during the breeding season.

2. Data Wrangling: Checklists

The plot below shows the the complete checklists and the observations of the Red-eyed Vireo

Figure 1:This plot shows the completed checklist from 2002-2025. The Plot includes only the observations of the Red-eyed Vireo. The checklists are the data entries that may or may not include the Red-eyed Vireo. The observations represent confirmed presence of the species of interest at a location.

3. Exploring eBird Checklists

Figure 3 is a histogram that uses all of the checklists.

Why the types of checklists matter

Using all checklists provides a larger dataset with broader trends, however complete checklists offer higher-quality data by addressing biases. This distinction underlines the importance of choosing the right dataset for specific research objectives.

4. Exploring Red-eyed Viero observations from a complete checklist (2002-2025)

Figure 4 shows the number of observations and the number of unique observers has changed over time (grouped by year). The “number of observations” refers to the count of rows of data in the dataset. Each row in the dataset represents a single observation or record.

Figure 5. Shows the counts of observations by month.

Section 4 Encounter Rates of the Red-eyed Verio for the “Traveling” protocol.

Data Preparation

Figure 6 Plots Encounter Rates by NLCD Class using the Traveling Protocol. This chart displays the encounter rates (observations per kilometer) across different National Land Cover Database (NLCD) classes.

Encounter Rates using Random Forest Model

Data Preparation

Figure 7 Plots predicted Encounter Rates by NLCD Class using the Traveling Protocol and Random Forest Model. This chart displays the encounter rates (observations per kilometer) across different National Land Cover Database (NLCD) classes.

Figure 6: The graph shows the results using the random forest model to predict encounter rates across the NLCD land classes in Bayfield Count WI. The highest encounter rates are observed in Developed Low Intensity. The lowest encounter rates in the Barren Land and the Cultivated Crop classes

5 Direct Observation and Predicted Encounter Rates

6 Conclusion

7 References