Introduction

The aim of this project is to build a machine learning model that can predict whether or not a certain bird species is vulnerable, if not already extinct. We will be using data from BIRDBASE, a global dataset of avian biogeography, conservation, ecology, and life history traits. We will implement multiple techniques to yield the most accurate model for this binary classification problem. It’s time to flock!

Motive

As a child, I remember my favorite pastime was always searching my sandbox for dinosaur fossils. Though I haven’t made any dinosaur discoveries yet, I switched to something more accessible and practical: their closest living relatives, birds. Nowadays, I spend my time looking through backyard bird guides and perfecting my bird feeder. I love all types of birds and discovering the new ones in my backyard, Santa Barbara. That being said, it’s obvious to those who pay attention to the outdoors that many bird species are at risk of extinction. Mornings are starting to sound more silent and it’s heartbreaking to see the decline over the years. I hope that my project can contribute a little bit to the conservation of birds.

Project Roadmap

The question I’d like to pose is the following: “How can we predict the vulnerability of a bird species based on it’s ecological traits?” Now that we understand the relevancy of the model, let’s chat about how we are going to build it throughout this project. We’ll start with some data manipulation and tidying, and then perform exploratory data analysis to get a bird’s eye view of our predictors. We hope to use predictor variables to predict a binary class “threat_status”, which will be our response variable detailing whether a bird species faces threat or not. We will then perform a treating/testing split on our data, cook up a recipe, and set folds for the 10-fold cross validation. Our models will be the following: Random Forest, Logistic Regression, Linear Discriminant Analysis, Decision Tree, K-Nearest Neighbor, and Quadratic Discriminant Analysis. We will identify the best performing model, and fit it to our testing dataset to analyze the prediction performance. Let’s ruffle some feathers!

Exploratory Data Anlaysis

Data Description

# Loading the data
birdbase_raw <- read_excel("BIRDBASE v2025.1 Sekercioglu et al. Final.xlsx")

We will obtain our data for this project through BirdBase, a global avian trait database with data on 78 traits of all bird species (11,589). These traits are distributed within 10 major categories: conservation, geographic distribution, morphology, elevation distribution, habitat, diet, social behavior, reproductive behavior, demography, and mobility. Each row is an observational unit of a bird species, my favorite being the magnificent Belted Kingfisher. While some little-known species still have incomplete data, the dataset provides a foundation for ornithologists around the world to conduct analyses in ornithology, conservation biology, and macroecology.

Variable Selection

Let’s mess around with the data a little bit to see what we’re currently working with!

dim(birdbase_raw)
## [1] 11590    97

The dataset contains 11,590 rows and 97 columns. That’s a lot! Let’s strategically narrow that down and rename our variables. First, we will identify species using their Common English Name, and remove the other 13 Taxonomy columns because categories like “Genus” are human-constructed labels rather than inherent biological traits. Similarly we will remove the “Source” variable, as it describes the data’s orgin rather than the bird itself. We will also remove the “Realm” variable because it represents broad geographic boundaries that overlap with our more specific ecological predictors. To avoid redundancy, lets simplify our other metrics:

  • Mass: We will remove 6 detailed mass variables and use the consolidated “Average Mass” variable
  • Elevation: Instead of using 2 specific variables about Elevation Distribution, we’ll just use “NormMin”, “NormMax”, and “Elevational Range”.
  • Habitat: Rather than tracking 16 individual habitat types, our primary focus will be on the “Primary Habitat” and habitat breath (“HB”) variables.
  • Diet: For Diet, we’ll use diet breath (“DB”) instead of the 14 more granular diet variables
  • Life History: We will isolate the most critical survival survival indicators (monogamy, clutch sizes, flightlessness, and sedentary status) and remove the 20 other variables to prevent extra noise.
# Selecting only the variables we want
selected_vars <- c("English Name (BirdLife > IOC > Clements>AviList)", '2024 IUCN Red List category', 'RR', 'ISL', 'LAT', 'Average Mass', 'NormMin', 'NormMax', 'Elevational Range', 'Primary Habitat', 'HB', 'Primary Diet', 'DB', 'Social_1', 'Social_2', 'Social_3', 'Social_4', 'Social_5', 'Social_6', 'Mono', 'Poly', 'Clutch_Min', 'Clutch_Max', 'Flightlessness', 'Sed')

birdbase_raw <- read_excel("BIRDBASE v2025.1 Sekercioglu et al. Final.xlsx", skip=1) |> 
            select(all_of(selected_vars))  

# Renaming our variables
birdbase_raw <- birdbase_raw  |> 
  rename(species = `English Name (BirdLife > IOC > Clements>AviList)`,
         threat_level = '2024 IUCN Red List category', restricted_range = 'RR', island_breeding = 'ISL',        
         latitudinal_range = 'LAT', average_mass = 'Average Mass', norm_min_elevation = 'NormMin',
         norm_max_elevation = 'NormMax', elevation_range = 'Elevational Range', 
         primary_habitat = 'Primary Habitat', num_habitats = 'HB', 
         primary_diet = 'Primary Diet', num_diets = 'DB', 
         colonial_behavior = 'Social_1', social_behavior = 'Social_2', family_behavior = 'Social_3', 
         singly_behavior = 'Social_4', solitary_behavior = 'Social_5', lekking_behavior = 'Social_6',
         clutch_min = 'Clutch_Min', clutch_max = 'Clutch_Max', 
         flight = 'Flightlessness', sedentary = 'Sed') 

Tidying our Data

We’ll start by tidying one of our most critical variables, threat_level. The IUCN Red List is a critical indicator of the health of the world’s biodiversity, which BirdBase uses to identify the threat_level of various bird species. After recoding the technical codes (“EX”) to their descriptive labels (“Extinct”), we will filter out species marked as “Data Deficient” and convert the remaining statuses into an ordered factor and transform the categories into a scale from 1 to 9.

To create a binary classification problem, we will create a new variable, threat_status, where 1 corresponds with any species with a threat level of 3 or higher (Vulnerable through Extinct) and 0 responds with any species below that threshold (Near-threatened or Least Concern).

# Tidying Threat Level
birdbase_raw <- birdbase_raw |>
  mutate(threat_level = fct_recode(threat_level,
    "Extinct"                = "EX",
    "Extinct in the Wild"    = "EW",
    "Possibly Extinct"                = "CR (PE)",
    "Possibly Extinct in the Wild"    = "CR (PEW)",
    "Critically Endangered" = "CR",
    "Endangered"            = "EN",
    "Vulnerable"            = "VU",
    "Near-threatened"       = "NT",
    "Least Concern"         = "LC",
    "Data Deficient"        = "DD"
  ))

# Filter threat_level that is not data deficient 
birdbase_raw <- birdbase_raw %>% 
  filter(threat_level != "Data Deficient")

# Converting to ordered factor and a numeric scale
threat_level_status <- c("Least Concern", "Near-threatened", "Vulnerable", "Endangered", "Critically Endangered", "Possibly Extinct in the Wild", "Possibly Extinct", "Extinct in the Wild", "Extinct")

birdbase_raw <- birdbase_raw|>
  mutate(threat_level = factor(threat_level, levels = threat_level_status, ordered = TRUE))
birdbase_raw <- birdbase_raw %>%
  mutate(threat_level = as.numeric(threat_level))

# Creating the threat_status variale
birdbase_raw <- birdbase_raw %>%
  mutate(threat_status = if_else(threat_level >= 3, 1, 0))

We’ll continue by cleaning up our other predictor variables. Let’s recode latitudinal_range from descriptive categories into a numeric scale from 1 to 5. For flight status, we’ll map “Yes” to 1 and everything else to 0. For sedentary status, we’ll use coalesce() to fill any missing values with 0, assuming that species without a recorded sedentary status are not sedentary.

# latitudinal_range
birdbase_raw <- birdbase_raw |>
  mutate(latitudinal_range = fct_recode(as.character(latitudinal_range),
    "Tropical"           = "1",
    "Tropical-Temperate"         = "2",
    "Temperate"   = "3",
    "Temperate-Polar"         = "4",
    "Tropical-Polar"          = "5"
  ))

# flight
birdbase_raw <- birdbase_raw |>
  mutate(flight = if_else(flight == "Yes", 1, 0))

# sedentary 
birdbase_raw <- birdbase_raw %>% mutate(sedentary = coalesce(sedentary, 0))

Let’s also convert our character data to factors for ease of modeling. For elevation_range, norm_min_elevation, and norm_max_elevation, we’ll convert the “NA” string into the NA value, and ensure that the data is stored as numeric. We’ll also mutate across our binary predictors to ensure they’re factors for our imputation steps later on.

# Converting character data to factors 
birdbase_raw <- birdbase_raw %>%
  mutate(across(c(norm_min_elevation, norm_max_elevation, elevation_range), 
                ~ as.numeric(na_if(as.character(.), "NA")))) %>%
  mutate(across(where(is.character), as.factor)) %>%
  mutate(across(c(Mono, Poly, restricted_range, sedentary, flight, threat_status), as.factor))

Dealing with Missing Data

Before moving on, we must address missing data which is common in ecological datasets. After looking at a table of the proportion of missing values for each variable, 50% seems like a reasonable missingness threshold, so we’ll remove variables where over half the values are missing (Mono and Poly). Due to a significant lack of data from some species, we’ll remove around 5000 species that have more than 2 NA values. We’re left with around 6,000 better studied species to work with.

# Remove variables with a large amount missing (over 50%)
missing_table_raw <- colMeans(is.na(birdbase_raw)) %>% 
  enframe(name = "variable", value = "proportion_missing")
missing_table_raw
## # A tibble: 26 × 2
##    variable           proportion_missing
##    <chr>                           <dbl>
##  1 species                        0     
##  2 threat_level                   0     
##  3 restricted_range               0.0160
##  4 island_breeding                0     
##  5 latitudinal_range              0     
##  6 average_mass                   0.0773
##  7 norm_min_elevation             0.0388
##  8 norm_max_elevation             0.0411
##  9 elevation_range                0.0540
## 10 primary_habitat                0     
## # ℹ 16 more rows
missing_table_over0.5 <- missing_table_raw |> filter(proportion_missing > 0.5)
missing_table_over0.5
## # A tibble: 2 × 2
##   variable proportion_missing
##   <chr>                 <dbl>
## 1 Mono                  0.706
## 2 Poly                  0.706
birdbase <- birdbase_raw %>% select(where(~ mean(is.na(.)) < 0.5))

# Drop rows with more than 2 NA values
birdbase <- birdbase %>%
  filter(rowSums(is.na(.)) < 3)

dim(birdbase)
## [1] 6205   24

After tidying, our “birdbase” dataset has 6205 rows and 24 columns! So much better! Let’s visualize the data before and after addressing the missing values to see the transformation.

# Visualizing the missing data
birdbase_raw |> vis_miss(cluster = TRUE) + theme(axis.text.x = element_text(angle = 90, size = 8, hjust = 1)) + labs(title = "Before Addressing the Missing Data")
birdbase |> vis_miss(cluster = TRUE) + theme(axis.text.x = element_text(angle = 90, size = 8, hjust = 1)) + labs(title = "After Addressing the Missing Data")

Codebook

Variable name Description
species The English Common Name of the bird from BirdLife International.
threat_level The IUCN Red List Status of the Bird.
restricted_range Restricted range (global range size<50,000 km2 ) species get a 1.
island_breeding 1 = Species’ breeding is restricted to island(s), 0 = species not restricted to an island.
latitudinal_range Latitudinal range of species based on tropics and polar circles.
average_mass The average mass value across males, females, and unsexed individuals in the dataset.
norm_min_elevation Normal lowest elevational limit of species.
norm_max_elevation Normal highest elevational limit of species.
elevation_range The difference between a species’ normal lowest and highest elevational extents.
primary_habitat The main habitat type utilized by a species.
num_habitats Habitat breadth, the number of major habitats used.
primary_diet The main food source of the species.
colonial_behavior Species exhibits colonial behavior.
social_behavior Species is social (large numbers of birds, mixed species flocks, seasonal flocks of the same species, etc).
family_behavior Species in pairs and family groups.
singly_behavior Species exhibits singly behavior.
solitary_behavior Species exhibits solitary behavior.
lekking_behavior Species exhibits lekking behavior.
clutch_min Lower number of eggs laid (excluding rare clutch sizes).
clutch_max Upper number of eggs laid (excluding rare clutch sizes).
flight The flight status of the bird species.
sedentary Sedentary, largely does not migrate. Typically a resident species, with some local/seasonal movements.
threat_status Binary classification variable with 1 being that the bird faces threat.

Visual EDA

To get a better idea of the particular distributions of our response variable and our predictors, we’ll play around with some visualization plots using ggplot. Because I adore the beautiful Steller’s Jay, we’ll use blue and black for our color schemes. Steller’s Jays are master mimics of apex predictors like the Red-tailed Hawk and also local rodents like squirrels!

Distribution of Threat Levels

By exploring the distribution of our outcome variable, threat_status, we can see that about 10% of bird species threat. As a reminder, a threat_status score of 1 means that the species faces threat, whereas a score of 0 means otherwise. This 10% represents a critical ecological concern. Much of this threat is driven by anthropogenic factors such as habitat fragmentation, overexploitation, and other human-induced environmental changes.

ggplot(birdbase, aes(x = threat_level)) +
  geom_bar(fill = "steelblue") +
  geom_text(stat = 'count', aes(label = after_stat(count)), vjust = -0.5) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Distribution of Threat Levels",
       x = "Threat Levels",
       y = "Number of Species")

ggplot(birdbase, aes(x = threat_status)) +
  geom_bar(fill = "steelblue") +
  geom_text(stat = 'count', aes(label = after_stat(count)), vjust = -0.5) +
  theme_minimal() +
  labs(
    title = "Distribution of Threat Status",
    x = "Threat Status",
    y = "Number of Species"
  )

### Threat Distribution by Sedentary Status

By analyzing the proportion of sedentary birds across different threat classifications, we can observe the relationship between sedentary status and threat. With the exception of already Extinct species (Level 9), the proportion of sedentary birds seems to decrease as threat level increases. A similar conclusion can be drawn with the threat status graph, which shows that birds that face threat have a smaller proportion of sedentary birds compared to birds that do not face threat. The general trend for living birds is that a higher threat level is correlated with a lower proportion of sedentary species. This is intuitive because non-sedentary species may have to migrate across multiple geographic regions, exposing them to a wider variety of environmental stressors and habitat fragmentation.

ggplot(birdbase, aes(x = factor(threat_level), fill = factor(sedentary))) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = c("0" = "steelblue", "1" = "black"), 
                    labels = c("Non-sedentary", "Sedentary")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(
    title = "Proportion of Sedentary Birds by Threat Level",
    x = "Threat Level (1-9)",
    y = "Proportion",
    fill = "Status"
  )

ggplot(birdbase, aes(x = factor(threat_status), fill = factor(sedentary))) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = c("0" = "steelblue", "1" = "black"), 
                    labels = c("Non-sedentary", "Sedentary")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(
    title = "Proportion of Sedentary Birds by Threat Status",
    x = "threat status",
    y = "Proportion",
    fill = "Status"
  )

Threat Distribution by Island Breeding Status

Island breeders represent species whose breeding is restricted to islands, making them more vulnerable to facing threat because their isolated populations have nowhere to retreat when faced with disaster. Unlike mainland birds, island breeders are often more specialized and lack the diversity to adapt as rapidly. This is corroborated by both graphs which show that as threat risk increases, as does the proportion of island breeders.

ggplot(birdbase, aes(x = factor(threat_level), fill = factor(island_breeding))) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = c("0" = "steelblue", "1" = "black"), 
                    labels = c("Mainland", "Island Breeder")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(
    title = "Proportion of Island Breeders by Threat Level",
    x = "Threat Level (1-9)",
    y = "Proportion",
    fill = "Status"
  )

ggplot(birdbase, aes(x = factor(threat_status), fill = factor(island_breeding))) +
  geom_bar(position = "fill") +
  # Steelblue for Mainland (0), Black for Island Breeder (1)
  scale_fill_manual(values = c("0" = "steelblue", "1" = "black"), 
                    labels = c("Mainland", "Island Breeder")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(
    title = "Proportion of Island Breeders by Threat Status",
    x = "threat status",
    y = "Proportion",
    fill = "Status"
  )

Threat Distribution by Average Mass

Species with a larger average mass tends to have smaller clutch sizes and longer incubation periods. Thus when faced with habitat loss or other disaster, larger birds cannot rebound as quickly as smaller birds that lay many eggs multiple times a year. The California Condor, the largest North American land bird, usually lay only one egg every other year. In the 1980’s, there were only 22 California Condors left but thanks to intensive conservation efforts, there are now over 500 condors today. Besides slower reproductive rates, larger birds tends to require more resources and are a target for over exploitation. Our boxplots highlight that threatened species often skew towards higher body masses.

ggplot(birdbase, aes(x = factor(threat_level), y = log(average_mass))) +
  geom_boxplot(fill = "steelblue", alpha = 0.8) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(
    title = "Bird Body Mass vs. Numeric Threat Score",
    subtitle = "Log-transformed mass across threat levels 1-9",
    x = "Threat Level (1-9)",
    y = "Log of Average Mass"
  )

ggplot(birdbase, aes(x = factor(threat_status), y = average_mass)) +
  geom_boxplot(fill = "steelblue", alpha = 0.8) +
  scale_y_log10() + # Keeps the scale readable for very large/small birds
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(
    title = "Average Mass by Threat Status",
    x = "Threat Status",
    y = "Average Mass (log10 scale)"
  )

Threat Distribution by Diet and Habitat

The proportion of threatened species varies significantly across ecological niches, with birds that inhabit the sea or follow scavenger diets showing notably higher risk levels. Scavengers and marine birds are uniquely vulnerable to toxins, like lead from carcasses or plastic pollution in the ocean. These visualizations highlight that a species’ primary habitat and diet are critical biological predictors for understanding vulnerability.

ggplot(birdbase, aes(x = primary_diet, fill = factor(threat_status))) +
  geom_bar(position = "fill") +
  # Steelblue for Secure (0), Black for Threatened (1)
  scale_fill_manual(values = c("0" = "steelblue", "1" = "black"), 
                    labels = c("0", "1")) +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Proportion of Threat Status by Primary Diet",
    x = "Primary Diet",
    y = "Proportion",
    fill = "Threat Status"
  )

ggplot(birdbase, aes(x = primary_habitat, fill = factor(threat_status))) +
  geom_bar(position = "fill") +
  # Steelblue for Secure (0), Black for Threatened (1)
  scale_fill_manual(values = c("0" = "steelblue", "1" = "black"), 
                    labels = c("0", "1")) +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Proportion of Threat Status by Primary Habitat",
    x = "Primary Habitat",
    y = "Proportion",
    fill = "Threat Status"
  )

### Correlation Plot

The correlation matrix reveals that related biological traits, such as elevation ranges and clutch sizes, tend to cluster together. Notably, threat_level shows a clear positive correlation with island_breeding, confirming that geographic isolation is a primary driver of extinction risk. It makes sense that variables like norm_max_elevation and norm_min elevation, as well as clutch_min and clutch_max are highly correlated.

bird_numeric <- birdbase |>
  select(where(is.numeric)) |>
  select(where(~sd(.x, na.rm = TRUE) > 0)) 

bird_cor <- cor(bird_numeric, use = "complete.obs") 

corrplot(bird_cor, 
         method = "color",          
         type = "upper",             
         order = "hclust",           
         col = COL2('RdBu'),         
         tl.col = "black",        
         tl.srt = 45,           
         tl.cex = 0.7,            
         title = "Correlation Matrix of Avian Numeric Traits",
         mar = c(0,0,1,0))

# Setting Up Models Yay! It’s almost time to start fitting models to our data to see if we can predict the threat status of various bird species based on the ecological predictors we talked about. But before we start fitting, let’s first split out data, make a recipe, and create folds for k-fold cross validation.

Train/test Split

The first step is to split our bird data into a training and testing set. The training set will be used to train our models. Once we find our best model, we’ll use the testing set to see how that best model truly performs. By splitting our data into testing and training sets, we prevent over-fitting which may happen if the model uses all of the available data to learn. 70% of our data will be for the training set, and 30% of our data will be for the testing set. The split is stratified on the outcome variable, threat_status, to ensure that both the training and the testing data have an equal distribution of threat_status. Let’s set our seed to be 240 for reproducibility. 240 mph is the hunting speed of the fastest animal on Earth: the Peregrine Falcon.

# Setting a seed
set.seed(240)

# Splitting the Data
bird_split <- initial_split(birdbase, prop = 0.70, strata = threat_status)

bird_train <- training(bird_split)
bird_test  <- testing(bird_split)

# Looking at the Split Data
bind_rows(
  bird_train |> mutate(dataset = "Training"),
  bird_test  |>  mutate(dataset = "Testing")
) |> 
  count(dataset) |> 
  mutate(proportion = n / sum(n))
## # A tibble: 2 × 3
##   dataset      n proportion
##   <chr>    <int>      <dbl>
## 1 Testing   1862      0.300
## 2 Training  4343      0.700

There are 4343 observations in the training dataset and 1862 observations in the testing dataset!

Parallel Computing

For computational efficiency, I’ll use parallel computing to distribute the workload across many CPU cores.

registerDoFuture()
plan(multisession, workers = parallelly::availableCores(omit = 1))
ctrl <- control_grid(
  parallel_over = "resamples",
  save_pred = FALSE,
  verbose = TRUE
)

Recipe Building

Let’s create one recipe for all of our models by associating our predictors with our outcome variable. We’ll use all 21 of our predictors in our recipe. In order to deal with missingness, we will impute the missing data in norm_min_elevation, norm_max_elevation, clutch_min, and clutch_max using linear regression of the other relevant predictors. We’ll impute restricted_range with mode. Similarly, let’s impute average_mass, elevation_range, and lekking_behavior using median. We’ll then make all of our categorical predictors in dummy variables and normalize our variables. We’ll remove predictors with “near-zero variance” because they provide little for prediction and remove variables that have correlation higher than 0.9 with other predictors. This helps with multicollinearity.

# Creating the Recipe
bird_recipe <- recipe(threat_status ~ restricted_range + island_breeding + latitudinal_range + 
                      average_mass + norm_min_elevation + norm_max_elevation + elevation_range + 
                      primary_habitat + num_habitats + primary_diet + num_diets + 
                      colonial_behavior + social_behavior + family_behavior + 
                      singly_behavior + solitary_behavior + lekking_behavior + 
                      clutch_min + clutch_max + flight + sedentary, 
                      data = bird_train) %>%
  
  step_impute_mode(restricted_range) %>%
  
  step_impute_median(average_mass, elevation_range, lekking_behavior) %>%
  
  step_impute_linear(clutch_min, clutch_max, 
                     impute_with = imp_vars(average_mass, num_diets)) %>%
  
  step_impute_linear(norm_min_elevation, norm_max_elevation, 
                     impute_with = imp_vars(elevation_range)) %>%
  
  step_dummy(all_nominal_predictors()) %>%
  
  step_nzv(all_predictors()) %>%
  
  step_normalize(all_numeric_predictors()) %>%

  step_corr(all_predictors(), threshold = 0.9)

# Prepping and Baking the Recipe
bird_prep <- prep(bird_recipe)
bird_baked <- bake(bird_prep, new_data = NULL)

K-Fold Cross Validation

We will create 10 folds to conduct k-fold stratified cross validation. This will help with providing a better estimate of testing accuracy. We’ll stratify on our response variable, threat_status, to prevent imbalanced data.

bird_folds <- vfold_cv(bird_train, v = 10, strata = threat_status)

Building Prediction Models

It’s finally time to build our models! As I’ve mentioned at the start, we’ll be training six different machine learning techniques all with the recipe we’ve just created.

ROC AUC will be our primary metric of performance because the metric is effective at dealing at binary classification with imbalanced data. The metric measures the area under the curve for the receiver operating characteristic (ROC) curve. The process for the building the model is outlined below:

    1. Set up the model by defining the model type, setting the engine, and establishing the mode (classification).
    1. Initialize the workflow, add the new model, and add the established recipe. Skip steps 3-5 for Logistic Regression, LDA, and QDA.
    1. Define the tuning grid with specific parameters and levels.
    1. Tune the models.
    1. Identify the best performing model based on ROC AUC and update the workflow with best parameters.
    1. Reform the final fit to the training data set.
    1. Save our results to an RDS file.

Logistic Regression

glm_model <- logistic_reg(mode = "classification") %>%
  set_engine("glm")

glm_wf <- workflow() %>%
  add_recipe(bird_recipe) %>%
  add_model(glm_model)

glm_fit <- fit_resamples(glm_wf, resamples=bird_folds)

Linear Discriminant Analysis

lda_model <- discrim_linear(mode="classification") %>%
  set_engine("MASS")

lda_wf <- workflow() %>%
  add_recipe(bird_recipe) %>%
  add_model(lda_model)

lda_fit <- fit_resamples(lda_wf, resamples=bird_folds)

Quadratic Discriminant Analysis

qda_model <- discrim_quad(mode="classification") %>%
  set_engine("MASS")

qda_wf <- workflow() %>%
  add_recipe(bird_recipe) %>%
  add_model(qda_model)

qda_fit <- fit_resamples(qda_wf, resamples=bird_folds)

Random Forest

# Random Forest 
rf_model <- rand_forest(
  mtry = tune(),    
  min_n = tune(), 
  trees = tune()
) %>%
  set_engine("ranger", importance = "permutation") %>%
  set_mode("classification")

rf_wf <- workflow() %>%
  add_recipe(bird_recipe) %>%
  add_model(rf_model)

rf_grid <- grid_regular(mtry(
  range = c(1, 12)), 
  trees(range = c(200,1000)), 
  min_n(range = c(5,20)), 
  levels = 5)

rf_res <- tune_grid(
  rf_wf,
  resamples = bird_folds,
  grid = rf_grid, 
  control = ctrl
)
# Create a folder for the tuned models
dir.create("pstat131final/data/tuned_models", recursive = TRUE, showWarnings = FALSE)

write_rds(rf_res, file = "pstat131final/data/tuned_models/rf.rds")
rf_tuned <- read_rds(file = "pstat131final/data/tuned_models/rf.rds")

Decision Tree

# Decision Tree 
dt_model <- decision_tree(
  cost_complexity = tune(),
  tree_depth = tune(),   
  min_n = tune()            
) %>%
  set_engine("rpart") %>%
  set_mode("classification")

dt_wf <- workflow() %>%
  add_recipe(bird_recipe) %>%
  add_model(dt_model)

dt_grid <- grid_regular(
  cost_complexity(),
  tree_depth(),
  min_n(),
  levels = 5
)

dt_res <- tune_grid(
  dt_wf,
  resamples = bird_folds,
  grid = dt_grid,
  control = ctrl
)
write_rds(dt_res, file = "pstat131final/data/tuned_models/dt.rds")
dt_tuned <- read_rds(file = "pstat131final/data/tuned_models/dt.rds")

K-Nearest Neighbors

# KNN
knn_model <- nearest_neighbor(
  neighbors = tune(),
) %>%
  set_engine("kknn") %>%
  set_mode("classification")


knn_wf <- workflow() %>%
  add_recipe(bird_recipe) %>%
  add_model(knn_model)


knn_grid <- grid_regular(
  neighbors(range = c(1, 40)),
  levels = 5
)

knn_res <- tune_grid(
  knn_wf,
  resamples = bird_folds,
  grid = knn_grid,
  control = ctrl
)
write_rds(knn_res, file = "pstat131final/data/tuned_models/knn.rds")
knn_tuned <- read_rds(file = "pstat131final/data/tuned_models/knn.rds")

Model Results

With our outcomes stored for our models, it’s time to take a view into how our models performed!

Model Autoplots

To understand how our tuning parameters affected the results, we’ll use one of the most useful tools for visualizing performance: autoplots. With autoplots, we can analyze how changes in certain parameters have affected our metric of choice, roc_auc.

Decision Tree Plot

The decision tree tuning parameters included 5 levels of tree depth, minimal node size, and cost complexity. A tree depth of 1 fails to capture any nuance whiles depths between 8 and 15 consistently achieve the highest ROC AUC.. A minimal node size of 30 resulted in better outcomes. The model exhibits a sharp drop in performance once the cost-complexity parameter exceeds a certain point.

dt_tuned %>% 
  autoplot(metric = "roc_auc")

Random Forest Plot

For the random forest, three parameters were tuned: the number of randomly selected predictors, the number of trees, and the minimum node size. The autoplot demonstrates that a higher minimum node size of 16 or 20 provides the best ROC AUC score. As the number of predictors reaches a midpoint of between 2.5 and 5, the outcome peaks. The number of trees was most optimal also at the midpoint, around 600 trees.

rf_tuned %>% 
  autoplot(metric = "roc_auc")

K-Nearest Neighbor Plot

For the K-Nearest Neighbor model, the tuning process focused on a single parameter: the number of nearest neighbors. It’s evident in the plot that as the number of nearest neighbors increases, as does the ROC AUC score. That being said, the ROC AUC increases at an almost exponential rate, so the benefits of having more neighbors decreass once the number is sufficiently large.

knn_tuned %>% 
  autoplot(metric = "roc_auc")

Accuracy of the Models

In order to summarize the best ROC AUC values from each of our 6 models, we will create a comparative tibble to display the estimated optimal ROC AUC for each of the fitted models.

# Retrieving the Results
log_auc <- collect_metrics(glm_fit) %>% filter(.metric == "roc_auc") %>% pull(mean)
lda_auc <- collect_metrics(lda_fit) %>% filter(.metric == "roc_auc") %>% pull(mean)
qda_auc <- collect_metrics(qda_fit) %>% filter(.metric == "roc_auc") %>% pull(mean)
rf_res <- read_rds("pstat131final/data/tuned_models/rf.rds")
dt_res <- read_rds("pstat131final/data/tuned_models/dt.rds")
knn_res <- read_rds("pstat131final/data/tuned_models/knn.rds")

rf_auc <- show_best(rf_res, n = 1, metric = "roc_auc") %>% pull(mean)
dt_auc <- show_best(dt_res, n = 1, metric = "roc_auc") %>% pull(mean)
knn_auc <- show_best(knn_res, n = 1, metric = "roc_auc") %>% pull(mean)

# Creating the Tibble
bird_results <- tibble(
  Model = c("Random Forest", "K-Nearest Neighbor", "Decision Tree", 
            "Logistic Regression", "LDA", "QDA"),
  ROC_AUC = c(rf_auc, knn_auc, dt_auc, log_auc, lda_auc, qda_auc)
) %>% 
  arrange(desc(ROC_AUC))

# Showing the Tibble
bird_results
## # A tibble: 6 × 2
##   Model               ROC_AUC
##   <chr>                 <dbl>
## 1 Random Forest         0.848
## 2 Logistic Regression   0.804
## 3 LDA                   0.804
## 4 K-Nearest Neighbor    0.784
## 5 QDA                   0.772
## 6 Decision Tree         0.768

Let’s also create a bar plot to help with visualization!

bird_comparison_plot <- bird_results %>%
  ggplot(aes(x = reorder(Model, ROC_AUC), y = ROC_AUC)) +
  geom_col(fill = "steelblue", width = 0.7) +
  geom_text(aes(label = round(ROC_AUC, 3)), vjust = -0.5, size = 3.5) +
  labs(
    title = "Comparing Performance between Models",
    x = "Model",
    y = "ROC AUC"
  ) +
  theme_minimal()

bird_comparison_plot

The top performer is the Random Forest model with an ROC AUC of 0.856, outperforming the other models. Logistic Regression and LDA performed nearly identically, while the Decision Tree model performed the worst. That being said, all of our models performed fairly well, with an ROC AUC of at least 0.76 across the board.

Results From the Best Model

Now that we know the random forest performed the best out of the models, let’s see which tuned parameters which chosen for the best random forest model.

# Showing the best Random Forest Model.
show_best(rf_tuned, metric = "roc_auc") %>% 
  select(-.estimator, .config) %>%
  slice(1)
## # A tibble: 1 × 8
##    mtry trees min_n .metric  mean     n std_err .config          
##   <int> <int> <int> <chr>   <dbl> <int>   <dbl> <chr>            
## 1     3   600    20 roc_auc 0.848    10 0.00804 pre0_mod040_post0

It looks like random forest model #65 is our winner winner chicken dinner! It has 6 predictors, 600 trees, and a minimal node size of 20.

Testing Random Forest Model #65

Let’s see how the best model from the tuned random forest performs when fitted to the training data and used to predict data that it hasn’t seen: the testing data!

# Selecting the best Random Forest parameters 
best_rf <- select_best(rf_tuned, metric = "roc_auc")

# Create the best workflow
final_rf_wf <- finalize_workflow(rf_wf, best_rf)

# Fitting the model to the entire training set
bird_rf_fit <- fit(final_rf_wf, data = bird_train)

bird_predict <- predict(bird_rf_fit, 
                        new_data = bird_test, 
                        type = "class")

# Bind the predictions to the actual data
bird_predict_with_actual <- bird_predict %>%
  bind_cols(bird_test) 

# Showing the results
bird_predict_with_actual
## # A tibble: 1,862 × 25
##    .pred_class species             threat_level restricted_range island_breeding
##    <fct>       <fct>                      <dbl> <fct>                      <dbl>
##  1 0           Common Ostrich                 1 0                              0
##  2 0           Somali Ostrich                 3 0                              0
##  3 0           Lesser Rhea                    1 0                              0
##  4 0           Solitary Tinamou               2 0                              0
##  5 0           Great Tinamou                  1 0                              0
##  6 0           Highland Tinamou               1 0                              0
##  7 0           Hooded Tinamou                 1 0                              0
##  8 0           Little Tinamou                 1 0                              0
##  9 0           Red-legged Tinamou             1 0                              0
## 10 0           Slaty-breasted Tin…            3 0                              0
## # ℹ 1,852 more rows
## # ℹ 20 more variables: latitudinal_range <fct>, average_mass <dbl>,
## #   norm_min_elevation <dbl>, norm_max_elevation <dbl>, elevation_range <dbl>,
## #   primary_habitat <fct>, num_habitats <dbl>, primary_diet <fct>,
## #   num_diets <dbl>, colonial_behavior <dbl>, social_behavior <dbl>,
## #   family_behavior <dbl>, singly_behavior <dbl>, solitary_behavior <dbl>,
## #   lekking_behavior <dbl>, clutch_min <dbl>, clutch_max <dbl>, flight <fct>, …
# Using the augment function
bird_augmented <- augment(bird_rf_fit, new_data = bird_test)
bird_augmented
## # A tibble: 1,862 × 27
##    .pred_class .pred_0 .pred_1 species             threat_level restricted_range
##    <fct>         <dbl>   <dbl> <fct>                      <dbl> <fct>           
##  1 0             0.871  0.129  Common Ostrich                 1 0               
##  2 0             0.874  0.126  Somali Ostrich                 3 0               
##  3 0             0.898  0.102  Lesser Rhea                    1 0               
##  4 0             0.825  0.175  Solitary Tinamou               2 0               
##  5 0             0.792  0.208  Great Tinamou                  1 0               
##  6 0             0.859  0.141  Highland Tinamou               1 0               
##  7 0             0.948  0.0522 Hooded Tinamou                 1 0               
##  8 0             0.914  0.0864 Little Tinamou                 1 0               
##  9 0             0.953  0.0473 Red-legged Tinamou             1 0               
## 10 0             0.864  0.136  Slaty-breasted Tin…            3 0               
## # ℹ 1,852 more rows
## # ℹ 21 more variables: island_breeding <dbl>, latitudinal_range <fct>,
## #   average_mass <dbl>, norm_min_elevation <dbl>, norm_max_elevation <dbl>,
## #   elevation_range <dbl>, primary_habitat <fct>, num_habitats <dbl>,
## #   primary_diet <fct>, num_diets <dbl>, colonial_behavior <dbl>,
## #   social_behavior <dbl>, family_behavior <dbl>, singly_behavior <dbl>,
## #   solitary_behavior <dbl>, lekking_behavior <dbl>, clutch_min <dbl>, …

ROC Curve

Let’s take a peak at the ROC curve from our best model! The plot demonstrates a strong predictive performance, as the curve bows significantly toward the top-left corner of the plot.

bird_roc_curve <- bird_augmented %>%
  roc_curve(threat_status, .pred_0)

autoplot(bird_roc_curve) +
  labs(title = "ROC Curve for Best Random Forest Model")

## Final ROC AUC Results Our random forest model was able to predict whether or not a bird species faces threat or not with 85.9% accuracy! Let’s celebrate!

bird_test_roc_auc <- augment(bird_rf_fit, new_data = bird_test) %>%
  roc_auc(threat_status, .pred_0) %>%
  select(.estimate)

bird_test_roc_auc
## # A tibble: 1 × 1
##   .estimate
##       <dbl>
## 1     0.859

Conclusion

Throughout this project, we have researched, explored, and analyzed our data and its variables in order to build and test a model that could predict whether or not a bird is facing extinction threat. After evaluating several machine learning approaches, the Random Forest model proved best at predicting, achieving a final test ROC AUC of 0.859. This came as no surprise because Random Forest models are non-parametric and highly flexible. That being said, the model is by no means perfect. In contrast, the Decision Tree performed the worst among the tested models. Interestingly, simpler linear models like Logistic Regression and LDA actually outperformed the Decision Tree model!

To improve this project, we could look into other characteristics associated with vulnerability. Incorporating habitat fragmentation information or climate change projections could help explain more of the variability in the outcome than our current model. We could also compare these results across different regions and continents. It would be interesting to see if the predictors for extinction risk in California differs significantly from those in the New York.

My favorite part of this project was being able to apply my academic learning to something I love: birds. People always say to appreciate the little things in life, and I found that through birds. Overall, attempting to predict the threat status of birds using this dataset provided a meaningful opportunity for me to apply my data analysis skills in R and learn more about machine learning. I enjoyed getting to learn more about the BIRDBASE data and the research that went into it. I’m happy that I was able to develop a model that explains a little bit about what causes threat to birds!

To top it all off, here is a picture of the Great Horned Owl that lives right in Santa Barbara.

Sources

This data was retrieved from the Springer Nature Figshare repository as part of the dataset, “BIRDBASE: A Global Database of Avian Biogeography, Conservation, Ecology and Life History Traits.” The BirdBase database was compiled and shared by lead author Çağan H. Şekercioğlu and a team of researchers to provide a comprehensive resource for avian conservation and life history analysis.