Predicting Endemic Plant Proportions in the Galapagos Islands Using Logistic Regression

Author

Dennis Baidoo

Published

June 20, 2025

Abstract

This study employs logistic regression to model the proportion of endemic plant species across 29 Galapagos Islands, using island characteristics (area, elevation, and distances to neighboring islands) as predictors. The final model, selected via stepwise AIC optimization, revealed significant negative effects of island area and proximity to other islands on endemism, alongside complex interactions involving distance to Santa Cruz Island. While the model explained substantial deviance (residual deviance = 39.51, df = 18), lack-of-fit tests (p = 0.002) indicated unresolved variability. Predictions for three withheld islands showed mixed accuracy, with intervals capturing observed proportions for Wolf Island but underestimating endemism on Gardner2 and Santa Fe. These results highlight the interplay of geographic isolation and habitat size in shaping endemic diversity, while underscoring limitations of small-sample ecological models.

1. Introduction

The Galapagos Islands, a UNESCO World Heritage Site, have long served as a natural laboratory for studying evolution and biogeography. While past research has quantified species richness patterns, fewer studies have modeled the proportion of endemic flora—a critical metric for conservation prioritization. This study addresses this gap by:

  • Developing a predictive model for endemic plant proportions using island traits.
  • Evaluating the roles of area, elevation, and isolation in endemism.
  • Assessing model performance via out-of-sample validation.

2. Methods

2.1 Data

We analyzed 29 islands with variables including:

  • Response: Proportion of endemic plants (PlantEnd/Plants).

  • Predictors: Log2-transformed area (Area), distances to nearest island (Nearest) and Santa Cruz (StCruz), and adjacent island area (Adjacent). Three islands (Gardner2, Santa Fe, Wolf) were withheld for validation.

2.2 Model Development

  • Data Exploration: Empirical logits revealed nonlinearity, prompting log transformation of Area.

  • Model Selection: A full logistic regression with pairwise interactions was refined via stepwise AIC reduction.

  • Lack-of-Fit: Assessed using deviance tests (α = 0.10).

2.3 Validation

Predicted probabilities and 95% CIs were compared to observed proportions for withheld islands.

3. Results

The logistic regression analysis of endemic plant proportions across the Galapagos Islands yielded several key findings. The initial exploratory analysis revealed nonlinear relationships between predictors and the empirical logits of endemic proportions, particularly for island area and elevation. To address this, a log2 transformation was applied to the area variable, which improved linearity in the model.

library(erikmisc)
library(tidyverse)
ggplot2::theme_set(ggplot2::theme_bw())  # set theme_bw for all plots
# First, download the data to your computer,
#   save in the same folder as this qmd file.

# read the data
dat_gal <-
  read_csv(
    "ADA2_CL_18_Galapagos.csv"
  , skip = 27             # I was expecting to skip 28, not sure why it wants 27
  ) |>
  mutate(
    id = 1:n()
  )
Rows: 29 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): Island
dbl (10): Plants, PlantEnd, Finches, FinchEnd, FinchGenera, Area, Elevation,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# observed proportions
dat_gal <-
  dat_gal |>
  mutate(
    p_hat     = PlantEnd / Plants
    # emperical logits
  , emp_logit = log((p_hat + 0.5/Plants) / (1 - p_hat + 0.5/Plants))
  )
# list of islands to predict
island_pred_list <-
  c(
    "Gardner2"
  , "Santa.Fe"
  , "Wolf"
  )

## capture the observed probabilities
dat_gal_pred_true <-
  dat_gal |>
  filter(
    Island %in% island_pred_list
  )

# Set these islands with missing response variables
#  (there must be a better way to NA selected rows, but I didn't find it)
dat_gal <-
  dat_gal |>
  mutate(
    Plants    = ifelse(Island %in% island_pred_list, NA, Plants   )
  , PlantEnd  = ifelse(Island %in% island_pred_list, NA, PlantEnd )
  , p_hat     = ifelse(Island %in% island_pred_list, NA, p_hat    )
  , emp_logit = ifelse(Island %in% island_pred_list, NA, emp_logit)
  )
## RETURN HERE TO SUBSET AND TRANSFORM THE DATA

dat_gal <-
  dat_gal |>
  mutate(
  ) |>
  filter(
    TRUE #!(id %in% c( 15,11 ))
  )

### Looking at the data it seems there are no maximum deviations
# Create plots for proportion endemic for each variable
dat_gal_long <-
  dat_gal |>
  select(
    Island, id, p_hat, emp_logit, Area, Elevation, Nearest, StCruz, Adjacent
  ) |>
  pivot_longer(
    cols = c(Area, Elevation, Nearest, StCruz, Adjacent)
  , names_to  = "variable"
  , values_to = "value"
  )

# Plot the data using ggplot
library(ggplot2)
p <- ggplot(dat_gal_long, aes(x = value, y = p_hat, label = id))
p <- p + geom_hline(yintercept = c(0,1), alpha = 0.25)
p <- p + geom_text(hjust = 0.5, vjust = -0.5, alpha = 0.25, colour = 2)
p <- p + geom_point()
p <- p + scale_y_continuous(limits = c(0, 1))
p <- p + facet_wrap( ~ variable, scales = "free_x", nrow = 1)
print(p)
Warning: Removed 15 rows containing missing values or values outside the scale range
(`geom_text()`).
Warning: Removed 15 rows containing missing values or values outside the scale range
(`geom_point()`).

# Plot the data using ggplot
library(ggplot2)
p <- ggplot(dat_gal_long, aes(x = value, y = emp_logit, label = id))
p <- p + geom_text(hjust = 0.5, vjust = -0.5, alpha = 0.25, colour = 2)
p <- p + geom_point()
p <- p + facet_wrap( ~ variable, scales = "free_x", nrow = 1)
print(p)
Warning: Removed 15 rows containing missing values or values outside the scale range
(`geom_text()`).
Removed 15 rows containing missing values or values outside the scale range
(`geom_point()`).

# relationships between predictors

library(ggplot2)
library(GGally)
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
p <- ggpairs(dat_gal |> select(Area, Elevation, Nearest, StCruz, Adjacent))
print(p)

The logistic equation is \[ \hat{p} = \frac{\exp(\eta)}{1 + \exp(\eta)}, \] Where

\[ \eta = 0.158 -0.131 \log(\text{Area}) -0.0321 \cdot \text{Nearest} -0.01027 \cdot \text{StCruz} + 0.000684 \cdot \text{Adjacent} + 0.000468 \cdot (\text{Nearest} \times \text{StCruz}) + 0.001072 \cdot (\log(\text{Area}) \times \text{StCruz}) -0.0000597 \cdot (\text{StCruz} \times \text{Adjacent}) \]

The full model, incorporating log-transformed area, distances to the nearest island and Santa Cruz, and adjacent island area, exhibited significant lack-of-fit (deviance p-value = 1.21 × 10⁻⁵), indicating that additional predictors or interactions might be necessary to fully explain the variability in endemic proportions. Stepwise model selection based on AIC identified a reduced model that included significant interaction terms:

  • log(Area) × StCruz (β = 0.00107, p = 0.025)
  • Nearest × StCruz (β = 0.000468, p < 0.001)
  • StCruz × Adjacent (β = −5.97 × 10⁻⁶, p = 0.042)

The reduced model showed improved but still significant lack-of-fit (deviance p-value = 0.002), suggesting some unexplained variability remained.

The coefficients for the main effects indicated that larger islands (log(Area): β = −0.131, p < 0.001) and those farther from other islands (Nearest: β = −0.0321, p = 0.009) tended to have lower proportions of endemic plants. Conversely, the interaction terms revealed that the effect of distance to Santa Cruz (StCruz) depended on other variables, such as island area and proximity to neighboring islands.

Predictions for the three withheld islands (Gardner2, Santa Fe, and Wolf) demonstrated mixed accuracy. The model underestimated the observed endemic proportions for Gardner2 (observed = 0.800, predicted = 0.383, 95% CI: 0.322–0.446) and Santa Fe (observed = 0.452, predicted = 0.281, 95% CI: 0.238–0.327). In contrast, the prediction for Wolf Island (observed = 0.571, predicted = 0.672, 95% CI: 0.447–0.838) fell within the confidence interval, indicating reasonable agreement.

Overall, the model identified island area and spatial isolation as significant predictors of endemic plant proportions, though the presence of interaction effects and residual lack-of-fit suggests additional ecological factors may influence these patterns. The predictive performance varied across islands, highlighting the challenges of modeling endemic diversity in heterogeneous environments.

4. Discussion

The logistic regression analysis provided valuable insights into the factors influencing endemic plant proportions across the Galapagos Islands, while also revealing important limitations in our modeling approach. The finding that larger islands exhibited lower proportions of endemic species contrasts with classical island biogeography theory, which typically predicts greater endemism in larger areas due to increased opportunities for speciation. This unexpected result may reflect the unique ecological dynamics of the Galapagos, where larger islands might experience higher rates of colonization by non-endemic species, thereby diluting the proportion of endemics. Alternatively, it could suggest that endemic species in the archipelago tend to specialize in the distinct habitats found on smaller islands.

The significant interaction effects involving distance to Santa Cruz Island (StCruz) highlight the complex spatial dynamics at play in this island system. The positive interaction between StCruz and log(Area) suggests that the negative effect of island size on endemism is attenuated for islands farther from Santa Cruz. This may indicate that Santa Cruz serves as both a source of colonizing species and a barrier to isolation-dependent speciation processes. The ecological interpretation of these spatial interactions warrants further investigation, possibly incorporating ocean current data and historical colonization patterns.

Model diagnostics revealed persistent lack-of-fit despite the inclusion of interaction terms, suggesting our model may be missing key ecological predictors. Potential missing variables could include island age (related to evolutionary time for speciation), habitat heterogeneity, or climatic factors like precipitation patterns. The underprediction of endemism for Gardner2 and Santa Fe islands particularly emphasizes these limitations, as these islands may possess unique characteristics not captured by our spatial predictors.

The mixed predictive performance across islands has important implications for conservation applications. While the model performed reasonably well for Wolf Island, its systematic underestimation of endemism on other islands suggests caution is needed when using such models for conservation prioritization. This may be particularly important for islands like Gardner2, where the model substantially underestimated the observed endemic proportion (0.800 observed vs. 0.383 predicted).

Methodologically, the challenges we encountered with model fit and prediction accuracy underscore the difficulties in modeling proportional data, especially with small sample sizes and potential overdispersion. The significant deviance statistics (p = 0.002 for reduced model) indicate that the binomial variance assumption may be violated, suggesting that alternative approaches like beta regression or quasi-likelihood methods might be worth exploring in future studies.

These findings contribute to ongoing discussions about the drivers of endemism in island systems, particularly the balance between isolation and area effects. The significant interaction terms in our model support more nuanced interpretations of island biogeography theory, where the effects of area and isolation may be contingent on other factors. From a conservation perspective, the results suggest that simple area-based predictions of endemism may be inadequate for the Galapagos, and that spatial configuration metrics need to be incorporated into conservation planning.

Future research directions should focus on expanding the predictor set to include habitat quality metrics and evolutionary history variables, as well as investigating alternative modeling approaches that might better handle the proportional nature of the response variable. Additionally, comparative studies across different archipelagos could help determine whether the patterns we observed are unique to the Galapagos or represent more general island biogeographic principles.

5. Conclusion

Our model identifies island area and spatial configuration as key predictors of plant endemism in the Galapagos, though predictive accuracy varies. The findings align with theories of island biogeography while highlighting the need for larger datasets to resolve residual variability. This work provides a framework for modeling endemic patterns in fragmented ecosystems globally.

References

Erhardt, E. B., Bedrick, E. J., & Schrader, R. M. (2020). \(\textit{Lecture notes for Advanced Data Analysis 2 (ADA2) (Stat 428/528)}\). University of New Mexico.