In this short article I demonstrate how to assess species occurrence records that are ‘likely’ to be outside the species’ range. Most likely outliers or imprecisely georeferenced occurrence points. Such records are quite common in big data like from GBIF and other citizen science sources.
I will start by downloading climate data from worldclim, Kenya boundary shapefile from GADM and a few occurrence records within Kenya and convert them to spatialPointsDataframe
.
clim <- getData('worldclim', var = 'bio', res = 10)
KEN <- getData('GADM', country = 'KEN', level = 0)
df <- data.frame(longitude = c(40.029948, 39.031136, 35.587305),
latitude = c(2.627751, -1.269534, 1.072451),
sampling_sites = c('Wajir', 'Bura Tana', 'Chesoi'))
df_spatial <- df
coordinates(df_spatial) <- ~longitude+latitude
The next thing is to crop and mask climate data with boundary of Kenya.
clim_mask <- mask(crop(clim, KEN), KEN)
plot(clim_mask[[4]])
plot(KEN, border = 'purple', lwd = 5, add = T)
The next phase is to create buffer of one map unit around the occurrence points and extract raster values that fall within the created buffers.
set_buff <- gBuffer(df_spatial, width = 0.5,
byid = T,
id = df_spatial@data$sampling_sites)
values_within_buffer <- raster::extract(clim_mask, set_buff, df = T)
plot(clim_mask[[4]])
plot(KEN, border = 'purple', lwd = 5, add = T)
plot(set_buff, add = T)
plot(df_spatial, add = T)
Lastly, we plot the values on boxplot to show points whose extracted values are clearly different from the other values. This could be an occurrence point(s) that has/have been wrongly recorded and may be excluded when running sdm.
values_within_buffer |> mutate(group = case_when(ID == 1 ~ "Wajir",
ID == 2 ~ "Bura Tana",
ID == 3 ~ "Chesoi")) |>
ggplot(aes(x = group, y = bio4, fill = group)) +
geom_boxplot() +
geom_jitter(width = 0.1)
In this case, Chesoi site appears to differ from the other two sites with regard to bio4. In case bio4 is one of the most important factors in determining the distribution of the species then we might decide to leave out Chesoi occurrence record from the model procedure and only use those for Bura Tana and Wajir. There is possibility of running probabilistic/Bayesian models to evaluate whether Chesoi is ‘really’ outside the range of the species. Frequentist approaches like anova with some p-values can also be used to test whether the mean of bio4 values around those occurrence records are statistically different. Code generating this html file can be sourced from .Rmd file in gitHub. Happy sdm-ing!.