Spatial rarification for imbalanced population structure analysis

Published

April 19, 2023

Motivation

(I’m referring to both of us in the third person, just so it’s clear who’s who).

The motivation for this comes from our conversation at the Biosphere 2 meeting. Antonio was concerned with oversampling a focal species versus one or more other species that are potentially hybridizing with the focal species. This could lead to unbalanced sampling of alleles from the focal species, leading to erroneous estimation of more population structure in the focal species versus the other species. It could also lead to undetected (or falsely detected) hybridization, where an allele may erroneously be inferred to be more frequent in the focal population relative to the non-focal species, or error propagation (e.g. genotyping error common in reduced representation data sets) could lead to noise in population structure inference. A likely scenario for the former is a species is oversampled at a few particular sites, leading to an oversampling of local allele frequencies and deviating away from expected patterns of isolation-by-distance or isolation-by-environment. A likely scenario for the latter is biased missing data, which could lead to the imputation method of the structure analysis (e.g. sNMF just considers missing data to be the reference/ancestral allele) replacing unknown missing alleles with an ancestral allele that is likely sampled more frequently in the focal population.

To mitigate this, Antonio originally proposed to pull repeated random samples from the oversampled species, conduct a population structure analysis with these samples, then summarized the structure results across repetitions to obtain a “super-STRUCTURE” plot that contains the average and error of structure estimates across replicates. One potential problem both identified is that the bias in sampling is still going to be reflected by an uninformed resampling technique, which by chance would resample the more densely sampled localities most frequently and retain biased structure in the data. Connor proposed an extension to this idea, where the resampling is spatially rarified to retain the spatial structure of the original data set, but limit sampling to one or a few samples in a given area. This would reduce the bias due to the above mechanisms, while retaining the important spatial signal in the data, assuming that the study area is well-represented by the data.

Connor’s proposed sampling method

Here, I’m proposing a workflow that uses the spThin R package to spatially rarify the input data repeatedly to return multiple resamples that can then be used to filter genetic data and input into a population structure inference method. I can write a thin wrapper around it later as well to facilitate input

Load packages and data

The packages can be installed from CRAN if you don’t have them. The data is example data from the spThin package- 201 georeferenced occurrence records for the Caribbean spiny pocket mouse, Heteromys anomalus.

library(spThin)
# for data wrangling
library(tidyverse)

data("Heteromys_anomalus_South_America")

head(Heteromys_anomalus_South_America)

      SPEC       LAT      LONG   REGION
1 anomalus  7.883333 -75.20000 mainland
2 anomalus  8.000000 -76.73333 mainland
3 anomalus 10.616667 -75.03333 mainland
4 anomalus  8.633333 -74.06667 mainland
5 anomalus  9.966667 -75.06667 mainland
6 anomalus 10.216667 -73.38333 mainland

Here’s a plot of the original localities.

Spatially thin data

spThin runs an algorithm that repeatedly samples a data set and rarifies according to a specified rarification distance, thin.par, which is in units of kilometers. To determine this distance, you could create a spatial correlogram and set it at the distance where spatial autocorrelation reduces to a minimum, or just give it a best-guess. Most folks who perform SDMs just give it a best guess, so…

The function returns a data.frame for each repetition with the column names “Longitude” and “Latitude”. The row numbers correspond with the row numbers in the original data.frame, so you can subset the original data to get metadata for the coordinates.

Here, I’m running 100 repetitions to spatially rarify the example data at a 20 km distance. I’m running this function with “verbose = TRUE” rather than writing to a log file for presentation purposes. The verbose output lists the number of occurrences retained after thinning, with the results binned if there are data sets that vary in the number of occurrences returned. This shouldn’t vary too much, but if it does, you can filter for data sets with specific sample sizes.

thin_df <- thin(
  loc.data = Heteromys_anomalus_South_America,
  lat.col = "LAT",
  long.col = "LONG",
  spec.col = "SPEC",
  thin.par = 20,
  reps = 100,
  locs.thinned.list.return = TRUE,
  write.files = FALSE,
  write.log.file = FALSE,
  verbose = TRUE
)

********************************************** 
 Beginning Spatial Thinning.
 Script Started at: Wed Apr 19 11:23:38 2023
lat.long.thin.count
 95 
100 
[1] "Maximum number of records after thinning: 95"
[1] "Number of data.frames with max records: 100"
[1] "No files written for this run."

We now need to link the thinned data back to the original data set to obtain metadata. I could make this one huge %>% fest, but I’m splitting it up for clarity.

# add row names to original data frame, rename the LAT and LONG columns to Latitude and Longitude, and rename the data frame for easier typing
hdf <- Heteromys_anomalus_South_America %>% 
  rename(Longitude = LONG,
         Latitude = LAT) %>% 
  mutate(rowname = rownames(Heteromys_anomalus_South_America))

# add row names to the thinned data frames
thin_df_named <- purrr::map(thin_df,  tibble::rownames_to_column)

# add metadata to thinned dfs

thin_df_meta <- purrr::map(thin_df_named, ~left_join(.x, hdf, by = c("rowname", "Longitude", "Latitude")))

Now, for each data frame, we have a spatially rarified sample of individuals for conducting structure analysis! Here’s an example map of the original samples in black and one of the rarified samples in red. Now these can be used to subset an existing genetic data set.