Spatial studies of genetic diversity within species

BOLD dataset

BOLD (Barcode Of Life Database) is a database of Barcode DNA sequences of georeferenced specimen that closely approximate species.

We use the package bold to download a set of georeferenced sequences for the Pomacanthidae taxon order request.

library(bold)
taxonRequest <- "Pomacanthidae"
resBold <- bold::bold_seqspec(taxon=taxonRequest, sepfasta=TRUE)

Prepare dataset

We filter and mutate georeferenced sequence dataset from boldsystems.org in order to produce a curated dataframe with rows as individual specimen and columns as specimen information. We add a new column sequence with DNA sequences as string.

The function prepare_bold_res apply 5 filters :

Select specimen with given marker_code
Remove specimen with no species_name information
Remove specimen with no lat or lon coordinates information
Remove specimen with IUAPC ambiguities on DNA sequences
Select specimen with DNA sequences within a given range of lengths in bp

## filter and mutate
preResBold <- prepare_bold_res(resBold,
                                   marker_code="COI-5P",
                                   species_names=TRUE, 
                                   coordinates=TRUE, 
                                   ambiguities=TRUE, 
                                   min_length=420,
                                   max_length=720
                                  )

Build grid world map

The grid is composed of nested squares of siteSize meters that we call site. By default, the grid is built on a map in Behrmann projection. In this example we set a grid with sites with a diameter of 260 kilometers.

grid.sp <- grid_spatialpolygons(siteSize=260000)

Generate the matrix of presence/absence of a specimen in sites from the worldmap grid

specimenIntersectSites <- specimen_intersect_site(specimen.df=preResBold, grid.sp=grid.sp)

Nucleotide diversity

By species

We gather together specimen from the same species located within the same site of the grid. Then sequences are aligned and nucleotide diversity is calculated for each species within each site.

nucdivSpecies <- nucleotide_diversity_species(specimen.df=preResBold, 
                             sequenceIntersectSites=specimenIntersectSites,
                             MinimumNumberOfSequencesBySpecies=3
                             )

By sites

Once we got species nucleotide diversity, we calculate mean species nucleotide diversity by site of the worldmap grid.

nucdivSites <- nucleotide_diversity_sites(nucdivSpecies)

Worldmap grid of mean species nucleotide diversity

We assign a mean species nucleotide diversity value to each site in the worldmap grid.

nucdivGrid <- nucleotide_diversity_grid(nucdivSites, grid.sp)

Then, we can print the wordldmap grid of nucleotide diversity.

gg <- plot_grid(nucdivGrid)

## OGR data source with driver: ESRI Shapefile 
## Source: "/tmp/RtmpyE1UC7", layer: "ne_50m_coastline"
## with 1428 features
## It has 3 fields
## Integer64 fields read as strings:  scalerank 
## OGR data source with driver: ESRI Shapefile 
## Source: "/tmp/RtmpyE1UC7", layer: "ne_50m_rivers_lake_centerlines"
## with 462 features
## It has 32 fields
## Integer64 fields read as strings:  ne_id 
## OGR data source with driver: ESRI Shapefile 
## Source: "/tmp/RtmpyE1UC7", layer: "ne_50m_lakes"
## with 275 features
## It has 35 fields
## Integer64 fields read as strings:  scalerank ne_id

gg