COST PalaeOpen - Aquatic proxies workshop

neotoma2 R exercise

Author

Petr Kuneš & Walter Finsinger

Published

May 19, 2025

Introduction

We will introduce you to the Neotoma Paleoecology Database environment, including explorer, Tilia and the neotoma2 R package. The database uses an application programming interface (API), which serves the explorer as well as Tilia but also lets you query the database in R. We will show you which information you can most easily extract in either of these choices and provide a hands-on R exercise on working with diatom, chironomid, and charcoal data.

The neotoma2package is available from CRAN and can be installed in R using install.package(neotoma2). We also load other packages needed for the exercise.

library(neotoma2)
library(riojaPlot)
library(rioja)
library(vegan)
library(ggplot2)
library(ggvegan)
library(ggrepel)
library(dplyr)

Get all sites with diatom data in Europe and plot on map

Sites in neotoma2 are primarily spatial objects. They have names, locations, and are found within the context of geopolitical units, but within the API and the package, the site itself does not have associated information about taxa, dataset types or ages. It is simply the container into which we add that information.

We can use several parameters to search sites, one of them is datasettype. Here we search all sites containing “diatom” dataset type in Europe.

europe <- list(geoJSON = '{"type": "Polygon",
        "coordinates": [[
            [-30, 30],
            [70, 30],
            [70, 90],
            [-30, 90],
            [-30, 30]]]}')

sites_diat <- get_sites(datasettype = "diatom", loc = europe$geoJSON, all_data = TRUE)
neotoma2::summary(head(sites_diat, 20))
   siteid                 sitename collectionunit chronologies datasets  types
1    3377                Řežabinec        REZABIN            0        1 diatom
2    3522           Zirbenwaldmoor        ZIRBEN1            0        1 diatom
3   14267                Didachara         DIDA-C            0        1 diatom
4   15702 Lago Grande di Avigliana        AVG0702            0        1 diatom
5   26991         Prášilské jezero         PRA-15            0        1 diatom
6   29340             Hallwilersee       HALL91-D            0        1 diatom
7   30451      Horseshoe Point 204        HOW0204            0        1 diatom
8   30452      Horseshoe Point 206        HOW0206            0        1 diatom
9   30453      Horseshoe Point 207        HOW0207            0        1 diatom
10  30454      Horseshoe Point 208        HOW0208            0        1 diatom
11  30614          VC35 (BP080094)         VC35DI            0        1 diatom
12  30983                 Sizewell       SZC_VC07            0        1 diatom
13  30983                 Sizewell       SZC_VC08            0        1 diatom
14  30983                 Sizewell       SZC_VC16            0        1 diatom
15  30983                 Sizewell       SZC_VC21            0        1 diatom

Now we get datasets to the sites, and filter charcoal dataset types and plot on the map.

datasets_diat <- sites_diat %>% get_datasets(all_data = TRUE)
neotoma2::plotLeaflet(datasets_diat)

Using a single site

Downloading diatom, chironomid and charcoal data

As an example, we will use the site ‘Prášilské jezero’, a mountain lake located in Czechia (Šumava Mts.). This site has been analyzed for several proxies, including diatoms, chironomids and macrocharcoal.

First, we obtain the siteid and other metadata including its datasets for the chosen site and plot its position:

site_ds <- get_datasets(get_sites(sitename = "Prášilské jezero"), all_data = TRUE)
plotLeaflet(site_ds)

We can list all available datasets for the site:

datasets(site_ds)
 datasetid                 database       datasettype age_range_old
     47517 European Pollen Database    geochronologic            NA
     47518 European Pollen Database plant macrofossil       11985.0
     47519 European Pollen Database          charcoal       11954.0
     47520 European Pollen Database            pollen       11954.0
     55585 European Pollen Database            diatom       11426.0
     55586 European Pollen Database        chironomid       11451.0
     47521 European Pollen Database    geochronologic            NA
     47522 European Pollen Database plant macrofossil       10641.1
     47523 European Pollen Database    geochronologic            NA
     47524 European Pollen Database plant macrofossil        9679.1
 age_range_young notes
              NA  <NA>
           -58.0  <NA>
           -58.0  <NA>
           -58.0  <NA>
           -60.0  <NA>
           -60.0  <NA>
              NA  <NA>
           -60.5  <NA>
              NA  <NA>
           -58.4  <NA>

Filter out the datasets ‘diatom’, ‘chironomid’ and ‘charcoal’

site_ds <- site_ds %>% neotoma2::filter(datasettype %in% c("diatom", "chironomid", "charcoal"))

And pull in and extract the sample data

site_rec <- site_ds %>% get_downloads()
site_samp <- site_rec %>% samples()

Filter samples

Now we will filter only charcoal samples with values as area (function filter of dplyr package). We also select only relevant columns, and merge variablename and elementtype

site_charcoal <- samples(site_rec) %>% dplyr::filter(ecologicalgroup == "CHAR" & elementtype == "area") %>%
  dplyr::select(depth, age, variablename, elementtype, units, value) %>%
  dplyr::mutate(variable = paste(variablename, elementtype, sep = " "), .keep = "unused") %>%
  dplyr::group_by(depth, age) %>%
  dplyr::arrange(depth)

In the next step, we pull in diatom samples.

site_samp_diat <- samples(site_rec) %>% 
  dplyr::filter(taxongroup == "Diatoms")

Data transformation

Finally, we need to transform the diatom frustule counts to proportions.

site_diat_perc <- site_samp_diat %>%
  dplyr::group_by(depth, age) %>%
  dplyr::mutate(diatcount = sum(value, na.rm = TRUE)) %>%
  dplyr::group_by(variablename) %>% 
  dplyr::mutate(prop = value / diatcount) %>% 
  dplyr::arrange(desc(age))

For further use in other packages, it is useful to pivot the data to a “wide” table, with variablenames as column headings. We do this for both diatom and charcoal datasets using tidyr (note that charcoal is in concentration units).

site_diat_perc <- site_diat_perc %>%
  dplyr::select(depth, age, variablename, prop) %>% 
  dplyr::mutate(prop = as.numeric(prop))

site_diat_wide <- tidyr::pivot_wider(site_diat_perc,
                                      id_cols = c(depth, age),
                                      names_from = variablename,
                                      values_from = prop,
                                      values_fill = 0)

site_charcoal_wide <- tidyr::pivot_wider(site_charcoal,
                                        id_cols = c(depth, age),
                                        names_from = variable,
                                        values_from = value,
                                        values_fill = 0)

Stratigraphic plotting

To plot a stratigraphic diagram, we can use riojaPlot. We select 15 most frequent diatom taxa to be plotted.

site_plot_taxa <- taxa(site_rec) %>% dplyr::filter(taxongroup == "Diatoms") %>%
  dplyr::arrange(desc(samples)) %>% 
  head(n = 15)

Zonation

Now, we calculate a constrained hierarchical cluster analysis using CONISS (constrained incremental sum-of-squares cluster analysis) using the chord distance. We further determine the significant zones (clusters) by the broken stick model (4 splits / 5 zones).

site_clust <- chclust(vegdist(site_diat_wide[,-1:-2], "chord"), method = "coniss")
plot(site_clust, hang = -1)

bstick(site_clust)

We plot the stratigraphic diagram with riojaPlot, and add significant zones.

riojaPlot(site_diat_wide[,-1:-2]*100, site_diat_wide[,1:2],
                       selVars = site_plot_taxa$variablename,
                       scale.percent = TRUE,
                       sec.yvar.name="age",
                       plot.sec.axis = TRUE,
                       srt.xlabel = 60,
                       xRight = 0.85) |>
addRPClust(site_clust) |>
addRPClustZone(site_clust, col = "red")

To further adapt your stratigraphic plot, please refer to riojaPlot gallery.

Comparing the trend in charcoal to variability in diatom composition

Show the trend in charcoal deposition

First, we plot log-transformed charcoal area data on a time axis. Then we add different trends: lowess (locally-weighted polynomial regression); loess (fits a polynomial surface determined by one or more numerical predictors, using local fitting); smooth.spline (fits a cubic smoothing spline).

plot(site_charcoal_wide$age, log(site_charcoal_wide$`Charcoal area`+1), ylim=c(0, 6), xlab = "Age cal BP", ylab = "CHAR")
lines(site_charcoal_wide$age, lowess(site_charcoal_wide$`Charcoal area`, f=0.1)$y, col="red")
lines(site_charcoal_wide$age, loess(site_charcoal_wide$`Charcoal area`~site_charcoal_wide$age, span = 0.2)$fitted, col="blue")
lines(site_charcoal_wide$age, smooth.spline(site_charcoal_wide$`Charcoal area`, df = 25)$y, col="green")
legend(1,6, legend = c("LOWESS", "LOESS", "smooth spline"), col = c("red", "blue", "green"), lty=1, cex=0.8)

To estimate the values for given depths/ages in the model, we need to apply predict(). We further use loess interpolation.

site_char_loess <- loess(site_charcoal_wide$`Charcoal area`~site_charcoal_wide$depth, span = 0.20)
site_char_pred <- stats::predict(site_char_loess, site_diat_wide$depth)

Multivariate analysis (ordination)

Principle component analysis

We apply the Principle component analysis (PCA), here rda(), a linear unconstrained ordination method. As the method uses Euclidean distance to calculate dissimilarity between samples, we need to transform the data first to prevent the double-zero problem by applying e.g., the Hellinger transformation (square-root) using decostand. We also extract the PCA scores with fortify(), and use only 15 most abundant species. Next, we fit charcoal as environmental variable to the PCA analysis using envfit().

site_pca <- rda(decostand(site_diat_wide[,-1:-2], method = "hellinger"))
site_PCA_species <- site_diat_wide[,-1:-2] %>%
  select(order(colSums(.), decreasing = T))
site_scores <- fortify(site_pca)
site_scores_species <- subset(site_scores, score == "species" & label %in% head(colnames(site_PCA_species), 15))
site_PCA_env <- envfit(site_pca, data.frame(CHAR = site_char_pred))
site_PCA_env_scores <- fortify(site_PCA_env)

Next, we plot the ordination using the ggplot function, adding a fitted environmental variable (charcoal) with envfit().

ggplot() + 
  geom_point(data = site_scores_species, aes(x=PC1, y=PC2), color = "#117733", size = 2) +
  geom_text_repel(data = site_scores_species, aes(x = PC1, y = PC2, label = label), color = "#117733", size = 5) +
  geom_segment(data = site_PCA_env_scores, aes(x = 0, y = 0, xend = PC1, yend = PC2),
               arrow = arrow(length = unit(0.2, "cm")), color = "#D55E00") +
  geom_text_repel(data = site_PCA_env_scores, aes(x = PC1, y = PC2, label = label), color = "#D55E00", size = 5) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "darkgray") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "darkgray") +
  theme_classic() +
  theme(
    panel.border = element_rect(color = "black", fill = NA, linewidth = 1), 
    axis.line = element_blank(),
    axis.text = element_text(size = 10)
  ) +
  labs(x = "PC1", y = "PC2")

Detrended correspondence analysis

Here we perform the Detrended correspondence analysis (DCA) with decorana. Correspondence analysis is a weighted averaging method. DCA detrends the first ordination axis by segments (hammer effect) often used to determine the extent of compositional turnover (4 SD units = complete turnover) and decide whether the species response are unimodal or linear. DCA also attempts to remove the arch effect from ordination (compare the PCA and DCA ordination plots).

Check the gradient length - look at the axis length of DCA1

site_dca <- decorana(decostand(site_diat_wide[,-1:-2], method = "hellinger"))
site_dca

Call:
decorana(veg = decostand(site_diat_wide[, -1:-2], method = "hellinger")) 

Detrended correspondence analysis with 26 segments.
Rescaling of axes with 4 iterations.
Total inertia (scaled Chi-square): 2.0524 

                       DCA1   DCA2    DCA3    DCA4
Eigenvalues          0.2985 0.1344 0.09716 0.07879
Additive Eigenvalues 0.2985 0.1343 0.09604 0.07763
Decorana values      0.3067 0.1425 0.11246 0.06256
Axis lengths         1.6684 1.9264 1.78166 1.31981
plot(site_dca, display = "sites")