library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4

PART 1: SAMPLE STANDARDIZATION

Many paleobiological analyses depend on examining temporal, environmental, and taxonomic trends in diversity. Producing accurate temporal “diversity curves” has been a central objective of paleobiology since the pioneering work of Phillips (1860). Tracking diversity through time based on the unevenly preserved, distributed, and sampled fossil record poses many unique challenges, some of which are not immediately obvious.

Today we’re going to start by working through some basic issues surrounding how to estimate diversity from a sample of individuals. Although diversity can be measured in many different ways, we’ll focus on the standard approach of looking at the distribution of individuals across taxonomic groups. Because of concerns about how consistently species are defined and recognized in many fossil groups, it is common practice to examine diversity trends at the genus level.

Global, or “gamma” diversity can be partitioned into two components: alpha, or within-sample diversity, and beta, or between-sample diversity. Beta diversity is strongly influenced by environmental gradients such as elevation, climate, etc. that can be very difficult to delineate in the stratigraphic record. So we’ll start by focusing on making diversity comparisons at the alpha level. In doing this, we face the problem that not all samples are equal. Variation in sampling intensity (the number of individuals that have been sampled from the community) complicates comparative diversity analyses.

Let’s begin by creating some fake data to work with. Imagine a local species pool composed of 20 species. We’ll designate each species with a letter - this code just makess a string of 20 letters and is found in the ‘Values’ section of the Environment panel at right.

species <- c(LETTERS[1:20])

At the local level, diversity is a function (1) the total number of species in the pool (Species Richness) and (2) the equitability with which individuals are distributed among those species (Evenness, and its inverse, Dominance). In a maximally even assemblage, every species is represented by the same number of individuals. Let’s begin with an equal sampling probability for every species in the pool -we will later modify this assumption. We want the sampling probabilities to sum to 1 so we’ll divide by the total number of individuals in the species pool, which in this contrived case is the same as the number of species. The code below just makes a list of 20values of 0.05 probability, which is 1/20.

probs <- rep(1/length(species),length(species))

Combine the vectors of species names and sampling probabilities into a single data frame. Go take a look at it (it’s in Data over in Environment now since we made it a dataframe)

samplingprobs <- data.frame(species,probs)

Now let’s imagine that we sample an individual randomly from this pool. The code below tells R to sample 1 individual from the species row of our dataframe samplingprobs and it tells us that the probability of taking any species from that pool should be determined by the probs row of our dataframe samplingprobs - right now, all the probabilities are the same, so for the moment this doesn’t do much, but as we’ll see later, and as you can imagine, the probability of finding one species over another isn’t always the same (for example, biomineralized taxa will have a higher probability than soft-bodied ones, etc). For now we’ll assume that the total number of individuals is effectively infinite, so we are sampling WITH replacement (the code replace=TRUE), i.e., when we take an individual out of the pool it is immediately replaced by another individual of the same species. Try running the code below a few times to see what it does.

rock1<-sample(samplingprobs$species,1, prob=samplingprobs$probs, replace=TRUE)
rock1
## [1] "Q"

Now we want to find out the richness of this rock sample, i.e., how many unique species are in it. We use the command n_distinct which counts the number of distinct things. Since we only pulled one sample (i.e., one species) from our pool, richness will come back as ‘1’ here.

rich1<-n_distinct(rock1)
rich1
## [1] 1

Now let’s add more samples from our pool. To do this, we’ll ask R to make a list that concatenates (links together in a chain; that’s what the ‘c’ does) our rock1 sample with another sample that pulls two individuals from our pool.

rock2<-c(rock1,sample(samplingprobs$species,2, prob=samplingprobs$probs, replace=TRUE))
rock2
## [1] "Q" "R" "C"
rich2<-n_distinct(rock2)
rich2
## [1] 3

Q1: Run this a few times - does richness always come back as ‘3’? What’s happening when it doesn’t?

Now we’ll do this with bigger sample sizes - make new code that does this but picks 4, 8, 16, 32, and 64 species from the pool. Name each one accordingly. So next we will draw another individual and add it so that the sample size is 4. Then we’ll draw two more individuals so that the sample size is 4, and then 4 more so that it is 8, and so on, increasing sample size by a factor of 2 up to a total sample size of 128 individuals. We’ll calculate richness for each sample as we go as well. As above, call your richness values rich3, rich4, etc.

Now let’s see how our sampled species richness (S) changes as a function of the number of individuals sampled (N) - we’re going to just make a data frame out of our richness values:

N <- c(1,2,4,8,16,32,64,128)
S <- c(rich1, rich2, rich3, rich4, rich5, rich6, rich7, rich8)
dat <- data.frame(N,S)

Now let’s make a simple plot

ggplot(dat,aes(N,S))+geom_point()

If we want to map a variable to the size, color, or type of the points, we need to include a statement to the that effect within “aes()”:

p<-ggplot(dat,aes(N,S,colour = S)) + geom_point(size = 4)
p

ggplot makes it very easy to superimpose graphical layers on top of each other:

p +  geom_line(colour="black") + geom_point(size = 4)

Q2: (2a) Instead of mapping the color of the points to species richness (S) map it to sample size (N). (2b) Change the color scale using “scale_colour_gradient()” (2c) Remove “colour=”black”” from the “geom_line()” statement and see what happens to the line color.**

Now let’s get back to the problem at hand. Clearly, and in accord with common sense, the number of individuals that we sample has a strong influence on the number of species that we discover. Even in this maximally even case, in which all species are equally common, the number of individuals counted must substantially exceed the number of species in the species pool before our species richness estimates reflect the true underlying species number (which is 20). Of course we have just randomly sampled from this distribution one time. Let’s see how variable our estimates might be if we were to do this many times. This will also provide us an opportunity to review how to write simple functions and loops in R. There are more elegant ways to accomplish this task, but for this example we’ll use a very basic and clunky approach to make it easy to understand what we’re doing.

subsample.pool <- function(df){

# make a list in which we will store our results
  
subsamps <- list()

# now loop through each number in the set 1:100 and draw individuals from the vector of species IDs (letters) based on the vector of sampling probabilities.  We start with a sample of 1, then sample another individual for a total sample size of 2, then 2 more individuals for a total sample size of 4, then 4 more individuals for a total sample size of 8, etc. up to a sample size of 128 individuals

for (i in 1:100){
  
# here we sample individuals, using the "sample()" function: 
  
rock1<-sample(samplingprobs$species,1, prob=samplingprobs$probs, replace=TRUE) 
rock2<-c(rock1,sample(samplingprobs$species,2, prob=samplingprobs$probs, replace=TRUE))
rock3<-c(rock2,sample(samplingprobs$species,4, prob=samplingprobs$probs, replace=TRUE))
rock4<-c(rock3,sample(samplingprobs$species,8, prob=samplingprobs$probs, replace=TRUE))
rock5<-c(rock4,sample(samplingprobs$species,16, prob=samplingprobs$probs, replace=TRUE))
rock6<-c(rock5,sample(samplingprobs$species,32, prob=samplingprobs$probs, replace=TRUE))
rock7<-c(rock6,sample(samplingprobs$species,64, prob=samplingprobs$probs, replace=TRUE))
rock8<-c(rock6,sample(samplingprobs$species,128, prob=samplingprobs$probs, replace=TRUE))

rich1<-n_distinct(rock1)
rich2<-n_distinct(rock2)
rich3<-n_distinct(rock3)
rich4<-n_distinct(rock4)
rich5<-n_distinct(rock5)
rich6<-n_distinct(rock6)
rich7<-n_distinct(rock7)
rich8<-n_distinct(rock8)
                
# finally we'll combine these outputs into a data frame.  We'll also include the number of individuals sampled and the iteration number so that we can use these for making plots.
                
               N <- c(1,2,4,8,16,32,64,128)
              S <- c(rich1, rich2, rich3, rich4, rich5, rich6, rich7, rich8)
              # iteration is just a list of whatever 'i' is, i.e. the run number we're on, and then repeating that run number 8 times, which is how many samples we're taking out on each run (rich1 through rich8) 
                iteration <- rep((i),8)
                out <- data.frame(iteration,N,S)
                subsamps[[i]] <- out
                }

# the loop outputs a list of 100 data frames like the one we generated above.  To make the output easier to work with, we'll combine them into a single data frame.  We use the bind_rows function to apply an operation -binding by shared rows - to the list of data frames output by the loop.

subsamples <- bind_rows(subsamps) 
return(subsamples)
}

Let’s apply this function to the data frame of sampling probabilities we already made:

sub.iterations <- subsample.pool(samplingprobs)

Now let’s generate a plot like the one we made, showing lines for all 100 subsampling iterations. To do this we’ll add a new statement which tells ggplot to map the colour of any graphical objects produced, such as points and lines, to the “iteration” column. Because this will be used in mapping a variable to the plot, we include it within the aes() statement. Now every sampling iteration is plotted as a different line, with a slightly different color.

ggplot(sub.iterations,aes(N,S,colour = S)) + geom_point()+geom_line(aes(group = iteration))

So now let’s look at an example that approaches the opposite extreme: 90% of individuals belong to a single species (species A). We need to replace the maximally even vector of sampling probabilities with one that distributes 90% of sampling probability among species A and the remaining 10% of sampling probability among the other 19 species.

samplingprobs$probs <- c(.9,rep(((1-.9)/19),19))

Let’s re-run the subsampling function we just created with the new sampling probabilities, and plot the results.

sub.iterations2 <- subsample.pool(samplingprobs)

p <- ggplot(sub.iterations2,aes(N,S,group=iteration,colour=iteration))
p + geom_line() + geom_point()

These look very different.

Q3: What is the appriximate average percentage of species found in our maximum sample size of 128 individuals?

This nicely illustrates the influence of evenness on estimates of species richness: when many of the species in the species pool are relatively rare, we need to sample many more individuals before our species richness estimates begin to reach a plateau.

Finally, we’ll try with an intermediate sampling probability distribution -we’ll give each species a number drawn from an exponential distribution using the “rexp()” function and then divide by the sum to get species sampling probabilities

samplingprobs$probs <- rexp(20, rate = 1)
total <- sum(samplingprobs$probs)
samplingprobs$probs <- samplingprobs$probs/total

sub.iterations2 <- subsample.pool(samplingprobs)

p <- ggplot(sub.iterations2,aes(N,S,group=iteration,colour=iteration))
p + geom_line() + geom_point()

So already we should suspect that estimating the sizes of local species pools through time is going to be difficult, especially if the intensity of our sampling also varies through time.

PART 2 Looking at Actual PBDB data

We’re going to look at all trilobites. First, we’ll call the data from PBDB, similarly to how we called data from Macrostrat.

trilos<-read.csv("https://paleobiodb.org/data1.2/occs/list.csv?&base_name=trilobita&taxon_reso=lump_genus&show=coords,loc,paleoloc")

Now there’s lots we can do with this. First, let’s figure out the FAD and LAD of each genus within all trilobites. We’ll use group_by and summarise to do this, plus learn how to bind columns together that have the the same number of rows.

minagetrilo<-trilos %>% group_by(accepted_name) %>% summarise(min(min_ma))
 
maxagetrilo<-trilos %>% group_by(accepted_name) %>% summarise(max(max_ma))
 
Range<-bind_cols(minagetrilo, maxagetrilo)
## New names:
## • `accepted_name` -> `accepted_name...1`
## • `accepted_name` -> `accepted_name...3`
Range<-Range[,-3]

Range

What else can we do?

Alpha diversity refers to the diversity of a local community (i.e. a single site or sample). The simplest form of alpha diversity is species richness which is the number of species observed in the local community. However, there are many different metrics which can be used to quantify alpha diversity in community ecology, that can broadly be broken down into three categories: measures of species richness, measures of species evenness, and overall diversity metrics considering both richness and evenness. Here we will explore three of the most common.

Shannon’s Index H’ Shannon’s H’ is the most frequently used metric to quantify overall diversity of a site. Shannon’s H’ takes into account both the total number of species observed at a site, and how evenly distributed species abundance is at each site. Shannon’s H’ is a measure of how difficult it is to predict the identity of a randomly chosen individual from the community.

Richness Richness is simply the number of species at a specific site. Despite the simplicity of richness, multiple measures exist, ranging from the number of species observed at the site through to estimates of true richness based on models.

Pielou’s Evenness J’ Evenness is an important component of diversity, and reflects how evenly distributed abundances are between all species present at a site. A common measure of eveness is Pielou’s Evenness.

Here, we’ll look at a few measures of diversity in Miocene bivalves. First we need to download the data, then we seen to count how many occurrences there are per genus, and get rid of rows where there is no genus name.

DataBivalvesM<-read.csv("https://paleobiodb.org/data1.2/occs/list.csv?base_name=bivalvia&taxon_reso=genus&interval=Miocene,Miocene&show=full")
DataBivalvesMX<-DataBivalvesM  %>% count(genus)
DataBivalvesMX<-DataBivalvesMX %>% filter(genus!='')

Now we need to re-format our data into a matrix where each column is a genus - in this matrix, there’s only one row, but later I’ll show you how to do this for many rows (like time periods, or geoplates)

matrix1<-spread(DataBivalvesMX, genus, n, fill = 0)

now to calculate some diversity terms!

H <- diversity(matrix1)
richness <- specnumber(matrix1)  
evenness <- H/log(richness)

and now to put these three numbers into one data frame

alpha <- cbind(shannon = H, richness = richness, pielou = evenness)
head(alpha)
##       shannon richness    pielou
## [1,] 5.941126     1027 0.8567617

Q4 - do this for Pliocene bivalves and then compare the two (Miocene vs Pliocene) in a plot

to do this you’ll need to label each row either Miocene or Pliocene, and you can do that by just adding a new column with two values in it, which you can learn how to do here: https://tibble.tidyverse.org/reference/add_column.html . Then you’ll need to bind the two dataframes together, which you can learn how to do here: https://dplyr.tidyverse.org/reference/bind.html

As I mentioned above, you can also make a matrix where each row is some sort of locality or other variable. Below I show you how to adapt the code above to look at diversity of each geoplate. Note that we have to do some other things, including deleting the head column with the gplates numbers and then adding it back in later - annoying, but necessary.

DataBivalvesMX<-DataBivalvesM %>% group_by(geoplate) %>% count(genus)
DataBivalvesMX<-DataBivalvesMX %>% filter(genus!='')

matrix1<-spread(DataBivalvesMX, genus, n, fill = 0)
matrix2<-matrix1[,-1]

now do

H <- diversity(matrix2)
richness <- specnumber(matrix2)  
evenness <- H/log(richness)

then put it all back together!

alpha <- cbind(shannon = H, richness = richness, pielou = evenness, matrix1)
head(alpha)

Plate ID’s can be found here: https://www.earthbyte.org/Resources/AuScope-Dec-2007/Global_EarthByte_PlateIDs_20071218.pdf

plot

alpha %>% ggplot(aes(x=geoplate, y=richness))+geom_point()

So now you have the ability to compare communities! However, what might be impacting the differences we see here? Is there a difference in the amount of time covered in the Pliocene and the Miocene? How might you get around this? You could filter the data after you download it in order to be able to compare equal time periods. For example, if we wanted to restrict the Miocene data to only a 5 million year period between 5 Ma and 10 Ma we could just filter like this (and also make it a new dataframe)

Q5: Compare Cambrian vs Ordovician communities of Brachiopoda using the same measures of diversity as above (ignore gplates for now). Make sure you compare the same amount of time. What are the differences? Why might they exist? What’s the difference between diversity, richness, and evenness? Why do we need all three?*

Applying what we’ve learned to the House Range

Now we’ll focus in again on our field area. To download the occurrence data for a specific geological unit, we can ask PBDB! Here’s the downloads for the Kanosh. To be safe, I’ve restricted the search to the Cambrian-Ordovician and North America, just in case there’s another unit with Kanosh in the name somewhere out there.

kanosh<-read.csv("https://paleobiodb.org/data1.2/occs/list.csv?interval=Cambrian,Ordovician&cc=NOA&strat=Kanosh")

some of the fossils here are identified to the level species, other to the level of the genus only - so we’ll just count by accepted name to make things easier. Use the code from above to calculate Shannon’s index, richness, and eveness for the Kanosh.

kanoshX<-kanosh  %>% count(accepted_name)
kanoshX<-kanoshX %>% filter(accepted_name!='')

Now we need to re-format our data into a matrix where each column is a genus - in this matrix, there’s only one row, but later I’ll show you how to do this for many rows (like time periods, or geoplates)

matrixK<-spread(kanoshX, accepted_name, n, fill = 0)

now to calculate some diversity terms!

H <- diversity(matrixK)
richness <- specnumber(matrixK)  
evenness <- H/log(richness)
alpha <- cbind(shannon = H, richness = richness, pielou = evenness)
head(alpha)
##       shannon richness    pielou
## [1,] 3.386535       34 0.9603485

Now do the same for each fossiliferous Cambrian-Ordovician Formation you found from your last lab.

Q6: Make figures(s) showing all three indeces for all fossiliferous formations from the last lab (i.e., from our field locality). You can compile the information in a spreadsheet in Excel or Google Sheeets if that’s easier for you, Excel isn’t totally evil, then re-import it using the read.csv command - just make sure to save the Excel file as a .csv. Then write a paragraph or so analyzing your figures in the context of the localities’ ages, lithologies, and any other information you want to incorporate*

Can we make rarefaction curves for our formations? Let’s try! First we need to make a dataframe that shows how many occurences there are per taxon in our unit.

kanoshrich<-kanosh %>% count(accepted_name)

Now we need to add a column for sampling probabilities. We’ll do this by dividing the number of occurences (n) by the total number of specimens in the dataframe which we can find by doing a number of things including:

kanoshrich %>% summarise(count=sum(n))

So now we need to add a new column, using mutate, that creates a probability that is a proportion of n and the total count of things in the unit.

kanoshrich<-kanoshrich %>% mutate(probs=(n/52))

Go check to make sure that speices with higher numbers of occurences also have higher probabilities.

Now we can re-write the function we used earlier to make a rarefaction curve for this data!! (if any of you want to get fancy and make a function that works for any set of data, go ahead, but I’m happy to have you do it a more basic way). One way to speed this up is to use command-f to find and replace the name of the dataframe in your code and any other variables that have different names, or just rename your dataframe to fit the function (harder to keep track of things this way).

Q7: Make rarefaction curves for **5* of your fossiliferous formations (those with richness greater than 10) and discuss any differences you see, and what might be causing them, using information you’ve calculated earlier in the lab like eveness