In today’s lab we will be heading back to the Macrostrat database, learning some new tools and tricks, and thinking more critically about the stratigraphic record. Things that require answers are in bold.
As usual, we need to start by loading our required libraries:
library(RCurl)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::complete() masks RCurl::complete()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Part 1: Big picture Macrostrat
First we’ll use tools we already know, then we’ll use some functions written by Dr. Macrotrat himself, Shanan Peters.
In this activity, we’ll be looking at specific lithologies. We’ll use a URL code that lets us specify our lithology up here, then code below will grab that and pop it into a URL for us when we want our data.
So, first off, let’s specify a lithology. I’m picking marble because, well, it’s not one that you’re going to look at (you’ll be looking at sedimentary lithologies only, and marble is metamorphic)
lithology_parameter="lith=marble"
Now let’s get your data! Notice the “lithoogy parameter” part - that’s pulling the lithology you specified above.
x=getURL(paste("https://macrostrat.org/api/units?",lithology_parameter,"&project_id=1,7&format=csv&response=long",sep=""))
yourdata=read.csv(text=x)
Let’s start off by looking at this data by grouping it by t_age.
xx<-yourdata %>% group_by(t_age) %>% tally()
Now let’s visualize this!
ggplot(xx, aes(x=t_age, y=n))+geom_point()+scale_x_reverse()
I personally love very simple plots, so here’s a way to make your ggplots look even sleeker:
ggplot(xx, aes(x=t_age, y=n))+geom_point()+scale_x_reverse()+theme_minimal()
As you’ll see, Macrostrat is pretty thin before the Phanerozoic, so it
can be more useful to zoooooom into the Phanerozoic. We can make a new
plot that’s just restricted to the Phanerozoic like this:
ggplot(xx, aes(x=t_age, y=n))+geom_point()+scale_x_reverse(limits=c(541,0))+theme_minimal()
## Warning: Removed 32 rows containing missing values (geom_point).
Q1: What does grouping and plotting by t_age do? What are the limitations to this approach? What would be a better way of doing this?
Play around doing the above with a few different (sedimentary) lithoolgies before moving on.
Q2A. Do the above again with a few lithlogies, but this time use b_age. What happens when you use b_age? Why? Which one do you think is better to use?
One way around the b_age t_age thing is to make a new variable that’s the mean of b_age and t_age, i.e., the midpoint of the unit’s time duration. To do this, we’ll use the tidyverse commant mutate. Can you figure out how to do this on your own? Rememer that we used this command in the last lab, so feel free to go look at that code and see what you did.
Q3 Make this new variable (call it meanage) and re-do the plot above with the new age variable.Does this seem better or worse? Why?
Now let’s also make a new variable that tells us the time range of each of our units, i.e, their duration. You’ll use simlar code as above, using the command mutate. Call your new variable range.
Once you’ve made this new time range variable -
Q4: Make a histogram of your new time range variable (we did this sort of plot last lab, so go look at that code if you can’t remember!)
Now we’ll use some fancy code to make timeseries that are binned. What does this mean? Instead of just relying on the t_age and b_age of each unit, we’re going to put them into bins that are a specified width long (like, 1 million years, or 5, or whatever).
package_timeseries=function(packages,bins=seq(1,3000,1)){
total_packages=numeric()
total_area=numeric()
flux=numeric()
prop=numeric()
prop[1:length(packages$b_age)]=1
if (is.numeric(packages$prop)) prop=packages$prop
for (i in 1:length(bins)){
total_packages[i]=length(intersect(which(packages$b_age>=bins[i]),which(packages$t_age<bins[i])))
total_area[i]=sum(packages$area[intersect(which(packages$b_age>=bins[i]),which(packages$t_age<bins[i]))])
a=intersect(which(packages$b_age>=bins[i]),which(packages$t_age<bins[i]))
total_area[i]=sum(packages$col_area[a]*prop[a])
flux[i]=sum(packages$col_area[a]*packages$max_thick[a]/1000*prop[a]/(packages$b_age[a]-packages$t_age[a]))
}
return(data.frame(cbind(bins,total_packages,total_area,flux)))
}
Run the code below, which will help us plot our new fancy timeseries with proper geological time scales pulled from Macrostrat
maketime=function(x,y_label,ages=FALSE,line_weight=1.2,ymin=FALSE,bins="international%20periods",direction="PALEO"){
library(RCurl)
tempx=getURL(paste("https://macrostrat.org/api/defs/intervals?timescale=",bins,"&format=csv",sep=''))
timescale_times=read.csv(text=tempx)
if (is.numeric(ages[1])){
begin=intersect(which(max(ages)>timescale_times$t_age),which(max(ages)<=timescale_times$b_age))
if (length(begin)==0) begin=length(timescale_times$t_age)
end=intersect(which(min(ages)>=timescale_times$t_age),which(min(ages)<timescale_times$b_age))
if (end==1) end=1
} else {
end=1
begin=length(timescale_times$t_age)
}
if (direction=='GEOCHEM') direct=begin:end
else direct=end:begin
interval.names=as.character(timescale_times$abbrev[direct])
interval.midpoint= timescale_times$b_age[direct]-(timescale_times$b_age[direct]-timescale_times$t_age[direct])/2
if (end==2) {
time.boundaries=timescale_times$b_age[direct]
} else time.boundaries=c(timescale_times$t_age[end],timescale_times$b_age[direct])
par(cex.lab=0.75,cex.axis=.6,font.lab=1,tcl=-0.2,mar=c(3,2.5,2,1)+0.1,mgp=c(1.2,.2,0))
########################################################################################
y.max=max(x)
x.max=timescale_times$b_age[begin]
x.min=timescale_times$t_age[end]
if (direction=='GEOCHEM') lims=c(x.min,x.max)
else lims=c(x.max,x.min)
if (ymin) y.min=min(x) else if (min(x)<=0) y.min=min(x) else y.min=0
########################################################################################
plot.min=y.min-(.05*(y.max-y.min))*1.25 ##trial and error fits for good plotting of scale lines
seg.min=plot.min
text.min=y.min-abs(y.min-plot.min)/2 ##"""
plot(1,1,xlim=lims,ylim=c(plot.min,(y.max)+y.max*.03),type="n",xlab="Geologic time (Ma)", ylab=y_label,yaxs="i")
abline(h=y.min)
segments(time.boundaries,y.min,time.boundaries,seg.min)
text(interval.midpoint,text.min,labels=interval.names,cex=.7)
if (is.numeric(ages[1]) && length(ages)==length(x)) lines(ages,x,lwd=line_weight)
####################################################################################
}
Set your bins - this gets called into the timeseries function below. The first number is the youngest age you’re interested in, the second is the oldest, and the third is the size of the bins (I think?)
bins=seq(250,541,1)
put this data into the timeseries function that you made above.
yourdata_timeseries=package_timeseries(yourdata,bins=bins)
Then apply the timescale code too!
maketime(yourdata_timeseries$total_packages,"all units",ages=bins)
OK now you know how to make some pretty fun figures.
Q6:What’s the advantage of using binned data? Are there any disadvantages to using binned data?
Q7A: For the Paleozoic, make figures showing the number of units containing shale and limestone. Are they different? What do you think could be causing this difference?
Q7B: For the Mesozoic, make figures showing the number of units containing shale and limestone. Are they different? What do you think could be causing this difference?
Q7C: For just one of the lithologies, make the above figure for the entire Phanerozoic. What patterns do you observe? How do they relate to the cannonical Sepkoski curve of marine diversity?(http://strata.geology.wisc.edu/jack/)
In the next part, we’re going to start looking at how the fossil record intersects with the rock record. To do this, we’ll rely on the pbdb_collections and pbdb_occurrences fields in the yourdata file that you made earlier in the lab.
Q8: Make a plot with meanage on the x-axis and pbdb_occurrences and pbdb_collections on the y-axis, and restict the x-axis of the plot to just the Phanerozoic. Do this for both shale and limstone.
Q9: What is the difference between your occurrences and collections plots? Why?
Q10: What is the difference between your limestone and shale plots? Why do you think they might be different?
Part 2: Where we’re going
First, go check out the COSUNA chart on the wall of the lab room. Find the column that we’ll be visiting on our field trip. Examine the distribution of packages through the column. Now we’ll look at the digital version of this in Macrostrat!
In a regular old web browser, head to https://macrostrat.org/sift/#/column/474 – this is the column for our field trip! Explore this page a bit - it’s a helpful way to visualize the data that we’re about to explore in R.
Once you’ve explored a bit, let’s import all of this data into R. To do this, we need to call the column ID.
x=getURL(paste("https://macrostrat.org/api/units?col_id=474&format=csv&response=long",sep=""))
col=read.csv(text=x)
Once you’ve got this data, look at it a little bit (easiest way is to click on the dataframe name in the upper right ‘global environment’ and it’ll pop up to the left)
There’s some fun things we can do with this dataframe now! First make a new variable as you did above for meanage.
Let’s try to make a figure of the ranges of each of the units within this column. Sort of like an R version ofa stratigraphic chart. We want the x-axis to be time, which we’re using meanage for. We want the names (each unit) along the y-axis. In order to get ranges we have to do something a little fiddly which is to pretend they are actually error bars, and use that part of ggplot to make this work, just telling it that our error bars go from our b_age to our t_age (centered around our meanage, which is our x-value!).
col<-col %>% mutate(meanage2=(t_age+b_age/2))
col %>% ggplot(aes(x = meanage2, y = unit_name)) + geom_point() +
geom_errorbarh(aes(xmin = b_age, xmax = t_age)) + scale_x_reverse()
OK this looks…messy. It would be way better if we could reorder the y-axis accorinding to the mean age so we could tell what’s what. We’ll use the function fct_reorder, which, as the name indicates, helps you reorder factors (like the unit names). Head here to learn more and see an example that you can use on this data: https://www.r-graph-gallery.com/267-reorder-a-variable-in-ggplot2.html
Q11: Make a new plot where the units are re-ordered according to their mean age. When are the two biggest hiatuses in this data? Approximately how long are they? What if you just focus on sedimentary rocks? How long is the longest hiatus then?
Cool! Now let’s think about the fossil record some more. We can look and see which units have PBDB occurrences in them and which don’t. Units with fossils we call fossiliferous, those without, we call barren. Let’s make a new version of this figure that colors the fossiliferous units one color, and the barren ones another color so that it’s easy to see where the fossils are and are not. In order to do that, we’ll make another variable that’s binary, i.e., yes or no. To do this, we’ll need to use mutate in combination with the command ifelse. Check out this guide which should get you there: https://rstudio-pubs-static.s3.amazonaws.com/116317_e6922e81e72e4e3f83995485ce686c14.html#/
Q12: Make a new plot where units are color coded according to wether or not they are fossiliferous
And finally, since we’re really only interested in the Cambrian-Ordovician units, let’s focus in on those. We could do this by changing the axis limits, but if we do that, it’ll leave all the unit names on the y axis and will look icky. So instead, we’ll use the filter command: https://dplyr.tidyverse.org/reference/filter.html
Now you should have one chunk of code that includes a bunch of pipes that filter, reorder, and graph your color-coded data.
Q13: Make a new plot that is restricted to the Cambrian-Ordovician units. What patterns (re: fossils or not), if any, do you see here? What do you think is causing them? Feel free to dig into other columns of the dataframe to help answer this question.
Pick another column of your choosing. Make sure your column has some fossils in it somewhere. The easiest way to find column ID’s is through the visual part of Macrostrat, aka Sift (what we looked at earlier). If you go to the main page, which is https://macrostrat.org/sift, and scroll down, you will see all of North America with grey polygons on it. Each one of these is a column. To make sure it has fossils, look under the ‘collections’ line on the main info panel when you pull up the column.
Q14: do Q’s 11 -13 for this new column
Q15: Using everything we’ve learned so far, (plus anything you’ve learned on your own, via google, or whatever) make another figure you think is interesting using data from Macrostrat, tell me what it means, and why you find it interesting.