Leaf mass per area (\(LMA\)) is a valuable measure of plant ecological strategy that can be estimated for fossil taxa based on the measured leaf area and petiole width of fossilized leaves. Here, we ask whether we can improve estimates of \(LMA\) by taking the taxonomic identity of a specimen into account. If the morphological and biomechanical characteristics of leaves are heritable and to some extent evolutionarily conserved, it makes sense that the residual variation in \(LMA\) based on petiole and leaf area measurements would carry a phylogenetic signal. However, placing fossil specimens into a phylogeny is both technically difficult and introduces uncertainty. Since recent revisions of the angiosperm phylogeny have yielded largely monophyletic named clades at least to the family level, we decided to use a taxonomic approach as a means of incorporating this potential phylogenetic signal into fossil LMA estimates. Specifically, we use taxonomically nested linear mixed-effects models to estimate fossil \(LMA\), and compare alternative model formulations using a likelihood framework.

Data

The calibration data for extant species of angiosperms are drawn from Royer et al. (2007), with the taxonomy checked using the Taxonomic Name Resolution Service (TNRS). We also utilize fossil data from Royer et al. (2007), augmented by some fossil petiole and leaf area measurements provided by Matt Butrim.

royer_tax_new<-read.csv(file="../data/processed/royer_tax_new.csv")
all_fossil_royer_pred<-read.csv(file="../data/processed/all_fossil_royer_pred.csv")

Model Selection

We must determine which mixed effects model is best suited to make \(LMA\) estimates, using \(log(LMA)\) as the response variable and \(log(w_p^2/A_L)\) as the fixed effect, where \(w_p\) is the width of the petiole and \(A_L\) is the area of the leaf blade. We then include taxonomic information (superorder, order, or family) as nested random effects on either the slope or intercept. Our most inclusive model uses slope and intercept effects for superorder, order, and family, and we consider all nested simpler alternative models. We then determine the most suitable model by running a maximum likelihood and estimating Akaiki’s Information Criterion. These are shown in the following table.

model1<-lmer(log_lma~log_pet_leafarea + (log_pet_leafarea|superorder/order/scrubbed_family), data=royer_tax_new)
## boundary (singular) fit: see ?isSingular
model2<-lmer(log_lma~log_pet_leafarea + (log_pet_leafarea|order/scrubbed_family), data=royer_tax_new)
model3<-lmer(log_lma~log_pet_leafarea + (log_pet_leafarea|scrubbed_family), data=royer_tax_new)
model4<-lmer(log_lma~log_pet_leafarea + (1|superorder/order/scrubbed_family), data=royer_tax_new)
## boundary (singular) fit: see ?isSingular
model5<-lmer(log_lma~log_pet_leafarea + (1|order/scrubbed_family), data=royer_tax_new)
## boundary (singular) fit: see ?isSingular
model6<-lmer(log_lma~log_pet_leafarea + (1|scrubbed_family), data=royer_tax_new)
model7<-lm(log_lma~log_pet_leafarea, data=royer_tax_new)
modcompare<-anova(model1, model2, model3, model4, model5, model6, model7)
## refitting model(s) with ML (instead of REML)
Rank Model Slope.Effects Intercept.Effects LogLik AIC dAIC
1 3 F F -184.00 380.00 0.00
2 6 F -187.25 382.49 2.49
3 5 O, F -187.20 384.40 4.40
4 2 O, F O, F -183.69 385.37 5.37
5 4 SO, O, F -187.20 386.40 6.40
6 1 SO, O, F SO, O, F -183.32 390.64 10.64
7 7 -217.84 441.68 61.68

Comparison of Taxonomic vs. Non-taxonomic models

The best model retains a slope and an intercept effect for family: model3 above. While the model ignoring taxonomy explains a decent fraction of the variation in \(log(LMA)\) (\(R^2 = 0.54\)), accounting for random slope and intercept effects increases the overall conditional \(R_c^2\) value to 0.65, with a marginal \(R_m^2 = 0.51\) made up by the fixed effect of \(log(w_p^2/A_L)\). The model using only an intercept term for family is only marginally worse, with \(R_c^2 =0.61\) and \(R_m^2 = 0.50\).

But whether we use family as a random effect on slope or intercept, the taxonomically informed model outperforms the taxonomy-free model.

summary(model7)$r.squared
## [1] 0.5446034
r.squaredGLMM(model3)
##            R2m       R2c
## [1,] 0.5068941 0.6464948
r.squaredGLMM(model6)
##            R2m       R2c
## [1,] 0.4993657 0.6107867

Random Effects

We can examine differences among families by looking at the slope and intercept effects estimates for each family. For the model with family specific effects on both slope and intercept, we find the following:

dotplot(ranef(model3,condVar=TRUE))
## $scrubbed_family

Here we can see variation in the slope (these are deviations from the fixed effect estimate of each parameter). Also note that the variation is positively correlated: families with more positive slopes also tend to have more positive intercepts.

If we look at the simpler model, with only the intercept term varying among families, we see the following:

dotplot(ranef(model6,condVar=TRUE))
## $scrubbed_family

Model Assessment

We can also compare the alternative models by examining the predicted LMA values as a function of the observed measurements.

First let’s look at the Royer et al. model ignoring taxonomy.

#Bind fitted values to the observations in a data frame

mod7fit<-as.data.frame(cbind(model7$fitted.values, royer_tax_new$log_lma, royer_tax_new$scrubbed_family))
names(mod7fit) <- c("predlogLMA","obslogLMA", "family")

ggplot(aes(x=obslogLMA, y=predlogLMA, color=royer_tax_new$scrubbed_family), data=mod7fit) +
  geom_point() +
  geom_abline(yintercept=0, slope=1, lty="dashed") +
  labs(color='Family')+
  xlab('Observed log(LMA)')+
  ylab('Predicted log(LMA)')+
  labs(title = 'LMA Predictions without Family Information')
## Warning: Ignoring unknown parameters: yintercept

We already know that was a pretty good model. Now let’s look at the model with family specific variation in slope and intercept:

#Bind fitted values to the observations in a data frame

mod3fit<-as.data.frame(cbind(fitted(model3), royer_tax_new$log_lma, royer_tax_new$scrubbed_family))
names(mod3fit) <- c("predlogLMA","obslogLMA", "family")

ggplot(aes(x=obslogLMA, y=predlogLMA, color=royer_tax_new$scrubbed_family), data=mod3fit) +
  geom_point() +
  geom_abline(yintercept=0, slope=1, lty="dashed") +
  labs(color='Family')+
  xlab('Observed log(LMA)')+
  ylab('Predicted log(LMA)')+
  labs(title = 'LMA Predictions with Family variation in slope')
## Warning: Ignoring unknown parameters: yintercept

Here it looks like the predictions do hug the 1:1 line a bit more closely and the residual variation. And finally, we can look at the somewhat simpler model with only variation in intercept:

#Bind fitted values to the observations in a data frame

mod6fit<-as.data.frame(cbind(fitted(model6), royer_tax_new$log_lma, royer_tax_new$scrubbed_family))
names(mod6fit) <- c("predlogLMA","obslogLMA", "family")

ggplot(aes(x=obslogLMA, y=predlogLMA, color=royer_tax_new$scrubbed_family), data=mod6fit) +
  geom_point() +
  geom_abline(yintercept=0, slope=1, lty="dashed") +
  labs(color='Family')+
  xlab('Observed log(LMA)')+
  ylab('Predicted log(LMA)')+
  labs(title = 'LMA Predictions with Family variation in slope')
## Warning: Ignoring unknown parameters: yintercept

This looks pretty similar, though it maybe backs up the AIC assessment that the more complex model is incrementally better.

Fossil predictions with and without Taxonomy

In order to compare our fossil data set to the living data set, we must limit the fossil data to families that are present in extant data.

options(max.print = 1000)
af<-royer_tax_new$scrubbed_family
bf<-all_fossil_royer_pred$scrubbed_family

##whats in all_fossil_royer_pred that isn't in #royer_tax_new

missing<-bf[!(bf%in%af)]
write.csv(missing, file="../data/missing.csv")

missing<-read.csv("../data/missing.csv")
missingdata<-subset(missing, select=-X)
colnames(missingdata)[colnames(missingdata)=="x"]<-"scrubbed_family"

tallymissing<-
  missingdata%>%
  group_by(scrubbed_family)%>%
  tally()
#print.data.frame(tallymissing)

new_fossil_royer_pred2<-all_fossil_royer_pred[!all_fossil_royer_pred$scrubbed_family %in% missingdata$scrubbed_family,]

Then, prediction intervals are created for log(LMA) in new_fossil_royer_pred2 based on model 3. The dataset contains observations for expected LMA from the taxonomy free model, but this gives prediction intervals for log LMA that give a more picture of what those values should be.

##Prediction Intervals 
PI3<-predictInterval(merMod=model3, newdata=new_fossil_royer_pred2, 
                    level=0.95, n.sims=1000,
                    stat="median", type="linear.prediction",
                    include.resid.var = TRUE)
#PI
fossilpredictions<-cbind(new_fossil_royer_pred2,PI3)
ggplot(aes(x=log(LMA), y=fit, ymin=lwr, ymax=upr, color=new_fossil_royer_pred2$scrubbed_family), data=fossilpredictions) +
  geom_point() +
  geom_linerange() +
  geom_abline(yintercept=0, slope=1, lty="dashed") +
  labs(color='Family')+
  xlab('Predicted log(LMA) without Family')+
  ylab('Family specific log(LMA)')+
  labs(title = 'Fossil LMA Predictions with Family Information')
## Warning: Ignoring unknown parameters: yintercept

###save as PDF
ggsave("PettoLMAFigure.pdf")
## Saving 7 x 5 in image

Save fossil predictions and eliminate unnecessary columns

fossilpredictionsfinal<-fossilpredictions[-c(2, 4, 5, 6, 7, 8)]
write.csv(fossilpredictionsfinal, file="../data/processed/PettoLMApredictionsfinal.csv")