Thu Nov 13 20:13:31 2025
Background
In this assignment, we will (re-)analyse one of the most influential databases in ecology: the glopnet database. The database was the first compilation of plant traits worldwide. Since then, plant traits (i.e. the properties of plants) have been used to understand ecological strategies of plants around the world, how they have evolved and how they respond to e.g. climate change. They have served as important inputs to improve so-called Dynamic Global Vegetation Models (i.e. the models that form the basis of the IPCC reports on climate change) and are at the basis of concepts of functional diversity as alternative to species richness as metric for biodiversity.
Aims
The aims of this assignment are:
- To (once more) test the assumptions of linear regression
- To create a better understanding on the independent variable
- To provide a solution in case the independent variable has (measurement) uncertainties
The glopnet database was first published in Nature by Wright et al. 2004: https://www.nature.com/articles/nature02403. This is one of the most cited papers in Ecology of the past two decades. Let me start with saying there is nothing wrong with the statistical analyses presented in their paper. They have been using Standardized Regression Models (SMAs) to analyse trait-trait relationships, which is also the correct way to analyse this dataset. Most discussions are about the interpretation of the patterns (which themselves since then have been repeatedly confirmed); whether they represent evolutionary patterns or ‘’coordinated’’ patterns due to common ecological drivers. There is also discussion on whether expressing traits on a mass basis or area basis is the best way. This has led to one of the most aggressive papers published in an ecological journal: https://nph.onlinelibrary.wiley.com/doi/10.1111/nph.12281
As a general note for this assignment: consider previous assignments if you have forgotten how to do a particular action.
1. Read the data file of glopnet in Rstudio and evaluate the properties of the dataset with summary() and similar functions.
# loading data
glopnet <- read.table("glopnet-database.txt", sep="\t", header=TRUE)
# overview of data
head(glopnet)## log.LLS log.LMA log.Nmass log.Narea log.Pmass log.Parea log.Amass log.Aarea
## 1 12.5530 24.54 0.069 0.5458 -0.9323 -0.4783 16.91 11.45
## 2 11.7150 21.88 0.094 0.2880 -0.9230 -0.7350 18.61 10.49
## 3 13.4700 21.45 0.014 0.1706 -0.9935 -0.8485 18.76 10.22
## 4 NA 19.79 NA NA NA NA NA NA
## 5 0.9513 20.31 0.388 0.4139 -0.8241 -0.7931 22.08 12.39
## 6 11.6690 23.56 0.255 0.6111 -0.9617 -0.6057 20.02 13.58
## log.Gs
## 1 23.526
## 2 21.776
## 3 22.031
## 4 NA
## 5 23.277
## 6 24.167
## log.LLS log.LMA log.Nmass log.Narea log.Pmass log.Parea log.Amass
## 2543 NA 20.223 NA NA NA NA NA
## 2544 NA 20.506 0.4048 0.4554 NA NA NA
## 2545 NA 19.830 0.3324 0.3154 NA NA NA
## 2546 NA 20.706 NA NA NA NA NA
## 2547 NA 18.794 0.5011 0.3805 NA NA NA
## 2548 NA 18.239 0.4594 0.2833 NA NA NA
## log.Aarea log.Gs
## 2543 NA NA
## 2544 NA NA
## 2545 NA NA
## 2546 NA NA
## 2547 NA NA
## 2548 NA NA
## 'data.frame': 2548 obs. of 9 variables:
## $ log.LLS : num 12.553 11.715 13.47 NA 0.951 ...
## $ log.LMA : num 24.5 21.9 21.4 19.8 20.3 ...
## $ log.Nmass: num 0.069 0.094 0.014 NA 0.388 0.255 0.328 0.382 NA 0.412 ...
## $ log.Narea: num 0.546 0.288 0.171 NA 0.414 ...
## $ log.Pmass: num -0.932 -0.923 -0.994 NA -0.824 ...
## $ log.Parea: num -0.478 -0.735 -0.849 NA -0.793 ...
## $ log.Amass: num 16.9 18.6 18.8 NA 22.1 ...
## $ log.Aarea: num 11.4 10.5 10.2 NA 12.4 ...
## $ log.Gs : num 23.5 21.8 22 NA 23.3 ...
## log.LLS log.LMA log.Nmass log.Narea
## Min. :-0.0252 Min. :11.58 Min. :-0.6055 Min. :-0.5836
## 1st Qu.: 0.6010 1st Qu.:17.91 1st Qu.: 0.0857 1st Qu.: 0.1174
## Median : 0.9255 Median :19.68 Median : 0.2601 Median : 0.2487
## Mean : 6.4051 Mean :19.91 Mean : 0.2275 Mean : 0.2437
## 3rd Qu.:12.8625 3rd Qu.:21.69 3rd Qu.: 0.3868 3rd Qu.: 0.3709
## Max. :24.5940 Max. :31.79 Max. : 0.8035 Max. : 0.9610
## NA's :1796 NA's :178 NA's :483 NA's :569
## log.Pmass log.Parea log.Amass log.Aarea
## Min. :-20.9690 Min. :-17.7140 Min. : 0.6812 Min. : 0.0014
## 1st Qu.:-13.9790 1st Qu.:-11.2605 1st Qu.:17.4240 1st Qu.: 0.8647
## Median :-10.1300 Median : -0.9564 Median :20.0205 Median :10.1630
## Mean : -7.5799 Mean : -5.6210 Mean :19.7818 Mean : 6.6419
## 3rd Qu.: -0.8021 3rd Qu.: -0.7922 3rd Qu.:22.0775 3rd Qu.:11.5840
## Max. : -0.2218 Max. : -0.0541 Max. :28.2100 Max. :16.2320
## NA's :1784 NA's :1792 NA's :1778 NA's :1723
## log.Gs
## Min. :15.44
## 1st Qu.:21.77
## Median :23.71
## Mean :23.86
## 3rd Qu.:25.92
## Max. :33.56
## NA's :2048
# checking distributions of data
par(mfrow=c(3,3))
hist(glopnet$log.LLS,main="LLS")
hist(glopnet$log.LMA,main="LMA")
hist(glopnet$log.Nmass,main="Nmass")
hist(glopnet$log.Narea,main="Narea")
hist(glopnet$log.Pmass,main="Pmass")
hist(glopnet$log.Parea,main="Parea")
hist(glopnet$log.Amass,main="Amass")
hist(glopnet$log.Aarea,main="Aarea")
hist(glopnet$log.Gs,main="Gs")
Conclusion: The patterns look good. The deviations are
caused by NAs which will not be taken into account during the
regression. Some data points deviate from normality, however bear
in mind that these are a few observations in a total of close to 800
observations (i.e. less than 8% that might happen just because of random
processes). Therefore, those are not problematic. If you would
have considered them problematic, you could have solved it by
bootstrapping.
2. What is your conclusion on the relationships between traits?
Let’s first evaluate some of the patterns in the data using a correlation matrix. We can visualize this correlation matrix. For doing that, install the package ‘corrplot’.
Note: we need the addition ‘’pairwise.complete.obs’’ to avoid that the NAs in the dataset hamper calculating the correlation coefficients.
# defining correlation matrix
cor_matrix <- cor(glopnet, use="pairwise.complete.obs") # takes NAs into account
# installing corrplot package
library(corrplot)## corrplot 0.95 loaded
Conclusion: Several traits are strongly correlated. Particularly strong correlations exist among LMA, Nmass (nitrogen content in leaves), Pmass (phosphorus content in leaves), Amass (photosynthesis rate) and LLS (leaf lifespan). Also between Amass, Aarea and Gs (stromatal conductance) strong relationship exist.
3. First, explore the relationship by making a plot with Amass as the dependent variable. How would you describe the relationship between LMA and Amass?
Next, we analyse what had gone wrong if the authors would have analysed the data using a linear regression. We take the relationship between LMA (Leaf Mass per Area) and Amass (the photosynthesis rate in a leaf on a mass basis). Take Amass as the dependent for a regression analysis.
# creating simple linear regression
simod <- lm(log.Amass~log.LMA,data=glopnet)
# plotting regression
plot(log.Amass~log.LMA, data=glopnet,pch=16,col="navyblue",lwd=2,main="LMA x Amass",
xlab="Leaf Mass per Area",ylab="Leaf Photosynthesis Rate (Mass Basis)")
abline(simod,col="steelblue")Conclusion: There is a strong negative correlation between Amass and LMA.
4. Test the assumptions for this regression analysis. What is your conclusion on the regression assumptions?
# applying Kolmogorov-Smirnoff test for normality
ks.test(simod$residuals,pnorm,mean(simod$residuals),sd(simod$residuals))##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: simod$residuals
## D = 0.054835, p-value = 0.02021
## alternative hypothesis: two-sided
Conclusion: The model complies with most assumptions, though the normality assumption is not fully fulfilled. The distribution is a bit left-skewed and this is also confirmed by the Kolmogorov-Smirnov test and seems caused by the four outliers in Amass, which may be a unit-problem in this database. This may be mainly a problem if that translate to a heterenogeneity of variance. Indeed there is some heterogeneity of variance, but deviations seem limited. Hence, we may continue with the analysis.
5. Run the regression. Interpret the ANOVA table and the model coefficients.
## Analysis of Variance Table
##
## Response: log.Amass
## Df Sum Sq Mean Sq F value Pr(>F)
## log.LMA 1 4720.9 4720.9 727.01 < 2.2e-16 ***
## Residuals 762 4948.1 6.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Intercept) log.LMA
## 39.4879217 -0.9768948
##
## Call:
## lm(formula = log.Amass ~ log.LMA, data = glopnet)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1316 -1.3586 0.2047 1.6052 5.6551
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.48792 0.73563 53.68 <2e-16 ***
## log.LMA -0.97689 0.03623 -26.96 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.548 on 762 degrees of freedom
## (1784 observations deleted due to missingness)
## Multiple R-squared: 0.4882, Adjusted R-squared: 0.4876
## F-statistic: 727 on 1 and 762 DF, p-value: < 2.2e-16
Conclusion: The model is highly significant and with every unit change of LMA, Amass decreases with 0.98 units.
Please note: any output of a linear model, whether that is called Regression, ANOVA, ANCOVA or . . . is called an ANOVA table. So far, you focussed on the summary() table instead of the anova() table. This simply reflects a difference in taste between Marco and Peter. Peter prefers to see all ins and outs of the model first. These are provided in the ANOVA table. Marco likes to see the parameter estimates as he considers that you get the df, F- and P-value also. Differences between summary() and anova() grow larger for more complicated models, where summary() provides estimates of individuals parameters (incl. levels within a factor) while anova() provides the significances of the factor as a whole.
6. Repeat the regression while taking LMA as dependent. Do you have to test the assumptions again?
## log.LMA
## -1.023652
# creating inverse of previous simple linear regression
invsimod <- lm(log.LMA~log.Amass,data=glopnet)
# plotting inverse regression
plot(log.LMA~log.Amass, data=glopnet,pch=16,col="navyblue",lwd=2,main="Amass x LMA",
xlab="Leaf Photosynthesis Rate (Mass Basis)",ylab="Leaf Mass per Area")
abline(invsimod,col="steelblue")## Analysis of Variance Table
##
## Response: log.LMA
## Df Sum Sq Mean Sq F value Pr(>F)
## log.Amass 1 2415.3 2415.28 727.01 < 2.2e-16 ***
## Residuals 762 2531.5 3.32
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Intercept) log.Amass
## 30.0445796 -0.4997976
##
## Call:
## lm(formula = log.LMA ~ log.Amass, data = glopnet)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9607 -1.0360 0.0243 1.1653 5.7789
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.04458 0.37307 80.53 <2e-16 ***
## log.Amass -0.49980 0.01854 -26.96 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.823 on 762 degrees of freedom
## (1784 observations deleted due to missingness)
## Multiple R-squared: 0.4882, Adjusted R-squared: 0.4876
## F-statistic: 727 on 1 and 762 DF, p-value: < 2.2e-16
Conclusion: Yes, the assumptions need to be tested again, because the assumptions of a regression are based on the residuals, and the residuals are entirely based on the Y-variable. Hence, when inverting the relationship, a different Y-variable gets into play and the assumptions need to be tested again.Also in this case, the assumptions are fulfilled (and even more convincingly, although there is again some heterogeneity of variance). This was not necessarily the case!
7. When you compare the regression coefficients, does this give the expected result? What do you conclude on the robustness of the models?
# combining coefficients of both models onto one table
coefficients <- data.frame(row.names=c("Intercept","Slope"),
"Simple"=c(coefficients(simod)[1],coefficients(simod)[2]),
"Inverse"=c(coefficients(invsimod)[1],coefficients(invsimod)[2]))Conclusion: If the models would have been robust, then the coefficient of model 1 would equal 1/coefficient of model 2 (inverted model). In other words, the value of the coefficient of model 2 would have been: -1.024. Instead, a value of -0.50 was found, a difference of a factor of two! The differences in variance of the two variables causes the major differences in the behavior of the two models.
8. Run the SMA. Then, evaluate the model output and compare the coefficient of the slope with the estimates previously obtained.
Now, we will run the Standardized Major Axis regression. First, install the ‘smatr’ package and call the library(smatr).
# checking variance ratio
varratio <- var(glopnet$log.LMA,na.rm=TRUE)/var(glopnet$log.Amass,na.rm=TRUE)
# rule of thumb: if var(X) > 1/3 of var(Y), consider MA regression
var(glopnet$log.Amass, na.rm=TRUE) > (1/3 * var(glopnet$log.LMA, na.rm=TRUE)) # if TRUE, run MA## [1] TRUE
## Loading required package: smatr
# running major axis regression
MAmod <- sma(formula=log.LMA~log.Amass, data=glopnet)
summary(MAmod)## Call: sma(formula = log.LMA ~ log.Amass, data = glopnet)
##
## Fit using Standardized Major Axis
##
## ------------------------------------------------------------
## Coefficients:
## elevation slope
## estimate 34.31309 -0.7152752
## lower limit 33.57869 -0.7525887
## upper limit 35.04748 -0.6798118
##
## H0 : variables uncorrelated
## R-squared : 0.4882497
## P-value : < 2.22e-16
## slope
## -0.7152752
Conclusion: The slope is in-between the two slopes previously obtained which is expected, because the variance of both variables is accounted for and weighted.
9. To evaluate model performance, you can make a scatterplot of the model. How do you interpret the plots?
# if model predicts well, residuals should be small and random
# horizontal line at zero represents "perfect prediction"Conclusion: The residual plot and the fit look okay; the residual plot is symmetric around the zero-line with no increasing or decreasing range of residuals with increasing fitted values. It is also clear that there are four outliers, but these do not strongly affect the fit.
10. If you turn the dependent and independent around, what do you expect in terms of the coefficient? Test your prediction by running that model.
Turning around the model, you expect the coefficient to become 1/-0.71 = -1.40.
# running inverse major axis regression
invMAmod <- sma(formula=log.Amass~log.LMA, data=glopnet)
summary(invMAmod)## Call: sma(formula = log.Amass ~ log.LMA, data = glopnet)
##
## Fit using Standardized Major Axis
##
## ------------------------------------------------------------
## Coefficients:
## elevation slope
## estimate 47.97187 -1.398063
## lower limit 46.52576 -1.470995
## upper limit 49.41797 -1.328747
##
## H0 : variables uncorrelated
## R-squared : 0.4882497
## P-value : < 2.22e-16
## slope
## -0.7152752
Conclusion: Indeed, the slope equals the predicted value. This implies that in contrast to a regular regression, the slopes are independent on which variable you define as x or y. This is because it always takes the uncertainties in both X and Y into account.