Assignment - Major Axis Regression

Thu Nov 13 20:13:31 2025

Background

In this assignment, we will (re-)analyse one of the most influential databases in ecology: the glopnet database. The database was the first compilation of plant traits worldwide. Since then, plant traits (i.e. the properties of plants) have been used to understand ecological strategies of plants around the world, how they have evolved and how they respond to e.g. climate change. They have served as important inputs to improve so-called Dynamic Global Vegetation Models (i.e. the models that form the basis of the IPCC reports on climate change) and are at the basis of concepts of functional diversity as alternative to species richness as metric for biodiversity.

Aims

The aims of this assignment are:

To (once more) test the assumptions of linear regression
To create a better understanding on the independent variable
To provide a solution in case the independent variable has (measurement) uncertainties

The glopnet database was first published in Nature by Wright et al. 2004: https://www.nature.com/articles/nature02403. This is one of the most cited papers in Ecology of the past two decades. Let me start with saying there is nothing wrong with the statistical analyses presented in their paper. They have been using Standardized Regression Models (SMAs) to analyse trait-trait relationships, which is also the correct way to analyse this dataset. Most discussions are about the interpretation of the patterns (which themselves since then have been repeatedly confirmed); whether they represent evolutionary patterns or ‘’coordinated’’ patterns due to common ecological drivers. There is also discussion on whether expressing traits on a mass basis or area basis is the best way. This has led to one of the most aggressive papers published in an ecological journal: https://nph.onlinelibrary.wiley.com/doi/10.1111/nph.12281

As a general note for this assignment: consider previous assignments if you have forgotten how to do a particular action.

1. Read the data file of glopnet in Rstudio and evaluate the properties of the dataset with summary() and similar functions.

# loading data
glopnet <- read.table("glopnet-database.txt", sep="\t", header=TRUE)

# overview of data
head(glopnet)

##   log.LLS log.LMA log.Nmass log.Narea log.Pmass log.Parea log.Amass log.Aarea
## 1 12.5530   24.54     0.069    0.5458   -0.9323   -0.4783     16.91     11.45
## 2 11.7150   21.88     0.094    0.2880   -0.9230   -0.7350     18.61     10.49
## 3 13.4700   21.45     0.014    0.1706   -0.9935   -0.8485     18.76     10.22
## 4      NA   19.79        NA        NA        NA        NA        NA        NA
## 5  0.9513   20.31     0.388    0.4139   -0.8241   -0.7931     22.08     12.39
## 6 11.6690   23.56     0.255    0.6111   -0.9617   -0.6057     20.02     13.58
##   log.Gs
## 1 23.526
## 2 21.776
## 3 22.031
## 4     NA
## 5 23.277
## 6 24.167

tail(glopnet)

##      log.LLS log.LMA log.Nmass log.Narea log.Pmass log.Parea log.Amass
## 2543      NA  20.223        NA        NA        NA        NA        NA
## 2544      NA  20.506    0.4048    0.4554        NA        NA        NA
## 2545      NA  19.830    0.3324    0.3154        NA        NA        NA
## 2546      NA  20.706        NA        NA        NA        NA        NA
## 2547      NA  18.794    0.5011    0.3805        NA        NA        NA
## 2548      NA  18.239    0.4594    0.2833        NA        NA        NA
##      log.Aarea log.Gs
## 2543        NA     NA
## 2544        NA     NA
## 2545        NA     NA
## 2546        NA     NA
## 2547        NA     NA
## 2548        NA     NA

str(glopnet)

## 'data.frame':    2548 obs. of  9 variables:
##  $ log.LLS  : num  12.553 11.715 13.47 NA 0.951 ...
##  $ log.LMA  : num  24.5 21.9 21.4 19.8 20.3 ...
##  $ log.Nmass: num  0.069 0.094 0.014 NA 0.388 0.255 0.328 0.382 NA 0.412 ...
##  $ log.Narea: num  0.546 0.288 0.171 NA 0.414 ...
##  $ log.Pmass: num  -0.932 -0.923 -0.994 NA -0.824 ...
##  $ log.Parea: num  -0.478 -0.735 -0.849 NA -0.793 ...
##  $ log.Amass: num  16.9 18.6 18.8 NA 22.1 ...
##  $ log.Aarea: num  11.4 10.5 10.2 NA 12.4 ...
##  $ log.Gs   : num  23.5 21.8 22 NA 23.3 ...

summary(glopnet) # many NAs - problem?

##     log.LLS           log.LMA        log.Nmass         log.Narea      
##  Min.   :-0.0252   Min.   :11.58   Min.   :-0.6055   Min.   :-0.5836  
##  1st Qu.: 0.6010   1st Qu.:17.91   1st Qu.: 0.0857   1st Qu.: 0.1174  
##  Median : 0.9255   Median :19.68   Median : 0.2601   Median : 0.2487  
##  Mean   : 6.4051   Mean   :19.91   Mean   : 0.2275   Mean   : 0.2437  
##  3rd Qu.:12.8625   3rd Qu.:21.69   3rd Qu.: 0.3868   3rd Qu.: 0.3709  
##  Max.   :24.5940   Max.   :31.79   Max.   : 0.8035   Max.   : 0.9610  
##  NA's   :1796      NA's   :178     NA's   :483       NA's   :569      
##    log.Pmass          log.Parea          log.Amass         log.Aarea      
##  Min.   :-20.9690   Min.   :-17.7140   Min.   : 0.6812   Min.   : 0.0014  
##  1st Qu.:-13.9790   1st Qu.:-11.2605   1st Qu.:17.4240   1st Qu.: 0.8647  
##  Median :-10.1300   Median : -0.9564   Median :20.0205   Median :10.1630  
##  Mean   : -7.5799   Mean   : -5.6210   Mean   :19.7818   Mean   : 6.6419  
##  3rd Qu.: -0.8021   3rd Qu.: -0.7922   3rd Qu.:22.0775   3rd Qu.:11.5840  
##  Max.   : -0.2218   Max.   : -0.0541   Max.   :28.2100   Max.   :16.2320  
##  NA's   :1784       NA's   :1792       NA's   :1778      NA's   :1723     
##      log.Gs     
##  Min.   :15.44  
##  1st Qu.:21.77  
##  Median :23.71  
##  Mean   :23.86  
##  3rd Qu.:25.92  
##  Max.   :33.56  
##  NA's   :2048

# checking distributions of data
par(mfrow=c(3,3))
hist(glopnet$log.LLS,main="LLS")
hist(glopnet$log.LMA,main="LMA")
hist(glopnet$log.Nmass,main="Nmass")
hist(glopnet$log.Narea,main="Narea")
hist(glopnet$log.Pmass,main="Pmass")
hist(glopnet$log.Parea,main="Parea")
hist(glopnet$log.Amass,main="Amass")
hist(glopnet$log.Aarea,main="Aarea")
hist(glopnet$log.Gs,main="Gs")

Conclusion: The patterns look good. The deviations are caused by NAs which will not be taken into account during the regression. Some data points deviate from normality, however bear in mind that these are a few observations in a total of close to 800 observations (i.e. less than 8% that might happen just because of random processes). Therefore, those are not problematic. If you would have considered them problematic, you could have solved it by bootstrapping.

2. What is your conclusion on the relationships between traits?

Let’s first evaluate some of the patterns in the data using a correlation matrix. We can visualize this correlation matrix. For doing that, install the package ‘corrplot’.

Note: we need the addition ‘’pairwise.complete.obs’’ to avoid that the NAs in the dataset hamper calculating the correlation coefficients.

# defining correlation matrix
cor_matrix <- cor(glopnet, use="pairwise.complete.obs") # takes NAs into account

# installing corrplot package 
library(corrplot)

## corrplot 0.95 loaded

# visualizing the correlation matrix
corrplot(cor_matrix, method="circle")

Conclusion: Several traits are strongly correlated. Particularly strong correlations exist among LMA, Nmass (nitrogen content in leaves), Pmass (phosphorus content in leaves), Amass (photosynthesis rate) and LLS (leaf lifespan). Also between Amass, Aarea and Gs (stromatal conductance) strong relationship exist.

3. First, explore the relationship by making a plot with Amass as the dependent variable. How would you describe the relationship between LMA and Amass?

Next, we analyse what had gone wrong if the authors would have analysed the data using a linear regression. We take the relationship between LMA (Leaf Mass per Area) and Amass (the photosynthesis rate in a leaf on a mass basis). Take Amass as the dependent for a regression analysis.

# creating simple linear regression
simod <- lm(log.Amass~log.LMA,data=glopnet)

# plotting regression
plot(log.Amass~log.LMA, data=glopnet,pch=16,col="navyblue",lwd=2,main="LMA x Amass", 
     xlab="Leaf Mass per Area",ylab="Leaf Photosynthesis Rate (Mass Basis)")
abline(simod,col="steelblue")

Conclusion: There is a strong negative correlation between Amass and LMA.

4. Test the assumptions for this regression analysis. What is your conclusion on the regression assumptions?

# testing assumptions
par(mfrow=c(2,2))
plot(simod)

# applying Kolmogorov-Smirnoff test for normality
ks.test(simod$residuals,pnorm,mean(simod$residuals),sd(simod$residuals))

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  simod$residuals
## D = 0.054835, p-value = 0.02021
## alternative hypothesis: two-sided

Conclusion: The model complies with most assumptions, though the normality assumption is not fully fulfilled. The distribution is a bit left-skewed and this is also confirmed by the Kolmogorov-Smirnov test and seems caused by the four outliers in Amass, which may be a unit-problem in this database. This may be mainly a problem if that translate to a heterenogeneity of variance. Indeed there is some heterogeneity of variance, but deviations seem limited. Hence, we may continue with the analysis.

5. Run the regression. Interpret the ANOVA table and the model coefficients.

# running ANOVA on simple linear model
anova(simod)

## Analysis of Variance Table
## 
## Response: log.Amass
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## log.LMA     1 4720.9  4720.9  727.01 < 2.2e-16 ***
## Residuals 762 4948.1     6.5                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

coefficients(simod) # getting coefficients

## (Intercept)     log.LMA 
##  39.4879217  -0.9768948

# summary of regression model for comparison
summary(simod)

## 
## Call:
## lm(formula = log.Amass ~ log.LMA, data = glopnet)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1316  -1.3586   0.2047   1.6052   5.6551 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.48792    0.73563   53.68   <2e-16 ***
## log.LMA     -0.97689    0.03623  -26.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.548 on 762 degrees of freedom
##   (1784 observations deleted due to missingness)
## Multiple R-squared:  0.4882, Adjusted R-squared:  0.4876 
## F-statistic:   727 on 1 and 762 DF,  p-value: < 2.2e-16

Conclusion: The model is highly significant and with every unit change of LMA, Amass decreases with 0.98 units.

Please note: any output of a linear model, whether that is called Regression, ANOVA, ANCOVA or . . . is called an ANOVA table. So far, you focussed on the summary() table instead of the anova() table. This simply reflects a difference in taste between Marco and Peter. Peter prefers to see all ins and outs of the model first. These are provided in the ANOVA table. Marco likes to see the parameter estimates as he considers that you get the df, F- and P-value also. Differences between summary() and anova() grow larger for more complicated models, where summary() provides estimates of individuals parameters (incl. levels within a factor) while anova() provides the significances of the factor as a whole.

6. Repeat the regression while taking LMA as dependent. Do you have to test the assumptions again?

# what is expected: 1/slope
1/coefficients(simod)[2]

##   log.LMA 
## -1.023652

# creating inverse of previous simple linear regression
invsimod <- lm(log.LMA~log.Amass,data=glopnet)

# plotting inverse regression
plot(log.LMA~log.Amass, data=glopnet,pch=16,col="navyblue",lwd=2,main="Amass x LMA", 
     xlab="Leaf Photosynthesis Rate (Mass Basis)",ylab="Leaf Mass per Area")
abline(invsimod,col="steelblue")

# testing assumptions
par(mfrow=c(2,2))
plot(simod)

# running ANOVA on inverse simple linear model
anova(invsimod)

## Analysis of Variance Table
## 
## Response: log.LMA
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## log.Amass   1 2415.3 2415.28  727.01 < 2.2e-16 ***
## Residuals 762 2531.5    3.32                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

coefficients(invsimod) # getting coefficients

## (Intercept)   log.Amass 
##  30.0445796  -0.4997976

# running inverse simple linear model regression
summary(invsimod)

## 
## Call:
## lm(formula = log.LMA ~ log.Amass, data = glopnet)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9607 -1.0360  0.0243  1.1653  5.7789 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.04458    0.37307   80.53   <2e-16 ***
## log.Amass   -0.49980    0.01854  -26.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.823 on 762 degrees of freedom
##   (1784 observations deleted due to missingness)
## Multiple R-squared:  0.4882, Adjusted R-squared:  0.4876 
## F-statistic:   727 on 1 and 762 DF,  p-value: < 2.2e-16

Conclusion: Yes, the assumptions need to be tested again, because the assumptions of a regression are based on the residuals, and the residuals are entirely based on the Y-variable. Hence, when inverting the relationship, a different Y-variable gets into play and the assumptions need to be tested again.Also in this case, the assumptions are fulfilled (and even more convincingly, although there is again some heterogeneity of variance). This was not necessarily the case!

7. When you compare the regression coefficients, does this give the expected result? What do you conclude on the robustness of the models?

# combining coefficients of both models onto one table
coefficients <- data.frame(row.names=c("Intercept","Slope"),
                           "Simple"=c(coefficients(simod)[1],coefficients(simod)[2]),
                           "Inverse"=c(coefficients(invsimod)[1],coefficients(invsimod)[2]))

Conclusion: If the models would have been robust, then the coefficient of model 1 would equal 1/coefficient of model 2 (inverted model). In other words, the value of the coefficient of model 2 would have been: -1.024. Instead, a value of -0.50 was found, a difference of a factor of two! The differences in variance of the two variables causes the major differences in the behavior of the two models.

8. Run the SMA. Then, evaluate the model output and compare the coefficient of the slope with the estimates previously obtained.

Now, we will run the Standardized Major Axis regression. First, install the ‘smatr’ package and call the library(smatr).

# checking variance ratio
varratio <- var(glopnet$log.LMA,na.rm=TRUE)/var(glopnet$log.Amass,na.rm=TRUE) 
# rule of thumb: if var(X) > 1/3 of var(Y), consider MA regression
var(glopnet$log.Amass, na.rm=TRUE) > (1/3 * var(glopnet$log.LMA, na.rm=TRUE)) # if TRUE, run MA

## [1] TRUE

# installing the smatr package
require(smatr)

## Loading required package: smatr

# running major axis regression
MAmod <- sma(formula=log.LMA~log.Amass, data=glopnet)
summary(MAmod)

## Call: sma(formula = log.LMA ~ log.Amass, data = glopnet) 
## 
## Fit using Standardized Major Axis 
## 
## ------------------------------------------------------------
## Coefficients:
##             elevation      slope
## estimate     34.31309 -0.7152752
## lower limit  33.57869 -0.7525887
## upper limit  35.04748 -0.6798118
## 
## H0 : variables uncorrelated
## R-squared : 0.4882497 
## P-value : < 2.22e-16

slope <- coef(MAmod)[2]
slope

##      slope 
## -0.7152752

Conclusion: The slope is in-between the two slopes previously obtained which is expected, because the variance of both variables is accounted for and weighted.

9. To evaluate model performance, you can make a scatterplot of the model. How do you interpret the plots?

# evaluating model performance
plot(MAmod)

# or:
plot(MAmod,which="residual") # plots only residuals
abline(a=0,b=0) # which is this line?

# if model predicts well, residuals should be small and random
# horizontal line at zero represents "perfect prediction"

Conclusion: The residual plot and the fit look okay; the residual plot is symmetric around the zero-line with no increasing or decreasing range of residuals with increasing fitted values. It is also clear that there are four outliers, but these do not strongly affect the fit.

10. If you turn the dependent and independent around, what do you expect in terms of the coefficient? Test your prediction by running that model.

Turning around the model, you expect the coefficient to become 1/-0.71 = -1.40.

# running inverse major axis regression
invMAmod <- sma(formula=log.Amass~log.LMA, data=glopnet)
summary(invMAmod)

## Call: sma(formula = log.Amass ~ log.LMA, data = glopnet) 
## 
## Fit using Standardized Major Axis 
## 
## ------------------------------------------------------------
## Coefficients:
##             elevation     slope
## estimate     47.97187 -1.398063
## lower limit  46.52576 -1.470995
## upper limit  49.41797 -1.328747
## 
## H0 : variables uncorrelated
## R-squared : 0.4882497 
## P-value : < 2.22e-16

# comparing slope
invslope <- 1/coef(invMAmod)[2]
invslope

##      slope 
## -0.7152752

Conclusion: Indeed, the slope equals the predicted value. This implies that in contrast to a regular regression, the slopes are independent on which variable you define as x or y. This is because it always takes the uncertainties in both X and Y into account.