1 Introduction

Biomass estimation models can inflcuence local, regional and global estimates. Current standard operating procedures for carbon measurement recommend the use of existing (previously published) models. However, it is difficult to know how accurate and aplicable a model can be for certain forests and species due to the wide variety of climates and microclimates.

In this study, different models will be tested, such as:

  • Linear regression models.
  • General linear models.
  • General additive models.
  • Decision trees.
  • Bagged trees.
  • Random forest.

And using the Root Mean Square Error (RMSE) metric, we can compare the error between the actual values and the estimations of the models.

2 Loading data and libraries

2.1 Data

# Data
trees <- read.csv("/Users/RStudio/Documents/Dropbox/Semilla/contrato/Personal/Walter Chanava/Tesis/Datos/trees.csv")
trees
  • arbol index of the tree.
  • diam.bas basal diameter of the tree in \(cm\) measured at \(30cm\) above ground.
  • altura total height of the tree, measured in \(m\).
  • area.copa cup area of the tree considering it a oval, measured in \(m^2\).
  • anillos age of the tree, measured by counting the number of rings a basal disk of the tree has.
  • pesos Above Ground Biomass (AGB), measured in \(kg\).

2.2 Libraries

# Libraries
library(corrplot)
library(cowplot)
library(gbm)
library(ggplot2)
library(ipred)
library(Metrics)
library(randomForest)
library(rpart)
library(rpart.plot)
  • corrplot to create a correlation diagram.
  • cowplot to visualize multiple plots in one figure.
  • gbm to do a Gradient Boosting Machine model.
  • ggplot to visualize the data.
  • ipred to do a bagged trees model.
  • Metrics to calculate the Root Mean Square Error (RMSE) of the models.
  • randomForest to do a Random Forest model.
  • rpart to do a decision tree model.
  • rpart.plot to visualize a decision tree model.

3 EDA

Exploratory Data Analysis is a process where we calculate statistics and make figures to find trends, anomalies, patterns, or relationships within the data. The goal of EDA is to learn what our data can tell us.

3.1 Correlation

# Correlation diagram
trees_cor <- cor(trees[,c("diam.bas","altura","area.copa","anillos","pesos")])
corrplot(trees_cor, method = "number", type = "upper", tl.col = "black", tl.srt = 0)

The diagram indicates that there is a high correlation between diam.bas, altura and area.copa with the explanatory variable pesos (i.e. AGB). And an almost null relationship between anillos (i.e. age of the trees) and pesos. By looking at this values, it is recommended to use the formula pesos ~ diam.bas + altura + area.copa.

3.2 Variables relationships and distributions

3.2.1 Diam.Bas

eda.diam1 <- ggplot(data = trees, aes(x = diam.bas)) +
  geom_histogram(bins = 10, fill = 'white', colour = 'black') +
  labs(x = "Diam.Bas (cm)", y = "Count") +
  theme

eda.diam2 <- ggplot(data = trees, aes(x = diam.bas)) +
  geom_point(aes(y = pesos), size = 3) +
  labs(x = "Diam.Bas (cm)", y = "AGB (kg)") +
  theme

plot_grid(eda.diam1,eda.diam2)

There is a higher amount of trees with basal diameter lower than 40, which have a AGB lower than 1000kg. The scatter plot indicates that there is an exponential relationship between these two variables.

3.2.2 Altura

eda.alt1 <- ggplot(data = trees, aes(x = altura)) +
  geom_histogram(bins = 10, fill = 'white', colour = 'black') +
  labs(x = "Height (m)", y = "Count") +
  theme

eda.alt2 <- ggplot(data = trees, aes(x = diam.bas)) +
  geom_point(aes(y = pesos), size = 3) +
  labs(x = "Height (m)", y = "AGB (kg)") +
  theme

plot_grid(eda.alt1,eda.alt2)

3.2.3 Area.Copa

eda.area1 <- ggplot(data = trees, aes(x = area.copa)) +
  geom_histogram(bins = 10, fill = 'white', colour = 'black') +
  labs(x = expression(paste("Cup area (", m^{2}, ")")), y = "Count") +
  theme

eda.area2 <- ggplot(data = trees, aes(x = diam.bas)) +
  geom_point(aes(y = pesos), size = 3) +
  labs(x = expression(paste("Cup area (", m^{2}, ")")), y = "AGB (kg)") +
  theme

plot_grid(eda.area1,eda.area2)

3.2.4 Pesos (AGB)

ggplot(data = trees, aes(x = pesos)) +
  geom_histogram(bins = 10, fill = 'white', colour = 'black') +
  labs(x = "AGB (kg)", y = "Count") +
  theme

3.3 Normality test

3.3.1 Diam.Bas

shapiro.test(trees$diam.bas)
## 
##  Shapiro-Wilk normality test
## 
## data:  trees$diam.bas
## W = 0.87051, p-value = 0.00172

3.3.2 Altura

shapiro.test(trees$altura)
## 
##  Shapiro-Wilk normality test
## 
## data:  trees$altura
## W = 0.92998, p-value = 0.04905

3.3.3 Area.Copa

shapiro.test(trees$area.copa)
## 
##  Shapiro-Wilk normality test
## 
## data:  trees$area.copa
## W = 0.91382, p-value = 0.01859

3.3.4 Pesos (AGB)

shapiro.test(trees$pesos)
## 
##  Shapiro-Wilk normality test
## 
## data:  trees$pesos
## W = 0.59699, p-value = 6.707e-08

None of the variables have a normal distribution, so for the models which require normal distributed variables we are going do a transformation to normalize them.

3.4 Logaritmic transformation

# Log transformation
trees_norm <- log(trees[,c("diam.bas","altura","area.copa","pesos")])
trees_norm
  • trees_norm is the new normalized dataframe.

4 Modeling

The goal of this study is to find the best model to predict the AGB of any tree (i.e. Prosopis Sp.) with high accuracy (i.e. lower RMSE).

4.1 Linear regression

As seen in 3.2 Variables relationships and distributions, it exists an exponential relationship between the variables and AGB. This relation can be explained by \(y=b_0x^{b_1}\), and by doing the logaritmic transfomation, we get the matrix linear form to model as follows: \[log(y)=log(b_0)+b_1log(x)\]

In which R will estimate the model: \[y=b_0+b_1x\]

And the multivariate form: \[y=b_0+b_1x_1+b_2x_2+...+b_kx_k\]

  • Being k the number of estimators.

Moreover, the \(y\) predictions are actually \(log(y)\), so by taking the exponent of the predictions (exp(pred)) we get the actual values in the right scale, and not the logaritmic transformed ones.

4.1.1 Univ. Reg.

4.1.1.1 Basal Diameter

Exponential form: \[AGB=b_0(diam.bas)^{b_1}\]

Linear form: \[log(AGB)=log(b_0)+b_1log(diam.bas)\]

# Model
model1_ols <- lm(pesos ~ diam.bas, data = trees_norm)
summary(model1_ols)
## 
## Call:
## lm(formula = pesos ~ diam.bas, data = trees_norm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9408 -0.1697  0.0435  0.2297  1.2847 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.5354     0.5776  -4.389 0.000147 ***
## diam.bas      2.4979     0.1759  14.199 2.55e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4278 on 28 degrees of freedom
## Multiple R-squared:  0.8781, Adjusted R-squared:  0.8737 
## F-statistic: 201.6 on 1 and 28 DF,  p-value: 2.554e-14

The coefficients estimates of the output (Intercept = -2.5354 and diam.bas = 2.4979) are actually \(log(b_0)\) and \(b_1\) respectively. So a transformation for the first estimate will be needed to get the coefficients for the exponential form equation.

data.frame(coef = c(exp(coef(model1_ols))[1],
                    coef(model1_ols)[2]),
           row.names = c("b0","b1"))

The coefficients obtained give the equation for the exponential form: \[AGB=0.079(diam.bas)^{2.498}\]

# predictions
pred_model1 <- exp(predict(model1_ols))

# RMSE
rmse_model1 <- rmse(trees$pesos, pred_model1)
data.frame(RMSE = rmse_model1, row.names = "AGB ~ diam.bas")

The RMSE of the first model is 330.076.

# df
df_mod1 <- data.frame(diam.bas = trees$diam.bas,
                      AGB = trees$pesos,
                      pred_model1)
       
# plot              
ggplot(data = df_mod1, aes(x = diam.bas)) +
  geom_point(aes(y = AGB), size = 3) +
  geom_line(aes(y = pred_model1), color = "red1", size = 1.5, alpha = .75) +
  geom_text(label = "RMSE = 330.0760", x = 25, y = 3600) +
  labs(title = "AGB ~ diam.bas", x = "Diam.Bas") +
  theme

4.1.1.2 Height

Exponential form: \[AGB=b_0(height)^{b_1}\]

Linear form: \[log(AGB)=log(b_0)+b_1log(height)\]

# Model
model2_ols <- lm(pesos ~ altura, data = trees_norm)
summary(model2_ols)
## 
## Call:
## lm(formula = pesos ~ altura, data = trees_norm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.09020 -0.34016  0.04133  0.41138  1.07915 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.3934     0.8744  -3.881 0.000578 ***
## altura        4.0380     0.3903  10.345 4.53e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5579 on 28 degrees of freedom
## Multiple R-squared:  0.7926, Adjusted R-squared:  0.7852 
## F-statistic:   107 on 1 and 28 DF,  p-value: 4.526e-11

The coefficients estimates of the output (Intercept = -3.3934 and altura = 4.0380) are actually \(log(b_0)\) and \(b_1\) respectively. So a transformation for the first estimate will be needed to get the coefficients for the exponential form equation.

data.frame(coef = c(exp(coef(model2_ols))[1],
                    coef(model2_ols)[2]),
           row.names = c("b0","b1"))

The coefficients obtained give the equation for the exponential form: \[AGB=0.034(height)^{4.038}\]

# predictions
pred_model2 <- exp(predict(model2_ols))

# RMSE
rmse_model2 <- rmse(trees$pesos, pred_model2)
data.frame(RMSE = rmse_model2, row.names = "AGB ~ height")

The RMSE of the second model is 584.9777 (higher than the first one).

# df
df_mod2 <- data.frame(height = trees$altura,
                      AGB = trees$pesos,
                      pred_model2)
       
# plot              
ggplot(df_mod2, aes(x = height)) +
  geom_point(aes(y = AGB), size = 3) +
  geom_line(aes(y = pred_model2), color = "blue1", size = 1.5, alpha = .75) +
  geom_text(label = "RMSE = 584.9777", x = 8.5, y = 3600) +
  labs(title = "AGB ~ height", x = "Height") +
  theme

4.1.1.3 Cup Area

Exponential form: \[AGB=b_0(cup.area)^{b_1}\]

Linear form: \[log(AGB)=log(b_0)+b_1log(cup.area)\]

# Model
model3_ols <- lm(pesos ~ area.copa, data = trees_norm)
summary(model3_ols)
## 
## Call:
## lm(formula = pesos ~ area.copa, data = trees_norm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.32990 -0.49456  0.02218  0.56175  1.21648 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.2252     0.7045   0.320    0.752    
## area.copa     1.4388     0.1859   7.742 1.96e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6913 on 28 degrees of freedom
## Multiple R-squared:  0.6816, Adjusted R-squared:  0.6702 
## F-statistic: 59.94 on 1 and 28 DF,  p-value: 1.963e-08

The coefficients estimates of the output (Intercept = 0.2252 and diam.bas = 1.4388) are actually \(log(b_0)\) and \(b_1\) respectively. So a transformation for the first estimate will be needed to get the coefficients for the exponential form equation.

data.frame(coef = c(exp(coef(model3_ols))[1],
                    coef(model3_ols)[2]),
           row.names = c("b0","b1"))

The coefficients obtained give the equation for the exponential form: \[AGB=1.253(cup.area)^{1.439}\]

# predictions
pred_model3 <- exp(predict(model3_ols))

# RMSE
rmse_model3 <- rmse(trees$pesos, pred_model3)
data.frame(RMSE = rmse_model3, row.names = "AGB ~ diam.bas")

The RMSE of the third model is 646.0653 (worse than the previous ones).

# df
df_mod3 <- data.frame(cup.area = trees$area.copa,
                      AGB = trees$pesos,
                      pred_model3)
       
# plot              
ggplot(data = df_mod3, aes(x = cup.area)) +
  geom_point(aes(y = AGB), size = 3) +
  geom_line(aes(y = pred_model3), color = "orange1", size = 1.5, alpha = .75) +
  geom_text(label = "RMSE = 646.0653", x = 40, y = 3600) +
  labs(title = "AGB ~ area.copa", x = "Área de copa") +
  theme

4.1.2 Multiv. Reg.

4.1.2.1 Model 4

4.1.2.2 Model 5

4.1.2.3 Model 6

4.1.2.4 Model 7

# OLS
model4_ols <- lm(pesos ~ diam.bas + altura, data = trees_norm)
model5_ols <- lm(pesos ~ diam.bas + area.copa, data = trees_norm)
model6_ols <- lm(pesos ~ diam.bas + altura + area.copa, data = trees_norm)

# Predictions
pred_model4 <- exp(predict(model4_ols))
pred_model5 <- exp(predict(model5_ols))
pred_model6 <- exp(predict(model6_ols))

RMSE

# RMSE

rmse_model2 <- rmse(trees$pesos, pred_model2)
rmse_model3 <- rmse(trees$pesos, pred_model3)
rmse_model4 <- rmse(trees$pesos, pred_model4)
rmse_model5 <- rmse(trees$pesos, pred_model5)
rmse_model6 <- rmse(trees$pesos, pred_model6)

# rmse data frame
rmse_ols <- data.frame(RMSE = c(rmse_model1,rmse_model2,rmse_model3,
                                rmse_model4,rmse_model5,rmse_model6),
                       row.names = c("pesos ~ diam. bas",
                                     "pesos ~ altura",
                                     "pesos ~ area.copa",
                                     "pesos ~ diam.bas + altura",
                                     "pesos ~ diam.bas + area.copa",
                                     "pesos ~ diam.bas + altura + area.copa"))
rmse_ols

4.2 General linear models

4.3 General additive models

4.4 Bayesian regression

4.5 Recursive partitions

4.5.1 Decision trees

# regression tree
model7_rpart <- rpart(pesos ~ diam.bas + altura + area.copa, data = trees)

# predict function
pred_model7 <- predict(model7_rpart)

# RMSE
rmse_model7 <- rmse(trees$pesos, pred_model7)
rmse_model7
## [1] 585.8788
# data frame
df_rpart <- data.frame(diam.bas = trees$diam.bas,
                     AGB = trees$pesos,
                     pred_model7)
  
# plot
rpart.plot(model7_rpart)

ggplot(data = df_rpart, aes(x = diam.bas)) +
  geom_point(aes(y = AGB), size = 3) +
  geom_line(aes(y = pred_model7, color = "red4"), size = 1.5, alpha = .75) +
  labs(x = "Diam.Bas") +
  theme +
  theme(legend.position = "none")

4.5.2 Bagged trees

# Model
model8_bag <- bagging(pesos ~ diam.bas + altura + area.copa,
                      data = trees)
# predict function
pred_model8 <- predict(model8_bag)

# RMSE
rmse_model8 <- rmse(trees$pesos, pred_model8)
rmse_model8
## [1] 658.0772
# Dataframe
df_bag <- data.frame(diam.bas = trees$diam.bas,
                     AGB = trees$pesos,
                     pred_model8)

ggplot(data = df_bag, aes(x = diam.bas)) +
  geom_point(aes(y = AGB), size = 3) +
  geom_line(aes(y = pred_model8, color = "green"), size = 1.5, alpha = .75) +
  labs(x = "Diam.Bas") +
  theme +
  theme(legend.position = "none")

4.5.3 Random forest

# model
model9_randfor <- randomForest(pesos ~ diam.bas + altura + area.copa,
                               data = trees)
# predict function
pred_model9 <- predict(model9_randfor)

# RMSE
rmse_model9 <- rmse(trees$pesos, pred_model9)
rmse_model9
## [1] 561.8813
# plot
df_rafo <- data.frame(diam.bas = trees$diam.bas,
                      AGB = trees$pesos,
                      pred_model9)

ggplot(data = df_rafo, aes(x = diam.bas)) +
  geom_point(aes(y = AGB), size = 3) +
  geom_line(aes(y = pred_model9, color = "blue"), size = 1.5, alpha = .75) +
  labs(x = "Diam.Bas") +
  theme +
  theme(legend.position = "none")