Biomass estimation models can inflcuence local, regional and global estimates. Current standard operating procedures for carbon measurement recommend the use of existing (previously published) models. However, it is difficult to know how accurate and aplicable a model can be for certain forests and species due to the wide variety of climates and microclimates.
In this study, different models will be tested, such as:
And using the Root Mean Square Error (RMSE) metric, we can compare the error between the actual values and the estimations of the models.
# Data
trees <- read.csv("/Users/RStudio/Documents/Dropbox/Semilla/contrato/Personal/Walter Chanava/Tesis/Datos/trees.csv")
trees
arbol index of the tree.diam.bas basal diameter of the tree in \(cm\) measured at \(30cm\) above ground.altura total height of the tree, measured in \(m\).area.copa cup area of the tree considering it a oval, measured in \(m^2\).anillos age of the tree, measured by counting the number of rings a basal disk of the tree has.pesos Above Ground Biomass (AGB), measured in \(kg\).# Libraries
library(corrplot)
library(cowplot)
library(gbm)
library(ggplot2)
library(ipred)
library(Metrics)
library(randomForest)
library(rpart)
library(rpart.plot)
corrplot to create a correlation diagram.cowplot to visualize multiple plots in one figure.gbm to do a Gradient Boosting Machine model.ggplot to visualize the data.ipred to do a bagged trees model.Metrics to calculate the Root Mean Square Error (RMSE) of the models.randomForest to do a Random Forest model.rpart to do a decision tree model.rpart.plot to visualize a decision tree model.Exploratory Data Analysis is a process where we calculate statistics and make figures to find trends, anomalies, patterns, or relationships within the data. The goal of EDA is to learn what our data can tell us.
# Correlation diagram
trees_cor <- cor(trees[,c("diam.bas","altura","area.copa","anillos","pesos")])
corrplot(trees_cor, method = "number", type = "upper", tl.col = "black", tl.srt = 0)
The diagram indicates that there is a high correlation between diam.bas, altura and area.copa with the explanatory variable pesos (i.e. AGB). And an almost null relationship between anillos (i.e. age of the trees) and pesos. By looking at this values, it is recommended to use the formula pesos ~ diam.bas + altura + area.copa.
eda.diam1 <- ggplot(data = trees, aes(x = diam.bas)) +
geom_histogram(bins = 10, fill = 'white', colour = 'black') +
labs(x = "Diam.Bas (cm)", y = "Count") +
theme
eda.diam2 <- ggplot(data = trees, aes(x = diam.bas)) +
geom_point(aes(y = pesos), size = 3) +
labs(x = "Diam.Bas (cm)", y = "AGB (kg)") +
theme
plot_grid(eda.diam1,eda.diam2)
There is a higher amount of trees with basal diameter lower than 40, which have a AGB lower than 1000kg. The scatter plot indicates that there is an exponential relationship between these two variables.
eda.alt1 <- ggplot(data = trees, aes(x = altura)) +
geom_histogram(bins = 10, fill = 'white', colour = 'black') +
labs(x = "Height (m)", y = "Count") +
theme
eda.alt2 <- ggplot(data = trees, aes(x = diam.bas)) +
geom_point(aes(y = pesos), size = 3) +
labs(x = "Height (m)", y = "AGB (kg)") +
theme
plot_grid(eda.alt1,eda.alt2)
eda.area1 <- ggplot(data = trees, aes(x = area.copa)) +
geom_histogram(bins = 10, fill = 'white', colour = 'black') +
labs(x = expression(paste("Cup area (", m^{2}, ")")), y = "Count") +
theme
eda.area2 <- ggplot(data = trees, aes(x = diam.bas)) +
geom_point(aes(y = pesos), size = 3) +
labs(x = expression(paste("Cup area (", m^{2}, ")")), y = "AGB (kg)") +
theme
plot_grid(eda.area1,eda.area2)
ggplot(data = trees, aes(x = pesos)) +
geom_histogram(bins = 10, fill = 'white', colour = 'black') +
labs(x = "AGB (kg)", y = "Count") +
theme
shapiro.test(trees$diam.bas)
##
## Shapiro-Wilk normality test
##
## data: trees$diam.bas
## W = 0.87051, p-value = 0.00172
shapiro.test(trees$altura)
##
## Shapiro-Wilk normality test
##
## data: trees$altura
## W = 0.92998, p-value = 0.04905
shapiro.test(trees$area.copa)
##
## Shapiro-Wilk normality test
##
## data: trees$area.copa
## W = 0.91382, p-value = 0.01859
shapiro.test(trees$pesos)
##
## Shapiro-Wilk normality test
##
## data: trees$pesos
## W = 0.59699, p-value = 6.707e-08
None of the variables have a normal distribution, so for the models which require normal distributed variables we are going do a transformation to normalize them.
# Log transformation
trees_norm <- log(trees[,c("diam.bas","altura","area.copa","pesos")])
trees_norm
trees_norm is the new normalized dataframe.The goal of this study is to find the best model to predict the AGB of any tree (i.e. Prosopis Sp.) with high accuracy (i.e. lower RMSE).
As seen in 3.2 Variables relationships and distributions, it exists an exponential relationship between the variables and AGB. This relation can be explained by \(y=b_0x^{b_1}\), and by doing the logaritmic transfomation, we get the matrix linear form to model as follows: \[log(y)=log(b_0)+b_1log(x)\]
In which R will estimate the model: \[y=b_0+b_1x\]
And the multivariate form: \[y=b_0+b_1x_1+b_2x_2+...+b_kx_k\]
Moreover, the \(y\) predictions are actually \(log(y)\), so by taking the exponent of the predictions (exp(pred)) we get the actual values in the right scale, and not the logaritmic transformed ones.
Exponential form: \[AGB=b_0(diam.bas)^{b_1}\]
Linear form: \[log(AGB)=log(b_0)+b_1log(diam.bas)\]
# Model
model1_ols <- lm(pesos ~ diam.bas, data = trees_norm)
summary(model1_ols)
##
## Call:
## lm(formula = pesos ~ diam.bas, data = trees_norm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9408 -0.1697 0.0435 0.2297 1.2847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.5354 0.5776 -4.389 0.000147 ***
## diam.bas 2.4979 0.1759 14.199 2.55e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4278 on 28 degrees of freedom
## Multiple R-squared: 0.8781, Adjusted R-squared: 0.8737
## F-statistic: 201.6 on 1 and 28 DF, p-value: 2.554e-14
The coefficients estimates of the output (Intercept = -2.5354 and diam.bas = 2.4979) are actually \(log(b_0)\) and \(b_1\) respectively. So a transformation for the first estimate will be needed to get the coefficients for the exponential form equation.
data.frame(coef = c(exp(coef(model1_ols))[1],
coef(model1_ols)[2]),
row.names = c("b0","b1"))
The coefficients obtained give the equation for the exponential form: \[AGB=0.079(diam.bas)^{2.498}\]
# predictions
pred_model1 <- exp(predict(model1_ols))
# RMSE
rmse_model1 <- rmse(trees$pesos, pred_model1)
data.frame(RMSE = rmse_model1, row.names = "AGB ~ diam.bas")
The RMSE of the first model is 330.076.
# df
df_mod1 <- data.frame(diam.bas = trees$diam.bas,
AGB = trees$pesos,
pred_model1)
# plot
ggplot(data = df_mod1, aes(x = diam.bas)) +
geom_point(aes(y = AGB), size = 3) +
geom_line(aes(y = pred_model1), color = "red1", size = 1.5, alpha = .75) +
geom_text(label = "RMSE = 330.0760", x = 25, y = 3600) +
labs(title = "AGB ~ diam.bas", x = "Diam.Bas") +
theme
Exponential form: \[AGB=b_0(height)^{b_1}\]
Linear form: \[log(AGB)=log(b_0)+b_1log(height)\]
# Model
model2_ols <- lm(pesos ~ altura, data = trees_norm)
summary(model2_ols)
##
## Call:
## lm(formula = pesos ~ altura, data = trees_norm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.09020 -0.34016 0.04133 0.41138 1.07915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.3934 0.8744 -3.881 0.000578 ***
## altura 4.0380 0.3903 10.345 4.53e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5579 on 28 degrees of freedom
## Multiple R-squared: 0.7926, Adjusted R-squared: 0.7852
## F-statistic: 107 on 1 and 28 DF, p-value: 4.526e-11
The coefficients estimates of the output (Intercept = -3.3934 and altura = 4.0380) are actually \(log(b_0)\) and \(b_1\) respectively. So a transformation for the first estimate will be needed to get the coefficients for the exponential form equation.
data.frame(coef = c(exp(coef(model2_ols))[1],
coef(model2_ols)[2]),
row.names = c("b0","b1"))
The coefficients obtained give the equation for the exponential form: \[AGB=0.034(height)^{4.038}\]
# predictions
pred_model2 <- exp(predict(model2_ols))
# RMSE
rmse_model2 <- rmse(trees$pesos, pred_model2)
data.frame(RMSE = rmse_model2, row.names = "AGB ~ height")
The RMSE of the second model is 584.9777 (higher than the first one).
# df
df_mod2 <- data.frame(height = trees$altura,
AGB = trees$pesos,
pred_model2)
# plot
ggplot(df_mod2, aes(x = height)) +
geom_point(aes(y = AGB), size = 3) +
geom_line(aes(y = pred_model2), color = "blue1", size = 1.5, alpha = .75) +
geom_text(label = "RMSE = 584.9777", x = 8.5, y = 3600) +
labs(title = "AGB ~ height", x = "Height") +
theme
Exponential form: \[AGB=b_0(cup.area)^{b_1}\]
Linear form: \[log(AGB)=log(b_0)+b_1log(cup.area)\]
# Model
model3_ols <- lm(pesos ~ area.copa, data = trees_norm)
summary(model3_ols)
##
## Call:
## lm(formula = pesos ~ area.copa, data = trees_norm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.32990 -0.49456 0.02218 0.56175 1.21648
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2252 0.7045 0.320 0.752
## area.copa 1.4388 0.1859 7.742 1.96e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6913 on 28 degrees of freedom
## Multiple R-squared: 0.6816, Adjusted R-squared: 0.6702
## F-statistic: 59.94 on 1 and 28 DF, p-value: 1.963e-08
The coefficients estimates of the output (Intercept = 0.2252 and diam.bas = 1.4388) are actually \(log(b_0)\) and \(b_1\) respectively. So a transformation for the first estimate will be needed to get the coefficients for the exponential form equation.
data.frame(coef = c(exp(coef(model3_ols))[1],
coef(model3_ols)[2]),
row.names = c("b0","b1"))
The coefficients obtained give the equation for the exponential form: \[AGB=1.253(cup.area)^{1.439}\]
# predictions
pred_model3 <- exp(predict(model3_ols))
# RMSE
rmse_model3 <- rmse(trees$pesos, pred_model3)
data.frame(RMSE = rmse_model3, row.names = "AGB ~ diam.bas")
The RMSE of the third model is 646.0653 (worse than the previous ones).
# df
df_mod3 <- data.frame(cup.area = trees$area.copa,
AGB = trees$pesos,
pred_model3)
# plot
ggplot(data = df_mod3, aes(x = cup.area)) +
geom_point(aes(y = AGB), size = 3) +
geom_line(aes(y = pred_model3), color = "orange1", size = 1.5, alpha = .75) +
geom_text(label = "RMSE = 646.0653", x = 40, y = 3600) +
labs(title = "AGB ~ area.copa", x = "Área de copa") +
theme
# OLS
model4_ols <- lm(pesos ~ diam.bas + altura, data = trees_norm)
model5_ols <- lm(pesos ~ diam.bas + area.copa, data = trees_norm)
model6_ols <- lm(pesos ~ diam.bas + altura + area.copa, data = trees_norm)
# Predictions
pred_model4 <- exp(predict(model4_ols))
pred_model5 <- exp(predict(model5_ols))
pred_model6 <- exp(predict(model6_ols))
# RMSE
rmse_model2 <- rmse(trees$pesos, pred_model2)
rmse_model3 <- rmse(trees$pesos, pred_model3)
rmse_model4 <- rmse(trees$pesos, pred_model4)
rmse_model5 <- rmse(trees$pesos, pred_model5)
rmse_model6 <- rmse(trees$pesos, pred_model6)
# rmse data frame
rmse_ols <- data.frame(RMSE = c(rmse_model1,rmse_model2,rmse_model3,
rmse_model4,rmse_model5,rmse_model6),
row.names = c("pesos ~ diam. bas",
"pesos ~ altura",
"pesos ~ area.copa",
"pesos ~ diam.bas + altura",
"pesos ~ diam.bas + area.copa",
"pesos ~ diam.bas + altura + area.copa"))
rmse_ols
# regression tree
model7_rpart <- rpart(pesos ~ diam.bas + altura + area.copa, data = trees)
# predict function
pred_model7 <- predict(model7_rpart)
# RMSE
rmse_model7 <- rmse(trees$pesos, pred_model7)
rmse_model7
## [1] 585.8788
# data frame
df_rpart <- data.frame(diam.bas = trees$diam.bas,
AGB = trees$pesos,
pred_model7)
# plot
rpart.plot(model7_rpart)
ggplot(data = df_rpart, aes(x = diam.bas)) +
geom_point(aes(y = AGB), size = 3) +
geom_line(aes(y = pred_model7, color = "red4"), size = 1.5, alpha = .75) +
labs(x = "Diam.Bas") +
theme +
theme(legend.position = "none")
# Model
model8_bag <- bagging(pesos ~ diam.bas + altura + area.copa,
data = trees)
# predict function
pred_model8 <- predict(model8_bag)
# RMSE
rmse_model8 <- rmse(trees$pesos, pred_model8)
rmse_model8
## [1] 658.0772
# Dataframe
df_bag <- data.frame(diam.bas = trees$diam.bas,
AGB = trees$pesos,
pred_model8)
ggplot(data = df_bag, aes(x = diam.bas)) +
geom_point(aes(y = AGB), size = 3) +
geom_line(aes(y = pred_model8, color = "green"), size = 1.5, alpha = .75) +
labs(x = "Diam.Bas") +
theme +
theme(legend.position = "none")
# model
model9_randfor <- randomForest(pesos ~ diam.bas + altura + area.copa,
data = trees)
# predict function
pred_model9 <- predict(model9_randfor)
# RMSE
rmse_model9 <- rmse(trees$pesos, pred_model9)
rmse_model9
## [1] 561.8813
# plot
df_rafo <- data.frame(diam.bas = trees$diam.bas,
AGB = trees$pesos,
pred_model9)
ggplot(data = df_rafo, aes(x = diam.bas)) +
geom_point(aes(y = AGB), size = 3) +
geom_line(aes(y = pred_model9, color = "blue"), size = 1.5, alpha = .75) +
labs(x = "Diam.Bas") +
theme +
theme(legend.position = "none")