Biomass estimation models can inflcuence local, regional and global estimates. Current standard operating procedures for carbon measurement recommend the use of existing (previously published) models. However, it is difficult to know how accurate and aplicable a model can be for certain forests and species due to the wide variety of climates and microclimates.
In this study, different models will be tested, such as:
And using the Root Mean Square Error (RMSE) metric, we can compare the error between the actual values and the estimations of the models.
# Data
trees <- read.csv("/Users/RStudio/Documents/Dropbox/Semilla/contrato/Personal/Walter Chanava/Tesis/Datos/trees.csv")
names(trees) <- c("tree","bas.diam","height","cup.area","tree.rings","AGB")
trees
tree index of the tree.bas.diam basal diameter of the tree in \(cm\) measured at \(30cm\) above ground.height total height of the tree, measured in \(m\).cup.area cup area of the tree considering it a oval, measured in \(m^2\).tree.rings age of the tree, measured by counting the number of rings a basal disk of the tree has.AGB Above Ground Biomass (AGB), generally speaking, it includes the stem, stump, branches, bark, seeds and foliage. measured in \(kg\).# Libraries
library(corrplot)
library(cowplot)
library(gbm)
library(ggplot2)
library(ipred)
library(Metrics)
library(randomForest)
library(rpart)
library(rpart.plot)
corrplot to create a correlation diagram.cowplot to visualize multiple plots in one figure.gbm to do a Gradient Boosting Machine model.ggplot to visualize the data.ipred to do a bagged trees model.Metrics to calculate the Root Mean Square Error (RMSE) of the models.randomForest to do a Random Forest model.rpart to do a decision tree model.rpart.plot to visualize a decision tree model.Exploratory Data Analysis is a process where we calculate statistics and make figures to find trends, anomalies, patterns, or relationships within the data. The goal of EDA is to learn what our data can tell us.
# Correlation diagram
trees_cor <- cor(trees[,-1]) # not including the index
corrplot(trees_cor, method = "number", type = "upper", tl.col = "black", tl.srt = 0)
The diagram indicates that there is a high correlation between bas.diam, height and cup.area with the explanatory variable AGB. And an almost null relationship between tree.rings (i.e. age of the trees) and AGB.
# P-values of the correlations with AGB
cor_test1 <- cor.test(trees$bas.diam, trees$AGB)
cor_test2 <- cor.test(trees$height, trees$AGB)
cor_test3 <- cor.test(trees$cup.area, trees$AGB)
cor_test4 <- cor.test(trees$tree.rings, trees$AGB)
data.frame(Correlations = c("Basal diameter ~ AGB",
"Height ~ AGB",
"Cup area ~ AGB",
"Tree rings ~ AGB"),
Pearson.cor = c(cor_test1$estimate,
cor_test2$estimate,
cor_test3$estimate,
cor_test4$estimate),
p.values = c(cor_test1$p.value,
cor_test2$p.value,
cor_test3$p.value,
cor_test4$p.value))
The p-values for the correlations between bas.diam, height and cup.area with AGB which area less than the significance level \(\alpha=0.05\). We can conclude that the variables are significantly correlated with AGB. But for tree.rings, the p-value is higher than \(\alpha=0.05\), meaning that the variable is insignificantly correlated with AGB.
By looking at this values, it is recommended to use only the variables bas.diam, height and cup.area to explain AGB with the regressions.
eda.diam1 <- ggplot(data = trees, aes(x = bas.diam)) +
geom_histogram(bins = 10, fill = 'white', colour = 'black') +
labs(x = "Diam.Bas (cm)", y = "Count") +
theme
eda.diam2 <- ggplot(data = trees, aes(x = bas.diam)) +
geom_point(aes(y = AGB), size = 3) +
labs(x = "Diam.Bas (cm)", y = "AGB (kg)") +
theme
plot_grid(eda.diam1,eda.diam2)
There is a higher amount of trees with basal diameter lower than 40, which have a AGB lower than 1000kg. The scatter plot indicates that there is an exponential relationship between these two variables.
eda.heig1 <- ggplot(data = trees, aes(x = height)) +
geom_histogram(bins = 10, fill = 'white', colour = 'black') +
labs(x = "Height (m)", y = "Count") +
theme
eda.heig2 <- ggplot(data = trees, aes(x = height)) +
geom_point(aes(y = AGB), size = 3) +
labs(x = "Height (m)", y = "AGB (kg)") +
theme
plot_grid(eda.heig1,eda.heig2)
eda.area1 <- ggplot(data = trees, aes(x = cup.area)) +
geom_histogram(bins = 10, fill = 'white', colour = 'black') +
labs(x = expression(paste("Cup area (", m^{2}, ")")), y = "Count") +
theme
eda.area2 <- ggplot(data = trees, aes(x = cup.area)) +
geom_point(aes(y = AGB), size = 3) +
labs(x = expression(paste("Cup area (", m^{2}, ")")), y = "AGB (kg)") +
theme
plot_grid(eda.area1,eda.area2)
ggplot(data = trees, aes(x = AGB)) +
geom_histogram(bins = 10, fill = 'white', colour = 'black') +
labs(x = "AGB (kg)", y = "Count") +
theme
shapiro.test(trees$bas.diam)
##
## Shapiro-Wilk normality test
##
## data: trees$bas.diam
## W = 0.87051, p-value = 0.00172
shapiro.test(trees$height)
##
## Shapiro-Wilk normality test
##
## data: trees$height
## W = 0.92998, p-value = 0.04905
shapiro.test(trees$cup.area)
##
## Shapiro-Wilk normality test
##
## data: trees$cup.area
## W = 0.91382, p-value = 0.01859
shapiro.test(trees$AGB)
##
## Shapiro-Wilk normality test
##
## data: trees$AGB
## W = 0.59699, p-value = 6.707e-08
None of the variables have a normal distribution, so for the models which require normal distributed variables we are going do a transformation to normalize them.
# Log transformation
trees_norm <- log(trees[,c("bas.diam","height","cup.area","AGB")])
trees_norm
trees_norm is the new normalized dataframe.The goal of this study is to find the best model to predict the AGB of any tree (i.e. Prosopis Sp.) with high accuracy (i.e. lower RMSE).
As seen in 3.2 Variables relationships and distributions, it exists an exponential relationship between the variables and AGB. This relation can be explained by \(y=b_0x^{b_1}\), and by doing the logaritmic transfomation, we get the matrix linear form to model as follows: \[log(y)=log(b_0)+b_1log(x)\]
In which R will estimate the model: \[y=b_0+b_1x\]
And the multivariate form: \[y=b_0+b_1x_1+b_2x_2+...+b_kx_k\]
Moreover, the \(y\) predictions are actually \(log(y)\), so by taking the exponent of the predictions (exp(pred)) we get the actual values in the right scale, and not the logaritmic transformed ones.
Exponential form: \[AGB=b_0(bas.diam)^{b_1}\]
Linear form: \[log(AGB)=log(b_0)+b_1log(bas.diam)\]
# Model
model1_ols <- lm(AGB ~ bas.diam, data = trees_norm)
summary(model1_ols)
##
## Call:
## lm(formula = AGB ~ bas.diam, data = trees_norm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9408 -0.1697 0.0435 0.2297 1.2847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.5354 0.5776 -4.389 0.000147 ***
## bas.diam 2.4979 0.1759 14.199 2.55e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4278 on 28 degrees of freedom
## Multiple R-squared: 0.8781, Adjusted R-squared: 0.8737
## F-statistic: 201.6 on 1 and 28 DF, p-value: 2.554e-14
The coefficients estimates of the output (Intercept = -2.5354 and bas.diam = 2.4979) are actually \(log(b_0)\) and \(b_1\) respectively. So a transformation for the first estimate will be needed to get the coefficients for the exponential form equation.
data.frame(coef = c(exp(coef(model1_ols))[1],
coef(model1_ols)[2]),
row.names = c("b0","b1"))
The coefficients obtained give the equation for the exponential form: \[AGB=0.079(bas.diam)^{2.498}\]
# predictions
pred_model1 <- exp(predict(model1_ols))
# RMSE
rmse_model1 <- rmse(trees$AGB, pred_model1)
data.frame(RMSE = rmse_model1, row.names = "AGB ~ bas.diam")
The RMSE of the first model is 330.076.
# df
df_mod1 <- data.frame(bas.diam = trees$bas.diam,
AGB = trees$AGB,
pred_model1)
# plot
ggplot(data = df_mod1, aes(x = bas.diam)) +
geom_point(aes(y = AGB), size = 3) +
geom_line(aes(y = pred_model1), color = "red1", size = 1.5, alpha = .75) +
geom_text(label = "RMSE = 330.0760", x = 25, y = 3600) +
labs(title = "AGB ~ bas.diam", x = "Basal diameter (cm)") +
theme
Exponential form: \[AGB=b_0(height)^{b_1}\]
Linear form: \[log(AGB)=log(b_0)+b_1log(height)\]
# Model
model2_ols <- lm(AGB ~ height, data = trees_norm)
summary(model2_ols)
##
## Call:
## lm(formula = AGB ~ height, data = trees_norm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.09020 -0.34016 0.04133 0.41138 1.07915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.3934 0.8744 -3.881 0.000578 ***
## height 4.0380 0.3903 10.345 4.53e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5579 on 28 degrees of freedom
## Multiple R-squared: 0.7926, Adjusted R-squared: 0.7852
## F-statistic: 107 on 1 and 28 DF, p-value: 4.526e-11
The coefficients estimates of the output (Intercept = -3.3934 and height = 4.0380) are actually \(log(b_0)\) and \(b_1\) respectively. So a transformation for the first estimate will be needed to get the coefficients for the exponential form equation.
data.frame(coef = c(exp(coef(model2_ols))[1],
coef(model2_ols)[2]),
row.names = c("b0","b1"))
The coefficients obtained give the equation for the exponential form: \[AGB=0.034(height)^{4.038}\]
# predictions
pred_model2 <- exp(predict(model2_ols))
# RMSE
rmse_model2 <- rmse(trees$AGB, pred_model2)
data.frame(RMSE = rmse_model2, row.names = "AGB ~ height")
The RMSE of the second model is 584.9777 (higher than the first one).
# df
df_mod2 <- data.frame(height = trees$height,
AGB = trees$AGB,
pred_model2)
# plot
ggplot(df_mod2, aes(x = height)) +
geom_point(aes(y = AGB), size = 3) +
geom_line(aes(y = pred_model2), color = "blue1", size = 1.5, alpha = .75) +
geom_text(label = "RMSE = 584.9777", x = 8.5, y = 3600) +
labs(title = "AGB ~ height", x = "Height (m)") +
theme
Exponential form: \[AGB=b_0(cup.area)^{b_1}\]
Linear form: \[log(AGB)=log(b_0)+b_1log(cup.area)\]
# Model
model3_ols <- lm(AGB ~ cup.area, data = trees_norm)
summary(model3_ols)
##
## Call:
## lm(formula = AGB ~ cup.area, data = trees_norm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.32990 -0.49456 0.02218 0.56175 1.21648
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2252 0.7045 0.320 0.752
## cup.area 1.4388 0.1859 7.742 1.96e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6913 on 28 degrees of freedom
## Multiple R-squared: 0.6816, Adjusted R-squared: 0.6702
## F-statistic: 59.94 on 1 and 28 DF, p-value: 1.963e-08
The coefficients estimates of the output (Intercept = 0.2252 and cup.area = 1.4388) are actually \(log(b_0)\) and \(b_1\) respectively. So a transformation for the first estimate will be needed to get the coefficients for the exponential form equation.
data.frame(coef = c(exp(coef(model3_ols))[1],
coef(model3_ols)[2]),
row.names = c("b0","b1"))
The coefficients obtained give the equation for the exponential form: \[AGB=1.253(cup.area)^{1.439}\]
# predictions
pred_model3 <- exp(predict(model3_ols))
# RMSE
rmse_model3 <- rmse(trees$AGB, pred_model3)
data.frame(RMSE = rmse_model3, row.names = "AGB ~ cup.area")
The RMSE of the third model is 646.0653 (worse than the previous ones).
# df
df_mod3 <- data.frame(cup.area = trees$cup.area,
AGB = trees$AGB,
pred_model3)
# plot
ggplot(data = df_mod3, aes(x = cup.area)) +
geom_point(aes(y = AGB), size = 3) +
geom_line(aes(y = pred_model3), color = "orange1", size = 1.5, alpha = .75) +
geom_text(label = "RMSE = 646.0653", x = 40, y = 3600) +
labs(title = "AGB ~ cup.area", x = expression(paste("Cup area (", m^{2}, ")"))) +
theme
Exponential form:
Linear form: \[log(AGB)=log(b_0)+b_1log(bas.diam)+b_2log(height)\]
model4_ols <- lm(AGB ~ bas.diam + height, data = trees_norm)
summary(model4_ols)
##
## Call:
## lm(formula = AGB ~ bas.diam + height, data = trees_norm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78583 -0.12109 0.08681 0.19013 1.03328
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.1637 0.6425 -4.924 3.73e-05 ***
## bas.diam 1.8639 0.3718 5.013 2.94e-05 ***
## height 1.2094 0.6326 1.912 0.0666 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4089 on 27 degrees of freedom
## Multiple R-squared: 0.8926, Adjusted R-squared: 0.8846
## F-statistic: 112.2 on 2 and 27 DF, p-value: 8.298e-14
Estimated model:
\[log(AGB)=0.042+1.8639(bas.diam)+1.2094(height)\]
pred_model4 <- exp(predict(model4_ols))
rmse_model4 <- rmse(trees$AGB, pred_model4)
The RMSE of the model is 367.015.
Exponential form:
Linear form: \[log(AGB)=log(b_0)+b_1log(bas.diam)+b_2log(cup.area)\]
model5_ols <- lm(AGB ~ bas.diam + cup.area, data = trees_norm)
summary(model5_ols)
##
## Call:
## lm(formula = AGB ~ bas.diam + cup.area, data = trees_norm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9794 -0.1683 0.0203 0.1938 1.2979
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.4448 0.5916 -4.133 0.000311 ***
## bas.diam 2.2638 0.3366 6.726 3.2e-07 ***
## cup.area 0.1799 0.2200 0.818 0.420744
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4304 on 27 degrees of freedom
## Multiple R-squared: 0.881, Adjusted R-squared: 0.8722
## F-statistic: 99.94 on 2 and 27 DF, p-value: 3.311e-13
The estimated model is:
\[log(AGB)=0.087+2.2638(bas.diam)+0.1799(cup.area)\]
Even tho, the cup.area estimate it is not significant in the ANOVA table.
pred_model5 <- exp(predict(model5_ols))
rmse_model5 <- rmse(trees$AGB, pred_model5)
The RMSE of the model is 328.5755.
Exponential form:
Linear form: \[log(AGB)=log(b_0)+b_1log(height)+b_2log(cup.area)\]
model6_ols <- lm(AGB ~ height + cup.area, data = trees_norm)
summary(model6_ols)
##
## Call:
## lm(formula = AGB ~ height + cup.area, data = trees_norm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.80097 -0.24976 0.04792 0.24505 0.77925
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.1947 0.6893 -4.634 8.14e-05 ***
## height 2.7765 0.4259 6.519 5.46e-07 ***
## cup.area 0.6994 0.1637 4.273 0.000214 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4388 on 27 degrees of freedom
## Multiple R-squared: 0.8763, Adjusted R-squared: 0.8671
## F-statistic: 95.63 on 2 and 27 DF, p-value: 5.589e-13
\[log(AGB)=0.041+2.777(height)+0.699(cup.area)\]
pred_model6 <- exp(predict(model6_ols))
rmse_model6 <- rmse(trees$AGB, pred_model6)
The RMSE of the model is 474.5881.
Exponential form:
Linear form: \[log(AGB)=log(b_0)+b_1log(bas.diam)+b_2log(height)+b_3log(cup.area)\]
model7_ols <- lm(AGB ~ bas.diam + height + cup.area, data = trees_norm)
summary(model7_ols)
##
## Call:
## lm(formula = AGB ~ bas.diam + height + cup.area, data = trees_norm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.81334 -0.15459 0.06272 0.17729 1.00174
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.1418 0.6291 -4.994 3.42e-05 ***
## bas.diam 1.3182 0.5191 2.540 0.0174 *
## height 1.4713 0.6443 2.284 0.0308 *
## cup.area 0.3139 0.2129 1.474 0.1524
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4003 on 26 degrees of freedom
## Multiple R-squared: 0.9009, Adjusted R-squared: 0.8894
## F-statistic: 78.77 on 3 and 26 DF, p-value: 3.554e-13
\[log(AGB)=0.043+1.3182(bas.diam)+1.4713(height)+0.3139(cup.area)\]
pred_model7 <- exp(predict(model7_ols))
rmse_model7 <- rmse(trees$AGB, pred_model7)
The RMSE of the model is 369.6939.
# rmse data frame
rmse_ols <- data.frame(RMSE = c(rmse_model1,rmse_model2,rmse_model3,
rmse_model4,rmse_model5,rmse_model6),
row.names = c("AGB ~ bas.diam",
"AGB ~ height",
"AGB ~ cup.area",
"AGB ~ bas.diam + height",
"AGB ~ bas.diam + cup.area",
"AGB ~ bas.diam + height + cup.area"))
rmse_ols
In a generalized linear model, the mean is transformed, by the link function, instead of transforming the response itself. The two methods of transformation can lead to quite different results; for example, the mean of log-transformed responses is not the same as the logarithm of the mean response. Transforming the mean often allows the results to be more easily interpreted, especially in that mean parameters remain on the same scale as the measured responses.
# GLM model
model8_glm <- glm(AGB ~ bas.diam, data = trees, family = "gaussian"(link='log'))
summary(model8_glm)
##
## Call:
## glm(formula = AGB ~ bas.diam, family = gaussian(link = "log"),
## data = trees)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -744.55 -76.76 -35.23 54.21 527.64
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.939889 0.259203 15.20 4.68e-15 ***
## bas.diam 0.064755 0.004213 15.37 3.55e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 61920.74)
##
## Null deviance: 22784422 on 29 degrees of freedom
## Residual deviance: 1733776 on 28 degrees of freedom
## AIC: 420.07
##
## Number of Fisher Scoring iterations: 6
# regression tree
model7_rpart <- rpart(AGB ~ bas.diam + height + cup.area, data = trees)
# predict function
pred_model7 <- predict(model7_rpart)
# RMSE
rmse_model7 <- rmse(trees$AGB, pred_model7)
rmse_model7
## [1] 585.8788
# data frame
df_rpart <- data.frame(bas.diam = trees$bas.diam,
AGB = trees$AGB,
pred_model7)
# plot
rpart.plot(model7_rpart)
ggplot(data = df_rpart, aes(x = bas.diam)) +
geom_point(aes(y = AGB), size = 3) +
geom_line(aes(y = pred_model7, color = "red4"), size = 1.5, alpha = .75) +
labs(x = "Basal diameter (cm)") +
theme +
theme(legend.position = "none")
# Model
model8_bag <- bagging(AGB ~ bas.diam + height + cup.area,
data = trees)
# predict function
pred_model8 <- predict(model8_bag)
# RMSE
rmse_model8 <- rmse(trees$AGB, pred_model8)
rmse_model8
## [1] 593.7536
# Dataframe
df_bag <- data.frame(bas.diam = trees$bas.diam,
AGB = trees$AGB,
pred_model8)
ggplot(data = df_bag, aes(x = bas.diam)) +
geom_point(aes(y = AGB), size = 3) +
geom_line(aes(y = pred_model8, color = "green"), size = 1.5, alpha = .75) +
labs(x = "Basal diameter (cm)") +
theme +
theme(legend.position = "none")
# model
model9_randfor <- randomForest(AGB ~ bas.diam + height + cup.area,
data = trees)
# predict function
pred_model9 <- predict(model9_randfor)
# RMSE
rmse_model9 <- rmse(trees$AGB, pred_model9)
rmse_model9
## [1] 564.1568
# plot
df_rafo <- data.frame(bas.diam = trees$bas.diam,
AGB = trees$AGB,
pred_model9)
ggplot(data = df_rafo, aes(x = bas.diam)) +
geom_point(aes(y = AGB), size = 3) +
geom_line(aes(y = pred_model9, color = "blue"), size = 1.5, alpha = .75) +
labs(x = "Basal diameter (cm)") +
theme +
theme(legend.position = "none")