1 Introduction

Biomass estimation models can inflcuence local, regional and global estimates. Current standard operating procedures for carbon measurement recommend the use of existing (previously published) models. However, it is difficult to know how accurate and aplicable a model can be for certain forests and species due to the wide variety of climates and microclimates.

In this study, different models will be tested, such as:

Linear regression models.
Generalized linear models.
Generalized additive models.
Recursive partitioning models.

And using the Root Mean Square Error (RMSE) metric, we can compare the error between the actual values and the estimations of the models.

2 Loading data and libraries

2.1 Data

# Data
trees         <- read.csv("/Users/RStudio/Documents/Dropbox/Semilla/contrato/Personal/Walter Chanava/Tesis/Datos/trees.csv")
names(trees)  <- c("tree","bas.diam","height","cup.area","tree.rings","AGB")
trees

tree index of the tree.
bas.diam basal diameter of the tree in \(cm\) measured at \(30cm\) above ground.
height total height of the tree, measured in \(m\).
cup.area cup area of the tree considering it a oval, measured in \(m^2\).
tree.rings age of the tree, measured by counting the number of rings a basal disk of the tree has.
AGB Above Ground Biomass (AGB), generally speaking, it includes the stem, stump, branches, bark, seeds and foliage. measured in \(kg\).

2.2 Libraries

# Libraries
library(corrplot)
library(cowplot)
library(gbm)
library(ggplot2)
library(ipred)
library(Metrics)
library(randomForest)
library(rpart)
library(rpart.plot)

corrplot to create a correlation diagram.
cowplot to visualize multiple plots in one figure.
gbm to do a Gradient Boosting Machine model.
ggplot to visualize the data.
ipred to do a bagged trees model.
Metrics to calculate the Root Mean Square Error (RMSE) of the models.
randomForest to do a Random Forest model.
rpart to do a decision tree model.
rpart.plot to visualize a decision tree model.

3 EDA

Exploratory Data Analysis is a process where we calculate statistics and make figures to find trends, anomalies, patterns, or relationships within the data. The goal of EDA is to learn what our data can tell us.

3.1 Correlation

# Correlation diagram
trees_cor <- cor(trees[,-1]) # not including the index
corrplot(trees_cor, method = "number", type = "upper", tl.col = "black", tl.srt = 0)

The diagram indicates that there is a high correlation between bas.diam, height and cup.area with the explanatory variable AGB. And an almost null relationship between tree.rings (i.e. age of the trees) and AGB.

# P-values of the correlations with AGB
cor_test1 <- cor.test(trees$bas.diam, trees$AGB)
cor_test2 <- cor.test(trees$height, trees$AGB)
cor_test3 <- cor.test(trees$cup.area, trees$AGB)
cor_test4 <- cor.test(trees$tree.rings, trees$AGB)

data.frame(Correlations = c("Basal diameter ~ AGB",
                            "Height ~ AGB",
                            "Cup area ~ AGB",
                            "Tree rings ~ AGB"),
           Pearson.cor = c(cor_test1$estimate,
                           cor_test2$estimate,
                           cor_test3$estimate,
                           cor_test4$estimate),
           p.values = c(cor_test1$p.value,
                       cor_test2$p.value,
                       cor_test3$p.value,
                       cor_test4$p.value))

The p-values for the correlations between bas.diam, height and cup.area with AGB which area less than the significance level \(\alpha=0.05\). We can conclude that the variables are significantly correlated with AGB. But for tree.rings, the p-value is higher than \(\alpha=0.05\), meaning that the variable is insignificantly correlated with AGB.

By looking at this values, it is recommended to use only the variables bas.diam, height and cup.area to explain AGB with the regressions.

3.2 Variables relationships and distributions

3.2.1 Basal Diameter

eda.diam1 <- ggplot(data = trees, aes(x = bas.diam)) +
  geom_histogram(bins = 10, fill = 'white', colour = 'black') +
  labs(x = "Diam.Bas (cm)", y = "Count") +
  theme

eda.diam2 <- ggplot(data = trees, aes(x = bas.diam)) +
  geom_point(aes(y = AGB), size = 3) +
  labs(x = "Diam.Bas (cm)", y = "AGB (kg)") +
  theme

plot_grid(eda.diam1,eda.diam2)

There is a higher amount of trees with basal diameter lower than 40, which have a AGB lower than 1000kg. The scatter plot indicates that there is an exponential relationship between these two variables.

3.2.2 Height

eda.heig1 <- ggplot(data = trees, aes(x = height)) +
  geom_histogram(bins = 10, fill = 'white', colour = 'black') +
  labs(x = "Height (m)", y = "Count") +
  theme

eda.heig2 <- ggplot(data = trees, aes(x = height)) +
  geom_point(aes(y = AGB), size = 3) +
  labs(x = "Height (m)", y = "AGB (kg)") +
  theme

plot_grid(eda.heig1,eda.heig2)

3.2.3 Cup area

eda.area1 <- ggplot(data = trees, aes(x = cup.area)) +
  geom_histogram(bins = 10, fill = 'white', colour = 'black') +
  labs(x = expression(paste("Cup area (", m^{2}, ")")), y = "Count") +
  theme

eda.area2 <- ggplot(data = trees, aes(x = cup.area)) +
  geom_point(aes(y = AGB), size = 3) +
  labs(x = expression(paste("Cup area (", m^{2}, ")")), y = "AGB (kg)") +
  theme

plot_grid(eda.area1,eda.area2)

3.2.4 AGB

ggplot(data = trees, aes(x = AGB)) +
  geom_histogram(bins = 10, fill = 'white', colour = 'black') +
  labs(x = "AGB (kg)", y = "Count") +
  theme

3.3 Normality test

3.3.1 Basal diameter

shapiro.test(trees$bas.diam)

## 
##  Shapiro-Wilk normality test
## 
## data:  trees$bas.diam
## W = 0.87051, p-value = 0.00172

3.3.2 Height

shapiro.test(trees$height)

## 
##  Shapiro-Wilk normality test
## 
## data:  trees$height
## W = 0.92998, p-value = 0.04905

3.3.3 Cup area

shapiro.test(trees$cup.area)

## 
##  Shapiro-Wilk normality test
## 
## data:  trees$cup.area
## W = 0.91382, p-value = 0.01859

3.3.4 AGB

shapiro.test(trees$AGB)

## 
##  Shapiro-Wilk normality test
## 
## data:  trees$AGB
## W = 0.59699, p-value = 6.707e-08

None of the variables have a normal distribution, so for the models which require normal distributed variables we are going do a transformation to normalize them.

3.4 Logaritmic transformation

# Log transformation
trees_norm <- log(trees[,c("bas.diam","height","cup.area","AGB")])
trees_norm

trees_norm is the new normalized dataframe.

4 Modeling

The goal of this study is to find the best model to predict the AGB of any tree (i.e. Prosopis Sp.) with high accuracy (i.e. lower RMSE).

4.1 Linear regression

As seen in 3.2 Variables relationships and distributions, it exists an exponential relationship between the variables and AGB. This relation can be explained by \(y=b_0x^{b_1}\), and by doing the logaritmic transfomation, we get the matrix linear form to model as follows: \[log(y)=log(b_0)+b_1log(x)\]

In which R will estimate the model: \[y=b_0+b_1x\]

And the multivariate form: \[y=b_0+b_1x_1+b_2x_2+...+b_kx_k\]

Being k the number of estimators.

Moreover, the \(y\) predictions are actually \(log(y)\), so by taking the exponent of the predictions (exp(pred)) we get the actual values in the right scale, and not the logaritmic transformed ones.

4.1.1 Univ. Reg.

4.1.1.1 Basal Diameter

Exponential form: \[AGB=b_0(bas.diam)^{b_1}\]

Linear form: \[log(AGB)=log(b_0)+b_1log(bas.diam)\]

# Model
model1_ols <- lm(AGB ~ bas.diam, data = trees_norm)
summary(model1_ols)

## 
## Call:
## lm(formula = AGB ~ bas.diam, data = trees_norm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9408 -0.1697  0.0435  0.2297  1.2847 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.5354     0.5776  -4.389 0.000147 ***
## bas.diam      2.4979     0.1759  14.199 2.55e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4278 on 28 degrees of freedom
## Multiple R-squared:  0.8781, Adjusted R-squared:  0.8737 
## F-statistic: 201.6 on 1 and 28 DF,  p-value: 2.554e-14

The coefficients estimates of the output (Intercept = -2.5354 and bas.diam = 2.4979) are actually \(log(b_0)\) and \(b_1\) respectively. So a transformation for the first estimate will be needed to get the coefficients for the exponential form equation.

data.frame(coef = c(exp(coef(model1_ols))[1],
                    coef(model1_ols)[2]),
           row.names = c("b0","b1"))

The coefficients obtained give the equation for the exponential form: \[AGB=0.079(bas.diam)^{2.498}\]

# predictions
pred_model1 <- exp(predict(model1_ols))

# RMSE
rmse_model1 <- rmse(trees$AGB, pred_model1)
data.frame(RMSE = rmse_model1, row.names = "AGB ~ bas.diam")

The RMSE of the first model is 330.076.

# df
df_mod1 <- data.frame(bas.diam = trees$bas.diam,
                      AGB = trees$AGB,
                      pred_model1)
       
# plot              
ggplot(data = df_mod1, aes(x = bas.diam)) +
  geom_point(aes(y = AGB), size = 3) +
  geom_line(aes(y = pred_model1), color = "red1", size = 1.5, alpha = .75) +
  geom_text(label = "RMSE = 330.0760", x = 25, y = 3600) +
  labs(title = "AGB ~ bas.diam", x = "Basal diameter (cm)") +
  theme

4.1.1.2 Height

Exponential form: \[AGB=b_0(height)^{b_1}\]

Linear form: \[log(AGB)=log(b_0)+b_1log(height)\]

# Model
model2_ols <- lm(AGB ~ height, data = trees_norm)
summary(model2_ols)

## 
## Call:
## lm(formula = AGB ~ height, data = trees_norm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.09020 -0.34016  0.04133  0.41138  1.07915 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.3934     0.8744  -3.881 0.000578 ***
## height        4.0380     0.3903  10.345 4.53e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5579 on 28 degrees of freedom
## Multiple R-squared:  0.7926, Adjusted R-squared:  0.7852 
## F-statistic:   107 on 1 and 28 DF,  p-value: 4.526e-11

The coefficients estimates of the output (Intercept = -3.3934 and height = 4.0380) are actually \(log(b_0)\) and \(b_1\) respectively. So a transformation for the first estimate will be needed to get the coefficients for the exponential form equation.

data.frame(coef = c(exp(coef(model2_ols))[1],
                    coef(model2_ols)[2]),
           row.names = c("b0","b1"))

The coefficients obtained give the equation for the exponential form: \[AGB=0.034(height)^{4.038}\]

# predictions
pred_model2 <- exp(predict(model2_ols))

# RMSE
rmse_model2 <- rmse(trees$AGB, pred_model2)
data.frame(RMSE = rmse_model2, row.names = "AGB ~ height")

The RMSE of the second model is 584.9777 (higher than the first one).

# df
df_mod2 <- data.frame(height = trees$height,
                      AGB = trees$AGB,
                      pred_model2)
       
# plot              
ggplot(df_mod2, aes(x = height)) +
  geom_point(aes(y = AGB), size = 3) +
  geom_line(aes(y = pred_model2), color = "blue1", size = 1.5, alpha = .75) +
  geom_text(label = "RMSE = 584.9777", x = 8.5, y = 3600) +
  labs(title = "AGB ~ height", x = "Height (m)") +
  theme

4.1.1.3 Cup Area

Exponential form: \[AGB=b_0(cup.area)^{b_1}\]

Linear form: \[log(AGB)=log(b_0)+b_1log(cup.area)\]

# Model
model3_ols <- lm(AGB ~ cup.area, data = trees_norm)
summary(model3_ols)

## 
## Call:
## lm(formula = AGB ~ cup.area, data = trees_norm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.32990 -0.49456  0.02218  0.56175  1.21648 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.2252     0.7045   0.320    0.752    
## cup.area      1.4388     0.1859   7.742 1.96e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6913 on 28 degrees of freedom
## Multiple R-squared:  0.6816, Adjusted R-squared:  0.6702 
## F-statistic: 59.94 on 1 and 28 DF,  p-value: 1.963e-08

The coefficients estimates of the output (Intercept = 0.2252 and cup.area = 1.4388) are actually \(log(b_0)\) and \(b_1\) respectively. So a transformation for the first estimate will be needed to get the coefficients for the exponential form equation.

data.frame(coef = c(exp(coef(model3_ols))[1],
                    coef(model3_ols)[2]),
           row.names = c("b0","b1"))

The coefficients obtained give the equation for the exponential form: \[AGB=1.253(cup.area)^{1.439}\]

# predictions
pred_model3 <- exp(predict(model3_ols))

# RMSE
rmse_model3 <- rmse(trees$AGB, pred_model3)
data.frame(RMSE = rmse_model3, row.names = "AGB ~ cup.area")

The RMSE of the third model is 646.0653 (worse than the previous ones).

# df
df_mod3 <- data.frame(cup.area = trees$cup.area,
                      AGB = trees$AGB,
                      pred_model3)
       
# plot              
ggplot(data = df_mod3, aes(x = cup.area)) +
  geom_point(aes(y = AGB), size = 3) +
  geom_line(aes(y = pred_model3), color = "orange1", size = 1.5, alpha = .75) +
  geom_text(label = "RMSE = 646.0653", x = 40, y = 3600) +
  labs(title = "AGB ~ cup.area", x = expression(paste("Cup area (", m^{2}, ")"))) +
  theme

4.1.2 Multiv. Reg.

4.1.2.1 Model 4

Exponential form:

Linear form: \[log(AGB)=log(b_0)+b_1log(bas.diam)+b_2log(height)\]

model4_ols <- lm(AGB ~ bas.diam + height, data = trees_norm)
summary(model4_ols)

## 
## Call:
## lm(formula = AGB ~ bas.diam + height, data = trees_norm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78583 -0.12109  0.08681  0.19013  1.03328 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.1637     0.6425  -4.924 3.73e-05 ***
## bas.diam      1.8639     0.3718   5.013 2.94e-05 ***
## height        1.2094     0.6326   1.912   0.0666 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4089 on 27 degrees of freedom
## Multiple R-squared:  0.8926, Adjusted R-squared:  0.8846 
## F-statistic: 112.2 on 2 and 27 DF,  p-value: 8.298e-14

Estimated model:

\[log(AGB)=0.042+1.8639(bas.diam)+1.2094(height)\]

pred_model4 <- exp(predict(model4_ols))
rmse_model4 <- rmse(trees$AGB, pred_model4)

The RMSE of the model is 367.015.

4.1.2.2 Model 5

Exponential form:

Linear form: \[log(AGB)=log(b_0)+b_1log(bas.diam)+b_2log(cup.area)\]

model5_ols <- lm(AGB ~ bas.diam + cup.area, data = trees_norm)
summary(model5_ols)

## 
## Call:
## lm(formula = AGB ~ bas.diam + cup.area, data = trees_norm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9794 -0.1683  0.0203  0.1938  1.2979 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.4448     0.5916  -4.133 0.000311 ***
## bas.diam      2.2638     0.3366   6.726  3.2e-07 ***
## cup.area      0.1799     0.2200   0.818 0.420744    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4304 on 27 degrees of freedom
## Multiple R-squared:  0.881,  Adjusted R-squared:  0.8722 
## F-statistic: 99.94 on 2 and 27 DF,  p-value: 3.311e-13

The estimated model is:

\[log(AGB)=0.087+2.2638(bas.diam)+0.1799(cup.area)\]

Even tho, the cup.area estimate it is not significant in the ANOVA table.

pred_model5 <- exp(predict(model5_ols))
rmse_model5 <- rmse(trees$AGB, pred_model5)

The RMSE of the model is 328.5755.

4.1.2.3 Model 6

Exponential form:

Linear form: \[log(AGB)=log(b_0)+b_1log(height)+b_2log(cup.area)\]

model6_ols <- lm(AGB ~ height + cup.area, data = trees_norm)
summary(model6_ols)

## 
## Call:
## lm(formula = AGB ~ height + cup.area, data = trees_norm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.80097 -0.24976  0.04792  0.24505  0.77925 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.1947     0.6893  -4.634 8.14e-05 ***
## height        2.7765     0.4259   6.519 5.46e-07 ***
## cup.area      0.6994     0.1637   4.273 0.000214 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4388 on 27 degrees of freedom
## Multiple R-squared:  0.8763, Adjusted R-squared:  0.8671 
## F-statistic: 95.63 on 2 and 27 DF,  p-value: 5.589e-13

\[log(AGB)=0.041+2.777(height)+0.699(cup.area)\]

pred_model6 <- exp(predict(model6_ols))
rmse_model6 <- rmse(trees$AGB, pred_model6)

The RMSE of the model is 474.5881.

4.1.2.4 Model 7

Exponential form:

Linear form: \[log(AGB)=log(b_0)+b_1log(bas.diam)+b_2log(height)+b_3log(cup.area)\]

model7_ols <- lm(AGB ~ bas.diam + height + cup.area, data = trees_norm)
summary(model7_ols)

## 
## Call:
## lm(formula = AGB ~ bas.diam + height + cup.area, data = trees_norm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.81334 -0.15459  0.06272  0.17729  1.00174 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.1418     0.6291  -4.994 3.42e-05 ***
## bas.diam      1.3182     0.5191   2.540   0.0174 *  
## height        1.4713     0.6443   2.284   0.0308 *  
## cup.area      0.3139     0.2129   1.474   0.1524    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4003 on 26 degrees of freedom
## Multiple R-squared:  0.9009, Adjusted R-squared:  0.8894 
## F-statistic: 78.77 on 3 and 26 DF,  p-value: 3.554e-13

\[log(AGB)=0.043+1.3182(bas.diam)+1.4713(height)+0.3139(cup.area)\]

pred_model7 <- exp(predict(model7_ols))
rmse_model7 <- rmse(trees$AGB, pred_model7)

The RMSE of the model is 369.6939.

RMSE

# rmse data frame
rmse_ols <- data.frame(RMSE = c(rmse_model1,rmse_model2,rmse_model3,
                                rmse_model4,rmse_model5,rmse_model6),
                       row.names = c("AGB ~ bas.diam",
                                     "AGB ~ height",
                                     "AGB ~ cup.area",
                                     "AGB ~ bas.diam + height",
                                     "AGB ~ bas.diam + cup.area",
                                     "AGB ~ bas.diam + height + cup.area"))
rmse_ols

4.2 Generalized linear models

In a generalized linear model, the mean is transformed, by the link function, instead of transforming the response itself. The two methods of transformation can lead to quite different results; for example, the mean of log-transformed responses is not the same as the logarithm of the mean response. Transforming the mean often allows the results to be more easily interpreted, especially in that mean parameters remain on the same scale as the measured responses.

# GLM model
model8_glm <- glm(AGB ~ bas.diam, data = trees, family = "gaussian"(link='log'))
summary(model8_glm)

## 
## Call:
## glm(formula = AGB ~ bas.diam, family = gaussian(link = "log"), 
##     data = trees)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -744.55   -76.76   -35.23    54.21   527.64  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.939889   0.259203   15.20 4.68e-15 ***
## bas.diam    0.064755   0.004213   15.37 3.55e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 61920.74)
## 
##     Null deviance: 22784422  on 29  degrees of freedom
## Residual deviance:  1733776  on 28  degrees of freedom
## AIC: 420.07
## 
## Number of Fisher Scoring iterations: 6

4.3 General additive models

4.4 Bayesian regression

4.5 Recursive partitions

4.5.1 Decision trees

# regression tree
model7_rpart <- rpart(AGB ~ bas.diam + height + cup.area, data = trees)

# predict function
pred_model7 <- predict(model7_rpart)

# RMSE
rmse_model7 <- rmse(trees$AGB, pred_model7)
rmse_model7

## [1] 585.8788

# data frame
df_rpart <- data.frame(bas.diam = trees$bas.diam,
                       AGB = trees$AGB,
                       pred_model7)
  
# plot
rpart.plot(model7_rpart)

ggplot(data = df_rpart, aes(x = bas.diam)) +
  geom_point(aes(y = AGB), size = 3) +
  geom_line(aes(y = pred_model7, color = "red4"), size = 1.5, alpha = .75) +
  labs(x = "Basal diameter (cm)") +
  theme +
  theme(legend.position = "none")

4.5.2 Bagged trees

# Model
model8_bag <- bagging(AGB ~ bas.diam + height + cup.area,
                      data = trees)

# predict function
pred_model8 <- predict(model8_bag)

# RMSE
rmse_model8 <- rmse(trees$AGB, pred_model8)
rmse_model8

## [1] 593.7536

# Dataframe
df_bag <- data.frame(bas.diam = trees$bas.diam,
                     AGB = trees$AGB,
                     pred_model8)

ggplot(data = df_bag, aes(x = bas.diam)) +
  geom_point(aes(y = AGB), size = 3) +
  geom_line(aes(y = pred_model8, color = "green"), size = 1.5, alpha = .75) +
  labs(x = "Basal diameter (cm)") +
  theme +
  theme(legend.position = "none")

4.5.3 Random forest

# model
model9_randfor <- randomForest(AGB ~ bas.diam + height + cup.area,
                               data = trees)

# predict function
pred_model9 <- predict(model9_randfor)

# RMSE
rmse_model9 <- rmse(trees$AGB, pred_model9)
rmse_model9

## [1] 564.1568

# plot
df_rafo <- data.frame(bas.diam = trees$bas.diam,
                      AGB = trees$AGB,
                      pred_model9)

ggplot(data = df_rafo, aes(x = bas.diam)) +
  geom_point(aes(y = AGB), size = 3) +
  geom_line(aes(y = pred_model9, color = "blue"), size = 1.5, alpha = .75) +
  labs(x = "Basal diameter (cm)") +
  theme +
  theme(legend.position = "none")

Prosopis Sp. Biomass Estimation

Walter Chanavá

September 2019

1 Introduction

2 Loading data and libraries

2.1 Data

2.2 Libraries

3 EDA

3.1 Correlation

3.2 Variables relationships and distributions

3.2.1 Basal Diameter

3.2.2 Height

3.2.3 Cup area

3.2.4 AGB

3.3 Normality test

3.3.1 Basal diameter

3.3.2 Height

3.3.3 Cup area

3.3.4 AGB

3.4 Logaritmic transformation

4 Modeling

4.1 Linear regression

4.1.1 Univ. Reg.

4.1.1.1 Basal Diameter

4.1.1.2 Height

4.1.1.3 Cup Area

4.1.2 Multiv. Reg.

4.1.2.1 Model 4

4.1.2.2 Model 5

4.1.2.3 Model 6

4.1.2.4 Model 7

RMSE

4.2 Generalized linear models

4.3 General additive models

4.4 Bayesian regression

4.5 Recursive partitions

4.5.1 Decision trees

4.5.2 Bagged trees

4.5.3 Random forest