Exploring the Relationship Between Tree Height and Biomass in the Catauba Region

Statistical data analysis project- MATH1324

Aleena Mariya Alex (s4158585)

Last updated: 19 October, 2025

Introduction

Introduction (Cont.)

Trees in Catuba
Trees in Catuba

Problem Statement

Main research question:

Can tree height be used to accurately predict above-ground biomass in the Catuaba forest dataset?

Statistical approach:

-Using descriptive statistics to summarise tree height and biomass. -Apply correlation analysis to assess the strength of the relationship between height and biomass. -Conduct hypothesis testing to test whether tree height significantly predicts biomass. -If a strong, significant relationship exists, tree height can be used as a practical, non-destructive predictor of biomass in forest assessments.

Data

Data Source: - The dataset used is from the LBA-ECO LC-02: Biophysical Measurements of Forests, Acre, Brazil (1999–2002), obtained from NASA Open Data (https://data.nasa.gov/dataset/lba-eco-lc-02-biophysical-measurements-of-forests-acre-brazil-1999-2002-90ecf).

-Measurements include tree diameter at breast height (DBH) and tree height for each tree within the study plots.

Sampling Method: -The sampling design followed a plot-based systematic sampling method, where all trees above a minimum DBH threshold (e.g., 10 cm) were measured.

Data Cont.

-Important Variables- -Tree_ID- Unique identifier for each tree measured- Factor variable -Species- Species name or code- Factor variable -Plot_ID- Identifier for sample plot- Factor variable -DBH_cm-Diameter at Breast Height (1.3 m)- Numeric variable -Height_m-Tree height- Numeric variable -Biomass_kg-Estimated above-ground biomass- Numeric variable -Site- Sampling site name (e.g., Catuaba)-Factor variable -Status-Tree condition (Alive / Dead)- Factor variable

Explanation of Factors

Descriptive Statistics and Visualisation

# Remove missing values
Catuba_clean <- catuba %>%
  filter(!is.na(Height), !is.na(Biomass))

#Histogram For Distribution of Tree Height
ggplot(Catuba_clean, aes(x = Height)) +
  geom_histogram(bins = 20, fill = "steelblue", color = "white") +
  labs(title = "Distribution of Tree Height", x = "Tree Height (m)", y = "Count") +
  theme_minimal()

#Histogram For Distribution of Above-ground Biomass
ggplot(Catuba_clean, aes(x = Biomass)) +
  geom_histogram(bins = 20, fill = "forestgreen", color = "white") +
  labs(title = "Distribution of Above-ground Biomass",
       x = "Biomass (Mg)", y = "Count") +
  theme_minimal()

Decsriptive Statistics Cont.

Output format: The knitr::kable() function formats the results into a clean HTML table titled “Summary Statistics for Height and Biomass”.

# Summary statistics for Height and Biomass
summary_table <- Catuba_clean %>%
  summarise(
    Min_Height = min(Height),
    Q1_Height = quantile(Height, 0.25),
    Median_Height = median(Height),
    Q3_Height = quantile(Height, 0.75),
    Max_Height = max(Height),
    Mean_Height = mean(Height),
    SD_Height = sd(Height),
    Min_Biomass = min(Biomass),
    Q1_Biomass = quantile(Biomass, 0.25),
    Median_Biomass = median(Biomass),
    Q3_Biomass = quantile(Biomass, 0.75),
    Max_Biomass = max(Biomass),
    Mean_Biomass = mean(Biomass),
    SD_Biomass = sd(Biomass),
    n = n()
  )

kable(summary_table, caption = "Summary Statistics for Height and Biomass")
Summary Statistics for Height and Biomass
Min_Height Q1_Height Median_Height Q3_Height Max_Height Mean_Height SD_Height Min_Biomass Q1_Biomass Median_Biomass Q3_Biomass Max_Biomass Mean_Biomass SD_Biomass n
1.8 15 22 27 223 22.03279 11.37982 47.18 146.48 1369.56 2607.02 12587.52 1837.21 2167.214 918

Hypothesis Testing

-Purpose: To test whether there is a significant association between Tree Height Category and Above-ground Biomass Category.

Hypotheses:

-Null Hypothesis (H₀): There is no association between tree height and biomass categories — they are independent. _Alternative Hypothesis (H₁): There is an association between tree height and biomass categories.

Method: Continuous variables (Height and Biomass) were divided into three groups (“Low”, “Medium”, “High”) using quantiles to create roughly equal-sized categories.

-A Chi-square test of independence was conducted on the contingency table of these categories.

Assumptions checked:

-Data are from independent observations. -Expected frequencies in each cell are sufficiently large (generally ≥ 5).

Interpretation:

-If the p-value < 0.05, reject H₀ and conclude that there is a significant association between tree height and biomass categories which meaning taller trees are more likely to have higher biomass. -If the p-value ≥ 0.05, fail to reject H₀ that indicating no strong relationship between the categorical groupings.

# Create categorical variables for Height and Biomass

Catuba_clean <- Catuba_clean %>%
mutate(
Height_Group = cut(Height,
breaks = quantile(Height, probs = c(0, 0.33, 0.66, 1)),
labels = c("Low", "Medium", "High"),
include.lowest = TRUE),
Biomass_Group = cut(Biomass,
breaks = quantile(Biomass, probs = c(0, 0.33, 0.66, 1)),
labels = c("Low", "Medium", "High"),
include.lowest = TRUE)
)

# Create a contingency table

contingency_table <- table(Catuba_clean$Height_Group, Catuba_clean$Biomass_Group)
contingency_table
##         
##          Low Medium High
##   Low    272     51    6
##   Medium  31    181   92
##   High     1     70  214
# Perform Chi-square test of independence

chi_result <- chisq.test(contingency_table)
chi_result
## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 729.53, df = 4, p-value < 2.2e-16
# Display expected counts and check assumptions

chi_result$expected
##         
##                Low    Medium      High
##   Low    108.94989 108.23312 111.81699
##   Medium 100.67102 100.00871 103.32026
##   High    94.37908  93.75817  96.86275

Hypthesis Testing Cont.

For the Chi-square test of independence between tree height category and biomass category:

\[H_0: P(Height\ Category,\ Biomass\ Category) = P(Height\ Category) \times P(Biomass\ Category)\] \[H_A: P(Height\ Category,\ Biomass\ Category) \ne P(Height\ Category) \times P(Biomass\ Category)\]

\[\chi^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\] # Correlation

-Purpose: To measure the strength and direction of the linear relationship between Tree Height (m) and Above-ground Biomass (Mg).

-Method: The Pearson correlation coefficient (r) is used since both variables are continuous and approximately normally distributed.

The value of r ranges from -1 to +1: -r ≈ +1: Strong positive linear relationship. -r ≈ 0: No linear relationship. -r ≈ -1: Strong negative linear relationship.

The test assumes both variables are numeric, independent, and linearly related.

Interpretation:

-If r is close to +1, taller trees tend to have higher biomass. -If r is close to 0, there’s no clear relationship between height and biomass. -If r is close to -1, taller trees tend to have lower biomass (unlikely in this biological context).

#Correlation Analysis between Tree Height and Biomass
#Calculating Pearson correlation

correlation_result <- cor(Catuba_clean$Height, Catuba_clean$Biomass, method = "pearson")

#Displaying result

correlation_result
## [1] 0.598601

Discussion

Key Findings:

-The Pearson correlation coefficient (r = 0.598601) indicates a moderate to strong positive relationship between tree height and above-ground biomass. This means taller trees tend to have higher biomass.

-The Chi-square test of independence produced a highly significant result (χ² = 729.53, df = 4, p < 2.2e−16), confirming that there is a statistically significant association between height and biomass categories.

-Together, these results support the research hypothesis that tree height can be used as a strong predictor of above-ground biomass in the Catuaba forest dataset.

Interpretation: -The positive relationship between tree height and biomass aligns with ecological theory as trees grow taller, they accumulate more woody material and carbon. The moderate correlation suggests that while height is a strong indicator, other factors (like tree diameter and species) also influence biomass.

Strengths of the Study: -Uses real-world field data from NASA’s LBA project. -Applies robust statistical methods (Chi-square, correlation). -Provides reproducible code and transparent data cleaning.

Limitations: -Only one predictor (tree height) was analyzed; including DBH or wood density could improve model accuracy. -The analysis assumes linearity and normality, which may not hold perfectly for biomass data. -The dataset represents one forest site, limiting generalization to other forest types.

Future Research: -Incorporate multiple regression models using DBH and species type. -Explore non-linear models (e.g., allometric equations). -Conduct cross-site comparisons to validate findings across other Amazonian forests.

Conclusion: -Tree height shows a significant positive relationship with above-ground biomass. This indicates that height can serve as a reliable, non-destructive indicator of forest biomass and carbon storage in the Catuaba region.

References

-Peng, R. D. (2015). R programming for data science. Leanpub, Baltimore, MD.

-R Core Team. (2025). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Available at: https://www.R-project.org/ (Accessed: 15 October 2025).

-Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer-Verlag, New York.

-NASA. (2002). LBA‐ECO LC‐02: Biophysical measurements of forests, Acre, Brazil (1999–2002). NASA Data Portal. Available at: https://data.nasa.gov/dataset/lba-eco-lc-02-biophysical-measurements-of-forests-acre-Br azil-1999-2002-90ecf (Accessed: 15 October 2025)