Introduction (Cont.)

Research Question:
Is there a statistically significant association between tree height and above-ground biomass in the Catauba forest dataset?
Hypotheses:
- Null (H₀): Tree height and biomass are independent; there is no significant relationship between them.
- Alternative (H₁): Tree height and biomass are associated; there is a significant relationship between them.
Statistical Approach:
The analysis applies correlation, regression, and Chi-square testing to evaluate the strength and significance of the relationship.
Tools Used:
Analysis conducted in R using readr, ggplot2, dplyr, knitr, and base stats functions.
Purpose:
To determine whether tree height can serve as a meaningful indicator of above-ground biomass and forest carbon storage potential.

Trees in Catuba

Problem Statement

Estimating forest biomass is essential for understanding carbon storage, forest productivity, and climate change mitigation.
Measuring biomass directly is time-consuming and destructive, so indirect estimation methods are preferred.

Main research question:

Can tree height be used to accurately predict above-ground biomass in the Catuaba forest dataset?

Statistical approach:

-Using descriptive statistics to summarise tree height and biomass. -Apply correlation analysis to assess the strength of the relationship between height and biomass. -Conduct hypothesis testing to test whether tree height significantly predicts biomass. -If a strong, significant relationship exists, tree height can be used as a practical, non-destructive predictor of biomass in forest assessments.

Data

Data Source: - The dataset used is from the LBA-ECO LC-02: Biophysical Measurements of Forests, Acre, Brazil (1999–2002), obtained from NASA Open Data (https://data.nasa.gov/dataset/lba-eco-lc-02-biophysical-measurements-of-forests-acre-brazil-1999-2002-90ecf).

Data were collected from permanent forest plots in the Catuaba Experimental Farm located in Acre, Brazil.

-Measurements include tree diameter at breast height (DBH) and tree height for each tree within the study plots.

Researchers measured each tree’s DBH at 1.3 meters above the ground and recorded height using standard forestry instruments (e.g., clinometer or laser hypsometer).

Sampling Method: -The sampling design followed a plot-based systematic sampling method, where all trees above a minimum DBH threshold (e.g., 10 cm) were measured.

Data Cont.

-Important Variables- -Tree_ID- Unique identifier for each tree measured- Factor variable -Species- Species name or code- Factor variable -Plot_ID- Identifier for sample plot- Factor variable -DBH_cm-Diameter at Breast Height (1.3 m)- Numeric variable -Height_m-Tree height- Numeric variable -Biomass_kg-Estimated above-ground biomass- Numeric variable -Site- Sampling site name (e.g., Catuaba)-Factor variable -Status-Tree condition (Alive / Dead)- Factor variable

Explanation of Factors

Species: Each level represents a unique tree species (e.g., Inga sp., Vismia sp.).
Plot_ID: Identifies each forest sampling plot within the site.
Status: Indicates tree condition (used to filter out dead or broken trees).
Site: In this analysis, the main focus is on the “Catuaba” site.

Descriptive Statistics and Visualisation

The main variables of interest are Tree Height (m) and Above-ground Biomass (Mg).
Descriptive statistics help understand the central tendency and spread of these variables.
Visualization highlights the relationship between tree height and biomass and any potential outliers.

# Remove missing values
Catuba_clean <- catuba %>%
  filter(!is.na(Height), !is.na(Biomass))

#Histogram For Distribution of Tree Height
ggplot(Catuba_clean, aes(x = Height)) +
  geom_histogram(bins = 20, fill = "steelblue", color = "white") +
  labs(title = "Distribution of Tree Height", x = "Tree Height (m)", y = "Count") +
  theme_minimal()

#Histogram For Distribution of Above-ground Biomass
ggplot(Catuba_clean, aes(x = Biomass)) +
  geom_histogram(bins = 20, fill = "forestgreen", color = "white") +
  labs(title = "Distribution of Above-ground Biomass",
       x = "Biomass (Mg)", y = "Count") +
  theme_minimal()

Decsriptive Statistics Cont.

Purpose: This code calculates key summary statistics for the two main variables in the dataset Tree Height (m) and Above-ground Biomass (Mg) — after cleaning the data.
Function used: The summarise() function from the dplyr package is used to compute numeric summaries for each variable.

Statistics included:

-Minimum (Min): The smallest observed value. -First Quartile (Q1): The 25th percentile, showing the lower spread of the data. -Median: The midpoint value, splitting the data into two equal halves. -Third Quartile (Q3): The 75th percentile, indicating the upper spread. -Maximum (Max): The largest observed value. -Mean: The average value of the variable. -Standard Deviation (SD): Measures how much the data vary around the mean. -n: The number of valid observations used in the calculation.

Output format: The knitr::kable() function formats the results into a clean HTML table titled “Summary Statistics for Height and Biomass”.

# Summary statistics for Height and Biomass
summary_table <- Catuba_clean %>%
  summarise(
    Min_Height = min(Height),
    Q1_Height = quantile(Height, 0.25),
    Median_Height = median(Height),
    Q3_Height = quantile(Height, 0.75),
    Max_Height = max(Height),
    Mean_Height = mean(Height),
    SD_Height = sd(Height),
    Min_Biomass = min(Biomass),
    Q1_Biomass = quantile(Biomass, 0.25),
    Median_Biomass = median(Biomass),
    Q3_Biomass = quantile(Biomass, 0.75),
    Max_Biomass = max(Biomass),
    Mean_Biomass = mean(Biomass),
    SD_Biomass = sd(Biomass),
    n = n()
  )

kable(summary_table, caption = "Summary Statistics for Height and Biomass")

Summary Statistics for Height and Biomass
Min_Height	Q1_Height	Median_Height	Q3_Height	Max_Height	Mean_Height	SD_Height	Min_Biomass	Q1_Biomass	Median_Biomass	Q3_Biomass	Max_Biomass	Mean_Biomass	SD_Biomass	n
1.8	15	22	27	223	22.03279	11.37982	47.18	146.48	1369.56	2607.02	12587.52	1837.21	2167.214	918

Hypothesis Testing

-Purpose: To test whether there is a significant association between Tree Height Category and Above-ground Biomass Category.

Hypotheses:

-Null Hypothesis (H₀): There is no association between tree height and biomass categories — they are independent. _Alternative Hypothesis (H₁): There is an association between tree height and biomass categories.

Method: Continuous variables (Height and Biomass) were divided into three groups (“Low”, “Medium”, “High”) using quantiles to create roughly equal-sized categories.

-A Chi-square test of independence was conducted on the contingency table of these categories.

Assumptions checked:

-Data are from independent observations. -Expected frequencies in each cell are sufficiently large (generally ≥ 5).

Interpretation:

-If the p-value < 0.05, reject H₀ and conclude that there is a significant association between tree height and biomass categories which meaning taller trees are more likely to have higher biomass. -If the p-value ≥ 0.05, fail to reject H₀ that indicating no strong relationship between the categorical groupings.

# Create categorical variables for Height and Biomass

Catuba_clean <- Catuba_clean %>%
mutate(
Height_Group = cut(Height,
breaks = quantile(Height, probs = c(0, 0.33, 0.66, 1)),
labels = c("Low", "Medium", "High"),
include.lowest = TRUE),
Biomass_Group = cut(Biomass,
breaks = quantile(Biomass, probs = c(0, 0.33, 0.66, 1)),
labels = c("Low", "Medium", "High"),
include.lowest = TRUE)
)

# Create a contingency table

contingency_table <- table(Catuba_clean$Height_Group, Catuba_clean$Biomass_Group)
contingency_table

##         
##          Low Medium High
##   Low    272     51    6
##   Medium  31    181   92
##   High     1     70  214

# Perform Chi-square test of independence

chi_result <- chisq.test(contingency_table)
chi_result

## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 729.53, df = 4, p-value < 2.2e-16

# Display expected counts and check assumptions

chi_result$expected

##         
##                Low    Medium      High
##   Low    108.94989 108.23312 111.81699
##   Medium 100.67102 100.00871 103.32026
##   High    94.37908  93.75817  96.86275

Hypthesis Testing Cont.

For the Chi-square test of independence between tree height category and biomass category:

Null Hypothesis (H₀): There is no association between tree height and biomass categories (they are independent).
Alternative Hypothesis (H₁): There is an association between tree height and biomass categories (they are not independent).

\[H_0: P(Height\ Category,\ Biomass\ Category) = P(Height\ Category) \times P(Biomass\ Category)\] \[H_A: P(Height\ Category,\ Biomass\ Category) \ne P(Height\ Category) \times P(Biomass\ Category)\]

\[\chi^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\] # Correlation

-Purpose: To measure the strength and direction of the linear relationship between Tree Height (m) and Above-ground Biomass (Mg).

-Method: The Pearson correlation coefficient (r) is used since both variables are continuous and approximately normally distributed.

The value of r ranges from -1 to +1: -r ≈ +1: Strong positive linear relationship. -r ≈ 0: No linear relationship. -r ≈ -1: Strong negative linear relationship.

The test assumes both variables are numeric, independent, and linearly related.

Interpretation:

-If r is close to +1, taller trees tend to have higher biomass. -If r is close to 0, there’s no clear relationship between height and biomass. -If r is close to -1, taller trees tend to have lower biomass (unlikely in this biological context).

#Correlation Analysis between Tree Height and Biomass
#Calculating Pearson correlation

correlation_result <- cor(Catuba_clean$Height, Catuba_clean$Biomass, method = "pearson")

#Displaying result

correlation_result

## [1] 0.598601

Discussion

Key Findings:

-The Pearson correlation coefficient (r = 0.598601) indicates a moderate to strong positive relationship between tree height and above-ground biomass. This means taller trees tend to have higher biomass.

-The Chi-square test of independence produced a highly significant result (χ² = 729.53, df = 4, p < 2.2e−16), confirming that there is a statistically significant association between height and biomass categories.

-Together, these results support the research hypothesis that tree height can be used as a strong predictor of above-ground biomass in the Catuaba forest dataset.

Interpretation: -The positive relationship between tree height and biomass aligns with ecological theory as trees grow taller, they accumulate more woody material and carbon. The moderate correlation suggests that while height is a strong indicator, other factors (like tree diameter and species) also influence biomass.

Strengths of the Study: -Uses real-world field data from NASA’s LBA project. -Applies robust statistical methods (Chi-square, correlation). -Provides reproducible code and transparent data cleaning.

Limitations: -Only one predictor (tree height) was analyzed; including DBH or wood density could improve model accuracy. -The analysis assumes linearity and normality, which may not hold perfectly for biomass data. -The dataset represents one forest site, limiting generalization to other forest types.

Future Research: -Incorporate multiple regression models using DBH and species type. -Explore non-linear models (e.g., allometric equations). -Conduct cross-site comparisons to validate findings across other Amazonian forests.

Conclusion: -Tree height shows a significant positive relationship with above-ground biomass. This indicates that height can serve as a reliable, non-destructive indicator of forest biomass and carbon storage in the Catuaba region.

References

-Peng, R. D. (2015). R programming for data science. Leanpub, Baltimore, MD.

-R Core Team. (2025). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Available at: https://www.R-project.org/ (Accessed: 15 October 2025).

-Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer-Verlag, New York.

-NASA. (2002). LBA‐ECO LC‐02: Biophysical measurements of forests, Acre, Brazil (1999–2002). NASA Data Portal. Available at: https://data.nasa.gov/dataset/lba-eco-lc-02-biophysical-measurements-of-forests-acre-Br azil-1999-2002-90ecf (Accessed: 15 October 2025)

Exploring the Relationship Between Tree Height and Biomass in the Catauba Region

Statistical data analysis project- MATH1324

RPubs link information

Introduction