Aleena Mariya Alex (s4158585)
Last updated: 19 October, 2025
Research Question:
Is there a statistically significant association between tree height and
above-ground biomass in the Catauba forest dataset?
Hypotheses:
Statistical Approach:
The analysis applies correlation, regression, and Chi-square testing to
evaluate the strength and significance of the relationship.
Tools Used:
Analysis conducted in R using readr, ggplot2,
dplyr, knitr, and base stats
functions.
Purpose:
To determine whether tree height can serve as a meaningful indicator of
above-ground biomass and forest carbon storage potential.
Main research question:
Can tree height be used to accurately predict above-ground biomass in the Catuaba forest dataset?
Statistical approach:
-Using descriptive statistics to summarise tree height and biomass. -Apply correlation analysis to assess the strength of the relationship between height and biomass. -Conduct hypothesis testing to test whether tree height significantly predicts biomass. -If a strong, significant relationship exists, tree height can be used as a practical, non-destructive predictor of biomass in forest assessments.
Data Source: - The dataset used is from the LBA-ECO LC-02: Biophysical Measurements of Forests, Acre, Brazil (1999–2002), obtained from NASA Open Data (https://data.nasa.gov/dataset/lba-eco-lc-02-biophysical-measurements-of-forests-acre-brazil-1999-2002-90ecf).
-Measurements include tree diameter at breast height (DBH) and tree height for each tree within the study plots.
Sampling Method: -The sampling design followed a plot-based systematic sampling method, where all trees above a minimum DBH threshold (e.g., 10 cm) were measured.
-Important Variables- -Tree_ID- Unique identifier for each tree measured- Factor variable -Species- Species name or code- Factor variable -Plot_ID- Identifier for sample plot- Factor variable -DBH_cm-Diameter at Breast Height (1.3 m)- Numeric variable -Height_m-Tree height- Numeric variable -Biomass_kg-Estimated above-ground biomass- Numeric variable -Site- Sampling site name (e.g., Catuaba)-Factor variable -Status-Tree condition (Alive / Dead)- Factor variable
Explanation of Factors
# Remove missing values
Catuba_clean <- catuba %>%
filter(!is.na(Height), !is.na(Biomass))
#Histogram For Distribution of Tree Height
ggplot(Catuba_clean, aes(x = Height)) +
geom_histogram(bins = 20, fill = "steelblue", color = "white") +
labs(title = "Distribution of Tree Height", x = "Tree Height (m)", y = "Count") +
theme_minimal()#Histogram For Distribution of Above-ground Biomass
ggplot(Catuba_clean, aes(x = Biomass)) +
geom_histogram(bins = 20, fill = "forestgreen", color = "white") +
labs(title = "Distribution of Above-ground Biomass",
x = "Biomass (Mg)", y = "Count") +
theme_minimal()Purpose: This code calculates key summary statistics for the two main variables in the dataset Tree Height (m) and Above-ground Biomass (Mg) — after cleaning the data.
Function used: The summarise() function from the dplyr package is used to compute numeric summaries for each variable.
Statistics included:
-Minimum (Min): The smallest observed value. -First Quartile (Q1): The 25th percentile, showing the lower spread of the data. -Median: The midpoint value, splitting the data into two equal halves. -Third Quartile (Q3): The 75th percentile, indicating the upper spread. -Maximum (Max): The largest observed value. -Mean: The average value of the variable. -Standard Deviation (SD): Measures how much the data vary around the mean. -n: The number of valid observations used in the calculation.
Output format: The knitr::kable() function formats the results into a clean HTML table titled “Summary Statistics for Height and Biomass”.
# Summary statistics for Height and Biomass
summary_table <- Catuba_clean %>%
summarise(
Min_Height = min(Height),
Q1_Height = quantile(Height, 0.25),
Median_Height = median(Height),
Q3_Height = quantile(Height, 0.75),
Max_Height = max(Height),
Mean_Height = mean(Height),
SD_Height = sd(Height),
Min_Biomass = min(Biomass),
Q1_Biomass = quantile(Biomass, 0.25),
Median_Biomass = median(Biomass),
Q3_Biomass = quantile(Biomass, 0.75),
Max_Biomass = max(Biomass),
Mean_Biomass = mean(Biomass),
SD_Biomass = sd(Biomass),
n = n()
)
kable(summary_table, caption = "Summary Statistics for Height and Biomass")| Min_Height | Q1_Height | Median_Height | Q3_Height | Max_Height | Mean_Height | SD_Height | Min_Biomass | Q1_Biomass | Median_Biomass | Q3_Biomass | Max_Biomass | Mean_Biomass | SD_Biomass | n |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1.8 | 15 | 22 | 27 | 223 | 22.03279 | 11.37982 | 47.18 | 146.48 | 1369.56 | 2607.02 | 12587.52 | 1837.21 | 2167.214 | 918 |
-Purpose: To test whether there is a significant association between Tree Height Category and Above-ground Biomass Category.
Hypotheses:
-Null Hypothesis (H₀): There is no association between tree height and biomass categories — they are independent. _Alternative Hypothesis (H₁): There is an association between tree height and biomass categories.
Method: Continuous variables (Height and Biomass) were divided into three groups (“Low”, “Medium”, “High”) using quantiles to create roughly equal-sized categories.
-A Chi-square test of independence was conducted on the contingency table of these categories.
Assumptions checked:
-Data are from independent observations. -Expected frequencies in each cell are sufficiently large (generally ≥ 5).
Interpretation:
-If the p-value < 0.05, reject H₀ and conclude that there is a significant association between tree height and biomass categories which meaning taller trees are more likely to have higher biomass. -If the p-value ≥ 0.05, fail to reject H₀ that indicating no strong relationship between the categorical groupings.
# Create categorical variables for Height and Biomass
Catuba_clean <- Catuba_clean %>%
mutate(
Height_Group = cut(Height,
breaks = quantile(Height, probs = c(0, 0.33, 0.66, 1)),
labels = c("Low", "Medium", "High"),
include.lowest = TRUE),
Biomass_Group = cut(Biomass,
breaks = quantile(Biomass, probs = c(0, 0.33, 0.66, 1)),
labels = c("Low", "Medium", "High"),
include.lowest = TRUE)
)
# Create a contingency table
contingency_table <- table(Catuba_clean$Height_Group, Catuba_clean$Biomass_Group)
contingency_table##
## Low Medium High
## Low 272 51 6
## Medium 31 181 92
## High 1 70 214
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 729.53, df = 4, p-value < 2.2e-16
##
## Low Medium High
## Low 108.94989 108.23312 111.81699
## Medium 100.67102 100.00871 103.32026
## High 94.37908 93.75817 96.86275
For the Chi-square test of independence between tree height category and biomass category:
\[H_0: P(Height\ Category,\ Biomass\ Category) = P(Height\ Category) \times P(Biomass\ Category)\] \[H_A: P(Height\ Category,\ Biomass\ Category) \ne P(Height\ Category) \times P(Biomass\ Category)\]
\[\chi^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\] # Correlation
-Purpose: To measure the strength and direction of the linear relationship between Tree Height (m) and Above-ground Biomass (Mg).
-Method: The Pearson correlation coefficient (r) is used since both variables are continuous and approximately normally distributed.
The value of r ranges from -1 to +1: -r ≈ +1: Strong positive linear relationship. -r ≈ 0: No linear relationship. -r ≈ -1: Strong negative linear relationship.
The test assumes both variables are numeric, independent, and linearly related.
Interpretation:
-If r is close to +1, taller trees tend to have higher biomass. -If r is close to 0, there’s no clear relationship between height and biomass. -If r is close to -1, taller trees tend to have lower biomass (unlikely in this biological context).
#Correlation Analysis between Tree Height and Biomass
#Calculating Pearson correlation
correlation_result <- cor(Catuba_clean$Height, Catuba_clean$Biomass, method = "pearson")
#Displaying result
correlation_result## [1] 0.598601
Key Findings:
-The Pearson correlation coefficient (r = 0.598601) indicates a moderate to strong positive relationship between tree height and above-ground biomass. This means taller trees tend to have higher biomass.
-The Chi-square test of independence produced a highly significant result (χ² = 729.53, df = 4, p < 2.2e−16), confirming that there is a statistically significant association between height and biomass categories.
-Together, these results support the research hypothesis that tree height can be used as a strong predictor of above-ground biomass in the Catuaba forest dataset.
Interpretation: -The positive relationship between tree height and biomass aligns with ecological theory as trees grow taller, they accumulate more woody material and carbon. The moderate correlation suggests that while height is a strong indicator, other factors (like tree diameter and species) also influence biomass.
Strengths of the Study: -Uses real-world field data from NASA’s LBA project. -Applies robust statistical methods (Chi-square, correlation). -Provides reproducible code and transparent data cleaning.
Limitations: -Only one predictor (tree height) was analyzed; including DBH or wood density could improve model accuracy. -The analysis assumes linearity and normality, which may not hold perfectly for biomass data. -The dataset represents one forest site, limiting generalization to other forest types.
Future Research: -Incorporate multiple regression models using DBH and species type. -Explore non-linear models (e.g., allometric equations). -Conduct cross-site comparisons to validate findings across other Amazonian forests.
Conclusion: -Tree height shows a significant positive relationship with above-ground biomass. This indicates that height can serve as a reliable, non-destructive indicator of forest biomass and carbon storage in the Catuaba region.
-Peng, R. D. (2015). R programming for data science. Leanpub, Baltimore, MD.
-R Core Team. (2025). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Available at: https://www.R-project.org/ (Accessed: 15 October 2025).
-Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer-Verlag, New York.
-NASA. (2002). LBA‐ECO LC‐02: Biophysical measurements of forests, Acre, Brazil (1999–2002). NASA Data Portal. Available at: https://data.nasa.gov/dataset/lba-eco-lc-02-biophysical-measurements-of-forests-acre-Br azil-1999-2002-90ecf (Accessed: 15 October 2025)