Mangroves provide vital ecosystem services which support biodiversity, carbon storage, and coastal protection. However, mangrove forests currently face a rapid decline due to severe land-use change caused by anthropogenic activities, especially in areas like Johor, Malaysia, where agriculture and urban development persist. Despite their importance, mangrove reproductive biology remains understudied. Propagules, the viviparous seedlings unique to mangroves, represents both reproductive success and future regeneration potential. Hence, this study aims to investigate how anthropogenic disturbance affects mangrove propagule counts and mass in Johor’s mangrove forests.
This script and code will answer the research question: How does anthropogenic disturbance affect the fecundity (measured as propagule production and maternal investment) of Avicennia alba in the Johor coastal region? The hypothesis is that disturbed mangrove stands have lower propagule production and/or lower propagule maturation success than less-disturbed stands due to environmental stress and reduced resource availability.
The definition of anthropogenic disturbance of this study is as follows: disturbed sites are sites with a mean human footprint of 4 or greater within a 1km buffer, and undisturbed sites have mean HFP of less than 4 within the 1km buffer. These sites were placed as far apart from one another as possible to count them as independent sites.
For the experimental design, there is a total of 6 disturbed and undisturbed sites chosen across the coast of Johor, and in each site, 10 Avicennia alba mangrove trees will be chosen. Avicennia alba is chosen as the focal species because it is most commonly found within this region, and can be found in both disturbed and undisturbed areas. Within each tree, 6 fruiting branches are selected, and the total propagule counts and average propagule mass will be recorded for each branch. Propagule production is measured in propagule counts, and maternal investment is measured through the average mass of propagule weights.
Load libraries.
library(dplyr)
library(tidyverse)
# Packages for plots
library(ggplot2)
library(ggfortify)
library(gridExtra)
library(ggdist) # for stat_halfeye()
library(ggbeeswarm) # for nicer jitter
library(scales) # for alpha()
library(gghalves)
library(patchwork) # for adding titles to patched plots
# Packages for modeling
library(lme4)
library(car)
library(lmerTest)
library(glmmTMB) # for negative binomial GLMM
library(DHARMa) # for GLMM model diagnostics
library(ggeffects) # for ggpredict()
library(emmeans)
library(arm)
library(performance) # for calculating R2 value for GLMM and LMM
A power analysis is run to determine the minimum sample size needed for a study to reliably detect a meaningful effect if one exists. It avoids an overpowered or underpowered study, and allows the researcher to plan their study by observing if there are significant results. A power of more than 80% means the study has a high probability of showing a significant effect.
For simplicity’s sake, we attempted to combine variance at each site and branch level (which unwittingly caused problems during the modelling section, but more on that later). Based on available literature, we fixed the following values for undisturbed sites:
Average count: 52 per tree (Raju et al., 2012).
Average mass: 2.3g (Coupland et al., 2002).
Standard deviation of mass:
The only site- and tree-level uncertainties I could find were in terms of proportions: 25—62% at a site-level and 7-19% at a tree-level (Clarke & Myerscough, 1993).
These proportions were then used to derive the site- and tree-level standard deviaions using a base standard deviation of ±0.2 SE (Coupland et al., 2002). We derived standard deviation from SE by assuming n = 60. Thus, \(SD = 0.2 \times √60 ≈ 1.5g\).
The values above are cited mostly from studies targeting Avicennia alba reproductive biology, whose experimental setups were conducted in relatively undisturbed locations far from human development.
Based on the above, we then varied counts and mass for disturbed
sites to the extent that we thought biologically reasonable. We also
kept standard deviations for site and tree-level the same for both
disturbed and undisturbed groups, again for simplicity’s sake. Finally,
we included possible zero-counts in the power analysis; if a row’s
total_prop was 0, its avg_mass would be become
NA.
Two dependent variables means two separate power analyses are needed,
from which we would then see which combination of replicate numbers
would achieve the highest power for both counts and mass. We used a Mann
Whitney-U test (wilcox.test()) for counts and a T-test for
mass. Putting all of these into the ChatGPT blender produced the
following power analysis.
# FOR SD MASS
sd_total <- 1.5 # total phenotypic SD in propagule mass
# choose midpoints from the reported ranges (no renormalization)
p_site <- mean(c(25, 62)) / 100 # 0.435
p_tree <- mean(c(7, 19)) / 100 # 0.13
# compute SDs directly from proportions
sd_site_mass <- sqrt(p_site) * sd_total
sd_tree_mass <- sqrt(p_tree) * sd_total
# Power analysis function
powerAnalysis <- function(
runs = 5000, alpha = 0.05,
n_site = 6, n_tree = 10, n_branch = 6, # possible to vary
mu_d_count = 13, mu_u_count = 15, phi = 1.5, # set mu_d_count arbitrarily
mean_d_mass = 2, mean_u_mass = 2.3, # set mean_d_mass arbitrarily
sd_site = 0.5588604, sd_tree = 1.022295, # base SDs on literature value above
lambda_inflo = 52/15 # base lambda on literature value
){
set.seed(123)
m <- n_tree * n_branch
p0 <- exp(-lambda_inflo) # probability of zero inflorescences
sd_mass_total <- sqrt(sd_site^2 + sd_tree^2)
size_d <- mu_d_count / (max(phi, 1 + 1e-9) - 1)
size_u <- mu_u_count / (max(phi, 1 + 1e-9) - 1)
pc <- 0L; pm <- 0L
for(i in seq_len(runs)){
# ----- Disturbed counts and mass -----
zD <- rbinom(n_site * m, 1, p0) # 1 = zero-inflo branch
cD <- rnbinom(n_site * m, size = size_d, mu = mu_d_count)
cD[zD == 1] <- 0 # no propagules if no inflo
mD <- rnorm(n_site * m, mean = mean_d_mass, sd = sd_mass_total)
mD[zD == 1] <- NA # NA for avg_mass if no inflo
sD_count <- rowMeans(matrix(cD, nrow = n_site, byrow = TRUE))
sD_mass <- rowMeans(matrix(mD, nrow = n_site, byrow = TRUE),
na.rm = TRUE)
# ----- Undisturbed counts and mass -----
zU <- rbinom(n_site * m, 1, p0)
cU <- rnbinom(n_site * m, size = size_u, mu = mu_u_count)
cU[zU == 1] <- 0
mU <- rnorm(n_site * m, mean = mean_u_mass, sd = sd_mass_total)
mU[zU == 1] <- NA
sU_count <- rowMeans(matrix(cU, nrow = n_site, byrow = TRUE))
sU_mass <- rowMeans(matrix(mU, nrow = n_site, byrow = TRUE), na.rm = TRUE)
# ----- Test for power on established effect sizes -----
if(!is.na(p <- wilcox.test(sU_count, sD_count, exact = FALSE)$p.value) && p < alpha)
pc <- pc + 1L
if(!is.na(p <- t.test(sU_mass, sD_mass, var.equal = TRUE)$p.value) && p < alpha)
pm <- pm + 1L
}
list(power_count = pc / runs, power_mass = pm / runs)
}
# Run the power analysis
res <- powerAnalysis(runs = 10000,
n_site = 6, n_tree = 10, n_branch = 6,
mu_d_count = 13, mu_u_count = 15, phi = 2,
mean_d_mass = 2, mean_u_mass = 2.3,
sd_site = 0.9893179, sd_tree = 0.5408327,
lambda_inflo = 52/15
)
# Power analysis results for count and mass
res$power_count # 0.9628
## [1] 0.9628
res$power_mass # 0.8881
## [1] 0.8881
The power for propagule count and propagule mass is 96.3% and 88.8% respectively, showing that this study will likely yield significant results for both these factors.
A simulated (dummy) dataset was created to represent branch-level observations of propagule counts and masses for Avicennia alba trees across disturbed and undisturbed mangrove sites. The simulation mirrors the intended experimental design: 6 disturbed and 6 undisturbed sites, each containing 10 trees per site and 6 reproductive branches per tree. We modelled mass on a normal distribution and counts on a negative binomial. We chose to use a negative binomial over a poisson distribution for counts because we wanted to capture the overdispersed nature of ecological data—it is biologically reasonable that some branches might not produce any propagules, while others may produce many.
set.seed(123)
n_site_grp = 6
n_tree = 10
n_branch = 6
mu_prop_d = 13 # arbitrary
mu_prop_u = 15
prop_disp = 4
mean_mass_d = 2.0 # arbitrary
mean_mass_u = 2.3
dummy3000 <- data.frame(
site_id = c(rep(paste0("D", 1:6), each = n_site_grp*n_tree),
rep(paste0("U", 1:6), each = n_site_grp*n_tree)),
disturbance = rep(c("Disturbed", "Undisturbed"), each = 360),
tree_id = rep(paste0("T", 1:(n_site_grp*2*n_tree)), each = n_branch),
branch_id = seq(1:(n_site_grp*2*n_tree*n_branch)),
total_inflorescence = rpois(n_site_grp*2*n_tree*n_branch, lambda = 52/15), # don't care so much about the accuracy of this parameter because it's not used in analysis, don't have time to figure out the exact math.
total_prop = c(
rnbinom(360, size = mu_prop_d / (prop_disp - 1), mu = mu_prop_d),
rnbinom(360, size = mu_prop_u / (prop_disp - 1), mu = mu_prop_u)),
avg_mass = pmax(
c(rnorm(n_site_grp*n_tree*n_branch, mean = mean_mass_d, sd = sqrt(sd_site_mass^2 + sd_tree_mass^2)),
rnorm(n_site_grp*n_tree*n_branch, mean = mean_mass_u, sd = sqrt(sd_site_mass^2 + sd_tree_mass^2))),
0.1)
)
# Applying the condition: if total_inflorescence == 0, then set both to 0
dummy3000$total_prop[dummy3000$total_inflorescence == 0] = 0
# Set avg_mass to NA when BOTH total_prop and total_inflorescence are 0
dummy3000$avg_mass[dummy3000$total_prop == 0 & dummy3000$total_inflorescence == 0] = NA
# Saving dummy dataset file:
write.csv(dummy3000, "mangrove_dummy.csv")
In this section, propagule counts is used as a measure of propagule production in Avicennia alba mangroves. Since propagule production is limited by fecundity (Van der Stocken et al., 2019), comparing the number of propagules that successfully develop and drop off each branch in disturbed and undisturbed sites reveals how much anthropogenic disturbance influences the biological capacity of this pioneer mangrove species to produce offspring. Thereby, we can also gauge the likelihood of seedling survival and establishment success in disturbed sites.
Load the dummy data.
propagule <- read.csv('/Users/joylynngoh/Files/Uni/Y4/ES3307/Mangrove project/mangrove code/mangrove_dummy.csv')
# Check data
str(propagule)
## 'data.frame': 720 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ site_id : chr "D1" "D1" "D1" "D1" ...
## $ disturbance : chr "Disturbed" "Disturbed" "Disturbed" "Disturbed" ...
## $ tree_id : chr "T1" "T1" "T1" "T1" ...
## $ branch_id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ total_inflorescence: int 2 5 3 6 7 1 3 6 4 3 ...
## $ total_prop : int 13 17 3 19 6 4 18 4 12 2 ...
## $ avg_mass : num 3.126 2.235 1.427 2.737 0.861 ...
head(propagule)
## X site_id disturbance tree_id branch_id total_inflorescence total_prop
## 1 1 D1 Disturbed T1 1 2 13
## 2 2 D1 Disturbed T1 2 5 17
## 3 3 D1 Disturbed T1 3 3 3
## 4 4 D1 Disturbed T1 4 6 19
## 5 5 D1 Disturbed T1 5 7 6
## 6 6 D1 Disturbed T1 6 1 4
## avg_mass
## 1 3.1259111
## 2 2.2354829
## 3 1.4267317
## 4 2.7367008
## 5 0.8613041
## 6 2.2292478
summary(propagule)
## X site_id disturbance tree_id
## Min. : 1.0 Length:720 Length:720 Length:720
## 1st Qu.:180.8 Class :character Class :character Class :character
## Median :360.5 Mode :character Mode :character Mode :character
## Mean :360.5
## 3rd Qu.:540.2
## Max. :720.0
##
## branch_id total_inflorescence total_prop avg_mass
## Min. : 1.0 Min. : 0.000 Min. : 0.00 Min. :0.100
## 1st Qu.:180.8 1st Qu.: 2.000 1st Qu.: 8.00 1st Qu.:1.442
## Median :360.5 Median : 3.000 Median :12.00 Median :2.154
## Mean :360.5 Mean : 3.476 Mean :13.37 Mean :2.146
## 3rd Qu.:540.2 3rd Qu.: 5.000 3rd Qu.:18.00 3rd Qu.:2.861
## Max. :720.0 Max. :11.000 Max. :50.00 Max. :6.343
## NA's :18
# Make disturbance, site, and tree into factors.
propagule <- propagule %>%
mutate(disturbance = factor(disturbance),
site_id = factor(site_id),
tree_id = factor(tree_id))
# Check data again
str(propagule)
## 'data.frame': 720 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ site_id : Factor w/ 12 levels "D1","D2","D3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ disturbance : Factor w/ 2 levels "Disturbed","Undisturbed": 1 1 1 1 1 1 1 1 1 1 ...
## $ tree_id : Factor w/ 120 levels "T1","T10","T100",..: 1 1 1 1 1 1 33 33 33 33 ...
## $ branch_id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ total_inflorescence: int 2 5 3 6 7 1 3 6 4 3 ...
## $ total_prop : int 13 17 3 19 6 4 18 4 12 2 ...
## $ avg_mass : num 3.126 2.235 1.427 2.737 0.861 ...
site_id, tree_id and
disturbance all needed to be factorised.
First, plot the raw data for propagule counts. We plot both by site to check if counts appear to vary significantly between the sites, and by disturbance level for a broader overview.
# COUNTS RAINCLOUD
ggplot(propagule, aes(x = site_id, y = total_prop, fill = disturbance)) +
stat_halfeye(adjust = 0.6, width = 0.6, justification = -0.25,
slab_alpha = 0.6, .width = 0, point_interval = NULL) +
geom_boxplot(width = 0.15, outlier.shape = NA, position = position_nudge(x = 0.15),
color = scales::alpha("black", 0.7)) +
geom_jitter(aes(color = disturbance), width = 0.1, height = 0.15, alpha = 0.7,
size = 1.3, shape = 16, stroke = 0) +
scale_fill_manual(name = "Disturbance level", values = c("#f4a582", "#92c5de"),
breaks = sort(unique(propagule$disturbance))) +
scale_color_manual(name = "Disturbance level",
values = c("#713715", "#10424C"), breaks = sort(unique(propagule$disturbance))) +
labs(x = "Site", y = "Propagule count per bra",
title = "Fig. 2.1: Raw data for propagule counts by site") +
theme_minimal(base_size = 13) +
theme(legend.position = "top", plot.title = element_text(face = "bold"),
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank())
# Plot of AGGREGATED counts against disturbance:
ggplot(propagule, aes(x = disturbance, y = total_prop,
fill = disturbance, color = disturbance)) +
# half-eye / density on the left
stat_halfeye(adjust = 0.6, width = 0.6, justification = -0.25,
slab_alpha = 0.6, .width = 0, point_interval = NULL) +
geom_boxplot(width = 0.15, outlier.shape = NA,
position = position_nudge(x = 0.15),
color = scales::alpha("black", 0.7)) +
# jittered raw points
geom_jitter(width = 0.06, height = 0.35,
alpha = 0.7, size = 1.3, shape = 16, stroke = 0) +
scale_fill_manual(name = "Disturbance level",
values = c("Disturbed" = "#f4a582",
"Undisturbed" = "#92c5de")) +
scale_color_manual(name = "Disturbance level",
values = c("Disturbed" = "#713715",
"Undisturbed" = "#10424C")) +
labs(x = "Disturbance level",
y = "Propagule count per branch",
title = "Fig. 2.2: Raw propagule counts by disturbance category") +
theme_minimal(base_size = 13) +
theme(
legend.position = "top", # match first plot (or change to "none" if you prefer)
plot.title = element_text(face = "bold"),
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank()
)
Fig. 2.1 and 2.2 show that the raw data does not visually support the hypothesis that disturbance reduces reproductive output. There is high within-site variability in both disturbance levels, and that the spread of both groups completely overlap each other. Furthermore, there is no visually discernible difference between the total propagule counts of disturbed and undisturbed Avicennia alba, as the median count also seems to be about the same for both. We also note the presence of quite a few 0 counts and a smaller handful of outliers which might skew the models later on. Fig. 2.2 shows roughly the same spread: the median of disturbed counts is very slightly lower than undisturbed counts, but we should expect no statistically significant difference between the two groups.
However, variability between sites seems to be quite low—all disturbed sites hover about the same raw mean of 12 to 13, and all undisturbed sites hover about a slightly higher raw mean.
total_propThe random effect structure of the models for count data will remain
the same for all the models we run, to reflect the nested structure of
the study design. With branch as the lowest level of replication, we
nest tree_id within site_id to capture any
inherent variances at these nested levels. Additionally, although it is
very likely that other environmental variables might explain the
differences between counts at disturbed vs undisturbed sites, we also
chose not to include additional fixed effects because we are only
interested in investigating anthropogenic disturbance, defined by mean
human footprint within a preset buffer.
Although we already know that a negative binomial distribution will fit our data the best, we will treat the data as though this is unknown to us. First, we try to fit a linear model, in case the data can actually be fitted with a simpler model.
count1 <- lmer(total_prop~disturbance + (1|site_id/tree_id), data = propagule)
summary(count1)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: total_prop ~ disturbance + (1 | site_id/tree_id)
## Data: propagule
##
## REML criterion at convergence: 4945.3
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.9260 -0.7104 -0.1557 0.6072 4.7354
##
## Random effects:
## Groups Name Variance Std.Dev.
## tree_id:site_id (Intercept) 0.0000 0.0000
## site_id (Intercept) 0.1438 0.3792
## Residual 56.3394 7.5060
## Number of obs: 720, groups: tree_id:site_id, 120; site_id, 12
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 12.4139 0.4248 9.9992 29.222 5.15e-11
## disturbanceUndisturbed 1.9083 0.6008 9.9992 3.176 0.00988
##
## Correlation of Fixed Effects:
## (Intr)
## dstrbncUnds -0.707
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
The model produces a singular fit, indicating that it is overfitted. The random effects also do not account for any of the total variance in the model. Strangely, the fixed effect shows strong evidence that the propagule count differs between each disturbance level (\(t_{1,718} = 3.14, P = 0.002\)). This seems to contradict the conclusions drawn from Fig. 2.1 and Fig. 2.2.
Run model diagnostics.
autoplot(lm(propagule$total_prop ~ propagule$disturbance),
which = c(1:3, 5), ncol = 2, label.size = 3, label.repel = TRUE) +
theme_minimal()
Surprisingly, the diagnostic plots show relatively linear and homoscedastic residuals, and outliers don’t seem to be too influential. However, the QQplot shows a heavy-tailed distribution of the residuals, violating the normality assumption of linear models.
We tried another LMM with log-transformed count data in line with Warton et al. (2016) who advise that it may be the case that the zeroes are normally distributed and do not contribute to an unequal mean-variance relationship—thus, logging the response variable might remove the need for a more complex GLM.
propagule$log_prop <- log10(propagule$total_prop + 1)
count2 <- lmer(log_prop~disturbance + (1|site_id/tree_id), data = propagule)
summary(count2)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: log_prop ~ disturbance + (1 | site_id/tree_id)
## Data: propagule
##
## REML criterion at convergence: 266.2
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.8346 -0.3846 0.1597 0.6633 2.0659
##
## Random effects:
## Groups Name Variance Std.Dev.
## tree_id:site_id (Intercept) 6.095e-04 0.0246889
## site_id (Intercept) 7.606e-08 0.0002758
## Residual 8.285e-02 0.2878423
## Number of obs: 720, groups: tree_id:site_id, 120; site_id, 12
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 1.05657 0.01550 102.53433 68.156 <2e-16
## disturbanceUndisturbed 0.05138 0.02192 102.53433 2.344 0.021
##
## Correlation of Fixed Effects:
## (Intr)
## dstrbncUnds -0.707
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.0026614 (tol = 0.002, component 1)
Once again, count2 has a singularity issue, and
tree-level random effects now accounts for \(0.0004482/0.0587808 = 0.00762\) of total
variance, which is still concerningly low. However, the statistical
significance of our fixed effects seems to better match the high
variability in Fig. 2.1 and 2.2, so that now we would
conclude that there is only moderate evidence that disturbance has an
effect on propagule counts (\(t_{1,118} =
2.19, P = 0.030\)).
Run model diagnostics.
autoplot(lm(propagule$log_prop ~ propagule$disturbance),
which = c(1:3, 5), ncol = 2, label.size = 3, label.repel = TRUE) +
theme_minimal()
Unfortunately, the diagnostic plots show that logging the data has caused the left tail of the QQplot to drag even more heavily, and has disproportionately skewed the residuals with respect to the first model. Thus, we move on to the GLMM family.
First, we try a Poisson distribution since this is the default family for count data.
count3 <- glmer(total_prop~disturbance + (1|site_id/tree_id), family = poisson, data = propagule)
summary(count3)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: poisson ( log )
## Formula: total_prop ~ disturbance + (1 | site_id/tree_id)
## Data: propagule
##
## AIC BIC logLik -2*log(L) df.resid
## 5962.2 5980.5 -2977.1 5954.2 716
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.8121 -1.3344 -0.2142 1.0462 7.5537
##
## Random effects:
## Groups Name Variance Std.Dev.
## tree_id:site_id (Intercept) 0.03978 0.1994
## site_id (Intercept) 0.00000 0.0000
## Number of obs: 720, groups: tree_id:site_id, 120; site_id, 12
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.49657 0.02992 83.429 < 2e-16
## disturbanceUndisturbed 0.14771 0.04188 3.527 0.00042
##
## Correlation of Fixed Effects:
## (Intr)
## dstrbncUnds -0.713
## optimizer (Nelder_Mead) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
Glancing at the output of the model, we note that there is still singularity in the model. Note that the random effects still account for a very negligible portion of overall variance.
Unfortunately, the glmer() function does not output
information about null and residual deviances because the function is
unable to compute a ‘saturated model’ in the same manner as for a
regular GLM, in order to compare the fitted model to.
Thankfully, the DHARMa package allows us to check for
overdispersion, along with more GLM-specific diagnostic tests.
# Simulate standardized residuals
res_count3 = simulateResiduals(fittedModel = count3, n = 1000)
par(mfrow=c(2,2), oma = c(0, 0, 4, 0), mar = c(3, 2, 2, 2))
testDispersion(res_count3) # dispersion ≈ 2, p < 0.00001 very overdispersed
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 2.7384, p-value < 2.2e-16
## alternative hypothesis: two.sided
testUniformity(res_count3) # D = 0.0832, p < 0.00001 residuals not uniform
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.14812, p-value = 3.804e-14
## alternative hypothesis: two-sided
testZeroInflation(res_count3) # ratio of observed 0 count to simulated 0 distribution: 81.8, p < 0.00001
##
## DHARMa zero-inflation test via comparison to expected zeros with
## simulation under H0 = fitted model
##
## data: simulationOutput
## ratioObsSim = 950, p-value < 2.2e-16
## alternative hypothesis: two.sided
testOutliers(res_count3) # number of outliers at both margins: 20, across 720 observations. p < 0.00001. observed frequency of outliers: 0.028; expected fequency of outliers: 0.002.
##
## DHARMa outlier test based on exact binomial test with approximate
## expectations
##
## data: res_count3
## outliers at both margin(s) = 44, observations = 720, p-value < 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.001998002
## 95 percent confidence interval:
## 0.04475088 0.08117357
## sample estimates:
## frequency of outliers (expected: 0.001998001998002 )
## 0.06111111
mtext("Model Diagnostics for count3 (Poisson GLMM)",
outer = TRUE, cex = 1, font = 2, line = 2)
mtext("DHARMa residual-based tests: uniformity, dispersion, zero-inflation, outliers",
outer = TRUE, cex = 0.8, line = 1)
Note that the red lines in the dispersion, zero-inflation, and outlier
diagnostic plots are the observed diagnostics of the fitted
model. The black lines or distributions are the simulated
distribution of expected diagnostics.
The diagnostics all show that the GLMM with poisson distribution is not a good fit for the data. From left to right, top to bottom:
The dispersion test returns a statistically significant result for overdispersion (dispersion = 2.00, p < 0.0001). Our diagnostic plot confirms this: the peak of the simulated distribution of dispersion ratios falls at around 1, while the true dispersion ratio falls out of the plot scale at about 2.
The QQplot shows heavy left and right tails for the residual distribution. From the Kolmogorov–Smirnov (KS) test, which checks for the overall uniformity of residuals, there is very strong evidence that residuals are not uniform (D = 0.0831, p < 0.0001).
There is significant, strong zero-inflation (ratio of observed to simulated zeros = 6000, p < 0.0001). Visually, the observed number of zeroes is so much larger than the simulated distribution that the entire distribution has collapsed into a single line.
Finally, the frequency of outliers is far greater than expected (p < 0.0001). Visually, the observed number of outliers (either zeroes or extreme counts) lies at both ends of the expected uniform distribution of outliers.
Finally, we try a negative binomial distribution to fix the heavy
overdispersion and non-uniformity issues in the previous model. We could
not use a quasipoisson distribution with glmer() as the
function was not built to model on quasi- distributions. We used the
glmmTMB() function instead of the glm.nb()
function because glm.nb() does not support mixed effect
models—as we will see later on, glmmTMB() is handy for
solving other count-specific issues.
count4 <- glmmTMB(total_prop ~ disturbance + (1|site_id/tree_id),
family = nbinom2, data = propagule)
summary(count4)
## Family: nbinom2 ( log )
## Formula: total_prop ~ disturbance + (1 | site_id/tree_id)
## Data: propagule
##
## AIC BIC logLik -2*log(L) df.resid
## NA NA NA NA 715
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## tree_id:site_id (Intercept) 2.302e-09 4.797e-05
## site_id (Intercept) 8.022e-17 8.956e-09
## Number of obs: 720, groups: tree_id:site_id, 120; site_id, 12
##
## Dispersion parameter for nbinom2 family (): 3.7
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.51882 0.03122 80.69 <2e-16
## disturbanceUndisturbed 0.14300 0.04381 3.26 0.0011
From the model summary, dispersion is now taken to be 8, which much better reflects the overdispersion (large variance) in our count data. Yet, the variance explained by the random effects is still negligible.
Run model diagnostics and check for overdispersion.
# Simulate standardized residuals
res_count4 = simulateResiduals(fittedModel = count4, n = 1000)
par(mfrow = c(2, 2), oma = c(0, 0, 4, 0), mar = c(3, 3, 3, 3))
testUniformity(res_count4) # D = 0.046, p = 0.0995. Residuals are uniform.
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.036487, p-value = 0.2932
## alternative hypothesis: two-sided
testDispersion(res_count4) # dispersion = 0.907, p = 0.13. Dispersion not significant.
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 0.91069, p-value = 0.186
## alternative hypothesis: two.sided
testZeroInflation(res_count4) # ratioObsSim = 81.8, p < 0.0001. Zero-inflation still significant.
##
## DHARMa zero-inflation test via comparison to expected zeros with
## simulation under H0 = fitted model
##
## data: simulationOutput
## ratioObsSim = 7.3473, p-value < 2.2e-16
## alternative hypothesis: two.sided
testOutliers(res_count4) # p < 0.0001. Outliers still significant (due to zero-inflation).
##
## DHARMa outlier test based on exact binomial test with approximate
## expectations
##
## data: res_count4
## outliers at both margin(s) = 9, observations = 720, p-value = 1.94e-05
## alternative hypothesis: true probability of success is not equal to 0.001998002
## 95 percent confidence interval:
## 0.005731294 0.023595785
## sample estimates:
## frequency of outliers (expected: 0.001998001998002 )
## 0.0125
mtext("Model Diagnostics for count4 (Negative Binomial GLMM)",
outer = TRUE, cex = 1, font = 2, line = 2)
mtext("DHARMa residual-based tests: uniformity, dispersion, zero inflation, and outliers",
outer = TRUE, cex = 0.8, line = 1)
The model diagnostics have greatly improved with a negative binomial distribution. Residuals are no longer uniform (p = 0.0995), and dispersion is sufficiently explained well, although now there seems to be slight underdispersion (dispersion = 0.907, p = 0.13). The non-zero outliers also seem to have been resolved, based on bottom right plot. However, zero-inflation is still an issue (p < 0.0001), which is also reflected in the zeroes highlighted in the outlier diagnostic plot.
Thankfully, glmmTMB() can account for zero-inflation by
including a ziformula in the model structure.
count5 <- glmmTMB(total_prop ~ disturbance + (1|site_id/tree_id), ziformula = ~disturbance,
family = nbinom2, data = propagule)
Briefly running diagnostics again,
res_count5 = simulateResiduals(fittedModel = count5, n = 1000)
par(mfrow = c(2, 2), oma = c(0, 0, 4, 0), mar = c(3, 3, 3, 3))
testUniformity(res_count5) # D = 0.197, p = 0.942. Residuals are uniform.
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: simulationOutput$scaledResiduals
## D = 0.01534, p-value = 0.9958
## alternative hypothesis: two-sided
testDispersion(res_count5) # dispersion = 1.01, p = 0.916. Dispersion not significant.
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 1.0021, p-value = 0.962
## alternative hypothesis: two.sided
testZeroInflation(res_count5) # ratioObsSim = 1.00, p = 1. Zero-inflation not significant.
##
## DHARMa zero-inflation test via comparison to expected zeros with
## simulation under H0 = fitted model
##
## data: simulationOutput
## ratioObsSim = 1.0002, p-value = 1
## alternative hypothesis: two.sided
testOutliers(res_count5) # p = 0.058. Outliers not significant.
##
## DHARMa outlier test based on exact binomial test with approximate
## expectations
##
## data: res_count5
## outliers at both margin(s) = 5, observations = 720, p-value = 0.01572
## alternative hypothesis: true probability of success is not equal to 0.001998002
## 95 percent confidence interval:
## 0.002258574 0.016131084
## sample estimates:
## frequency of outliers (expected: 0.001998001998002 )
## 0.006944444
mtext("Fig. 2.3: Model diagnostics for count5 (ngative binomial GLMM with zero-inflation)",
outer = TRUE, cex = 1, font = 2, line = 2)
mtext("DHARMa residual-based tests: uniformity, dispersion, zero inflation, and outliers",
outer = TRUE, cex = 0.8, line = 1)
The model diagnostics show that a negative binomial distribution that
accounts for zero-inflation is the best fit for the count data.
Zero-inflation is well-accounted for, as the peak of the simulated
distribution aligns well with the observed number of zeroes. Accounting
for zero-inflation has also solved our outlier issue. Since our model
diagnostics look nearly perfect, we chose to use model
count5 to interpret our results.
summary(count5)
## Family: nbinom2 ( log )
## Formula: total_prop ~ disturbance + (1 | site_id/tree_id)
## Zero inflation: ~disturbance
## Data: propagule
##
## AIC BIC logLik -2*log(L) df.resid
## 4837.5 4869.6 -2411.7 4823.5 713
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## tree_id:site_id (Intercept) 3.873e-09 6.224e-05
## site_id (Intercept) 1.911e-09 4.371e-05
## Number of obs: 720, groups: tree_id:site_id, 120; site_id, 12
##
## Dispersion parameter for nbinom2 family (): 4.82
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.53642 0.02858 88.76 < 2e-16
## disturbanceUndisturbed 0.15814 0.04012 3.94 8.09e-05
##
## Zero-inflation model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.0307 0.4261 -9.460 <2e-16
## disturbanceUndisturbed 0.6282 0.5230 1.201 0.23
Unfortunately, the negligible variance in our random effects persists even with zero-inflation. This indicates that there might be something wrong with the structure of the model. More on this in section 4.
Within the glmmTMB() function, anova() is
used for model comparison. This means that it cannot decompose single
fitted models into type I or II sums of squares, so to test for the
significance of fixed effects, we must manually create a model without
the fixed effect before using anova() to compare the
two.
# create a "null" version of count5 model without the fixed effect
count5_null <- glmmTMB(total_prop ~ 1 + (1|site_id/tree_id), ziformula = ~1,
family = nbinom2, data = propagule)
anova(count5, count5_null, test = "Chisq")
## Data: propagule
## Models:
## count5_null: total_prop ~ 1 + (1 | site_id/tree_id), zi=~1, disp=~1
## count5: total_prop ~ disturbance + (1 | site_id/tree_id), zi=~disturbance, disp=~1
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## count5_null 5 4845.3 4868.2 -2417.7 4835.3
## count5 7 4837.5 4869.6 -2411.8 4823.5 11.851 2 0.002671
r2(count5)
## [1] NA
emm2 <- emmeans(count5, ~ disturbance, type = c("response"))
confint(emm2, level = 0.95)
## disturbance response SE df asymp.LCL asymp.UCL
## Disturbed 12.6 0.361 Inf 11.9 13.4
## Undisturbed 14.8 0.417 Inf 14.0 15.6
##
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
pred4 <- as.data.frame(emm2)
Since there is only one fixed effect, we used emmeans()
to average over all the blocking factors (ie. sites). We also used the
“response” type argument to automatically back-transform the means and
confidence intervals to the original scale of the raw counts.
R2 returns an ‘NA’ value, indicating that the null deviance must be 0, so that the R2 value cannot be calculated. More on this in section 4.
There is strong evidence that anthropogenic disturbance has a negative effect on total propagule counts per branch, where disturbed mangroves produced 1.538 propagules fewer than undisturbed mangroves (\(χ2_{2,7} = 12.2, P = 0.002\)), which supports our hypothesis. Taken at face value, this is biologically significant because we used 360 replicates per disturbance level, and it is reasonable that the reproductive capacity of mangroves would suffer if the soil and water were polluted by metals or excess nutrients.
With this, we plot the model-predicted means to visualise the model results.
ggplot(pred4, aes(x = disturbance, y = response, color = disturbance)) +
# raw data
geom_jitter(
data = propagule,
aes(x = disturbance, y = total_prop, color = disturbance),
width = 0.05, height = 0.1,
alpha = 0.6, size = 1.2
) +
# model CIs
geom_linerange(
aes(ymin = asymp.LCL, ymax = asymp.UCL),
position = position_nudge(x = 0.1),
linewidth = 0.7
) +
# model means
geom_point(
size = 2,
position = position_nudge(x = 0.1)
) +
scale_color_manual(
name = "Disturbance level",
values = c("Disturbed" = "#d17752", "Undisturbed" = "#5f95b0"),
breaks = levels(propagule$disturbance)
) +
labs(
x = "Disturbance level",
y = "Propagule count per branch",
title = "Fig. 2.4: Model-predicted means (95% CI) with NB GLMM + zero-inflation model",
subtitle = "count5 <- glmmTMB(total_prop ~ disturbance + (1|site_id/tree_id), \nziformula = ~ disturbance, family = nbinom2, data = propagule)"
) +
theme_minimal(base_size = 13) +
theme(
legend.position = "top",
plot.subtitle = element_text(family = "Monaco", size = 9),
plot.title = element_text(face = "bold", size = 13),
panel.grid.minor = element_blank()
)
From Fig. 2.4, the model-predicted means seem to fall roughly in the mean of the raw data, but the confidence intervals are glaringly small (so small that they appear invisible). To see the model-predicted confidence intervals, we plot them without the raw data:
ggplot(pred4, aes(x = disturbance, y = response, color = disturbance)) +
# model CIs
geom_linerange(aes(ymin = asymp.LCL, ymax = asymp.UCL), linewidth = 0.7) +
# model means
geom_point(size = 2) +
scale_color_manual(
name = "Disturbance level",
values = c("Disturbed" = "#d17752", "Undisturbed" = "#5f95b0"),
breaks = levels(propagule$disturbance)) +
labs(
x = "Disturbance level",
y = "Estimated mean propagule count per branch",
title = "Fig. 2.5: Model-predicted means (95% CI) without raw data",
subtitle = "count5 <- glmmTMB(total_prop ~ disturbance + (1|site_id/tree_id), \nziformula = ~ disturbance, family = nbinom2, data = propagule)") +
theme_minimal(base_size = 13) +
theme(
legend.position = "top",
plot.subtitle = element_text(family = "Monaco", size = 9),
plot.title = element_text(face = "bold", size = 13),
panel.grid.minor = element_blank()
)
Without the raw data, the confidence intervals clearly do not overlap (Fig. 2.5). If we were to take our model-predicted means at face value, they support the model results that show disturbed Avicennia alba to produce slightly fewer propagules on average than undisturbed mangroves.
This analysis investigates how anthropogenic disturbance affects propagule mass (a measure of maternal investment) in Avicennia alba mangroves. Propagule mass represents the energy and nutrient resources that a parent tree allocates to its offspring. It reflects maternal investment, where higher propagule mass can improve seedling survival and establishment success, particularly in stressful environments.
Plotting the raw dataset for propagule mass and disturbance:
# Plotting raw dataset (raincloud plot):
ggplot(na.omit(propagule), aes(x = site_id, y = avg_mass, fill = disturbance)) +
stat_halfeye(adjust = 0.6, width = 0.6, justification = -0.25,
slab_alpha = 0.6, .width = 0, point_interval = NULL) +
geom_boxplot(width = 0.15, outlier.shape = NA, position = position_nudge(x = 0.15),
color = scales::alpha("black", 0.7)) +
geom_jitter(aes(color = disturbance), width = 0.1, height = 0.15, alpha = 0.7,
size = 1.3, shape = 16, stroke = 0) +
scale_fill_manual(name = "Disturbance level",
values = c("#f4a582", "#92c5de"), breaks = sort(unique(propagule$disturbance))) +
scale_color_manual(name = "Disturbance level",
values = c("#713715", "#10424C"), breaks = sort(unique(propagule$disturbance))) +
labs(x = "Site", y = "Average propagule mass per branch",
title = "Fig. 3.1: Raw data for average mass by site") +
theme_minimal(base_size = 13) +
theme(legend.position = "top",
plot.title = element_text(face = "bold"),
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank())
# Plot of AGGREGATED counts against disturbance:
ggplot(na.omit(propagule), aes(x = disturbance, y = avg_mass,
fill = disturbance, color = disturbance)) +
# half-eye / density on the left
stat_halfeye(adjust = 0.6, width = 0.6, justification = -0.25,
slab_alpha = 0.6, .width = 0, point_interval = NULL) +
geom_boxplot(width = 0.15, outlier.shape = NA,
position = position_nudge(x = 0.15),
color = scales::alpha("black", 0.7)) +
# jittered raw points
geom_jitter(width = 0.06, height = 0.35,
alpha = 0.7, size = 1.3, shape = 16, stroke = 0) +
scale_fill_manual(name = "Disturbance level",
values = c("Disturbed" = "#f4a582",
"Undisturbed" = "#92c5de")) +
scale_color_manual(name = "Disturbance level",
values = c("Disturbed" = "#713715",
"Undisturbed" = "#10424C")) +
labs(x = "Disturbance level",
y = "Average propagule mass per branch",
title = "Fig. 3.2: Raw average propagule mass by disturbance category") +
theme_minimal(base_size = 13) +
theme(
legend.position = "top", # match first plot (or change to "none" if you prefer)
plot.title = element_text(face = "bold"),
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank()
)
The raw data plots do not support the hypothesis. From Fig. 3.1 and 3.2, both disturbed and undisturbed ranges overlap completely. Both groups show considerable within-site variation, likely reflecting individual tree or branch-level differences. Still, inferential statistics are needed to confirm if this difference is significant.
This dataset simulates close to real-life biological processes, and
there is a chance that there are no propagules or no inflorescence
within a branch. In this case, the average mass (avg_mass)
is indicated as NA, as mass cannot be measured in the absence of
propagules.
Accounting for NA values in average mass:
# Quick counts of NAs in avg_mass:
table(is.na(propagule$avg_mass)) # shows that there are 18 NA values for avg_mass
##
## FALSE TRUE
## 702 18
# Treating NA values in avg_mass:
# Excluding NA rows for mass models (mass is conditional on presence):
prop_mass_df = propagule %>% filter(!is.na(avg_mass))
# creating a dataframe without NA values
# Modelling and including presence separately:
propagule = propagule %>%
mutate(has_prop = as.integer(!is.na(avg_mass) & total_prop > 0)) # 1 if mass measured / propagule present (adding a column to indicate presence of propagule)
# Table showing the number of branches that do not have propagules for the respective disturbed/undisturbed site:
table(propagule$has_prop, propagule$disturbance)
##
## Disturbed Undisturbed
## 0 7 12
## 1 353 348
The prop_mass_df dataframe removes all NA rows for
propagule mass models. Using this might answer if disturbance affects
whether branches produce propagules at all. The added
has_prop variable analyses whether propagules are present
at all, and may answer a question like: among branches that do have
propagules, does disturbance affect their mass? Hence, these branches
were excluded from this propagule mass analysis but retained for the
separate analysis of propagule production, as a zero count is still
considered a data point for propagule counts.
This conditional analysis approach is appropriate because it distinguishes between two distinct aspects of reproductive output: (1) whether propagules are produced at all (reproductive success/failure), and (2) how much resource is allocated per propagule when they are produced (maternal investment). Including branches without propagules as zeros would inappropriately conflate these processes, as zero mass is biologically impossible for existing propagules.
avg_massCreating candidate models to determine which model (regarding propagule mass) is the most suitable: ### 3.2.1. Nested Branch-level LMM (linear mixed model)
# Branch-level nested LMM (branch-level masses, tree nested in site):
mass_nested = tryCatch(lmer(avg_mass ~ disturbance + (1|site_id/tree_id), data = prop_mass_df, REML = TRUE), error = function(e) { message("mass_nested failed: ", e$message); NULL })
# accounts for nested hierarchical structure: branches within trees within sites
# tryCatch() handles errors: if model fails, it returns NULL instead of crashing
summary(mass_nested)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: avg_mass ~ disturbance + (1 | site_id/tree_id)
## Data: prop_mass_df
##
## REML criterion at convergence: 2096.8
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.0232 -0.6386 0.0098 0.6724 3.9737
##
## Random effects:
## Groups Name Variance Std.Dev.
## tree_id:site_id (Intercept) 0.000000 0.00000
## site_id (Intercept) 0.004119 0.06418
## Residual 1.148105 1.07150
## Number of obs: 702, groups: tree_id:site_id, 120; site_id, 12
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 2.06723 0.06276 9.96970 32.938 1.66e-11
## disturbanceUndisturbed 0.15803 0.08897 10.06316 1.776 0.106
##
## Correlation of Fixed Effects:
## (Intr)
## dstrbncUnds -0.705
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
This first model is a branch-level nested linear mixed model. This
model treats each branch as one observation, and includes trees nested
within sites as random effects. (1|site_id/tree_id)
represents the random intercept for site, and random intercept for tree
within site. This accounts for the hierarchical, nested structure, where
branches are nested within trees, and trees are nested within sites.
From the summary, the random effect for tree nested within site has no variance, which means it contributes nothing to the study. There was also a singular fit warning, as the model tried to estimate the variation among trees, but there was effectively none once site-level variation was accounted for. In other words, propagule mass does not differ systematically among trees within the same site. Overall, adding tree as a random effect did not improve or add to the model as it only made it over-parameterised (model is too complex for the variation present).
# Branch-level site-only random intercept (simpler model):
mass_site = lmer(avg_mass ~ disturbance + (1|site_id), data = prop_mass_df, REML = TRUE)
# accounts for site-level clustering
# simpler model (few random effects, more stable estimation), but treats all branches within a site as equally correlated
summary(mass_site)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: avg_mass ~ disturbance + (1 | site_id)
## Data: prop_mass_df
##
## REML criterion at convergence: 2096.8
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.0232 -0.6386 0.0098 0.6724 3.9737
##
## Random effects:
## Groups Name Variance Std.Dev.
## site_id (Intercept) 0.004119 0.06418
## Residual 1.148105 1.07150
## Number of obs: 702, groups: site_id, 12
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 2.06723 0.06276 9.96970 32.938 1.66e-11
## disturbanceUndisturbed 0.15803 0.08897 10.06316 1.776 0.106
##
## Correlation of Fixed Effects:
## (Intr)
## dstrbncUnds -0.705
This second model is a branch-level linear mixed model with a site-level random intercept. This is a simpler model that still uses branch-level data, but now only accounts for site-level random variation. This treats all branches within a site as correlated (i.e. not fully independent), which is biologically representative.
From the summary, it shows that the variance of the site is \(0.0152\), with a residual of \(1.0824\), meaning that with this model structure, variation between sites accounts for \((1.0824-0.0152)/1.0824=0.99\) of model variation. Most variation is within sites (between branches) and not between sites. However, including site as a random effect still accounts for clustering and avoids pseudoreplication. The fixed effect for disturbance also shows to be borderline significant (\(p=0.0172\)). Hence, this model correctly accounts for the non-independence of branches within the same site without over-fitting.
# Aggregating to tree-level means (for inference on trees):
tree_df = prop_mass_df %>%
group_by(site_id, tree_id, disturbance) %>%
summarise(mean_mass = mean(avg_mass, na.rm = TRUE),
n_branches = n(), .groups = "drop")
# Creating model:
mass_tree = lmer(mean_mass ~ disturbance + (1|site_id), data = tree_df, REML = TRUE)
# this model aggregates branch-level data to tree-level averages; analyses trees as the unit of replication (not branches)
# .groups = "drop" removes grouping structure after summarising
# each tree contributes equally (not weighted by how many branches are measured)
summary(mass_tree)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: mean_mass ~ disturbance + (1 | site_id)
## Data: tree_df
##
## REML criterion at convergence: 138.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.52560 -0.69893 -0.02772 0.62435 2.66624
##
## Random effects:
## Groups Name Variance Std.Dev.
## site_id (Intercept) 0.006141 0.07836
## Residual 0.172389 0.41520
## Number of obs: 120, groups: site_id, 12
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 2.06948 0.06242 10.00000 33.153 1.47e-11
## disturbanceUndisturbed 0.15324 0.08828 10.00000 1.736 0.113
##
## Correlation of Fixed Effects:
## (Intr)
## dstrbncUnds -0.707
The third model is a linear mixed model that aggregates branch-level data to tree-level averages, treating each tree as an independent replicate. The random intercept for site accounts for site-level clustering. This model explicitly avoids pseudoreplication (multiple branches from the same tree are not independent), but has a lower sample size (\(n=120\) instead of \(n=702\)).
The summary shows that residual variance is smaller (\(0.15\) compared to \(1.0824\) from mass_site)
because of this smaller sample size. The effect of disturbance remains
borderline significant (\(p=0.013\))
and nearly identical in magnitude to branch-level models. The random
site variance is also consistent with the other models. However, while
this model gives the same biological conclusion of disturbance, there is
a reduced sample size and less statistical power.
To further observe these models before making a final decision, a few tests were run to observe their statistical inference.
Checking the singularity of these models:
# Singularity checks:
models = list(nested = mass_nested, site = mass_site, tree = mass_tree)
lapply(models, function(m) {
if (!is.null(m))
singular = isSingular(m)})
## $nested
## [1] TRUE
##
## $site
## [1] FALSE
##
## $tree
## [1] FALSE
These checks are to test whether each model suffers from singularity, which is if one or more random effects have zero or near-zero estimated variance. A singular model means at least one random effect level adds no meaningful variance, and a non-singular model means all random effects are contributing reasonably to explaining variance. Singularity indicates that the model is too complex for the data structure and cannot reliably estimate all variance components, leading to unstable parameter estimates and potentially invalid inference. Hence, models that are non-singular should be preferred.
From these singularity checks, mass_nested is shown to
have a singular fit with tree-level variance estimated at 0.000 and
site-level variance at \(0.015\). This
indicates that branches from different trees within the same site do not
differ systematically in propagule mass; all variation occurs either
between sites or between individual branches. This suggests that the
tree-level random effect is unnecessary and the nested structure is
overparameterised. In contrast, mass_site and
mass_tree have non-singular fits with all random effects
contributing meaningful variance.
Running diagnostic plots for the different candidate models:
## For mass_nested:
par(mfrow = c(2,2))
# QQ plot for normality
qqnorm(resid(mass_nested))
qqline(resid(mass_nested))
# Residuals vs fitted (already what plot(mass_nested) shows)
plot(fitted(mass_nested), resid(mass_nested),
xlab = "Fitted values", ylab = "Residuals", main = "Residuals vs Fitted")
abline(h = 0, col = "red")
# Histogram of residuals
hist(resid(mass_nested), breaks = 20, col = "lightblue", main = "Histogram of residuals")
# Residuals by disturbance level
boxplot(resid(mass_nested) ~ prop_mass_df$disturbance,
main = "Residuals by Disturbance", ylab = "Residuals", col = c("#f4a582", "#92c5de"))
par(mfrow = c(1,1))
## For mass_site:
par(mfrow = c(2,2))
# QQ plot for normality
qqnorm(resid(mass_site))
qqline(resid(mass_site))
# Residuals vs fitted (already what plot(mass_site) shows)
plot(fitted(mass_site), resid(mass_site),
xlab = "Fitted values", ylab = "Residuals", main = "Residuals vs Fitted")
abline(h = 0, col = "red")
# Histogram of residuals
hist(resid(mass_site), breaks = 20, col = "lightblue", main = "Histogram of residuals")
# Residuals by disturbance level
boxplot(resid(mass_site) ~ prop_mass_df$disturbance,
main = "Residuals by Disturbance", ylab = "Residuals", col = c("#f4a582", "#92c5de"))
par(mfrow = c(1,1))
## For mass_tree:
par(mfrow = c(2,2))
# QQ plot for normality
qqnorm(resid(mass_tree))
qqline(resid(mass_tree))
# Residuals vs fitted (already what plot(mass_tree) shows)
plot(fitted(mass_tree), resid(mass_tree),
xlab = "Fitted values", ylab = "Residuals", main = "Residuals vs Fitted")
abline(h = 0, col = "red")
# Histogram of residuals
hist(resid(mass_tree), breaks = 20, col = "lightblue", main = "Histogram of residuals")
# Residuals by disturbance level
boxplot(resid(mass_tree) ~ tree_df$disturbance,
main = "Residuals by Disturbance", ylab = "Residuals", col = c("#f4a582", "#92c5de"))
par(mfrow = c(1,1))
The diagnostic plots for mass_nested and
mass_site are the exact same, but random effects are
measured differently. Additionally, mass_nested shows a
singular fit, and is hence disregarded for now. The Normal Q-Q
plot for mass_nested and mass_site shows that
the points follow the 1:1 closely, except for some slight deviations at
both tails (which is normal for ecological data). The residuals show
that they are approximately normally distributed and acceptable for LMM
assumptions. The Residuals vs Fitted plot checks for
homoscedasticity (constant variance) and model fit. The residuals are
scattered evenly around zero with no funnel shape or clear pattern,
showing that the variance is roughly constant across fitted values (no
major heteroscedasticity). The Histogram of Residuals show that
the data is roughly bell-shaped (centered near zero), meaning the
residuals are approximately normal. Lastly, the Residuals by
Disturbance plot compares whether model errors (residuals) differ
systematically between groups. The medians of both boxplots (Disturbed
vs Undisturbed) are near zero and have similar spread.
As for mass_tree, the diagnostic plots mostly express
similar fits to those of mass_nested and
mass_site, but with slight differences. The Normal
Q-Q plot showed that the points do not follow the line as closely,
with larger deviations at the tails. The residuals show approximate
normal distribution and homoscedasticity. The plot of Residuals by
Disturbance show that the Undisturbed boxplot had a larger error
bar and slightly higher median compared to the Disturbed boxplot.
Overall, the key assumptions look well satisfied with no extreme outliers, and the models appear not to be biased to one group.
Comparing variance values for the different models:
# Comparing variance:
VarCorr(mass_nested)
## Groups Name Std.Dev.
## tree_id:site_id (Intercept) 0.000000
## site_id (Intercept) 0.064181
## Residual 1.071497
VarCorr(mass_site)
## Groups Name Std.Dev.
## site_id (Intercept) 0.064181
## Residual 1.071497
VarCorr(mass_tree)
## Groups Name Std.Dev.
## site_id (Intercept) 0.078363
## Residual 0.415197
For mass_nested, this variance proves that tree-level
random effect had a standard deviation of \(0.0000\), meaning it explained no
additional variance. This implies that branches from the same tree do
not differ systematically in propagule mass. The site-level random
effect (\(0.1233\)) captures only small
variation between sites. The residual variance (\(1.0404\)) is much larger, implying that
most of the variation occurs within branches, not between trees or
sites. Hence, this shows that the tree-level random effect is not
relevant, the model has a singular fit, and that is why the model is not
an ideal choice.
For mass_site, it only includes site-level variation,
which is small (\(0.1233\)) but not
zero. The same residual variance as mass_nested (\(1.0404\)) shows that most variation is
still within sites (branch-level differences). The model here showed no
singularity. Hence, this model is simpler and better-fitting, keeping
only one meaningful random effect (site). Hence, out of the three
models, this emerges as the best option because it satisfies
assumptions.
For mass_tree, the model uses tree-level means instead
of branch-level data. The residual variance (\(0.39165\)) is much smaller, because
averaging across branches per tree reduces within-tree variability. The
site-level variation (\(0.12699\)) is
similar to the previous model, showing consistency. Hence, this model is
conceptually strong (explicitly avoids pseudoreplication), but has less
statistical power since data is aggregated. It is an appropriate model
as a cross-check, but mass_site gives finer resolution and
identical results.
Model comparison or selection using AIC:
# Note: Comparing models with different datasets (branch- vs tree-level)
aic_table = data.frame(
model = c("nested", "site", "tree"),
AIC = c(if(!is.null(mass_nested)) AIC(mass_nested) else NA, AIC(mass_site),
AIC(mass_tree)),
data_level = c("branch", "branch", "tree"),
n_obs = c(if(!is.null(mass_nested)) nrow(prop_mass_df) else NA, nrow(prop_mass_df),
nrow(tree_df)),
singular = c(
if(!is.null(mass_nested)) isSingular(mass_nested) else NA, isSingular(mass_site),
isSingular(mass_tree)))
print(aic_table)
## model AIC data_level n_obs singular
## 1 nested 2106.8196 branch 702 TRUE
## 2 site 2104.8196 branch 702 FALSE
## 3 tree 146.6609 tree 120 FALSE
Model comparison using AIC showed that among branch-level models,
mass_site (\(AIC=2067.6\))
performed marginally better than mass_nested (\(AIC=2069.6\), \(ΔAIC=2.0\)), though
mass_nested is singular and thus, unreliable. The
tree-level aggregated model mass_tree had a much lower AIC
(\(137.0\)), though direct comparison
with branch-level models is not appropriate due to different sample
sizes (n = 120 trees vs. n = 702 branches).
Hence, from these tests and also the summary of each model, the model
chosen is mass_site.
Moving on to the final analysis of the final model
(mass_site). Observing summary values from final model
chosen, and getting the estimated marginal means and 95% confidence
interval:
# Summary of final model:
summary(mass_site)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: avg_mass ~ disturbance + (1 | site_id)
## Data: prop_mass_df
##
## REML criterion at convergence: 2096.8
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.0232 -0.6386 0.0098 0.6724 3.9737
##
## Random effects:
## Groups Name Variance Std.Dev.
## site_id (Intercept) 0.004119 0.06418
## Residual 1.148105 1.07150
## Number of obs: 702, groups: site_id, 12
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 2.06723 0.06276 9.96970 32.938 1.66e-11
## disturbanceUndisturbed 0.15803 0.08897 10.06316 1.776 0.106
##
## Correlation of Fixed Effects:
## (Intr)
## dstrbncUnds -0.705
# Anova test of final model:
anova(mass_site)
## Type III Analysis of Variance Table with Satterthwaite's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## disturbance 3.6222 3.6222 1 10.063 3.1549 0.1059
# R2 of final model:
model_performance(mass_site)
## # Indices of model performance
##
## AIC | AICc | BIC | R2 (cond.) | R2 (marg.) | ICC | RMSE | Sigma
## --------------------------------------------------------------------------
## 2104.8 | 2104.9 | 2123.0 | 0.009 | 0.005 | 0.004 | 1.069 | 1.071
# Estimated marginal means and 95% CI of final model:
emm = emmeans(mass_site, ~ disturbance, type = "response")
emm # print estimate marginal means results
## disturbance emmean SE df lower.CL upper.CL
## Disturbed 2.07 0.0628 9.91 1.93 2.21
## Undisturbed 2.23 0.0631 10.09 2.08 2.37
##
## Degrees-of-freedom method: kenward-roger
## Confidence level used: 0.95
confint(emm, level = 0.95)
## disturbance emmean SE df lower.CL upper.CL
## Disturbed 2.07 0.0628 9.91 1.93 2.21
## Undisturbed 2.23 0.0631 10.09 2.08 2.37
##
## Degrees-of-freedom method: kenward-roger
## Confidence level used: 0.95
massResult <- as.data.frame(emm)
# Pairwise contrast
pairs(emm)
## contrast estimate SE df t.ratio p.value
## Disturbed - Undisturbed -0.158 0.089 10 -1.776 0.1061
##
## Degrees-of-freedom method: kenward-roger
From the model performance statistics, the model explained little variation overall (marginal R² = 0.020; conditional R² = 0.034), indicating that most variability occurred within sites rather than between sites.
Estimated marginal means showed that propagules from disturbed sites averaged 1.98g (95% CI: [1.81, 2.15]), while those from undisturbed sites averaged 2.28g (95% CI: [2.11, 2.45]), representing a 15.3% reduction in propagule mass under disturbed conditions. The model analysed branch-level observations (n = 702 branches from 120 trees across 12 sites) with site as a random intercept to account for spatial clustering. Therefore, there is strong evidence that propagule mass was affected by anthropogenic disturbance, although only by a small amount—disturbed mangrove propagules were (0.302 ± 0.075)g lighter than undisturbed mangrove propagules (\(F_{1,9.98}=8.13, P = 0.017\)).
After looking at the statistical results, these plots represent the data visually. This plots the predicted means of this final model:
# Predicted means plot (ggeffects or emmeans and ggplot):
pred = ggpredict(mass_site, terms = "disturbance")
plot(pred) # quick plot
# Plot of predicted means:
pred_df = as.data.frame(pred) # dataframe of ggpredict
ggplot(pred_df, aes(x = x, y = predicted, color = x)) +
geom_linerange(aes(ymin = conf.low, ymax = conf.high), linewidth = 0.7) +
# model means, nudged
geom_point(size = 2) +
scale_color_manual(name = "Disturbance level",
values = c("Disturbed" = "#d17752", "Undisturbed" = "#5f95b0"),
breaks = levels(propagule$disturbance)) +
labs(x = "Disturbance level", y = "Estimated mean propagule mass (g)",
title = "Fig. 3.5: Model-predicted Means (95% CI) of propagule mass without raw data",
subtitle = "mass_site <- lmer(avg_mass ~ disturbance + (1|site_id), data = prop_mass_df, REML = TRUE)") +
theme_minimal(base_size = 13) +
theme(legend.position = "top",
plot.subtitle = element_text(family = "Monaco", size = 9),
plot.title = element_text(face = "bold", size = 13),
panel.grid.minor = element_blank())
From Fig. 3.5, the estimated marginal means from the linear mixed model demonstrate a clear difference in propagule mass between disturbance levels. The narrow confidence intervals indicate reasonably precise estimates, and the minimal overlap between the error bars visually confirms the statistical significance of this difference. These model predictions account for spatial clustering through the site random intercept, implying that disturbance negatively affects maternal investment in propagule quality across the study region.
To add to this plot, the original data will be added into this predicted means plot to show actual tree-level means alongside predictions:
# Enhanced plot with raw data:
ggplot() +
geom_jitter(data = propagule, aes(x = disturbance, y = avg_mass, color = disturbance),
alpha = 0.5, width = 0.15, size = 1.8) +
geom_linerange(data = pred_df, aes(x = x, ymin = conf.low, ymax = conf.high, color = x),
width = 0.1, linewidth = 0.7, position = position_nudge(x = 0.25)) +
geom_point(data = pred_df, aes(x = x, y = predicted, color = x), size = 2,
position = position_nudge(x = 0.25)) +
scale_color_manual(name = "Disturbance level",
values = c("Disturbed" = "#d17752", "Undisturbed" = "#5f95b0"),
breaks = levels(propagule$disturbance)) +
labs(x = "Disturbance level", y = "Estimated mean propagule mass (g)",
title = "Fig. 3.4: Model Predicted Means (95% CI) of propagule mass in original scale",
subtitle = "mass_site <- lmer(avg_mass ~ disturbance + (1|site_id), data = prop_mass_df, REML = TRUE)") +
theme_minimal(base_size = 13) + theme(legend.position = "top",
plot.subtitle = element_text(family = "Monaco", size = 9),
plot.title = element_text(face = "bold", size = 13),
panel.grid.minor = element_blank())
Fig. 3.4 shows raw propagule mass data (individual points) for disturbed and undisturbed sites, alongside the model-predicted means and 95% confidence intervals (the small dots and bars at the right side).
At face value, the mean propagule mass appears slightly higher in undisturbed sites than in disturbed ones, which matches the final model’s fixed effect estimates (around +0.3g difference, \(p ≈ 0.017\)). This pattern would suggest that propagules in undisturbed sites are heavier on average, implying higher maternal investment where environmental stress is lower.
If the model diagnostics are clean, why is there such a large mismatch between the model-predicted means and the raw data?
Throughout both model selection and analyses processes, there have been warnings signs that the intended random-effect structure did not explain the raw data well. Tracing the problem back to the dummy dataset stage, we discovered that the root of the problem lies in the way we sampled counts and mass.
In essence, we intended for all counts and mass data points from the same tree to originate from the same error distribution, then the tree-level aggregates from the same site to come from the same error distribution. However, in attempting to simplify the dummy-generating code by pooling the site- and tree-level variance together, we accidentally removed all random effects at tree and branch levels. In the end, all the data points at branch level were sampled from the same distribution across each disturbance level (eg. all counts in disturbed sites were sampled from the same error distribution), which meant that no variance was partitioned at each nesting level.
This explains why random effect variance accounted for little to none of the total variance in the model for both LMM and GLMM.
VarCorr(count1) # random effect variance = 0%
## Groups Name Std.Dev.
## tree_id:site_id (Intercept) 0.00000
## site_id (Intercept) 0.37921
## Residual 7.50596
VarCorr(count2) # tree random effect variance = 0.021172/0.242448 = 0.0873
## Groups Name Std.Dev.
## tree_id:site_id (Intercept) 0.02468892
## site_id (Intercept) 0.00027578
## Residual 0.28784234
sapply(VarCorr(count3), function(x) attr(x, "stddev")^2)
## tree_id:site_id.(Intercept) site_id.(Intercept)
## 0.03977698 0.00000000
Each model effectively collapsed to a fixed-effect-only model (as
reflected by the singular fit warning in count1,
count2 and count3). This also explains why the
model diagnostics for count1 already appeared quite decent
and the linear model could reasonably have been used to model the
counts. Thus, we suspect that the heavy tails in the QQplot were due to
zero-inflation and non-zero outliers. Regardless, the linear model would
not have been able to account for zero-inflation or outliers, which were
still biologically significant data points, so our use of GLMM is still
justified.
With 720 total observations (360 per group) but artificially low variance, the standard errors are tiny, leading to extremely narrow confidence intervals around the predicted means, which are much smaller than what is expected from real ecological data. This gives the model false overconfidence—the CI bars in Fig. 2.4 and 3.4 are so small compared to the raw data range that they are almost invisible.
If we could re-generate the dummy dataset, the corrected code might look something similar to this:
# Code supplied by ChatGPT
set.seed(123)
# base parameters
n_site <- 6 # per disturbance level
n_tree <- 10
n_branch <- 6
mu_prop_d <- 13
mu_prop_u <- 15
phi <- 1.5 # overdisp for NB
lambda_inflo <- 52/15 # for zero-inflo mechanism
mean_mass_d <- 2.0
mean_mass_u <- 2.3
# variance components
sd_site <- 0.55 # site-to-site variation (log scale for counts, additive for mass)
sd_tree <- 1.02 # tree-to-tree variation
sd_branch <- 0.3 # small branch-level wiggle
sd_mass_resid <- 0.15 # residual mass noise at branch level
# helper: NB size from mean and phi
size_d <- mu_prop_d / (max(phi, 1 + 1e-9) - 1)
size_u <- mu_prop_u / (max(phi, 1 + 1e-9) - 1)
# build skeleton ----------------------------------------------------------
sites_d <- paste0("D", 1:n_site)
sites_u <- paste0("U", 1:n_site)
# expand to site × tree × branch for EACH disturbance
df_d <- expand.grid(
site_id = sites_d,
tree_num = 1:n_tree,
branch_id = 1:n_branch,
KEEP.OUT.ATTRS = FALSE,
stringsAsFactors = FALSE
)
df_d$disturbance <- "Disturbed"
df_u <- expand.grid(
site_id = sites_u,
tree_num = 1:n_tree,
branch_id = 1:n_branch,
KEEP.OUT.ATTRS = FALSE,
stringsAsFactors = FALSE
)
df_u$disturbance <- "Undisturbed"
dummy <- rbind(df_d, df_u)
# make a global unique tree_id
dummy$tree_id <- with(dummy, paste0(site_id, "_T", tree_num))
# draw random effects -----------------------------------------------------
# site REs: separate set for D and U so sites can differ within each group
site_re <- c(
setNames(rnorm(n_site, 0, sd_site), sites_d),
setNames(rnorm(n_site, 0, sd_site), sites_u)
)
# tree REs: one per (site, tree_num)
tree_keys <- unique(dummy[, c("site_id", "tree_num")])
tree_re_vec <- rnorm(nrow(tree_keys), 0, sd_tree)
tree_re <- setNames(tree_re_vec,
paste0(tree_keys$site_id, "_", tree_keys$tree_num))
# branch REs: one per row
branch_re <- rnorm(nrow(dummy), 0, sd_branch)
# simulate inflorescences (to zero out later)
dummy$total_inflorescence <- rpois(nrow(dummy), lambda = lambda_inflo)
zero_branch <- dummy$total_inflorescence == 0
# counts ------------------------------------------------------------------
# linear predictor for counts = log(mu_group) + site + tree + branch
lp_count <- numeric(nrow(dummy))
isD <- dummy$disturbance == "Disturbed"
isU <- !isD
lp_count[isD] <- log(mu_prop_d) +
site_re[dummy$site_id[isD]] +
tree_re[paste0(dummy$site_id[isD], "_", dummy$tree_num[isD])] +
branch_re[isD]
lp_count[isU] <- log(mu_prop_u) +
site_re[dummy$site_id[isU]] +
tree_re[paste0(dummy$site_id[isU], "_", dummy$tree_num[isU])] +
branch_re[isU]
mu_count <- exp(lp_count)
# choose size per row depending on disturbance
size_vec <- ifelse(isD, size_d, size_u)
dummy$total_prop <- rnbinom(nrow(dummy), size = size_vec, mu = mu_count)
dummy$total_prop[zero_branch] <- 0
# mass --------------------------------------------------------------------
# mass is additive, not log, so:
mass_mean <- numeric(nrow(dummy))
mass_mean[isD] <- mean_mass_d +
site_re[dummy$site_id[isD]] +
tree_re[paste0(dummy$site_id[isD], "_", dummy$tree_num[isD])] +
branch_re[isD]
mass_mean[isU] <- mean_mass_u +
site_re[dummy$site_id[isU]] +
tree_re[paste0(dummy$site_id[isU], "_", dummy$tree_num[isU])] +
branch_re[isU]
dummy$avg_mass <- rnorm(nrow(dummy), mean = mass_mean, sd = sd_mass_resid)
dummy$avg_mass[zero_branch] <- NA
# final cleanup
dummy <- dummy[order(dummy$disturbance, dummy$site_id, dummy$tree_num, dummy$branch_id), ]
row.names(dummy) <- NULL
We were curious to see how misguided our original analysis was, based on the new corrected dataset, so we ran an amended power analysis below.
# Code supplied by ChatGPT
powerAnalysisFinal <- function(
runs = 10000, alpha = 0.05,
n_site = 6, n_tree = 10, n_branch = 6,
mu_d_count = 13, mu_u_count = 15, phi = 1.5,
mean_d_mass = 2, mean_u_mass = 2.3,
sd_site = 0.9893179, sd_tree = 0.5408327,
sd_branch = 0.3, # new: small branch-level variation
sd_mass_resid = 0.15, # new: residual noise for mass
lambda_inflo = 52/15
){
set.seed(123)
# same NB parameterization as before
size_d <- mu_d_count / (max(phi, 1 + 1e-9) - 1)
size_u <- mu_u_count / (max(phi, 1 + 1e-9) - 1)
pc <- 0L
pm <- 0L
# how many branches per site total
m <- n_tree * n_branch
for(i in seq_len(runs)) {
## -------- DISTURBED GROUP --------
# site REs
site_re_D <- rnorm(n_site, 0, sd_site)
# storage for branch-level disturbed
cD <- numeric(n_site * m)
mD <- numeric(n_site * m)
idx <- 1
for(s in 1:n_site){
for(t in 1:n_tree){
tree_re <- rnorm(1, 0, sd_tree)
for(b in 1:n_branch){
branch_re <- rnorm(1, 0, sd_branch)
# zero-inflo
z <- rbinom(1, 1, exp(-lambda_inflo))
# count
mu <- exp(log(mu_d_count) + site_re_D[s] + tree_re + branch_re)
count_val <- rnbinom(1, size = size_d, mu = mu)
if(z == 1) count_val <- 0
# mass
mass_mean <- mean_d_mass + site_re_D[s] + tree_re + branch_re
mass_val <- rnorm(1, mass_mean, sd_mass_resid)
if(z == 1) mass_val <- NA
cD[idx] <- count_val
mD[idx] <- mass_val
idx <- idx + 1
}
}
}
# site means (same idea as your original code)
sD_count <- rowMeans(matrix(cD, nrow = n_site, byrow = TRUE))
sD_mass <- rowMeans(matrix(mD, nrow = n_site, byrow = TRUE), na.rm = TRUE)
## -------- UNDISTURBED GROUP --------
site_re_U <- rnorm(n_site, 0, sd_site)
cU <- numeric(n_site * m)
mU <- numeric(n_site * m)
idx <- 1
for(s in 1:n_site){
for(t in 1:n_tree){
tree_re <- rnorm(1, 0, sd_tree)
for(b in 1:n_branch){
branch_re <- rnorm(1, 0, sd_branch)
z <- rbinom(1, 1, exp(-lambda_inflo))
mu <- exp(log(mu_u_count) + site_re_U[s] + tree_re + branch_re)
count_val <- rnbinom(1, size = size_u, mu = mu)
if(z == 1) count_val <- 0
mass_mean <- mean_u_mass + site_re_U[s] + tree_re + branch_re
mass_val <- rnorm(1, mass_mean, sd_mass_resid)
if(z == 1) mass_val <- NA
cU[idx] <- count_val
mU[idx] <- mass_val
idx <- idx + 1
}
}
}
sU_count <- rowMeans(matrix(cU, nrow = n_site, byrow = TRUE))
sU_mass <- rowMeans(matrix(mU, nrow = n_site, byrow = TRUE), na.rm = TRUE)
## -------- tests --------
p1 <- suppressWarnings(wilcox.test(sU_count, sD_count, exact = FALSE)$p.value)
if(!is.na(p1) && p1 < alpha) pc <- pc + 1L
p2 <- suppressWarnings(t.test(sU_mass, sD_mass, var.equal = TRUE)$p.value)
if(!is.na(p2) && p2 < alpha) pm <- pm + 1L
}
list(power_count = pc / runs,
power_mass = pm / runs)
}
res <- powerAnalysisFinal()
res$power_count # 0.0465
## [1] 0.0465
res$power_mass # 0.0754
## [1] 0.0754
With the same effect sizes and number of replicates, we find that our power for both counts (4.65%) and mass (7.54%) is extremely low. However, with such a nested structure containing variation at each of 3 levels, it is biologically reasonable to assume that a much greater effect size would be needed to significantly observable with such few replicates at each level of nesting.
We recognise this mistake as a limitation of our experimental design and research question. It is a dangerous limitation because if we had not discovered this mistake in the dummy generating code, we would wrongly assume that our conclusions are biologically significant (ie. that mangrove stands within a 1-km buffer with mean HFP ≥ 4 are highly likely to have lower propagule production and maternal investment). If we extended the implications of this assumption, we might seek to devote resources to restoring the fecundity of mangroves in disturbed areas that might not actually require intervention. Overall, we would actually conclude that our model results are not biologically significant, due to the inherent flaws in our dummy data.
Future experiments could consider using real-life data or correctly inserting random variation at each nesting level into the dummy dataset structure so that our results are more realistic.