arrival_results_first50.csv holds fire arrival-time
results (minutes to reach the target cells) for 50 ignitions per
landscape scenario, crossed over the full design: 8
fuel/moisture factors (invaded/non-invaded x live/dead x fuel
load/moisture, each Low/Medium/High -> 3^8 = 6,561 combinations), 11
cogongrass cover levels (0-100% by 10s), and up to 3 spatial
distributions (Random, ModClump, Clump; distribution is undefined at 0%
and 100% cover). That is 190,269 unique landscape scenarios x 50
ignitions = 9,513,450 rows.
The proposal specifies a multiple linear regression (MLR) of fire risk on cogongrass cover, distribution, and their interaction, controlling for weather/fuel conditions. This script (1) builds that model correctly at the right unit of analysis, (2) tests whether MLR’s assumptions actually hold for this response, and (3) if they don’t, fits and justifies a better-suited model, while still reporting the same effects the proposal asks about.
Why aggregate: each landscape scenario has 50
ignition points, but ignition location doesn’t change any of the design
variables (cover, distribution, fuel/moisture). Fitting a regression on
all 9.5M ignition-level rows would treat 50 pseudo-replicates as
independent observations for every landscape-level predictor, inflating
degrees of freedom and Type I error ~50-fold. The correct unit of
analysis for cover/distribution/fuel effects is the
scenario (n = 190,269), with the 50 ignitions
summarized into a mean (and SD, for reference) per scenario. This
mirrors the aggregation step already used in
First_50_Ign.Rmd.
We use DuckDB to aggregate directly against the CSV rather than
reading all 2 GB into memory with fread; it finishes in
well under a minute and only the small aggregated table is loaded into
R.
con <- dbConnect(duckdb())
scenario_dt <- dbGetQuery(con, sprintf("
SELECT cover_percent, distribution,
inv_moist_live, inv_moist_dead, ni_moist_live, ni_moist_dead,
inv_fuel_live, inv_fuel_dead, ni_fuel_live, ni_fuel_dead,
AVG(arrival_mean) AS arrival_avg,
STDDEV(arrival_mean) AS arrival_sd,
COUNT(*) AS n_ign
FROM read_csv_auto('%s')
GROUP BY ALL
", input_path))
dbDisconnect(con, shutdown = TRUE)
setDT(scenario_dt)
cat("Scenarios:", nrow(scenario_dt), "\n")
## Scenarios: 190269
cat("Replicate count per scenario:\n")
## Replicate count per scenario:
print(table(scenario_dt$n_ign)) # should all be 50
##
## 50
## 190269
# ---- Alternative if you don't have the duckdb package / prefer fread ----
# dt <- fread(input_path)
# group_cols <- c("cover_percent","distribution","inv_moist_live","inv_moist_dead",
# "ni_moist_live","ni_moist_dead","inv_fuel_live","inv_fuel_dead",
# "ni_fuel_live","ni_fuel_dead")
# scenario_dt <- dt[, .(arrival_avg = mean(arrival_mean), arrival_sd = sd(arrival_mean),
# n_ign = .N), by = group_cols]
stopifnot(all(scenario_dt$n_ign == 50))
stopifnot(all(scenario_dt$arrival_avg > 0))
stopifnot(sum(is.na(scenario_dt$arrival_avg)) == 0)
# distribution is NA only at the 0%/100% cover extremes (no pattern possible)
cover_for_na <- scenario_dt[is.na(distribution), unique(cover_percent)]
cat("cover_percent values with distribution == NA:", paste(cover_for_na, collapse = ", "), "\n")
## cover_percent values with distribution == NA: 0, 100
stopifnot(setequal(cover_for_na, range(scenario_dt$cover_percent)))
scenario_dt[, distribution := fifelse(is.na(distribution), "CompleteDominance", distribution)]
scenario_dt[, distribution := factor(distribution,
levels = c("CompleteDominance", "Random", "ModClump", "Clump"))]
mf_cols <- c("inv_moist_live","inv_moist_dead","ni_moist_live","ni_moist_dead",
"inv_fuel_live","inv_fuel_dead","ni_fuel_live","ni_fuel_dead")
scenario_dt[, (mf_cols) := lapply(.SD, factor, levels = c("L","M","H")), .SDcols = mf_cols]
# Center cover so the intercept = predicted arrival time at the midpoint (50%)
scenario_dt[, cover_c := cover_percent - 50]
# Sum-to-zero contrasts so Type III tests below are interpretable
options(contrasts = c("contr.sum", "contr.poly"))
p1 <- ggplot(scenario_dt, aes(arrival_avg)) +
geom_histogram(bins = 60, fill = "steelblue") +
labs(title = "Distribution of scenario-mean arrival time", x = "Arrival time (min)") +
theme_minimal()
p2 <- ggplot(scenario_dt, aes(sample = arrival_avg)) +
stat_qq() + stat_qq_line(color = "red") +
labs(title = "QQ plot (raw arrival time)") +
theme_minimal()
p1 + p2
cat("Skewness:", moments::skewness(scenario_dt$arrival_avg), "\n")
## Skewness: 2.124421
cat("Mean:", mean(scenario_dt$arrival_avg), " Median:", median(scenario_dt$arrival_avg), "\n")
## Mean: 75.386 Median: 58.46665
Arrival time is strictly positive and right-skewed (mean > median), which is typical for a duration/rate-type outcome and is already a warning sign for plain OLS.
ggplot(scenario_dt, aes(x = factor(cover_percent), y = arrival_avg, fill = distribution)) +
geom_boxplot(outlier.size = 0.3) +
labs(title = "Arrival time by cogongrass cover and distribution",
x = "Cogongrass cover (%)", y = "Arrival time (min)", fill = "Distribution") +
theme_minimal()
Formula follows the proposal: cover x distribution as the focal interaction, plus fuel-moisture and fuel-load conditions as covariates (these stand in for the “weather” term in the proposal — this dataset varies fuel moisture and fuel load directly across quantile levels rather than two discrete weather scenarios, giving finer resolution on the same idea).
form <- arrival_avg ~ cover_c * distribution +
cover_c * (inv_moist_live + inv_moist_dead + ni_moist_live + ni_moist_dead) +
distribution * (inv_moist_live + inv_moist_dead + ni_moist_live + ni_moist_dead) +
inv_fuel_live + inv_fuel_dead + ni_fuel_live + ni_fuel_dead
m_ols <- lm(form, data = scenario_dt)
summary(m_ols)$r.squared
## [1] 0.7684059
performance::check_model(m_ols, check = c("linearity","homogeneity","qq","outliers"))
# Normality — Shapiro's test is over-powered at n > 5000, so test a random
# subsample and report skew/kurtosis for effect size context.
set.seed(1)
resid_sample <- sample(residuals(m_ols), 5000)
shapiro.test(resid_sample)
##
## Shapiro-Wilk normality test
##
## data: resid_sample
## W = 0.91322, p-value < 2.2e-16
cat("Residual skew:", moments::skewness(residuals(m_ols)),
" kurtosis:", moments::kurtosis(residuals(m_ols)), "\n")
## Residual skew: 1.373485 kurtosis: 6.890738
# Homoscedasticity
car::ncvTest(m_ols)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 120490.2, Df = 1, p = < 2.22e-16
# Multicollinearity
car::vif(m_ols)
## GVIF Df GVIF^(1/(2*Df))
## cover_c 1.006250 1 1.003120
## distribution 1.000000 3 1.000000
## inv_moist_live 2.281359 2 1.228990
## inv_moist_dead 2.281359 2 1.228990
## ni_moist_live 2.281359 2 1.228990
## ni_moist_dead 2.281359 2 1.228990
## inv_fuel_live 1.000000 2 1.000000
## inv_fuel_dead 1.000000 2 1.000000
## ni_fuel_live 1.000000 2 1.000000
## ni_fuel_dead 1.000000 2 1.000000
## cover_c:distribution 1.006250 3 1.001039
## cover_c:inv_moist_live 1.000000 2 1.000000
## cover_c:inv_moist_dead 1.000000 2 1.000000
## cover_c:ni_moist_live 1.000000 2 1.000000
## cover_c:ni_moist_dead 1.000000 2 1.000000
## distribution:inv_moist_live 2.281359 6 1.071148
## distribution:inv_moist_dead 2.281359 6 1.071148
## distribution:ni_moist_live 2.281359 6 1.071148
## distribution:ni_moist_dead 2.281359 6 1.071148
Read: residuals are right-skewed with heavy tails,
the QQ plot bows away from the line, and the non-constant variance test
is strongly significant — variance clearly grows with the mean. VIFs are
all low (this is a balanced factorial design, so multicollinearity was
never expected to be the issue). Raw MLR violates the normality
and homoscedasticity assumptions it depends on, so p-values and
CIs from m_ols are not trustworthy as-is.
A standard fix for right-skewed, strictly-positive responses.
scenario_dt[, log_arrival := log(arrival_avg)]
form_log <- update(form, log_arrival ~ .)
m_log <- lm(form_log, data = scenario_dt)
summary(m_log)$r.squared
## [1] 0.8755868
performance::check_model(m_log, check = c("linearity","homogeneity","qq","outliers"))
set.seed(1)
shapiro.test(sample(residuals(m_log), 5000))
##
## Shapiro-Wilk normality test
##
## data: sample(residuals(m_log), 5000)
## W = 0.9666, p-value < 2.2e-16
car::ncvTest(m_log)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 20564.97, Df = 1, p = < 2.22e-16
Better (R² rises, skew and kurtosis shrink substantially), but residuals still show mild remaining skew and the interpretation now requires back-transforming log-scale coefficients (with a retransformation-bias correction) to talk about minutes — awkward for a dissertation chapter.
A Gamma GLM models a strictly-positive, right-skewed continuous
response directly, assumes variance proportional to the square of the
mean (a good match for what ncvTest found above), and the
log link keeps effects on an interpretable multiplicative scale — e.g.,
“cover increases arrival time by X%” — without manual
back-transformation.
m_gamma <- glm(update(form, arrival_avg ~ .), data = scenario_dt,
family = Gamma(link = "log"))
summary(m_gamma)
##
## Call:
## glm(formula = update(form, arrival_avg ~ .), family = Gamma(link = "log"),
## data = scenario_dt)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.198e+00 5.795e-04 7242.851 < 2e-16 ***
## cover_c -7.510e-03 1.680e-05 -447.096 < 2e-16 ***
## distribution1 7.350e-02 1.396e-03 52.658 < 2e-16 ***
## distribution2 -5.657e-02 8.332e-04 -67.901 < 2e-16 ***
## distribution3 -3.604e-02 8.332e-04 -43.262 < 2e-16 ***
## inv_moist_live1 -1.681e-01 8.196e-04 -205.092 < 2e-16 ***
## inv_moist_live2 3.712e-02 8.196e-04 45.291 < 2e-16 ***
## inv_moist_dead1 -1.819e-01 8.196e-04 -221.909 < 2e-16 ***
## inv_moist_dead2 -7.109e-02 8.196e-04 -86.736 < 2e-16 ***
## ni_moist_live1 -4.169e-02 8.196e-04 -50.865 < 2e-16 ***
## ni_moist_live2 -1.154e-02 8.196e-04 -14.083 < 2e-16 ***
## ni_moist_dead1 -2.472e-01 8.196e-04 -301.604 < 2e-16 ***
## ni_moist_dead2 -1.320e-01 8.196e-04 -161.013 < 2e-16 ***
## inv_fuel_live1 -6.669e-02 6.669e-04 -100.000 < 2e-16 ***
## inv_fuel_live2 -1.378e-02 6.669e-04 -20.668 < 2e-16 ***
## inv_fuel_dead1 2.167e-01 6.669e-04 324.908 < 2e-16 ***
## inv_fuel_dead2 -2.925e-02 6.669e-04 -43.854 < 2e-16 ***
## ni_fuel_live1 -1.419e-02 6.669e-04 -21.285 < 2e-16 ***
## ni_fuel_live2 7.695e-04 6.669e-04 1.154 0.24856
## ni_fuel_dead1 1.404e-01 6.669e-04 210.464 < 2e-16 ***
## ni_fuel_dead2 -1.162e-02 6.669e-04 -17.417 < 2e-16 ***
## cover_c:distribution1 2.265e-04 3.045e-05 7.438 1.03e-13 ***
## cover_c:distribution2 -5.752e-04 2.863e-05 -20.092 < 2e-16 ***
## cover_c:distribution3 4.748e-04 2.863e-05 16.586 < 2e-16 ***
## cover_c:inv_moist_live1 -3.474e-03 2.368e-05 -146.684 < 2e-16 ***
## cover_c:inv_moist_live2 7.071e-04 2.368e-05 29.859 < 2e-16 ***
## cover_c:inv_moist_dead1 -3.611e-03 2.368e-05 -152.469 < 2e-16 ***
## cover_c:inv_moist_dead2 -1.413e-03 2.368e-05 -59.650 < 2e-16 ***
## cover_c:ni_moist_live1 8.620e-04 2.368e-05 36.400 < 2e-16 ***
## cover_c:ni_moist_live2 2.406e-04 2.368e-05 10.160 < 2e-16 ***
## cover_c:ni_moist_dead1 5.300e-03 2.368e-05 223.818 < 2e-16 ***
## cover_c:ni_moist_dead2 2.764e-03 2.368e-05 116.721 < 2e-16 ***
## distribution1:inv_moist_live1 -7.355e-04 1.974e-03 -0.373 0.70942
## distribution2:inv_moist_live1 -9.764e-03 1.178e-03 -8.287 < 2e-16 ***
## distribution3:inv_moist_live1 -3.224e-03 1.178e-03 -2.737 0.00621 **
## distribution1:inv_moist_live2 -2.813e-03 1.974e-03 -1.425 0.15418
## distribution2:inv_moist_live2 3.407e-03 1.178e-03 2.892 0.00383 **
## distribution3:inv_moist_live2 1.695e-03 1.178e-03 1.439 0.15022
## distribution1:inv_moist_dead1 4.571e-03 1.974e-03 2.316 0.02057 *
## distribution2:inv_moist_dead1 -1.280e-02 1.178e-03 -10.860 < 2e-16 ***
## distribution3:inv_moist_dead1 -6.437e-03 1.178e-03 -5.463 4.69e-08 ***
## distribution1:inv_moist_dead2 2.207e-03 1.974e-03 1.118 0.26358
## distribution2:inv_moist_dead2 -2.580e-03 1.178e-03 -2.190 0.02852 *
## distribution3:inv_moist_dead2 -1.586e-03 1.178e-03 -1.346 0.17834
## distribution1:ni_moist_live1 5.575e-04 1.974e-03 0.282 0.77760
## distribution2:ni_moist_live1 5.878e-03 1.178e-03 4.989 6.08e-07 ***
## distribution3:ni_moist_live1 1.170e-03 1.178e-03 0.993 0.32083
## distribution1:ni_moist_live2 9.790e-04 1.974e-03 0.496 0.61993
## distribution2:ni_moist_live2 1.397e-03 1.178e-03 1.186 0.23577
## distribution3:ni_moist_live2 1.075e-04 1.178e-03 0.091 0.92733
## distribution1:ni_moist_dead1 -1.019e-02 1.974e-03 -5.163 2.44e-07 ***
## distribution2:ni_moist_dead1 3.314e-02 1.178e-03 28.125 < 2e-16 ***
## distribution3:ni_moist_dead1 1.363e-02 1.178e-03 11.570 < 2e-16 ***
## distribution1:ni_moist_dead2 1.458e-03 1.974e-03 0.739 0.46011
## distribution2:ni_moist_dead2 2.653e-02 1.178e-03 22.520 < 2e-16 ***
## distribution3:ni_moist_dead2 7.016e-03 1.178e-03 5.954 2.62e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.04231123)
##
## Null deviance: 61454.4 on 190268 degrees of freedom
## Residual deviance: 7130.5 on 190213 degrees of freedom
## AIC: 1499877
##
## Number of Fisher Scoring iterations: 6
set.seed(1)
sim_res <- simulateResiduals(m_gamma, n = 500)
plot(sim_res)
testDispersion(sim_res)
##
## DHARMa nonparametric dispersion test via sd of residuals fitted vs.
## simulated
##
## data: simulationOutput
## dispersion = 1.5714, p-value < 2.2e-16
## alternative hypothesis: two.sided
pseudo_r2 <- function(mod) {
null_mod <- update(mod, . ~ 1)
1 - (mod$deviance / null_mod$deviance)
}
comp_tbl <- data.table(
model = c("OLS (raw)", "OLS (log)", "Gamma GLM (log link)"),
AIC = c(AIC(m_ols), AIC(m_log), AIC(m_gamma)),
R2_or_pseudoR2 = c(summary(m_ols)$r.squared, summary(m_log)$r.squared, pseudo_r2(m_gamma)),
resid_skew = c(moments::skewness(residuals(m_ols)),
moments::skewness(residuals(m_log)),
moments::skewness(residuals(m_gamma, type = "deviance"))),
ncv_test_p = c(car::ncvTest(m_ols)$p, car::ncvTest(m_log)$p, NA)
)
print(comp_tbl)
## model AIC R2_or_pseudoR2 resid_skew ncv_test_p
## <char> <num> <num> <num> <num>
## 1: OLS (raw) 1755347.51 0.7684059 1.3734852 0
## 2: OLS (log) -93973.15 0.8755868 0.8036222 0
## 3: Gamma GLM (log link) 1499876.65 0.8839701 0.9166239 NA
fwrite(comp_tbl, file.path(out_dir, "model_comparison.csv"))
Conclusion: raw MLR is not a good fit for this
response — it fails both normality and constant-variance assumptions,
driven by the right skew and mean-variance relationship inherent to
arrival-time data. The Gamma GLM with a log link is the
best-justified choice: it has the highest deviance-explained pseudo-R²,
directly accommodates the skew/variance structure DHARMa confirms is
well-behaved, and needs no post-hoc transformation to interpret. This
also matches the direction already taken in
First_50_Ign.Rmd; this analysis adds the diagnostic
evidence for why that choice is correct. m_gamma is
used as the primary model going forward.
car::Anova(m_gamma, type = "III")
## Analysis of Deviance Table (Type III tests)
##
## Response: arrival_avg
## LR Chisq Df Pr(>Chisq)
## cover_c 197992 1 < 2.2e-16 ***
## distribution 7119 3 < 2.2e-16 ***
## inv_moist_live 44510 2 < 2.2e-16 ***
## inv_moist_dead 104597 2 < 2.2e-16 ***
## ni_moist_live 4705 2 < 2.2e-16 ***
## ni_moist_dead 230873 2 < 2.2e-16 ***
## inv_fuel_live 16667 2 < 2.2e-16 ***
## inv_fuel_dead 123879 2 < 2.2e-16 ***
## ni_fuel_live 573 2 < 2.2e-16 ***
## ni_fuel_dead 54445 2 < 2.2e-16 ***
## cover_c:distribution 590 3 < 2.2e-16 ***
## cover_c:inv_moist_live 24356 2 < 2.2e-16 ***
## cover_c:inv_moist_dead 48530 2 < 2.2e-16 ***
## cover_c:ni_moist_live 2439 2 < 2.2e-16 ***
## cover_c:ni_moist_dead 120833 2 < 2.2e-16 ***
## distribution:inv_moist_live 227 6 < 2.2e-16 ***
## distribution:inv_moist_dead 471 6 < 2.2e-16 ***
## distribution:ni_moist_live 120 6 < 2.2e-16 ***
## distribution:ni_moist_dead 6397 6 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
emmeans by default builds a reference grid that crosses
every factor in the model, including the 8 fuel/moisture
factors (3^8 = 6,561 combinations) — that blows well past its row limit
even though we only asked about cover and distribution. The fix is to
hold the fuel/moisture factors fixed at a single reference level (“M” =
medium, the middle quantile) via at, so the grid only
varies the two factors we actually want to plot. Predictions below are
therefore “at medium fuel-moisture conditions”; re-run with a different
ref_at to see how the picture shifts under wetter/drier or
higher/lower fuel-load conditions.
ref_at <- setNames(as.list(rep("M", length(mf_cols))), mf_cols)
emm_grid <- emmeans(m_gamma, ~ cover_c * distribution,
at = c(list(cover_c = seq(-50, 50, 10)), ref_at),
type = "response")
emm_df <- as.data.frame(emm_grid)
# emmeans names the CI columns differently depending on version/df method
# (e.g. "asymp.LCL"/"asymp.UCL" vs "lower.CL"/"upper.CL") -- detect rather
# than hardcode so this doesn't break across setups.
lcl_col <- grep("LCL$|lower\\.CL$", names(emm_df), value = TRUE)[1]
ucl_col <- grep("UCL$|upper\\.CL$", names(emm_df), value = TRUE)[1]
emm_df$LCL <- emm_df[[lcl_col]]
emm_df$UCL <- emm_df[[ucl_col]]
ggplot(emm_df, aes(x = cover_c + 50, y = response, color = distribution, fill = distribution)) +
geom_line(linewidth = 1) +
geom_ribbon(aes(ymin = LCL, ymax = UCL), alpha = 0.15, color = NA) +
labs(title = "Predicted fire arrival time by cogongrass cover and distribution",
x = "Cogongrass cover (%)", y = "Predicted arrival time (min)",
color = "Distribution", fill = "Distribution") +
theme_minimal()
ggsave(file.path(out_dir, "cover_by_distribution_effect.png"), width = 8, height = 5, dpi = 300)
Same issue applies here: for each single factor v, hold
every other fuel/moisture factor at “M” (cover_c is continuous,
so emmeans already just uses its mean rather than expanding it) so the
grid stays small.
fuel_terms <- c("inv_moist_live","inv_moist_dead","ni_moist_live","ni_moist_dead",
"inv_fuel_live","inv_fuel_dead","ni_fuel_live","ni_fuel_dead")
fuel_plots <- lapply(fuel_terms, function(v) {
at_list <- ref_at
at_list[[v]] <- NULL # let this factor vary over its own L/M/H levels
emm <- emmeans(m_gamma, as.formula(paste("~", v)), at = at_list, type = "response")
df <- as.data.frame(emm)
lcl_col <- grep("LCL$|lower\\.CL$", names(df), value = TRUE)[1]
ucl_col <- grep("UCL$|upper\\.CL$", names(df), value = TRUE)[1]
df$LCL <- df[[lcl_col]]
df$UCL <- df[[ucl_col]]
ggplot(df, aes(x = .data[[v]], y = response)) +
geom_pointrange(aes(ymin = LCL, ymax = UCL)) +
labs(title = v, x = NULL, y = "Arrival time (min)") +
theme_minimal(base_size = 9)
})
wrap_plots(fuel_plots, ncol = 4)
ggsave(file.path(out_dir, "fuel_moisture_main_effects.png"), width = 12, height = 6, dpi = 300)
The proposal describes distribution using a continuous clustering
parameter p (Random = 0, ModClump = 0.3, Clump = 0.55) rather
than a categorical factor. As a robustness check (restricted to 10-90%
cover, where a pattern exists), we refit with p continuous
to see whether the categorical model’s conclusions hold under a
linear-in-p assumption.
p_map <- c(Random = 0, ModClump = 0.3, Clump = 0.55)
sens_dt <- scenario_dt[distribution != "CompleteDominance"]
sens_dt[, clustering_p := p_map[as.character(distribution)]]
m_gamma_p <- glm(arrival_avg ~ cover_c * clustering_p +
inv_moist_live + inv_moist_dead + ni_moist_live + ni_moist_dead +
inv_fuel_live + inv_fuel_dead + ni_fuel_live + ni_fuel_dead,
data = sens_dt, family = Gamma(link = "log"))
summary(m_gamma_p)
##
## Call:
## glm(formula = arrival_avg ~ cover_c * clustering_p + inv_moist_live +
## inv_moist_dead + ni_moist_live + ni_moist_dead + inv_fuel_live +
## inv_fuel_dead + ni_fuel_live + ni_fuel_dead, family = Gamma(link = "log"),
## data = sens_dt)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.155e+00 1.156e-03 3593.796 <2e-16 ***
## cover_c -7.627e-03 4.478e-05 -170.311 <2e-16 ***
## clustering_p 1.335e-01 3.197e-03 41.766 <2e-16 ***
## inv_moist_live1 -1.665e-01 1.016e-03 -163.811 <2e-16 ***
## inv_moist_live2 3.663e-02 1.016e-03 36.036 <2e-16 ***
## inv_moist_dead1 -1.833e-01 1.016e-03 -180.374 <2e-16 ***
## inv_moist_dead2 -7.587e-02 1.016e-03 -74.639 <2e-16 ***
## ni_moist_live1 -4.284e-02 1.016e-03 -42.145 <2e-16 ***
## ni_moist_live2 -1.219e-02 1.016e-03 -11.989 <2e-16 ***
## ni_moist_dead1 -2.469e-01 1.016e-03 -242.897 <2e-16 ***
## ni_moist_dead2 -1.424e-01 1.016e-03 -140.049 <2e-16 ***
## inv_fuel_live1 -6.896e-02 1.016e-03 -67.840 <2e-16 ***
## inv_fuel_live2 -1.383e-02 1.016e-03 -13.610 <2e-16 ***
## inv_fuel_dead1 2.208e-01 1.016e-03 217.243 <2e-16 ***
## inv_fuel_dead2 -3.007e-02 1.016e-03 -29.583 <2e-16 ***
## ni_fuel_live1 -1.495e-02 1.016e-03 -14.704 <2e-16 ***
## ni_fuel_live2 9.315e-04 1.016e-03 0.916 0.3595
## ni_fuel_dead1 1.449e-01 1.016e-03 142.542 <2e-16 ***
## ni_fuel_dead2 -1.317e-02 1.016e-03 -12.959 <2e-16 ***
## cover_c:clustering_p 2.461e-04 1.238e-04 1.987 0.0469 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.09151222)
##
## Null deviance: 53985 on 177146 degrees of freedom
## Residual deviance: 13333 on 177127 degrees of freedom
## AIC: 1519017
##
## Number of Fisher Scoring iterations: 6
Compare the sign/significance of cover_c:clustering_p
here to the categorical interaction above — agreement means the
conclusion about how cover and distribution jointly affect arrival time
doesn’t hinge on which way distribution is coded.
Everything above works on scenario means, which is the statistically correct unit for testing landscape-level effects. As an optional robustness check, a Gamma GLMM with a random intercept per scenario models the full ignition-level data (9.5M rows) directly instead of aggregating first, and should recover the same fixed-effect conclusions.
This is computationally heavy (large N, many random-effect levels) — run only if you have time/RAM to spare, and consider subsetting to a manageable number of scenarios (e.g. one cover level) first.
con <- dbConnect(duckdb())
ign_dt <- setDT(dbGetQuery(con, sprintf("
SELECT cover_percent, distribution, inv_moist_live, inv_moist_dead,
ni_moist_live, ni_moist_dead, inv_fuel_live, inv_fuel_dead,
ni_fuel_live, ni_fuel_dead, ignition, arrival_mean
FROM read_csv_auto('%s')
", input_path)))
dbDisconnect(con, shutdown = TRUE)
ign_dt[, scenario_id := .GRP, by = .(cover_percent, distribution,
inv_moist_live, inv_moist_dead, ni_moist_live, ni_moist_dead,
inv_fuel_live, inv_fuel_dead, ni_fuel_live, ni_fuel_dead)]
ign_dt[, distribution := fifelse(is.na(distribution), "CompleteDominance", distribution)]
ign_dt[, distribution := factor(distribution, levels = c("CompleteDominance","Random","ModClump","Clump"))]
ign_dt[, (mf_cols) := lapply(.SD, factor, levels = c("L","M","H")), .SDcols = mf_cols]
ign_dt[, cover_c := cover_percent - 50]
m_glmm <- glmmTMB(arrival_mean ~ cover_c * distribution +
inv_moist_live + inv_moist_dead + ni_moist_live + ni_moist_dead +
inv_fuel_live + inv_fuel_dead + ni_fuel_live + ni_fuel_dead +
(1 | scenario_id),
data = ign_dt, family = Gamma(link = "log"))
summary(m_glmm)
saveRDS(m_gamma, file.path(out_dir, "gamma_glm_primary_model.rds"))
saveRDS(m_ols, file.path(out_dir, "ols_raw_model.rds"))
saveRDS(m_log, file.path(out_dir, "ols_log_model.rds"))
saveRDS(m_gamma_p, file.path(out_dir, "gamma_glm_continuous_p_sensitivity.rds"))
fwrite(scenario_dt, file.path(out_dir, "scenario_level_data.csv"))