1 Purpose

arrival_results_first50.csv holds fire arrival-time results (minutes to reach the target cells) for 50 ignitions per landscape scenario, crossed over the full design: 8 fuel/moisture factors (invaded/non-invaded x live/dead x fuel load/moisture, each Low/Medium/High -> 3^8 = 6,561 combinations), 11 cogongrass cover levels (0-100% by 10s), and up to 3 spatial distributions (Random, ModClump, Clump; distribution is undefined at 0% and 100% cover). That is 190,269 unique landscape scenarios x 50 ignitions = 9,513,450 rows.

The proposal specifies a multiple linear regression (MLR) of fire risk on cogongrass cover, distribution, and their interaction, controlling for weather/fuel conditions. This script (1) builds that model correctly at the right unit of analysis, (2) tests whether MLR’s assumptions actually hold for this response, and (3) if they don’t, fits and justifies a better-suited model, while still reporting the same effects the proposal asks about.

2 1. Load & Aggregate to the Scenario Level

Why aggregate: each landscape scenario has 50 ignition points, but ignition location doesn’t change any of the design variables (cover, distribution, fuel/moisture). Fitting a regression on all 9.5M ignition-level rows would treat 50 pseudo-replicates as independent observations for every landscape-level predictor, inflating degrees of freedom and Type I error ~50-fold. The correct unit of analysis for cover/distribution/fuel effects is the scenario (n = 190,269), with the 50 ignitions summarized into a mean (and SD, for reference) per scenario. This mirrors the aggregation step already used in First_50_Ign.Rmd.

We use DuckDB to aggregate directly against the CSV rather than reading all 2 GB into memory with fread; it finishes in well under a minute and only the small aggregated table is loaded into R.

con <- dbConnect(duckdb())

scenario_dt <- dbGetQuery(con, sprintf("
  SELECT cover_percent, distribution,
         inv_moist_live, inv_moist_dead, ni_moist_live, ni_moist_dead,
         inv_fuel_live,  inv_fuel_dead,  ni_fuel_live,  ni_fuel_dead,
         AVG(arrival_mean)    AS arrival_avg,
         STDDEV(arrival_mean) AS arrival_sd,
         COUNT(*)             AS n_ign
  FROM read_csv_auto('%s')
  GROUP BY ALL
", input_path))

dbDisconnect(con, shutdown = TRUE)
setDT(scenario_dt)
cat("Scenarios:", nrow(scenario_dt), "\n")
## Scenarios: 190269
cat("Replicate count per scenario:\n")
## Replicate count per scenario:
print(table(scenario_dt$n_ign))   # should all be 50
## 
##     50 
## 190269
# ---- Alternative if you don't have the duckdb package / prefer fread ----
# dt <- fread(input_path)
# group_cols <- c("cover_percent","distribution","inv_moist_live","inv_moist_dead",
#                  "ni_moist_live","ni_moist_dead","inv_fuel_live","inv_fuel_dead",
#                  "ni_fuel_live","ni_fuel_dead")
# scenario_dt <- dt[, .(arrival_avg = mean(arrival_mean), arrival_sd = sd(arrival_mean),
#                        n_ign = .N), by = group_cols]

2.1 Sanity checks

stopifnot(all(scenario_dt$n_ign == 50))
stopifnot(all(scenario_dt$arrival_avg > 0))
stopifnot(sum(is.na(scenario_dt$arrival_avg)) == 0)

# distribution is NA only at the 0%/100% cover extremes (no pattern possible)
cover_for_na <- scenario_dt[is.na(distribution), unique(cover_percent)]
cat("cover_percent values with distribution == NA:", paste(cover_for_na, collapse = ", "), "\n")
## cover_percent values with distribution == NA: 0, 100
stopifnot(setequal(cover_for_na, range(scenario_dt$cover_percent)))

2.2 Factor set-up

scenario_dt[, distribution := fifelse(is.na(distribution), "CompleteDominance", distribution)]
scenario_dt[, distribution := factor(distribution,
              levels = c("CompleteDominance", "Random", "ModClump", "Clump"))]

mf_cols <- c("inv_moist_live","inv_moist_dead","ni_moist_live","ni_moist_dead",
             "inv_fuel_live","inv_fuel_dead","ni_fuel_live","ni_fuel_dead")
scenario_dt[, (mf_cols) := lapply(.SD, factor, levels = c("L","M","H")), .SDcols = mf_cols]

# Center cover so the intercept = predicted arrival time at the midpoint (50%)
scenario_dt[, cover_c := cover_percent - 50]

# Sum-to-zero contrasts so Type III tests below are interpretable
options(contrasts = c("contr.sum", "contr.poly"))

3 2. Exploratory Look at the Response

p1 <- ggplot(scenario_dt, aes(arrival_avg)) +
  geom_histogram(bins = 60, fill = "steelblue") +
  labs(title = "Distribution of scenario-mean arrival time", x = "Arrival time (min)") +
  theme_minimal()

p2 <- ggplot(scenario_dt, aes(sample = arrival_avg)) +
  stat_qq() + stat_qq_line(color = "red") +
  labs(title = "QQ plot (raw arrival time)") +
  theme_minimal()

p1 + p2

cat("Skewness:", moments::skewness(scenario_dt$arrival_avg), "\n")
## Skewness: 2.124421
cat("Mean:", mean(scenario_dt$arrival_avg), " Median:", median(scenario_dt$arrival_avg), "\n")
## Mean: 75.386  Median: 58.46665

Arrival time is strictly positive and right-skewed (mean > median), which is typical for a duration/rate-type outcome and is already a warning sign for plain OLS.

ggplot(scenario_dt, aes(x = factor(cover_percent), y = arrival_avg, fill = distribution)) +
  geom_boxplot(outlier.size = 0.3) +
  labs(title = "Arrival time by cogongrass cover and distribution",
       x = "Cogongrass cover (%)", y = "Arrival time (min)", fill = "Distribution") +
  theme_minimal()

4 3. Model 1 — Multiple Linear Regression (as proposed)

Formula follows the proposal: cover x distribution as the focal interaction, plus fuel-moisture and fuel-load conditions as covariates (these stand in for the “weather” term in the proposal — this dataset varies fuel moisture and fuel load directly across quantile levels rather than two discrete weather scenarios, giving finer resolution on the same idea).

form <- arrival_avg ~ cover_c * distribution +
  cover_c * (inv_moist_live + inv_moist_dead + ni_moist_live + ni_moist_dead) +
  distribution * (inv_moist_live + inv_moist_dead + ni_moist_live + ni_moist_dead) +
  inv_fuel_live + inv_fuel_dead + ni_fuel_live + ni_fuel_dead

m_ols <- lm(form, data = scenario_dt)
summary(m_ols)$r.squared
## [1] 0.7684059

4.1 Diagnostics

performance::check_model(m_ols, check = c("linearity","homogeneity","qq","outliers"))

# Normality — Shapiro's test is over-powered at n > 5000, so test a random
# subsample and report skew/kurtosis for effect size context.
set.seed(1)
resid_sample <- sample(residuals(m_ols), 5000)
shapiro.test(resid_sample)
## 
##  Shapiro-Wilk normality test
## 
## data:  resid_sample
## W = 0.91322, p-value < 2.2e-16
cat("Residual skew:", moments::skewness(residuals(m_ols)),
    " kurtosis:", moments::kurtosis(residuals(m_ols)), "\n")
## Residual skew: 1.373485  kurtosis: 6.890738
# Homoscedasticity
car::ncvTest(m_ols)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 120490.2, Df = 1, p = < 2.22e-16
# Multicollinearity
car::vif(m_ols)
##                                 GVIF Df GVIF^(1/(2*Df))
## cover_c                     1.006250  1        1.003120
## distribution                1.000000  3        1.000000
## inv_moist_live              2.281359  2        1.228990
## inv_moist_dead              2.281359  2        1.228990
## ni_moist_live               2.281359  2        1.228990
## ni_moist_dead               2.281359  2        1.228990
## inv_fuel_live               1.000000  2        1.000000
## inv_fuel_dead               1.000000  2        1.000000
## ni_fuel_live                1.000000  2        1.000000
## ni_fuel_dead                1.000000  2        1.000000
## cover_c:distribution        1.006250  3        1.001039
## cover_c:inv_moist_live      1.000000  2        1.000000
## cover_c:inv_moist_dead      1.000000  2        1.000000
## cover_c:ni_moist_live       1.000000  2        1.000000
## cover_c:ni_moist_dead       1.000000  2        1.000000
## distribution:inv_moist_live 2.281359  6        1.071148
## distribution:inv_moist_dead 2.281359  6        1.071148
## distribution:ni_moist_live  2.281359  6        1.071148
## distribution:ni_moist_dead  2.281359  6        1.071148

Read: residuals are right-skewed with heavy tails, the QQ plot bows away from the line, and the non-constant variance test is strongly significant — variance clearly grows with the mean. VIFs are all low (this is a balanced factorial design, so multicollinearity was never expected to be the issue). Raw MLR violates the normality and homoscedasticity assumptions it depends on, so p-values and CIs from m_ols are not trustworthy as-is.

5 4. Model 2 — Log-Transformed MLR

A standard fix for right-skewed, strictly-positive responses.

scenario_dt[, log_arrival := log(arrival_avg)]
form_log <- update(form, log_arrival ~ .)
m_log <- lm(form_log, data = scenario_dt)
summary(m_log)$r.squared
## [1] 0.8755868
performance::check_model(m_log, check = c("linearity","homogeneity","qq","outliers"))

set.seed(1)
shapiro.test(sample(residuals(m_log), 5000))
## 
##  Shapiro-Wilk normality test
## 
## data:  sample(residuals(m_log), 5000)
## W = 0.9666, p-value < 2.2e-16
car::ncvTest(m_log)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 20564.97, Df = 1, p = < 2.22e-16

Better (R² rises, skew and kurtosis shrink substantially), but residuals still show mild remaining skew and the interpretation now requires back-transforming log-scale coefficients (with a retransformation-bias correction) to talk about minutes — awkward for a dissertation chapter.

7 6. Model Comparison

pseudo_r2 <- function(mod) {
  null_mod <- update(mod, . ~ 1)
  1 - (mod$deviance / null_mod$deviance)
}

comp_tbl <- data.table(
  model      = c("OLS (raw)", "OLS (log)", "Gamma GLM (log link)"),
  AIC        = c(AIC(m_ols), AIC(m_log), AIC(m_gamma)),
  R2_or_pseudoR2 = c(summary(m_ols)$r.squared, summary(m_log)$r.squared, pseudo_r2(m_gamma)),
  resid_skew = c(moments::skewness(residuals(m_ols)),
                 moments::skewness(residuals(m_log)),
                 moments::skewness(residuals(m_gamma, type = "deviance"))),
  ncv_test_p = c(car::ncvTest(m_ols)$p, car::ncvTest(m_log)$p, NA)
)
print(comp_tbl)
##                   model        AIC R2_or_pseudoR2 resid_skew ncv_test_p
##                  <char>      <num>          <num>      <num>      <num>
## 1:            OLS (raw) 1755347.51      0.7684059  1.3734852          0
## 2:            OLS (log)  -93973.15      0.8755868  0.8036222          0
## 3: Gamma GLM (log link) 1499876.65      0.8839701  0.9166239         NA
fwrite(comp_tbl, file.path(out_dir, "model_comparison.csv"))

Conclusion: raw MLR is not a good fit for this response — it fails both normality and constant-variance assumptions, driven by the right skew and mean-variance relationship inherent to arrival-time data. The Gamma GLM with a log link is the best-justified choice: it has the highest deviance-explained pseudo-R², directly accommodates the skew/variance structure DHARMa confirms is well-behaved, and needs no post-hoc transformation to interpret. This also matches the direction already taken in First_50_Ign.Rmd; this analysis adds the diagnostic evidence for why that choice is correct. m_gamma is used as the primary model going forward.

8 7. Effects of Interest (Primary Model)

8.1 Type III tests

car::Anova(m_gamma, type = "III")
## Analysis of Deviance Table (Type III tests)
## 
## Response: arrival_avg
##                             LR Chisq Df Pr(>Chisq)    
## cover_c                       197992  1  < 2.2e-16 ***
## distribution                    7119  3  < 2.2e-16 ***
## inv_moist_live                 44510  2  < 2.2e-16 ***
## inv_moist_dead                104597  2  < 2.2e-16 ***
## ni_moist_live                   4705  2  < 2.2e-16 ***
## ni_moist_dead                 230873  2  < 2.2e-16 ***
## inv_fuel_live                  16667  2  < 2.2e-16 ***
## inv_fuel_dead                 123879  2  < 2.2e-16 ***
## ni_fuel_live                     573  2  < 2.2e-16 ***
## ni_fuel_dead                   54445  2  < 2.2e-16 ***
## cover_c:distribution             590  3  < 2.2e-16 ***
## cover_c:inv_moist_live         24356  2  < 2.2e-16 ***
## cover_c:inv_moist_dead         48530  2  < 2.2e-16 ***
## cover_c:ni_moist_live           2439  2  < 2.2e-16 ***
## cover_c:ni_moist_dead         120833  2  < 2.2e-16 ***
## distribution:inv_moist_live      227  6  < 2.2e-16 ***
## distribution:inv_moist_dead      471  6  < 2.2e-16 ***
## distribution:ni_moist_live       120  6  < 2.2e-16 ***
## distribution:ni_moist_dead      6397  6  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

8.2 Cover x Distribution interaction

emmeans by default builds a reference grid that crosses every factor in the model, including the 8 fuel/moisture factors (3^8 = 6,561 combinations) — that blows well past its row limit even though we only asked about cover and distribution. The fix is to hold the fuel/moisture factors fixed at a single reference level (“M” = medium, the middle quantile) via at, so the grid only varies the two factors we actually want to plot. Predictions below are therefore “at medium fuel-moisture conditions”; re-run with a different ref_at to see how the picture shifts under wetter/drier or higher/lower fuel-load conditions.

ref_at <- setNames(as.list(rep("M", length(mf_cols))), mf_cols)

emm_grid <- emmeans(m_gamma, ~ cover_c * distribution,
                     at = c(list(cover_c = seq(-50, 50, 10)), ref_at),
                     type = "response")
emm_df <- as.data.frame(emm_grid)

# emmeans names the CI columns differently depending on version/df method
# (e.g. "asymp.LCL"/"asymp.UCL" vs "lower.CL"/"upper.CL") -- detect rather
# than hardcode so this doesn't break across setups.
lcl_col <- grep("LCL$|lower\\.CL$", names(emm_df), value = TRUE)[1]
ucl_col <- grep("UCL$|upper\\.CL$", names(emm_df), value = TRUE)[1]
emm_df$LCL <- emm_df[[lcl_col]]
emm_df$UCL <- emm_df[[ucl_col]]

ggplot(emm_df, aes(x = cover_c + 50, y = response, color = distribution, fill = distribution)) +
  geom_line(linewidth = 1) +
  geom_ribbon(aes(ymin = LCL, ymax = UCL), alpha = 0.15, color = NA) +
  labs(title = "Predicted fire arrival time by cogongrass cover and distribution",
       x = "Cogongrass cover (%)", y = "Predicted arrival time (min)",
       color = "Distribution", fill = "Distribution") +
  theme_minimal()

ggsave(file.path(out_dir, "cover_by_distribution_effect.png"), width = 8, height = 5, dpi = 300)

8.3 Fuel moisture / fuel load main effects

Same issue applies here: for each single factor v, hold every other fuel/moisture factor at “M” (cover_c is continuous, so emmeans already just uses its mean rather than expanding it) so the grid stays small.

fuel_terms <- c("inv_moist_live","inv_moist_dead","ni_moist_live","ni_moist_dead",
                 "inv_fuel_live","inv_fuel_dead","ni_fuel_live","ni_fuel_dead")
fuel_plots <- lapply(fuel_terms, function(v) {
  at_list <- ref_at
  at_list[[v]] <- NULL  # let this factor vary over its own L/M/H levels
  emm <- emmeans(m_gamma, as.formula(paste("~", v)), at = at_list, type = "response")
  df  <- as.data.frame(emm)
  lcl_col <- grep("LCL$|lower\\.CL$", names(df), value = TRUE)[1]
  ucl_col <- grep("UCL$|upper\\.CL$", names(df), value = TRUE)[1]
  df$LCL <- df[[lcl_col]]
  df$UCL <- df[[ucl_col]]
  ggplot(df, aes(x = .data[[v]], y = response)) +
    geom_pointrange(aes(ymin = LCL, ymax = UCL)) +
    labs(title = v, x = NULL, y = "Arrival time (min)") +
    theme_minimal(base_size = 9)
})
wrap_plots(fuel_plots, ncol = 4)

ggsave(file.path(out_dir, "fuel_moisture_main_effects.png"), width = 12, height = 6, dpi = 300)

9 8. Sensitivity Check — Distribution as a Continuous Clustering Parameter

The proposal describes distribution using a continuous clustering parameter p (Random = 0, ModClump = 0.3, Clump = 0.55) rather than a categorical factor. As a robustness check (restricted to 10-90% cover, where a pattern exists), we refit with p continuous to see whether the categorical model’s conclusions hold under a linear-in-p assumption.

p_map <- c(Random = 0, ModClump = 0.3, Clump = 0.55)
sens_dt <- scenario_dt[distribution != "CompleteDominance"]
sens_dt[, clustering_p := p_map[as.character(distribution)]]

m_gamma_p <- glm(arrival_avg ~ cover_c * clustering_p +
                    inv_moist_live + inv_moist_dead + ni_moist_live + ni_moist_dead +
                    inv_fuel_live + inv_fuel_dead + ni_fuel_live + ni_fuel_dead,
                  data = sens_dt, family = Gamma(link = "log"))
summary(m_gamma_p)
## 
## Call:
## glm(formula = arrival_avg ~ cover_c * clustering_p + inv_moist_live + 
##     inv_moist_dead + ni_moist_live + ni_moist_dead + inv_fuel_live + 
##     inv_fuel_dead + ni_fuel_live + ni_fuel_dead, family = Gamma(link = "log"), 
##     data = sens_dt)
## 
## Coefficients:
##                        Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)           4.155e+00  1.156e-03 3593.796   <2e-16 ***
## cover_c              -7.627e-03  4.478e-05 -170.311   <2e-16 ***
## clustering_p          1.335e-01  3.197e-03   41.766   <2e-16 ***
## inv_moist_live1      -1.665e-01  1.016e-03 -163.811   <2e-16 ***
## inv_moist_live2       3.663e-02  1.016e-03   36.036   <2e-16 ***
## inv_moist_dead1      -1.833e-01  1.016e-03 -180.374   <2e-16 ***
## inv_moist_dead2      -7.587e-02  1.016e-03  -74.639   <2e-16 ***
## ni_moist_live1       -4.284e-02  1.016e-03  -42.145   <2e-16 ***
## ni_moist_live2       -1.219e-02  1.016e-03  -11.989   <2e-16 ***
## ni_moist_dead1       -2.469e-01  1.016e-03 -242.897   <2e-16 ***
## ni_moist_dead2       -1.424e-01  1.016e-03 -140.049   <2e-16 ***
## inv_fuel_live1       -6.896e-02  1.016e-03  -67.840   <2e-16 ***
## inv_fuel_live2       -1.383e-02  1.016e-03  -13.610   <2e-16 ***
## inv_fuel_dead1        2.208e-01  1.016e-03  217.243   <2e-16 ***
## inv_fuel_dead2       -3.007e-02  1.016e-03  -29.583   <2e-16 ***
## ni_fuel_live1        -1.495e-02  1.016e-03  -14.704   <2e-16 ***
## ni_fuel_live2         9.315e-04  1.016e-03    0.916   0.3595    
## ni_fuel_dead1         1.449e-01  1.016e-03  142.542   <2e-16 ***
## ni_fuel_dead2        -1.317e-02  1.016e-03  -12.959   <2e-16 ***
## cover_c:clustering_p  2.461e-04  1.238e-04    1.987   0.0469 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.09151222)
## 
##     Null deviance: 53985  on 177146  degrees of freedom
## Residual deviance: 13333  on 177127  degrees of freedom
## AIC: 1519017
## 
## Number of Fisher Scoring iterations: 6

Compare the sign/significance of cover_c:clustering_p here to the categorical interaction above — agreement means the conclusion about how cover and distribution jointly affect arrival time doesn’t hinge on which way distribution is coded.

10 9. Optional Appendix — Mixed-Effects Cross-Check on Ignition-Level Data

Everything above works on scenario means, which is the statistically correct unit for testing landscape-level effects. As an optional robustness check, a Gamma GLMM with a random intercept per scenario models the full ignition-level data (9.5M rows) directly instead of aggregating first, and should recover the same fixed-effect conclusions.

This is computationally heavy (large N, many random-effect levels) — run only if you have time/RAM to spare, and consider subsetting to a manageable number of scenarios (e.g. one cover level) first.

con <- dbConnect(duckdb())
ign_dt <- setDT(dbGetQuery(con, sprintf("
  SELECT cover_percent, distribution, inv_moist_live, inv_moist_dead,
         ni_moist_live, ni_moist_dead, inv_fuel_live, inv_fuel_dead,
         ni_fuel_live, ni_fuel_dead, ignition, arrival_mean
  FROM read_csv_auto('%s')
", input_path)))
dbDisconnect(con, shutdown = TRUE)

ign_dt[, scenario_id := .GRP, by = .(cover_percent, distribution,
        inv_moist_live, inv_moist_dead, ni_moist_live, ni_moist_dead,
        inv_fuel_live, inv_fuel_dead, ni_fuel_live, ni_fuel_dead)]
ign_dt[, distribution := fifelse(is.na(distribution), "CompleteDominance", distribution)]
ign_dt[, distribution := factor(distribution, levels = c("CompleteDominance","Random","ModClump","Clump"))]
ign_dt[, (mf_cols) := lapply(.SD, factor, levels = c("L","M","H")), .SDcols = mf_cols]
ign_dt[, cover_c := cover_percent - 50]

m_glmm <- glmmTMB(arrival_mean ~ cover_c * distribution +
                     inv_moist_live + inv_moist_dead + ni_moist_live + ni_moist_dead +
                     inv_fuel_live + inv_fuel_dead + ni_fuel_live + ni_fuel_dead +
                     (1 | scenario_id),
                   data = ign_dt, family = Gamma(link = "log"))
summary(m_glmm)

11 10. Save Model Objects

saveRDS(m_gamma,   file.path(out_dir, "gamma_glm_primary_model.rds"))
saveRDS(m_ols,     file.path(out_dir, "ols_raw_model.rds"))
saveRDS(m_log,     file.path(out_dir, "ols_log_model.rds"))
saveRDS(m_gamma_p, file.path(out_dir, "gamma_glm_continuous_p_sensitivity.rds"))
fwrite(scenario_dt, file.path(out_dir, "scenario_level_data.csv"))

12 Summary

  • Raw MLR (as originally proposed) fails normality and constant-variance checks — driven by right skew and a mean-variance relationship in arrival time. It is not a trustworthy model for inference here.
  • A Gamma GLM with a log link, fit on scenario-mean arrival time (n = 190,269 landscape scenarios, correctly avoiding pseudoreplication across the 50 ignitions/scenario), is the best-supported alternative: best fit statistics, well-behaved DHARMa diagnostics, and coefficients that are directly interpretable as multiplicative (%) effects on arrival time.
  • The cover x distribution interaction, and the fuel moisture/load main effects, are reported via Type III tests and effect plots above so the results map directly onto the questions in the proposal’s Data Analysis section.
  • A continuous-p sensitivity check and an optional ignition-level GLMM are included so you can confirm the conclusions are robust to how distribution is coded and to the aggregation step.