#Libraries + the raw database that comes directly from the google-sheet shared by OAF (https://docs.google.com/spreadsheets/d/1ycoY3k0vQIEYVtDBqDFWpYZF_p4sMNF59y8QM6PkdGE/edit?gid=1553488959#gid=1553488959). You can also find description of the variables here(https://docs.google.com/spreadsheets/d/1ycoY3k0vQIEYVtDBqDFWpYZF_p4sMNF59y8QM6PkdGE/edit?gid=366994340#gid=366994340)

Data Cleaning; elimination some outliers for yield

I use a linear mixed model to identify outliers in yield (DC_Yield_t_ha), flagged by large residuals or influence diagnostics (e.g. standardized residuals, Cook’s distance, leverage), and then count how many outliers exist per season and treatment code.

Notes: I used standard residue of >4 to consider outlier (its high but on farm across the country and several seasons we should be more flexible).

# Data Cleaning; elimination some outliers for yield

# Load required libraries
library(lme4)
library(lmerTest)
library(broom.mixed)
library(dplyr)

# ---- 1. Prepare data ----
model_data <- ca_raw %>%
  filter(
    !is.na(DC_Yield_t_ha),
    !is.na(IQR_block),
    !is.na(IQF_environment),
    !is.na(Treat_code),
    !is.na(IQR_Season)
  ) %>%
  mutate(
    DC_Yield_t_ha = as.numeric(DC_Yield_t_ha),
    Treat_code = as.factor(Treat_code),
    IQF_environment = as.factor(IQF_environment),
    IQR_Season = as.factor(IQR_Season),
    IQR_block = as.factor(IQR_block)
  )

# ---- 2. Fit linear mixed model ----
model_resid <- lmer(
  DC_Yield_t_ha ~ Treat_code * IQR_Season + 
    (1 | IQF_environment) + (1 | IQR_block),
  data = model_data
)

# ---- 3. Get residuals and flag outliers ----
resid_data <- augment(model_resid) %>%
  mutate(
    std_resid = .resid / sd(.resid, na.rm = TRUE),
    is_outlier = abs(std_resid) > 4
  )

# 1. Compute residuals with augment() and join to model data
resid_data <- bind_cols(model_data, augment(model_resid))

# 2. Compute residual metrics
# ---- 3. Extract residuals (correctly aligned) ----
resid_data <- augment(model_resid, data = model_data) %>%
  mutate(
    std_resid = scale(.resid)[, 1],
    is_outlier = abs(std_resid) > 4
  )

# ---- 4. Count outliers per group ----
outlier_counts <- resid_data %>%
  filter(is_outlier) %>%
  group_by(IQR_Season, IQF_environment, Treat_code) %>%
  summarise(n_outliers = n(), .groups = "drop")


# ---- 6. Print results ----
print(outlier_counts)
## # A tibble: 13 × 4
##    IQR_Season IQF_environment Treat_code n_outliers
##    <fct>      <fct>           <fct>           <int>
##  1 23A        Mid Low Dry     A                   3
##  2 23A        Mid Low Dry     B                   1
##  3 23A        Mid Low Dry     C                   2
##  4 23A        Mid Low Dry     D                   1
##  5 24A        Highland        A                   5
##  6 24A        Highland        B                   4
##  7 24A        Highland        C                   4
##  8 24A        Highland        D                   4
##  9 24A        Mid_Low_Wet     B                   1
## 10 24A        Mid_Low_Wet     D                   1
## 11 24A        Transition      A                   1
## 12 25A        Highland        A                   1
## 13 25A        Transition      A                   1
# ---- 7. Total number of outliers ----
cat("Total outliers detected:", sum(resid_data$is_outlier), "\n")
## Total outliers detected: 29

Since we are pulling together CAs treatments lated, I eliminated outliers for CA but I kept the blocks, assuming the average of the 2 left will not ve very affected. However, where I wliminate treatment A (the conventional), then I eliminate the block for yield analysis: total of 12 in the total 6 seasons = 6/2000 and something

library(dplyr)

# 1. Create a unique identifier for Season + Block
resid_data <- resid_data %>%
  mutate(IQF_SeasonblockID = paste(IQR_block, IQR_Season, sep = "_"))


# 2. Identify Season-Block IDs where A was an outlier
remove_full_blocks <- resid_data %>%
  filter(is_outlier == TRUE, Treat_code == "A") %>%
  pull(IQF_SeasonblockID) %>%
  unique()

# 3. Identify rows to remove for B, C, D (but not whole blocks)
remove_specific_rows <- resid_data %>%
  filter(is_outlier == TRUE, Treat_code %in% c("B", "C", "D")) %>%
  dplyr::select(IQR_Season, IQR_block, Treat_code)


# 4. Start from ca_raw and remove:
#    - Full blocks where A was outlier
#    - Specific rows where B/C/D were outlier
yield_n_out <- ca_raw %>%
  # Remove blocks where A was an outlier
  filter(!(IQF_SeasonblockID %in% remove_full_blocks)) %>%
  # Remove specific rows (B, C, D outliers)
  anti_join(remove_specific_rows, by = c("IQR_Season", "IQR_block", "Treat_code")) %>%
  dplyr::select(-IQF_SeasonblockID)  # remove temp column

Which environmental zoning should we use: AEZ-OAF or the conventional Rwandan AEZ?

To compare zoning systems, we evaluated how much variance in yield is explained when modeling the effects of season, zone, and treatment, using either OAF_AEZ or the national AEZ classification. The zoning system that accounts for more variance provides the better environmental stratification for yield analysis.

Notes: We use OAF_AEZ because (now on called IQF_environment): 1- it better represents the biophysical and agronomic conditions that drive yield variation (e.g., moisture regime, elevation, temperature). 2- In our models, OAF_AEZ explains substantially more variance than the national AEZ classification (η² = 0.49 vs. 0.21). Moreover since it has 4 cathergories (AEZ has 8) the power to stufy effect of environmental zone and interactions will give us more power. 3 -OAF_AEZ is also closely aligned with crop-specific agronomic realities, including maize variety zones (highland vs. mid-altitude) and bean growth habit (bush vs. climbing). 4- While national AEZs may capture broader social, infrastructural, and historical patterns due to their geographic contiguity, we will analyze these socio-spatial drivers separately when modeling household-level effects on yield

#Fix mistake in DC_Yield_t_ha, were some Highland are spelled Higland
# Replace misspelling in IQF_environment
# Fix mistakes in IQF_environment
yield_n_out <- yield_n_out %>%
  mutate(
    IQF_environment = gsub("^<", "", IQF_environment),
    IQF_environment = case_when(
      tolower(IQF_environment) == "higland"   ~ "Highland",
      tolower(IQF_environment) == "highland"  ~ "Highland",
      IQF_environment == "mid_low_wet"        ~ "Mid_Low_Wet",
      IQF_environment == "mid_low_dry"        ~ "Mid_Low_Dry",
      IQF_environment == "Mid Low Dry"        ~ "Mid_Low_Dry",
      IQF_environment == "Mid Low Wet"        ~ "Mid_Low_Wet",
      IQF_environment == "transition"         ~ "Transition",
      TRUE                                    ~ IQF_environment
    )
  )

# Load libraries
library(lme4)
library(lmerTest)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
## 
##     some
## The following object is masked from 'package:dplyr':
## 
##     recode
library(effectsize)
## 
## Attaching package: 'effectsize'
## The following object is masked from 'package:mvtnorm':
## 
##     standardize
library(performance)

# Clean dataset
yield_n_out_clean <- yield_n_out %>%
  filter(!is.na(DC_Yield_t_ha), !is.na(Treat_code))

# Fit models
mod_agzone <- lmer(
  DC_Yield_t_ha ~ Treat_code * IQF_agzone + IQR_Season + (1 | IQR_block),
  data = yield_n_out_clean
)

mod_env <- lmer(
  DC_Yield_t_ha ~ Treat_code * IQF_environment + IQR_Season + (1 | IQR_block),
  data = yield_n_out_clean
)

# Type III ANOVA
anova_agzone <- Anova(mod_agzone, type = 3)
anova_env    <- Anova(mod_env,    type = 3)

# Partial Eta Squared (correct for current effectsize)
eta_agzone <- eta_squared(mod_agzone)
eta_env    <- eta_squared(mod_env)


# R² measures
r2_agzone <- r2(mod_agzone)
r2_env    <- r2(mod_env)

# Print results
cat("=== ANOVA: AgZone Model ===\n")
## === ANOVA: AgZone Model ===
print(anova_agzone)
## Analysis of Deviance Table (Type III Wald chisquare tests)
## 
## Response: DC_Yield_t_ha
##                          Chisq Df Pr(>Chisq)    
## (Intercept)            1368.77  1  < 2.2e-16 ***
## Treat_code              109.95  3  < 2.2e-16 ***
## IQF_agzone              187.62 10  < 2.2e-16 ***
## IQR_Season            16969.02  5  < 2.2e-16 ***
## Treat_code:IQF_agzone   112.95 30  1.513e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cat("\n=== ANOVA: Environment Model ===\n")
## 
## === ANOVA: Environment Model ===
print(anova_env)
## Analysis of Deviance Table (Type III Wald chisquare tests)
## 
## Response: DC_Yield_t_ha
##                               Chisq Df Pr(>Chisq)    
## (Intercept)                 3684.70  1    < 2e-16 ***
## Treat_code                   226.31  3    < 2e-16 ***
## IQF_environment              171.13  3    < 2e-16 ***
## IQR_Season                 17246.87  5    < 2e-16 ***
## Treat_code:IQF_environment    21.48  9    0.01068 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cat("\n=== Partial Eta²: AgZone Model ===\n")
## 
## === Partial Eta²: AgZone Model ===
print(eta_agzone)
## # Effect Size for ANOVA (Type III)
## 
## Parameter             | Eta2 (partial) |       95% CI
## -----------------------------------------------------
## Treat_code            |       2.82e-03 | [0.00, 1.00]
## IQF_agzone            |           0.19 | [0.14, 1.00]
## IQR_Season            |           0.61 | [0.61, 1.00]
## Treat_code:IQF_agzone |           0.01 | [0.01, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].
cat("\n=== Partial Eta²: Environment Model ===\n")
## 
## === Partial Eta²: Environment Model ===
print(eta_env)
## # Effect Size for ANOVA (Type III)
## 
## Parameter                  | Eta2 (partial) |       95% CI
## ----------------------------------------------------------
## Treat_code                 |           0.07 | [0.06, 1.00]
## IQF_environment            |           0.06 | [0.05, 1.00]
## IQR_Season                 |           0.62 | [0.61, 1.00]
## Treat_code:IQF_environment |       2.03e-03 | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].
cat("\n=== R²: AgZone Model ===\n")
## 
## === R²: AgZone Model ===
print(r2_agzone)
## # R2 for Mixed Models
## 
##   Conditional R2: 0.693
##      Marginal R2: 0.541
cat("\n=== R²: Environment Model ===\n")
## 
## === R²: Environment Model ===
print(r2_env)
## # R2 for Mixed Models
## 
##   Conditional R2: 0.719
##      Marginal R2: 0.495

Testing the three-way interaction

To assess whether the interaction among Treatment, Environment (OAF_AEZ), and Season improves the model, we compared a full model (including the 3-way interaction) with a reduced model (no 3-way term). Because lmer() does not provide p-values for interaction terms, we used a likelihood ratio test via anova() to compare the nested models.

Notes: The likelihood ratio test shows strong evidence that the three-way interaction (Treatment × Environment × Season) significantly improves model fit. Therefore, we proceed to open the interaction and evaluate 2 way interactions (Environmetn x treatment)

Models

Full model (with 3-way interaction):

model_full <- lmer(
  DC_Yield_t_ha ~ Treat_code * IQF_environment * IQR_Season +
    (1 | IQR_block) + (1 | IQF_environment) + (1 | IQR_Season),
  data = yield_n_out,
  REML = FALSE
)
## fixed-effect model matrix is rank deficient so dropping 4 columns / coefficients
## boundary (singular) fit: see help('isSingular')

Reduced model (no 3-way interaction):

model_no_3way <- lmer(
  DC_Yield_t_ha ~ Treat_code * IQF_environment + Treat_code * IQR_Season +
    (1 | IQR_block) + (1 | IQF_environment) + (1 | IQR_Season),
  data = yield_n_out,
  REML = FALSE
)
## boundary (singular) fit: see help('isSingular')

Likelihood ratio test

anova(model_no_3way, model_full)
## Data: yield_n_out
## Models:
## model_no_3way: DC_Yield_t_ha ~ Treat_code * IQF_environment + Treat_code * IQR_Season + (1 | IQR_block) + (1 | IQF_environment) + (1 | IQR_Season)
## model_full: DC_Yield_t_ha ~ Treat_code * IQF_environment * IQR_Season + (1 | IQR_block) + (1 | IQF_environment) + (1 | IQR_Season)
##               npar   AIC   BIC logLik deviance  Chisq Df Pr(>Chisq)    
## model_no_3way   40 36417 36710 -18169    36337                         
## model_full      96 35293 35997 -17551    35101 1236.1 56  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Test the AEZ x Treatment interaction, in each season

Now that we confirmed a significant 3-way interaction, next step is to analyze each season separately and test for Treat_code × IQF_environment interaction within each season.

Notes: Interaction is significant in all seasons but one season (25B). Then continue with post-hoc comparisons per agroecological zone (IQF_environment) within each season.

## 
## 
## ==== SEASON: 23A ====
## Type III Analysis of Variance Table with Satterthwaite's method
##                             Sum Sq Mean Sq NumDF  DenDF  F value  Pr(>F)    
## Treat_code                 315.910 105.303     3 1654.2 146.5477 < 2e-16 ***
## IQF_environment              4.243   1.414     3  686.7   1.9682 0.11741    
## Treat_code:IQF_environment  13.302   1.478     9 1654.6   2.0568 0.03032 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## ==== SEASON: 23B ====
## Type III Analysis of Variance Table with Satterthwaite's method
##                            Sum Sq Mean Sq NumDF  DenDF  F value    Pr(>F)    
## Treat_code                 44.422 14.8075     3 1540.1 184.0492 < 2.2e-16 ***
## IQF_environment             1.045  0.3482     3  697.6   4.3285  0.004914 ** 
## Treat_code:IQF_environment  8.192  0.9102     9 1540.5  11.3135 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## ==== SEASON: 24A ====
## Type III Analysis of Variance Table with Satterthwaite's method
##                            Sum Sq Mean Sq NumDF   DenDF  F value    Pr(>F)    
## Treat_code                 499.90 166.635     3 1383.52 260.0756 < 2.2e-16 ***
## IQF_environment             54.47  18.158     3  599.66  28.3404 < 2.2e-16 ***
## Treat_code:IQF_environment  48.49   5.388     9 1383.37   8.4091  2.64e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## ==== SEASON: 24B ====
## Type III Analysis of Variance Table with Satterthwaite's method
##                            Sum Sq Mean Sq NumDF   DenDF  F value    Pr(>F)    
## Treat_code                 94.073 31.3577     3 1406.88 378.9046 < 2.2e-16 ***
## IQF_environment             2.195  0.7315     3  700.07   8.8395 9.323e-06 ***
## Treat_code:IQF_environment  1.806  0.2006     9 1407.59   2.4243  0.009863 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## ==== SEASON: 25A ====
## fixed-effect model matrix is rank deficient so dropping 2 columns / coefficients
## Missing cells for: Treat_codeB:IQF_environmentHighland, Treat_codeD:IQF_environmentHighland.  
## Interpret type III hypotheses with care.
## Type III Analysis of Variance Table with Satterthwaite's method
##                             Sum Sq Mean Sq NumDF  DenDF  F value    Pr(>F)    
## Treat_code                 223.233  74.411     3 1143.0 112.6896 < 2.2e-16 ***
## IQF_environment             18.353   6.118     3  483.6   9.2648 5.739e-06 ***
## Treat_code:IQF_environment  17.293   2.470     7 1141.6   3.7412 0.0005143 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## ==== SEASON: 25B ====
## fixed-effect model matrix is rank deficient so dropping 2 columns / coefficients
## Missing cells for: Treat_codeB:IQF_environmentHighland, Treat_codeD:IQF_environmentHighland.  
## Interpret type III hypotheses with care.
## Type III Analysis of Variance Table with Satterthwaite's method
##                            Sum Sq Mean Sq NumDF   DenDF F value  Pr(>F)    
## Treat_code                 34.390 11.4634     3 1120.51 77.9677 < 2e-16 ***
## IQF_environment             1.501  0.5003     3  450.89  3.4025 0.01769 *  
## Treat_code:IQF_environment  1.259  0.1798     7 1120.48  1.2229 0.28687    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Testing the AEZ × Treatment interaction within each season

Given that the three-way interaction (Treatment × Environment × Season) is significant, the correct next step is to analyze each season separately and test for the Treatment × Environment interaction within that season. For each season, we fit a model including the Treat_code × IQF_environment interaction and evaluated its significance.

Notes: The interaction is significant in all seasons except one (25B). This indicates that treatment effects vary across environmental zones for nearly every season, supporting the need for zone-specific post-hoc comparisons.

Models per season

## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons

#Post-hoc comparisons within each Season × AEZ After confirming interaction patterns, we conducted post-hoc comparisons of treatment effects within every Season × AEZ combination. We estimated marginal means per treatment and tested whether the CA treatments (B, C, D) differed from each other.

Across 24 Season–AEZ combinations: -Treatment effect is significant in all 24 combinations. -In none of the 24 combinations do CA treatments (B, C, D) differ significantly from each other; all CA treatments form a statistically homogeneous group.

Notes: Because B, C, and D do not differ from each other in any environment–season context, they can be pooled into a single CA group to increase statistical power when comparing CA vs. Conventional systems in the next steps of the analysis. We will have to clearly explain that we are comparing cropping systems, which contain heterogeneity within them (varieties, management along the seasons, etc)

library(dplyr)
library(emmeans)
library(multcomp)
library(multcompView)

summary_results <- list()

# All combinations of Season × AEZ
season_aez_combos <- yield_n_out %>%
  distinct(IQR_Season, IQF_environment) %>%
  arrange(IQR_Season, IQF_environment)

for (i in 1:nrow(season_aez_combos)) {
  
  season <- season_aez_combos$IQR_Season[i]
  aez    <- season_aez_combos$IQF_environment[i]
  
  sub_data <- yield_n_out %>%
    filter(IQR_Season == season, IQF_environment == aez)
  
  if (length(unique(sub_data$Treat_code)) > 1 && nrow(sub_data) >= 5) {
    
    model <- lm(DC_Yield_t_ha ~ Treat_code, data = sub_data)
    
    # p-value
    anova_p <- anova(model)["Treat_code", "Pr(>F)"]
    
    # letters
    emm  <- emmeans(model, "Treat_code")
    pw   <- multcomp::cld(emm, adjust = "tukey")
    df_pw <- as.data.frame(pw)
    
    # Clean `.group` (" a b" -> "ab")
    df_pw$clean_group <- gsub(" ", "", df_pw$.group)
    
groups_bcd <- df_pw %>%
  filter(Treat_code %in% c("B", "C", "D")) %>%
  dplyr::select(Treat_code, clean_group)

    
    # Default
    bc_d_diff <- "No"
    
    # Only check differences if >= 2 treatments available
    if (nrow(groups_bcd) >= 2) {
      
      # All pairwise treatment letter sets
      combos <- combn(groups_bcd$clean_group, 2, simplify = FALSE)
      
      for (combo in combos) {
        
        g1 <- unlist(strsplit(combo[1], ""))  # split letters
        g2 <- unlist(strsplit(combo[2], ""))
        
        # If NO shared letters → significant difference
        if (length(intersect(g1, g2)) == 0) {
          bc_d_diff <- "Yes"
          break
        }
      }
    }
    
    summary_results[[paste(season, aez, sep = "_")]] <- data.frame(
      IQR_Season = season,
      IQF_environment = aez,
      Treat_pvalue = round(anova_p, 5),
      B_C_D_different = bc_d_diff
    )
  }
}
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
results_table <- bind_rows(summary_results)

How many times effect of trat was significant (p<=0.5) and how many timmes at least 2 CA treatmes were differenc eamong them:

## Significant p-values (<= 0.05): 23
## B/C/D differences detected: 0
## Total Season-AEZ combinations: 24

Plot results with letters and mean values

Combining CA Treatments

Because CA treatments B, C, and D showed no significant differences across any Season × AEZ combination, we combined them into a single CA group. Numerical variables (including yield) were averaged across the three treatments, while categorical variables were taken from treatment C, the most agronomically consistent option. This provides a cleaner dataset and increases power for comparing CA vs. Conventional systems.

Yield comparison of CA vs CONV by AEZ and season

We compared CA vs Conventional (CONV) yields using linear mixed models fitted separately for each Season × AEZ combination. For each model, we extracted; the p-value for the CA vs CONV contrast, and the relative yield defined as:

Relative yield = CA yield / CONV yield

Notes:

Average relative yield (CA / CONV): 0.758, i.e. CA yields are on average about 24% lower than CONV, and all comparisons are significant: 24 out of 24 Season × AEZ combinations show a significant difference (p ≤ 0.05) between CA and CONV. These results indicate that, across all tested environments and seasons, CA yields are consistently and significantly lower than those under conventional tillage.

Average difference and % of significant comparisons

# 1. Average relative yield
average_relative_yield <- mean(yield_comparison_table$relative_yield, na.rm = TRUE)

# 2. Count of significant p-values
n_significant <- sum(yield_comparison_table$p_value <= 0.05, na.rm = TRUE)

# 3. Total number of comparisons
n_total <- nrow(yield_comparison_table)

# Display the results
cat("Average Relative Yield (CA / CONV):", round(average_relative_yield, 3), "\n")
## Average Relative Yield (CA / CONV): 0.761
cat("Significant differences (p ≤ 0.05):", n_significant, "of", n_total, "\n")
## Significant differences (p ≤ 0.05): 24 of 24

#Cumulative distribution of relative yield (Season × AEZ level) We visualized the distribution of relative yield (CA / CONV) across Season × AEZ combinations using a cumulative frequency plot of relative yield (CA / CONV) for the average of season-agzone combinations with:X-axis: Relative yield (sorted increasing).

Notes:

This figure shows that even when targeting at the AEZ level, it is not possible to “pick out” zones where CA has no yield penalty. The signal is highly variable within these broad zones, suggesting that effective targeting would need to happen at the plot or farm level, not just by AEZ. Same but with color profression by season

#Plot-level relative yield (block × season)

We then moved from aggregated Season × AEZ comparisons to the plot level (each block = one farm in one season).We plotted the cumulative frequency of plot-level relative yield, adding a vertical reference line at relative yield = 1 (no yield penalty).

Notes:

At the plot level, a share of plots have little or no yield penalty under CA. Depending on the cut-off used, roughly 15–20% of plots show relative yields close to or above 1, indicating that CA can perform competitively on a subset of farms even though the overall average is negative.

Interpretation of yield comparison between CA and CONV (block–season combinations)

This figure shows the distribution of relative yield (CA / CONV) for each block–season combination. Each point represents the yield performance of one farm in one specific season, comparing CA to its paired CONV value.

Notes: Across AEZs, about 15–20% of block–season observations have relative yields close to or above 1. This means that although CA clearly underperforms on average at the ag-zone level, a meaningful subset of fields in individual seasons shows no yield penalty. This variation highlights that CA performance is highly plot-specific and cannot be reliably predicted using only AEZ-level information.

## Warning: Removed 11 rows containing missing values or values outside the scale range
## (`geom_point()`).

Another way to see it:

#Cumulative yield difference per farm (multi-season average).

Here we summarize the mid-term yield impact (3 years, 6 seasons) of CA at the farm level. For each block (farm), we first compute the relative yield (CA / CONV) at the block–season level and then calculate the average relative yield across seasons (average of the 6 seasonal rations). This gives one value per farm: its mean CA performance relative to Conventional over time.

Notes:

When looking at farms over multiple seasons, again most blocks (82%) have an average relative yield below 1, meaning a cumulative yield penalty under CA. Only a 18% share of farms in each AEZ reaches or exceeds yield parity. What looked as average is a non-viable practice, however it that fraction of farmers that benefit from CA can be targeter, then potential benefits of CA can be exploded.

## # A tibble: 577 × 4
##    IQR_block avg_relative_yield IQF_environment IQF_District
##    <chr>                  <dbl> <chr>           <chr>       
##  1 102_1                  0.828 Mid_Low_Wet     Bugarama    
##  2 102_10                 0.544 Mid_Low_Wet     Bugarama    
##  3 102_11                 0.919 Mid_Low_Wet     Bugarama    
##  4 102_12                 0.436 Mid_Low_Wet     Bugarama    
##  5 102_13                 0.768 Mid_Low_Wet     Bugarama    
##  6 102_14                 1.94  Mid_Low_Wet     Bugarama    
##  7 102_2                  0.726 Mid_Low_Wet     Bugarama    
##  8 102_3                  0.750 Mid_Low_Wet     Bugarama    
##  9 102_4                  0.753 Mid_Low_Wet     Bugarama    
## 10 102_5                  0.781 Mid_Low_Wet     Bugarama    
## # ℹ 567 more rows

Now a figure of cummulative frequency of relative performance

Another way to see frequency of performance of average (the 6 seasons averaged) relative yield of CA

## Warning: Removed 28 rows containing non-finite outside the scale range
## (`stat_density()`).

#Evolution of CA relative yield performance across six seasons

While looking at the cummulative yield difference between conventional and CA is useful from an agronomic perspective, it is missess relevance fro farmers. For example, farmers can’t afford large lossess in the initial seasons even if there is s compensation 2 or 3 years after. So its important to look at the effect of CA as a video and not as a picture.

We first try to understand whether farms consistently perform well or poorly under CA. For that we divide each season the blocks in good, mis and poor performers dividing the yield response (Realtive yield of CA = Ca/CONV) in three 3rds high, mid and low. Using these seasonal classifications, we create alluvial diagrams to visualize how each block’s performance group changes across seasons.

Note:

The alluvial plots show that blocks rarely stay in the same performance group from season to season. Many farms move between high, mid, and low categories, indicating that CA performance is highly unstable over time at the farm level. Only a small number of blocks consistently remain high-performing or consistently remain low-performing across all seasons.

## Warning: package 'ggalluvial' was built under R version 4.3.3
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.

## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.

## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.

Another way to explore the evolution of performance is the correlation of performance along the seasons, specially with season 1. That is having good performance is season 1 correlates better with the rest? And with overall?

## Warning: package 'corrplot' was built under R version 4.3.3
## corrplot 0.95 loaded

# Classification of seasonal CA performance using absolute thresholds As an alternative to tercile-based rankings, we also classify each block–season combination using a set of absolute thresholds for relative yield (CA / CONV). This provides a more interpretable scale of performance: -Very poor: < 0.90 -Poor: 0.90–0.95 -Acceptable: 0.95–1.00 -Good: ≥ 1.00 We add this classification to the block_yield dataset for each season and block. Using these categories, we generate alluvial diagrams to visualize how farms transition between performance groups across seasons.

## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.

## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.

## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.

Typology of multi-season CA performance (23A–25B)

To describe how farms perform under CA over time, we created a typology of block-level responses using their relative yield (CA / CONV) across six seasons (23A–25B). Only blocks with data in at least four seasons were retained to ensure stable classification.

For each block, we calculated its average relative yield, identified whether it ever experienced a yield penalty greater than 5%, and compiled its full sequence of seasonal performances. Based on these indicators, each farm was grouped into one of four long-term performance types: -Good: Average ≥ 0.95 and no season < 0.95 -Acceptable: Average ≥ 0.95 but at least one season < 0.95 -Poor: Average between 0.90 and 0.95 -Very Poor: Average ≤ 0.90

These thresholds were defined arbitrarily but intentionally, informed by field experience and our understanding of what yield reductions farmers may find acceptable if longer-term soil-health and economic benefits of CA are demonstrated and communicated clearly. Importantly, this typology is not final—it will be refined and validated through farmer consultations to ensure the categories and cut-off points align with farmers’ own perceptions of acceptable and unacceptable performance.

Notes:once again we see that there is around 18% of plots that could have an acceptable yield performance of the CA. This is of cource not considering all the soild benefits of this plots which are highly afftected by water erosion under conventional tillage, specially in maize-beans rotations. An important consideration is that this indictes that about 18 % of the plots coudl potentially be targeted with CA (if we just consider yield), however, farmers in Rwanda own on average 2.5 plots per hosehold (NISR 2023) so that would be about 39% of the farmers having at least one of their plots with acceptable performance of CA. Question remains…can we pre-identify this farmers and plots within the farm? Is it possible then to use that information to target the recommendatiosn to these farmers and plots within their farmes?… we will explore that below.

library(dplyr)
library(tidyr)

# Step 1: Define target seasons
seasons <- c("23A", "23B", "24A", "24B", "25A", "25B")

# Step 2: Compute number of valid seasons per block
blocks_with_enough_seasons <- block_yield %>%
  filter(IQR_Season %in% seasons, is.finite(Relative_Yield)) %>%
  group_by(IQR_block) %>%
  summarise(n_valid_seasons = n_distinct(IQR_Season), .groups = "drop") %>%
  filter(n_valid_seasons >= 4)

# Step 3: Filter main dataset to include only blocks with ≥4 seasons
filtered_yield <- block_yield %>%
  filter(IQR_block %in% blocks_with_enough_seasons$IQR_block,
         IQR_Season %in% seasons,
         is.finite(Relative_Yield))

# Step 4: Prepare wide-format Relative Yield table
RY_wide <- filtered_yield %>%
  group_by(IQR_block, IQR_Season) %>%
  summarise(Relative_Yield = mean(Relative_Yield, na.rm = TRUE), .groups = "drop") %>%
  pivot_wider(
    names_from = IQR_Season,
    values_from = Relative_Yield,
    names_prefix = "Relative_Yield_"
  )

# Step 5: Compute performance typology
Perf_Type <- filtered_yield %>%
  group_by(IQR_block) %>%
  summarise(
    n_seasons = n_distinct(IQR_Season),
    any_below_0.95 = any(Relative_Yield < 0.95),
    mean_ratio = mean(Relative_Yield, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(
    CA_typology = case_when(
      !any_below_0.95 ~ "Good",
      any_below_0.95 & mean_ratio >= 0.95 ~ "Acceptable",
      any_below_0.95 & mean_ratio < 0.95 & mean_ratio > 0.90 ~ "Poor",
      mean_ratio <= 0.90 ~ "Very Poor"
    )
  )

# Step 6 (fixed): Get block-level metadata using fallback from multiple seasons
block_meta <- block_yield %>%
  filter(IQR_Season %in% c("23A", "24A", "25A")) %>%
  arrange(IQR_block, match(IQR_Season, c("23A", "24A", "25A"))) %>%  # Prioritize 23A > 24A > 25A
  distinct(IQR_block, .keep_all = TRUE) %>%
  dplyr::select(IQR_block, IQF_environment, IQF_region, IQF_District)


# Step 7: Join all into Perf_Type
Perf_Type <- Perf_Type %>%
  left_join(RY_wide, by = "IQR_block") %>%
  left_join(block_meta, by = "IQR_block")

Now we analyse the frequency of each type

Same by environment:

Now we explore what are the drivers of performance in each season

With the selected variables we explore better one by one

ICR_N_perc, ICR_Org_C_perc, ICR_Avail_P_ppm, ICR_K_meq_100g, IQR_TF_tubura_client, ICR_age_hh_head, IQR_TF_role, ICR_HH_size, ICR_arable_land_owned, ICF_Lat, ICF_Long, ICF_Alt, IQR_plot_fert_quality, IQR_soil_texture, IQR_soil_color, IQF_Position_landscape, IQF_position_slope, IQF_Weed_specie, IQR_compost_quality, ICR_compost_tn_ha, IQR_types_compost, IQR_basic_crop, ICF_planting_date, ICR_avg_daily_rainfall

Nitrogen pre-planting - ICR_N_perc -

Colelcted only season 23A. Notes: Keep. has low N of missing values, high CV.

## # A tibble: 1 × 6
##   Total Missing Available  Mean    SD    CV
##   <int>   <int>     <int> <dbl> <dbl> <dbl>
## 1   556       4       552 0.221 0.137 0.621

Organic C - ICR_Org_C_perc -

Notes: Keep. for this one we have 3 seasons. Since they are all very complete, I will use the average of 23A and 24A measurements for the control treatment (Conventinal, A) I think this will give better estimate of the levels of OM of a block and in 2 years only very little change of OM in 0-30cm depth should be expected.

## # A tibble: 3 × 7
##   IQR_Season Total Missing Available  Mean    SD    CV
##   <chr>      <int>   <int>     <int> <dbl> <dbl> <dbl>
## 1 23A          556       4       552  2.10 0.878 0.419
## 2 24A          468      29       439  2.45 1.37  0.559
## 3 25A          445     160       285  3.19 1.54  0.484

K - ICR_K_meq_100g -

Notes: Keep. Colelcted only season 23A. Keep ir, has low N of missing values, high CV.

## # A tibble: 3 × 7
##   IQR_Season Total Missing Available    Mean     SD     CV
##   <chr>      <int>   <int>     <int>   <dbl>  <dbl>  <dbl>
## 1 23A          556       4       552   0.510  0.494  0.968
## 2 24A          468     468         0 NaN     NA     NA    
## 3 25A          445     445         0 NaN     NA     NA

Available P. - ICR_Avail_P_ppm -

Notes: Keep. We will use only TreatA as tillage may make more uniform the samplig within the plot. We will use average of the 3 samplings ti increase quality. Now sure whay so much difference among seasons of sampling. One would expect increase if any (becasue of DAP and compost applications)

## # A tibble: 3 × 9
##   IQR_Season Total Missing Available  Mean    SD    CV   Min   Max
##   <chr>      <int>   <int>     <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 23A          556       4       552  21.2  33.0  1.56  1.9   310 
## 2 24A          468      29       439  27.8  35.7  1.28  1.12  193.
## 3 25A          445     160       285  21.6  29.2  1.35  1.52  164.

HH Size - ICR_HH_size -

Notes: NOT Keep, not sure it is really a factor to consider, we just stay with biophisical+management only (or non-biophisical that we are sure relate to biophisicals + management we can not measure)

## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `is.finite(as.numeric(ICR_HH_size))`.
## Caused by warning:
## ! NAs introduced by coercion

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `ICR_HH_size = as.numeric(ICR_HH_size)`.
## Caused by warning:
## ! NAs introduced by coercion
## # A tibble: 3 × 10
##   IQR_Season Total Missing Available  Mean    SD    CV Non_NA_Non_Zero   Min
##   <chr>      <int>   <int>     <int> <dbl> <dbl> <dbl>           <int> <dbl>
## 1 23A          556       3       553  5.58  2.09 0.375             553     1
## 2 24A          468       0       468  5.68  2.00 0.352             468     1
## 3 25A          445       2       443  5.49  2.00 0.364             443     1
## # ℹ 1 more variable: Max <dbl>

Arable land Owned -ICR_arable_land_owned-

Comclusion: Keep. While not biophisical, may be related to them and to historical management, other crops, mulch pressure, etc. Something wrong with 23B. use average of all the rest

## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `is.finite(as.numeric(ICR_arable_land_owned))`.
## Caused by warning:
## ! NAs introduced by coercion

### ALtitude -ICF_Alt- Notes: Keep. Altitude was recorded each season. What is good as we can do average of recordingas and increase quality. We had a good CV, however, scores for 25A are a little off so for the averages we will exclude that season ### Rain (average daily) - ICR_avg_daily_rainfall - Notes. Keep. We will use the average of the 6 seasons but later we can explore better rain accumulated at different periods (ej.g., if we assume mucl can help early dorughts etc)

## # A tibble: 3 × 7
##   IQR_Season Total Missing Available  Mean    SD    CV
##   <chr>      <int>   <int>     <int> <dbl> <dbl> <dbl>
## 1 23A          556       0       556  5.04  1.67 0.332
## 2 24A          468      11       457  5.20  1.95 0.374
## 3 25A          445      12       433  4.15  1.49 0.358

Compost tons per ha - ICR_compost_tn_ha -

Notes: Keep. We can use average of the 6 season for the farmer as a management factor (biophisical + management). We also use quality and source of compost as corector factors for amount (see below)

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `ICR_compost_tn_ha = as.numeric(ICR_compost_tn_ha)`.
## Caused by warning:
## ! NAs introduced by coercion

Now we explore the qualitative variables

Tubura client - IQR_TF_tubura_client -

Notes: Keep. Not biophisical or management but related to them. Not perfectly balanced (ranging form 58 to 75 % of tubura clients) but goo denough as to beinclued considering how relevant it can be. A binary variable is often considered too imbalanced if one class is >90–95%, especially in ML models, so are well below that threshold.

Plot fertility as perceibed by the farmer - IQR_plot_fert_quality -

Notes: Keep with caution, “low” is very low represented (may consider to combine with medium), and Medium is very over-represented ### Now for soil texture - IQR_soil_texture - Notes: Keep. We have a nice representativeness of different soils and quite balanced althouth the 3-Clay may be better to combine in 1 as they have few reps in each cathegory. For the ones that are NA in season 23A, get the data from 24A of 24B ### Soil color -IQR_soil_color- Notes: Keep. We have a nice representativeness of different soils and quite balanced althouth better combine red less with red more. Also see if can combine the soil in KY with other. For the ones that are NA in season 23A, get the data from 24A of 24B ### Position in the lanscape - IQF_Position_landscape - Notes: Keep but extremely unbalanced so with caution. The category hills completely dominates (~90%), the other two categories are rare, each <5%. In mix models the low N cathegories will have high SE, nit very informative. RF handles imbalanced categorical predictors fine, but will provably have low relevance (even if it may have it). SO maybe include it in a first ran of mix and RF see the results and base on that remove. They can look indivudually at how the scores of relative yield are in each environment for each category (maybe looking for the blocks that are most similar for other factors Paired comparison?) ### Position on the slope - IQF_position_slope - Notes: Keep but try to find/impute the 10% missing data (not recorede in other seasons, but maybe can be imputed from other variables like position in the landscape or if we measure slope later). As is can be used for mixed models although limited as quite umbalanced and several cathegoreis low represented. For RF is of as is

## 🔎 Total NA blocks in 23A: 0
## ✅ Recovered from 24A: 0

Weed species IQR_Weedtype1_specie - IQR_Weedtype1_specie -

Notess: Keep. But Generally not suitable for mix models and even for RF not useful in its current form. Having too many rare categories (singletons) would cause unstable model estimation, produce large standard errors and potentially lead to singular fits or convergence issues. We can group the ones in minority by weed type (broad leaf and grasses). We Keep individually becuase they have large number: Couch_grass
Galinsoga_parviflora And weechweed because of its relevance (striga). It has N=19, what is not so bad. Correct urukuta = Commelina_benghalensis

### IQR_compost_quality - IQR_compost_quality - For compose we need to get average of the seasons, so we need to conver the qualitative scale to numeric and then do average. “poor” → 1 “average” → 2 “well_decomposed” → 3. While there is almost half of farmers that are very onsistent on having good, average of poor compost along the seaosns,another amount moves from one to another. So we better do average of the score along seasons and then use the numerical scale rather than the factorial Not sure how to use compost quality tough…it seems that it may no be an indicator strong enough on its own but we can also think about using it as a correction factor for compost amount; e.g. average compost leaves amount =, food it increases by a % and poor decreses it by another %??? *Note: interesting to seee how there seems to be a trent of better quality in zones with higuer temperatures…what could be expected by faster decomposition Here an analysis hut to check how cuality changes across environments and that areas with lower T have lower quality (less decomposition rate???)

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: compost_score ~ IQF_environment + (1 | IQR_block)
##    Data: block_yield
## 
## REML criterion at convergence: 4039.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.8125 -0.4308  0.1464  0.4646  3.4850 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev.
##  IQR_block (Intercept) 0.1213   0.3482  
##  Residual              0.1773   0.4211  
## Number of obs: 2886, groups:  IQR_block, 569
## 
## Fixed effects:
##                             Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)                2.349e+00  3.149e-02 9.780e+02  74.610  < 2e-16 ***
## IQF_environmentMid_Low_Dry 3.362e-01  3.516e-02 1.969e+03   9.561  < 2e-16 ***
## IQF_environmentMid_Low_Wet 1.699e-01  4.166e-02 1.111e+03   4.079 4.85e-05 ***
## IQF_environmentTransition  5.268e-03  4.574e-02 1.011e+03   0.115    0.908    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) IQF_M_L_D IQF_M_L_W
## IQF_nvM_L_D -0.716                    
## IQF_nvM_L_W -0.732  0.577             
## IQF_nvrnmnT -0.672  0.494     0.593
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
##  IQF_environment emmean     SE   df lower.CL upper.CL .group
##  Highland          2.35 0.0315 1060     2.27     2.43  1    
##  Transition        2.35 0.0339  960     2.27     2.44  1    
##  Mid_Low_Wet       2.52 0.0284  975     2.45     2.59   2   
##  Mid_Low_Dry       2.69 0.0254  922     2.62     2.75    3  
## 
## Degrees-of-freedom method: kenward-roger 
## Confidence level used: 0.95 
## Conf-level adjustment: sidak method for 4 estimates 
## P value adjustment: tukey method for comparing a family of 4 estimates 
## significance level used: alpha = 0.05 
## NOTE: If two or more means share the same grouping symbol,
##       then we cannot show them to be different.
##       But we also did not show them to be the same.

Now we explore if there is a trend on farmers that usually used good /poor compost or change a lot from season to season

## Warning in setup_data(...): Some differentiation aesthetics vary within alluvia, and will be diffused by their first value.
## Consider using `geom_flow()` instead.

## Warning in setup_data(...): Some differentiation aesthetics vary within alluvia, and will be diffused by their first value.
## Consider using `geom_flow()` instead.

## Warning in setup_data(...): Some differentiation aesthetics vary within alluvia, and will be diffused by their first value.
## Consider using `geom_flow()` instead.

### Compost types - IQR_types_compost - Notes: Do not keep, most on the same, too unbalanced. In any case can then evluate individually ### Basic crop Notes: Not to keep. Althous we can explore how buch and beans behave udner CA. Limitations is that there will be lots of confunding factors associeted to the specie, but maybe we can find some that are similar context with differnet crop?? ### Planting date Notes. Keep. We have a nice range within each district-season, what makes it very meaningful the planting dates comparison. Another relevant management practice. However, since planting date is relative, to others is same context, we will make it relative to the 1st planting date in the same district the same season. So we ID the first planting date and give a 1 and from there we add delas from that.

First we make sure we align formationgs of dates. Then we create a verable that is a date that is more meaningful; days from the time the first block was planted inthe same district-season combination.

## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

Figure

So the overall Notess of variables we keep fro drivers analysis: N – ICR_N_perc. Collected only in 23A; low missing values but high CV.Keep as is; single-season snapshot. Organic C – ICR_Org_C_perc. Available for 3 seasons; very complete. Use the average of 23A and 24A for conventional plots to better represent block OM (minimal expected OM change over 2 years). K – ICR_K_meq_100g. Collected only in 23A; low missing values, high CV. Keep as is. Available P – ICR_Avail_P_ppm. Use only Treatment A (to avoid tillage sampling effects). Average the 3 sampling seasons to improve quality; unclear why values differ so much. Arable Land Owned – ICR_arable_land_owned. Socio-management variable; may relate to soil, crops, mulch, history. Exclude 23B (erroneous) and use average of remaining data. Altitude – ICF_Alt. Recorded each season; good CV. Average across seasons but exclude 25A (values off). Rainfall (avg daily) – ICR_avg_daily_rainfall. Use average across 6 seasons for now. Later explore rainfall accumulation during key crop periods. Compost (t/ha) – ICR_compost_tn_ha. Use 6-season average as a management factor. Include compost source/quality as correction factors. Tubura Client – IQR_TF_tubura_client. Binary, moderately imbalanced (58–75% “yes”); acceptable for models. Relevant socio-management factor; include. Perceived Plot Fertility – IQR_plot_fert_quality. “Low” barely represented; consider merging with “Medium.Medium over-represented, so use with caution. Soil Texture – IQR_soil_texture. Generally well-represented and balanced. Combine rare clay categories; fill 23A missing values using 24A/24B. Soil Color – IQR_soil_color. Good distribution; may merge “red less” with “red more.” Check whether KY combines with another group; fill missing as above. Position in the Landscape – IQF_Position_landscape. Extremely unbalanced (~90% hills). Include initially in models, but likely low information; inspect performance before final inclusion. Position on the Slope – IQF_position_slope. 10% missing; attempt imputation from related variables. Usable in mixed models but unbalanced; RF can handle as is. Weed Species – IQR_Weedtype1_specie. Too many rare categories for modeling; group minority species into grass/broadleaf. Keep key species separately: couch grass, Galinsoga parviflora, and witchweed (Striga, N=19). Compost Quality – IQR_compost_quality. Convert qualitative scale to numeric (1–3) and average across seasons. Potentially useful as correction factor for compost amount; quality trends relate to warmer zones.

Now we prepare the database that will have the selected drivers properly calculated

Calculations will be one per season-block combination (“seasonal”) and then one that will be calculated for the block along seasons and used when we evaluate divers of avege along seasons or the “dynamic performance” (“combined”)

We strat by creating a database called “yield_drivers” that we will then join to block_yield

Driver 1: ICR_N_perc We now extract the data from 23A (only collected that season) and add it to drivers

check

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0500  0.1500  0.1800  0.2213  0.2500  1.0800      18

Driver 2: ICR_Org_C_perc
for this one we have 3 seasons. Since they are all very complete, I will use the average of 23A and 24A measurements for the control treatment (Conventinal, A) I think this will give better estimate of the levels of OM of a block and in 2 years only very little change of OM in 0-30cm depth should be expected. So we will get it from the database yield_CA_comb, filtering for Treat_group = CONV

check

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.480   1.620   2.100   2.272   2.763   9.930       2

Driver 3: ICR_K_meq_100g We now extract the data from 23A (only collected that season) and add it to drivers

check

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0500  0.2000  0.3300  0.5101  0.6600  3.4100      18

Driver 4 ICR_Avail_P_ppm We will use only Treat A as tillage may make more uniform the samplig within the plot. We will use average of the 3 samplings ti increase quality. Now sure whay so much difference among seasons of sampling. One would expect increase if any (becasue of DAP and compost applications)

check

## Warning: Unknown or uninitialised column: `ICR_Avail_P_ppm`.
## Length  Class   Mode 
##      0   NULL   NULL

Driver 5 ICR_arable_land_owned
While not biophisical, may be related to them and to historical management, other crops, mulch pressure, etc. Something wrong with 23B. use average of all the rest

check

## Warning: Unknown or uninitialised column: `ICR_arable_land_owned`.
## Length  Class   Mode 
##      0   NULL   NULL

Driver 6 ICF_Alt Altitude was recorded each season. What is good as we can do average of recordingas and increase quality. We had a good CV, however, scores for 25A are a little off so for the averages we will exclude that season. Also values <500 are not included in the avergaes

Check

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1001    1462    1660    1692    1877    2434       1

Driver 7 ICR_avg_daily_rainfall We will use the average of the 6 seasons but later we can explore better rain accumulated at different periods (ej.g., if we assume mucl can help early dorughts etc). Use only CONV (Treat_group = CONV). If rainfall is 0 or NA, replace using: same season type (A with A, B with B) from previous or next year if available Example: Replace missing 24A with value from 23A or 25A Replace missing 24B with value from 23B or 25B Compute block-level average rainfall after imputation.Join into yield_drivers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6817  3.3583  3.9783  4.1446  4.8850  9.5367

Driver 8 ICR_compost_tn_ha We use average of the 6 season for the farmer as a management factor (biophisical + management). We also can use quality and source of compost as corector factors for amount, but for now we usem individually

Check:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   8.787  10.544  11.008  13.032  35.037

Driver 9 IQR_TF_tubura_client Not biophisical or management but related to them. Not perfectly balanced (ranging form 58 to 75 % of tubura clients) but good enough as to be inclued considering how relevant it can be. A binary variable is often considered too imbalanced if one class is >90–95%, especially in ML models, so are well below that threshold.

Check

## 
##  --- #N/A    0    1 
##    3    1  192  373

Driver 10 IQR_plot_fert_quality
WE will use with caution, “low” is very low represented (may consider to combine with medium). We use 23A data

cehck

## 
##         ---        high low_quality      medium        <NA> 
##           4          76          12         474           3

See if we can find the missing data in other seasons

## [1] "115_11" "115_12" "43_1"   "44_3"
library(dplyr)

blocks_to_check <- c("43_1", "44_3")

Fert_check <- yield_CA_comb %>%
  filter(IQR_block %in% blocks_to_check) %>%
  dplyr::select(IQR_block, IQR_Season, Treat_group, IQR_plot_fert_quality) %>%
  arrange(IQR_block, IQR_Season, Treat_group)

Fert_check
## # A tibble: 16 × 4
##    IQR_block IQR_Season Treat_group IQR_plot_fert_quality
##    <chr>     <chr>      <chr>       <chr>                
##  1 43_1      23A        CA          ---                  
##  2 43_1      23A        CONV        ---                  
##  3 43_1      23B        CA          medium               
##  4 43_1      23B        CONV        medium               
##  5 44_3      23A        CA          ---                  
##  6 44_3      23A        CONV        ---                  
##  7 44_3      23B        CA          medium               
##  8 44_3      23B        CONV        medium               
##  9 44_3      24A        CA          medium               
## 10 44_3      24A        CONV        medium               
## 11 44_3      24B        CA          medium               
## 12 44_3      24B        CONV        medium               
## 13 44_3      25A        CA          medium               
## 14 44_3      25A        CONV        medium               
## 15 44_3      25B        CA          medium               
## 16 44_3      25B        CONV        medium

Replase the 2 — in yield_drivers

Driver 11 IQR_soil_texture We have a nice representativeness of different soils and quite balanced althouth the 3-Clay may be better to combine in 1 as they have few reps in each cathegory. For the ones that are NA in season 23A, get the data from 24A of 24B

Now we try to see if we can find values for the NA is other seasons

## # A tibble: 116 × 4
##    IQR_block IQR_Season Treat_group IQR_soil_texture   
##    <chr>     <chr>      <chr>       <chr>              
##  1 132_2     24B        CA          F_(sandy_clay_loam)
##  2 132_2     24B        CONV        F_(sandy_clay_loam)
##  3 132_2     25A        CA          F_(sandy_clay_loam)
##  4 132_2     25A        CONV        F_(sandy_clay_loam)
##  5 132_2     25B        CA          F_(sandy_clay_loam)
##  6 132_2     25B        CONV        F_(sandy_clay_loam)
##  7 132_3     24B        CA          F_(sandy_clay_loam)
##  8 132_3     24B        CONV        F_(sandy_clay_loam)
##  9 132_3     25A        CA          F_(sandy_clay_loam)
## 10 132_3     25A        CONV        F_(sandy_clay_loam)
## # ℹ 106 more rows
## # A tibble: 70 × 12
##    IQR_block ICR_N_perc_23A ICR_Org_C_avg ICR_K_perc_23A ICR_Avail_P_avg
##    <chr>              <dbl>         <dbl>          <dbl>           <dbl>
##  1 10_12               0.19          2.02           2.7            148  
##  2 10_9                0.13          1.78           0.88            56  
##  3 102_1               0.26          2.69           0.73             8.2
##  4 102_8               0.13          1.46           0.29            20.2
##  5 102_9               0.24          2.88           0.96             8.2
##  6 103_12              0.45          2.12           0.36             2.3
##  7 103_4               0.21          2.14           1.03             2.7
##  8 110_11              0.34          2.69           0.25             8.5
##  9 111_2               0.46          3.96           0.16             7  
## 10 111_5               0.33          3.32           0.22             9.7
## # ℹ 60 more rows
## # ℹ 7 more variables: ICR_arable_land_avg <dbl>, ICF_Alt_avg <dbl>,
## #   ICR_rainfall_avg <dbl>, ICR_compost_avg <dbl>, IQR_TF_tubura_client <chr>,
## #   IQR_plot_fert_quality <chr>, IQR_soil_texture <chr>
## # A tibble: 70 × 12
##    IQR_block ICR_N_perc_23A ICR_Org_C_avg ICR_K_perc_23A ICR_Avail_P_avg
##    <chr>              <dbl>         <dbl>          <dbl>           <dbl>
##  1 10_12               0.19          2.02           2.7            148  
##  2 10_9                0.13          1.78           0.88            56  
##  3 102_1               0.26          2.69           0.73             8.2
##  4 102_8               0.13          1.46           0.29            20.2
##  5 102_9               0.24          2.88           0.96             8.2
##  6 103_12              0.45          2.12           0.36             2.3
##  7 103_4               0.21          2.14           1.03             2.7
##  8 110_11              0.34          2.69           0.25             8.5
##  9 111_2               0.46          3.96           0.16             7  
## 10 111_5               0.33          3.32           0.22             9.7
## # ℹ 60 more rows
## # ℹ 7 more variables: ICR_arable_land_avg <dbl>, ICF_Alt_avg <dbl>,
## #   ICR_rainfall_avg <dbl>, ICR_compost_avg <dbl>, IQR_TF_tubura_client <chr>,
## #   IQR_plot_fert_quality <chr>, IQR_soil_texture <chr>

chekc:

library(dplyr)

# Frequency table
soiltexture_table <- table(yield_drivers$IQR_soil_texture)

# Barplot
barplot(
  soiltexture_table,
  main = "Soil Texture Distribution",
  xlab = "Soil Texture",
  ylab = "Frequency",
  col = "burlywood",
  las = 2
)

# we combine by similar general texture into 3 cathegories

library(dplyr)

# Collapse IQR_soil_texture into Coarse / Mix / Fine
yield_drivers <- yield_drivers %>%
  mutate(
    IQR_soil_texture = case_when(
      grepl("^[A-Ca-c]", IQR_soil_texture) ~ "Coarse",
      grepl("^[D-Gd-g]", IQR_soil_texture) ~ "Mix",
      grepl("^[H-Kh-k]", IQR_soil_texture) ~ "Fine",
      TRUE ~ NA_character_
    )
  )

chekc:

library(dplyr)

# Frequency table
soiltexture_table <- table(yield_drivers$IQR_soil_texture)

# Barplot
barplot(
  soiltexture_table,
  main = "Soil Texture Distribution",
  xlab = "Soil Texture",
  ylab = "Frequency",
  col = "burlywood",
  las = 2
)

Driver 12 IQR_soil_color We have a nice representativeness of different soils and quite balanced althouth better combine red less with red more. For the ones that are NA in season 23A, get the data from 24A of 24B

## 
##                  #N/A                 black                 brown 
##                    86                   276                    33 
## ivu_busa_nibara_ryivu                   red               Unknown 
##                    17                   298                     1 
##                  <NA> 
##                     0

check

library(dplyr)

# Frequency table

soilcolor_table <- table(yield_drivers$IQR_soil_color)

# Barplot

barplot(
soilcolor_table,
main = "Soil Color Distribution",
xlab = "Soil Color",
ylab = "Frequency",
col = "tan",
las = 2
)

Driver 13 IQF_Position_landscape Extremely unbalanced so with caution. The category hills completely dominates (~90%), the other two categories are rare, each <5%. For now we leave it as si, but then we can combine them, or we just remove it since this will be somehow captured by position on the slope

## 
## FALSE  TRUE 
##   706     5
## [1] "hills" "hills" "hills" "hills" "hills" "hills"

now we complete missing data from 23 A CONV with other seasons and CA

## 
## FALSE 
##   711
## 
##              ---            hills Radical_terraces           Valley 
##                5              646               30               30

check

Driver 14 IQF_position_slope Find/impute the 10% missing data (not recorede in other seasons, but maybe can be imputed from other variables like position in the landscape or if we measure slope later). As is can be used for mixed models although limited as quite umbalanced and several cathegoreis low represented. For RF is of as is

check

Try to fill with data from other seasons/ and CA

## 
## Final slope distribution:
## 
##       --- Backslope      Flat Footslope  Shoulder    Summit 
##        65       383        54       101        62        46
## 
## Missing after fill:
## [1] 0

Try to imputate using data from position on the landscape; if valley or terraze = “Flat”

check Driver 15 IQR_Weedtype1_specie
Having too many rare categories (singletons) would cause unstable model estimation, produce large standard errors and potentially lead to singular fits or convergence issues. I grouped the ones in minority by weed type (broad leaf and grasses). We Keep individually becuase they have large number the Couch_grass and Galinsoga_parviflora, and weechweed because of its relevance (striga; with N=19, what is not so bad). I also corrected urukuta = Commelina_benghalensis

First for season A, getting from 23A

## 
## FALSE  TRUE 
##   692    19

Now we fill gaps for season A with 24A and for season B with 24B

Replace “urukuta” with “Commelina_benghalensis”

check Now we keep priginal specie ID for couch grass and Galinsoga and Stiga, but for the rest we clasiffy in grass or broadleaf

Check distribution again

Driver 16 IQR_compost_quality For compose we need to get average of the seasons, so we need to conver the qualitative scale to numeric and then do average. “poor” → 1 “average” → 2 “well_decomposed” → 3. Not sure how to use compost quality tough…it seems that it may no be an indicator strong enough on its own but we can also think about using it as a correction factor for compost amount; e.g. average compost leaves amount =, food it increases by a % and poor decreses it by another %??? For now we just keep it individual *Note: interesting to seee how there seems to be a trent of better quality in zones with higuer temperatures…what could be expected by faster decomposition

Check

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   2.000   2.250   2.667   2.587   3.000   3.000       1

We will create a compost quantyty corrected by compost quality to reduce the number of variables. This will be called Comp_amount_corr_quali in the database yield_drivers we will make a new variable called “Comp_amount_corr_quali” that will be calculated as yield_drivers:ICR_compost_avg corrected based on “yield_drivers:IQR_compost_quality_avg” as correction factor = (“yield_drivers:IQR_compost_quality_avg”/5 ) * 1.2

yield_drivers <- yield_drivers %>%
  mutate(
    Comp_amount_corr_quali =
      ICR_compost_avg * ((IQR_compost_quality_avg / 5) * 1.2)
  )

Driver 17 ICF_planting_date We have a nice range within each district-season, what makes it very meaningful the planting dates comparison. Another relevant management practice. However, since planting date is relative, to others is same context, we will make it relative to the 1st planting date in the same district the same season. So we ID the first planting date and give a 1 and from there we add delas from that.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    8.00   12.00   13.11   16.75   45.00       1

#Now we have the 17 drivers at the “Block” level we can evaluate some correlations and ponetial eliminatinos/modifications First Merge yield drivers into block_yield database;

## Warning in left_join(., block_yield %>% dplyr::select(IQR_block, CA_typology, : Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 2 of `x` matches multiple rows in `y`.
## ℹ Row 2 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

Now we check for edundant predictors; identify multicollinearity and try to understand relationships among numeric AND categorical drivers

Split drivers into numeric and categorical

Correlation matrix for NUMERIC predictors Notes: - Compost_avg, compost_quality_avg, rainfall_avg, and altitude form a strong multicollinearity cluster. - Rain and altitude are a well known correlation, and altitude is also highly correlated to the “Environmental” class. So we will not use altitude and we will keep rain (as is for these specific seasons and not the historical) and the environmental class. - Compost quality cound be used a a factor to correct compost cuantity. I will us N is also highly corretated to soil C so will just keep soil C for now - Position in the slope and position in the landscape are also higuly correlated. Posiion in the slope gives us most information since the bast mayority of the farmers are in the hills. SO we will only keep position in the slope

Categorical vs categorical: Cramer’s V matrix Notes Position in the slope and position in the landscape my by redundat Numeric vs categorical: ANOVA or Eta²: Eta (η) measures how strongly a numeric variable differs across the categories of a categorical variable, indicating how much of the numeric variance is explained by group differences. Notes: as expected altitude and OAF-AEZ (Environment) are very correlated, maybe we wliminate altitude

## Warning: package 'DescTools' was built under R version 4.3.3
## 
## Attaching package: 'DescTools'
## The following object is masked from 'package:car':
## 
##     Recode
##                    Category                 Numeric         Eta2
## 1                 IQR_block          ICR_N_perc_23A 1.000000e+00
## 2                 IQR_block           ICR_Org_C_avg 1.000000e+00
## 3                 IQR_block          ICR_K_perc_23A 1.000000e+00
## 4                 IQR_block         ICR_Avail_P_avg 1.000000e+00
## 5                 IQR_block     ICR_arable_land_avg 1.000000e+00
## 6                 IQR_block             ICF_Alt_avg 1.000000e+00
## 7                 IQR_block        ICR_rainfall_avg 1.000000e+00
## 8                 IQR_block         ICR_compost_avg 1.000000e+00
## 9                 IQR_block IQR_compost_quality_avg 1.000000e+00
## 10                IQR_block  Comp_amount_corr_quali 1.000000e+00
## 11                IQR_block       ICF_planting_date 1.000000e+00
## 12                IQR_block      Relative_Yield_avg 1.000000e+00
## 13     IQR_TF_tubura_client          ICR_N_perc_23A 9.499414e-06
## 14     IQR_TF_tubura_client           ICR_Org_C_avg 1.307843e-04
## 15     IQR_TF_tubura_client          ICR_K_perc_23A 8.731820e-03
## 16     IQR_TF_tubura_client         ICR_Avail_P_avg 2.457738e-03
## 17     IQR_TF_tubura_client     ICR_arable_land_avg 8.263275e-02
## 18     IQR_TF_tubura_client             ICF_Alt_avg 2.570057e-03
## 19     IQR_TF_tubura_client        ICR_rainfall_avg 4.516771e-03
## 20     IQR_TF_tubura_client         ICR_compost_avg 5.183558e-04
## 21     IQR_TF_tubura_client IQR_compost_quality_avg 1.360943e-03
## 22     IQR_TF_tubura_client  Comp_amount_corr_quali 1.205228e-03
## 23     IQR_TF_tubura_client       ICF_planting_date 8.493785e-06
## 24     IQR_TF_tubura_client      Relative_Yield_avg 1.324763e-02
## 25    IQR_plot_fert_quality          ICR_N_perc_23A 5.047534e-03
## 26    IQR_plot_fert_quality           ICR_Org_C_avg 2.972169e-05
## 27    IQR_plot_fert_quality          ICR_K_perc_23A 2.236591e-02
## 28    IQR_plot_fert_quality         ICR_Avail_P_avg 2.587272e-03
## 29    IQR_plot_fert_quality     ICR_arable_land_avg 7.195146e-03
## 30    IQR_plot_fert_quality             ICF_Alt_avg 3.244234e-02
## 31    IQR_plot_fert_quality        ICR_rainfall_avg 4.841510e-03
## 32    IQR_plot_fert_quality         ICR_compost_avg 5.574060e-03
## 33    IQR_plot_fert_quality IQR_compost_quality_avg 1.140834e-02
## 34    IQR_plot_fert_quality  Comp_amount_corr_quali 3.443783e-03
## 35    IQR_plot_fert_quality       ICF_planting_date 5.471574e-03
## 36    IQR_plot_fert_quality      Relative_Yield_avg 7.641108e-03
## 37         IQR_soil_texture          ICR_N_perc_23A 5.665438e-02
## 38         IQR_soil_texture           ICR_Org_C_avg 4.545028e-02
## 39         IQR_soil_texture          ICR_K_perc_23A 7.320604e-02
## 40         IQR_soil_texture         ICR_Avail_P_avg 4.212296e-02
## 41         IQR_soil_texture     ICR_arable_land_avg 2.219589e-02
## 42         IQR_soil_texture             ICF_Alt_avg 8.530878e-04
## 43         IQR_soil_texture        ICR_rainfall_avg 1.605316e-01
## 44         IQR_soil_texture         ICR_compost_avg 8.193033e-02
## 45         IQR_soil_texture IQR_compost_quality_avg 5.572072e-02
## 46         IQR_soil_texture  Comp_amount_corr_quali 4.650686e-02
## 47         IQR_soil_texture       ICF_planting_date 2.947608e-04
## 48         IQR_soil_texture      Relative_Yield_avg 2.564590e-02
## 49           IQR_soil_color          ICR_N_perc_23A 4.018970e-02
## 50           IQR_soil_color           ICR_Org_C_avg 4.207381e-02
## 51           IQR_soil_color          ICR_K_perc_23A 2.974066e-02
## 52           IQR_soil_color         ICR_Avail_P_avg 4.308049e-02
## 53           IQR_soil_color     ICR_arable_land_avg 3.911583e-02
## 54           IQR_soil_color             ICF_Alt_avg 1.302196e-03
## 55           IQR_soil_color        ICR_rainfall_avg 6.138614e-02
## 56           IQR_soil_color         ICR_compost_avg 2.639538e-02
## 57           IQR_soil_color IQR_compost_quality_avg 6.385239e-02
## 58           IQR_soil_color  Comp_amount_corr_quali 9.934977e-03
## 59           IQR_soil_color       ICF_planting_date 4.523329e-02
## 60           IQR_soil_color      Relative_Yield_avg 1.784712e-03
## 61   IQF_Position_landscape          ICR_N_perc_23A 7.282323e-02
## 62   IQF_Position_landscape           ICR_Org_C_avg 5.307091e-02
## 63   IQF_Position_landscape          ICR_K_perc_23A 5.573645e-04
## 64   IQF_Position_landscape         ICR_Avail_P_avg 1.452066e-03
## 65   IQF_Position_landscape     ICR_arable_land_avg 9.644310e-04
## 66   IQF_Position_landscape             ICF_Alt_avg 9.164373e-03
## 67   IQF_Position_landscape        ICR_rainfall_avg 2.423319e-03
## 68   IQF_Position_landscape         ICR_compost_avg 5.658593e-03
## 69   IQF_Position_landscape IQR_compost_quality_avg 3.930128e-03
## 70   IQF_Position_landscape  Comp_amount_corr_quali 1.432317e-02
## 71   IQF_Position_landscape       ICF_planting_date 1.065922e-02
## 72   IQF_Position_landscape      Relative_Yield_avg 1.245701e-02
## 73       IQF_position_slope          ICR_N_perc_23A 4.835360e-02
## 74       IQF_position_slope           ICR_Org_C_avg 2.401151e-02
## 75       IQF_position_slope          ICR_K_perc_23A 4.099797e-02
## 76       IQF_position_slope         ICR_Avail_P_avg 6.078855e-03
## 77       IQF_position_slope     ICR_arable_land_avg 3.254918e-02
## 78       IQF_position_slope             ICF_Alt_avg 5.987319e-03
## 79       IQF_position_slope        ICR_rainfall_avg 2.729358e-02
## 80       IQF_position_slope         ICR_compost_avg 1.223088e-02
## 81       IQF_position_slope IQR_compost_quality_avg 3.582511e-02
## 82       IQF_position_slope  Comp_amount_corr_quali 2.214457e-02
## 83       IQF_position_slope       ICF_planting_date 1.700734e-02
## 84       IQF_position_slope      Relative_Yield_avg 5.213529e-03
## 85  Weed_species_A_combined          ICR_N_perc_23A 1.931955e-02
## 86  Weed_species_A_combined           ICR_Org_C_avg 6.532074e-03
## 87  Weed_species_A_combined          ICR_K_perc_23A 1.206952e-02
## 88  Weed_species_A_combined         ICR_Avail_P_avg 7.180210e-03
## 89  Weed_species_A_combined     ICR_arable_land_avg 4.405571e-02
## 90  Weed_species_A_combined             ICF_Alt_avg 2.723552e-02
## 91  Weed_species_A_combined        ICR_rainfall_avg 8.108810e-02
## 92  Weed_species_A_combined         ICR_compost_avg 2.328252e-02
## 93  Weed_species_A_combined IQR_compost_quality_avg 3.932069e-02
## 94  Weed_species_A_combined  Comp_amount_corr_quali 3.030242e-02
## 95  Weed_species_A_combined       ICF_planting_date 6.861499e-03
## 96  Weed_species_A_combined      Relative_Yield_avg 1.554917e-02
## 97              CA_typology          ICR_N_perc_23A 8.332342e-03
## 98              CA_typology           ICR_Org_C_avg 1.218981e-02
## 99              CA_typology          ICR_K_perc_23A 2.290736e-03
## 100             CA_typology         ICR_Avail_P_avg 5.332245e-03
## 101             CA_typology     ICR_arable_land_avg 7.368707e-03
## 102             CA_typology             ICF_Alt_avg 1.079567e-02
## 103             CA_typology        ICR_rainfall_avg 7.342257e-03
## 104             CA_typology         ICR_compost_avg 2.420866e-02
## 105             CA_typology IQR_compost_quality_avg 5.969219e-03
## 106             CA_typology  Comp_amount_corr_quali 3.516601e-02
## 107             CA_typology       ICF_planting_date 6.772195e-03
## 108             CA_typology      Relative_Yield_avg 5.967566e-01
## 109         IQF_environment          ICR_N_perc_23A 3.649840e-02
## 110         IQF_environment           ICR_Org_C_avg 2.872610e-02
## 111         IQF_environment          ICR_K_perc_23A 5.548726e-02
## 112         IQF_environment         ICR_Avail_P_avg 3.154359e-02
## 113         IQF_environment     ICR_arable_land_avg 9.617543e-02
## 114         IQF_environment             ICF_Alt_avg 6.115825e-01
## 115         IQF_environment        ICR_rainfall_avg 4.269173e-01
## 116         IQF_environment         ICR_compost_avg 1.784404e-01
## 117         IQF_environment IQR_compost_quality_avg 2.755039e-01
## 118         IQF_environment  Comp_amount_corr_quali 3.970855e-02
## 119         IQF_environment       ICF_planting_date 2.008632e-01
## 120         IQF_environment      Relative_Yield_avg 2.747147e-02
Eta² value Interpretation
0.00–0.01 Negligible effect
0.01–0.06 Small effect
0.06–0.14 Medium effect
> 0.14 Large effect

#Now we look at the drivers of performance dynamic Random forest for the output variable of average relative yield of CA vs CONV across the 6 seasons in each farm (Relative_yield_avg) with the list of potential drives as is

Prepare the dataset for Random Forest We explude N, altitude, position in the lanscape

Notes: Very high % Var explained: 95.7 makes us suspect something is wrong with the variable s(too many amd/or too many cathergores within each with few representations). This is provably because each of the 570 rows represents a unique block, and the predictors form a near-unique fingerprint for each block, allowing the forest to memorise block-level average performance rather than learn generalisable drivers.

RF data

library(dplyr)
library(forcats)
library(randomForest)

set.seed(123)

#────────────────────────────────────
# 1. Prepare data
#────────────────────────────────────

rf_data_clean <- rf_data %>%
  mutate(
    across(where(is.character), as.factor),
    across(where(is.numeric), ~ ifelse(is.na(.x), median(.x, na.rm = TRUE), .x)),
    across(where(is.factor), ~ fct_na_value_to_level(.x, level = "Unknown"))
  )

#────────────────────────────────────
# 2. 80–20 train / test split
#────────────────────────────────────

n <- nrow(rf_data_clean)
train_idx <- sample(seq_len(n), size = floor(0.8 * n))

train_data <- rf_data_clean[train_idx, ]
test_data  <- rf_data_clean[-train_idx, ]

#────────────────────────────────────
# 3. Fit Random Forest on training set
#────────────────────────────────────

rf_model <- randomForest(
  Relative_Yield_avg ~ .,
  data = train_data,
  ntree = 1000,
  importance = TRUE
)

print(rf_model)
## 
## Call:
##  randomForest(formula = Relative_Yield_avg ~ ., data = train_data,      ntree = 1000, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 0.001492501
##                     % Var explained: 95.7
#────────────────────────────────────
# 4. Evaluate on test set
#────────────────────────────────────

pred <- predict(rf_model, newdata = test_data)

RMSE <- sqrt(mean((test_data$Relative_Yield_avg - pred)^2))
R2   <- cor(test_data$Relative_Yield_avg, pred)^2

cat("Test RMSE:", RMSE, "\n")
## Test RMSE: 0.02752971
cat("Test R²:", R2, "\n")
## Test R²: 0.9794577
#────────────────────────────────────
# 5. Variable importance (training-based)
#────────────────────────────────────

varImpPlot(rf_model, type = 1)

# IMproved Random Forest goruping by envoronment Respondes to :Which drivers of CA–CONV yield differences generalize across environments?Because each block was observed only once, cross-validation could not be grouped at the block level; therefore, we grouped by environment to evaluate whether drivers of CA–CONV yield differences were transferable across agro-environmental contexts rather than environment-specific

We used a random forest regression to identify the main drivers of the yield difference between conservation agriculture (CA) and conventional management (CONV). The model was trained using leave-environment-out cross-validation, grouping observations by agro-environmental zone, to prevent information leakage and ensure that results reflect drivers that generalize across environments rather than environment-specific effects. By design, this approach removes variance attributable purely to environmental differences, providing a conservative estimate of explanatory power. The model explained approximately 20–25% of the variance in CA–CONV yield differences across unseen environments, indicating that a limited set of soil and management variables systematically modulates CA performance. Variable importance analysis showed that planting date, rainfall, baseline soil fertility (organic carbon, potassium, and phosphorus), and compost application were the most influential drivers, while weed identity and fertilizer quality played a minor role. These results suggest that CA advantages are context-dependent and are primarily shaped by climatic conditions and soil fertility rather than by management labels alone.

library(dplyr)
library(caret)
## Warning: package 'caret' was built under R version 4.3.3
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following objects are masked from 'package:DescTools':
## 
##     MAE, RMSE
## The following object is masked from 'package:purrr':
## 
##     lift
## The following object is masked from 'package:survival':
## 
##     cluster
library(ranger)
## Warning: package 'ranger' was built under R version 4.3.3
## 
## Attaching package: 'ranger'
## The following object is masked from 'package:randomForest':
## 
##     importance
library(forcats)

set.seed(123)

#────────────────────────────────────────
# 1. PREPARE DATA
#────────────────────────────────────────

rf_df <- rf_data_clean %>%
  mutate(
    # make sure factors are clean
    across(where(is.factor), ~ fct_na_value_to_level(.x, "Unknown"))
  )

# define response and predictors
y_var <- "Relative_Yield_avg"

# grouping variable (CRITICAL)
group_var <- rf_df$IQF_environment

#────────────────────────────────────────
# 2. REMOVE NEAR-ZERO VARIANCE PREDICTORS
#────────────────────────────────────────

pred_cols <- setdiff(names(rf_df), y_var)
nzv <- nearZeroVar(rf_df[, pred_cols], saveMetrics = TRUE)

rf_df <- rf_df[, c(y_var, rownames(nzv)[!nzv$nzv])]

#────────────────────────────────────────
# 3. GROUPED CROSS-VALIDATION (ENVIRONMENT)
#────────────────────────────────────────

k <- length(unique(group_var))  # leave-one-environment-out
folds <- groupKFold(group = group_var, k = k)

ctrl <- trainControl(
  method = "cv",
  index = folds,
  savePredictions = "final"
)

# tuning grid (simple, robust)
p <- ncol(rf_df) - 1
grid <- expand.grid(
  mtry = unique(pmax(1, round(c(sqrt(p), 0.25*p, 0.33*p)))),
  splitrule = "variance",
  min.node.size = c(5, 10)
)

#────────────────────────────────────────
# 4. FIT RANDOM FOREST (NO LEAKAGE)
#────────────────────────────────────────

rf_fit <- train(
  Relative_Yield_avg ~ .,
  data = rf_df,
  method = "ranger",
  trControl = ctrl,
  tuneGrid = grid,
  num.trees = 1000,
  importance = "permutation"
)

print(rf_fit)
## Random Forest 
## 
## 3703 samples
##   15 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 2928, 775 
## Resampling results across tuning parameters:
## 
##   mtry  min.node.size  RMSE       Rsquared   MAE      
##   4      5             0.1621534  0.3327244  0.1230093
##   4     10             0.1654177  0.3165887  0.1263355
##   5      5             0.1568202  0.3446975  0.1166651
##   5     10             0.1597875  0.3336508  0.1199271
## 
## Tuning parameter 'splitrule' was held constant at a value of variance
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 5, splitrule = variance
##  and min.node.size = 5.
# unbiased CV performance
rf_fit$results
##   mtry splitrule min.node.size      RMSE  Rsquared       MAE     RMSESD
## 1    4  variance             5 0.1621534 0.3327244 0.1230093 0.02800503
## 2    4  variance            10 0.1654177 0.3165887 0.1263355 0.02509390
## 3    5  variance             5 0.1568202 0.3446975 0.1166651 0.03465570
## 4    5  variance            10 0.1597875 0.3336508 0.1199271 0.03149256
##   RsquaredSD      MAESD
## 1  0.4313934 0.02724279
## 2  0.4151424 0.02432112
## 3  0.4438680 0.03503907
## 4  0.4318532 0.03174999
#────────────────────────────────────────
# 5. FINAL MODEL ON ALL DATA
#────────────────────────────────────────

final_rf <- ranger(
  Relative_Yield_avg ~ .,
  data = rf_df,
  num.trees = 1000,
  mtry = rf_fit$bestTune$mtry,
  min.node.size = rf_fit$bestTune$min.node.size,
  importance = "permutation"
)

# variable importance
vip <- sort(final_rf$variable.importance, decreasing = TRUE)
print(vip)
##             ICF_Alt_avg       ICF_planting_date  Comp_amount_corr_quali 
##             0.029632808             0.022962927             0.021966636 
##        ICR_rainfall_avg           ICR_Org_C_avg         ICR_Avail_P_avg 
##             0.021026641             0.020804884             0.019211755 
##          ICR_K_perc_23A     ICR_arable_land_avg Weed_species_A_combined 
##             0.018154042             0.014172073             0.007577408 
##         IQF_environment          IQR_soil_color      IQF_position_slope 
##             0.005759308             0.005010683             0.004377213 
##        IQR_soil_texture    IQR_TF_tubura_client   IQR_plot_fert_quality 
##             0.003689101             0.003030949             0.001829865

Another way to do the RF, considering that environment has large explanatory power

It estimates drivers of CA–CONV yield differences within each environment, avoiding confounding across environments and making interaction patterns explicit.By fitting separate random forest models per environment, we isolated environment-specific drivers of CA–CONV yield differences and explicitly revealed CA × environment interactions that are masked in pooled analyses.

##────────────────────────────────────────────
# RF PER ENVIRONMENT + WITHIN-ENV VALIDATION
#────────────────────────────────────────────

library(dplyr)
library(randomForest)
library(forcats)
library(tibble)
library(tidyr)

set.seed(123)

# 1. Clean data (same as before)
rf_data_clean2 <- rf_data_clean %>%
  mutate(
    across(where(is.character), as.factor),
    across(where(is.factor), ~ fct_na_value_to_level(.x, level = "Unknown"))
  )

# 2. Split data by environment
rf_by_env <- rf_data_clean2 %>%
  group_split(IQF_environment)

env_names <- sapply(rf_by_env, function(df) unique(df$IQF_environment))

#────────────────────────────────────────────
# 3. FIT ONE RF PER ENVIRONMENT (FULL DATA)
#────────────────────────────────────────────

rf_models <- lapply(rf_by_env, function(df) {

  df_env <- df %>%
    dplyr::select(-IQF_environment)

  randomForest(
    Relative_Yield_avg ~ .,
    data = df_env,
    ntree = 1000,
    importance = TRUE
  )
})

names(rf_models) <- env_names

# In-sample performance (expected to be high)
rf_in_sample_perf <- bind_rows(
  lapply(rf_models, function(m) {
    data.frame(
      MSE = m$mse[m$ntree],
      Rsquared = m$rsq[m$ntree]
    )
  }),
  .id = "Environment"
)

print(rf_in_sample_perf)
##   Environment         MSE  Rsquared
## 1    Highland 0.001400113 0.9639444
## 2 Mid_Low_Dry 0.002519209 0.9255612
## 3 Mid_Low_Wet 0.002492199 0.9118016
## 4  Transition 0.001978741 0.9444694
#────────────────────────────────────────────
# 4. WITHIN-ENVIRONMENT 80/20 VALIDATION
#────────────────────────────────────────────

set.seed(123)

rf_env_cv <- lapply(rf_by_env, function(df) {

  df_env <- df %>%
    dplyr::select(-IQF_environment)

  n <- nrow(df_env)
  idx <- sample(seq_len(n), size = floor(0.8 * n))

  train <- df_env[idx, ]
  test  <- df_env[-idx, ]

  m <- randomForest(
    Relative_Yield_avg ~ .,
    data = train,
    ntree = 1000
  )

  pred <- predict(m, newdata = test)

  data.frame(
    RMSE = sqrt(mean((test$Relative_Yield_avg - pred)^2)),
    R2   = cor(test$Relative_Yield_avg, pred)^2
  )
})

rf_env_cv_perf <- bind_rows(rf_env_cv, .id = "Environment")

print(rf_env_cv_perf)
##   Environment       RMSE        R2
## 1           1 0.06499952 0.9086406
## 2           2 0.04947863 0.9336802
## 3           3 0.02965064 0.9703302
## 4           4 0.06538699 0.8810449
#────────────────────────────────────────────
# 5. VARIABLE IMPORTANCE PER ENVIRONMENT
#────────────────────────────────────────────

var_imp_list <- lapply(rf_models, function(m) {

  randomForest::importance(m, type = 1) %>%
    as.data.frame() %>%
    rownames_to_column("Variable") %>%
    rename(IncMSE = `%IncMSE`) %>%
    arrange(desc(IncMSE))
})

# Top 10 drivers per environment
top_drivers <- bind_rows(
  lapply(names(var_imp_list), function(env) {
    var_imp_list[[env]] %>%
      slice(1:10) %>%
      mutate(Environment = env)
  })
)

print(top_drivers)
##                  Variable   IncMSE Environment
## 1             ICF_Alt_avg 49.89655    Highland
## 2          ICR_K_perc_23A 45.09363    Highland
## 3  Comp_amount_corr_quali 41.40628    Highland
## 4      IQF_position_slope 41.08541    Highland
## 5       ICF_planting_date 40.85498    Highland
## 6           ICR_Org_C_avg 39.19458    Highland
## 7         ICR_Avail_P_avg 38.83592    Highland
## 8     ICR_arable_land_avg 36.10111    Highland
## 9        ICR_rainfall_avg 32.59708    Highland
## 10         IQR_soil_color 28.88320    Highland
## 11            ICF_Alt_avg 68.71471 Mid_Low_Dry
## 12        ICR_Avail_P_avg 57.76390 Mid_Low_Dry
## 13      ICF_planting_date 55.79639 Mid_Low_Dry
## 14       ICR_rainfall_avg 55.10739 Mid_Low_Dry
## 15    ICR_arable_land_avg 52.09059 Mid_Low_Dry
## 16         ICR_K_perc_23A 51.19468 Mid_Low_Dry
## 17 Comp_amount_corr_quali 50.65468 Mid_Low_Dry
## 18          ICR_Org_C_avg 43.59543 Mid_Low_Dry
## 19     IQF_position_slope 43.40560 Mid_Low_Dry
## 20       IQR_soil_texture 29.19072 Mid_Low_Dry
## 21         ICR_K_perc_23A 53.49618 Mid_Low_Wet
## 22      ICF_planting_date 53.25065 Mid_Low_Wet
## 23    ICR_arable_land_avg 49.74409 Mid_Low_Wet
## 24 Comp_amount_corr_quali 49.61651 Mid_Low_Wet
## 25          ICR_Org_C_avg 47.21795 Mid_Low_Wet
## 26        ICR_Avail_P_avg 44.49566 Mid_Low_Wet
## 27            ICF_Alt_avg 38.45771 Mid_Low_Wet
## 28     IQF_position_slope 38.03755 Mid_Low_Wet
## 29         IQR_soil_color 37.62862 Mid_Low_Wet
## 30       ICR_rainfall_avg 32.48141 Mid_Low_Wet
## 31      ICF_planting_date 56.26644  Transition
## 32        ICR_Avail_P_avg 49.98004  Transition
## 33 Comp_amount_corr_quali 47.17661  Transition
## 34          ICR_Org_C_avg 46.65120  Transition
## 35            ICF_Alt_avg 43.47734  Transition
## 36         ICR_K_perc_23A 42.03532  Transition
## 37    ICR_arable_land_avg 40.01583  Transition
## 38       ICR_rainfall_avg 34.68373  Transition
## 39         IQR_soil_color 32.74855  Transition
## 40     IQF_position_slope 30.23287  Transition
#────────────────────────────────────────────
# 6. DRIVER RANK COMPARISON (INTERACTIONS)
#────────────────────────────────────────────

rank_table <- bind_rows(
  lapply(names(var_imp_list), function(env) {
    var_imp_list[[env]] %>%
      mutate(
        Environment = env,
        Rank = rank(-IncMSE)
      )
  })
)

rank_table %>%
  filter(Rank <= 5) %>%
  arrange(Variable, Rank) %>%
  print()
##                  Variable   IncMSE Environment Rank
## 1  Comp_amount_corr_quali 41.40628    Highland    3
## 2  Comp_amount_corr_quali 47.17661  Transition    3
## 3  Comp_amount_corr_quali 49.61651 Mid_Low_Wet    4
## 4             ICF_Alt_avg 49.89655    Highland    1
## 5             ICF_Alt_avg 68.71471 Mid_Low_Dry    1
## 6             ICF_Alt_avg 43.47734  Transition    5
## 7       ICF_planting_date 56.26644  Transition    1
## 8       ICF_planting_date 53.25065 Mid_Low_Wet    2
## 9       ICF_planting_date 55.79639 Mid_Low_Dry    3
## 10      ICF_planting_date 40.85498    Highland    5
## 11        ICR_Avail_P_avg 57.76390 Mid_Low_Dry    2
## 12        ICR_Avail_P_avg 49.98004  Transition    2
## 13         ICR_K_perc_23A 53.49618 Mid_Low_Wet    1
## 14         ICR_K_perc_23A 45.09363    Highland    2
## 15          ICR_Org_C_avg 46.65120  Transition    4
## 16          ICR_Org_C_avg 47.21795 Mid_Low_Wet    5
## 17    ICR_arable_land_avg 49.74409 Mid_Low_Wet    3
## 18    ICR_arable_land_avg 52.09059 Mid_Low_Dry    5
## 19       ICR_rainfall_avg 55.10739 Mid_Low_Dry    4
## 20     IQF_position_slope 41.08541    Highland    4

Evaluate drivers with mix modeles

We first need to prepare a database where we have the 2 treatments and averaged for Maize and for Beans separated. We will keep most of the variables as 1 per block along seasons as we did for RF, but with the expection of Rain, Compost and Planting data that wee be specific for maize and beans specifically (season A and B). Yields will come from “yield_CA_comb” the combined drivers from “drivers_ID” and the Rain, Compost and Planting averaged for season A and B from yield_CA_comb also

## tibble [2,318 × 17] (S3: tbl_df/tbl/data.frame)
##  $ IQR_block              : chr [1:2318] "102_1" "102_1" "102_1" "102_1" ...
##  $ Treat_group            : chr [1:2318] "CA" "CA" "CONV" "CONV" ...
##  $ crop                   : chr [1:2318] "beans" "maize" "beans" "maize" ...
##  $ Yield_avg              : num [1:2318] NA 5.17 NA 6.25 1.1 ...
##  $ Rain_avg               : num [1:2318] 3.1 4.39 3.1 4.39 2.95 ...
##  $ Compost_avg            : num [1:2318] NA 5.4 NA 5.4 6.08 ...
##  $ ICR_Org_C_avg          : num [1:2318] 2.69 2.69 2.69 2.69 2.01 2.01 2.01 2.01 1.75 1.75 ...
##  $ ICR_K_perc_23A         : num [1:2318] 0.73 0.73 0.73 0.73 0.75 0.75 0.75 0.75 0.32 0.32 ...
##  $ ICR_Avail_P_avg        : num [1:2318] 8.2 8.2 8.2 8.2 11.7 ...
##  $ ICR_arable_land_avg    : num [1:2318] 35 35 35 35 40.4 40.4 40.4 40.4 56.4 56.4 ...
##  $ IQR_TF_tubura_client   : chr [1:2318] "1" "1" "1" "1" ...
##  $ IQR_plot_fert_quality  : chr [1:2318] "high" "high" "high" "high" ...
##  $ IQR_soil_texture       : chr [1:2318] NA NA NA NA ...
##  $ IQR_soil_color         : Factor w/ 6 levels "#N/A","black",..: 1 1 1 1 2 2 2 2 2 2 ...
##  $ IQF_position_slope     : chr [1:2318] "Footslope" "Footslope" "Footslope" "Footslope" ...
##  $ Weed_species_A_combined: chr [1:2318] "Broadleaf" "Broadleaf" "Broadleaf" "Broadleaf" ...
##  $ IQF_environment        : Factor w/ 4 levels "Highland","Mid_Low_Dry",..: 3 3 3 3 3 3 3 3 3 3 ...

now we need to add the delay in planting average for the 3 maize and beans seasons

library(dplyr)
library(stringr)

#────────────────────────────────────────
# ADD PLANTING DELAY (per block × crop)
#────────────────────────────────────────

# 1. Prepare planting delay data and assign crop identity
block_yield <- block_yield %>%
  mutate(
    planting_delay_doy =
      as.numeric(suppressWarnings(as.character(planting_delay_doy))),
    crop = case_when(
      str_detect(IQR_Season, "A$") ~ "maize",
      str_detect(IQR_Season, "B$") ~ "beans",
      TRUE ~ NA_character_
    )
  )

# 2. Compute block × crop average planting delay
planting_summary <- block_yield %>%
  group_by(IQR_block, crop) %>%
  summarise(
    Planting_avg_doy = ifelse(
      all(is.na(planting_delay_doy)),
      NA_real_,
      mean(planting_delay_doy, na.rm = TRUE)
    ),
    .groups = "drop"
  )

# 3. Merge planting delay into model driver table
drivers_mm <- drivers_mm %>%
  left_join(planting_summary, by = c("IQR_block", "crop"))

# 4. Check output
summary(drivers_mm$Planting_avg_doy)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   7.333  11.667  13.064  17.500  53.000     232

Now we can proceed with the analysis of a stepwise approach to ID most relevant trivers with mix models

library(dplyr)
library(tidyr)
library(lme4)
library(lmerTest)

#────────────────────────────────────────────
# 1) Create dataset with CA and CONV yield pairs
#────────────────────────────────────────────

diff_data <- drivers_mm %>%
  # keep necessary columns only
  dplyr::select(
    IQR_block, crop, Treat_group, Yield_avg,
    Rain_avg, Compost_avg,
    ICR_Org_C_avg, ICR_K_perc_23A, ICR_Avail_P_avg,
    ICR_arable_land_avg, 
    IQR_TF_tubura_client, IQR_plot_fert_quality,
    IQR_soil_texture, IQR_soil_color,
    IQF_position_slope, Weed_species_A_combined,
    IQF_environment, Planting_avg_doy
  ) %>%
  # spread treatments to columns
  pivot_wider(
    names_from  = Treat_group,
    values_from = Yield_avg
  ) %>%
  # compute yield difference (CONV – CA)
  mutate(
    Yield_diff = CONV - CA
  ) %>%
  # remove rows without both treatments
  filter(!is.na(Yield_diff))

#────────────────────────────────────────────
# 2) Convert grouping variables to factors
#────────────────────────────────────────────

diff_data <- diff_data %>%
  mutate(
    IQR_block            = factor(IQR_block),
    crop                 = factor(crop),
    IQR_TF_tubura_client = factor(IQR_TF_tubura_client),
    IQR_plot_fert_quality= factor(IQR_plot_fert_quality),
    IQR_soil_texture     = factor(ifelse(is.na(IQR_soil_texture),"Unknown",IQR_soil_texture)),
    IQR_soil_color       = factor(IQR_soil_color),
    IQF_position_slope   = factor(IQF_position_slope),
    Weed_species_A_combined = factor(Weed_species_A_combined),
    IQF_environment      = factor(IQF_environment)
  )

#────────────────────────────────────────────
# 3) Scale numeric predictors
#────────────────────────────────────────────

num_vars <- c(
  "Rain_avg","Compost_avg",
  "ICR_Org_C_avg","ICR_K_perc_23A","ICR_Avail_P_avg",
  "ICR_arable_land_avg",
  "Planting_avg_doy"
)

diff_data <- diff_data %>%
  mutate(across(
    all_of(num_vars),
    ~ as.numeric(scale(.)),
    .names = "z_{.col}"
  ))

Approach 1 for mix models stepwise

To assess the drivers of relative performance between conservation CA) and CONV (= Yeild CA/ Yield CONV), we modelled treatment effects using the log yield ratio, defined as log(CONV / CA), rather than absolute yield differences. With this we capture proportional treatment effects, stabilizes variance across yield levels, and avoids scale-dependent biases that arise when absolute yield differences are systematically difference among environments. Mixed-effects models were fitted to paired CA–CONV observations within blocks, with block included as a random effect and candidate drivers entered sequentially in thematic groups (biophysical, socio-economic, and management). Results show strong environmental modulation of relative performance: - CA was least disadvantaged in mid-altitude dry environments, where the CONV advantage was significantly reduced compared with wetter or transitional environments. - Management variables were also influential, with compost application reducing the relative advantage of CONV and later planting dates increasing it - In contrast, rainfall and most individual soil chemical indicators did not significantly affect relative performance once environment was accounted for, although soil texture exerted a significant collective influence. We may need to better assess the water lodging / drainage of the soils as this was reported as an incluencial facted in several publications

#────────────────────────────────────────────
# STEPWISE MIXED MODEL USING LOG RESPONSE RATIO
# Drivers of CONV–CA yield gap
#────────────────────────────────────────────

library(dplyr)
library(tidyr)
library(lme4)
library(lmerTest)
library(forcats)

#────────────────────────────────────────────
# 1) Prepare data and fix variable types
#────────────────────────────────────────────

mm0 <- drivers_mm %>%
  mutate(
    IQR_block = factor(IQR_block),
    Treat_group = factor(Treat_group),
    crop = factor(crop),

    IQR_TF_tubura_client = factor(IQR_TF_tubura_client),
    IQR_plot_fert_quality = factor(IQR_plot_fert_quality),

    IQR_soil_texture =
      fct_na_value_to_level(factor(IQR_soil_texture), "Unknown"),
    IQR_soil_color =
      fct_na_value_to_level(factor(IQR_soil_color), "Unknown"),
    IQF_position_slope =
      fct_na_value_to_level(factor(IQF_position_slope), "Unknown"),
    Weed_species_A_combined =
      fct_na_value_to_level(factor(Weed_species_A_combined), "Unknown"),

    IQF_environment = factor(IQF_environment)
  )

# Numeric predictors
num_vars <- c(
  "Rain_avg",
  "Compost_avg",
  "Planting_avg_doy",
  "ICR_Org_C_avg",
  "ICR_K_perc_23A",
  "ICR_Avail_P_avg",
  "ICR_arable_land_avg"
)

mm0 <- mm0 %>%
  mutate(across(all_of(num_vars), as.numeric))

#────────────────────────────────────────────
# 2) Construct CA–CONV paired dataset
#     Log response ratio = log(CONV / CA)
#────────────────────────────────────────────

diff_data <- mm0 %>%
  dplyr::select(
    IQR_block, crop, IQF_environment,
    all_of(num_vars),
    IQR_TF_tubura_client, IQR_plot_fert_quality,
    IQR_soil_texture, IQR_soil_color,
    IQF_position_slope, Weed_species_A_combined,
    Treat_group, Yield_avg
  ) %>%
  pivot_wider(
    names_from = Treat_group,
    values_from = Yield_avg
  ) %>%
  filter(!is.na(CA), !is.na(CONV), CA > 0, CONV > 0) %>%
  mutate(
    Log_Ratio = log(CONV / CA)
  )

#────────────────────────────────────────────
# 3) Median-impute numeric predictors
#────────────────────────────────────────────

diff_data <- diff_data %>%
  mutate(
    across(all_of(num_vars),
           ~ ifelse(is.na(.x), median(.x, na.rm = TRUE), .x))
  )

#────────────────────────────────────────────
# 4) Scale numeric predictors
#────────────────────────────────────────────

diff_data <- diff_data %>%
  mutate(
    across(all_of(num_vars),
           ~ as.numeric(scale(.x)),
           .names = "z_{.col}")
  )

#────────────────────────────────────────────
# 5) COMPLETE-CASE dataset
#    (ensures identical rows across all models)
#────────────────────────────────────────────

all_vars <- c(
  "Log_Ratio", "crop", "IQR_block",
  paste0("z_", num_vars),
  "IQR_TF_tubura_client", "IQR_plot_fert_quality",
  "IQR_soil_texture", "IQR_soil_color",
  "IQF_position_slope", "Weed_species_A_combined",
  "IQF_environment"
)

diff_data_cc <- diff_data %>%
  dplyr::select(all_of(all_vars)) %>%
  drop_na()

#────────────────────────────────────────────
# 6) Helper function for stepwise testing
#────────────────────────────────────────────

add_and_test <- function(prev_model, terms, label) {
  new_model <- update(prev_model,
                      as.formula(paste(". ~ . +", terms)))
  cat("\n==============================\n")
  cat("ADD BLOCK:", label, "\n")
  print(anova(prev_model, new_model))
  return(new_model)
}

#────────────────────────────────────────────
# 7) Stepwise mixed-effects modelling
#────────────────────────────────────────────

# Null model
m0 <- lmer(
  Log_Ratio ~ crop + (1 | IQR_block),
  data = diff_data_cc,
  REML = FALSE
)

# Biophysical: rainfall
m1 <- add_and_test(
  m0,
  "z_Rain_avg",
  "biophysical: rainfall"
)
## 
## ==============================
## ADD BLOCK: biophysical: rainfall 
## Data: diff_data_cc
## Models:
## prev_model: Log_Ratio ~ crop + (1 | IQR_block)
## new_model: Log_Ratio ~ crop + (1 | IQR_block) + z_Rain_avg
##            npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## prev_model    4 357.18 377.11 -174.59   349.18                       
## new_model     5 353.40 378.32 -171.70   343.40 5.7762  1    0.01625 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Biophysical: soil (nutrients + physical)
m2 <- add_and_test(
  m1,
  paste(
    "z_ICR_Org_C_avg + z_ICR_K_perc_23A + z_ICR_Avail_P_avg +",
    "IQR_soil_texture + IQR_soil_color"
  ),
  "biophysical: soil"
)
## 
## ==============================
## ADD BLOCK: biophysical: soil 
## Data: diff_data_cc
## Models:
## prev_model: Log_Ratio ~ crop + (1 | IQR_block) + z_Rain_avg
## new_model: Log_Ratio ~ crop + (1 | IQR_block) + z_Rain_avg + z_ICR_Org_C_avg + z_ICR_K_perc_23A + z_ICR_Avail_P_avg + IQR_soil_texture + IQR_soil_color
##            npar    AIC    BIC  logLik deviance Chisq Df Pr(>Chisq)   
## prev_model    5 353.40 378.32 -171.70   343.40                       
## new_model    15 344.69 419.45 -157.35   314.69 28.71 10   0.001388 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Biophysical: landscape & weeds
m3 <- add_and_test(
  m2,
  "IQF_position_slope + Weed_species_A_combined",
  "biophysical: slope & weeds"
)
## 
## ==============================
## ADD BLOCK: biophysical: slope & weeds 
## Data: diff_data_cc
## Models:
## prev_model: Log_Ratio ~ crop + (1 | IQR_block) + z_Rain_avg + z_ICR_Org_C_avg + z_ICR_K_perc_23A + z_ICR_Avail_P_avg + IQR_soil_texture + IQR_soil_color
## new_model: Log_Ratio ~ crop + (1 | IQR_block) + z_Rain_avg + z_ICR_Org_C_avg + z_ICR_K_perc_23A + z_ICR_Avail_P_avg + IQR_soil_texture + IQR_soil_color + IQF_position_slope + Weed_species_A_combined
##            npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## prev_model   15 344.69 419.45 -157.35   314.69                       
## new_model    24 341.86 461.48 -146.93   293.86 20.829  9    0.01343 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Biophysical: environment
m4 <- add_and_test(
  m3,
  "IQF_environment",
  "biophysical: environment"
)
## 
## ==============================
## ADD BLOCK: biophysical: environment 
## Data: diff_data_cc
## Models:
## prev_model: Log_Ratio ~ crop + (1 | IQR_block) + z_Rain_avg + z_ICR_Org_C_avg + z_ICR_K_perc_23A + z_ICR_Avail_P_avg + IQR_soil_texture + IQR_soil_color + IQF_position_slope + Weed_species_A_combined
## new_model: Log_Ratio ~ crop + (1 | IQR_block) + z_Rain_avg + z_ICR_Org_C_avg + z_ICR_K_perc_23A + z_ICR_Avail_P_avg + IQR_soil_texture + IQR_soil_color + IQF_position_slope + Weed_species_A_combined + IQF_environment
##            npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)   
## prev_model   24 341.86 461.48 -146.93   293.86                        
## new_model    27 332.53 467.10 -139.27   278.54 15.329  3   0.001556 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Socio-economic
m5 <- add_and_test(
  m4,
  "z_ICR_arable_land_avg + IQR_TF_tubura_client",
  "socio-economic"
)
## 
## ==============================
## ADD BLOCK: socio-economic 
## Data: diff_data_cc
## Models:
## prev_model: Log_Ratio ~ crop + (1 | IQR_block) + z_Rain_avg + z_ICR_Org_C_avg + z_ICR_K_perc_23A + z_ICR_Avail_P_avg + IQR_soil_texture + IQR_soil_color + IQF_position_slope + Weed_species_A_combined + IQF_environment
## new_model: Log_Ratio ~ crop + (1 | IQR_block) + z_Rain_avg + z_ICR_Org_C_avg + z_ICR_K_perc_23A + z_ICR_Avail_P_avg + IQR_soil_texture + IQR_soil_color + IQF_position_slope + Weed_species_A_combined + IQF_environment + z_ICR_arable_land_avg + IQR_TF_tubura_client
##            npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## prev_model   27 332.53 467.10 -139.27   278.54                       
## new_model    30 329.41 478.92 -134.70   269.41 9.1267  3    0.02765 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Management
m6 <- add_and_test(
  m5,
  "z_Compost_avg + z_Planting_avg_doy + IQR_plot_fert_quality",
  "management"
)
## 
## ==============================
## ADD BLOCK: management 
## Data: diff_data_cc
## Models:
## prev_model: Log_Ratio ~ crop + (1 | IQR_block) + z_Rain_avg + z_ICR_Org_C_avg + z_ICR_K_perc_23A + z_ICR_Avail_P_avg + IQR_soil_texture + IQR_soil_color + IQF_position_slope + Weed_species_A_combined + IQF_environment + z_ICR_arable_land_avg + IQR_TF_tubura_client
## new_model: Log_Ratio ~ crop + (1 | IQR_block) + z_Rain_avg + z_ICR_Org_C_avg + z_ICR_K_perc_23A + z_ICR_Avail_P_avg + IQR_soil_texture + IQR_soil_color + IQF_position_slope + Weed_species_A_combined + IQF_environment + z_ICR_arable_land_avg + IQR_TF_tubura_client + z_Compost_avg + z_Planting_avg_doy + IQR_plot_fert_quality
##            npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
## prev_model   30 329.41 478.92 -134.70   269.41                         
## new_model    35 283.81 458.25 -106.91   213.81 55.594  5  9.852e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#────────────────────────────────────────────
# 8) Backward pruning (diagnostic)
#────────────────────────────────────────────

drop1(m6, test = "Chisq")
## Single term deletions using Satterthwaite's method:
## 
## Model:
## Log_Ratio ~ crop + (1 | IQR_block) + z_Rain_avg + z_ICR_Org_C_avg + z_ICR_K_perc_23A + z_ICR_Avail_P_avg + IQR_soil_texture + IQR_soil_color + IQF_position_slope + Weed_species_A_combined + IQF_environment + z_ICR_arable_land_avg + IQR_TF_tubura_client + z_Compost_avg + z_Planting_avg_doy + IQR_plot_fert_quality
##                          Sum Sq Mean Sq NumDF   DenDF F value    Pr(>F)    
## crop                    0.40154 0.40154     1  954.98  6.5034 0.0109219 *  
## z_Rain_avg              0.04024 0.04024     1  918.47  0.6518 0.4196952    
## z_ICR_Org_C_avg         0.10378 0.10378     1  565.20  1.6809 0.1953352    
## z_ICR_K_perc_23A        0.22000 0.22000     1  561.16  3.5631 0.0595937 .  
## z_ICR_Avail_P_avg       0.30631 0.30631     1  554.32  4.9611 0.0263250 *  
## IQR_soil_texture        0.53991 0.17997     3  560.37  2.9148 0.0337901 *  
## IQR_soil_color          0.24196 0.06049     4  534.81  0.9797 0.4180743    
## IQF_position_slope      0.40077 0.08015     5  563.72  1.2982 0.2630444    
## Weed_species_A_combined 1.22616 0.30654     4  656.51  4.9648 0.0006004 ***
## IQF_environment         0.55121 0.18374     3  610.35  2.9758 0.0310692 *  
## z_ICR_arable_land_avg   0.50115 0.50115     1  554.59  8.1168 0.0045487 ** 
## IQR_TF_tubura_client    0.46151 0.23075     2  534.78  3.7373 0.0244424 *  
## z_Compost_avg           1.57523 1.57523     1 1072.90 25.5126 5.162e-07 ***
## z_Planting_avg_doy      1.57173 1.57173     1 1031.08 25.4561 5.346e-07 ***
## IQR_plot_fert_quality   0.17179 0.05726     3  555.99  0.9274 0.4271561    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
best_model <- m6

#────────────────────────────────────────────
# 9) Final output
#────────────────────────────────────────────

summary(best_model)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
##   method [lmerModLmerTest]
## Formula: Log_Ratio ~ crop + (1 | IQR_block) + z_Rain_avg + z_ICR_Org_C_avg +  
##     z_ICR_K_perc_23A + z_ICR_Avail_P_avg + IQR_soil_texture +  
##     IQR_soil_color + IQF_position_slope + Weed_species_A_combined +  
##     IQF_environment + z_ICR_arable_land_avg + IQR_TF_tubura_client +  
##     z_Compost_avg + z_Planting_avg_doy + IQR_plot_fert_quality
##    Data: diff_data_cc
## 
##      AIC      BIC   logLik deviance df.resid 
##    283.8    458.2   -106.9    213.8     1044 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.4973 -0.5219 -0.0336  0.4874  5.5208 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev.
##  IQR_block (Intercept) 0.01035  0.1017  
##  Residual              0.06174  0.2485  
## Number of obs: 1079, groups:  IQR_block, 566
## 
## Fixed effects:
##                                       Estimate Std. Error         df t value
## (Intercept)                          4.035e-01  3.038e-01  5.221e+02   1.328
## cropmaize                           -5.541e-02  2.173e-02  9.550e+02  -2.550
## z_Rain_avg                          -1.218e-02  1.509e-02  9.185e+02  -0.807
## z_ICR_Org_C_avg                      1.306e-02  1.008e-02  5.652e+02   1.296
## z_ICR_K_perc_23A                    -1.900e-02  1.006e-02  5.612e+02  -1.888
## z_ICR_Avail_P_avg                    2.140e-02  9.607e-03  5.543e+02   2.227
## IQR_soil_textureFine                 1.045e-01  3.541e-02  5.275e+02   2.952
## IQR_soil_textureMix                  4.963e-02  2.788e-02  5.304e+02   1.780
## IQR_soil_textureUnknown              6.585e-02  6.535e-02  6.065e+02   1.008
## IQR_soil_colorblack                  6.057e-02  5.699e-02  5.566e+02   1.063
## IQR_soil_colorbrown                  6.752e-02  6.868e-02  5.398e+02   0.983
## IQR_soil_colorivu_busa_nibara_ryivu -1.290e-02  8.775e-02  5.326e+02  -0.147
## IQR_soil_colorred                    2.860e-02  5.591e-02  5.556e+02   0.512
## IQF_position_slopeBackslope         -1.090e-01  2.619e-01  5.909e+02  -0.416
## IQF_position_slopeFlat              -1.213e-01  2.631e-01  5.905e+02  -0.461
## IQF_position_slopeFootslope         -9.348e-02  2.634e-01  5.902e+02  -0.355
## IQF_position_slopeShoulder          -3.320e-02  2.640e-01  5.901e+02  -0.126
## IQF_position_slopeSummit            -9.530e-02  2.648e-01  5.911e+02  -0.360
## Weed_species_A_combinedBroadleaf    -1.772e-02  1.472e-01  5.196e+02  -0.120
## Weed_species_A_combinedGrass         5.422e-02  1.477e-01  5.203e+02   0.367
## Weed_species_A_combinedunknown       6.488e-01  3.104e-01  9.595e+02   2.090
## Weed_species_A_combinedWitchweed    -4.426e-02  1.561e-01  5.235e+02  -0.284
## IQF_environmentMid_Low_Dry          -7.404e-02  3.184e-02  6.252e+02  -2.325
## IQF_environmentMid_Low_Wet          -8.092e-02  3.022e-02  6.459e+02  -2.677
## IQF_environmentTransition           -5.722e-02  3.082e-02  5.977e+02  -1.857
## z_ICR_arable_land_avg               -2.815e-02  9.882e-03  5.546e+02  -2.849
## IQR_TF_tubura_client0               -1.269e-01  2.058e-01  5.200e+02  -0.617
## IQR_TF_tubura_client1               -7.240e-02  2.053e-01  5.196e+02  -0.353
## z_Compost_avg                       -5.287e-02  1.047e-02  1.073e+03  -5.051
## z_Planting_avg_doy                   4.862e-02  9.637e-03  1.031e+03   5.045
## IQR_plot_fert_qualityhigh            3.796e-02  3.029e-01  5.724e+02   0.125
## IQR_plot_fert_qualitylow_quality     1.443e-01  3.075e-01  5.717e+02   0.469
## IQR_plot_fert_qualitymedium          6.355e-02  3.020e-01  5.723e+02   0.210
##                                     Pr(>|t|)    
## (Intercept)                          0.18468    
## cropmaize                            0.01092 *  
## z_Rain_avg                           0.41970    
## z_ICR_Org_C_avg                      0.19534    
## z_ICR_K_perc_23A                     0.05959 .  
## z_ICR_Avail_P_avg                    0.02633 *  
## IQR_soil_textureFine                 0.00330 ** 
## IQR_soil_textureMix                  0.07567 .  
## IQR_soil_textureUnknown              0.31407    
## IQR_soil_colorblack                  0.28833    
## IQR_soil_colorbrown                  0.32600    
## IQR_soil_colorivu_busa_nibara_ryivu  0.88314    
## IQR_soil_colorred                    0.60915    
## IQF_position_slopeBackslope          0.67728    
## IQF_position_slopeFlat               0.64479    
## IQF_position_slopeFootslope          0.72277    
## IQF_position_slopeShoulder           0.89996    
## IQF_position_slopeSummit             0.71905    
## Weed_species_A_combinedBroadleaf     0.90420    
## Weed_species_A_combinedGrass         0.71374    
## Weed_species_A_combinedunknown       0.03687 *  
## Weed_species_A_combinedWitchweed     0.77686    
## IQF_environmentMid_Low_Dry           0.02037 *  
## IQF_environmentMid_Low_Wet           0.00761 ** 
## IQF_environmentTransition            0.06386 .  
## z_ICR_arable_land_avg                0.00455 ** 
## IQR_TF_tubura_client0                0.53763    
## IQR_TF_tubura_client1                0.72454    
## z_Compost_avg                       5.16e-07 ***
## z_Planting_avg_doy                  5.35e-07 ***
## IQR_plot_fert_qualityhigh            0.90032    
## IQR_plot_fert_qualitylow_quality     0.63915    
## IQR_plot_fert_qualitymedium          0.83340    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation matrix not shown by default, as p = 33 > 12.
## Use print(x, correlation=TRUE)  or
##     vcov(x)        if you need it

#3rd approach to the mix models (https://agritrop.cirad.fr/596238/1/596238.pdf) We analysed the relative yield performance of conservation agriculture (CA) compared with conventional management (CONV) using the natural logarithm of the yield response ratio, ln(CONV/CA). This metric captures proportional treatment effects, stabilizes variance across environments, and avoids scale-dependent biases associated with absolute yield differences. Mixed-effects models were fitted to paired CA–CONV observations with block as a random effect, and candidate drivers were entered sequentially in thematic groups (biophysical, socio-economic, and management). Results show strong environmental modulation of relative performance: CA was least disadvantaged in mid-altitude dry environments, where the CONV advantage was significantly reduced. Management factors were the dominant drivers of variation in lnRR, with compost application reducing and delayed planting increasing the relative advantage of CONV. Soil texture exerted a significant collective influence, whereas rainfall and most soil chemical indicators were not significant once environmental context was accounted for. Because the response is ln(CONV / CA): Negative values mean CA performs relatively better More negative = smaller CONV advantage (or smaller CA penalty) Key results: -Environment: major contextual modifier; Mid-altitude dry environments are where CA is least disadvantaged relative to CONV. Makes sence…CA has been reported to perform better in dyr, well drained soils -Compost (−): reduces CONV advantage → CA relatively improves. Makes sence, high OC can help alleviate the nutrient inmobilization crated by CA -Planting date (+): later planting increases CONV advantage. Makes sence, early planting > chances of dry start then mulching and less soil disturbing > nifiltration and reduce evaporation -Soil texture (as a group): highly significant → individual levels vary, but collectively soil physical properties matter - Available P: small but significant effect - Slope & weeds: retained as groups, but weaker than management -Arable land size (−): larger land holdings reduce relative CONV advantage - Client status: retained as a group, modest effect - rainfall dropped out, but the assessment of rain could be improved by looking at diffeent stages of the crop (e.g., vegetative only, reproductive only, grain filling only/ frequency/ water balance and lodging combining texture and rain, etc)

#────────────────────────────────────────────
# PAPER-ALIGNED MIXED MODEL (ln response ratio)
# lnRR = log(CONV / CA)
# Stepwise add/drop by F-statistics + AIC, then refit with REML
# (NO interaction terms)
#────────────────────────────────────────────

library(dplyr)
library(tidyr)
library(lme4)
library(lmerTest)
library(forcats)

#────────────────────────────────────────────
# 1) Build paired dataset and lnRR
#────────────────────────────────────────────

mm0 <- drivers_mm %>%
  mutate(
    IQR_block = factor(IQR_block),
    Treat_group = factor(Treat_group),
    crop = factor(crop),
    IQF_environment = factor(IQF_environment),

    IQR_TF_tubura_client   = factor(IQR_TF_tubura_client),
    IQR_plot_fert_quality  = factor(IQR_plot_fert_quality),

    IQR_soil_texture =
      fct_na_value_to_level(factor(IQR_soil_texture), "Unknown"),
    IQR_soil_color =
      fct_na_value_to_level(factor(IQR_soil_color), "Unknown"),
    IQF_position_slope =
      fct_na_value_to_level(factor(IQF_position_slope), "Unknown"),
    Weed_species_A_combined =
      fct_na_value_to_level(factor(Weed_species_A_combined), "Unknown")
  )

num_vars <- c(
  "Rain_avg",
  "Compost_avg",
  "Planting_avg_doy",
  "ICR_Org_C_avg",
  "ICR_K_perc_23A",
  "ICR_Avail_P_avg",
  "ICR_arable_land_avg"
)

mm0 <- mm0 %>% mutate(across(all_of(num_vars), as.numeric))

diff_data <- mm0 %>%
  dplyr::select(
    IQR_block, crop, IQF_environment,
    all_of(num_vars),
    IQR_TF_tubura_client, IQR_plot_fert_quality,
    IQR_soil_texture, IQR_soil_color,
    IQF_position_slope, Weed_species_A_combined,
    Treat_group, Yield_avg
  ) %>%
  pivot_wider(names_from = Treat_group, values_from = Yield_avg) %>%
  filter(!is.na(CA), !is.na(CONV), CA > 0, CONV > 0) %>%
  mutate(lnRR = log(CONV / CA))

# Median impute numeric predictors
diff_data <- diff_data %>%
  mutate(across(all_of(num_vars), ~ ifelse(is.na(.x), median(.x, na.rm = TRUE), .x)))

# Scale numeric predictors
diff_data <- diff_data %>%
  mutate(across(all_of(num_vars), ~ as.numeric(scale(.x)), .names = "z_{.col}"))

# Complete-case dataset so every model uses identical rows
all_vars <- c(
  "lnRR","IQR_block","crop","IQF_environment",
  paste0("z_", num_vars),
  "IQR_TF_tubura_client","IQR_plot_fert_quality",
  "IQR_soil_texture","IQR_soil_color",
  "IQF_position_slope","Weed_species_A_combined"
)

d_cc <- diff_data %>%
  dplyr::select(all_of(all_vars)) %>%
  drop_na()

#────────────────────────────────────────────
# 2) Base model
#────────────────────────────────────────────

m0 <- lmer(
  lnRR ~ crop + (1 | IQR_block),
  data = d_cc,
  REML = FALSE
)

# Helper: add a term-block, print LRT + AIC
add_block <- function(prev_model, add_terms, label) {
  new_model <- update(prev_model, as.formula(paste(". ~ . +", add_terms)))
  cat("\n==============================\n")
  cat("ADD BLOCK:", label, "\n")
  print(anova(prev_model, new_model))
  cat("AIC(prev,new):", AIC(prev_model), "→", AIC(new_model), "\n")
  return(new_model)
}

# Helper: backward prune using paper-like rule (drop terms with F < 1),
# while preventing large AIC worsening
prune_Flt1 <- function(model) {
  repeat {
    d1 <- drop1(model, test = "F")
    d1 <- d1[!is.na(d1$`F value`), , drop = FALSE]
    if (nrow(d1) == 0) break
    minF <- min(d1$`F value`, na.rm = TRUE)
    if (!is.finite(minF) || minF >= 1) break

    term_to_drop <- rownames(d1)[which.min(d1$`F value`)]
    cat("\nDropping (F < 1):", term_to_drop, "  F =", minF, "\n")

    model2 <- update(model, as.formula(paste(". ~ . -", term_to_drop)))

    if (AIC(model2) <= AIC(model) + 2) {
      model <- model2
    } else {
      cat("Drop rejected (AIC got worse):", AIC(model), "→", AIC(model2), "\n")
      break
    }
  }
  model
}

#────────────────────────────────────────────
# 3) Sequential addition of covariate blocks (NO interactions)
#────────────────────────────────────────────


#────────────────────────────────────────────
# 3) Sequential addition of covariate blocks
#────────────────────────────────────────────

# (1) Site descriptor
m1 <- add_block(
  m0,
  "IQF_environment",
  "site descriptor: environment"
)
## 
## ==============================
## ADD BLOCK: site descriptor: environment 
## Data: d_cc
## Models:
## prev_model: lnRR ~ crop + (1 | IQR_block)
## new_model: lnRR ~ crop + (1 | IQR_block) + IQF_environment
##            npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
## prev_model    4 357.18 377.11 -174.59   349.18                         
## new_model     7 345.05 379.94 -165.53   331.05 18.125  3  0.0004145 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## AIC(prev,new): 357.179 → 345.0538
# (2) Biophysical (rainfall + soil + fertility status)
m2 <- add_block(
  m1,
  paste(
    "z_Rain_avg +
     z_ICR_Org_C_avg + z_ICR_K_perc_23A + z_ICR_Avail_P_avg +
     IQR_soil_texture + IQR_soil_color +
     IQR_plot_fert_quality"
  ),
  "biophysical: rainfall + soil + fertility status"
)
## 
## ==============================
## ADD BLOCK: biophysical: rainfall + soil + fertility status 
## Data: d_cc
## Models:
## prev_model: lnRR ~ crop + (1 | IQR_block) + IQF_environment
## new_model: lnRR ~ crop + (1 | IQR_block) + IQF_environment + z_Rain_avg + z_ICR_Org_C_avg + z_ICR_K_perc_23A + z_ICR_Avail_P_avg + IQR_soil_texture + IQR_soil_color + IQR_plot_fert_quality
##            npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
## prev_model    7 345.05 379.94 -165.53   331.05                         
## new_model    21 334.78 439.44 -146.39   292.78 38.272 14  0.0004721 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## AIC(prev,new): 345.0538 → 334.7819
# (3) Landscape & biotic pressure
m3 <- add_block(
  m2,
  "IQF_position_slope + Weed_species_A_combined",
  "biophysical: slope & weeds"
)
## 
## ==============================
## ADD BLOCK: biophysical: slope & weeds 
## Data: d_cc
## Models:
## prev_model: lnRR ~ crop + (1 | IQR_block) + IQF_environment + z_Rain_avg + z_ICR_Org_C_avg + z_ICR_K_perc_23A + z_ICR_Avail_P_avg + IQR_soil_texture + IQR_soil_color + IQR_plot_fert_quality
## new_model: lnRR ~ crop + (1 | IQR_block) + IQF_environment + z_Rain_avg + z_ICR_Org_C_avg + z_ICR_K_perc_23A + z_ICR_Avail_P_avg + IQR_soil_texture + IQR_soil_color + IQR_plot_fert_quality + IQF_position_slope + Weed_species_A_combined
##            npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## prev_model   21 334.78 439.44 -146.39   292.78                       
## new_model    30 335.37 484.89 -137.69   275.37 17.409  9    0.04268 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## AIC(prev,new): 334.7819 → 335.3727
# (4) Socio-economic
m4 <- add_block(
  m3,
  "z_ICR_arable_land_avg + IQR_TF_tubura_client",
  "socio-economic"
)
## 
## ==============================
## ADD BLOCK: socio-economic 
## Data: d_cc
## Models:
## prev_model: lnRR ~ crop + (1 | IQR_block) + IQF_environment + z_Rain_avg + z_ICR_Org_C_avg + z_ICR_K_perc_23A + z_ICR_Avail_P_avg + IQR_soil_texture + IQR_soil_color + IQR_plot_fert_quality + IQF_position_slope + Weed_species_A_combined
## new_model: lnRR ~ crop + (1 | IQR_block) + IQF_environment + z_Rain_avg + z_ICR_Org_C_avg + z_ICR_K_perc_23A + z_ICR_Avail_P_avg + IQR_soil_texture + IQR_soil_color + IQR_plot_fert_quality + IQF_position_slope + Weed_species_A_combined + z_ICR_arable_land_avg + IQR_TF_tubura_client
##            npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## prev_model   30 335.37 484.89 -137.69   275.37                       
## new_model    33 331.81 496.27 -132.90   265.81 9.5645  3    0.02265 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## AIC(prev,new): 335.3727 → 331.8082
# (5) Management (true farmer actions)
m5 <- add_block(
  m4,
  "z_Compost_avg + z_Planting_avg_doy",
  "management"
)
## 
## ==============================
## ADD BLOCK: management 
## Data: d_cc
## Models:
## prev_model: lnRR ~ crop + (1 | IQR_block) + IQF_environment + z_Rain_avg + z_ICR_Org_C_avg + z_ICR_K_perc_23A + z_ICR_Avail_P_avg + IQR_soil_texture + IQR_soil_color + IQR_plot_fert_quality + IQF_position_slope + Weed_species_A_combined + z_ICR_arable_land_avg + IQR_TF_tubura_client
## new_model: lnRR ~ crop + (1 | IQR_block) + IQF_environment + z_Rain_avg + z_ICR_Org_C_avg + z_ICR_K_perc_23A + z_ICR_Avail_P_avg + IQR_soil_texture + IQR_soil_color + IQR_plot_fert_quality + IQF_position_slope + Weed_species_A_combined + z_ICR_arable_land_avg + IQR_TF_tubura_client + z_Compost_avg + z_Planting_avg_doy
##            npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
## prev_model   33 331.81 496.27 -132.90   265.81                         
## new_model    35 283.81 458.25 -106.91   213.81 51.994  2  5.125e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## AIC(prev,new): 331.8082 → 283.8144
#────────────────────────────────────────────
# Backward pruning + REML refit
#────────────────────────────────────────────

m_final_ML   <- prune_Flt1(m5)
## 
## Dropping (F < 1): z_Rain_avg   F = 0.6517542 
## 
## Dropping (F < 1): IQR_plot_fert_quality   F = 0.9725905 
## 
## Dropping (F < 1): IQR_soil_color   F = 0.9193834
m_final_REML <- update(m_final_ML, REML = TRUE)

summary(m_final_REML)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: lnRR ~ crop + (1 | IQR_block) + IQF_environment + z_ICR_Org_C_avg +  
##     z_ICR_K_perc_23A + z_ICR_Avail_P_avg + IQR_soil_texture +  
##     IQF_position_slope + Weed_species_A_combined + z_ICR_arable_land_avg +  
##     IQR_TF_tubura_client + z_Compost_avg + z_Planting_avg_doy
##    Data: d_cc
## 
## REML criterion at convergence: 356.9
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.2878 -0.5091 -0.0259  0.4798  5.5019 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev.
##  IQR_block (Intercept) 0.01256  0.1121  
##  Residual              0.06203  0.2491  
## Number of obs: 1079, groups:  IQR_block, 566
## 
## Fixed effects:
##                                    Estimate Std. Error         df t value
## (Intercept)                       5.069e-01  2.099e-01  5.557e+02   2.415
## cropmaize                        -6.808e-02  1.572e-02  5.480e+02  -4.330
## IQF_environmentMid_Low_Dry       -6.991e-02  3.030e-02  5.984e+02  -2.307
## IQF_environmentMid_Low_Wet       -7.953e-02  2.926e-02  5.928e+02  -2.718
## IQF_environmentTransition        -6.474e-02  3.078e-02  5.660e+02  -2.104
## z_ICR_Org_C_avg                   1.523e-02  1.013e-02  5.365e+02   1.503
## z_ICR_K_perc_23A                 -2.077e-02  1.018e-02  5.368e+02  -2.041
## z_ICR_Avail_P_avg                 2.385e-02  9.690e-03  5.366e+02   2.461
## IQR_soil_textureFine              1.074e-01  3.619e-02  5.051e+02   2.968
## IQR_soil_textureMix               4.593e-02  2.841e-02  5.103e+02   1.617
## IQR_soil_textureUnknown           1.871e-02  4.060e-02  6.958e+02   0.461
## IQF_position_slopeBackslope      -4.081e-02  1.311e-01  5.153e+02  -0.311
## IQF_position_slopeFlat           -4.970e-02  1.328e-01  5.163e+02  -0.374
## IQF_position_slopeFootslope      -1.806e-02  1.335e-01  5.153e+02  -0.135
## IQF_position_slopeShoulder        3.703e-02  1.352e-01  5.168e+02   0.274
## IQF_position_slopeSummit         -2.557e-02  1.359e-01  5.170e+02  -0.188
## Weed_species_A_combinedBroadleaf -3.831e-02  1.502e-01  4.990e+02  -0.255
## Weed_species_A_combinedGrass      2.863e-02  1.505e-01  4.987e+02   0.190
## Weed_species_A_combinedunknown    6.607e-01  3.154e-01  9.129e+02   2.095
## Weed_species_A_combinedWitchweed -7.290e-02  1.593e-01  5.022e+02  -0.458
## z_ICR_arable_land_avg            -2.625e-02  1.001e-02  5.314e+02  -2.622
## IQR_TF_tubura_client0            -1.619e-01  1.543e-01  5.390e+02  -1.050
## IQR_TF_tubura_client1            -1.119e-01  1.537e-01  5.392e+02  -0.728
## z_Compost_avg                    -5.675e-02  1.008e-02  1.010e+03  -5.630
## z_Planting_avg_doy                4.790e-02  9.659e-03  9.977e+02   4.959
##                                  Pr(>|t|)    
## (Intercept)                       0.01606 *  
## cropmaize                        1.77e-05 ***
## IQF_environmentMid_Low_Dry        0.02140 *  
## IQF_environmentMid_Low_Wet        0.00676 ** 
## IQF_environmentTransition         0.03586 *  
## z_ICR_Org_C_avg                   0.13348    
## z_ICR_K_perc_23A                  0.04172 *  
## z_ICR_Avail_P_avg                 0.01416 *  
## IQR_soil_textureFine              0.00314 ** 
## IQR_soil_textureMix               0.10651    
## IQR_soil_textureUnknown           0.64501    
## IQF_position_slopeBackslope       0.75578    
## IQF_position_slopeFlat            0.70844    
## IQF_position_slopeFootslope       0.89243    
## IQF_position_slopeShoulder        0.78434    
## IQF_position_slopeSummit          0.85085    
## Weed_species_A_combinedBroadleaf  0.79885    
## Weed_species_A_combinedGrass      0.84916    
## Weed_species_A_combinedunknown    0.03647 *  
## Weed_species_A_combinedWitchweed  0.64743    
## z_ICR_arable_land_avg             0.00899 ** 
## IQR_TF_tubura_client0             0.29425    
## IQR_TF_tubura_client1             0.46697    
## z_Compost_avg                    2.33e-08 ***
## z_Planting_avg_doy               8.33e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation matrix not shown by default, as p = 25 > 12.
## Use print(x, correlation=TRUE)  or
##     vcov(x)        if you need it

#Back-transforming log to % Model estimates were back-transformed from log response ratios using exp(lnRR) and expressed as percentage differences in yield between CONV and CA as (exp(lnRR) − 1) × 100, with 95% confidence intervals calculated on the log scale and back-transformed.

#────────────────────────────────────────────
# BACK-TRANSFORMATION OF ln RESPONSE RATIO
# lnRR = log(CONV / CA)
# Outputs: response ratio + % difference with 95% CI
#────────────────────────────────────────────

library(dplyr)
library(broom.mixed)
library(emmeans)

#────────────────────────────────────────────
# 1) Back-transform FIXED EFFECTS
#────────────────────────────────────────────

bt_fixed <- broom.mixed::tidy(
  m_final_REML,
  effects = "fixed",
  conf.int = TRUE,
  conf.method = "Wald"
) %>%
  mutate(
    RR = exp(estimate),
    RR_low = exp(conf.low),
    RR_high = exp(conf.high),

    pct_change = (RR - 1) * 100,
    pct_low = (RR_low - 1) * 100,
    pct_high = (RR_high - 1) * 100
  ) %>%
  dplyr::select(
    term,
    estimate,
    conf.low,
    conf.high,
    pct_change,
    pct_low,
    pct_high,
    p.value
  ) %>%
  mutate(
    across(c(pct_change, pct_low, pct_high), round, 1),
    p.value = signif(p.value, 3)
  )
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `across(c(pct_change, pct_low, pct_high), round, 1)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))
bt_fixed
## # A tibble: 25 × 8
##    term          estimate conf.low conf.high pct_change pct_low pct_high p.value
##    <chr>            <dbl>    <dbl>     <dbl>      <dbl>   <dbl>    <dbl>   <dbl>
##  1 (Intercept)     0.507   0.0946   0.919          66       9.9    151.  1.61e-2
##  2 cropmaize      -0.0681 -0.0990  -0.0372         -6.6    -9.4     -3.7 1.77e-5
##  3 IQF_environm…  -0.0699 -0.129   -0.0104         -6.8   -12.1     -1   2.14e-2
##  4 IQF_environm…  -0.0795 -0.137   -0.0221         -7.6   -12.8     -2.2 6.76e-3
##  5 IQF_environm…  -0.0647 -0.125   -0.00429        -6.3   -11.8     -0.4 3.59e-2
##  6 z_ICR_Org_C_…   0.0152 -0.00468  0.0351          1.5    -0.5      3.6 1.33e-1
##  7 z_ICR_K_perc…  -0.0208 -0.0408  -0.000782       -2.1    -4       -0.1 4.17e-2
##  8 z_ICR_Avail_…   0.0238  0.00481  0.0429          2.4     0.5      4.4 1.42e-2
##  9 IQR_soil_tex…   0.107   0.0363   0.178          11.3     3.7     19.5 3.14e-3
## 10 IQR_soil_tex…   0.0459 -0.00988  0.102           4.7    -1       10.7 1.07e-1
## # ℹ 15 more rows

Same mix model analysis but for outpot the multi-season performance type of the block = Perf_Type; a cathegoric variable

##prepare the database

# Join CA_typology from Perf_Type, remove Yield_avg from drivers_mm
drivers_ptype_mm <- drivers_mm %>%
  left_join(Perf_Type %>% dplyr::select(IQR_block, CA_typology), by = "IQR_block") %>%
  dplyr::select(-Yield_avg)

##run the analysos

#Start from drivers_ID filtering variable to be used
data_mm_ptype <- drivers_ID %>%
  dplyr::select(-ICR_N_perc_23A, -IQF_Position_landscape,-IQR_compost_quality_avg,-ICR_compost_avg,-Relative_Yield_avg)


# Then we do average across seasons for the num and more frequent for the chr
library(dplyr)

# Helper to compute mode (most frequent value)
get_mode <- function(x) {
  x <- x[!is.na(x)]
  if (length(x) == 0) return(NA)
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

# 1. Identify column types
num_cols <- sapply(data_mm_ptype, is.numeric)
cat_cols <- sapply(data_mm_ptype, is.character) | sapply(data_mm_ptype, is.factor)

# 2. Collapse numeric columns using mean
numeric_summary <- data_mm_ptype %>%
  dplyr::select(IQR_block, which(num_cols)) %>%
  group_by(IQR_block) %>%
  summarise(across(everything(), mean, na.rm = TRUE), .groups = "drop")

# 3. Collapse categorical/factor columns using mode
categorical_summary <- data_mm_ptype %>%
  dplyr::select(IQR_block, which(cat_cols)) %>%
  group_by(IQR_block) %>%
  summarise(across(everything(), get_mode), .groups = "drop")

# 4. Join both summaries
data_mm_block <- left_join(numeric_summary, categorical_summary, by = "IQR_block")

#Filter out the blocks where performance type was not possible to be calculated (missing seasons)

# Filter out rows with NA in CA_typology
data_mm_block <- data_mm_block %>%
  filter(!is.na(CA_typology))

##Visualize the distributions

library(ggplot2)
library(gridExtra)  # for arranging plots in grids
## Warning: package 'gridExtra' was built under R version 4.3.3
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following object is masked from 'package:dplyr':
## 
##     combine
# Split variables by type
num_vars <- names(data_mm_block)[sapply(data_mm_block, is.numeric)]
cat_vars <- names(data_mm_block)[sapply(data_mm_block, function(x) is.factor(x) | is.character(x))]

# Limit number of bars for categorical plots (for readability)
max_categories <- 20

# Plot numeric variables: Histograms
plot_list_num <- lapply(num_vars, function(var) {
  ggplot(data_mm_block, aes_string(x = var)) +
    geom_histogram(fill = "#69b3a2", color = "white", bins = 30) +
    theme_minimal() +
    labs(title = paste("Histogram of", var), x = var)
})
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Plot categorical variables: Bar plots
plot_list_cat <- lapply(cat_vars, function(var) {
  if (n_distinct(data_mm_block[[var]]) <= max_categories) {
    ggplot(data_mm_block, aes_string(x = var)) +
      geom_bar(fill = "#404080") +
      theme_minimal() +
      labs(title = paste("Frequency of", var), x = var) +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))
  } else {
    NULL  # Skip too many categories
  }
})

# Remove NULLs (from variables with too many levels)
plot_list_cat <- Filter(Negate(is.null), plot_list_cat)

# Show plots one-by-one or combine (if fewer than 9 at a time)
for (p in plot_list_num) print(p)

## Warning: Removed 14 rows containing non-finite outside the scale range
## (`stat_bin()`).

for (p in plot_list_cat) print(p)

mix models con variable performance

#Before lets combine some low# cathegories on the variables

library(dplyr)
library(forcats)

#Recode: Combine "Good" and "Acceptable" into "Acceptable"
data_mm_block <- data_mm_block %>%
  mutate(
   CA_typology = fct_collapse(
     CA_typology,
      Acceptable = c("Acceptable", "Good")
    ),
    CA_typology = fct_drop(CA_typology),  # drop unused level
    CA_typology = fct_relevel(CA_typology, "Very Poor")  # set reference
  )
#First we combine good and acceptable performance into acceptable becasue there is only few good
library(dplyr)
library(forcats)

# Recode: Combine "Good" and "Acceptable" into "Acceptable"
#data_mm_block <- data_mm_block %>%
#  mutate(
#    CA_typology = fct_collapse(
 #     CA_typology,
#      Acceptable = c("Acceptable", "Good")
#    ),
#    CA_typology = fct_drop(CA_typology),  # drop unused level
#    CA_typology = fct_relevel(CA_typology, "Very Poor")  # set reference
#  )

# Check the new distribution
#table(data_mm_block$CA_typology)



#────────────────────────────────────────────
# MULTINOMIAL MODEL: CA_typology as outcome
#────────────────────────────────────────────

library(nnet)
library(dplyr)
library(forcats)

# 1. Prepare the data
data_model <- data_mm_block %>%
  mutate(
    CA_typology = fct_drop(CA_typology),  # drop unused levels
    CA_typology = fct_relevel(CA_typology, "Very Poor"),  # reference level
    IQR_block = factor(IQR_block),
    IQR_TF_tubura_client  = factor(IQR_TF_tubura_client),
    IQR_plot_fert_quality = factor(IQR_plot_fert_quality),
    IQR_soil_texture      = fct_na_value_to_level(factor(IQR_soil_texture), "Unknown"),
    IQR_soil_color        = fct_na_value_to_level(factor(IQR_soil_color), "Unknown"),
    IQF_position_slope    = fct_na_value_to_level(factor(IQF_position_slope), "Unknown"),
    Weed_species_A_combined = fct_na_value_to_level(factor(Weed_species_A_combined), "Unknown")
  )

# 2. Define numeric variables to scale
num_vars <- c(
  "ICR_rainfall_avg",
  "Comp_amount_corr_quali",
  "ICF_planting_date",
  "ICR_Org_C_avg",
  "ICR_K_perc_23A",
  "ICR_Avail_P_avg",
  "ICR_arable_land_avg"
)

# 3. Median impute + scale numeric variables
data_model <- data_model %>%
  mutate(across(all_of(num_vars), ~ ifelse(is.na(.x), median(.x, na.rm = TRUE), .x))) %>%
  mutate(across(all_of(num_vars), scale, .names = "z_{.col}"))

# 4. Final variable set
all_vars <- c(
  "CA_typology", "IQF_environment",
  paste0("z_", num_vars),
  "IQR_TF_tubura_client", "IQR_plot_fert_quality",
  "IQR_soil_texture", "IQR_soil_color",
  "IQF_position_slope", "Weed_species_A_combined"
)

data_model <- data_model %>%
  dplyr::select(all_of(all_vars)) %>%
  drop_na()

# 5. Fit base model
m0 <- multinom(CA_typology ~ 1, data = data_model, trace = FALSE)

# 6. Helper to add a block
add_block <- function(prev_model, add_terms, label) {
  formula_new <- as.formula(paste("CA_typology ~", deparse(formula(prev_model)[[3]]), "+", add_terms))
  new_model <- multinom(formula_new, data = data_model, trace = FALSE)
  cat("\n==============================\n")
  cat("ADD BLOCK:", label, "\n")
  print(anova(prev_model, new_model, test = "Chisq"))
  cat("AIC(prev,new):", AIC(prev_model), "→", AIC(new_model), "\n")
  return(new_model)
}

#────────────────────────────────────────────
# 7. Add covariate blocks step-by-step (REVISED)
#────────────────────────────────────────────

# (1) Site descriptor
m1 <- add_block(m0, "IQF_environment", "site descriptor: environment")
## 
## ==============================
## ADD BLOCK: site descriptor: environment 
## Likelihood ratio tests of Multinomial Models
## 
## Response: CA_typology
##                 Model Resid. df Resid. Dev   Test    Df LR stat.    Pr(Chi)
## 1                   1       940   634.8214                                 
## 2 1 + IQF_environment       934   619.7365 1 vs 2     6  15.0849 0.01960656
## AIC(prev,new): 638.8214 → 635.7365
# (2) Weather (only rainfall)
m2 <- add_block(
  m1,
  "z_ICR_rainfall_avg",
  "weather: rainfall"
)
## 
## ==============================
## ADD BLOCK: weather: rainfall 
## Likelihood ratio tests of Multinomial Models
## 
## Response: CA_typology
##                                      Model Resid. df Resid. Dev   Test    Df
## 1                      1 + IQF_environment       934   619.7365             
## 2 1 + IQF_environment + z_ICR_rainfall_avg       932   618.1107 1 vs 2     2
##   LR stat.   Pr(Chi)
## 1                   
## 2 1.625811 0.4435674
## AIC(prev,new): 635.7365 → 638.1107
# (3) Soil: fertility + texture + color
m3 <- add_block(
  m2,
  "z_ICR_Org_C_avg + z_ICR_K_perc_23A + z_ICR_Avail_P_avg + IQR_soil_texture + IQR_soil_color + IQR_plot_fert_quality",
  "soil: fertility, texture, color"
)
## 
## ==============================
## ADD BLOCK: soil: fertility, texture, color 
## Likelihood ratio tests of Multinomial Models
## 
## Response: CA_typology
##                                                                                                                                                           Model
## 1                                                                                                                      1 + IQF_environment + z_ICR_rainfall_avg
## 2 1 + IQF_environment + z_ICR_rainfall_avg + z_ICR_Org_C_avg + z_ICR_K_perc_23A + z_ICR_Avail_P_avg + IQR_soil_texture + IQR_soil_color + IQR_plot_fert_quality
##   Resid. df Resid. Dev   Test    Df LR stat.   Pr(Chi)
## 1       932   618.1107                                
## 2       908   592.5625 1 vs 2    24 25.54823 0.3764782
## AIC(prev,new): 638.1107 → 660.5625
# (4) Landscape
m4 <- add_block(
  m3,
  "IQF_position_slope",
  "landscape"
)
## Warning: Using formula(x) is deprecated when x is a character vector of length > 1.
##   Consider formula(paste(x, collapse = " ")) instead.
## 
## ==============================
## ADD BLOCK: landscape 
## Likelihood ratio tests of Multinomial Models
## 
## Response: CA_typology
##                                                                                                                                                           Model
## 1                                                                              1 + IQF_environment + z_ICR_rainfall_avg + z_ICR_Org_C_avg + +IQF_position_slope
## 2 1 + IQF_environment + z_ICR_rainfall_avg + z_ICR_Org_C_avg + z_ICR_K_perc_23A + z_ICR_Avail_P_avg + IQR_soil_texture + IQR_soil_color + IQR_plot_fert_quality
##   Resid. df Resid. Dev   Test    Df LR stat.  Pr(Chi)
## 1       920   609.6038                               
## 2       908   592.5625 1 vs 2    12 17.04134 0.148049
## AIC(prev,new): 660.5625 → 653.6038
# (5) Weeds
m5 <- add_block(
  m4,
  "Weed_species_A_combined",
  "weeds"
)
## Warning: Using formula(x) is deprecated when x is a character vector of length > 1.
##   Consider formula(paste(x, collapse = " ")) instead.
## 
## ==============================
## ADD BLOCK: weeds 
## Likelihood ratio tests of Multinomial Models
## 
## Response: CA_typology
##                                                                                   Model
## 1 1 + IQF_environment + z_ICR_rainfall_avg + z_ICR_Org_C_avg + +Weed_species_A_combined
## 2      1 + IQF_environment + z_ICR_rainfall_avg + z_ICR_Org_C_avg + +IQF_position_slope
##   Resid. df Resid. Dev   Test    Df LR stat.   Pr(Chi)
## 1       924   610.8611                                
## 2       920   609.6038 1 vs 2     4 1.257262 0.8685838
## AIC(prev,new): 653.6038 → 646.8611
# (6) Socio-economic
m6 <- add_block(
  m5,
  "z_ICR_arable_land_avg + IQR_TF_tubura_client",
  "socio-economic"
)
## Warning: Using formula(x) is deprecated when x is a character vector of length > 1.
##   Consider formula(paste(x, collapse = " ")) instead.
## 
## ==============================
## ADD BLOCK: socio-economic 
## Likelihood ratio tests of Multinomial Models
## 
## Response: CA_typology
##                                                                                                        Model
## 1                      1 + IQF_environment + z_ICR_rainfall_avg + z_ICR_Org_C_avg + +Weed_species_A_combined
## 2 1 + IQF_environment + z_ICR_rainfall_avg + z_ICR_Org_C_avg + +z_ICR_arable_land_avg + IQR_TF_tubura_client
##   Resid. df Resid. Dev   Test    Df LR stat. Pr(Chi)
## 1       924   610.8611                              
## 2       924   607.8603 1 vs 2     0  3.00079       0
## AIC(prev,new): 646.8611 → 643.8603
# (7) Management
m7 <- add_block(
  m6,
  "z_Comp_amount_corr_quali + z_ICF_planting_date",
  "management"
)
## Warning: Using formula(x) is deprecated when x is a character vector of length > 1.
##   Consider formula(paste(x, collapse = " ")) instead.
## 
## ==============================
## ADD BLOCK: management 
## Likelihood ratio tests of Multinomial Models
## 
## Response: CA_typology
##                                                                                                          Model
## 1 1 + IQF_environment + z_ICR_rainfall_avg + z_ICR_Org_C_avg + +z_Comp_amount_corr_quali + z_ICF_planting_date
## 2   1 + IQF_environment + z_ICR_rainfall_avg + z_ICR_Org_C_avg + +z_ICR_arable_land_avg + IQR_TF_tubura_client
##   Resid. df Resid. Dev   Test    Df  LR stat. Pr(Chi)
## 1       926   599.3893                               
## 2       924   607.8603 1 vs 2     2 -8.470988       1
## AIC(prev,new): 643.8603 → 631.3893
#────────────────────────────────────────────
# 8. Final model summary
#────────────────────────────────────────────

summary(m7)
## Call:
## multinom(formula = formula_new, data = data_model, trace = FALSE)
## 
## Coefficients:
##            (Intercept) IQF_environmentMid_Low_Dry IQF_environmentMid_Low_Wet
## Poor         -2.775988                 -0.2476626                  0.5831139
## Acceptable   -2.481242                  1.2299373                  1.0959155
##            IQF_environmentTransition z_ICR_rainfall_avg z_ICR_Org_C_avg
## Poor                       0.6059333         0.15329885      -0.1448562
## Acceptable                 0.6542255        -0.03045663      -0.1794830
##            z_Comp_amount_corr_quali z_ICF_planting_date
## Poor                     -0.0714487         -0.01694281
## Acceptable                0.5768253         -0.07242872
## 
## Std. Errors:
##            (Intercept) IQF_environmentMid_Low_Dry IQF_environmentMid_Low_Wet
## Poor         0.4712419                  0.7450346                  0.6221335
## Acceptable   0.3843984                  0.4889954                  0.4838385
##            IQF_environmentTransition z_ICR_rainfall_avg z_ICR_Org_C_avg
## Poor                       0.6443225          0.2716718       0.2069851
## Acceptable                 0.5140031          0.1953960       0.1535552
##            z_Comp_amount_corr_quali z_ICF_planting_date
## Poor                      0.2448871           0.2251178
## Acceptable                0.1628177           0.1607873
## 
## Residual Deviance: 599.3893 
## AIC: 631.3893

#Dive into drivers of performance

Make regression trees with the key variables leaving 15% fo the blcoks from each environment out for independent validation

Evaluate the % of farmers to be targeted if using the regression tree and the % of farmers that were acceptable or good

Compare a blanquet approach as compared to the targeted one:

##N of famers reached ##roportion of reached farmers with success ##Total N of farmers reached that will success / ## Modeling dissemination of blanket vs. targeted Using data from dissemination rates (% of potential dissemination based on success of target farmer) model how the intervantion impact woudl evolve in one and another case (from Intercrop trial and or from here if available) Some kind of modeling with dissemination theory and how if a farmer gets positive feedback X chance of adoption if only negative Y and if possitive and negative Z….then after N years, targeting could lead to even higher adoption and reduced costs