Country: Democratic Republic of Congo (DRC)
Survey: CODPHIA
Activity: Weighting

This notebook contains organised code for each step of the weighting process for the CODPHIA data.
Ce carnet contient du code organisé pour chaque étape du processus de pondération des données de l’enquête CODPHIA.

The major steps and progress (at 11/09/2025) are as follows: Les étapes principales et leur état d’avancement (au 11/09/2025) sont les suivantes :

[done] Load and organise/join the required datasets [terminé] Charger et organiser/fusionner les ensembles de données requis
[done] Calculate PSU weights [terminé] Calculer les poids des ZD (unités primaires de sondage)
[done] Calculate household weights [terminé] Calculer les poids des ménages
[done] Calculate person-level interview weights [terminé] Calculer les poids des individus pour les entretiens
[done] Calculate person-level blood test weights [terminé] Calculer les poids des individus pour les tests sanguins
[done] Create jackknife replicate weights [terminé] Créer les poids de réplique jackknife

1 Calculate EA weights

1.1 Base Weights

Base weights compensate for unequal selection probabilities due to multi-stage sampling designs.
They are the inverse of the probability of selection at each stage. Make reference to the CODPHIA second stage sample, the base weight is psuwt0 = column(ES) = ADJ_PSU_WT, while the ADJ_PSU_WT = 1/column(ER) = 1/ADJ_PROB2

\[ \text{PSU weight} = \frac{1}{\text{Selection Probability}} \]

1.2 Loading data to calculate base weights

load second stage sample, remove duplicates in terms of EAs so that you are left with 396 EAs. The reason for using second stage sample instead of 1st stage sample is that the 2nd stage sample has all information from 1st stage and from the household listing

# This section is to import packages
# and prepare for importing data
rm(list = ls())




# clean/remove everything
#install.packages("pacman")
# we use pacman to import all packages
pacman::p_load(tidyverse,
               haven,
               rio, 
               ggplot2, 
               janitor, 
               survey, 
               lubridate, 
               readxl,
               BART,
               future,
               furrr,
               progressr,
               caret,
               progressr,
               tinytex,
               srvyr
               
)


# create path to the datasets
 cd_sample_dir <- "C:/Users/Kupamupindit/OneDrive - Columbia University Irving Medical Center/Documents/ICAP work/Countries/DRC/Sampling/2nd_stage_sampling/"

#cd_sample_dir <- "G:/Shared drives/DMA/PHIA/Weighting/CODPHIA2/Input_data/"


###################################################################

# read_excel comes from the package readxl, str_c means concatenate 2 strings as shown below
# if you press shift+ctr+M you will get a pipe = %>% 
# if you press ctr+s you will save your script, its a shortcut for saving 
# you can google shortcuts for Rstudio, it will give you a whole list of all these shortcuts
# clean_name comes from the package janitor, the purpose of clean_name is to make your variable names proper
cd_zd_in <- read_excel(str_c(cd_sample_dir,"CODPHIA_second_stage_sampling_09_08_2024.xlsx")) %>%
  clean_names() %>% 
  distinct(ea_id, .keep_all=TRUE)

1.3 Calculating base weights

psu_base_weights <- cd_zd_in |> 
  mutate(segment_adjustment = if_else(is.na(segment), 1, segmentsize / easize),
         segadj = adj_prob2 / adj_prob1,
         psuwt0 = 1/(adj_prob1 * segment_adjustment))

# Check
psu_base_weights |> 
  ggplot(aes(x = adj_psu_wt, y = psuwt0)) +
  geom_point() +
  geom_smooth(method = lm)

## `geom_smooth()` using formula = 'y ~ x'

psu_base_weights |> 
  ggplot(aes(x = factor(strata),
             y = psuwt0)) +
  geom_boxplot() +
  labs(x = "Stratum/Region",
       y = "PSU base weight") +
  theme_bw()

psu_base_weights %>% 
  group_by(id_province) |> 
  mutate(psu_num_instrata = row_number()) |> 
  ggplot(aes(x = psu_num_instrata,
             y = psuwt0, 
             )) +
  geom_col() +
  facet_wrap(~id_province) +
  theme_bw() +
  labs(fill = "Urban/Rural",
       x = "PSU number within stratum",
       y = "PSU base weight")

psu_weights <- psu_base_weights %>% 
  mutate(EA_ID=ea_id) %>% 
  select(EA_ID, -ea_id, psuwt0)

2 Household base weights

There are several steps for the household weights:

Calculate the within-EA household sampling weight.
Calculate the weight adjustment for households of unknown eligibility.
Calculate the weight adjustment for non-responding households.

2.1 Calculate the within-EA household sampling weight.

We need the counts of households listed and selected to compute the household selection probability and weight.

The sample design was to take a systematic random sample \({h_i}\) households from each EA. The household selection probability, \(p_{hh,i}\) in EA \(i\) is then

\[ p_{hh,i} = \frac{h_i}{N_{\textrm{elig},i}}, \] where \({h_i}\) is the target number of household per each EA \({i}\) and \(N_{\textrm{elig},i}\) is the number of eligible households in EA \(i\).

# ─────────────────────────────────────────────────────────────────────
# Step 1: Load listing and sampling files, then compute base household weights
# Étape 1 : Charger les fichiers de dénombrement et d’échantillonnage, puis calculer les poids de base
# ─────────────────────────────────────────────────────────────────────

# Load the household listing data with number of counted households per EA
# Charger les données de dénombrement avec le nombre de ménages comptés par ZD
listing <- read_dta(str_c(cd_sample_dir, "masterlist_prox1.dta")) %>%
  select(EA_ID, HH_NUM_CNT)  # Keep only EA ID and household count
  # Ne garder que EA_ID et le nombre de ménages (HH_NUM_CNT)

# Load the second-stage sampling file and remove any redundant HH count column
# Charger le fichier d’échantillonnage de deuxième phase et supprimer la colonne HH_NUM_CNT si présente


sampled <- read_excel(str_c(cd_sample_dir, "CODPHIA_second_stage_sampling_09_08_2024.xlsx")) %>%
  select(-HH_NUM_CNT) %>%  # Remove old household count variable if present
  # Join the listing data to attach HH counts to sampled EAs
  # Joindre les données de dénombrement pour associer le nombre de ménages aux ZD échantillonnées
  left_join(
    listing %>%
      group_by(EA_ID) %>%
      summarise(across(everything(), first), .groups = "drop"),  # Keep only one row per EA_ID
    by = "EA_ID"
  )



# ─────────────────────────────────────────────────────────────────────
# Step 2: Compute household selection weights
# Étape 2 : Calculer les poids de sélection des ménages
# ─────────────────────────────────────────────────────────────────────

hh_selection_weights <- sampled %>%
  
  # Join with PSU weights to compute combined household weights
  # Joindre avec les poids des grappes (PSU) pour calculer les poids combinés
  left_join(psu_weights, by = "EA_ID") %>%
  
  # Calculate household selection weight: max(1, HH_NUM_CNT / Sampled_HH)
  # Calculer le poids de sélection du ménage : max(1, HH_NUM_CNT / Sampled_HH)
  mutate(
    hh_selection_weight = pmax(1.0, HH_NUM_CNT / Sampled_HH),
    
    # Final base weight: product of PSU weight and household selection weight
    # Poids de base final : produit du poids PSU et du poids de sélection du ménage
    hhbwt0 = psuwt0 * hh_selection_weight
  ) %>%
  
  # Keep only one record per EA
  # Garder une seule ligne par ZD
  distinct(EA_ID, .keep_all = TRUE)


# ─────────────────────────────────────────────────────────────────────
# Step 3: Summarize and visualize the distribution of base household weights
# Étape 3 : Résumer et visualiser la distribution des poids de base des ménages
# ─────────────────────────────────────────────────────────────────────

# Show summary statistics of the household base weights
# Afficher les statistiques résumées des poids de base des ménages
summary(hh_selection_weights$hhbwt0)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   40.81  200.12  203.27  375.72  654.91  749.55

# Plot histogram of the household weights to check distribution
# Tracer un histogramme des poids des ménages pour vérifier leur distribution
hist(hh_selection_weights$hhbwt0, breaks = 40)

# Grouped summary by province to explore weight variation
# Résumé groupé par province pour explorer la variation des poids

hh_selection_weights |>
  group_by(Province) |>
  summarise(
    min = min(hhbwt0),
    low_Q = quantile(hhbwt0, 0.25),  # 1st quartile
    medianwt = median(hhbwt0),
    meanwt = mean(hhbwt0),
    upp_Q = quantile(hhbwt0, 0.75),  # 3rd quartile
    max = max(hhbwt0)
  )

## # A tibble: 2 × 7
##   Province       min low_Q medianwt meanwt upp_Q   max
##   <chr>        <dbl> <dbl>    <dbl>  <dbl> <dbl> <dbl>
## 1 Haut-Katanga 408.   653.     657.   650.  661.  750.
## 2 Lualaba       40.8  199.     201.   214.  203.  634.

# Visualize household base weights by province using boxplots
# Visualiser les poids de base des ménages par province avec des diagrammes en boîte

hh_selection_weights |>
  ggplot(aes(y = hhbwt0)) +
  facet_wrap(~Province) +  # One boxplot per province
  geom_boxplot() +
  theme_bw() +
  labs(x = "Household Base Weight",     # X-axis label
       y = "Province")                  # Y-axis label

2.2 Calculate the weight adjustment for households of unknown eligibility

The variable UPCODE_STAT_HH on the household interview file contains the final response status for each household. The codes are:

UPCODE_STAT_HH	Definition
1	Respondent
2	Non-respondent
3	Ineligible
4	Unknown eligibility

In this step, the weights of the households with unknown eligibility are re-distributed to the other households in the EA. The adjustment factor in EA \(i\) is computed as

\[ A^{\textrm{HH elig}}_i = \Sigma_{j=1}^{n_i} W_{ij} / \Sigma_{j = 1}^{n_{i,1} + n_{i,2} + n_{i,3}} W_{ij}. \] The values \(n_{i,1-3}\) are the number of households in EA \(i\) with final status equal to 1, 2, or 3, as defined in the table above. The total number of selected households is \(n_i\), and \(W_{ij}\) is the household base weight of household \(j\) in EA \(i\). Note that because of the sample design, all households in each EA have the same selection probability/weight, so the value of this weight cancels out in computing the adjustment factor.

# ─────────────────────────────────────────────────────────────────────────────
# Load and process DRC household interview data to assign household status
# Charger et traiter les données d'entretien des ménages en RDC pour attribuer un statut
# ─────────────────────────────────────────────────────────────────────────────

# Define file paths
# Définir les chemins d’accès aux fichiers
# cd_sample_dir <- "C:/Users/Kupamupindit/OneDrive - Columbia University Irving Medical Center/Documents/ICAP work/Countries/DRC/Data/"
# indir <- "C:/Users/kupamupindit/OneDrive - Columbia University Irving Medical Center/Documents/ICAP work/Countries/DRC/weighting/Data/"


statdir <- "C:/Users/Kupamupindit/OneDrive - Columbia University Irving Medical Center/Documents/ICAP work/Countries/DRC/Weighting/Data/Stat_files/"

# Load the household interview dataset from SAS file
# Charger le fichier d’entretien des ménages à partir d’un fichier SAS
# cd_hh <- import(str_c(indir, "CD_FFcorr_hh_int_DRAFT_20250628.xlsx")) %>%
#   # Exclude any household flagged for discarding
#   # Exclure tout ménage marqué comme à rejeter
#   filter(is.na(DMFLAG) | DMFLAG != "DISCARD")
cd_hh <- import(str_c(statdir, "cd_ffstat_hh_int.sas7bdat"))


# Create derived household status variable
# Créer une variable dérivée du statut du ménage
cd_hhstatus <- cd_hh
  # GR 10-07: Use final stat file status instead of deriving
  # mutate(
  #   official_hhstatus = HH_STATUS,
  #   # Combine resolution codes for flexibility
  #   # Combiner les codes de résultat pour plus de flexibilité
  #   COMB_UPCODED_RESLTNDT = coalesce(as.numeric(UPCODE_RESLTNDT), as.numeric(RESULTNDT)),
  #   
  #   # Classify households into status categories
  #   # Classer les ménages en catégories de statut
  #   UPCODE_STAT_HH = case_when(
  #     
  #     # Status 1: Eligible Respondent
  #     # Statut 1 : Répondant éligible
  #     is.na(RESULTNDT) & (STARTINT == 1 & HHELIG == 1 & HHCONSTAT == 1 &
  #                         !is.na(HHQDTHSINS) & !is.na(ROSTER_START) & !is.na(HHQINSHH) &
  #                         !is.na(HHQASSIGN_INST)) ~ 1,
  #     
  #     is.na(RESULTNDT) & (STARTINT == 4 & !is.na(ROSTER_START)) ~ 1,
  #     
  #     # Status 2: Eligible Non-respondent
  #     # Statut 2 : Ménage éligible non répondant
  #     COMB_UPCODED_RESLTNDT %in% c(1, 2, 7, 8, 9) ~ 2,
  #     is.na(RESULTNDT) & (HHELIG == 2 | HHCONSTAT %in% c(2, 3) |
  #                         (HHELIG == 1 & is.na(HHCONSTAT)) |
  #                         STARTINT == 4 & is.na(ROSTER_START)) ~ 2,
  #     
  #     # Status 3: Ineligible Household
  #     # Statut 3 : Ménage non éligible
  #     COMB_UPCODED_RESLTNDT %in% c(3, 4, 6) ~ 3,
  #     
  #     # Status 4: Unknown Eligibility
  #     # Statut 4 : Éligibilité inconnue
  #     is.na(COMB_UPCODED_RESLTNDT) | COMB_UPCODED_RESLTNDT %in% c(5, 96) ~ 4,
  #     
  #     # Default to unknown if no rule applies
  #     # Par défaut : statut inconnu si aucune règle ne s’applique
  #     .default = 4
  #   )
  # )

# Create a frequency table of household statuses
# Créer un tableau de fréquence des statuts des ménages
cd_hhstatus |> 
  tabyl(HH_STATUS, UPCODE_STAT_HH)

##  HH_STATUS    1   2   3  4
##          1 4218   0   0  0
##          2    0 465   0  0
##          3    0   0 265  0
##          4    0   0   0 26

# Visualize the proportion of households by response status within each enumeration area
# Visualiser la proportion des ménages par statut de réponse dans chaque zone de dénombrement

cd_hhstatus |> 
  # Rename column UPCODE_STAT_HH to hh_status for clarity
  # Renommer la colonne UPCODE_STAT_HH en hh_status pour plus de clarté
  rename(hh_status = UPCODE_STAT_HH) %>%
  # Start a ggplot to create a stacked proportional bar chart
  # Démarrer un ggplot pour créer un diagramme en barres empilées proportionnelles
  ggplot(aes(x = factor(HHI_PROVINCE),               # X-axis: enumeration area (as a factor)
             fill = factor(hh_status, levels = 1:4,  # Fill color: household response status
                           labels = c("Eligible Respondent",         # 1
                                      "Eligible Non-respondent",     # 2
                                      "Ineligible Household",        # 3
                                      "Unknown Eligibility")))) +    # 4
  # Use a stacked bar chart with proportions (each bar sums to 1)
  # Utiliser un diagramme en barres empilées avec des proportions (chaque barre fait 100 %)
  geom_bar(position = position_fill(reverse = TRUE)) +
  
  # Use a colorblind-friendly color palette (viridis)
  # Utiliser une palette de couleurs adaptée au daltonisme (viridis)
  scale_fill_viridis_d() +
  
  # Add axis and legend labels
  # Ajouter des étiquettes aux axes et à la légende
  labs(x = "Region",           # Label for x-axis
       y = "Proportion",       # Label for y-axis
       fill = "hh_status") +   # Legend title for household status
  
  # Apply a clean black-and-white theme for the plot
  # Appliquer un thème noir et blanc pour un rendu épuré
  theme_bw()

# Prepare the household selection weights and create a standardized enumeration area code
# Préparer les poids de sélection des ménages et créer un code de zone de dénombrement standardisé

selected_hh <- hh_selection_weights %>%
  
  # Keep only one unique observation per EA_ID (to avoid duplication)
  # Conserver une seule observation unique par EA_ID (pour éviter les doublons)
  distinct(EA_ID, .keep_all = TRUE) %>%
  
  # Select only the enumeration area ID and the base household weight
  # Sélectionner uniquement l’identifiant de la zone de dénombrement et le poids de base du ménage
  select(EA_ID, hhbwt0) %>%
  
  # Create a new variable HHI_EACODE by prefixing EA_ID with "CD"
  # Créer une nouvelle variable HHI_EACODE en ajoutant le préfixe "CD" à EA_ID
  mutate(HHI_EACODE = paste0("CD", EA_ID))

# Step 1: Start preparing the household dataset for unknown eligibility adjustment
# Étape 1 : Préparer les données des ménages pour l'ajustement d’éligibilité inconnue

hh_uknwn_adj <- cd_hhstatus %>%
  
  # Filter out rows with missing enumeration area codes (HHI_EACODE)
  # Exclure les lignes avec des codes de zone de dénombrement manquants (HHI_EACODE)
  filter(!is.na(HHI_EACODE)) %>%
  
  # Keep only the relevant columns: enumeration area, province, and status code
  # Conserver uniquement les colonnes pertinentes : zone de dénombrement, province et code de statut
  select(HHI_EACODE, EA_HHID_FIXED, HHI_PROVINCE, UPCODE_STAT_HH) %>%
  
  # Merge in the initial selection weights (hhbwt0) using the EA code
  # Joindre les poids initiaux de sélection (hhbwt0) à l’aide du code de zone de dénombrement
  left_join(selected_hh, by = "HHI_EACODE") %>%
  
  # Group by enumeration area to compute the adjustment factor at the EA level
  # Regrouper par zone de dénombrement pour calculer le facteur d'ajustement à ce niveau
  group_by(HHI_EACODE) %>%
  
  # Step 2: Create household response status and calculate adjustment factors
  # Étape 2 : Créer le statut de réponse des ménages et calculer les facteurs d’ajustement
  mutate(
    # Assign response disposition code to new variable hhstatus
    # Attribuer le code de disposition de réponse à la nouvelle variable hhstatus
    hhstatus = UPCODE_STAT_HH,
    
    # Compute adjustment factor (hh_adj_one) based on eligible or unknown households
    # Calculer le facteur d'ajustement (hh_adj_one) basé sur les ménages éligibles ou à éligibilité inconnue
    hh_adj_one = sum(hhbwt0) / (
      sum(hhbwt0 * (hhstatus == 1)) +  # Respondents
      sum(hhbwt0 * (hhstatus == 2)) +  # Eligible non-respondents
      sum(hhbwt0 * (hhstatus == 3))    # Unknown eligibility
    ),
    
    # Assign adjusted weights based on response status
    # Attribuer les poids ajustés en fonction du statut de réponse
    hh_wt_one = case_when(
      hhstatus %in% c(1, 2) ~ hh_adj_one * hhbwt0,  # Apply adjusted weight to eligible households
      hhstatus %in% c(3, 4) ~ 0                    # Zero weight for unknown/ineligible households
    )
  ) %>%
  
  # Ungroup to return to a regular data frame structure
  # Supprimer le regroupement pour revenir à une structure de données régulière
  ungroup()

2.3 Compute non-response adjusted household weight

Non-response adjustment for households is done by dividing the responding household weights by the response rate in each EA. The non response adjustment factor is

\[ A^{\textrm{HH NR}}_i = \Sigma_{j = 1}^{n_{i,1} + n_{i,2}} W^{\textit{hh2}}_{ij} / \Sigma_{j=1}^{n_i} W^{hh2}_{ij}, \] where \(W^{\textit{hh2}}_{ij}\) is the eligibility-adjusted weight of household \(j\) in EA \(i\), and the other quantities are as in the previous equation.

# Adjust household weights for non-response at the EA (Enumeration Area) level
# Ajuster les poids des ménages pour la non-réponse au niveau des ZD (Zones de Dénombrement)

hh_nr_adj <- hh_uknwn_adj %>%
  
  # Group data by enumeration area (HHI_EACODE)
  # Regrouper les données par zone de dénombrement (HHI_EACODE)
  group_by(HHI_EACODE) %>%
  
  # Calculate adjusted weights and apply them only to responding households
  # Calculer les poids ajustés et les appliquer uniquement aux ménages répondants
  mutate(
    # Compute the adjustment factor: total weight divided by weight of responding households
    # Calculer le facteur d’ajustement : poids total divisé par le poids des ménages répondants
    hh_adj_two = sum(hh_wt_one) / (sum(hh_wt_one * (hhstatus == 1))),
    
    # Apply the adjustment factor only to households with hhstatus == 1
    # Appliquer le facteur d’ajustement uniquement aux ménages avec hhstatus == 1
    hh_wt_two = if_else(hhstatus == 1, 
                        hh_adj_two * hh_wt_one,  # Adjusted weight
                        0)                       # Non-respondents get weight = 0
  ) %>%
  
  # Remove grouping to return to normal (ungrouped) data frame
  # Supprimer le regroupement pour revenir à un tableau de données non groupé
  ungroup()


# Prepare final household weight dataset with standard variable names
# Préparer le jeu de données final des poids des ménages avec des noms de variables standardisés

final_hh_weights <- hh_nr_adj %>%
  
  # Rename variables for final output: EAID for enumeration area, hhwt0 for final household weight
  # Renommer les variables pour le fichier final : EAID pour zone de dénombrement, hhwt0 pour poids final
  rename(EAID = HHI_EACODE, 
         hhwt0 = hh_wt_two)
 # mutate(EAID=HHI_EACODE)

# Create a boxplot showing distribution of the second household adjustment factor by province
# Créer un diagramme en boîte montrant la distribution du deuxième facteur d'ajustement des ménages par province

hh_nr_adj |> 
  
  # Keep one unique observation per HHI_EACODE (to avoid duplication)
  # Garder une seule observation unique par HHI_EACODE (pour éviter les doublons)
  distinct(HHI_EACODE, .keep_all = TRUE) |>
  
  # Start a ggplot: define x and y axes
  # Créer le graphique avec ggplot : définir les axes x et y
  ggplot(aes(x = hh_adj_two,              # X-axis: the second household adjustment factor
             y = HHI_PROVINCE)) +         # Y-axis: the province or region
  
  # Add a boxplot layer to show the distribution of values by province
  # Ajouter un diagramme en boîte pour montrer la distribution des valeurs par province
  geom_boxplot() +
  
  # Use a clean black-and-white theme for the plot
  # Utiliser un thème en noir et blanc pour un rendu clair
  theme_bw() +
  
  # Add axis labels and a title for clarity
  # Ajouter des étiquettes d’axes pour plus de clarté
  labs(x = "HH adjustment factor two",    # Label for x-axis
       y = "Region")                      # Label for y-axis

3 Calculating person-level interview weights

3.1 Load roster and individual data

# cd_csproroster <- import(str_c(indir, "CD_FFcorr_roster_DRAFT_20250628.xlsx"),
#                   guess_max = 10000) |> 
#   filter(is.na(DMFLAG) | DMFLAG != "DISCARD" | !is.na(EA_HHID_LN_FIXED)) %>% 
#     distinct(EA_HHID_LN_FIXED, .keep_all = TRUE)

# cd_csproindiv <- import(str_c(indir, "CD_FFcorr_ind_int_DRAFT_20250628.xlsx"),
#                   guess_max = 10000) |> 
#   filter(is.na(DMFLAG) | DMFLAG != "DISCARD" | is.na(EA_HHID_LN_FIXED)) %>% 
#   mutate(roster_elig=PRIMARY_ROSTER_STATUS)%>%  # we need to check this 
#   distinct(EA_HHID_LN_FIXED, .keep_all = TRUE)


# GR 10-06 use final stat file instead of corrected draft
# These already have roster_elig on them as well, just renaming here.
cd_csproroster <- import(str_c(statdir, "cd_ffstat_roster.sas7bdat")) |> 
  mutate(roster_elig = ROSTER_ELIG)
cd_csproindiv <- import(str_c(statdir, "cd_ffstat_ind_int.sas7bdat")) |> 
  mutate(roster_elig = ROSTER_ELIG)


#  mutate(roster_elig=PRIMARY_ROSTER_STATUS) # need to relook a this, could not find the same variable as RUPHIA
# just used PRIMARY_ROSTER_STATUS as a placeholder for roster_elig

# Create a dataset called roster_status_vars by starting from the household roster data (cd_csproroster)
# Crée un jeu de données nommé roster_status_vars à partir des données du registre des ménages (cd_csproroster)
roster_status_vars <- cd_csproroster %>%
  
  # Create a new column indiv_status by copying values from ROSTERED_FINAL_STATUS
  # Crée une nouvelle colonne indiv_status à partir des valeurs de ROSTERED_FINAL_STATUS
  mutate(indiv_status = ROSTERED_FINAL_STATUS) %>%
  
  # Join with the individual-level dataset cd_csproindiv using EA_HHID_LN_FIXED as the key
  # Joindre avec le jeu de données individuel cd_csproindiv en utilisant EA_HHID_LN_FIXED comme clé
  left_join(select(cd_csproindiv, EA_HHID_LN_FIXED, INDSTATUS, roster_elig)) %>%
  
  # Keep only relevant variables for analysis and rename some for clarity
  # Garder uniquement les variables pertinentes pour l'analyse et renommer certaines pour plus de clarté
  select(roster_sex = SEX,                  # Rename SEX to roster_sex
         roster_age = AGEYEARS,             # Rename AGEYEARS to roster_age
         RELATTOHH,                         # Relationship to head of household
         SLEEPHERE,                         # Whether the individual sleeps in the household
         EA_HHID_FIXED, EA_HHID_LN_FIXED,   # Household identifiers
         indiv_status, ELIGIBLE1,           # Individual status and eligibility
         HHR_EACODE, INDSTATUS              # Enumeration area code and individual status from join
  )

## Joining with `by = join_by(EA_HHID_LN_FIXED, roster_elig)`

# Create a separate dataset for individual-level status variables
# Crée un jeu de données distinct pour les variables de statut individuel
indiv_status_vars <- cd_csproindiv %>%
  
  # Select key geographic and demographic variables for individual analysis
  # Sélectionne des variables géographiques et démographiques clés pour l'analyse individuelle
  select(IND_PROVINCE, IND_EACODE,         # Province and enumeration area
         EA_HHID_FIXED, EA_HHID_LN_FIXED,  # Household identifiers
         INDISPSTATUS, CONFAGEY)           # Individual disposition status and confirmed age

# Merge roster and individual status information, and create classification variables
# Fusionner les informations du registre et du statut individuel, et créer des variables de classification

roster_ind_status_vars <- roster_status_vars %>%
  
  # Convert HHR_EACODE to character and rename as EAID for merging
  # Convertir HHR_EACODE en chaîne de caractères et renommer en EAID pour la fusion
  mutate(EAID = as.character(HHR_EACODE)) %>%
  
  # Join with household weights dataset using EAID
  # Joindre avec les poids des ménages en utilisant EAID
  left_join(select(final_hh_weights, EA_HHID_FIXED, EAID, hhwt0), by = "EA_HHID_FIXED") %>%
  
  # Join with individual-level status variables
  # Joindre avec les variables de statut au niveau individuel
  left_join(indiv_status_vars, by = c("EA_HHID_FIXED", "EA_HHID_LN_FIXED")) %>%
  
  # Create derived variables
  # Créer des variables dérivées
  mutate(
    # Indicator: 1 if roster age is 15 or older, 0 otherwise, -9 if missing
    # Indicateur : 1 si l'âge dans le registre ≥ 15 ans, 0 sinon, -9 si manquant
    roster_age_gt_15 = if_else(roster_age >= 15, 1, 0, missing = -9),
    
    # Indicator: 1 if confirmed individual age ≥ 15, 0 otherwise, -9 if missing
    # Indicateur : 1 si CONFAGEY ≥ 15 ans, 0 sinon, -9 si manquant
    int_age_gt_15 = if_else(CONFAGEY >= 15, 1, 0, missing = -9),
    
    # Classify each person by residence and age group
    # Classer chaque personne selon la résidence et le groupe d'âge
    resident_status_roster = case_when(
      SLEEPHERE == 1 & roster_age >= 15 ~ "De facto person 15 years or older",
      SLEEPHERE != 1 & roster_age >= 15 ~ "Non-de facto person 15 years or older",
      roster_age < 15 ~ "Person under 15 years",
      TRUE ~ "OTHER"
    ),
    
    # Classification for analysis table (used for tabulation)
    # Classification pour le tableau d'analyse (utilisé pour le regroupement)
    status_for_table = case_when(
      SLEEPHERE == 1 & roster_age >= 15 & int_age_gt_15 == 1 ~ 1,
      SLEEPHERE == 1 & roster_age >= 15 & int_age_gt_15 == 0 ~ 2,
      SLEEPHERE == 1 & roster_age >= 15 & int_age_gt_15 == -9 ~ 3,
      SLEEPHERE != 1 & roster_age >= 15 ~ 4,
      roster_age < 15 ~ 5,
      TRUE ~ 99999
    ),
    
    # Recode ineligible people: assign code 90 to INDSTATUS and 9 to indiv_status
    # Recoder les personnes non éligibles : attribuer 90 à INDSTATUS et 9 à indiv_status
    INDSTATUS = if_else(ELIGIBLE1 == 2, 90, INDSTATUS),
    indiv_status = if_else(ELIGIBLE1 == 2, 9, indiv_status)
  )

# Generate the summary table with counts and weighted estimates
# Générer le tableau récapitulatif avec les effectifs et les estimations pondérées

table_counts <- roster_ind_status_vars %>% 
  
  # Convert status_for_table to a factor with fixed levels
  # Convertir status_for_table en facteur avec des niveaux fixes
  mutate(
    status_for_table = factor(status_for_table, levels = 1:5),
    
    # Create age group description for interview age
    # Créer une description du groupe d'âge basé sur l'âge confirmé
    int_age_desc = case_when(
      status_for_table %in% c(4, 5) ~ "NA",
      int_age_gt_15 == 1 ~ "15+",
      int_age_gt_15 == 0 ~ "Under 15",
      int_age_gt_15 == -9 ~ "Unknown"
    )
  ) %>%
  
  # Group by classification variables and do not drop any factor levels
  # Regrouper par les variables de classification sans supprimer les niveaux
  group_by(status_for_table, resident_status_roster, int_age_desc, .drop = FALSE) %>%
  
  # Summarize: count people and sum household weights
  # Résumer : compter les personnes et additionner les poids des ménages
  summarise(
    rostered_persons = n(),
    wgt_number_rostered = sum(hhwt0, na.rm = TRUE)
  ) %>%
  
  # Reassign label for status 2 to fix consistency in resident_status and age group
  # Réassigner les libellés pour status 2 pour assurer la cohérence
  mutate(
    resident_status_roster = if_else(status_for_table != 2, resident_status_roster, "De facto person 15 years or older"),
    int_age_desc = if_else(status_for_table != 2, int_age_desc, "Under 15")
  ) %>%
  
  # Add a total row to the summary table
  # Ajouter une ligne de total au tableau récapitulatif
  janitor::adorn_totals(where = "row")

## `summarise()` has grouped output by 'status_for_table',
## 'resident_status_roster'. You can override using the `.groups` argument.

3.2 Phase One Interview Weights and Adjust for Eligibility

# ─────────────────────────────────────────────────────────────────────────────
# Construct Phase One Interview Weights and Adjust for Eligibility
# Construction des poids d'entretien phase 1 et ajustement selon l'éligibilité
# ─────────────────────────────────────────────────────────────────────────────

# Step 1: Prepare input data for phase one interview weighting
# Étape 1 : Préparer les données d'entrée pour le calcul des poids de la phase 1

int_wgts_phase_one_input <- select(cd_csproroster, 
                                   EA_HHID_FIXED, EA_HHID_LN_FIXED, HHR_EACODE, SEX, AGEYEARS, SLEEPHERE) %>%
  
  # Merge with individual data to obtain confirmed age (CONFAGEY)
  # Fusionner avec les données individuelles pour obtenir l'âge confirmé (CONFAGEY)
  left_join(select(cd_csproindiv, EA_HHID_LN_FIXED, CONFAGEY), by = "EA_HHID_LN_FIXED") %>%
  
  # Create derived variables for eligibility and stratification
  # Créer des variables dérivées pour l’éligibilité et la stratification
  mutate(
    EAID = as.character(HHR_EACODE),  # Convert EA code to character for merging
    # Convertir le code EA en caractère pour la jointure
    
    roster_age_gt_15 = case_when(     # Determine if individual is 15+
      AGEYEARS >= 0 & AGEYEARS < 15 ~ 0,
      AGEYEARS >= 15 ~ 1,
      TRUE ~ -9
    ),
    
    int_elig_status = case_when(      # Classify individual eligibility status
      roster_age_gt_15 == 1 & CONFAGEY >= 15 ~ 1,   # Eligible and confirmed 15+
      roster_age_gt_15 == 1 & CONFAGEY >= 0 & CONFAGEY < 15 ~ 2, # Misclassified as eligible
      roster_age_gt_15 == 1 ~ 3,      # Missing or unknown confirmed age
      roster_age_gt_15 == 0 ~ 4,      # Under 15
      TRUE ~ 99999
    ),
    
    # Assign individuals into age-sex cells
    # Affecter les individus à des cellules âge-sexe
    age_sex_cell = case_when(
      SEX == 1 & roster_age_gt_15 == 1 & AGEYEARS < 18 ~ "Male 15-17",
      SEX == 1 & roster_age_gt_15 == 1 & AGEYEARS >= 18 ~ "Male 18+",
      SEX == 2 & roster_age_gt_15 == 1 & AGEYEARS < 18 ~ "Female 15-17",
      SEX == 2 & roster_age_gt_15 == 1 & AGEYEARS >= 18 ~ "Female 18+",
      AGEYEARS < 15 ~ "Under 15",
      TRUE ~ "Missing/Unknown"
    )
  ) %>%
  
  # Merge in household base weights for each individual
  # Ajouter les poids de base des ménages à chaque individu
  left_join(select(final_hh_weights, EA_HHID_FIXED, EAID, hhwt0), by = "EA_HHID_FIXED")



# Step 2: Calculate adjustment factors by age-sex cell and eligibility
# Étape 2 : Calculer les facteurs d'ajustement par cellule âge-sexe et éligibilité

int_wgts_phase_one_adj <- int_wgts_phase_one_input %>%
  
  # Keep only eligible and misclassified individuals
  # Garder uniquement les individus éligibles ou mal classifiés
  filter(int_elig_status %in% c(1, 2, 3)) %>%
  
  # Group by age-sex cell and eligibility category
  # Regrouper par cellule âge-sexe et statut d’éligibilité
  group_by(age_sex_cell, int_elig_status) %>%
  
  # Count individuals and sum household weights
  # Compter les individus et sommer les poids des ménages
  summarise(
    count = n(),
    weighted_count = sum(hhwt0, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  
  # Reshape the table so each eligibility status becomes a column
  # Réorganiser le tableau pour que chaque statut d’éligibilité devienne une colonne
  pivot_wider(
    values_from = c(count, weighted_count),
    names_from = int_elig_status,
    names_glue = "status_{int_elig_status}_{.value}",
    values_fill = 0
  ) %>%
  
  # Calculate the adjustment factor to correct for underestimation due to missing/misclassified data
  # Calculer le facteur d’ajustement pour corriger la sous-estimation due aux données manquantes ou mal classées
  mutate(
    phase_one_int_adj_factor = (status_1_weighted_count + status_3_weighted_count) / status_1_weighted_count
  )


# Step 3: Apply adjustment factor to each eligible individual
# Étape 3 : Appliquer le facteur d’ajustement à chaque individu éligible

int_wgts_phase_one_adjusted <- int_wgts_phase_one_input %>%
  
  # Merge in the adjustment factor based on age-sex cell
  # Ajouter le facteur d'ajustement en fonction de la cellule âge-sexe
  left_join(select(int_wgts_phase_one_adj, age_sex_cell, phase_one_int_adj_factor), 
            by = "age_sex_cell") %>%
  
  # Compute the adjusted individual interview weight
  # Calculer le poids d’entretien individuel ajusté
  mutate(
    phase_one_adj_int_wgt = hhwt0 * phase_one_int_adj_factor
  )


# Step 4: Summarise adjusted weights by age-sex cell
# Étape 4 : Résumer les poids ajustés par cellule âge-sexe

int_wgts_phase_one_adjusted %>%
  group_by(age_sex_cell) %>%
  summarise(
    mean_weight = mean(phase_one_adj_int_wgt, na.rm = TRUE)
  )

## # A tibble: 5 × 2
##   age_sex_cell mean_weight
##   <chr>              <dbl>
## 1 Female 15-17        536.
## 2 Female 18+          530.
## 3 Male 15-17          578.
## 4 Male 18+            578.
## 5 Under 15            NaN

4 Calculating person-level interview weights

4.1 Load roster and individual data

roster_status_vars <- cd_csproroster %>%
  #mutate(indiv_status = ROSTERED_FINAL_STATUS) %>% 
  # GR 10-07; just use existing indiv_status
  mutate(indiv_status = INDIV_STATUS) |> 
  left_join(select(cd_csproindiv, EA_HHID_LN_FIXED, INDSTATUS, roster_elig)) %>%
  select(roster_sex = SEX, 
         roster_age = AGEYEARS,
         RELATTOHH, SLEEPHERE,
         EA_HHID_FIXED, EA_HHID_LN_FIXED, 
         indiv_status, ELIGIBLE1, HHR_EACODE,INDSTATUS
  )

## Joining with `by = join_by(EA_HHID_LN_FIXED, roster_elig)`

indiv_status_vars <- cd_csproindiv %>%
  select(IND_PROVINCE, IND_EACODE, 
         EA_HHID_FIXED, EA_HHID_LN_FIXED, INDISPSTATUS, CONFAGEY)

# AGEYEARS, UPCODE_IND0040,roster_elig cannot find

indv <- cd_csproindiv %>% 
  select(EA_HHID_LN_FIXED,INDIVCNTSTATUS)

roster_ind_status_vars <- roster_status_vars %>%
  left_join(indv) %>% 
  mutate(EAID = as.character(HHR_EACODE)) %>%
  left_join(select(final_hh_weights, EA_HHID_FIXED, EAID, hhwt0)) %>%
  left_join(indiv_status_vars) %>%
  mutate(
    roster_age_gt_15 = if_else(roster_age >= 15, 1, 0, missing = -9),
    int_age_gt_15 = if_else(CONFAGEY >= 15, 1, 0, missing = -9),
    resident_status_roster = case_when(SLEEPHERE == 1 & roster_age >= 15 ~ "De facto person 15 years or older",
                                       SLEEPHERE != 1 & roster_age >= 15 ~ "Non-de facto person 15 years or older",
                                       roster_age < 15 ~ "Person under 15 years",
                                       TRUE ~ "OTHER"),
    status_for_table = case_when(SLEEPHERE == 1 & roster_age >= 15 & int_age_gt_15 == 1 ~ 1,
                                 SLEEPHERE == 1 & roster_age >= 15 & int_age_gt_15 == 0 ~ 2,
                                 SLEEPHERE == 1 & roster_age >= 15 & int_age_gt_15 == -9 ~ 3,
                                 SLEEPHERE != 1 & roster_age >= 15 ~ 4,
                                 roster_age < 15 ~ 5,
                                 TRUE ~ 99999),
    # Give ineligible people indiv_status = 9 and INDSTATUS = 9
    INDSTATUS = if_else(ELIGIBLE1 == 2, 90, INDSTATUS),
    indiv_status = if_else(ELIGIBLE1 == 2, 9, indiv_status))

## Joining with `by = join_by(EA_HHID_LN_FIXED)`
## Joining with `by = join_by(EA_HHID_FIXED, EAID)`
## Joining with `by = join_by(EA_HHID_FIXED, EA_HHID_LN_FIXED)`

table_counts <- roster_ind_status_vars %>% 
  mutate(status_for_table = factor(status_for_table, levels = 1:5),
         int_age_desc = case_when(
           status_for_table %in% c(4,5) ~ "NA",
           int_age_gt_15 == 1 ~ "15+",
           int_age_gt_15 == 0 ~ "Under 15",
           int_age_gt_15 == -9 ~ "Unknown")) %>%
  group_by(status_for_table, resident_status_roster, int_age_desc, .drop = FALSE) %>% 
  summarise(rostered_persons = n(), 
            wgt_number_rostered = sum(hhwt0)) %>%
  mutate(resident_status_roster = if_else(status_for_table != 2, resident_status_roster, "De facto person 15 years or older"),
         int_age_desc = if_else(status_for_table != 2, int_age_desc, "Under 15")) %>%
  adorn_totals(where = "row")

## `summarise()` has grouped output by 'status_for_table',
## 'resident_status_roster'. You can override using the `.groups` argument.

print(table_counts)

##  status_for_table                resident_status_roster int_age_desc
##                 1     De facto person 15 years or older          15+
##                 2     De facto person 15 years or older     Under 15
##                 3                                  <NA>         <NA>
##                 4 Non-de facto person 15 years or older           NA
##                 5                 Person under 15 years           NA
##             Total                                     -            -
##  rostered_persons wgt_number_rostered
##              9844         4322788.849
##                 7            2897.775
##                 0               0.000
##              2452         1095777.178
##              6427         2844020.721
##             18730         8265484.524

4.2 First-phase interview weight adjustment

# Create Phase 1 interview weight input dataset
# Select necessary individual-level variables and join with confirmed age from individual interview
# ---
# Créer le jeu de données d'entrée pour les poids de la phase 1 de l'interview
# Sélectionner les variables nécessaires au niveau individuel et joindre l'âge confirmé de l'interview
int_wgts_phase_one_input <- select(cd_csproroster, EA_HHID_FIXED, EA_HHID_LN_FIXED, HHR_EACODE, SEX, AGEYEARS, SLEEPHERE) %>%
  left_join(select(cd_csproindiv, EA_HHID_LN_FIXED, CONFAGEY)) %>%

  # Create EAID string and define eligibility based on age
  # ---
  # Créer une chaîne EAID et définir l'éligibilité basée sur l'âge
  mutate(EAID = as.character(HHR_EACODE),

         # Determine if roster member is 15 or older (potentially eligible)
         # ---
         # Identifier si la personne a 15 ans ou plus (potentiellement éligible)
         roster_age_gt_15 = case_when(AGEYEARS >= 0 & AGEYEARS < 15 ~ 0,
                                      AGEYEARS >= 15 ~ 1,
                                      TRUE ~ -9),

         # Assign interview eligibility status:
         # 1 = eligible and confirmed 15+, 2 = eligible by roster but confirmed under 15,
         # 3 = eligible but no confirmed age, 4 = under 15, 99999 = other/missing
         # ---
         # Définir le statut d'éligibilité à l'interview :
         # 1 = éligible et confirmé ≥ 15 ans, 2 = éligible mais confirmé < 15 ans,
         # 3 = éligible sans confirmation d'âge, 4 = moins de 15 ans, 99999 = autre/manquant
         int_elig_status = case_when(roster_age_gt_15 == 1 & CONFAGEY >= 15 ~ 1,
                                     roster_age_gt_15 == 1 & CONFAGEY >= 0 & CONFAGEY < 15 ~ 2,
                                     roster_age_gt_15 == 1 ~ 3,
                                     roster_age_gt_15 == 0 ~ 4,
                                     TRUE ~ 99999),

         # Categorize individuals into age-sex groups for adjustment factor computation
         # ---
         # Catégoriser les individus par groupes d'âge-sexe pour calculer le facteur d'ajustement
         age_sex_cell = case_when(
           SEX == 1 & roster_age_gt_15 == 1 & AGEYEARS < 18 ~ "Male 15-17",
           SEX == 1 & roster_age_gt_15 == 1 & AGEYEARS >= 18 ~ "Male 18+",
           SEX == 2 & roster_age_gt_15 == 1 & AGEYEARS < 18 ~ "Female 15-17",
           SEX == 2 & roster_age_gt_15 == 1 & AGEYEARS >= 18 ~ "Female 18+",
           AGEYEARS < 15 ~ "Under 15",
           TRUE ~ "Missing/Unknown")) %>%

  # Merge in household base weights
  # ---
  # Joindre les poids de base des ménages
  left_join(select(final_hh_weights, EA_HHID_FIXED, EAID, hhwt0))

## Joining with `by = join_by(EA_HHID_LN_FIXED)`
## Joining with `by = join_by(EA_HHID_FIXED, EAID)`

# Compute adjustment factors for Phase 1 interview nonresponse
# This corrects for potential bias among age-sex groups where not everyone responded
# ---
# Calculer les facteurs d'ajustement pour la non-réponse à la phase 1
# Cela corrige le biais potentiel entre groupes d'âge-sexe en cas de non-réponse
int_wgts_phase_one_adj1 <- int_wgts_phase_one_input %>%
  filter(int_elig_status %in% c(1,2,3)) %>%  # Keep only potentially eligible
  # ---
  # Ne garder que les individus potentiellement éligibles à l'interview
  group_by(age_sex_cell, int_elig_status) %>%
  summarise(count = n(),  # Raw count
            weighted_count = sum(hhwt0)) %>%  # Sum of weights
  # ---
  # Compter les individus et la somme des poids dans chaque groupe
  pivot_wider(values_from = c(count, weighted_count), 
              names_from = int_elig_status, 
              names_glue = "status_{int_elig_status}_{.value}",
              values_fill = 0) %>%
  # ---
  # Transformer le tableau en format large avec une colonne par statut
  mutate(phase_one_int_adj_factor = (status_1_weighted_count + status_3_weighted_count)/status_1_weighted_count)

## `summarise()` has grouped output by 'age_sex_cell'. You can override using the
## `.groups` argument.

  # ---
  # Calculer le facteur d'ajustement pour corriger la non-réponse éligible non confirmée (statut 3)

# Apply adjustment factor to create Phase 1 adjusted interview weights
# ---
# Appliquer le facteur d'ajustement pour créer les poids ajustés de la phase 1
int_wgts_phase_one_adjusted <- int_wgts_phase_one_input %>%
  left_join(select(int_wgts_phase_one_adj, age_sex_cell, phase_one_int_adj_factor)) %>%
  mutate(phase_one_adj_int_wgt = hhwt0 * phase_one_int_adj_factor)

## Joining with `by = join_by(age_sex_cell)`

  # ---
  # Calculer les poids ajustés : poids ménage multiplié par le facteur de correction

# Summarize mean adjusted weights by age-sex cell
# This helps verify whether certain groups were more affected by nonresponse
# ---
# Résumer les poids moyens ajustés par groupe d'âge-sexe
# Cela permet de vérifier l'effet de la non-réponse sur certains groupes
int_wgts_phase_one_adjusted %>% group_by(age_sex_cell) %>%
  summarise(mean_weight = mean(phase_one_adj_int_wgt))

## # A tibble: 5 × 2
##   age_sex_cell mean_weight
##   <chr>              <dbl>
## 1 Female 15-17        536.
## 2 Female 18+          530.
## 3 Male 15-17          578.
## 4 Male 18+            578.
## 5 Under 15             NA

##Interview non-response adjustment

remove all the process variables first and remain with the important variables

# Helper function to expand multiple-select answers like "ABD" into binary indicators for each letter
# For example, if "B" is in "ABD", then it becomes 1 for B, otherwise 0
# This is useful for transforming checkbox responses into analyzable format
# ---
# Fonction utilitaire pour convertir les réponses multiples (ex : "ABD") en indicateurs binaires pour chaque lettre
# Exemple : si "B" est présent dans "ABD", alors la colonne B vaut 1, sinon 0
# Utile pour transformer les réponses à choix multiples en format analysable
expand_mult <- function(x, i) {
  if_else(grepl(as.character(i), x, fixed = TRUE), 1, 0)
}

# Select and clean key household-level variables from the household interview
# Focuses on eligible households, expands multiple response questions, replaces missing values, simplifies counts
# ---
# Sélectionner et nettoyer les variables clés du questionnaire ménage
# Se concentre sur les ménages éligibles, transforme les questions à choix multiples, remplace les valeurs manquantes, simplifie les quantités
hhvars <- cd_hhstatus %>% 
  filter(UPCODE_STAT_HH == 1) %>%  # Only keep eligible households
  # ---
  # Garder seulement les eligible interviews

  select(36:113, SICK_HOUSEHOLD, contains("EA_HHID")) %>%  # Select useful columns including location identifiers
  # ---
  # Sélectionner les colonnes utiles, y compris les identifiants géographiques

  mutate(
    DEATHCOUNT = replace_na(DEATHCOUNT, 0),  # Replace missing death counts with 0
    # ---
    # Remplacer les valeurs manquantes pour le nombre de décès par 0

    # Expand multi-response variables into binary flags for A to Z codes
    across(c(HHQITEMS, HHQOWN), list(~expand_mult(.x, "A"),
                                     ~expand_mult(.x, "B"),
                                     ~expand_mult(.x, "C"),
                                     ~expand_mult(.x, "D"),
                                     ~expand_mult(.x, "E"),
                                     ~expand_mult(.x, "F"),
                                     ~expand_mult(.x, "G"),
                                     ~expand_mult(.x, "H"),
                                     ~expand_mult(.x, "I"),
                                     ~expand_mult(.x, "X"),
                                     ~expand_mult(.x, "Y"),
                                     ~expand_mult(.x, "Z")),
           .names = "{.col}{.fn}"),
    # ---
    # Transformer les réponses multiples en colonnes binaires pour chaque lettre choisie

    across(c(DEATHS), ~if_else(.x %in% c(-8, -9), 2, .x)),  # Recode "Don't know"/"Refused" to generic 2
    # ---
    # Recoder les réponses "Ne sait pas" ou "Refusé" en 2

    across(ends_with("DUR"), ~if_else(.x %in% c(-8,-9), 0, .x)),  # Set unknown durations to 0
    # ---
    # Mettre les durées inconnues ou refusées à 0

    across(ends_with("NUM"), ~if_else(.x == -7, 1, .x)),  # Recode special missing to 1 for numeric counts
    # ---
    # Remplacer -7 (valeur spéciale) par 1 pour les variables de type "nombre"

    across(c(HHEMANCMINOR, ends_with("DUR")), ~replace_na(.x, 0)),  # Fill missing with 0
    across(c(TOILETSHARE), ~replace_na(.x, 2))  # Assume '2' = other/shared if missing
    # ---
    # Remplir les valeurs manquantes avec 0 ou 2 selon le contexte
  ) %>%

  # Drop irrelevant or redundant variables that won't be used
  # ---
  # Supprimer les variables non pertinentes ou redondantes
  select(-c(HHCNTSTATUS, HHQITEMS, HHQOWN, HHCAGREEDT, HHSTFSIG, HHRINS4,
            contains("HHCONST"),
            starts_with(c("DIED", "DTH", "INELIG")))) %>%

  # Unify missing value codes: treat "Don't know" (-8) as "Refused" (-9)
  # ---
  # Uniformiser les codes de non-réponse : transformer -8 en -9
  mutate(across(where(is.numeric), ~if_else(.x == -8, -9, .x))) %>%

  # Convert duration/counts into binary indicators: 1 = yes/some, 0 = none
  # Useful to avoid low cell counts in modeling
  # ---
  # Transformer les durées ou quantités en indicateurs oui/non (1 ou 0)
  mutate(across(ends_with(c("DUR","NUM")), ~if_else(.x > 0, 1, 0))) %>%

  # Keep only variables that vary across records (remove constants)
  # ---
  # Conserver uniquement les variables ayant de la variabilité
  select(where(function(col) length(unique(col)) > 1))

# Prepare the roster data (individual-level composition of the household)
# Clean, drop unnecessary columns, handle missing values
# ---
# Préparer les données de composition du ménage (niveau individu)
# Nettoyer, supprimer les colonnes inutiles, gérer les valeurs manquantes
roster_vars <- cd_csproroster %>%
  select(-c(
    HHR_SURVEYFORMSDT, HHR_HHI_UUID, HHR_DEVICEID, STAFFIDHH,
    HHRSTS, HHR_HHMEM_ID, HHRETS, HHDRSTS, INTRODR, AGEMONTHS,
    HHDRETS, MINORLINENUM, HHCRSTS, HHCRETS, DMFLAG, DMCOMMENT,
    HHRINS5, HHQOVCINS1, HHQSUPPSTART, HHQOVCEND, OVCLINENUM,
    LIVEAWAYM, LIVEAWAYY, LIVECOUNTRY, 
    HHQASSIGN_INST, HHQASSIGN, WITHDRAWHH,
    EA_HHID_FIXED, HHR_EACODE, HHR_SHH, HHR_CT, 
    HHR_PROVINCE, HHR_EACODE_ORIG, HHR_SHH_ORIG, HHR_TEAM_ID,
    NATMOMNM, MOMFEMNAME, DADMALENAME, MALEGUARDNAME,
    RELATTOHH, ROSTERED_FINAL_STATUS, ELIGIBLE1,
    HHCCONSEL, HHCEMANC, HHCCONT0)) %>%
  # ---
  # Supprimer les colonnes d'identifiants, dates, variables de contrôle

  mutate(
    across(c(SICK3MO), ~replace_na(.x, -8)),  # Replace missing illness with "Don't know"
    across(c(EMANCIPATED, LIVEAWAY, starts_with("SUPPORT")), ~replace_na(.x, 2)),  # Assume 2 = no/missing
    across(c(VULNERABLE_CHILD), ~replace_na(.x, 0)),  # No vulnerable child if missing
    across(where(is.numeric), ~if_else(.x == -8, -9, .x))  # Standardize codes
    # ---
    # Remplacer les valeurs manquantes avec des codes standardisés
  ) %>%

  select(where(function(col) length(unique(col)) > 1)) %>%
  left_join(select(indiv_status_vars, EA_HHID_LN_FIXED))  # Add line-level ID

## Joining with `by = join_by(EA_HHID_LN_FIXED)`

  # ---
  # Ajouter les identifiants au niveau ligne

# Create a binary outcome for whether an individual responded to the individual interview
# ---
# Créer une variable binaire indiquant si l’individu a participé à l’interview
interview_response <- cd_csproindiv %>%
  filter(roster_elig == 1) %>%  # Keep eligible individuals
  #mutate(ind_responded = if_else(INDIVCNTSTATUS == 1, 1, 0)) %>%  # 1 = responded, 0 = did not
  # GR 10-07: use final INDIV_STATUS to determine
  mutate(ind_responded = if_else(INDIV_STATUS == 1, 1, 0)) %>%  # 1 = responded, 0 = did not
  # ---
  # 1 = a répondu, 0 = n’a pas répondu
  select(ind_responded, EA_HHID_FIXED, EA_HHID_LN_FIXED, roster_elig)

# Merge household, interview, and roster datasets
# Drop irrelevant variables, especially large open-text or control vars
# ---
# Fusionner les jeux de données ménage, interview et roster
# Supprimer les variables textuelles ou de contrôle
nr_vars <- hhvars %>%
  left_join(interview_response) %>%
  left_join(roster_vars) %>% 
  select(-c(LIVEREGIONLIVECOUNTRYOTH, INDIV_STATUS))

## Joining with `by = join_by(EA_HHID_FIXED)`
## Joining with `by = join_by(EA_HHID_LN_FIXED, roster_elig)`

# filter(roster_elig == 1)  # Optional: restrict to eligible individuals
# ---
# Optionnel : limiter aux individus éligibles

# Create a recode for teenage status: 1 = 15–17, 2 = 18+
# Used later in the nonresponse model stratification
# ---
# Créer une variable pour distinguer les adolescents (15–17) des adultes
# Utilisée pour stratifier dans le modèle de non-réponse
nr_vars_recodes <- nr_vars %>%
  mutate(H_AGETEENYEARS = if_else(AGEYEARS >= 15 & AGEYEARS <= 17, 1, 2))

# View variable names in the final dataset
# ---
# Afficher les noms de variables du jeu de données final

# vars_to_remove <- c(
#   "HHQITEMS1", "HHQITEMS2", "HHQITEMS3", "HHQITEMS4", "HHQITEMS5", "HHQITEMS6", "HHQITEMS11",
#   "HHQOWN1", "HHQOWN2", "HHQOWN3", "HHQOWN4", "HHQOWN5", "HHQOWN11", "HHQOWN12", "ind_responded",
#   "EA_HHID_LN_FIXED", "roster_elig", "NOM_COURT", "CODE_SECTEUR", "TYPE_SECTEUR", "CORRECT_LN",
#   "CSPVERSNHH", "SEX", "LIVEHERE", "SLEEPHERE", "AGEYEARS", "EMANCIPATED", "LIVEAWAY",
#   "LIVEPROVINCE", "LIVEREGIONLIVECOUNTRYOTH", "SICK3MO", "CHENSCH", "MOMALIVE", "MOMHHM",
#   "FEMGUARDHHM", "DADALIVE", "DADHHM", "MALEGUARDHHM", "MOMSICK", "MOMAIDS", "DADSICK", "DADAIDS",
#   "VULNERABLE_CHILD", "SUPPORTMED12", "SUPPORTEMOT12", "SUPPORTEMOT3", "SUPPORTMATER12",
#   "SUPPORTMATER3", "SUPPORTSOCIAL12", "SUPPORTSOCIAL3", "SUPPORTSCHOL12", "HHR_HHI_UUID_ORIG",
#   "HHI_ENDSURVEY", "HH_STATUS", "WATERSOURCEOT"
# )
# nr_vars <- nr_vars %>% select(-all_of(vars_to_remove))
 
names(nr_vars)

##  [1] "HHAGEYEARS"        "HHEMANCMINOR"      "HHELANG"          
##  [4] "MOREPPL"           "SICK_HOUSEHOLD"    "DEATHS"           
##  [7] "DEATHCOUNT"        "WATERSOURCE"       "WATERSOURCEOT"    
## [10] "TOILETTYPE"        "TOILETTYPEOT"      "TOILETSHARE"      
## [13] "COOKINGFUEL"       "COOKINGFUELOT"     "MATFLOOR"         
## [16] "MATFLOOROT"        "MATROOF"           "MATROOFOT"        
## [19] "MATEXWALLS"        "MATEXWALLSOT"      "ROOMSLEEP"        
## [22] "OWNCOWNUM"         "OWNGOATNUM"        "OWNCHIKNNUM"      
## [25] "OWNDOGNUM"         "OWNHORSENUM"       "ECONSUP12"        
## [28] "ECONSUP12OT"       "EA_HHID_FIXED"     "HHQITEMS1"        
## [31] "HHQITEMS2"         "HHQITEMS3"         "HHQITEMS4"        
## [34] "HHQITEMS5"         "HHQITEMS6"         "HHQITEMS11"       
## [37] "HHQOWN1"           "HHQOWN2"           "HHQOWN3"          
## [40] "HHQOWN4"           "HHQOWN5"           "HHQOWN11"         
## [43] "HHQOWN12"          "ind_responded"     "EA_HHID_LN_FIXED" 
## [46] "roster_elig"       "NOM_COURT"         "CODE_SECTEUR"     
## [49] "TYPE_SECTEUR"      "URBAN_RURAL"       "CORRECT_LN"       
## [52] "CSPVERSNHH"        "SEX"               "LIVEHERE"         
## [55] "SLEEPHERE"         "AGEYEARS"          "EMANCIPATED"      
## [58] "LIVEAWAY"          "LIVEPROVINCE"      "SICK3MO"          
## [61] "CHENSCH"           "MOMALIVE"          "MOMHHM"           
## [64] "FEMGUARDHHM"       "DADALIVE"          "DADHHM"           
## [67] "MALEGUARDHHM"      "MOMSICK"           "MOMAIDS"          
## [70] "DADSICK"           "DADAIDS"           "VULNERABLE_CHILD" 
## [73] "SUPPORTMED12"      "SUPPORTEMOT12"     "SUPPORTEMOT3"     
## [76] "SUPPORTMATER12"    "SUPPORTMATER3"     "SUPPORTSOCIAL12"  
## [79] "SUPPORTSOCIAL3"    "SUPPORTSCHOL12"    "HHR_HHI_UUID_ORIG"
## [82] "HHI_ENDSURVEY"     "ROSTER_ELIG"

4.3 Model by sex & age years variables

# Prepare the dataset for BART model input by converting variables
# Numeric variables are needed for modeling, while categorical variables are turned into factors
# ---
# Préparer les données pour le modèle BART en convertissant les variables
# Les variables numériques sont nécessaires pour l'ajustement, les autres deviennent des facteurs
int_nr_data_in <- nr_vars_recodes %>%
  # Select important columns and arrange by strata
  # ---
  # Sélectionner les colonnes importantes et trier selon les strates
  select(H_AGETEENYEARS, SEX, EA_HHID_FIXED, EA_HHID_LN_FIXED, ind_responded, everything()) %>%
  arrange(H_AGETEENYEARS, SEX, EA_HHID_FIXED, EA_HHID_LN_FIXED) %>%

  # Recode age into grouped categories for individuals and households
  # ---
  # Recoder l'âge en catégories groupées pour les individus et les chefs de ménage
  mutate(age_group = case_when(AGEYEARS < 18 ~ AGEYEARS,
                               between(AGEYEARS, 18, 24) ~ 1,
                               between(AGEYEARS, 25, 29) ~ 2,
                               between(AGEYEARS, 30, 39) ~ 3,
                               between(AGEYEARS, 40, 49) ~ 4,
                               between(AGEYEARS, 50, 59) ~ 5,
                               AGEYEARS >= 60  ~ 6,
                               TRUE ~ 99),
         hh_age_group = case_when(between(HHAGEYEARS, 15, 24) ~ 1,
                                  between(HHAGEYEARS, 25, 29) ~ 2,
                                  between(HHAGEYEARS, 30, 39) ~ 3,
                                  between(HHAGEYEARS, 40, 49) ~ 4,
                                  between(HHAGEYEARS, 50, 59) ~ 5,
                                  HHAGEYEARS >= 60  ~ 6,
                                  TRUE ~ 99),

         # Convert appropriate variables to factors
         # ---
         # Convertir les variables en facteurs sauf celles à garder comme numériques
         across(-c(starts_with("TOTAL"), ends_with("NUM"), ends_with("DUR"),
                   DEATHCOUNT, AGEYEARS, ROOMSLEEP, HHAGEYEARS, ind_responded), as_factor),

         # Ensure selected numeric columns stay numeric
         # ---
         # Garder les colonnes numériques essentielles telles quelles
         across(c(starts_with("TOTAL"), ends_with("NUM"), ends_with("DUR"),
                   DEATHCOUNT, AGEYEARS, ROOMSLEEP, HHAGEYEARS, ind_responded))) %>% 

  # Drop variables that are not useful for modeling
  # ---
  # Supprimer les variables inutiles pour le modèle
  select(-c(SLEEPHERE, DEATHCOUNT, HHEMANCMINOR, LIVEAWAY,
         AGEYEARS, HHAGEYEARS),
         # GR 10-07: drop more variables
         -c(MOREPPL,  
            WATERSOURCEOT, TOILETTYPEOT, COOKINGFUELOT, MATFLOOROT, MATROOFOT, 
            MATEXWALLSOT, ECONSUP12OT, 
            TYPE_SECTEUR, HHR_HHI_UUID_ORIG, HHI_ENDSURVEY, ROSTER_ELIG))

# Extract binary (2-level) variables for separate handling
# ---
# Extraire les variables binaires (2 niveaux) pour traitement séparé
int_binary_vars <- int_nr_data_in %>% 
  select(EA_HHID_LN_FIXED, H_AGETEENYEARS, SEX, where(function(col) length(unique(col)) == 2), -ind_responded) %>%
  mutate(across(-c(EA_HHID_LN_FIXED, H_AGETEENYEARS, SEX), as.numeric)) %>%
  mutate(across(-c(EA_HHID_LN_FIXED, H_AGETEENYEARS, SEX), ~if_else(.x == 2, 0, .x))) %>%
  mutate(across(-c(EA_HHID_LN_FIXED, H_AGETEENYEARS, SEX), as.factor))

# Prepare model matrix for all other (non-binary) variables
# ---
# Créer la matrice du modèle pour les variables non-binaires
int_nr_data_mod_matrix_in <- int_nr_data_in %>%
  select(where(function(col) length(unique(col)) != 2), ind_responded) %>%
  select(-contains("EA_HHID"))

dmy_int_nr_data <- dummyVars("ind_responded ~ .", data = int_nr_data_mod_matrix_in)
df_int_nr_data = int_nr_data_mod_matrix_in

# Create final matrix using dummy variables + binary vars
# ---
# Générer la matrice finale avec les variables fictives + les binaires
model_matrix_int_nr <- predict(dmy_int_nr_data, newdata = df_int_nr_data) %>%
  as_tibble() %>%
  cbind(ind_responded = int_nr_data_mod_matrix_in$ind_responded) %>%
  cbind(EA_HHID_LN_FIXED = int_nr_data_in$EA_HHID_LN_FIXED) %>%
  left_join(int_binary_vars) %>%
  mutate(across(-c(EA_HHID_LN_FIXED, H_AGETEENYEARS, SEX, ind_responded), as.factor))

## Joining with `by = join_by(EA_HHID_LN_FIXED)`

# Identify low-frequency variables to exclude from model
# ---
# Identifier les variables rares à exclure du modèle
variable_response_props_int <- model_matrix_int_nr %>%
  pivot_longer(cols = -c(EA_HHID_LN_FIXED, H_AGETEENYEARS, SEX, ind_responded), 
               names_to = "variable", values_to = "value") %>%
  group_by(H_AGETEENYEARS, SEX, variable, value) %>%
  summarise(respondents = sum(ind_responded),
            totalN = n(),
            rrate = respondents / totalN) %>%
  mutate(category_proportion = totalN / sum(totalN))

## `summarise()` has grouped output by 'H_AGETEENYEARS', 'SEX', 'variable'. You
## can override using the `.groups` argument.

# Filter variables with too few observations or categories
# ---
# Filtrer les variables avec trop peu d'observations ou de catégories
int_nr_vars_dropped <- variable_response_props_int %>%
  mutate(drop = if_else(!grepl("age", variable) & 
                          (totalN < 25 | category_proportion < 10/100), 1, NA_real_)) %>%
  filter(drop == 1)

# Create final nonresponse model matrix without dropped vars
# ---
# Créer la matrice finale du modèle de non-réponse sans les variables exclues
final_int_nr_model_matrix <- model_matrix_int_nr %>%
  mutate(ind_responded = if_else(is.na(ind_responded), 0, ind_responded)) %>% 
  pivot_longer(cols = -c(EA_HHID_LN_FIXED, H_AGETEENYEARS, SEX, ind_responded), 
               names_to = "variable", values_to = "value") %>%
  left_join(select(int_nr_vars_dropped, H_AGETEENYEARS, SEX, variable, drop)) %>%
  filter(is.na(drop)) %>%
  group_by(EA_HHID_LN_FIXED, H_AGETEENYEARS, SEX, ind_responded, variable) %>%
  summarise(value = first(value), .groups = "drop") %>%
  pivot_wider(id_cols = c(EA_HHID_LN_FIXED, H_AGETEENYEARS, SEX, ind_responded), 
              names_from = variable, values_from = value) %>%
  mutate(across(-c(EA_HHID_LN_FIXED, H_AGETEENYEARS, SEX, ind_responded), 
                ~replace_na(as.numeric(.x), 0)))

## Joining with `by = join_by(H_AGETEENYEARS, SEX, variable)`

#Function to fit a single model
# Input needs to contain the predictors and the response variable
fit_bart_model_int <- function(data, nskip = 1000, ndpost = 4000) {

  predictor_matrix <- data %>%
    select(where(function(col) length(unique(col)) > 1)) %>%
    remove_empty(which = "cols") %>%
    select(-c(EA_HHID_LN_FIXED, ind_responded)) %>%
    data.frame()
  
  response <- data$ind_responded

  # bartFit <- pbart(predictor_matrix, response, binaryOffset = mean(response),
  #                nskip = nskip, ndpost = ndpost)
  bartFit <- lbart(predictor_matrix, response, binaryOffset = mean(response),
                 nskip = nskip, ndpost = ndpost)
}

# Function to just output the fitted response probabilities
fit_response_probabilities_int <- function(data) {

  bartFit <- fit_bart_model_int(data)

  response <- data$ind_responded
  response_probability <- bartFit$prob.train.mean

  tibble(data$EA_HHID_LN_FIXED, response, response_probability)
}

# nested <- final_int_nr_model_matrix %>%
#   group_by(H_AGETEENYEARS, SEX) %>%
#   tidyr::nest() %>%
#   mutate(bart = lapply(data, fit_response_probabilities))
# 
# response_probs <- nested %>%
#   select(-data) %>%
#   unnest(cols = bart) %>%
#   rename(EA_HHID_LN_FIXED = `data$EA_HHID_LN_FIXED`)


nested_int <- final_int_nr_model_matrix %>%
  filter(!is.na(H_AGETEENYEARS), !is.na(SEX)) %>%
  group_by(H_AGETEENYEARS, SEX) %>%
  nest() %>%
  mutate(bart = lapply(data, fit_bart_model_int))

## *****Into main of lbart
## *****Data:
## data:n,p,np: 404, 65, 0
## y1,yn: 1, 1
## x1,x[n*p]: 2.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 1 ... 1
## *****burn and ndpost: 1000, 4000
## *****Prior:
## beta,alpha,tau: 2.000000,0.950000,0.129526
## *****Dirichlet:sparse,a,b,rho,augment: 0,0.5,1,65,0
## *****nkeeptrain,nkeeptest: 4000, 4000
## *****printevery: 100
## 
## MCMC
## done 0 (out of 5000)
## done 100 (out of 5000)
## done 200 (out of 5000)
## done 300 (out of 5000)
## done 400 (out of 5000)
## done 500 (out of 5000)
## done 600 (out of 5000)
## done 700 (out of 5000)
## done 800 (out of 5000)
## done 900 (out of 5000)
## done 1000 (out of 5000)
## done 1100 (out of 5000)
## done 1200 (out of 5000)
## done 1300 (out of 5000)
## done 1400 (out of 5000)
## done 1500 (out of 5000)
## done 1600 (out of 5000)
## done 1700 (out of 5000)
## done 1800 (out of 5000)
## done 1900 (out of 5000)
## done 2000 (out of 5000)
## done 2100 (out of 5000)
## done 2200 (out of 5000)
## done 2300 (out of 5000)
## done 2400 (out of 5000)
## done 2500 (out of 5000)
## done 2600 (out of 5000)
## done 2700 (out of 5000)
## done 2800 (out of 5000)
## done 2900 (out of 5000)
## done 3000 (out of 5000)
## done 3100 (out of 5000)
## done 3200 (out of 5000)
## done 3300 (out of 5000)
## done 3400 (out of 5000)
## done 3500 (out of 5000)
## done 3600 (out of 5000)
## done 3700 (out of 5000)
## done 3800 (out of 5000)
## done 3900 (out of 5000)
## done 4000 (out of 5000)
## done 4100 (out of 5000)
## done 4200 (out of 5000)
## done 4300 (out of 5000)
## done 4400 (out of 5000)
## done 4500 (out of 5000)
## done 4600 (out of 5000)
## done 4700 (out of 5000)
## done 4800 (out of 5000)
## done 4900 (out of 5000)
## time: 80s
## check counts
## trcnt,tecnt: 4000,0
## *****Into main of lbart
## *****Data:
## data:n,p,np: 522, 63, 0
## y1,yn: 1, 1
## x1,x[n*p]: 2.000000, 2.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 1 ... 1
## *****burn and ndpost: 1000, 4000
## *****Prior:
## beta,alpha,tau: 2.000000,0.950000,0.129526
## *****Dirichlet:sparse,a,b,rho,augment: 0,0.5,1,63,0
## *****nkeeptrain,nkeeptest: 4000, 4000
## *****printevery: 100
## 
## MCMC
## done 0 (out of 5000)
## done 100 (out of 5000)
## done 200 (out of 5000)
## done 300 (out of 5000)
## done 400 (out of 5000)
## done 500 (out of 5000)
## done 600 (out of 5000)
## done 700 (out of 5000)
## done 800 (out of 5000)
## done 900 (out of 5000)
## done 1000 (out of 5000)
## done 1100 (out of 5000)
## done 1200 (out of 5000)
## done 1300 (out of 5000)
## done 1400 (out of 5000)
## done 1500 (out of 5000)
## done 1600 (out of 5000)
## done 1700 (out of 5000)
## done 1800 (out of 5000)
## done 1900 (out of 5000)
## done 2000 (out of 5000)
## done 2100 (out of 5000)
## done 2200 (out of 5000)
## done 2300 (out of 5000)
## done 2400 (out of 5000)
## done 2500 (out of 5000)
## done 2600 (out of 5000)
## done 2700 (out of 5000)
## done 2800 (out of 5000)
## done 2900 (out of 5000)
## done 3000 (out of 5000)
## done 3100 (out of 5000)
## done 3200 (out of 5000)
## done 3300 (out of 5000)
## done 3400 (out of 5000)
## done 3500 (out of 5000)
## done 3600 (out of 5000)
## done 3700 (out of 5000)
## done 3800 (out of 5000)
## done 3900 (out of 5000)
## done 4000 (out of 5000)
## done 4100 (out of 5000)
## done 4200 (out of 5000)
## done 4300 (out of 5000)
## done 4400 (out of 5000)
## done 4500 (out of 5000)
## done 4600 (out of 5000)
## done 4700 (out of 5000)
## done 4800 (out of 5000)
## done 4900 (out of 5000)
## time: 99s
## check counts
## trcnt,tecnt: 4000,0
## *****Into main of lbart
## *****Data:
## data:n,p,np: 4358, 47, 0
## y1,yn: 1, 1
## x1,x[n*p]: 2.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 1 ... 1
## *****burn and ndpost: 1000, 4000
## *****Prior:
## beta,alpha,tau: 2.000000,0.950000,0.129526
## *****Dirichlet:sparse,a,b,rho,augment: 0,0.5,1,47,0
## *****nkeeptrain,nkeeptest: 4000, 4000
## *****printevery: 100
## 
## MCMC
## done 0 (out of 5000)
## done 100 (out of 5000)
## done 200 (out of 5000)
## done 300 (out of 5000)
## done 400 (out of 5000)
## done 500 (out of 5000)
## done 600 (out of 5000)
## done 700 (out of 5000)
## done 800 (out of 5000)
## done 900 (out of 5000)
## done 1000 (out of 5000)
## done 1100 (out of 5000)
## done 1200 (out of 5000)
## done 1300 (out of 5000)
## done 1400 (out of 5000)
## done 1500 (out of 5000)
## done 1600 (out of 5000)
## done 1700 (out of 5000)
## done 1800 (out of 5000)
## done 1900 (out of 5000)
## done 2000 (out of 5000)
## done 2100 (out of 5000)
## done 2200 (out of 5000)
## done 2300 (out of 5000)
## done 2400 (out of 5000)
## done 2500 (out of 5000)
## done 2600 (out of 5000)
## done 2700 (out of 5000)
## done 2800 (out of 5000)
## done 2900 (out of 5000)
## done 3000 (out of 5000)
## done 3100 (out of 5000)
## done 3200 (out of 5000)
## done 3300 (out of 5000)
## done 3400 (out of 5000)
## done 3500 (out of 5000)
## done 3600 (out of 5000)
## done 3700 (out of 5000)
## done 3800 (out of 5000)
## done 3900 (out of 5000)
## done 4000 (out of 5000)
## done 4100 (out of 5000)
## done 4200 (out of 5000)
## done 4300 (out of 5000)
## done 4400 (out of 5000)
## done 4500 (out of 5000)
## done 4600 (out of 5000)
## done 4700 (out of 5000)
## done 4800 (out of 5000)
## done 4900 (out of 5000)
## time: 900s
## check counts
## trcnt,tecnt: 4000,0
## *****Into main of lbart
## *****Data:
## data:n,p,np: 4567, 49, 0
## y1,yn: 1, 1
## x1,x[n*p]: 2.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 1 ... 1
## *****burn and ndpost: 1000, 4000
## *****Prior:
## beta,alpha,tau: 2.000000,0.950000,0.129526
## *****Dirichlet:sparse,a,b,rho,augment: 0,0.5,1,49,0
## *****nkeeptrain,nkeeptest: 4000, 4000
## *****printevery: 100
## 
## MCMC
## done 0 (out of 5000)
## done 100 (out of 5000)
## done 200 (out of 5000)
## done 300 (out of 5000)
## done 400 (out of 5000)
## done 500 (out of 5000)
## done 600 (out of 5000)
## done 700 (out of 5000)
## done 800 (out of 5000)
## done 900 (out of 5000)
## done 1000 (out of 5000)
## done 1100 (out of 5000)
## done 1200 (out of 5000)
## done 1300 (out of 5000)
## done 1400 (out of 5000)
## done 1500 (out of 5000)
## done 1600 (out of 5000)
## done 1700 (out of 5000)
## done 1800 (out of 5000)
## done 1900 (out of 5000)
## done 2000 (out of 5000)
## done 2100 (out of 5000)
## done 2200 (out of 5000)
## done 2300 (out of 5000)
## done 2400 (out of 5000)
## done 2500 (out of 5000)
## done 2600 (out of 5000)
## done 2700 (out of 5000)
## done 2800 (out of 5000)
## done 2900 (out of 5000)
## done 3000 (out of 5000)
## done 3100 (out of 5000)
## done 3200 (out of 5000)
## done 3300 (out of 5000)
## done 3400 (out of 5000)
## done 3500 (out of 5000)
## done 3600 (out of 5000)
## done 3700 (out of 5000)
## done 3800 (out of 5000)
## done 3900 (out of 5000)
## done 4000 (out of 5000)
## done 4100 (out of 5000)
## done 4200 (out of 5000)
## done 4300 (out of 5000)
## done 4400 (out of 5000)
## done 4500 (out of 5000)
## done 4600 (out of 5000)
## done 4700 (out of 5000)
## done 4800 (out of 5000)
## done 4900 (out of 5000)
## time: 911s
## check counts
## trcnt,tecnt: 4000,0

#export(nested_int, "nested_int_model_updated.Rdata")

#nested_int <- import("nested_int_model_updated.Rdata")

response_probs_int <- nested_int %>%
  mutate(response_probability_int = lapply(bart, function(x) x$prob.train.mean)) %>%
  select(response_probability_int, data) %>%
  unnest(cols = c(data, response_probability_int)) %>%
  select(H_AGETEENYEARS, SEX, EA_HHID_LN_FIXED, ind_responded, response_probability_int) %>%
  ungroup()

## Adding missing grouping variables: `H_AGETEENYEARS`, `SEX`

nr_adj_int_weights <- response_probs_int %>%
  mutate(SEX = as.numeric(SEX)) %>%
  left_join(int_wgts_phase_one_adjusted) %>%
  mutate(nr_adj_int_wgt = if_else(ind_responded == 1, phase_one_adj_int_wgt / response_probability_int, NA_real_)) %>% 
      filter(!is.na(H_AGETEENYEARS), !is.na(SEX))

## Joining with `by = join_by(SEX, EA_HHID_LN_FIXED)`

nr_adj_dataset_int <- int_nr_data_in %>%
  select(-H_AGETEENYEARS, -SEX) %>%
  left_join(nr_adj_int_weights, by = c("EA_HHID_LN_FIXED" = "EA_HHID_LN_FIXED")) %>% 
    filter(!is.na(H_AGETEENYEARS), !is.na(SEX))


nr_adj_dataset_int %>% 
  mutate(sex_f = factor(SEX, levels = c(1,2), labels = c("Male", "Female")),
         teen_f = factor(H_AGETEENYEARS, levels = c(1,2), labels = c("15-17", "18+"))) %>%
  ggplot(aes(x = factor(EAID), y = response_probability_int)) +
  geom_boxplot() +
  facet_grid(cols = ggplot2::vars(teen_f), rows = ggplot2::vars(sex_f)) +
  theme_bw()

# Plot of distribution
response_probs_int %>% 
  ggplot(aes(x = response_probability_int, fill = factor(ind_responded))) +
  geom_density(alpha = 0.5) +
  theme_bw()

response_probs_int %>% 
  ggplot(aes(x = response_probability_int, color = factor(ind_responded),
             after_stat(density))) +
  geom_freqpoly(binwidth = 0.025) +
  theme_bw()

response_probs_int %>% 
  ggplot(aes(x = factor(ind_responded), y = response_probability_int, fill = factor(ind_responded),)) +
  geom_violin() +
  #geom_boxplot() +
  theme_bw() +
  scale_y_continuous(limits = c(0.5,1))

response_probs_int %>% 
  ungroup() %>%
  group_by(ind_responded) %>% 
  summarise(mean = mean(response_probability_int),
            median = median(response_probability_int),
            q25 = quantile(response_probability_int, 0.25),
            q75 = quantile(response_probability_int, 0.75),
            min = min(response_probability_int),
            max = max(response_probability_int))

## # A tibble: 2 × 7
##   ind_responded  mean median   q25   q75   min   max
##           <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1             0 0.878  0.897 0.819 0.954 0.579 0.996
## 2             1 0.955  0.974 0.941 0.988 0.502 0.998

# Create a survey design object using EAID as the cluster/primary sampling unit (PSU),
# and using the non-response adjusted individual weights (nr_adj_int_wgt).
# This defines the survey's complex sampling structure for analysis.
# Crée un objet de plan d'enquête en utilisant EAID comme unité primaire d'échantillonnage (PSU),
# et en utilisant les poids individuels ajustés pour la non-réponse (nr_adj_int_wgt).
# Cela définit la structure complexe de l'échantillon pour l'analyse.
cd_svy <- svydesign(ids = ~EAID,
                      weights = ~nr_adj_int_wgt,
                      data = nr_adj_int_weights %>% filter(!is.na(nr_adj_int_wgt)))


# Summarize the distribution of the original survey weights before trimming.
# Useful to inspect min, max, mean, and spread of the weights.
# Résume la distribution des poids d'enquête originaux avant le "trimming".
# Utile pour inspecter les valeurs min, max, moyenne et la dispersion des poids.
summary(weights(cd_svy))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   81.24  275.77  386.87  576.98  865.87 3274.72

# GR 10-07 investigating high weights
high_weights <- nr_adj_int_weights |> 
  filter(!is.na(nr_adj_int_wgt),
         nr_adj_int_wgt > 3.5 * median(nr_adj_int_wgt, na.rm = TRUE))

high_weights |> 
  count(HHR_EACODE)

## # A tibble: 11 × 2
##    HHR_EACODE          n
##    <chr>           <int>
##  1 CD1810210100130    21
##  2 CD1810210100447    13
##  3 CD1810210100549    24
##  4 CD1810210101024    37
##  5 CD1810210101118    34
##  6 CD1810210101199    17
##  7 CD1810210101387    12
##  8 CD1810210101755    13
##  9 CD1810210102347    10
## 10 CD1810210102459    11
## 11 CD2020230101747    34

nr_adj_int_weights |> 
  filter(!is.na(nr_adj_int_wgt)) |> 
  mutate(high_weight = nr_adj_int_wgt > 3.5 * median(nr_adj_int_wgt, na.rm = TRUE)) |> 
  ggplot(aes(x = nr_adj_int_wgt, fill = high_weight)) +
  geom_density(alpha = 0.5)

# GR 10-07:
# My conclusion is that the high weights aren't outliers, they are just the result of some
# EAs which had low selection probabilities etc., so it doesn't make much sense to trim them.
# We'll just have to live with the effect on variance from the variation in weights.


# Apply weight trimming to limit extreme high weights.
# Here, any weight above 3.5 times the median weight is capped at that threshold.
# 'strict = TRUE' ensures the new weights respect this upper limit exactly.
# Applique une réduction ("trimming") des poids pour limiter les valeurs extrêmes.
# Ici, tout poids supérieur à 3,5 fois la médiane est plafonné à ce seuil.
# 'strict = TRUE' garantit que les nouveaux poids respectent strictement cette limite.
cd_svy_trimmed <- trimWeights(cd_svy, 
                                  lower = -Inf,
                              # GR 10-07 set a super high limit to avoid having to re-work later code
                              upper = 35000 * median(weights(cd_svy)),
                                  #upper = 3.5 * median(weights(cd_svy)),
                                  strict = TRUE)


# Summarize the distribution of the weights after trimming.
# This helps confirm that high extreme weights have been successfully capped.
# Résume la distribution des poids après le "trimming".
# Permet de vérifier que les poids extrêmes ont bien été plafonnés.
summary(weights(cd_svy_trimmed))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   81.24  275.77  386.87  576.98  865.87 3274.72

# Extract the trimmed weights and the associated data into a new tibble (dataframe).
# Adds a new column 'trmpnr1w0' containing the final trimmed weights.
# Useful for merging with other data or for further analysis.
# Extrait les poids ajustés après "trimming" et les données associées dans un nouveau tibble (dataframe).
# Ajoute une nouvelle colonne 'trmpnr1w0' contenant les poids finaux réduits.
# Utile pour fusionner avec d'autres données ou pour des analyses ultérieures.
trimmed_int_weights <- tibble(cd_svy_trimmed$variables, trmpnr1w0 = weights(cd_svy_trimmed))



# 
# so we dont want to have extreme weights because we dont want individuals with big weights to dominate the survey, or to dominate inside the EA. so how do we limit or what do we consider a high value. anything above 3.5 times the median is considered too high and must be capped. for example we have someone with 3357 as a weight. in this survey this would be considered extreme. so by our rule of thumb, our median = 520 and if we mulitply it by 3.5 = 1822. so anything above this value will be given 1822. remember we will also do postratification to make sure that our populations are balancing correctly
# Nous ne souhaitons donc pas utiliser de pondérations extrêmes, car nous ne voulons pas que les individus ayant une pondération importante dominent l'enquête, ni au sein de la zone de recensement. Comment limiter cette valeur ? Que considère-t-on comme une valeur élevée ? Toute valeur supérieure à 3,5 fois la médiane est considérée comme trop élevée et doit être plafonnée. Par exemple, nous avons une personne dont la pondération est de 3 357. Dans cette enquête, ce serait considéré comme extrême. Ainsi, selon notre règle empirique, notre médiane est de 520, et si nous la multiplions par 3,5, elle est de 1 822. Ainsi, toute valeur supérieure à cette valeur sera de 1 822. N'oubliez pas que nous effectuerons également une post-ratification pour garantir le bon équilibre de nos populations.


# after trimming we must poststratify the weights, this mean we need to align the population that is estimated from the survey to the population that is believed to be correct i.e. the population that we get from population projections. so we are aligning our population to the population projections. So now what we need is the population of HK and LUA for 2024 by sex and agegroup 15+. These are population projections 
# 
# Après l'ajustage, nous devons poststratifier les pondérations. Cela signifie que nous devons aligner la population estimée par l'enquête sur la population estimée correcte, c'est-à-dire celle obtenue à partir des projections démographiques. Nous alignons donc notre population sur les projections démographiques. Nous avons donc besoin de la population de Hong Kong et de LUA pour 2024, par sexe et par tranche d'âge de 15 ans et plus. Il s'agit des projections démographiques.

4.4 Postratification of individual weights

#pop_projections_RDC_2024 <- import(str_c(cd_sample_dir, "pop_projections_RDC_2024.xlsx"))
pop_projections_RDC_2024 <- read_excel("C:/Users/Kupamupindit/OneDrive - Columbia University Irving Medical Center/Documents/ICAP work/Countries/DRC/Weighting/Data/pop_projections_RDC_2024.xlsx")


ref_totals <- pop_projections_RDC_2024  %>% 
    mutate(province = case_when(
    province == "Haut-Katanga" ~ "18",
    province == "Lualaba" ~ "20",
    TRUE ~ NA_character_  # fallback if name doesn't match
  )) %>% 
    mutate(sex = case_when(
    sex == "male" ~ 1,
    sex == "female" ~ 2,
    TRUE ~ NA_real_
  ))


  
individual_reference_totals_adj <- ref_totals


  #summarise actual totals
intwt_totals <- trimmed_int_weights %>%
  mutate(province = str_sub(EAID, 3, 4)) %>% 
  filter(!is.na(trmpnr1w0)) %>%
  #left_join(geo_level_codes) %>% 
  # mutate(age_group = case_when(between(CONFAGEY, 15, 19) ~ "15-19",
  #                              between(CONFAGEY, 20, 24) ~ "20-24",
  #                              between(CONFAGEY, 25, 29) ~ "25-29",
  #                              between(CONFAGEY, 30, 34) ~ "30-34",
  #                              between(CONFAGEY, 35, 39) ~ "35-39",
  #                              between(CONFAGEY, 40, 44) ~ "40-44",
  #                              between(CONFAGEY, 45, 49) ~ "45-49",
  #                              between(CONFAGEY, 50, 54) ~ "50-54",
  #                              between(CONFAGEY, 55, 59) ~ "55-59",
  #                              between(CONFAGEY, 60, 64) ~ "60-64",
  #                              CONFAGEY >= 65            ~ "65+",
  #                              TRUE ~ "ERROR"),
  #        sex = SEX) %>%
  #group_by(L1_Name, sex, age_group) %>% # By L1 area, age, and sex the cells are very small
  rename(sex=SEX) %>% 
  group_by(province, sex) %>%
  #group_by(sex, age_group) %>%
  summarise(totalN = n(),
            totalWgtN = sum(trmpnr1w0))

## `summarise()` has grouped output by 'province'. You can override using the
## `.groups` argument.

#compute adjustment factor
int_poststrat_adj <- intwt_totals %>%
  left_join(individual_reference_totals_adj) %>%
  mutate(int_postrat_adj_factor = population / totalWgtN, SEX=sex)

## Joining with `by = join_by(province, sex)`

           #UN_Count_adj / totalWgtN)

# GR 10-07: all factors are pretty similar and not too large (1.43-1.77)

# join to indiv level file and calculate weight
poststrat_int_wgts <- trimmed_int_weights %>% 
  mutate(province = str_sub(EAID, 3, 4)) %>% 
 # left_join(geo_level_codes) %>%
 # mutate(age_group = case_when(between(CONFAGEY, 15, 19) ~ "15-19",
 #                               between(CONFAGEY, 20, 24) ~ "20-24",
 #                               between(CONFAGEY, 25, 29) ~ "25-29",
 #                               between(CONFAGEY, 30, 34) ~ "30-34",
 #                               between(CONFAGEY, 35, 39) ~ "35-39",
 #                               between(CONFAGEY, 40, 44) ~ "40-44",
 #                               between(CONFAGEY, 45, 49) ~ "45-49",
 #                               between(CONFAGEY, 50, 54) ~ "50-54",
 #                               between(CONFAGEY, 55, 59) ~ "55-59",
 #                               between(CONFAGEY, 60, 64) ~ "60-64",
 #                               CONFAGEY >= 65            ~ "65+",
 #                               TRUE ~ "ERROR"),
 #         sex = SEX)           %>%
  left_join(int_poststrat_adj, by=c("province", "SEX")) %>%
  mutate(intwt0 = int_postrat_adj_factor * trmpnr1w0)

# Post-stratification means that the weights are adjusted so that the weighted totals within mutually exclusive cells equal the known population totals. This term is misused most of the time, as in practice it is often attributed to weighting methods for which only margins of a multivariate contingency table are known, but not the higher order cells. These methods should be referred to as calibration, and the specific raking algorithm is usually used. While calibrated weights are based on achieving the alignment between the sample and the known population figures, non-response adjusted weights are based on achieving the alignment between the responding sample and the original sample.

# La post-stratification signifie que les poids sont ajustés de sorte que les totaux pondérés au sein de cellules mutuellement exclusives égalent les totaux de population connus. Ce terme est souvent mal utilisé, car en pratique, il est souvent attribué à des méthodes de pondération pour lesquelles seuls les marges d'un tableau de contingence multivarié sont connus, mais pas les cellules d'ordre supérieur. Ces méthodes devraient être qualifiées de calibration, et l'algorithme de raking spécifique est généralement utilisé. Alors que les poids calibrés sont basés sur l'atteinte de l'alignement entre l'échantillon et les chiffres de population connus, les poids ajustés pour non-réponse sont basés sur l'atteinte de l'alignement entre l'échantillon répondant et l'échantillon original.

5 Weighting of Biomarker dataset

#load biomaker data

#biomarker <- import(str_c(cd_sample_dir, "fullMerge_RD_Congo2.xlsx"), guess_max = 10000) |> 
  biomarker <- read_excel("C:/Users/kupamupindit/OneDrive - Columbia University Irving Medical Center/Documents/ICAP work/Countries/drc/Weighting/Data/fullMerge_RD_Congo2.xlsx") %>% 
  filter(!is.na(ea_hhid_ln_fixed))

#names(biomarker) <- toupper(names(biomarker))
#DISCLOSE, CONDOMWHERE, CONDOMNOTEASYRSN, ADHIVPREV, TBSYMPASSESS, CHRONICCOND, CHRONICMED, - Multi select
# ARVAMTNUM - number

# Interview variables
blood_nr_variables <- cd_csproindiv %>%
  # GR 10-07 just use indiv_status and roster_elig from final stat file
    left_join(select(cd_csproroster, EA_HHID_LN_FIXED, SLEEPHERE, LIVEHERE, ELIGIBLE1, ROSTERED_FINAL_STATUS, AGEYEARS,
                     INDIV_STATUS)) %>%
    mutate(roster_elig = ROSTER_ELIG,
           indiv_status = INDIV_STATUS) |> #ELIGIBLE1, #if_else(AGEYEARS >= 15 & SLEEPHERE == 1, 1, 0, missing = 0),
         # indiv_status = case_when(
         #   #roster_elig == 1 & SLEEPHERE == 2 & LIVEHERE == 1 & AGEYEARS >= 15 ~ 9,
         #   roster_elig == 0 ~ 5,
         #   CONFAGEY >= 15 & ENDMSG1 == "A" ~ 1,
         #   CONFAGEY >= 15 & ENDMSG1 != "A" ~ 2,
         #   CONFAGEY < 15 ~ 3,
         #   is.na(CONFAGEY) ~ 4,
         #   TRUE ~ 999999)) %>%
 # select(EA_HHID_LN_FIXED, SLEEPHERE, LIVEHERE, AGEYEARS, CONFAGEY, ENDMSG1, INDSTATUS, roster_elig, indiv_status) %>% 
# it says joining by LN_FIXED and AGEYEARS is this correct? 
 # left_join(select(cd_csproroster, EA_HHID_LN_FIXED, SLEEPHERE)) %>%
  filter(indiv_status == 1,
         SLEEPHERE == 1,
         CONFAGEY >= 15) %>%
 # select(30:448, 537:552) %>%
   # Expand multiple selection
  mutate(across(c(DISCLOSE, CONDOMWHERE, CONDOMNOTEASYRSN, 
                  ADHIVPREV, TBSYMPASSESS, CHRONICCOND, CHRONICMED,
                  HIVTSTNVRRSN, CMETHOD), 
                list(~expand_mult(.x, "A"),
                                   ~expand_mult(.x, "B"),
                                   ~expand_mult(.x, "C"),
                                   ~expand_mult(.x, "D"),
                                   ~expand_mult(.x, "E"),
                                   ~expand_mult(.x, "F"),
                                   ~expand_mult(.x, "G"),
                                   ~expand_mult(.x, "H"),
                                   ~expand_mult(.x, "I"),
                                   ~expand_mult(.x, "J"),
                                   ~expand_mult(.x, "K"),
                                   ~expand_mult(.x, "L"),
                                   ~expand_mult(.x, "M"),
                                   ~expand_mult(.x, "N"),
                                   ~expand_mult(.x, "W"),
                                   ~expand_mult(.x, "X"),
                                   ~expand_mult(.x, "Y"),
                                   ~expand_mult(.x, "Z")),
                .names = "{.col}{.fn}")) %>%
  select(-c(ends_with(c("TS", "DT", "_HOLD", "PARAG", "START", "INST", "INSTR")), 
            contains(c("GEND", "AGE", "STAFFID", "SIG", "INSTART", "INDCONSTFDT",
                       "CONSTAT", "INFUTCON", "ASYBIOGT", "ASYBLST", "ASYFUTR")),
            starts_with(c("INCON", "INSTR", "PPRM", "INAGREEINST", "WIFENM", "NEWNAME", "PARTHHLINE", "SXPREPS", "CHILDMORE"))), 
         -c(DISCLOSE, CONDOMWHERE, CONDOMNOTEASYRSN, HIVTSTNVRRSN, CMETHOD,
                  ADHIVPREV, TBSYMPASSESS, CHRONICCOND, CHRONICMED,
            ADELIGEDT, INECONS, PPELIGSDT,
            CONFAGEY, CONFGEND, INEEMAN, INECHILD, ELIGSTAT_PARENT, PPGUARD, PPGUARDREL, PPCHEEMAN, PPCHEEMANC, PPCHEHEARING, PPCHEHEARACM, PPCHELANGACC, PPCHECOGN, PPCHADULT, PPCHECONS, ELIGSTAT_CHILD, CHEEMAN, CHEEMANC, CHEHEARING, CHEHEARACM, CHELANGACC, CHECOGN, CHECONS, CONSTAT_ADULT, INAGREEIN, CONSTAT_PARENT,  INFUTCONST2,
            PPCHCONSTAT, CONSTAT_CHILD, ASOAGREEST1, ASOAGREEST2, ASYBIOGTST2, ASYFUTRST2,ASYASCONFIRM, ASYBLSTST2, ASOCONSTAT, CONSTCHID, ENROLL, PTIDSCAN, PTIDCOMF, INPAPER, ENELIGTHX, CONFEMAN_INELIG, CONSTAT_INELIG, LIVETIMEDK, MONTHWHENM, MONTHWHENY, NEWHSELECT,  BIRTHDAY, BIRTHMON, BIRTHYR, HIVTFPOSM, HIVTFPOSY, HIVLASTNEGM, HIVLASTNEGY, HIVLASTNEGDK, HIVPOSPROVM, HIVPOSPROVY, HIVCLINIC_GV1, HIVCLINIC, HIVCLINICOTH, HIVCLM, HIVCLY, ARVFTM, ARVFTY, VLTESTLSTM, VLTESTLSTY, CERVCNTSM, CERVCNTSY, ARVAMDK, DMFLAG, DMCOMMENT, HIVTESTM, HIVTESTY)) %>%
    # Fill NAs with no/missing where appropriate
 mutate(
  across(c(HIVSELFTST), ~replace_na(.x, 1)),
  across(c(ACCHEAR, ACCOM1, ACCLANG, ACCCOGNITIVE, ASOAGREE, WORK12MO, WORK7DAYS, CURMAR,  
           starts_with("SUPPORT")), ~replace_na(.x, 2)),

  # numeric replacement
  across(where(is.numeric) & (starts_with("ARV") | starts_with("VL") | starts_with("MEDIN") |
                              starts_with("TB") | starts_with("CERCN") | starts_with("PART") | 
                              starts_with("CHILD")), ~replace_na(.x, 0)),

  # character replacement
  across(where(is.character) & (starts_with("ARV") | starts_with("VL") | starts_with("MEDIN") |
                                starts_with("TB") | starts_with("CERCN") | starts_with("PART") | 
                                starts_with("CHILD")), ~replace_na(.x, "missing")),

  # the rest
  across(c(SCHLCUR, SCHCOM, OUTREGIONTYPE, OUTREGIONWHRC, MONTHOUTEVER, MONTHTIMES,
            WHEREOUTC, REASONAWAY, WORKIND, NORMWORK, NUMWIF, WIFLIVEEW, WIFEWHERE,
           HUSLIVEW, HUSOTWIF, HUSNWIF, CHILDA2012, PRGTWIN, LIVEB, ALCNUMDAY, ALCSIXMORE, CONDOMGET,
           ADTPSX, ADDISHIV, ADHIVSCHMTG, HFHIVTSTOFFER, HIVTSTLOCATION, HIVTSTRSN, HIVTSTRSLT,
           HIVCARE, HIVCNOTRSN, HIVCRLTC, CLINCHANGE, TRAVELTIME, TRAVELDIFF),
         ~replace_na(.x, 0))
)%>%
  # Combine don't know and missing values
  mutate(across(where(is.numeric), ~if_else(.x == -8, -9, .x))) %>%
  remove_empty(which = "cols") %>%
  # Remove variables with only one value
  select(where(function(col) length(unique(col)) > 1))

## Joining with `by = join_by(EA_HHID_LN_FIXED, AGEYEARS, INDIV_STATUS)`

# Bring in variables used for interview adjustment
# int_nr_data_in
int_nr_data_vars <- int_nr_data_in %>%
  mutate(SEX = as.double(SEX))

blood_nr_variables_selected <- blood_nr_variables %>%
  select(-EMANCIPATED,-LIVEHERE, -CODE_SECTEUR, -CORRECT_LN, -URBAN_RURAL) %>% 
  left_join(int_nr_data_vars) %>%
  # Remove any variables that aren't helpful
  select(-INDSTATUS) %>%
  # Replace remaining missings with -9
  mutate(across(where(is.numeric), ~as.factor(replace_na(.x, -9)))) %>%
  # Remove variables with only one value
  select(where(function(col) length(unique(col)) > 1))

## Joining with `by = join_by(NOM_COURT, EA_HHID_LN_FIXED, EA_HHID_FIXED, SEX)`

blood_response <- blood_nr_variables_selected %>%
  left_join(biomarker, by = c("EA_HHID_LN_FIXED" = "ea_hhid_ln_fixed")) %>%
  mutate(bt_status = if_else(hiv1statusfinalsurvey %in% c("Positive", "Negative"), 1, 2, missing = 2)) %>%
  select(EA_HHID_LN_FIXED, bt_status)

blood_nr_data_in <- blood_nr_variables_selected %>%
  left_join(blood_response)

## Joining with `by = join_by(EA_HHID_LN_FIXED)`

names(blood_nr_data_in)

##   [1] "NOM_COURT"             "EA_HHID_LN_FIXED"      "EA_HHID_FIXED"        
##   [4] "IND_EACODE"            "IND_SHH"               "IND_PROVINCE"         
##   [7] "IND_CT"                "IND_EACODE_ORIG"       "IND_SHH_ORIG"         
##  [10] "TYPE_SECTEUR"          "IND_TEAM_ID"           "IND_HHMEM_ID"         
##  [13] "IND_HHI_UUID"          "IND_HHI_UUID_ORIG"     "IND_DEVICEID"         
##  [16] "IND_HUBUUID"           "DEVICEID_SECONDARY"    "INDIVHHRCOLL"         
##  [19] "INSTATUSPW"            "INFORMSTATUSREASON"    "CONSENTCHANGE"        
##  [22] "INDIVCNTSTATUS"        "INDIVCHANGECONSENTPW1" "ELIGSTAT_ADULT"       
##  [25] "ACCHEAR"               "ACCOM1"                "ACCLANG"              
##  [28] "ACCCOGNITIVE"          "PPGUARDRELOTH"         "INLANGAD"             
##  [31] "ASOLANG"               "ASOAGREE"              "CONFEMAN"             
##  [34] "CONSTCHRESN1"          "CONSTCHRESN2"          "CONSTCHRESN3"         
##  [37] "CONSTCHRESN4"          "LNGVQX_LNG"            "LNGVINT_LNG"          
##  [40] "LNGNAT_LNG"            "SCHLAT"                "SCHLCUR"              
##  [43] "SCHCOM"                "LIVETIMENUM"           "LIVETIME"             
##  [46] "OUTREGIONTYPE"         "OUTREGIONWHRP"         "OUTREGIONWHRC"        
##  [49] "OUTREGIONWHROTH"       "OUTREGIONWHRCT"        "OUTREGIONWHRCTOTH"    
##  [52] "MONTHOUTEVER"          "MONTHTIMES"            "WHEREOUTP"            
##  [55] "WHEREOUTC"             "WHEREOUTCT"            "WHEREOUTCTOTH"        
##  [58] "REASONAWAY"            "REASONAWATOTH"         "WORK12MO"             
##  [61] "WORK7DAYS"             "WORKIND"               "WORKINDOTH"           
##  [64] "NORMWORK"              "EVERMAR"               "CURMAR"               
##  [67] "NUMWIF"                "WIFLIVEEW"             "WIFEWHERE"            
##  [70] "HUSLIVEW"              "HUSOTWIF"              "HUSNWIF"              
##  [73] "LIVEB"                 "CHILDA2012"            "PRGTWIN"              
##  [76] "REPCHNUM"              "PRGCARE"               "HIVTSBP"              
##  [79] "HIVPSBP"               "ARVFVST"               "HIVTPRG"              
##  [82] "HIVRTPG"               "HIVTSAD"               "HIVRTAD"              
##  [85] "PREGNANT"              "AVOIDPREG"             "CMETHODOTH"           
##  [88] "PRGTWINPLUS1"          "PRGTWINPLUS2"          "PRGTWINPLUS3"         
##  [91] "PRGTWINPLUS4"          "PRGTWINPLUS5"          "CHILDALIVE1"          
##  [94] "CHILDALIVE2"           "CHILDALIVE3"           "CHILDALIVE4"          
##  [97] "CHILDALIVE5"           "CHILDBRSTFD1"          "CHILDBRSTFD2"         
## [100] "CHILDBRSTFD3"          "CHILDBRSTFD4"          "CHILDBRSTFD5"         
## [103] "CHILDBRSTFDNOW1"       "CHILDBRSTFDNOW2"       "CHILDBRSTFDNOW3"      
## [106] "CHILDBRSTFDNOW4"       "CHILDBRSTFDNOW5"       "CHTSTHIVBIRTH1"       
## [109] "CHTSTHIVBIRTH2"        "CHTSTHIVBIRTH3"        "CHTSTHIVBIRTH4"       
## [112] "CHTSTHIVBIRTH5"        "CHTSTHIVRESULT1"       "CHTSTHIVBRSTFD1"      
## [115] "CHTSTHIVRESULTLAST1"   "MCSTATUS"              "MCPLANS"              
## [118] "MCWHOTRAD"             "MCWHOMED"              "LIFETIMESEX"          
## [121] "PART12MONUM"           "PARTLIVEW1"            "PARTLIVEW2"           
## [124] "PARTLIVEW3"            "PARTINIT1"             "PARTINIT2"            
## [127] "PARTINIT3"             "PARTRECENT1"           "PARTRELATION1"        
## [130] "PARTRELATION2"         "PARTRELATION3"         "PARTRELATIONOTH1"     
## [133] "PARTRELATIONOTH3"      "PARTLASTCNDM1"         "PARTLASTCNDM2"        
## [136] "PARTLASTCNDM3"         "PARTLASTETOH1"         "PARTLASTETOH2"        
## [139] "PARTLASTETOH3"         "PARTKNOWHIV1"          "PARTKNOWHIV2"         
## [142] "PARTKNOWHIV3"          "PARTHIVSAT1"           "PARTHIVSAT2"          
## [145] "PARTHIVSAT3"           "HFLAST12MO"            "HFHIVTSTOFFER"        
## [148] "HIVTSTEVER"            "HIVTSTNVRRSNOTH"       "HIVTSTLOCATION"       
## [151] "HIVTSTLOCATIONOTH"     "HIVTSTRSN"             "HIVTSTRSNOTH"         
## [154] "HIVTSTRSLT"            "HIVPOSPROV"            "HIVSELFTST"           
## [157] "DISCLOSEOTH"           "PRPEVRHDR"             "PREPEVER"             
## [160] "PREPCURNT"             "PREPWDTK"              "HIVCARE"              
## [163] "HIVCNOTRSN"            "HIVCRLTC"              "HIVCLINIC_GV2"        
## [166] "CLINCHANGE"            "TRAVELTIME"            "TRAVELDIFF"           
## [169] "TRAVELMODE"            "ARVSTAKENEV"           "ARVSNOTTAKE"          
## [172] "ARVSCURRENT"           "ARVSNOTCURRSN"         "ARVLOC"               
## [175] "ARVAMTNUM"             "ARVAMT"                "ARVSWITCH"            
## [178] "ARVSWITCHWHY"          "ARVINTERR"             "ARVSMISSDAYS"         
## [181] "VLTEST"                "VLTESTRESULT"          "TBTPT"                
## [184] "MEDINHCURR"            "MEDINHMONTHS"          "TBCLINVISIT"          
## [187] "TBCLINHIVTST"          "TBDIAGN"               "TBTREATED"            
## [190] "TBTTREATCURR"          "TBTREAT6MOFULL"        "CERVCNTST"            
## [193] "CERNCNRSLT"            "CERNCNTRT"             "HPVVACC"              
## [196] "HPVVACCDOSE"           "LITTLEINTEREST"        "DEPRESSED"            
## [199] "ANXIETY"               "WORRY"                 "CHRONICCOND_OTH"      
## [202] "CHRONICMED_OTH"        "ALCFREQ"               "ALCNUMDAY"            
## [205] "ALCSIXMORE"            "DRUGEVER"              "DRUGCURR"             
## [208] "DRUGTYPE"              "DRUGFREQ"              "CONDOMWHEREOTH"       
## [211] "CONDOMGET"             "CONDOMNOTEASYRSNOTH"   "ADTPSX"               
## [214] "ADDISHIV"              "ADHIVPREVOTH"          "ADHIVSCHMTG"          
## [217] "LOCPTID"               "LOCGPSPREP"            "LOCGPS_STARTTIME"     
## [220] "LOCGPS_SATELITES"      "LOCGPS_ACCURACY"       "LOCGPS_READTIME"      
## [223] "LOCGPS_ELAPSEDTIME"    "LOCGPS"                "LOCINSTR1"            
## [226] "LOCINSTR2"             "HIVCLINICC"            "HIVCLINIC2"           
## [229] "LOCPROV"               "LOCPROV2"              "LOCFAC"               
## [232] "LOCCNTPH"              "LOCOWSAPNR"            "LOCMOBONR"            
## [235] "LOCLANDONR"            "LOCOTHONR"             "LOCPHPRF"             
## [238] "LOCPHTM"               "LOCPHMSG"              "LOCVISIT"             
## [241] "LOCINSTR3"             "LOCOTHC"               "LOCEND"               
## [244] "LOCINSTR15"            "LOCINSTR115"           "LOCINSTR215"          
## [247] "LOCPROV15"             "LOCPROV215"            "LOCFAC15"             
## [250] "LOCCNTPH15"            "LOCOWSAPNR15"          "LOCMOBONR15"          
## [253] "LOCLANDONR15"          "LOCOTHONR15"           "LOCPHPRF15"           
## [256] "LOCINSTR1500"          "LOCPHTM15"             "LOCPHMSG15"           
## [259] "LOCVISIT15"            "LOCINSTR315"           "LOCOTHC15"            
## [262] "LOCEND15"              "HIVSTATUS"             "PTIDS"                
## [265] "CONFBC"                "BIOPTIDMATCH"          "BIOPTIDMATCH2"        
## [268] "SPECDATE"              "BIOTEAMID"             "COLTYPEAD"            
## [271] "BIOCLCTFL"             "HIVSTATUS_3T"          "BIOHIVTST1"           
## [274] "BIOHIVTST1LT"          "BIOHIVTST1EX"          "BIOHIVTST1RS"         
## [277] "BIOHIVTST1LT2"         "BIOHIVTST1EX2"         "BIOHIVTST1RS2"        
## [280] "BIOHIVTST2LT"          "BIOHIVTST2EX"          "BIOHIVTST2RS"         
## [283] "BIOHIVTST1ALT"         "BIOHIVTST1AEX"         "BIOHIVTST1ARS"        
## [286] "BIOHIVTST3LT"          "BIOHIVTST3EX"          "BIOHIVTST3RS"         
## [289] "BIOHIVPOS"             "BIOINVALID"            "BIOHIVNEG"            
## [292] "BIOHIVIND"             "BIOCOMMENT"            "LTCSTATUS"            
## [295] "LTCCNTSTATUS"          "LTCPTID"               "LTCSTRT"              
## [298] "LTCRETRSLT"            "LTCRETRSLT1"           "LTCRETRSLT2"          
## [301] "LTCRETRSLT15"          "LTCRETRSLT115"         "LTCRETRSLT215"        
## [304] "LTCVRBL"               "LTCVRBLST1"            "LTCVRBLST2"           
## [307] "LTCHOW"                "LTCNOCONFIRM"          "LTCPHPRF"             
## [310] "LTCCONCONFIRM"         "LTCCONSTCHRESN1"       "LTCCONSTCHRESN2"      
## [313] "LTCCONSTCHRESN3"       "LTCCONSTCHNM"          "LTCCONSTCHID"         
## [316] "ARVSCURRENTINST1"      "ARVSCURRENTINST2"      "ARVSCURRENTINST3"     
## [319] "ARVSCURRENTINST4"      "WITHDRAWAL"            "WITHDRAWRSN"          
## [322] "ENDSURVEY"             "ENDSUMMARY"            "INDSTATUS_CALC"       
## [325] "ENDTHANKS"             "ENDCOMMENT"            "CSPVERSNINT"          
## [328] "SEX"                   "KNOWN_HIV_STATUS"      "DISCLOSE1"            
## [331] "DISCLOSE2"             "DISCLOSE3"             "DISCLOSE5"            
## [334] "DISCLOSE16"            "DISCLOSE17"            "CONDOMWHERE1"         
## [337] "CONDOMWHERE2"          "CONDOMWHERE3"          "CONDOMWHERE4"         
## [340] "CONDOMWHERE5"          "CONDOMWHERE6"          "CONDOMWHERE16"        
## [343] "CONDOMWHERE17"         "CONDOMWHERE18"         "CONDOMNOTEASYRSN1"    
## [346] "CONDOMNOTEASYRSN2"     "CONDOMNOTEASYRSN3"     "CONDOMNOTEASYRSN4"    
## [349] "CONDOMNOTEASYRSN5"     "CONDOMNOTEASYRSN6"     "CONDOMNOTEASYRSN16"   
## [352] "CONDOMNOTEASYRSN17"    "CONDOMNOTEASYRSN18"    "ADHIVPREV1"           
## [355] "ADHIVPREV2"            "ADHIVPREV3"            "ADHIVPREV4"           
## [358] "ADHIVPREV5"            "ADHIVPREV6"            "ADHIVPREV7"           
## [361] "ADHIVPREV8"            "ADHIVPREV9"            "ADHIVPREV10"          
## [364] "ADHIVPREV11"           "ADHIVPREV15"           "ADHIVPREV16"          
## [367] "ADHIVPREV17"           "ADHIVPREV18"           "TBSYMPASSESS1"        
## [370] "TBSYMPASSESS2"         "TBSYMPASSESS3"         "TBSYMPASSESS4"        
## [373] "TBSYMPASSESS5"         "TBSYMPASSESS17"        "CHRONICCOND1"         
## [376] "CHRONICCOND2"          "CHRONICCOND3"          "CHRONICCOND4"         
## [379] "CHRONICCOND5"          "CHRONICCOND6"          "CHRONICCOND7"         
## [382] "CHRONICCOND9"          "CHRONICCOND16"         "CHRONICCOND17"        
## [385] "CHRONICCOND18"         "CHRONICMED1"           "CHRONICMED2"          
## [388] "CHRONICMED3"           "CHRONICMED4"           "CHRONICMED5"          
## [391] "CHRONICMED6"           "CHRONICMED7"           "CHRONICMED9"          
## [394] "CHRONICMED16"          "CHRONICMED17"          "CHRONICMED18"         
## [397] "HIVTSTNVRRSN1"         "HIVTSTNVRRSN2"         "HIVTSTNVRRSN3"        
## [400] "HIVTSTNVRRSN4"         "HIVTSTNVRRSN5"         "HIVTSTNVRRSN6"        
## [403] "HIVTSTNVRRSN7"         "HIVTSTNVRRSN8"         "HIVTSTNVRRSN9"        
## [406] "HIVTSTNVRRSN10"        "HIVTSTNVRRSN11"        "HIVTSTNVRRSN12"       
## [409] "HIVTSTNVRRSN16"        "HIVTSTNVRRSN17"        "HIVTSTNVRRSN18"       
## [412] "CMETHOD1"              "CMETHOD2"              "CMETHOD3"             
## [415] "CMETHOD4"              "CMETHOD5"              "CMETHOD6"             
## [418] "CMETHOD7"              "CMETHOD8"              "CMETHOD9"             
## [421] "CMETHOD10"             "CMETHOD11"             "CMETHOD16"            
## [424] "CMETHOD17"             "H_AGETEENYEARS"        "HHELANG"              
## [427] "SICK_HOUSEHOLD"        "DEATHS"                "WATERSOURCE"          
## [430] "TOILETTYPE"            "TOILETSHARE"           "COOKINGFUEL"          
## [433] "MATFLOOR"              "MATROOF"               "MATEXWALLS"           
## [436] "ROOMSLEEP"             "OWNCOWNUM"             "OWNGOATNUM"           
## [439] "OWNCHIKNNUM"           "OWNDOGNUM"             "OWNHORSENUM"          
## [442] "ECONSUP12"             "HHQITEMS1"             "HHQITEMS2"            
## [445] "HHQITEMS3"             "HHQITEMS4"             "HHQITEMS5"            
## [448] "HHQITEMS6"             "HHQITEMS11"            "HHQOWN1"              
## [451] "HHQOWN2"               "HHQOWN3"               "HHQOWN4"              
## [454] "HHQOWN5"               "HHQOWN11"              "HHQOWN12"             
## [457] "CODE_SECTEUR"          "URBAN_RURAL"           "CORRECT_LN"           
## [460] "CSPVERSNHH"            "LIVEHERE"              "EMANCIPATED"          
## [463] "LIVEPROVINCE"          "SICK3MO"               "CHENSCH"              
## [466] "MOMALIVE"              "MOMHHM"                "FEMGUARDHHM"          
## [469] "DADALIVE"              "DADHHM"                "MALEGUARDHHM"         
## [472] "MOMSICK"               "MOMAIDS"               "DADSICK"              
## [475] "DADAIDS"               "VULNERABLE_CHILD"      "SUPPORTMED12"         
## [478] "SUPPORTEMOT12"         "SUPPORTEMOT3"          "SUPPORTMATER12"       
## [481] "SUPPORTMATER3"         "SUPPORTSOCIAL12"       "SUPPORTSCHOL12"       
## [484] "age_group"             "hh_age_group"          "bt_status"

target_vars <- c("H_AGETEENYEARS", "SEX", "EA_HHID_FIXED", "EA_HHID_LN_FIXED", "bt_status",
  "ACCHEAR", "ACCLITER", "ACCVISUAL", "ADDISHIV", "ADHIVSCHMTG", "ADTPSX", 
  "AGE", "AGE_GROUP", "AGEMAR_GRP", "ALCFREQ", "ALCLOC", "ALCNUMDAY", 
  "ALCSIXMORE", "ANXIETY", "ARVAMT", "ARVAMTCOVID", "ARVAMTNUM", "ARVFTM", 
  "ARVFTY", "ARVFVST", "ARVINTERR", "ARVINTERRCOVID", "ARVLOC", "ARVLOCCOVID", 
  "ARVNRPG", "ARVSCURRENT", "ARVSMSISDAYS", "ARVSNOTCURRSN", "ARVSNOTTAKE", 
  "ARVSTAKENEV", "ARVSWITCH", "ARVSWITCHWHY", "ARVTKPG", "ASOLANG", "AVOIDPREG", 
  "BIO_RESPONDED", "BIRTHDAY", "BIRTHMON", "BIRTHYR", "BLOODTRANS", "CERNCRSLT", 
  "CERNCNTRT", "CERVCNTSM", "CERVCNTST", "CERVCNTSY", "CHEAGE", "CHEAGEC", 
  "CHEGEND", "CHEGENDC", "CHELITER", "CHENSCH", "CHEVISUAL", "CHILDA2012", 
  "CHILDALIVE1", "CHILDALIVE2", "CHILDALIVE4", "CHILDBRSTFD1", "CHILDBRSTFD2", 
  "CHILDBRSTFDNOW1", "CHILDBRSTFDNOW2", "CHILDBRSTFDNOW3", "CHILDBRSTFDNOW4", 
  "CHILDBRSTFDNOW5", "CHTSTHIVAGE1", "CHTSTHIVAGE2", "CHTSTHIVAGELAST1", 
  "CHTSTHIVAGELAST2", "CHTSTHIVAGELASTDK1", "CHTSTHIVAGELASTNUM1", 
  "CHTSTHIVAGELASTNUM2", "CHTSTHIVAGENUM1", "CHTSTHIVAGENUM2", "CHTSTHIVBIRTH1", 
  "CHTSTHIVBIRTH2", "CHTSTHIVBIRTH3", "CHTSTHIVBIRTH4", "CHTSTHIVBIRTH5", 
  "CHTSTHIVBRSTFD1", "CHTSTHIVBRSTFD2", "CHTSTHIVRESULT1", "CHTSTHIVRESULTLAST1", 
  "CLINCHANGE", "CONDOMGET", "CONFEMAN", "CONFEMAN_INELIG", "CONSTAT_ADULT", 
  "CONSTAT_CHILD", "CONSTAT_PARENT", "COOKINGFUEL", "CURMAR", "DADAIDS", 
  "DADALIVE", "DADHHM", "DADSICK", "DEATHAGEMO1", "DEATHAGEMO2", "DEATHAGEYR1", 
  "DEATHAGEYR2", "DEATHCOUNT", "DEPRESSED", "DRUG", "DWELLBELONG", "ECONSUPCOVID", 
  "ELIGSTAT_CHILD", "ELIGSTAT_PARENT", "EVERMAR", "FEMGUARDHHM", "FIRSTSXAGE_GRP", 
  "FIRSTSXAGEC", "HEPATITIS", "HFHIVTSTOFFER", "HFLAST12MO", "HH_AGEGROUP", 
  "HHEHEARACM", "HHELANG", "HIVCARE", "HIVCLM", "HIVCLY", "HIVCNOTRSN", 
  "HIVCRTLC", "HIVLASTNEGM", "HIVLASTNEGY", "HIVPOSROV", "HIVPOSROVM", 
  "HIVPOSROVY", "HIVPSBP", "HIVRTAD", "HIVRTPG", "HIVSELFTST", "HIVTESTM", 
  "HIVTESTY", "HIVTFPOSM", "HIVTFPOSY", "HIVTPRG", "HIVTSAD", "HIVTSBP", 
  "HIVTSTEVER", "HIVTSTLOCATION", "HIVTSTRSLT", "HIVTSTRSN", "HPVVACC", 
  "HUSLIVEW", "HUSNWIF", "HUSOTWIF", "INEEMAN", "INLANGAD", "LIFETIMESEX", 
  "LIGHTINGFUEL", "LITTLEINTEREST", "LIVEB", "LIVETIME", "LIVETIMEDK", 
  "LIVETIMENUM", "LNGNAT_LNG", "LNGVINT_LNG", "MALEGUARDHHM", "MATEXWALLS", 
  "MATFLOOR", "MATROOF", "MCAGE_GRP", "MCILLTYPE", "MCLOC", "MCPLANS", 
  "MCSTATUS", "MCWHO", "MEDINHCURR", "MEDINHMONTHS", "MOMAIDS", "MOMALIVE", 
  "MOMHHM", "MOMSICK", "MONTHOUTEVER", "MONTHTIMES", "MONTHWHENM", "MONTHWHENY", 
  "NORMWORK", "NUMWIF", "OUTREGIONTYPE", "OUTREGIONWHR", "PART12MONUM", 
  "PARTAGE1_GRP", "PARTAGE2_GRP", "PARTAGE3_GRP", "PARTGEND1", "PARTGEND2", 
  "PARTGEND3", "PARTHIVSAT1", "PARTHIVSAT2", "PARTHIVSAT3", "PARTKNOWHIV1", 
  "PARTKNOWHIV2", "PARTKNOWHIV3", "PARTLASTCNDM1", "PARTLASTCNDM2", "PARTLASTCNDM3", 
  "PARTLASTCNDMRSN1", "PARTLASTCNDMRSN2", "PARTLASTCNDMRSN3", "PARTLASTETOH1", 
  "PARTLASTETOH2", "PARTLASTETOH3", "PARTLIVEW1", "PARTLIVEW2", "PARTLIVEW3", 
  "PARTRECENT1", "PARTRELATION1", "PARTRELATION2", "PARTRELATION3", "PPGUARDREL", 
  "PREGNANT", "PREPCURNT", "PREPEVER", "PREPWDTK", "PRGCARE", "PRGSYTHOFFER1", 
  "PRGSYTHOFFER2", "PRGSYTHOFFER3", "PRGSYTHPOS1", "PRGSYTHPOS2", "PRGSYPTHTST1", 
  "PRGSYPTHTST2", "PRGSYPTHTST3", "PRGSYPTHTREAT1", "PRGTWIN", "PRGTWINPLUS1", 
  "PRGTWINPLUS2", "PRGTWINPLUS3", "PRPEVRHDR", "REASONAWAY", "RELATTOHH", 
  "ROOMSLEEP", "SCHCOM", "SCHCOMC1", "SCHLAT", "SCHLCUR", "SUPPORTEMOT12", 
  "SUPPORTEMOT3", "SUPPORTMATER12", "SUPPORTMED12", "SUPPORTSCHOL12", 
  "SUPPORTSOCIAL12", "TBCLINHIVTST", "TBCLINVISIT", "TBDIAGN", "TBTPT", 
  "TBTREAT6MOFULL", "TBTREATED", "TBTTREATCURR", "TOILETSHARE", "TOILETTYPE", 
  "TRAVELDIFF", "TRAVELTIME", "VARSTRAT", "VLTEST", "VLTESTLSTM", "VLTESTLSTY", 
  "VLTESTRESULT", "VULNERABLE_CHILD", "WATERSOURCE", "WHEREOUT", "WIFWHERE", 
  "WIFLIVEEW", "WORK12MO", "WORK7DAYS", "WORKIND", "WORRY", "WIFEWHERE", "PRGTWINPLUS5", "CHILDALIVE5", "TOILETTYPEOT", "COOKINGFUELOT","OUTREGIONWHRC", "WATERSOURCEOT", "HIVCRLTC", "CHTSTHIVRESULT2", "WHEREOUTC", "CHILDALIVE3", "MATEXWALLSOT", "CHILDBRSTFD5", "AGE_GROUP", "HIVPOSPROV",
"CHILDBRSTFD4", "CERNCNRSLT", "OUTREGIONWHRSP", "CHILDBRSTFD3", "PRGTWINPLUS4",
"HH_AGE_GROUP", "ARVSMISSDAYS", "HIVTSTLOCATIONOTH", "CHTSTHIVRESULTLAST2"

)




blood_nr_data_in <- blood_nr_data_in %>%
  select(any_of(target_vars))

names(blood_nr_data_in)

##   [1] "H_AGETEENYEARS"      "SEX"                 "EA_HHID_FIXED"      
##   [4] "EA_HHID_LN_FIXED"    "bt_status"           "ACCHEAR"            
##   [7] "ADDISHIV"            "ADHIVSCHMTG"         "ADTPSX"             
##  [10] "ALCFREQ"             "ALCNUMDAY"           "ALCSIXMORE"         
##  [13] "ANXIETY"             "ARVAMT"              "ARVAMTNUM"          
##  [16] "ARVFVST"             "ARVINTERR"           "ARVLOC"             
##  [19] "ARVSCURRENT"         "ARVSNOTCURRSN"       "ARVSNOTTAKE"        
##  [22] "ARVSTAKENEV"         "ARVSWITCH"           "ARVSWITCHWHY"       
##  [25] "ASOLANG"             "AVOIDPREG"           "CERNCNTRT"          
##  [28] "CERVCNTST"           "CHENSCH"             "CHILDA2012"         
##  [31] "CHILDALIVE1"         "CHILDALIVE2"         "CHILDALIVE4"        
##  [34] "CHILDBRSTFD1"        "CHILDBRSTFD2"        "CHILDBRSTFDNOW1"    
##  [37] "CHILDBRSTFDNOW2"     "CHILDBRSTFDNOW3"     "CHILDBRSTFDNOW4"    
##  [40] "CHILDBRSTFDNOW5"     "CHTSTHIVBIRTH1"      "CHTSTHIVBIRTH2"     
##  [43] "CHTSTHIVBIRTH3"      "CHTSTHIVBIRTH4"      "CHTSTHIVBIRTH5"     
##  [46] "CHTSTHIVBRSTFD1"     "CHTSTHIVRESULT1"     "CHTSTHIVRESULTLAST1"
##  [49] "CLINCHANGE"          "CONDOMGET"           "CONFEMAN"           
##  [52] "COOKINGFUEL"         "CURMAR"              "DADAIDS"            
##  [55] "DADALIVE"            "DADHHM"              "DADSICK"            
##  [58] "DEPRESSED"           "EVERMAR"             "FEMGUARDHHM"        
##  [61] "HFHIVTSTOFFER"       "HFLAST12MO"          "HHELANG"            
##  [64] "HIVCARE"             "HIVCNOTRSN"          "HIVPSBP"            
##  [67] "HIVRTAD"             "HIVRTPG"             "HIVSELFTST"         
##  [70] "HIVTPRG"             "HIVTSAD"             "HIVTSBP"            
##  [73] "HIVTSTEVER"          "HIVTSTLOCATION"      "HIVTSTRSLT"         
##  [76] "HIVTSTRSN"           "HPVVACC"             "HUSLIVEW"           
##  [79] "HUSNWIF"             "HUSOTWIF"            "INLANGAD"           
##  [82] "LIFETIMESEX"         "LITTLEINTEREST"      "LIVEB"              
##  [85] "LIVETIME"            "LIVETIMENUM"         "LNGNAT_LNG"         
##  [88] "LNGVINT_LNG"         "MALEGUARDHHM"        "MATEXWALLS"         
##  [91] "MATFLOOR"            "MATROOF"             "MCPLANS"            
##  [94] "MCSTATUS"            "MEDINHCURR"          "MEDINHMONTHS"       
##  [97] "MOMAIDS"             "MOMALIVE"            "MOMHHM"             
## [100] "MOMSICK"             "MONTHOUTEVER"        "MONTHTIMES"         
## [103] "NORMWORK"            "NUMWIF"              "OUTREGIONTYPE"      
## [106] "PART12MONUM"         "PARTHIVSAT1"         "PARTHIVSAT2"        
## [109] "PARTHIVSAT3"         "PARTKNOWHIV1"        "PARTKNOWHIV2"       
## [112] "PARTKNOWHIV3"        "PARTLASTCNDM1"       "PARTLASTCNDM2"      
## [115] "PARTLASTCNDM3"       "PARTLASTETOH1"       "PARTLASTETOH2"      
## [118] "PARTLASTETOH3"       "PARTLIVEW1"          "PARTLIVEW2"         
## [121] "PARTLIVEW3"          "PARTRECENT1"         "PARTRELATION1"      
## [124] "PARTRELATION2"       "PARTRELATION3"       "PREGNANT"           
## [127] "PREPCURNT"           "PREPEVER"            "PREPWDTK"           
## [130] "PRGCARE"             "PRGTWIN"             "PRGTWINPLUS1"       
## [133] "PRGTWINPLUS2"        "PRGTWINPLUS3"        "PRPEVRHDR"          
## [136] "REASONAWAY"          "ROOMSLEEP"           "SCHCOM"             
## [139] "SCHLAT"              "SCHLCUR"             "SUPPORTEMOT12"      
## [142] "SUPPORTEMOT3"        "SUPPORTMATER12"      "SUPPORTMED12"       
## [145] "SUPPORTSCHOL12"      "SUPPORTSOCIAL12"     "TBCLINHIVTST"       
## [148] "TBCLINVISIT"         "TBDIAGN"             "TBTPT"              
## [151] "TBTREAT6MOFULL"      "TBTREATED"           "TBTTREATCURR"       
## [154] "TOILETSHARE"         "TOILETTYPE"          "TRAVELDIFF"         
## [157] "TRAVELTIME"          "VLTEST"              "VLTESTRESULT"       
## [160] "VULNERABLE_CHILD"    "WATERSOURCE"         "WIFLIVEEW"          
## [163] "WORK12MO"            "WORK7DAYS"           "WORKIND"            
## [166] "WORRY"               "WIFEWHERE"           "PRGTWINPLUS5"       
## [169] "CHILDALIVE5"         "OUTREGIONWHRC"       "HIVCRLTC"           
## [172] "WHEREOUTC"           "CHILDALIVE3"         "CHILDBRSTFD5"       
## [175] "HIVPOSPROV"          "CHILDBRSTFD4"        "CERNCNRSLT"         
## [178] "CHILDBRSTFD3"        "PRGTWINPLUS4"        "ARVSMISSDAYS"       
## [181] "HIVTSTLOCATIONOTH"

# Make the appropriate variables numeric and the rest factors.
# This is required for the BART model fitting to handle the 
# data correctly.
nr_data_in_bio <- blood_nr_data_in %>%
 #filter(!is.na(H_AGETEENYEARS), !is.na(SEX)) %>%
  select(H_AGETEENYEARS, SEX, EA_HHID_FIXED, EA_HHID_LN_FIXED, bt_status, everything()) %>%
  arrange(H_AGETEENYEARS, SEX, EA_HHID_FIXED, EA_HHID_LN_FIXED) %>%
  mutate(response_bio = if_else(bt_status == 1, 1, 0)) %>%
  select(-bt_status) %>%
  mutate(across(where(is.numeric), ~if_else(as.numeric(.x) > 0, 1, 0)),
         across(-response_bio, as_factor))


# For variables which are already two-leveled
# bio_binary_vars <- nr_data_in_bio %>% 
#   select(EA_HHID_LN_FIXED, where(function(col) length(unique(col)) == 2), -response_bio) %>%
#   mutate(across(-c(EA_HHID_LN_FIXED, H_AGETEENYEARS, SEX), as.numeric)) %>%
#   mutate(across(-c(EA_HHID_LN_FIXED, H_AGETEENYEARS, SEX), ~if_else(.x == 2, 0, .x))) %>%
#   mutate(across(-c(EA_HHID_LN_FIXED, H_AGETEENYEARS, SEX), as.factor))

bio_binary_vars <- nr_data_in_bio %>% 
  select(EA_HHID_LN_FIXED, H_AGETEENYEARS, SEX,
         where(~ length(unique(na.omit(.))) == 2), -response_bio) %>%
  mutate(across(-c(EA_HHID_LN_FIXED, H_AGETEENYEARS, SEX), as.numeric)) %>%
  mutate(across(-c(EA_HHID_LN_FIXED, H_AGETEENYEARS, SEX), ~if_else(.x == 2, 0, .x))) %>%
  mutate(across(-c(EA_HHID_LN_FIXED, H_AGETEENYEARS, SEX), as.factor))


# Non-binary variables
bio_nr_data_matrix_in <- nr_data_in_bio %>%
  select(where(function(col) length(unique(col)) != 2), response_bio) %>%
  select(-contains("EA_HHID"))

dmy_bio_nr_data <- dummyVars("response_bio ~ .", data = bio_nr_data_matrix_in)
df_bio_nr_data <- bio_nr_data_matrix_in

model_matrix_nr_bio <-predict(dmy_bio_nr_data, newdata = df_bio_nr_data) %>%
  as_tibble() %>%
   # mutate(H_AGETEENYEARS = if_else(H_AGETEENYEARS.2 == 1, 2, 1),
   #       SEX = if_else(SEX.2 == 1, 2, 1)) %>%
  # select(-H_AGETEENYEARS.1, -H_AGETEENYEARS.2, -SEX.1, -SEX.2) %>%
  cbind(response_bio = nr_data_in_bio$response_bio) %>%
  cbind(EA_HHID_LN_FIXED = nr_data_in_bio$EA_HHID_LN_FIXED) %>%
  left_join(bio_binary_vars) %>%
  mutate(across(-c(EA_HHID_LN_FIXED, H_AGETEENYEARS, SEX, response_bio), as.factor))

## Joining with `by = join_by(EA_HHID_LN_FIXED)`

variable_response_props_bio <- model_matrix_nr_bio %>%
  pivot_longer(cols = -c(EA_HHID_LN_FIXED, H_AGETEENYEARS, SEX, response_bio), 
               names_to = "variable", values_to = "value") %>%
  group_by(H_AGETEENYEARS, SEX, variable, value) %>%
  summarise(respondents = sum(response_bio),
            totalN = n(),
            rrate = respondents / totalN) %>%
  mutate(category_proportion = totalN / sum(totalN))

## `summarise()` has grouped output by 'H_AGETEENYEARS', 'SEX', 'variable'. You
## can override using the `.groups` argument.

bio_nr_variables_dropped <- variable_response_props_bio %>%
  mutate(drop = if_else(!grepl("age", variable) & 
                          (totalN < 25 | category_proportion < 10/100), 1, NA_real_)) %>%
  filter(drop == 1)


final_bio_nr_model_matrix <- model_matrix_nr_bio %>%
  pivot_longer(cols = -c(EA_HHID_LN_FIXED, H_AGETEENYEARS, SEX, response_bio), 
               names_to = "variable", values_to = "value") %>%
  left_join(select(bio_nr_variables_dropped, H_AGETEENYEARS, SEX, variable, drop)) %>%
  filter(is.na(drop)) %>%
  pivot_wider(id_cols = c(EA_HHID_LN_FIXED, H_AGETEENYEARS, SEX, response_bio), 
              names_from = variable, values_from = value) %>%
  rename_with(~gsub(" ", ".", .x, fixed = TRUE)) %>%
  rename_with(~gsub("-", ".", .x, fixed = TRUE)) %>%
  mutate(across(-c(EA_HHID_LN_FIXED, H_AGETEENYEARS, SEX, response_bio), 
                 ~replace_na(as.numeric(.x), 0)))

## Joining with `by = join_by(H_AGETEENYEARS, SEX, variable)`

# Model by sex & teenage years variables
# Prepare to fit separate Bayesian Additive Regression Tree (BART) models for biomarker response probability within each combination of H_AGETEENYEARS (teenage age group) and SEX. This allows the model to capture different response patterns by demographic group — improving prediction accuracy and reducing bias in inverse probability weighting.

# Préparez-vous à ajuster des modèles séparés d'arbres de régression additifs bayésiens (BART) pour la probabilité de réponse des biomarqueurs au sein de chaque combinaison de H_AGETEENYEARS (groupe d'âge adolescent) et SEX. Cela permet au modèle de capturer différents schémas de réponse par groupe démographique — améliorant la précision de prédiction et réduisant le biais dans la pondération par probabilité inverse.

# Function to fit a single model
# Input needs to contain the predictors and the response variable
# Define a function to fit a BART model for biomarker response. Input: 'data' = dataset containing predictors and response_bio (binary: 1=responded, 0=did not). Optional: nskip (burn-in iterations), ndpost (posterior draws). Uses 'lbart' (logistic BART for binary outcomes) to model response probability. The binaryOffset is set to mean(response_bio) — a common default to center priors.

# Définissez une fonction pour ajuster un modèle BART pour la réponse des biomarqueurs. Entrée : 'data' = jeu de données contenant les prédicteurs et response_bio (binaire : 1=a répondu, 0=n'a pas répondu). Optionnel : nskip (itérations de burn-in), ndpost (tirages a posteriori). Utilise 'lbart' (BART logistique pour résultats binaires) pour modéliser la probabilité de réponse. Le binaryOffset est défini à mean(response_bio) — un choix par défaut courant pour centrer les a priori.

fit_bart_model_bio <- function(data, nskip = 1000, ndpost = 4000) {

  # Build predictor matrix: select only columns with more than one unique value (remove constants), remove any empty columns, then drop identifier (EA_HHID_LN_FIXED) and response (response_bio) to keep only predictors. Convert to data.frame for BART compatibility.
  # Construisez la matrice de prédicteurs : sélectionnez uniquement les colonnes avec plus d'une valeur unique (supprimez les constantes), supprimez les colonnes vides, puis supprimez l'identifiant (EA_HHID_LN_FIXED) et la réponse (response_bio) pour ne garder que les prédicteurs. Convertissez en data.frame pour compatibilité avec BART.

  predictor_matrix <- data %>%
    select(where(function(col) length(unique(col)) > 1)) %>%
    remove_empty(which = "cols") %>%
    select(-c(EA_HHID_LN_FIXED, response_bio)) %>%
    data.frame()
  
  # Extract response vector
  # Extrayez le vecteur de réponse

  response_bio <- data$response_bio

  # Fit logistic BART model: lbart() models binary outcomes. 'binaryOffset' centers prior on observed response rate. 'nskip' discards initial MCMC samples; 'ndpost' keeps this many posterior samples for inference. Output: fitted BART object containing predicted probabilities.
  # Ajustez le modèle BART logistique : lbart() modélise les résultats binaires. 'binaryOffset' centre le a priori sur le taux de réponse observé. 'nskip' élimine les premiers échantillons MCMC ; 'ndpost' conserve ce nombre d'échantillons a posteriori pour l'inférence. Sortie : objet BART ajusté contenant les probabilités prédites.

  bartFit <- lbart(predictor_matrix, response_bio, binaryOffset = mean(response_bio),
                 nskip = nskip, ndpost = ndpost)
}

# Function to just output the fitted response probabilities
# Define a wrapper function that fits the BART model and returns a tibble with person ID, response status, and predicted response probability (mean of posterior draws).
# Définissez une fonction enveloppe qui ajuste le modèle BART et renvoie un tibble avec l'identifiant de la personne, le statut de réponse et la probabilité de réponse prédite (moyenne des tirages a posteriori).

get_response_probabilities_bio <- function(data) {

  bartFit <- fit_bart_model_bio(data)

  response_bio <- data$response_bio
  response_probability_bio <- bartFit$prob.train.mean

  tibble(data$EA_HHID_LN_FIXED, response_bio, response_probability_bio)
}

# Group the modeling dataset (final_bio_nr_model_matrix) by H_AGETEENYEARS and SEX, then nest all other columns into a list-column 'data'. This creates one row per (H_AGETEENYEARS, SEX) group, with all individual records in that group stored as a data frame inside 'data'. Then, for each group, fit a BART model using lapply() over the nested data — result stored in list-column 'bart'. This allows separate models per demographic group.
# Groupez le jeu de données de modélisation (final_bio_nr_model_matrix) par H_AGETEENYEARS et SEX, puis imbriquez toutes les autres colonnes dans une colonne-liste 'data'. Cela crée une ligne par groupe (H_AGETEENYEARS, SEX), avec tous les enregistrements individuels de ce groupe stockés sous forme de trame de données dans 'data'. Ensuite, pour chaque groupe, ajustez un modèle BART en utilisant lapply() sur les données imbriquées — résultat stocké dans la colonne-liste 'bart'. Cela permet des modèles séparés par groupe démographique.

nested_bio <- final_bio_nr_model_matrix %>%
  group_by(H_AGETEENYEARS, SEX) %>%
  nest() %>%
  mutate(bart = lapply(data, fit_bart_model_bio))

## *****Into main of lbart
## *****Data:
## data:n,p,np: 359, 97, 0
## y1,yn: 1, 1
## x1,x[n*p]: 2.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 1 ... 1
## *****burn and ndpost: 1000, 4000
## *****Prior:
## beta,alpha,tau: 2.000000,0.950000,0.129526
## *****Dirichlet:sparse,a,b,rho,augment: 0,0.5,1,97,0
## *****nkeeptrain,nkeeptest: 4000, 4000
## *****printevery: 100
## 
## MCMC
## done 0 (out of 5000)
## done 100 (out of 5000)
## done 200 (out of 5000)
## done 300 (out of 5000)
## done 400 (out of 5000)
## done 500 (out of 5000)
## done 600 (out of 5000)
## done 700 (out of 5000)
## done 800 (out of 5000)
## done 900 (out of 5000)
## done 1000 (out of 5000)
## done 1100 (out of 5000)
## done 1200 (out of 5000)
## done 1300 (out of 5000)
## done 1400 (out of 5000)
## done 1500 (out of 5000)
## done 1600 (out of 5000)
## done 1700 (out of 5000)
## done 1800 (out of 5000)
## done 1900 (out of 5000)
## done 2000 (out of 5000)
## done 2100 (out of 5000)
## done 2200 (out of 5000)
## done 2300 (out of 5000)
## done 2400 (out of 5000)
## done 2500 (out of 5000)
## done 2600 (out of 5000)
## done 2700 (out of 5000)
## done 2800 (out of 5000)
## done 2900 (out of 5000)
## done 3000 (out of 5000)
## done 3100 (out of 5000)
## done 3200 (out of 5000)
## done 3300 (out of 5000)
## done 3400 (out of 5000)
## done 3500 (out of 5000)
## done 3600 (out of 5000)
## done 3700 (out of 5000)
## done 3800 (out of 5000)
## done 3900 (out of 5000)
## done 4000 (out of 5000)
## done 4100 (out of 5000)
## done 4200 (out of 5000)
## done 4300 (out of 5000)
## done 4400 (out of 5000)
## done 4500 (out of 5000)
## done 4600 (out of 5000)
## done 4700 (out of 5000)
## done 4800 (out of 5000)
## done 4900 (out of 5000)
## time: 75s
## check counts
## trcnt,tecnt: 4000,0
## *****Into main of lbart
## *****Data:
## data:n,p,np: 471, 103, 0
## y1,yn: 1, 1
## x1,x[n*p]: 2.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 1 ... 1
## *****burn and ndpost: 1000, 4000
## *****Prior:
## beta,alpha,tau: 2.000000,0.950000,0.129526
## *****Dirichlet:sparse,a,b,rho,augment: 0,0.5,1,103,0
## *****nkeeptrain,nkeeptest: 4000, 4000
## *****printevery: 100
## 
## MCMC
## done 0 (out of 5000)
## done 100 (out of 5000)
## done 200 (out of 5000)
## done 300 (out of 5000)
## done 400 (out of 5000)
## done 500 (out of 5000)
## done 600 (out of 5000)
## done 700 (out of 5000)
## done 800 (out of 5000)
## done 900 (out of 5000)
## done 1000 (out of 5000)
## done 1100 (out of 5000)
## done 1200 (out of 5000)
## done 1300 (out of 5000)
## done 1400 (out of 5000)
## done 1500 (out of 5000)
## done 1600 (out of 5000)
## done 1700 (out of 5000)
## done 1800 (out of 5000)
## done 1900 (out of 5000)
## done 2000 (out of 5000)
## done 2100 (out of 5000)
## done 2200 (out of 5000)
## done 2300 (out of 5000)
## done 2400 (out of 5000)
## done 2500 (out of 5000)
## done 2600 (out of 5000)
## done 2700 (out of 5000)
## done 2800 (out of 5000)
## done 2900 (out of 5000)
## done 3000 (out of 5000)
## done 3100 (out of 5000)
## done 3200 (out of 5000)
## done 3300 (out of 5000)
## done 3400 (out of 5000)
## done 3500 (out of 5000)
## done 3600 (out of 5000)
## done 3700 (out of 5000)
## done 3800 (out of 5000)
## done 3900 (out of 5000)
## done 4000 (out of 5000)
## done 4100 (out of 5000)
## done 4200 (out of 5000)
## done 4300 (out of 5000)
## done 4400 (out of 5000)
## done 4500 (out of 5000)
## done 4600 (out of 5000)
## done 4700 (out of 5000)
## done 4800 (out of 5000)
## done 4900 (out of 5000)
## time: 96s
## check counts
## trcnt,tecnt: 4000,0
## *****Into main of lbart
## *****Data:
## data:n,p,np: 4088, 120, 0
## y1,yn: 1, 1
## x1,x[n*p]: 1.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 1 ... 1
## *****burn and ndpost: 1000, 4000
## *****Prior:
## beta,alpha,tau: 2.000000,0.950000,0.129526
## *****Dirichlet:sparse,a,b,rho,augment: 0,0.5,1,120,0
## *****nkeeptrain,nkeeptest: 4000, 4000
## *****printevery: 100
## 
## MCMC
## done 0 (out of 5000)
## done 100 (out of 5000)
## done 200 (out of 5000)
## done 300 (out of 5000)
## done 400 (out of 5000)
## done 500 (out of 5000)
## done 600 (out of 5000)
## done 700 (out of 5000)
## done 800 (out of 5000)
## done 900 (out of 5000)
## done 1000 (out of 5000)
## done 1100 (out of 5000)
## done 1200 (out of 5000)
## done 1300 (out of 5000)
## done 1400 (out of 5000)
## done 1500 (out of 5000)
## done 1600 (out of 5000)
## done 1700 (out of 5000)
## done 1800 (out of 5000)
## done 1900 (out of 5000)
## done 2000 (out of 5000)
## done 2100 (out of 5000)
## done 2200 (out of 5000)
## done 2300 (out of 5000)
## done 2400 (out of 5000)
## done 2500 (out of 5000)
## done 2600 (out of 5000)
## done 2700 (out of 5000)
## done 2800 (out of 5000)
## done 2900 (out of 5000)
## done 3000 (out of 5000)
## done 3100 (out of 5000)
## done 3200 (out of 5000)
## done 3300 (out of 5000)
## done 3400 (out of 5000)
## done 3500 (out of 5000)
## done 3600 (out of 5000)
## done 3700 (out of 5000)
## done 3800 (out of 5000)
## done 3900 (out of 5000)
## done 4000 (out of 5000)
## done 4100 (out of 5000)
## done 4200 (out of 5000)
## done 4300 (out of 5000)
## done 4400 (out of 5000)
## done 4500 (out of 5000)
## done 4600 (out of 5000)
## done 4700 (out of 5000)
## done 4800 (out of 5000)
## done 4900 (out of 5000)
## time: 839s
## check counts
## trcnt,tecnt: 4000,0
## *****Into main of lbart
## *****Data:
## data:n,p,np: 4298, 140, 0
## y1,yn: 1, 1
## x1,x[n*p]: 2.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 1 ... 1
## *****burn and ndpost: 1000, 4000
## *****Prior:
## beta,alpha,tau: 2.000000,0.950000,0.129526
## *****Dirichlet:sparse,a,b,rho,augment: 0,0.5,1,140,0
## *****nkeeptrain,nkeeptest: 4000, 4000
## *****printevery: 100
## 
## MCMC
## done 0 (out of 5000)
## done 100 (out of 5000)
## done 200 (out of 5000)
## done 300 (out of 5000)
## done 400 (out of 5000)
## done 500 (out of 5000)
## done 600 (out of 5000)
## done 700 (out of 5000)
## done 800 (out of 5000)
## done 900 (out of 5000)
## done 1000 (out of 5000)
## done 1100 (out of 5000)
## done 1200 (out of 5000)
## done 1300 (out of 5000)
## done 1400 (out of 5000)
## done 1500 (out of 5000)
## done 1600 (out of 5000)
## done 1700 (out of 5000)
## done 1800 (out of 5000)
## done 1900 (out of 5000)
## done 2000 (out of 5000)
## done 2100 (out of 5000)
## done 2200 (out of 5000)
## done 2300 (out of 5000)
## done 2400 (out of 5000)
## done 2500 (out of 5000)
## done 2600 (out of 5000)
## done 2700 (out of 5000)
## done 2800 (out of 5000)
## done 2900 (out of 5000)
## done 3000 (out of 5000)
## done 3100 (out of 5000)
## done 3200 (out of 5000)
## done 3300 (out of 5000)
## done 3400 (out of 5000)
## done 3500 (out of 5000)
## done 3600 (out of 5000)
## done 3700 (out of 5000)
## done 3800 (out of 5000)
## done 3900 (out of 5000)
## done 4000 (out of 5000)
## done 4100 (out of 5000)
## done 4200 (out of 5000)
## done 4300 (out of 5000)
## done 4400 (out of 5000)
## done 4500 (out of 5000)
## done 4600 (out of 5000)
## done 4700 (out of 5000)
## done 4800 (out of 5000)
## done 4900 (out of 5000)
## time: 1316s
## check counts
## trcnt,tecnt: 4000,0

# (Commented) — Export the nested model object for reuse (avoids refitting). Can be imported later to extract probabilities without re-running models.
# (Commenté) — Exportez l'objet modèle imbriqué pour réutilisation (évite le réajustement). Peut être importé plus tard pour extraire les probabilités sans réexécuter les modèles.

#export(nested_bio, "nested_bio_model_updated.Rdata")

#nested_bio <- import("nested_bio_model_updated.Rdata")

# Extract predicted response probabilities: for each group, extract the mean posterior predicted probability (prob.train.mean) from the BART model. Store as list-column, then unnest() both the original data and the probabilities — this aligns each individual with their predicted probability. Select key variables and ungroup for downstream use. Result: response_probs_bio — dataset with person ID, response status, and predicted probability by demographic group.
# Extrayez les probabilités de réponse prédites : pour chaque groupe, extrayez la probabilité prédite moyenne a posteriori (prob.train.mean) du modèle BART. Stockez sous forme de colonne-liste, puis unnest() à la fois les données originales et les probabilités — cela aligne chaque individu avec sa probabilité prédite. Sélectionnez les variables clés et dégroupez pour une utilisation en aval. Résultat : response_probs_bio — jeu de données avec identifiant de personne, statut de réponse et probabilité prédite par groupe démographique.

response_probs_bio <- nested_bio %>%
  mutate(response_probability_bio = lapply(bart, function(x) x$prob.train.mean)) %>%
  select(response_probability_bio, data) %>%
  unnest(cols = c(data, response_probability_bio)) %>%
  select(H_AGETEENYEARS, SEX, EA_HHID_LN_FIXED, response_bio, response_probability_bio) %>%
  ungroup()

## Adding missing grouping variables: `H_AGETEENYEARS`, `SEX`

# (Commented alternative approach) — Another way to structure the modeling and extraction — equivalent logic.
# (Approche alternative commentée) — Une autre façon de structurer la modélisation et l'extraction — logique équivalente.

# nested_bio <- final_bio_nr_model_matrix %>%
#   group_by(H_AGETEENYEARS, SEX) %>%
#   nest() %>%
#   mutate(bart = lapply(data, fit_response_probabilities_bio))
# 
# # Plots and analysis
# response_probs_bio <- nested_bio %>%
#   select(-data) %>%
#   unnest(cols = bart) %>%
#   rename(EA_HHID_LN_FIXED = `data$EA_HHID_LN_FIXED`)

# Compute nonresponse-adjusted biomarker weights: convert H_AGETEENYEARS to factor and SEX to numeric (if needed), then join with trimmed individual weights dataset (contains trmpnr1w0 — trimmed individual weight). For biomarker respondents (response_bio == 1), compute inverse probability weight: trmpnr1w0 / response_probability_bio; for nonrespondents, set to NA. Result: nr_adj_bio_weights — dataset with final biomarker nonresponse-adjusted weights.
# Calculez les poids ajustés pour la non-réponse des biomarqueurs : convertissez H_AGETEENYEARS en facteur et SEX en numérique (si nécessaire), puis joignez avec le jeu de données des poids individuels tronqués (contient trmpnr1w0 — poids individuel tronqué). Pour les répondants de biomarqueurs (response_bio == 1), calculez le poids par probabilité inverse : trmpnr1w0 / response_probability_bio ; pour les non-répondants, mettez NA. Résultat : nr_adj_bio_weights — jeu de données avec les poids finaux ajustés pour la non-réponse des biomarqueurs.

nr_adj_bio_weights <- response_probs_bio %>%
  mutate(H_AGETEENYEARS = factor(H_AGETEENYEARS),
         SEX = as.numeric(SEX)) %>%
  left_join(trimmed_int_weights) %>%
  mutate(nr_adj_bio_wgt = if_else(response_bio == 1, trmpnr1w0 / response_probability_bio, NA_real_))

## Joining with `by = join_by(H_AGETEENYEARS, SEX, EA_HHID_LN_FIXED)`

# Attach biomarker nonresponse-adjusted weights to main biomarker analysis dataset: remove redundant H_AGETEENYEARS and SEX (already in weights), then left_join by person ID (EA_HHID_LN_FIXED). Result: nr_adj_dataset_bio — analysis-ready dataset with weights attached.
# Attachez les poids ajustés pour la non-réponse des biomarqueurs au jeu de données d'analyse principal des biomarqueurs : supprimez H_AGETEENYEARS et SEX redondants (déjà dans les poids), puis left_join par identifiant de personne (EA_HHID_LN_FIXED). Résultat : nr_adj_dataset_bio — jeu de données prêt pour l'analyse avec poids attachés.

nr_adj_dataset_bio <- nr_data_in_bio %>%
  select(-H_AGETEENYEARS, -SEX) %>%
  left_join(nr_adj_bio_weights, by = c("EA_HHID_LN_FIXED" = "EA_HHID_LN_FIXED"))

# Plot density of predicted response probabilities, colored by actual response (1 or 0). Shows how well the model discriminates between respondents and nonrespondents — ideally, respondents (response_bio=1) have higher predicted probabilities.
# Tracez la densité des probabilités de réponse prédites, colorée par réponse réelle (1 ou 0). Montre à quel point le modèle discrimine entre répondants et non-répondants — idéalement, les répondants (response_bio=1) ont des probabilités prédites plus élevées.

response_probs_bio %>% 
  ggplot(aes(x = response_probability_bio, fill = factor(response_bio))) +
  geom_density(alpha = 0.5) +
  theme_bw()

# Violin plot of predicted probabilities by response status. Shows distribution shape and spread — respondents should cluster at higher probabilities. Y-axis limited to 0.5–1 for focus on upper range.
# Diagramme en violon des probabilités prédites par statut de réponse. Montre la forme et la dispersion de la distribution — les répondants devraient se regrouper à des probabilités plus élevées. Axe des y limité à 0,5–1 pour se concentrer sur la plage supérieure.

response_probs_bio %>% 
  ggplot(aes(x = factor(response_bio), y = response_probability_bio, fill = factor(response_bio))) +
  geom_violin() +
  theme_bw() +
  scale_y_continuous(limits = c(0.5,1))

# Summary statistics of predicted probabilities for respondents only (response_bio == 1). Shows min, median, mean, max — useful for checking for extreme or near-zero probabilities that could cause weight inflation.
# Statistiques sommaires des probabilités prédites pour les répondants uniquement (response_bio == 1). Montre min, médiane, moyenne, max — utile pour vérifier les probabilités extrêmes ou quasi nulles qui pourraient causer une inflation des poids.

response_probs_bio %>% 
  ungroup() %>%
  filter(response_bio == 1) %>% 
  select(response_probability_bio) %>%
  summary()

##  response_probability_bio
##  Min.   :0.6749          
##  1st Qu.:0.9568          
##  Median :0.9753          
##  Mean   :0.9661          
##  3rd Qu.:0.9864          
##  Max.   :0.9988

5.1 Triming biomaker weights

# GR 10-07 looking at distribution
nr_adj_bio_weights %>%
  filter(!is.na(nr_adj_bio_wgt)) %>%
  mutate(high_weight = nr_adj_bio_wgt > 3.5 * median(nr_adj_bio_wgt)) |> 
  ggplot(aes(x = nr_adj_bio_wgt, fill = high_weight)) +
  geom_density(alpha = 0.5)

# Same kind of pattern - again suggest we don't trim.

cd_svy_bio <- nr_adj_bio_weights %>%
  filter(!is.na(nr_adj_bio_wgt)) %>%
  as_survey_design(ids = EAID,
                   weights = nr_adj_bio_wgt)

summary(weights(cd_svy_bio))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   82.96  282.36  404.89  596.24  896.84 3323.65

max(weights(cd_svy_bio))/median(weights(cd_svy_bio))

## [1] 8.208726

cd_svy_trimmed_bio <- trimWeights(cd_svy_bio, 
                                  lower = -Inf,
                                  # GR 10-07: replace with very high limit to just not really trim
                                  upper = 3500 * median(weights(cd_svy_bio)),
                                  #upper = 3.5 * median(weights(cd_svy_bio)),
                                  strict = TRUE)

summary(weights(cd_svy_trimmed_bio))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   82.96  282.36  404.89  596.24  896.84 3323.65

trimmed_bio_weights <- tibble(cd_svy_trimmed_bio$variables, trmbtnr1w0 = weights(cd_svy_trimmed_bio))
max(weights(cd_svy_trimmed_bio))/median(weights(cd_svy_trimmed_bio))

## [1] 8.208726

5.2 Postratification of bio weights

# GR 10-07: already loaded!
# pop_projections_RDC_2024 <- read_excel("C:/Users/Kupamupindit/OneDrive - Columbia University Irving Medical Center/Documents/ICAP work/Countries/DRC/Weighting/Data/pop_projections_RDC_2024.xlsx")

# 
# ref_totals <- pop_projections_RDC_2024  %>% 
#     mutate(province = case_when(
#     province == "Haut-Katanga" ~ "18",
#     province == "Lualaba" ~ "20",
#     TRUE ~ NA_character_  # fallback if name doesn't match
#   )) %>% 
#     mutate(sex = case_when(
#     sex == "male" ~ 1,
#     sex == "female" ~ 2,
#     TRUE ~ NA_real_
#   ))
# 
# 
#   
# individual_reference_totals_adj <- ref_totals


  #summarise actual totals
biotwt_totals <- trimmed_bio_weights %>%
  mutate(province = str_sub(EAID, 3, 4)) %>% 
  filter(!is.na(trmbtnr1w0)) %>% 
  #left_join(geo_level_codes) %>% 
  # mutate(age_group = case_when(between(CONFAGEY, 15, 19) ~ "15-19",
  #                              between(CONFAGEY, 20, 24) ~ "20-24",
  #                              between(CONFAGEY, 25, 29) ~ "25-29",
  #                              between(CONFAGEY, 30, 34) ~ "30-34",
  #                              between(CONFAGEY, 35, 39) ~ "35-39",
  #                              between(CONFAGEY, 40, 44) ~ "40-44",
  #                              between(CONFAGEY, 45, 49) ~ "45-49",
  #                              between(CONFAGEY, 50, 54) ~ "50-54",
  #                              between(CONFAGEY, 55, 59) ~ "55-59",
  #                              between(CONFAGEY, 60, 64) ~ "60-64",
  #                              CONFAGEY >= 65            ~ "65+",
  #                              TRUE ~ "ERROR"),
  #        sex = SEX) %>%
  #group_by(L1_Name, sex, age_group) %>% # By L1 area, age, and sex the cells are very small
  rename(sex=SEX) %>% 
  group_by(province, sex) %>%
  #group_by(sex, age_group) %>%
  summarise(totalNBio = n(),
            totalWgtNBio = sum(trmbtnr1w0))

## `summarise()` has grouped output by 'province'. You can override using the
## `.groups` argument.

#compute adjustment factor
bio_poststrat_adj <- biotwt_totals %>%
  left_join(individual_reference_totals_adj) %>%
  mutate(bio_postrat_adj_factor = population / totalWgtNBio, SEX=sex)

## Joining with `by = join_by(province, sex)`

           #UN_Count_adj / totalWgtN)
# GR 10-07: again factors are pretty small/reasonable - OK!

# join to indiv level file and calculate weight
poststrat_bio_wgts <- trimmed_bio_weights %>% 
  mutate(province = str_sub(EAID, 3, 4)) %>% 
 # left_join(geo_level_codes) %>%
 # mutate(age_group = case_when(between(CONFAGEY, 15, 19) ~ "15-19",
 #                               between(CONFAGEY, 20, 24) ~ "20-24",
 #                               between(CONFAGEY, 25, 29) ~ "25-29",
 #                               between(CONFAGEY, 30, 34) ~ "30-34",
 #                               between(CONFAGEY, 35, 39) ~ "35-39",
 #                               between(CONFAGEY, 40, 44) ~ "40-44",
 #                               between(CONFAGEY, 45, 49) ~ "45-49",
 #                               between(CONFAGEY, 50, 54) ~ "50-54",
 #                               between(CONFAGEY, 55, 59) ~ "55-59",
 #                               between(CONFAGEY, 60, 64) ~ "60-64",
 #                               CONFAGEY >= 65            ~ "65+",
 #                               TRUE ~ "ERROR"),
 #         sex = SEX)           %>%
  left_join(bio_poststrat_adj, by=c("province", "SEX")) %>%
  mutate(btwt0 = bio_postrat_adj_factor * trmbtnr1w0)

# Post-stratification means that the weights are adjusted so that the weighted totals within mutually exclusive cells equal the known population totals. This term is misused most of the time, as in practice it is often attributed to weighting methods for which only margins of a multivariate contingency table are known, but not the higher order cells. These methods should be referred to as calibration, and the specific raking algorithm is usually used. While calibrated weights are based on achieving the alignment between the sample and the known population figures, non-response adjusted weights are based on achieving the alignment between the responding sample and the original sample.

# La post-stratification signifie que les poids sont ajustés de sorte que les totaux pondérés au sein de cellules mutuellement exclusives égalent les totaux de population connus. Ce terme est souvent mal utilisé, car en pratique, il est souvent attribué à des méthodes de pondération pour lesquelles seuls les marges d'un tableau de contingence multivarié sont connus, mais pas les cellules d'ordre supérieur. Ces méthodes devraient être qualifiées de calibration, et l'algorithme de raking spécifique est généralement utilisé. Alors que les poids calibrés sont basés sur l'atteinte de l'alignement entre l'échantillon et les chiffres de population connus, les poids ajustés pour non-réponse sont basés sur l'atteinte de l'alignement entre l'échantillon répondant et l'échantillon original.
           


# GR 10-08: Just exporting some of the final results to save time/keep track
# tempdir <- "G:/Shared drives/DMA/PHIA/Weighting/CODPHIA2/Output_data/temp/"
# export(poststrat_bio_wgts, str_c(tempdir, "poststrat_bio_wgts.csv"))
# export(poststrat_int_wgts, str_c(tempdir, "poststrat_int_wgts.csv"))
# export(final_hh_weights, str_c(tempdir, "final_hh_weights.csv"))

##############################
# On this section we are creating our 
# Varstrat and varunits
################################

# vous souhaitez mettre les ZD par 2, ou par paires, 
# rappelez-vous lors de l'atelier de pondération, l'idée est de jumeler les ZD

# you want to put the ZDs into 2, or pairs, remember 
# during the workshop on weighting, idea is to pair the ZDs

# Tacking 3s just on the end

# Tacking 3s just on the end
cd_varunits <- cd_zd_in %>% # shift control M = shortcut for pipe  %>%
  mutate(psuid = format(ea_id, scientific = FALSE)) %>% 
  select(id_province, province, order, psuid) |> 
  arrange(-id_province, order) |> 
  filter(id_province!=1)|> # when using filter please double ==
  group_by(id_province) |> 
  mutate(varunit = 1 + 
           case_when(row_number() == n() & row_number() %% 2 != 0 ~ 2,
                     .default = (row_number() - 1) %% 2)) %>% 
  rename(strata=province) %>% 
  ungroup()


#la dernière partie du code consiste à générer les paires
#fonction modulaire, si je divise quelque chose 
#par 2 et que la réponse n'est pas zéro alors je mets 2.

cd_varstrats <- cd_varunits |> 
  filter(varunit == 1) |> 
  select(strata, psuid, varunit) |> 
  mutate(varstrat = row_number()) %>% 
  arrange(strata, varstrat, varunit)

cd_varstrat_varunit <- cd_varunits |> 
  left_join(cd_varstrats) |> 
  fill(varstrat)

## Joining with `by = join_by(strata, psuid, varunit)`

set.seed(877014)

# Random deletions
deleted <- cd_varstrat_varunit |> 
  group_by(varstrat) |> 
  sample_n(1)


# Base weight multiplier and deletion flag
cd_base_weight_mult <- cd_varstrat_varunit |> 
  group_by(varstrat) |> 
  mutate(psu_wt_mult0 = 1,
         deletion_factor = if_else(psuid %in% deleted$psuid, 0, n() / (n() - 1))) 


# Make one replicate
make_base_rep <- function(data, rep_number) {
  
  wtname <- str_pad(rep_number, 3, pad = 0)
  
  data |> 
    ungroup() |> 
    mutate("psu_wt_mult{wtname}" := 
             case_when(varstrat == rep_number ~ psu_wt_mult0 * deletion_factor,
                       .default = psu_wt_mult0)) |> 
    select(starts_with("psu_wt_mult"), -psu_wt_mult0)
  
}

test <- make_base_rep(cd_base_weight_mult, 2)

cd_mult_wts_out <- lapply(seq(1:max(cd_base_weight_mult$varstrat)), 
                           \(x) make_base_rep(cd_base_weight_mult, x)) |> 
  reduce(cbind)

# Apply the `make_base_rep` function to each value in the sequence from 1 to the maximum value of `varstrat` 
# in the `cd_base_weight_mult` dataframe. This is achieved using `lapply`, where `seq(1:max(cd_base_weight_mult$varstrat))`
# generates the sequence of numbers for each stratum (varstrat). 

# The function `make_base_rep` is executed for each stratum (`x`), and the output for all strata is collected 
# into a list. The lambda syntax (\(x)) is used for a concise inline function.

# The resulting list of outputs from `lapply` is then combined into a single dataframe or matrix by reducing
# the list with `cbind` (column binding). This ensures all outputs are aligned as columns.

# Finally, the entire pipeline is stored in the object `cd_mult_wts_out`.


# Appliquer la fonction `make_base_rep` à chaque valeur de la séquence de 1 à la valeur maximale de `varstrat` 
# dans le dataframe `cd_base_weight_mult`. Ceci est réalisé en utilisant `lapply`, où `seq(1:max(cd_base_weight_mult$varstrat))`
# génère la séquence de nombres pour chaque strate (varstrat). 

# La fonction `make_base_rep` est exécutée pour chaque strate (`x`) et la sortie pour toutes les strates est collectée 
# dans une liste. La syntaxe lambda (\(x)) est utilisée pour une fonction en ligne concise.

# La liste résultante des sorties de `lapply` est ensuite combinée en une seule trame de données ou matrice en réduisant
# la liste avec `cbind` (reliure de colonnes). Cela garantit que toutes les sorties sont alignées sous forme de colonnes.

# Enfin, l'intégralité du pipeline est stockée dans l'objet `cd_mult_wts_out`

# wtname <- str_pad(3, 3, pad = 0)
# 
# test1 <- make_base_rep(zw2_base_weight_mult, 2)
# test2 <- make_base_rep(zw2_base_weight_mult, 3)
# 
# test <- cbind(test1, test2)

psu_rep_mults <- cbind(cd_base_weight_mult, cd_mult_wts_out)

psu_basewt_reps <- psu_base_weights |> 
  # GR 10-08: joining wasn't working correctly (I think joining by province?): re-worked
  #select(-strata, -order) %>% 
  select(psuid = ea_id,
         psuwt0) |> 
  left_join(psu_rep_mults) |> 
  mutate(across(c(starts_with("psu_wt_mult"), -psu_wt_mult0), \(x) psuwt0 * x, 
                .names = "newwt{.col}")) |> 
  rename_with(.cols = contains("newwt"), 
              \(x) gsub("newwtpsu_wt_mult", "psuwt", x)) |> 
  arrange(varstrat, varunit)

## Joining with `by = join_by(psuid)`

6 Variance estimation using Jacknife method

# Use dataset file from the computation of the 
# original hhbwt0 weight variable as input
# Start with psu_basewt_reps — a dataset containing PSU-level replicate weights (likely bootstrap or jackknife weights for primary sampling units). Use distinct(ea_id, .keep_all = TRUE) to ensure only one row per EA (Enumeration Area), keeping all other columns. This prevents duplication if multiple households per EA exist in this file. Then select essential columns: ea_id (EA identifier), varunit (PSU ID), varstrat (stratum ID), and all columns starting with "psuwt" (PSU replicate weights). Finally, create a new column EAID by prepending "CD" to ea_id — likely to match naming convention in other datasets (e.g., "CD0102" from "0102").
# Commencez avec psu_basewt_reps — un jeu de données contenant les poids répliqués au niveau des UP (probablement des poids bootstrap ou jackknife pour les unités primaires d'échantillonnage). Utilisez distinct(ea_id, .keep_all = TRUE) pour garantir une seule ligne par ZD (zone de dénombrement), en conservant toutes les autres colonnes. Cela évite les duplications si plusieurs ménages par ZD existent dans ce fichier. Ensuite, sélectionnez les colonnes essentielles : ea_id (identifiant de ZD), varunit (identifiant d'UP), varstrat (identifiant de strate), et toutes les colonnes commençant par "psuwt" (poids répliqués des UP). Enfin, créez une nouvelle colonne EAID en préfixant "CD" à ea_id — probablement pour correspondre à la convention de nommage dans d'autres jeux de données (par exemple, "CD0102" à partir de "0102").

psuwt_reps <- psu_basewt_reps |> 
  # GR 10-08: Adjusting because of name change above
  mutate(ea_id = psuid) |> 
  distinct(ea_id, .keep_all = TRUE) %>% 
  select(ea_id,varunit,varstrat, starts_with("psuwt")) |> 
  mutate(EAID = paste0("CD", ea_id)) 

# Join PSU replicate weights with household dataset. 'input' starts as final_hh_weights (household dataset with adjusted replicate weights), then left_join() with psuwt_reps by implicit key (likely EAID or ea_id). Result: each household now has PSU-level replicate weights attached — necessary for computing household replicate weights that respect PSU-level perturbations.
# Joignez les poids répliqués des UP avec le jeu de données des ménages. 'input' commence comme final_hh_weights (jeu de données des ménages avec poids répliqués ajustés), puis left_join() avec psuwt_reps par clé implicite (probablement EAID ou ea_id). Résultat : chaque ménage a maintenant les poids répliqués au niveau des UP attachés — nécessaire pour calculer les poids répliqués des ménages qui respectent les perturbations au niveau des UP.

input <- final_hh_weights %>% 
  left_join(psuwt_reps)

## Joining with `by = join_by(EAID)`

# Function which outputs the specified number 
# replicate weight
# Define a function to compute one household replicate weight (e.g., hhbwt003) based on the corresponding PSU replicate weight (e.g., psuwt003). Input: data (not used — function uses 'input' from parent environment), rep_number (integer, e.g., 3). The function computes the household replicate weight by scaling the base household weight (hhbwt0) using the ratio of PSU base weight (psuwt0) to PSU replicate weight (psuwtXXX). Formula: hhbwtXXX = 1 / ( (1 / psuwt_rep) * (psuwt0 / hhbwt0) ) — this preserves the relationship between PSU and household weights while perturbing according to PSU replicate.
# Définissez une fonction pour calculer un poids répliqué de ménage (par exemple, hhbwt003) basé sur le poids répliqué correspondant des UP (par exemple, psuwt003). Entrée : data (non utilisé — la fonction utilise 'input' de l'environnement parent), rep_number (entier, par exemple, 3). La fonction calcule le poids répliqué de ménage en mettant à l'échelle le poids de base du ménage (hhbwt0) en utilisant le rapport du poids de base des UP (psuwt0) au poids répliqué des UP (psuwtXXX). Formule : hhbwtXXX = 1 / ( (1 / psuwt_rep) * (psuwt0 / hhbwt0) ) — cela préserve la relation entre les poids des UP et des ménages tout en perturbant selon la réplique des UP.

compute_hhbwt_rep <- function(data, rep_number) {
  
  # Padded weight number suffix, eg. 3 becomes "003"
  # Convert rep_number to 3-digit zero-padded string (e.g., 1 → "001", 12 → "012", 100 → "100") using str_pad(). This ensures consistent naming for replicate weight columns.
  # Convertissez rep_number en chaîne de 3 chiffres avec zéro de remplissage (par exemple, 1 → "001", 12 → "012", 100 → "100") en utilisant str_pad(). Cela garantit une dénomination cohérente pour les colonnes de poids répliqués.

  wtnum <- str_pad(rep_number, 3, pad = 0)
  
  # Name of the replicate weight to be used
  # eg. "psuwt003"
  # Construct the name of the PSU replicate weight column by concatenating "psuwt" with the padded number (e.g., "psuwt003").
  # Construisez le nom de la colonne de poids répliqué des UP en concaténant "psuwt" avec le numéro rempli (par exemple, "psuwt003").

  repwt <- str_c("psuwt", wtnum)
  
  # Take the 'input' dataset (from parent environment) and add two new columns: 
  # 1. psuwt_rep = value from the PSU replicate weight column (e.g., psuwt003), extracted using .data[[repwt]] (tidy eval for dynamic column name)
  # 2. hhbwt{wtnum} = computed household replicate weight using formula: 1 / ( (1 / psuwt_rep) * (psuwt0 / hhbwt0) )
  #    This formula ensures that the household replicate weight is proportional to the PSU replicate weight, preserving design relationships.
  #    The := operator creates a column with a programmatically generated name (e.g., "hhbwt003").
  # Then select only the newly created replicate weight column for return.
  # Prenez le jeu de données 'input' (de l'environnement parent) et ajoutez deux nouvelles colonnes :
  # 1. psuwt_rep = valeur de la colonne de poids répliqué des UP (par exemple, psuwt003), extraite en utilisant .data[[repwt]] (évaluation tidy pour nom de colonne dynamique)
  # 2. hhbwt{wtnum} = poids répliqué de ménage calculé en utilisant la formule : 1 / ( (1 / psuwt_rep) * (psuwt0 / hhbwt0) )
  #    Cette formule garantit que le poids répliqué de ménage est proportionnel au poids répliqué des UP, en préservant les relations de plan.
  #    L'opérateur := crée une colonne avec un nom généré programmatiquement (par exemple, "hhbwt003").
  # Ensuite, sélectionnez uniquement la colonne de poids répliqué nouvellement créée pour le retour.

  input |> 
    mutate(psuwt_rep = .data[[repwt]],
           "hhbwt{wtnum}" := 1/ ( (1 / psuwt_rep) * (psuwt0/hhbwt0))) |>
    select(str_c("hhbwt", wtnum))
  
}

# Generate replicate weights for all strata. Use lapply() to apply compute_hhbwt_rep() to each integer from 1 to max(psuwt_reps$varstrat) — assuming number of replicates equals number of strata (common in jackknife). Result: a list where each element is a single-column data frame containing one replicate weight (e.g., hhbwt001, hhbwt002, ...).
# Générez des poids répliqués pour toutes les strates. Utilisez lapply() pour appliquer compute_hhbwt_rep() à chaque entier de 1 à max(psuwt_reps$varstrat) — en supposant que le nombre de répliques égale le nombre de strates (courant en jackknife). Résultat : une liste où chaque élément est une trame de données à une colonne contenant un poids répliqué (par exemple, hhbwt001, hhbwt002, ...).

rep_weights <- lapply(seq(1:max(psuwt_reps$varstrat)), \(x) compute_hhbwt_rep(input, x))

# Convert the list of single-column data frames into one wide data frame. Then assign column names: "hhbwt" + 3-digit zero-padded sequence number (e.g., hhbwt001, hhbwt002, ...), ensuring names match expected format.
# Convertissez la liste de trames de données à une colonne en une seule trame de données large. Ensuite, attribuez des noms de colonnes : "hhbwt" + numéro de séquence à 3 chiffres avec zéro de remplissage (par exemple, hhbwt001, hhbwt002, ...), en garantissant que les noms correspondent au format attendu.

rep_weights_df <- as.data.frame(rep_weights)
names(rep_weights_df) <- paste0("hhbwt", str_pad(seq_along(rep_weights_df), 3, pad = "0"))

# Combine original input dataset with newly computed replicate weights using bind_cols() (column-wise binding — assumes same row order). Then remove unnecessary columns: HHI_PROVINCE, hhstatus, hh_adj_one, hh_wt_one, hh_adj_two, EAID, and all columns containing "psuwt" (intermediate PSU weights no longer needed). Result: hhbwts_reps — clean dataset of households with final replicate weights ready for individual expansion.
# Combinez le jeu de données d'entrée original avec les poids répliqués nouvellement calculés en utilisant bind_cols() (liaison par colonnes — suppose le même ordre de lignes). Ensuite, supprimez les colonnes inutiles : HHI_PROVINCE, hhstatus, hh_adj_one, hh_wt_one, hh_adj_two, EAID, et toutes les colonnes contenant "psuwt" (poids intermédiaires des UP plus nécessaires). Résultat : hhbwts_reps — jeu de données propre des ménages avec poids répliqués finaux prêts pour l'expansion individuelle.

hhbwts_reps <- bind_cols(input, rep_weights_df) %>% 
  select(-c(HHI_PROVINCE, hhstatus,hh_adj_one,hh_wt_one,hh_adj_two,EAID, contains("psuwt"))) |> 
  # GR 10-08: sorting by varstrat/varunit so it's easier to visually see if the replicates show the right pattern.
  arrange(varstrat, varunit)

6.1 HHs replicate weights

# Prepare input dataset for household replicate weight adjustment: start with hhbwts_reps (dataset of household replicate weights), and join with key household variables from cd_hh — specifically EA_HHID_FIXED (unique household ID), HHI_EACODE (Enumeration Area code, used as adjustment domain), HH_STATUS (household eligibility/response status), and HHI_PROVINCE (for potential diagnostics or later use). Join by implicit key (likely EA_HHID_FIXED). Then sort rows by sampling stratum (varstrat) and PSU (varunit) to ensure consistent ordering. Group by HHI_EACODE (to prepare for within-EA adjustments) then immediately ungroup — likely for code clarity or debugging, as no grouped mutate follows immediately.
# Préparez le jeu de données d'entrée pour l'ajustement des poids répliqués des ménages : commencez avec hhbwts_reps (jeu de données des poids répliqués des ménages), et joignez avec les variables clés des ménages issues de cd_hh — spécifiquement EA_HHID_FIXED (identifiant unique du ménage), HHI_EACODE (code de zone de dénombrement, utilisé comme domaine d'ajustement), HH_STATUS (statut d'éligibilité/réponse du ménage), et HHI_PROVINCE (pour diagnostics potentiels ou usage ultérieur). Joignez par clé implicite (probablement EA_HHID_FIXED). Ensuite, triez les lignes par strate d'échantillonnage (varstrat) et UP (varunit) pour garantir un ordre cohérent. Groupez par HHI_EACODE (pour préparer les ajustements intra-ZD) puis dégroupez immédiatement — probablement pour la clarté du code ou le débogage, car aucun mutate groupé ne suit immédiatement.

hh_adj_rep_in <- hhbwts_reps |> 
  left_join(select(cd_hh, EA_HHID_FIXED, HHI_EACODE,HH_STATUS, HHI_PROVINCE)) |> 
  arrange(varstrat, varunit) |> 
  group_by(HHI_EACODE) |> 
  ungroup()

## Joining with `by = join_by(EA_HHID_FIXED)`

# Define a function to adjust one household replicate weight vector for eligibility and nonresponse within each Enumeration Area (EA). Input: weight_vec = a numeric vector of replicate weights (e.g., hhbwt001 for all households).
# Définissez une fonction pour ajuster un vecteur de poids répliqué de ménage pour l'éligibilité et la non-réponse au sein de chaque zone de dénombrement (ZD). Entrée : weight_vec = un vecteur numérique de poids répliqués (par exemple, hhbwt001 pour tous les ménages).

hhwt_rep_adj <- function(weight_vec) {
  
  # Phase 1
  # Create a temporary dataset: add the current replicate weight as column 'repwt', then select only essential variables for adjustment (EA code, household ID, province, status, design vars, weight). Group by HHI_EACODE to perform adjustments within each EA. Then compute adjustment factors and weights in a single mutate() block.
  # Créez un jeu de données temporaire : ajoutez le poids répliqué actuel comme colonne 'repwt', puis sélectionnez uniquement les variables essentielles pour l'ajustement (code ZD, identifiant du ménage, province, statut, variables de plan, poids). Groupez par HHI_EACODE pour effectuer les ajustements au sein de chaque ZD. Ensuite, calculez les facteurs d'ajustement et les poids dans un seul bloc mutate().

  hh_uknwn_adj <- hh_adj_rep_in %>%
    mutate(repwt = weight_vec) |> 
    select(HHI_EACODE, EA_HHID_FIXED, HHI_PROVINCE, HH_STATUS, varstrat, varunit, repwt) %>%
    group_by(HHI_EACODE) %>% 
    mutate(
      # Calculate first adjustment factor: total weight in EA divided by sum of weights for HH_STATUS 1, 2, and 3 (considered "in-scope" households). This redistributes the EA total weight only among eligible or in-scope households, excluding HH_STATUS 4 (out-of-scope).
      # Calculez le premier facteur d'ajustement : poids total dans la ZD divisé par la somme des poids pour HH_STATUS 1, 2 et 3 (considérés comme ménages "dans le champ"). Cela redistribue le poids total de la ZD uniquement parmi les ménages éligibles ou dans le champ, en excluant HH_STATUS 4 (hors champ).
      hh_adj_one = sum(repwt) / ( sum(repwt * (HH_STATUS == 1)) + 
                                    sum(repwt * (HH_STATUS == 2)) + 
                                    sum(repwt * (HH_STATUS == 3))),
      # Apply first adjustment: if original weight is 0, keep 0; if HH_STATUS is 1 or 2, multiply by hh_adj_one; if HH_STATUS is 3 or 4, set to 0. This retains weight only for status 1 & 2 after adjustment, zeroing out 3 & 4.
      # Appliquez le premier ajustement : si le poids d'origine est 0, gardez 0 ; si HH_STATUS est 1 ou 2, multipliez par hh_adj_one ; si HH_STATUS est 3 ou 4, mettez à 0. Cela conserve le poids uniquement pour les statuts 1 et 2 après ajustement, en mettant à zéro les 3 et 4.
      hh_wt_one = case_when(repwt == 0 ~ 0,
                            HH_STATUS %in% c(1, 2) ~ hh_adj_one * repwt,
                            HH_STATUS %in% c(3, 4) ~ 0),
      # Calculate second adjustment factor: sum of first-adjusted weights (hh_wt_one) divided by sum of those weights where HH_STATUS == 1 (responding eligible households). This shifts all weight onto actual respondents (status 1), assuming status 2 are nonrespondents.
      # Calculez le deuxième facteur d'ajustement : somme des poids ajustés en premier (hh_wt_one) divisée par la somme de ces poids où HH_STATUS == 1 (ménages éligibles répondants). Cela transfère tout le poids sur les répondants réels (statut 1), en supposant que le statut 2 représente des non-répondants.
      hh_adj_two = sum(hh_wt_one) / ( sum(hh_wt_one * (HH_STATUS == 1))),
      # Apply second adjustment: if HH_STATUS == 1 and original weight > 0, multiply first-adjusted weight by hh_adj_two; else set to 0. Final result: only HH_STATUS == 1 households retain non-zero weights.
      # Appliquez le deuxième ajustement : si HH_STATUS == 1 et poids d'origine > 0, multipliez le poids ajusté en premier par hh_adj_two ; sinon, mettez à 0. Résultat final : seuls les ménages avec HH_STATUS == 1 conservent des poids non nuls.
      hh_wt_two = if_else(HH_STATUS == 1 & repwt > 0, 
                          hh_adj_two * hh_wt_one,
                          0)) %>% 
    ungroup()
  
  # Return the final adjusted weight vector (hh_wt_two) for this replicate. All other households (non-status-1) will have weight 0.
  # Renvoyez le vecteur de poids ajusté final (hh_wt_two) pour cette réplique. Tous les autres ménages (non statut 1) auront un poids de 0.

  return(hh_uknwn_adj$hh_wt_two)
}

# Quick test
# (Commented) — Example of how to test the function on first two replicates: join with main weight, compute adjusted versions, sort, and inspect. Useful for debugging or validation before applying to all replicates.
# (Commenté) — Exemple de test de la fonction sur les deux premières répliques : joignez avec le poids principal, calculez les versions ajustées, triez et inspectez. Utile pour le débogage ou la validation avant d'appliquer à toutes les répliques.

# test <- hh_adj_rep_in |> 
#   left_join(select(final_hh_weights, EA_HHID_FIXED, hhwt0)) |> 
#   mutate(hhwt001_new = hhwt_rep_adj(hhbwt001),
#          hhwt002_new = hhwt_rep_adj(hhbwt002)) |> 
#   arrange(varstrat, varunit) |> 
#   select(ends_with("new"), everything())
# 
# test2 <- test |> 
#   filter(varstrat == 1)

# All reps
# Apply the hhwt_rep_adj() function to every column in hh_adj_rep_in matching the pattern "^hhbwt\\d{3}$" — i.e., any column starting with "hhbwt" followed by exactly 3 digits (e.g., hhbwt001, hhbwt002, ...). For each, compute the fully adjusted household weight and store it in a new column named "adj_" + original column name (e.g., adj_hhbwt001). This creates adjusted versions of all replicate weights.
# Appliquez la fonction hhwt_rep_adj() à chaque colonne de hh_adj_rep_in correspondant au motif "^hhbwt\\d{3}$" — c'est-à-dire toute colonne commençant par "hhbwt" suivie exactement de 3 chiffres (par exemple, hhbwt001, hhbwt002, ...). Pour chacune, calculez le poids de ménage entièrement ajusté et stockez-le dans une nouvelle colonne nommée "adj_" + nom de la colonne d'origine (par exemple, adj_hhbwt001). Cela crée des versions ajustées de tous les poids répliqués.

final_hh_weights <- hh_adj_rep_in |> 
  mutate(across(matches("^hhbwt\\d{3}$"), hhwt_rep_adj, .names = "adj_{.col}"))

# Rename adjusted columns: sort dataset again by design (varstrat, varunit) for consistency, then rename all columns starting with "adj_hhbwt" to "hhwt" (e.g., adj_hhbwt001 → hhwt001). This standardizes naming for downstream use and matches expected column names in analysis datasets.
# Renommez les colonnes ajustées : triez à nouveau le jeu de données par plan (varstrat, varunit) pour la cohérence, puis renommez toutes les colonnes commençant par "adj_hhbwt" en "hhwt" (par exemple, adj_hhbwt001 → hhwt001). Cela standardise la dénomination pour une utilisation en aval et correspond aux noms de colonnes attendus dans les jeux de données d'analyse.

final_hh_weights_rename <- final_hh_weights |> 
  arrange(varstrat, varunit) |> 
  rename_with(\(x) str_replace(x, "adj_hhbwt", "hhwt"))

# Random sample of households to check
# (Commented diagnostic) — If uncommented, this would take a random sample of 10 households, reshape their weights (main + replicates) to long format, and plot them as jittered points by household ID. Useful for visually checking weight distributions and spotting extreme values or inconsistencies.
# (Diagnostic commenté) — Si décommenté, cela prendrait un échantillon aléatoire de 10 ménages, remodèlerait leurs poids (principal + répliques) au format long, et les tracerait sous forme de points jitterés par identifiant de ménage. Utile pour vérifier visuellement les distributions de poids et repérer les valeurs extrêmes ou les incohérences.

# hh_sample <- final_hh_weights_rename |> 
#   slice_sample(n = 10)
# 
# hh_sample_long <- hh_sample |> 
#   pivot_longer(cols = starts_with("hhwt"), names_to = "weight_num", values_to = "weight_val")
# 
# hh_sample_long |> 
#   ggplot(aes(x = EA_HHID_FIXED, y = weight_val)) +
#   geom_jitter(alpha = 0.5)

6.2 Individual replicate weights

# Attaching to each person
int_wgts_phase_one_input_reps <- int_wgts_phase_one_input %>%
  left_join(
    #final_hhwts_reps %>% 
    final_hh_weights_rename |>
      #select(EA_HHID_FIXED, varunit, varstrat, starts_with("hhbwt")),
      select(EA_HHID_FIXED, varunit, varstrat, starts_with("hhwt"), -hhwt0),
    by = "EA_HHID_FIXED"
  ) |> 
  arrange(varstrat, varunit)

6.2.1 Phase 1 adjustment on individual replicate weights

# phase 1 replicates
# Define a function to adjust one individual replicate weight vector for eligibility within age/sex cells. This function redistributes weights from ineligible or unresolved individuals to eligible respondents, preserving totals within each demographic cell. Input: wt_vector = a numeric vector of replicate weights (e.g., hhwt001 for all individuals).
# Définissez une fonction pour ajuster un vecteur de poids répliqué individuel pour l'éligibilité au sein des cellules âge/sexe. Cette fonction redistribue les poids des individus inéligibles ou non résolus vers les répondants éligibles, en préservant les totaux au sein de chaque cellule démographique. Entrée : wt_vector = un vecteur numérique de poids répliqués (par exemple, hhwt001 pour tous les individus).

int_firststage_wt_rep <- function(wt_vector) {
  # Create a lookup table (adj_tbl) that calculates the adjustment factor for each age_sex_cell. Start by adding the current replicate weight as a new column 'repwt'. Filter to include only individuals with int_elig_status in {1,2,3} (considered "in scope") and positive weights. Group by age_sex_cell and int_elig_status, then compute the sum of weights within each group. '.groups = "drop_last"' drops the last grouping level (int_elig_status) but keeps age_sex_cell grouped for pivot_wider.
  # Créez une table de correspondance (adj_tbl) qui calcule le facteur d'ajustement pour chaque age_sex_cell. Commencez par ajouter le poids répliqué actuel comme nouvelle colonne 'repwt'. Filtrez pour inclure uniquement les individus avec int_elig_status dans {1,2,3} (considérés comme "dans le champ") et des poids positifs. Groupez par age_sex_cell et int_elig_status, puis calculez la somme des poids au sein de chaque groupe. '.groups = "drop_last"' supprime le dernier niveau de regroupement (int_elig_status) mais conserve age_sex_cell groupé pour pivot_wider.

  adj_tbl <- int_wgts_phase_one_input_reps %>%
    mutate(repwt = wt_vector) %>% 
    filter(int_elig_status %in% c(1,2,3), repwt > 0) %>%
    group_by(age_sex_cell, int_elig_status) %>%
    summarise(weighted_count = sum(repwt), .groups = "drop_last") %>%
    # Reshape from long to wide format: create separate columns for each int_elig_status (1,2,3) — named "status_1", "status_2", "status_3" — containing the summed weights. Fill missing combinations with 0. This allows easy computation of adjustment factors across eligibility statuses within each age/sex cell.
    # Remodelez du format long au format large : créez des colonnes séparées pour chaque int_elig_status (1,2,3) — nommées "status_1", "status_2", "status_3" — contenant les poids sommés. Remplissez les combinaisons manquantes avec 0. Cela permet un calcul facile des facteurs d'ajustement entre les statuts d'éligibilité au sein de chaque cellule âge/sexe.
    tidyr::pivot_wider(values_from = weighted_count,
                       names_from = int_elig_status,
                       names_glue = "status_{int_elig_status}",
                       values_fill = 0) %>%
    # Compute adjustment factor: (status_1 + status_3) / status_1. Assumes status 1 = eligible respondents, status 3 = eligible non-selected or resolved eligible — their weight should be redistributed to status 1 respondents. Status 2 (e.g., nonrespondents) is not included in numerator — only status 1 and 3 are considered "effective population".
    # Calculez le facteur d'ajustement : (status_1 + status_3) / status_1. Suppose que status 1 = répondants éligibles, status 3 = éligibles non sélectionnés ou éligibles résolus — leur poids devrait être redistribué aux répondants de status 1. Le status 2 (par exemple, non-répondants) n'est pas inclus dans le numérateur — seuls les status 1 et 3 sont considérés comme "population effective".
    mutate(phase_one_int_adj_factor = (status_1 + status_3) / status_1) %>%
    # Keep only age_sex_cell and the computed adjustment factor for joining back to individual records.
    # Conservez uniquement age_sex_cell et le facteur d'ajustement calculé pour la jointure avec les enregistrements individuels.
    select(age_sex_cell, phase_one_int_adj_factor)

  # Apply the adjustment to each individual: re-add the replicate weight, join with the adjustment table by age_sex_cell, multiply each person’s weight by their cell’s adjustment factor, then extract the resulting vector of adjusted weights. Non-matching or filtered individuals will get NA — handled naturally by left_join.
  # Appliquez l'ajustement à chaque individu : réajoutez le poids répliqué, joignez avec la table d'ajustement par age_sex_cell, multipliez le poids de chaque personne par le facteur d'ajustement de sa cellule, puis extrayez le vecteur résultant de poids ajustés. Les individus sans correspondance ou filtrés obtiendront NA — géré naturellement par left_join.

  int_wgts_phase_one_input_reps %>%
    mutate(repwt = wt_vector) %>%
    left_join(adj_tbl, by = "age_sex_cell") %>%
    mutate(phase_one_adj_int_wgt = repwt * phase_one_int_adj_factor) %>%
    pull(phase_one_adj_int_wgt)
}

# Compute phase-one adjusted interview replicate columns: phase_one_adj_wt001 … 
# Apply the int_firststage_wt_rep() function to every column in the dataset matching the pattern "^hhwt\\d{3}$" — i.e., any column starting with "hhwt" followed by exactly 3 digits (e.g., hhwt001, hhwt002, ...). These are the household-derived replicate weights now being adjusted at individual level. For each, compute the phase-one eligibility-adjusted weight and store it in a new column named "phase_one_adj_wt" + last 3 digits of original column name (e.g., phase_one_adj_wt001). This creates the full set of first-stage adjusted replicate weights for individuals.
# Calculez les colonnes de poids répliqués ajustés en phase un pour les entretiens : phase_one_adj_wt001... Appliquez la fonction int_firststage_wt_rep() à chaque colonne du jeu de données correspondant au motif "^hhwt\\d{3}$" — c'est-à-dire toute colonne commençant par "hhwt" suivie exactement de 3 chiffres (par exemple, hhwt001, hhwt002, ...). Ce sont les poids répliqués dérivés des ménages maintenant ajustés au niveau individuel. Pour chacun, calculez le poids ajusté pour l'éligibilité en phase un et stockez-le dans une nouvelle colonne nommée "phase_one_adj_wt" + les 3 derniers chiffres du nom de la colonne d'origine (par exemple, phase_one_adj_wt001). Cela crée l'ensemble complet des poids répliqués ajustés en première étape pour les individus.

int_phase1_reps_start <- int_wgts_phase_one_input_reps %>%
  mutate(across(matches("^hhwt\\d{3}$"),
                int_firststage_wt_rep,
                .names = "phase_one_adj_wt{str_sub(.col, -3)}"))

6.2.2 NR adjustment for individual replicate weights

# Joining fixed response probabilities
# Attach individual-level response data to the dataset containing phase-one adjusted replicate weights. Specifically, join with 'response_probs_int' to bring in: EA_HHID_LN_FIXED (person ID for matching), ind_responded (1 if individual completed interview, 0 otherwise), and response_probability_int (predicted probability of responding, from a logistic regression model). The join is done by person ID to ensure correct record linkage.
# Attachez les données de réponse au niveau individuel au jeu de données contenant les poids répliqués ajustés en phase un. Plus précisément, joignez avec 'response_probs_int' pour apporter : EA_HHID_LN_FIXED (identifiant de la personne pour la correspondance), ind_responded (1 si l'individu a terminé l'entretien, 0 sinon), et response_probability_int (probabilité prédite de répondre, issue d'un modèle de régression logistique). La jointure est effectuée par identifiant de personne pour garantir une liaison correcte des enregistrements.

int_phase1_reps <- int_phase1_reps_start %>%
  left_join(select(response_probs_int,
                   EA_HHID_LN_FIXED, ind_responded, 
                   response_probability_int),
            by = "EA_HHID_LN_FIXED")

#this cap is just for abitrary
# Compute a cap value for individual nonresponse-adjusted weights: 3.5 times the median of the nonresponse-adjusted weights (from a presumably pre-existing dataset 'nr_adj_int_weights'). 'nr_adj_int_weights' may not be defined yet — this line will fail if executed before those weights exist. Intended logic: cap extreme weights to reduce their influence on estimates. The multiplier 3.5 is arbitrary but common — it limits weights to 3.5x the typical weight, reducing outlier impact while preserving most adjustment.
# Calculez une valeur de plafond pour les poids ajustés pour la non-réponse individuelle : 3,5 fois la médiane des poids ajustés pour la non-réponse (d'un jeu de données préexistant 'nr_adj_int_weights').REMARQUE : 'nr_adj_int_weights' peut ne pas être défini encore — cette ligne échouera si elle est exécutée avant que ces poids n'existent. Logique intentionnelle : plafonner les poids extrêmes pour réduire leur influence sur les estimations. Le multiplicateur 3,5 est arbitraire mais courant — il limite les poids à 3,5x le poids typique, réduisant l'impact des valeurs aberrantes tout en préservant la plupart des ajustements.

#cap_int <- 3.5 * median(nr_adj_int_weights$nr_adj_int_wgt, na.rm = TRUE)
# GR 10-08 remove trimming for reasons noted above by setting super-high limit
cap_int <- 3500 * median(nr_adj_int_weights$nr_adj_int_wgt, na.rm = TRUE)

# Apply nonresponse adjustment and trimming in one mutate() step. First 'across()': for every column matching "^phase_one_adj_wt\\d{3}$" (e.g., phase_one_adj_wt001), apply inverse probability weighting — if ind_responded == 1, divide weight by response_probability_int (to upweight low-probability respondents); if not responded, set weight to NA (they don’t contribute to estimates). Create new columns named "nr_adj_int_wgt" + last 3 digits of original column (e.g., nr_adj_int_wgt001). Second 'across()': for every newly created "nr_adj_int_wgt\\d{3}$" column, cap the weight at 'cap_int' using pmin() — replaces any value above cap_int with cap_int. This stabilizes weights and prevents extreme values from dominating estimates.
# Appliquez l'ajustement pour la non-réponse et le troncage en une seule étape mutate(). Premier 'across()' : pour chaque colonne correspondant à "^phase_one_adj_wt\\d{3}$" (par exemple, phase_one_adj_wt001), appliquez la pondération par probabilité inverse — si ind_responded == 1, divisez le poids par response_probability_int (pour surpondérer les répondants à faible probabilité) ; si non répondant, mettez le poids à NA (ils ne contribuent pas aux estimations). Créez de nouvelles colonnes nommées "nr_adj_int_wgt" + les 3 derniers chiffres de la colonne d'origine (par exemple, nr_adj_int_wgt001). Deuxième 'across()' : pour chaque colonne nouvellement créée "nr_adj_int_wgt\\d{3}$", plafonnez le poids à 'cap_int' en utilisant pmin() — remplace toute valeur supérieure à cap_int par cap_int. Cela stabilise les poids et empêche les valeurs extrêmes de dominer les estimations.

int_nr_reps <- int_phase1_reps %>%
  mutate(
    across(matches("^phase_one_adj_wt\\d{3}$"),
           #~ if_else(ind_responded == 1, .x / response_probability_int, NA_real_),
           # GR 10-08 set missing weights to 0.0 to make it cleaner
           ~ if_else(ind_responded == 1, .x / response_probability_int, 0.0),
           .names = "nr_adj_int_wgt{str_sub(.col, -3)}"),
    across(matches("^nr_adj_int_wgt\\d{3}$"),
           ~ pmin(.x, cap_int))
  )

# int_nr_reps |> 
#   group_by(ind_responded) |> 
#   summarise(max = max(nr_adj_int_wgt001, na.rm = TRUE))
# (Commented diagnostic) — If uncommented, this would group the dataset by response status (responded vs not) and compute the maximum value of the first nonresponse-adjusted replicate weight (nr_adj_int_wgt001) within each group. Useful for checking if nonrespondents somehow retained weights (should be NA) or if respondents have extreme weights needing further trimming.
# (Diagnostic commenté) — Si décommenté, cela regrouperait le jeu de données par statut de réponse (répondant vs non) et calculerait la valeur maximale de la première réplique de poids ajustée pour la non-réponse (nr_adj_int_wgt001) dans chaque groupe. Utile pour vérifier si les non-répondants ont conservé des poids (devraient être NA) ou si les répondants ont des poids extrêmes nécessitant un troncage supplémentaire.

6.2.3 Postratification of individual replicate weights

# postrat replicate
# Define a function to post-stratify one individual replicate weight column to match known population totals by province and sex. This ensures survey estimates are demographically representative. The function takes three arguments: df = dataset, repcol = name of the replicate weight column to adjust, targets_df = data frame containing target population counts by province and sex.
# Définissez une fonction pour post-stratifier une colonne de poids répliqué individuel afin de correspondre aux totaux de population connus par province et sexe. Cela garantit que les estimations de l'enquête sont représentatives sur le plan démographique. La fonction prend trois arguments : df = jeu de données, repcol = nom de la colonne de poids répliqué à ajuster, targets_df = trame de données contenant les décomptes de population cibles par province et sexe.

poststrat_one_intrep <- function(df, repcol, targets_df) {
  # Create a temporary dataset containing only: province (extracted as characters 3–4 from EAID, e.g., "03" from "CI03ABCD"), sex (from SEX column), and the current replicate weight (w = value from column named in 'repcol'). 'transmute()' is like 'mutate()' but drops all other columns — keeping only what’s needed. Then remove rows where weight is NA (nonrespondents or ineligible individuals).
  # Créez un jeu de données temporaire contenant uniquement : province (extraite comme caractères 3–4 de EAID, par exemple, "03" de "CI03ABCD"), sexe (de la colonne SEX), et le poids répliqué actuel (w = valeur de la colonne nommée dans 'repcol'). 'transmute()' est comme 'mutate()' mais supprime toutes les autres colonnes — ne gardant que ce qui est nécessaire. Ensuite, supprimez les lignes où le poids est NA (non-répondants ou individus inéligibles).

  tmp <- df %>%
    transmute(
      province = stringr::str_sub(EAID, 3, 4),
      sex = SEX,
      w = .data[[repcol]]
    ) %>% 
    filter(!is.na(w))

  # Group the temporary dataset by province and sex, then compute the total current weight (sum of replicate weights) in each group. '.groups = "drop"' ensures the grouping is removed after summarise to prevent unexpected behavior in later operations.
  # Groupez le jeu de données temporaire par province et sexe, puis calculez le poids total actuel (somme des poids répliqués) dans chaque groupe. '.groups = "drop"' garantit que le regroupement est supprimé après summarise pour éviter des comportements inattendus dans les opérations ultérieures.

  tots <- tmp %>% group_by(province, sex) %>%
    summarise(totalW = sum(w), .groups = "drop")

  # Join the current weighted totals (tots) with the target population counts (targets_df) by province and sex. Compute the adjustment factor 'f' = target population / current weighted total. This factor scales weights up (if underrepresented) or down (if overrepresented). Keep only province, sex, and f for the next step.
  # Joignez les totaux pondérés actuels (tots) avec les décomptes de population cibles (targets_df) par province et sexe. Calculez le facteur d'ajustement 'f' = population cible / total pondéré actuel. Ce facteur met à l'échelle les poids vers le haut (si sous-représentés) ou vers le bas (si sur-représentés). Conservez uniquement province, sexe et f pour l'étape suivante.

  adj <- tots %>% left_join(targets_df, by = c("province","sex")) %>%
    mutate(f = population / totalW) %>% select(province, sex, f)

  # Join the adjustment factors (adj) back to the individual-level temporary dataset (tmp) by province and sex. Multiply each individual’s weight (w) by their group’s adjustment factor (f) to get their post-stratified weight. 'pull(final)' extracts the resulting vector of adjusted weights.
  # Joignez les facteurs d'ajustement (adj) au jeu de données temporaire au niveau individuel (tmp) par province et sexe. Multipliez le poids de chaque individu (w) par le facteur d'ajustement de leur groupe (f) pour obtenir leur poids post-stratifié. 'pull(final)' extrait le vecteur résultant de poids ajustés.

  out <- tmp %>% left_join(adj, by = c("province","sex")) %>%
    mutate(final = w * f) %>% pull(final)

  # Reconstruct a weight vector of the same length as the original dataset, initialized with NA. Then, only for rows where the original replicate weight (in 'repcol') was non-NA, insert the adjusted weight from 'out'. This preserves the original row order and NA structure — critical for replicate weights to remain aligned with the dataset.
  # Reconstruisez un vecteur de poids de la même longueur que le jeu de données d'origine, initialisé avec NA. Ensuite, uniquement pour les lignes où le poids répliqué d'origine (dans 'repcol') était non-NA, insérez le poids ajusté de 'out'. Cela préserve l'ordre original des lignes et la structure NA — essentiel pour que les poids répliqués restent alignés avec le jeu de données.

  res <- rep(NA_real_, nrow(df))
  res[!is.na(df[[repcol]])] <- out
  res
}

# We already have EAID, SEX, CONFAGEY inside int_phase1_reps via earlier joins
# Identify all columns in int_nr_reps whose names match the pattern "^nr_adj_int_wgt\\d{3}$" — i.e., any column starting with "nr_adj_int_wgt" followed by exactly 3 digits (e.g., nr_adj_int_wgt001, nr_adj_int_wgt002, ...). 'grep(..., value = TRUE)' returns the actual column names (not positions) as a character vector.
# Identifiez toutes les colonnes dans int_nr_reps dont les noms correspondent au motif "^nr_adj_int_wgt\\d{3}$" — c'est-à-dire toute colonne commençant par "nr_adj_int_wgt" suivie exactement de 3 chiffres (par exemple, nr_adj_int_wgt001, nr_adj_int_wgt002, ...). 'grep(..., value = TRUE)' renvoie les noms réels des colonnes (pas les positions) sous forme de vecteur de caractères.

int_rep_cols <- grep("^nr_adj_int_wgt\\d{3}$", names(int_nr_reps), value = TRUE)

# For each identified replicate weight column (e.g., nr_adj_int_wgt001), generate a new column name by replacing "nr_adj_int_wgt" with "intwt" (→ intwt001). Then apply the poststrat_one_intrep() function to that column, using population targets (individual_reference_totals_adj), and store the result in the new column. This creates final post-stratified individual replicate weights.
# Pour chaque colonne de poids répliqué identifiée (par exemple, nr_adj_int_wgt001), générez un nouveau nom de colonne en remplaçant "nr_adj_int_wgt" par "intwt" (→ intwt001). Appliquez ensuite la fonction poststrat_one_intrep() à cette colonne, en utilisant les cibles de population (individual_reference_totals_adj), et stockez le résultat dans la nouvelle colonne. Cela crée les poids répliqués individuels finaux post-stratifiés.

for (rc in int_rep_cols) {
  newname <- sub("^nr_adj_int_wgt", "intwt", rc)
  int_nr_reps[[newname]] <- poststrat_one_intrep(int_nr_reps, rc, individual_reference_totals_adj)
}

# Create the final individual weights dataset: join with the main post-stratified weight (intwt0 from poststrat_int_wgts) using both EA_HHID_LN_FIXED (person ID) and EAID (enumeration area ID) to ensure correct matching. Then select only essential columns: person ID, EAID, sampling design variables (varunit, varstrat), main weight (intwt0), and all replicate weights (intwt001, intwt002, ...). This dataset is now clean and analysis-ready.
# Créez le jeu de données final des poids individuels : joignez avec le poids principal post-stratifié (intwt0 de poststrat_int_wgts) en utilisant à la fois EA_HHID_LN_FIXED (identifiant de la personne) et EAID (identifiant de la zone de dénombrement) pour garantir une correspondance correcte. Ensuite, sélectionnez uniquement les colonnes essentielles : identifiant de la personne, EAID, variables de plan d'échantillonnage (varunit, varstrat), poids principal (intwt0), et tous les poids répliqués (intwt001, intwt002, ...). Ce jeu de données est maintenant propre et prêt pour l'analyse.

final_indwts_reps <- int_nr_reps %>%
  left_join(
    poststrat_int_wgts %>%
      select(EA_HHID_LN_FIXED, EAID, intwt0),
    by = c("EA_HHID_LN_FIXED", "EAID")
  ) %>%
  select(EA_HHID_LN_FIXED, EAID, varunit, varstrat, intwt0, matches("^intwt\\d{3}$"))

# Attach the final individual weights (main + replicates) to the main individual dataset (cd_csproindiv) by matching on person ID (EA_HHID_LN_FIXED). Result: every individual now has their calibrated analysis weight and replicate weights attached — ready for estimation and variance calculation.
# Attachez les poids individuels finaux (principal + répliques) au jeu de données individuel principal (cd_csproindiv) en faisant correspondre l'identifiant de la personne (EA_HHID_LN_FIXED). Résultat : chaque individu a maintenant son poids d'analyse calibré et ses poids répliqués attachés — prêt pour l'estimation et le calcul de la variance.

indiv_data_weights <- cd_csproindiv %>% 
  left_join(final_indwts_reps)

## Joining with `by = join_by(EA_HHID_LN_FIXED)`

6.3 Creating biomaker replicate weights

# creating replicates for biomaker
# Start building biomarker replicate weights by joining biomarker response data (who gave blood, and their predicted probability of doing so) to the individual-level dataset that already contains nonresponse-adjusted replicate weights. The join is done using the unique person identifier EA_HHID_LN_FIXED to ensure correct matching.
# Commencez à construire les poids répliqués pour les biomarqueurs en joignant les données de réponse des biomarqueurs (qui a donné du sang et leur probabilité prédite de le faire) au jeu de données au niveau individuel qui contient déjà les poids répliqués ajustés pour la non-réponse. La jointure est effectuée à l'aide de l'identifiant unique de la personne EA_HHID_LN_FIXED pour garantir une correspondance correcte.

bio_base <- int_nr_reps %>%
  left_join(select(response_probs_bio,
                   EA_HHID_LN_FIXED, response_bio, response_probability_bio),
            by = "EA_HHID_LN_FIXED")

# Apply trimming (capping) to all individual nonresponse-adjusted replicate weight columns (e.g., nr_adj_int_wgt001, nr_adj_int_wgt002, ...). The 'across()' function applies the same operation to multiple columns that match the regex pattern "^nr_adj_int_wgt\\d{3}$" — meaning any column name starting with "nr_adj_int_wgt" followed by exactly 3 digits. For each value in those columns, pmin(.x, cap_int) replaces it with the smaller of the current weight or cap_int (a pre-defined cap, e.g., 3.5 * median weight). The .names argument creates new column names: "trmpnr1w" + last 3 digits of original column (e.g., trmpnr1w001). This prevents extreme weights from blowing up further during biomarker adjustment.
# Appliquez un troncage (plafonnement) à toutes les colonnes de poids répliqués individuels ajustés pour la non-réponse (par exemple, nr_adj_int_wgt001, nr_adj_int_wgt002, ...). La fonction 'across()' applique la même opération à plusieurs colonnes correspondant à l'expression régulière "^nr_adj_int_wgt\\d{3}$" — c'est-à-dire tout nom de colonne commençant par "nr_adj_int_wgt" suivi exactement de 3 chiffres. Pour chaque valeur dans ces colonnes, pmin(.x, cap_int) la remplace par la plus petite entre le poids actuel et cap_int (un plafond prédéfini, par exemple 3,5 * poids médian). L'argument .names crée de nouveaux noms de colonnes : "trmpnr1w" + les 3 derniers chiffres de la colonne d'origine (par exemple, trmpnr1w001). Cela empêche les poids extrêmes de devenir encore plus grands lors de l'ajustement des biomarqueurs.

bio_base <- bio_base %>%
  mutate(across(matches("^nr_adj_int_wgt\\d{3}$"),
                ~ pmin(.x, cap_int),
                .names = "trmpnr1w{str_sub(.col, -3)}"))

6.3.1 NR adjustment on biomarker replicates

# For each trimmed replicate weight column (names matching "^trmpnr1w\\d{3}$"), identified using 'grep' — which scans column names and returns those matching the pattern — we create a new column for biomarker nonresponse adjustment. 'grep(..., value = TRUE)' returns the actual column names (not positions). For each such column (e.g., trmpnr1w001), we generate a new name by replacing "trmpnr1w" with "nr_adj_bio_wgt" (→ nr_adj_bio_wgt001). Then, using ifelse(), we adjust the weight: if the person gave a biomarker (response_bio == 1), divide their trimmed weight by their predicted biomarker response probability (response_probability_bio) — this is inverse probability weighting (IPW); if they didn't give a biomarker, set weight to NA (they won’t contribute to biomarker estimates).
# Pour chaque colonne de poids répliqué tronqué (noms correspondant à "^trmpnr1w\\d{3}$"), identifiée à l'aide de 'grep' — qui analyse les noms de colonnes et renvoie celles correspondant au motif — nous créons une nouvelle colonne pour l'ajustement de la non-réponse des biomarqueurs. 'grep(..., value = TRUE)' renvoie les noms réels des colonnes (pas les positions). Pour chaque colonne de ce type (par exemple, trmpnr1w001), nous générons un nouveau nom en remplaçant "trmpnr1w" par "nr_adj_bio_wgt" (→ nr_adj_bio_wgt001). Ensuite, en utilisant ifelse(), nous ajustons le poids : si la personne a fourni un biomarqueur (response_bio == 1), divisez son poids tronqué par sa probabilité prédite de réponse au biomarqueur (response_probability_bio) — c'est la pondération par probabilité inverse (IPW) ; si elle n'a pas fourni de biomarqueur, mettez le poids à NA (elle ne contribuera pas aux estimations des biomarqueurs).

for (rc in grep("^trmpnr1w\\d{3}$", names(bio_base), value = TRUE)) {
  newname <- sub("^trmpnr1w", "nr_adj_bio_wgt", rc)
  bio_base[[newname]] <- ifelse(
    bio_base$response_bio == 1,
    bio_base[[rc]] / bio_base$response_probability_bio,
    NA_real_
  )
}

6.3.2 Trimming of replicate biomarker weights

# Define a cap for biomarker-adjusted weights: 20.5 times the median of the biomarker nonresponse-adjusted weights.  CRITICAL NOTE: 'nr_adj_bio_weights' is not defined in your environment — this will cause an error. You likely meant to use bio_base$nr_adj_bio_wgt001 (or similar) after the loop above creates those columns. Fix: cap_bio <- 20.5 * median(bio_base$nr_adj_bio_wgt001, na.rm = TRUE). This high multiplier (20.5x) suggests biomarker weights may become extremely large after IPW — common when response probabilities are very small.
# Définissez un plafond pour les poids ajustés des biomarqueurs : 20,5 fois la médiane des poids ajustés pour la non-réponse des biomarqueurs. REMARQUE CRITIQUE : 'nr_adj_bio_weights' n'est pas défini dans votre environnement — cela provoquera une erreur. Vous vouliez probablement utiliser bio_base$nr_adj_bio_wgt001 (ou similaire) après que la boucle ci-dessus ait créé ces colonnes. Correction : cap_bio <- 20.5 * median(bio_base$nr_adj_bio_wgt001, na.rm = TRUE). Ce multiplicateur élevé (20,5x) suggère que les poids des biomarqueurs peuvent devenir extrêmement grands après IPW — courant lorsque les probabilités de réponse sont très faibles.

#cap_bio <- 3.5 * median(nr_adj_bio_weights$nr_adj_bio_wgt, na.rm = TRUE)
# GR 10-08 Set high trimmming threshold as discussed above
cap_bio <- 3500 * median(nr_adj_bio_weights$nr_adj_bio_wgt, na.rm = TRUE)

nr_adj_bio_weights |> 
  ggplot(aes(x = nr_adj_bio_wgt)) +
  geom_boxplot()

## Warning: Removed 456 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

# Apply the biomarker cap to all biomarker nonresponse-adjusted replicate weights (columns matching "^nr_adj_bio_wgt\\d{3}$"). Again, 'across()' selects all such columns, applies pmin(.x, cap_bio) to cap extreme values, and creates new columns named "trmbtnr1w" + last 3 digits of original column name (e.g., trmbtnr1w001). This stabilizes weights before post-stratification.
# Appliquez le plafond des biomarqueurs à tous les poids répliqués ajustés pour la non-réponse des biomarqueurs (colonnes correspondant à "^nr_adj_bio_wgt\\d{3}$"). Là encore, 'across()' sélectionne toutes ces colonnes, applique pmin(.x, cap_bio) pour plafonner les valeurs extrêmes, et crée de nouvelles colonnes nommées "trmbtnr1w" + les 3 derniers chiffres du nom de la colonne d'origine (par exemple, trmbtnr1w001). Cela stabilise les poids avant la post-stratification.

bio_base <- bio_base %>%
  mutate(across(matches("^nr_adj_bio_wgt\\d{3}$"),
                ~ pmin(.x, cap_bio),
                .names = "trmbtnr1w{str_sub(.col, -3)}"))

6.3.3 Postratification of replicate biomarker weights

# Post-stratify each replicate to get btwtXXX
# Define a function that post-stratifies one replicate weight column to match known population totals by province and sex. This ensures biomarker estimates are demographically representative. The function takes: df = dataset, repcol = name of weight column to adjust, targets_df = data frame with population totals by province and sex.
# Définissez une fonction qui post-stratifie une colonne de poids répliqué pour correspondre aux totaux de population connus par province et sexe. Cela garantit que les estimations des biomarqueurs sont représentatives sur le plan démographique. La fonction prend : df = jeu de données, repcol = nom de la colonne de poids à ajuster, targets_df = trame de données avec les totaux de population par province et sexe.

poststrat_one_biorep <- function(df, repcol, targets_df) {
  # Extract province (characters 3-4 from EA_HHID_FIXED, e.g., "03" from "CI03..."), sex, and the current replicate weight. 'transmute()' creates a new dataset with only these variables. Then filter out rows where weight is NA (non-biomarker respondents).
  # Extrayez la province (caractères 3-4 de EA_HHID_FIXED, par exemple, "03" de "CI03..."), le sexe et le poids répliqué actuel. 'transmute()' crée un nouveau jeu de données avec uniquement ces variables. Ensuite, filtrez les lignes où le poids est NA (non-répondants de biomarqueurs).
  tmp <- df %>%
    transmute(
      province = stringr::str_sub(EA_HHID_FIXED, 3, 4),
      sex = SEX,
      w = .data[[repcol]]
    ) %>% 
    filter(!is.na(w))

  # Group by province and sex, then compute the total current weight (sum of replicate weights) in each group. '.groups = "drop"' ensures grouping is removed after summarise to avoid unexpected behavior downstream.
  # Groupez par province et sexe, puis calculez le poids total actuel (somme des poids répliqués) dans chaque groupe. '.groups = "drop"' garantit que le regroupement est supprimé après summarise pour éviter des comportements inattendus en aval.
  tots <- tmp %>% group_by(province, sex) %>%
    summarise(totalW = sum(w), .groups = "drop")

  # Join with population targets (targets_df) by province and sex. Compute adjustment factor 'f' = target population / current weighted total. This factor will scale weights up (if undercounted) or down (if overcounted). Keep only province, sex, and f.
  # Joignez avec les cibles de population (targets_df) par province et sexe. Calculez le facteur d'ajustement 'f' = population cible / total pondéré actuel. Ce facteur mettra à l'échelle les poids vers le haut (si sous-comptés) ou vers le bas (si sur-comptés). Conservez uniquement province, sexe et f.
  adj <- tots %>% left_join(targets_df, by = c("province","sex")) %>%
    mutate(f = population / totalW) %>% select(province, sex, f)

  # Join adjustment factors back to individual-level data, then multiply each person’s weight by their group’s adjustment factor 'f'. 'pull(final)' extracts the adjusted weight vector.
  # Joignez les facteurs d'ajustement aux données au niveau individuel, puis multipliez le poids de chaque personne par le facteur d'ajustement 'f' de leur groupe. 'pull(final)' extrait le vecteur de poids ajusté.
  out <- tmp %>% left_join(adj, by = c("province","sex")) %>%
    mutate(final = w * f) %>% pull(final)

  # Reconstruct a weight vector of original dataset length, initialized with NA. Then, only for rows where the original replicate weight was non-NA, insert the adjusted weight. This preserves row order and NA structure — critical for replicate weights.
  # Reconstruisez un vecteur de poids de la longueur du jeu de données d'origine, initialisé avec NA. Ensuite, uniquement pour les lignes où le poids répliqué d'origine était non-NA, insérez le poids ajusté. Cela préserve l'ordre des lignes et la structure NA — critique pour les poids répliqués.
  res <- rep(NA_real_, nrow(df))
  res[!is.na(df[[repcol]])] <- out
  res
}

# Identify all columns in bio_base that match the pattern "^trmbtnr1w\\d{3}$" — i.e., all trimmed biomarker nonresponse-adjusted replicate weights (e.g., trmbtnr1w001, trmbtnr1w002, ...). 'grep(..., value = TRUE)' returns the actual column names as a character vector.
# Identifiez toutes les colonnes dans bio_base qui correspondent au motif "^trmbtnr1w\\d{3}$" — c'est-à-dire tous les poids répliqués ajustés pour la non-réponse des biomarqueurs et tronqués (par exemple, trmbtnr1w001, trmbtnr1w002, ...). 'grep(..., value = TRUE)' renvoie les noms réels des colonnes sous forme de vecteur de caractères.

bio_rep_cols <- grep("^trmbtnr1w\\d{3}$", names(bio_base), value = TRUE)

# For each trimmed biomarker replicate weight column (e.g., trmbtnr1w001), generate a new column name by replacing "trmbtnr1w" with "btwt" (→ btwt001). Then apply the poststrat_one_biorep() function to that column, using population targets (individual_reference_totals_adj), and store the result in the new column. This creates final post-stratified biomarker replicate weights.
# Pour chaque colonne de poids répliqué de biomarqueur tronqué (par exemple, trmbtnr1w001), générez un nouveau nom de colonne en remplaçant "trmbtnr1w" par "btwt" (→ btwt001). Appliquez ensuite la fonction poststrat_one_biorep() à cette colonne, en utilisant les cibles de population (individual_reference_totals_adj), et stockez le résultat dans la nouvelle colonne. Cela crée les poids répliqués finaux de biomarqueurs post-stratifiés.

for (rc in bio_rep_cols) {
  newname <- sub("^trmbtnr1w", "btwt", rc)
  bio_base[[newname]] <- poststrat_one_biorep(bio_base, rc, individual_reference_totals_adj)
}

# Create final clean biomarker weights dataset: join with main biomarker weight (btwt0 from poststrat_bio_wgts) using EA_HHID_LN_FIXED, then select only essential columns — unique person ID, household ID, sampling design variables (varunit, varstrat), main weight (btwt0), and all replicate weights (btwt001, btwt002, ...). This is ready for analysis.
# Créez le jeu de données final propre des poids de biomarqueurs : joignez avec le poids principal de biomarqueur (btwt0 de poststrat_bio_wgts) en utilisant EA_HHID_LN_FIXED, puis sélectionnez uniquement les colonnes essentielles — identifiant unique de la personne, identifiant du ménage, variables de plan d'échantillonnage (varunit, varstrat), poids principal (btwt0) et tous les poids répliqués (btwt001, btwt002, ...). C'est prêt pour l'analyse.

final_bloodwts_reps <- bio_base %>%
  left_join(
    poststrat_bio_wgts %>% select(EA_HHID_LN_FIXED, btwt0),
    by = "EA_HHID_LN_FIXED" ) %>%
  select(EA_HHID_LN_FIXED, EA_HHID_FIXED, varunit, varstrat, btwt0, matches("^btwt\\d{3}$")) |> 
  # GR 10-08 set missing weights to 0.0 to make it cleaner
  mutate(across(starts_with("btwt"), \(x) replace_na(x, 0.0)))

7 Merging and outputting final weighted datasets

# Attach final household weights (hhwt0, hhwt001, ...) to the main household dataset (cd_hh) by matching on household ID. Result: every household has its analysis and replicate weights attached.
# Attachez les poids finaux des ménages (hhwt0, hhwt001, ...) au jeu de données principal des ménages (cd_hh) en faisant correspondre l'identifiant du ménage. Résultat : chaque ménage a ses poids d'analyse et répliqués attachés.

hh_data_weights <- cd_hh %>% 
  left_join(final_hh_weights_rename)

## Joining with `by = join_by(EA_HHID_FIXED, HHI_EACODE, HHI_PROVINCE,
## UPCODE_STAT_HH, HH_STATUS)`

# Attach final individual weights (intwt0, intwt001, ...) to the main individual dataset (cd_csproindiv) by matching on person ID. Result: every individual has analysis and replicate weights for estimating individual-level indicators.
# Attachez les poids finaux individuels (intwt0, intwt001, ...) au jeu de données principal individuel (cd_csproindiv) en faisant correspondre l'identifiant de la personne. Résultat : chaque individu a des poids d'analyse et répliqués pour estimer les indicateurs au niveau individuel.

indiv_data_weights <- cd_csproindiv %>% 
  left_join(final_indwts_reps)

## Joining with `by = join_by(EA_HHID_LN_FIXED)`

# Rename the biomarker dataset’s person ID column from 'ea_hhid_ln_fixed' to standard 'EA_HHID_LN_FIXED', then attach final biomarker weights (btwt0, btwt001, ...). Result: every biomarker respondent has weights calibrated for nonresponse and post-stratified to population totals.
# Renommez la colonne d'identifiant de personne du jeu de données des biomarqueurs de 'ea_hhid_ln_fixed' en 'EA_HHID_LN_FIXED' standard, puis attachez les poids finaux des biomarqueurs (btwt0, btwt001, ...). Résultat : chaque répondant de biomarqueur a des poids calibrés pour la non-réponse et post-stratifiés aux totaux de population.

bio_data_weights <- biomarker %>%
  rename(EA_HHID_LN_FIXED=ea_hhid_ln_fixed) %>% 
  left_join(final_bloodwts_reps)

## Joining with `by = join_by(EA_HHID_LN_FIXED)`

output_dir <- "C:/Users/kupamupindit/OneDrive - Columbia University Irving Medical Center/Documents/ICAP work/Countries/DRC/Weighting/Output_data/"
export(hh_data_weights, str_c(output_dir, "cd_household_weights_16102025.csv"))
export(indiv_data_weights, str_c(output_dir, "cd_individual_weights_16102025.csv"))
export(bio_data_weights, str_c(output_dir, "cd_biomaker_weights_16102025.csv"))


# output_dir <- "G:/Shared drives/DMA/PHIA/Weighting/CODPHIA2/Output_data/"
# export(hh_data_weights, str_c(output_dir, "cd_household_weights_", today(), ".csv"))
# export(indiv_data_weights, str_c(output_dir, "cd_individual_weights_", today(), ".csv"))
# export(bio_data_weights, str_c(output_dir, "cd_biomaker_weights_", today(), ".csv"))

7.1 Plots to check replicates

# Random sample of households
# Randomly select 10 households from the final household weights dataset for visual inspection. 'slice_sample(n = 10)' randomly picks 10 rows without replacement, giving a quick snapshot of weight behavior across different households.
# Sélectionnez aléatoirement 10 ménages à partir du jeu de données final des poids des ménages pour une inspection visuelle. 'slice_sample(n = 10)' choisit aléatoirement 10 lignes sans remplacement, offrant un aperçu rapide du comportement des poids à travers différents ménages.

hh_sample <- hh_data_weights |> 
  slice_sample(n = 10)

# Reshape the 10 sampled households from wide format (each weight in its own column: hhwt0, hhwt001, ...) to long format, where each row represents one weight value for one household. 'pivot_longer()' collapses all columns starting with "hhwt" into two new columns: 'weight_num' (the name of the weight, e.g., "hhwt001") and 'weight_val' (the numeric value). This is required for ggplot to handle multiple weights in a single plot.
# Remodelez les 10 ménages échantillonnés du format large (chaque poids dans sa propre colonne : hhwt0, hhwt001, ...) au format long, où chaque ligne représente une valeur de poids pour un ménage. 'pivot_longer()' regroupe toutes les colonnes commençant par "hhwt" en deux nouvelles colonnes : 'weight_num' (le nom du poids, par exemple, "hhwt001") et 'weight_val' (la valeur numérique). Cela est nécessaire pour que ggplot gère plusieurs poids dans un seul graphique.

hh_sample_long <- hh_sample |> 
  pivot_longer(cols = starts_with("hhwt"), names_to = "weight_num", values_to = "weight_val")

# Create a boxplot where each household (x-axis = EA_HHID_FIXED) has a box showing the distribution of its weights (main + replicates). This helps identify households with extreme weights, or where replicate weights vary wildly — potential red flags for weighting errors or instability.
# Créez une boîte à moustaches où chaque ménage (axe des x = EA_HHID_FIXED) a une boîte montrant la distribution de ses poids (principal + répliques). Cela aide à identifier les ménages avec des poids extrêmes, ou où les poids répliqués varient considérablement — signes potentiels d'erreurs de pondération ou d'instabilité.

hh_sample_long |> 
  ggplot(aes(x = EA_HHID_FIXED, y = weight_val)) +
  geom_boxplot()

# Select only the main household weight (hhwt0) and first two replicates (hhwt001, hhwt002) from the full dataset. Reshape them to long format (same as above), then create overlapping density plots — one for each weight type. The 'fill = weight_num' colors each density curve differently, and 'alpha = 0.25' makes them semi-transparent so you can see overlaps. This checks whether replicate weights follow similar distributions to the main weight — they should.
# Sélectionnez uniquement le poids principal du ménage (hhwt0) et les deux premières répliques (hhwt001, hhwt002) à partir du jeu de données complet. Remodelez-les au format long (comme ci-dessus), puis créez des tracés de densité superposés — un pour chaque type de poids. Le 'fill = weight_num' colore chaque courbe de densité différemment, et 'alpha = 0.25' les rend semi-transparentes pour voir les chevauchements. Cela vérifie si les poids répliqués suivent des distributions similaires au poids principal — ce devrait être le cas.

hh_data_weights |> 
  select(hhwt0, hhwt001, hhwt002) |> 
  pivot_longer(cols = starts_with("hhwt"), names_to = "weight_num", values_to = "weight_val") |> 
  ggplot(aes(x = weight_val, fill = weight_num)) +
  geom_density(alpha = 0.25)

# Random sample of individuals
# Randomly select 10 individuals from the individual weights dataset to inspect their weight distributions. Again, 'slice_sample(n = 10)' ensures random, unbiased selection.
# Sélectionnez aléatoirement 10 individus à partir du jeu de données des poids individuels pour inspecter leurs distributions de poids. Là encore, 'slice_sample(n = 10)' garantit une sélection aléatoire et non biaisée.

ind_sample <- indiv_data_weights |> 
  slice_sample(n = 10)

# Reshape the 10 sampled individuals from wide to long format: collapse all columns starting with "intwt" into 'weight_num' and 'weight_val'. This prepares the data for plotting multiple weights per person.
# Remodelez les 10 individus échantillonnés du format large au format long : regroupez toutes les colonnes commençant par "intwt" en 'weight_num' et 'weight_val'. Cela prépare les données pour tracer plusieurs poids par personne.

ind_sample_long <- ind_sample |> 
  pivot_longer(cols = starts_with("intwt"), names_to = "weight_num", values_to = "weight_val")

# Boxplot of weights for each sampled individual (x-axis = unique person ID). Shows how much replicate weights vary for each person — large spreads may indicate instability or extreme adjustment factors.
# Boîte à moustaches des poids pour chaque individu échantillonné (axe des x = identifiant unique de la personne). Montre à quel point les poids répliqués varient pour chaque personne — de grands écarts peuvent indiquer une instabilité ou des facteurs d'ajustement extrêmes.

ind_sample_long |> 
  ggplot(aes(x = EA_HHID_LN_FIXED, y = weight_val)) +
  geom_boxplot()

# Density plot comparing distributions of main individual weight (intwt0) and first two replicates (intwt001, intwt002). Overlapping, similarly shaped curves indicate replicates were properly generated and reflect the same underlying weighting adjustments.
# Tracé de densité comparant les distributions du poids individuel principal (intwt0) et des deux premières répliques (intwt001, intwt002). Des courbes superposées et de forme similaire indiquent que les répliques ont été correctement générées et reflètent les mêmes ajustements de pondération sous-jacents.

indiv_data_weights |> 
  select(intwt0, intwt001, intwt002) |> 
  pivot_longer(cols = starts_with("intwt"), names_to = "weight_num", values_to = "weight_val") |> 
  ggplot(aes(x = weight_val, fill = weight_num)) +
  geom_density(alpha = 0.25)

## Warning: Removed 635 rows containing non-finite outside the scale range
## (`stat_density()`).

# Compute and display the maximum value found in the first individual replicate weight column (intwt001), ignoring NA values. This is a quick diagnostic to check for extreme weights that might distort estimates or indicate errors in weighting logic.
# Calculez et affichez la valeur maximale trouvée dans la première colonne de poids répliqué individuel (intwt001), en ignorant les valeurs NA. C'est un diagnostic rapide pour vérifier les poids extrêmes qui pourraient déformer les estimations ou indiquer des erreurs dans la logique de pondération.

max(indiv_data_weights$intwt001, na.rm = TRUE)

## [1] 5508.577

# Create a subset of individuals whose main weight (intwt0) exceeds 20,000 — these are potential outliers. You can inspect them manually to see if weights are justified (e.g., rare demographic in underrepresented area) or if they indicate a problem (e.g., model failure, data error).
# Créez un sous-ensemble d'individus dont le poids principal (intwt0) dépasse 20 000 — ce sont des valeurs aberrantes potentielles. Vous pouvez les inspecter manuellement pour voir si les poids sont justifiés (par exemple, démographie rare dans une zone sous-représentée) ou s'ils indiquent un problème (par exemple, échec du modèle, erreur de données).

bigwt <- indiv_data_weights |> 
  filter(intwt0 > 20000)

# Random sample of bioweights
# Randomly select 10 biomarker respondents to inspect their final calibrated weights. These individuals participated in both interview and biomarker collection.
# Sélectionnez aléatoirement 10 répondants de biomarqueurs pour inspecter leurs poids finaux calibrés. Ces individus ont participé à la fois à l'entretien et à la collecte de biomarqueurs.

bio_sample <- bio_data_weights |> 
  slice_sample(n = 10)

# Reshape the 10 sampled biomarker respondents from wide to long format: collapse all "btwt" columns into 'weight_num' and 'weight_val' for plotting.
# Remodelez les 10 répondants de biomarqueurs échantillonnés du format large au format long : regroupez toutes les colonnes "btwt" en 'weight_num' et 'weight_val' pour le tracé.

bio_sample_long <- bio_sample |> 
  pivot_longer(cols = starts_with("btwt"), names_to = "weight_num", values_to = "weight_val")

# Boxplot of biomarker weights for each sampled individual. Since biomarker weights often undergo multiple adjustments (individual NR, biomarker NR, post-stratification), they can be more volatile — this plot helps spot problematic cases.
# Boîte à moustaches des poids de biomarqueurs pour chaque individu échantillonné. Comme les poids de biomarqueurs subissent souvent plusieurs ajustements (NR individuel, NR biomarqueur, post-stratification), ils peuvent être plus volatils — ce graphique aide à repérer les cas problématiques.

bio_sample_long |> 
  ggplot(aes(x = EA_HHID_LN_FIXED, y = weight_val)) +
  geom_boxplot()

# Density plot comparing main biomarker weight (btwt0) and first two replicates (btwt001, btwt002). Ensures that after all adjustments, replicate weights still resemble the main weight in distribution — critical for valid variance estimation.
# Tracé de densité comparant le poids principal de biomarqueur (btwt0) et les deux premières répliques (btwt001, btwt002). Garantit qu'après tous les ajustements, les poids répliqués ressemblent encore au poids principal en distribution — essentiel pour une estimation de variance valide.

bio_data_weights |> 
  select(btwt0, btwt001, btwt002) |> 
  pivot_longer(cols = starts_with("btwt"), names_to = "weight_num", values_to = "weight_val") |> 
  ggplot(aes(x = weight_val, fill = weight_num)) +
  geom_density(alpha = 0.25)

# Spot-check the weight for one specific individual (ID = "CI010100214067300106"). This is useful for debugging, auditing, or validating that weights were correctly calculated and joined. Pull their record from the post-stratified individual weights dataset to see all their weight values in context.
# Vérification ponctuelle du poids pour un individu spécifique (ID = "CI010100214067300106"). C'est utile pour le débogage, l'audit ou la validation que les poids ont été correctement calculés et joints. Extrayez leur enregistrement du jeu de données des poids individuels post-stratifiés pour voir toutes leurs valeurs de poids dans leur contexte.

check <- poststrat_int_wgts %>% 
  filter(EA_HHID_LN_FIXED=="CI010100214067300106")

CODPHIA Weighting

Takura Kupamupindi