This notebook contains organised code for each step of the weighting process for the CODPHIA data. Ce carnet contient du code organisé pour chaque étape du processus de pondération des données de l’enquête CODPHIA.

The major steps and progress (at 16/05/2025) are as follows: Les étapes principales et leur état d’avancement (au 16/05/2025) sont les suivantes :

[done] Load and organise/join the required datasets [terminé] Charger et organiser/fusionner les ensembles de données requis
[done] Calculate PSU weights [terminé] Calculer les poids des ZD (unités primaires de sondage)
[done] Calculate household weights [terminé] Calculer les poids des ménages
[started] Calculate person-level interview weights [en cours] Calculer les poids des individus pour les entretiens
[not started] Calculate person-level blood test weights [non commencé] Calculer les poids des individus pour les tests sanguins
[started] Create jackknife replicate weights [en cours] Créer les poids de réplique jackknife

1 Calculate EA weights

1.1 What Are Base Weights?

Base weights compensate for unequal selection probabilities due to multi-stage sampling designs.
They are the inverse of the probability of selection at each stage. Make reference to the CODPHIA second stage sample, the base weight is psuwt0 = column(ES) = ADJ_PSU_WT, while the ADJ_PSU_WT = 1/column(ER) = 1/ADJ_PROB2

\[ \text{PSU weight} = \frac{1}{\text{Selection Probability}} \]

1.2 Loading data to calculate base weights

load second stage sample, remove duplicates in terms of EAs so that you are left with 396 EAs. The reason for using second stage sample instead of 1st stage sample is that the 2nd stage sample has all information from 1st stage and from the household listing

# This section is to import packages
# and prepare for importing data
rm(list = ls())



# clean/remove everything
#install.packages("pacman")
# we use pacman to import all packages
pacman::p_load(tidyverse,
               haven,
               rio, 
               ggplot2, 
               janitor, 
               survey, 
               lubridate, 
               readxl,
               BART,
               future,
               furrr,
               progressr,
               caret,
               progressr,
               tinytex
               
)


# create path to the datasets
drc_sample_dir <- "C:/Users/Kupamupindit/OneDrive - Columbia University Irving Medical Center/Documents/ICAP work/Countries/DRC/Sampling/2nd_stage_sampling/"



###################################################################

# read_excel comes from the package readxl, str_c means concatenate 2 strings as shown below
# if you press shift+ctr+M you will get a pipe = %>% 
# if you press ctr+s you will save your script, its a shortcut for saving 
# you can google shortcuts for Rstudio, it will give you a whole list of all these shortcuts
# clean_name comes from the package janitor, the purpose of clean_name is to make your variable names proper
drc_zd_in <- read_excel(str_c(drc_sample_dir,"CODPHIA_second_stage_sampling_09_08_2024.xlsx")) %>%
  clean_names() %>% 
  distinct(ea_id, .keep_all=TRUE)

1.2.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.1

2 calculating base weights

2.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.1

psu_base_weights <- drc_zd_in |> 
  mutate(segment_adjustment = if_else(is.na(segment), 1, segmentsize / easize),
         segadj = adj_prob2 / adj_prob1,
         psuwt0 = 1/(adj_prob1 * segment_adjustment))

# Check
psu_base_weights |> 
  ggplot(aes(x = adj_psu_wt, y = psuwt0)) +
  geom_point() +
  geom_smooth(method = lm)

## `geom_smooth()` using formula = 'y ~ x'

psu_base_weights |> 
  ggplot(aes(x = factor(strata),
             y = psuwt0)) +
  geom_boxplot() +
  labs(x = "Stratum/Region",
       y = "PSU base weight") +
  theme_bw()

psu_base_weights %>% 
  group_by(province) |> 
  mutate(psu_num_instrata = row_number()) |> 
  ggplot(aes(x = psu_num_instrata,
             y = psuwt0, 
             fill = factor(type_secteur, levels = c(1,2), 
                           labels = c("Urban", "Rural")))) +
  geom_col() +
  facet_wrap(~province) +
  theme_bw() +
  labs(fill = "Urban/Rural",
       x = "PSU number within stratum",
       y = "PSU base weight")

psu_weights <- psu_base_weights %>% 
  mutate(EA_ID=ea_id) %>% 
  select(EA_ID, -ea_id, psuwt0)

3 Household base weights

There are several steps for the household weights:

Calculate the within-EA household sampling weight.
Calculate the weight adjustment for households of unknown eligibility.
Calculate the weight adjustment for non-responding households.

3.1 Calculate the within-EA household sampling weight.

We need the counts of households listed and selected to compute the household selection probability and weight.

The sample design was to take a systematic random sample \({h_i}\) households from each EA. The household selection probability, \(p_{hh,i}\) in EA \(i\) is then

\[ p_{hh,i} = \frac{h_i}{N_{\textrm{elig},i}}, \] where \({h_i}\) is the target number of household per each EA \({i}\) and \(N_{\textrm{elig},i}\) is the number of eligible households in EA \(i\).

# ─────────────────────────────────────────────────────────────────────
# Step 1: Load listing and sampling files, then compute base household weights
# Étape 1 : Charger les fichiers de dénombrement et d’échantillonnage, puis calculer les poids de base
# ─────────────────────────────────────────────────────────────────────

# Load the household listing data with number of counted households per EA
# Charger les données de dénombrement avec le nombre de ménages comptés par ZD
listing <- read_dta(str_c(drc_sample_dir, "masterlist_prox1.dta")) %>%
  select(EA_ID, HH_NUM_CNT)  # Keep only EA ID and household count
  # Ne garder que EA_ID et le nombre de ménages (HH_NUM_CNT)

# Load the second-stage sampling file and remove any redundant HH count column
# Charger le fichier d’échantillonnage de deuxième phase et supprimer la colonne HH_NUM_CNT si présente
sampled <- read_excel(str_c(drc_sample_dir, "CODPHIA_second_stage_sampling_09_08_2024.xlsx")) %>%
  select(-HH_NUM_CNT) %>%
  
  # Join the listing data to attach HH counts to sampled EAs
  # Joindre les données de dénombrement pour associer le nombre de ménages aux ZD échantillonnées
  left_join(listing, by = "EA_ID")




# ─────────────────────────────────────────────────────────────────────
# Step 2: Compute household selection weights
# Étape 2 : Calculer les poids de sélection des ménages
# ─────────────────────────────────────────────────────────────────────

hh_selection_weights <- sampled %>%
  
  # Join with PSU weights to compute combined household weights
  # Joindre avec les poids des grappes (PSU) pour calculer les poids combinés
  left_join(psu_weights, by = "EA_ID") %>%
  
  # Calculate household selection weight: max(1, HH_NUM_CNT / Sampled_HH)
  # Calculer le poids de sélection du ménage : max(1, HH_NUM_CNT / Sampled_HH)
  mutate(
    hh_selection_weight = pmax(1.0, HH_NUM_CNT / Sampled_HH),
    
    # Final base weight: product of PSU weight and household selection weight
    # Poids de base final : produit du poids PSU et du poids de sélection du ménage
    hhbwt0 = psuwt0 * hh_selection_weight
  ) %>%
  
  # Keep only one record per EA
  # Garder une seule ligne par ZD
  distinct(EA_ID, .keep_all = TRUE)


# ─────────────────────────────────────────────────────────────────────
# Step 3: Summarize and visualize the distribution of base household weights
# Étape 3 : Résumer et visualiser la distribution des poids de base des ménages
# ─────────────────────────────────────────────────────────────────────

# Show summary statistics of the household base weights
# Afficher les statistiques résumées des poids de base des ménages
summary(hh_selection_weights$hhbwt0)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   40.81  200.12  203.27  375.72  654.91  749.55

# Plot histogram of the household weights to check distribution
# Tracer un histogramme des poids des ménages pour vérifier leur distribution
hist(hh_selection_weights$hhbwt0, breaks = 40)

# Grouped summary by province to explore weight variation
# Résumé groupé par province pour explorer la variation des poids

hh_selection_weights |>
  group_by(Province) |>
  summarise(
    min = min(hhbwt0),
    low_Q = quantile(hhbwt0, 0.25),  # 1st quartile
    medianwt = median(hhbwt0),
    meanwt = mean(hhbwt0),
    upp_Q = quantile(hhbwt0, 0.75),  # 3rd quartile
    max = max(hhbwt0)
  )

## # A tibble: 2 × 7
##   Province       min low_Q medianwt meanwt upp_Q   max
##   <chr>        <dbl> <dbl>    <dbl>  <dbl> <dbl> <dbl>
## 1 Haut-Katanga 408.   653.     657.   650.  661.  750.
## 2 Lualaba       40.8  199.     201.   214.  203.  634.

# Visualize household base weights by province using boxplots
# Visualiser les poids de base des ménages par province avec des diagrammes en boîte

hh_selection_weights |>
  ggplot(aes(y = hhbwt0)) +
  facet_wrap(~Province) +  # One boxplot per province
  geom_boxplot() +
  theme_bw() +
  labs(x = "Household Base Weight",     # X-axis label
       y = "Province")                  # Y-axis label

3.2 Calculate the weight adjustment for households of unknown eligibility

The variable UPCODE_STAT_HH on the household interview file contains the final response status for each household. The codes are:

UPCODE_STAT_HH	Definition
1	Respondent
2	Non-respondent
3	Ineligible
4	Unknown eligibility

In this step, the weights of the households with unknown eligibility are re-distributed to the other households in the EA. The adjustment factor in EA \(i\) is computed as

\[ A^{\textrm{HH elig}}_i = \Sigma_{j=1}^{n_i} W_{ij} / \Sigma_{j = 1}^{n_{i,1} + n_{i,2} + n_{i,3}} W_{ij}. \] The values \(n_{i,1-3}\) are the number of households in EA \(i\) with final status equal to 1, 2, or 3, as defined in the table above. The total number of selected households is \(n_i\), and \(W_{ij}\) is the household base weight of household \(j\) in EA \(i\). Note that because of the sample design, all households in each EA have the same selection probability/weight, so the value of this weight cancels out in computing the adjustment factor.

# ─────────────────────────────────────────────────────────────────────────────
# Load and process DRC household interview data to assign household status
# Charger et traiter les données d'entretien des ménages en RDC pour attribuer un statut
# ─────────────────────────────────────────────────────────────────────────────

# Define file paths
# Définir les chemins d’accès aux fichiers
drc_sample_dir <- "C:/Users/Kupamupindit/OneDrive - Columbia University Irving Medical Center/Documents/ICAP work/Countries/DRC/Data/"
indir <- "C:/Users/kupamupindit/OneDrive - Columbia University Irving Medical Center/Documents/ICAP work/Countries/DRC/weighting/Data/"

# Load the household interview dataset from SAS file
# Charger le fichier d’entretien des ménages à partir d’un fichier SAS
drc_hh <- read_sas(str_c(indir, "cd_ffcorr_hh_int_draft_20241004.sas7bdat")) %>%
  
  # Exclude any household flagged for discarding
  # Exclure tout ménage marqué comme à rejeter
  filter(is.na(DMFLAG) | DMFLAG != "DISCARD")

# Create derived household status variable
# Créer une variable dérivée du statut du ménage
drc_hhstatus <- drc_hh %>%
  mutate(
    # Combine resolution codes for flexibility
    # Combiner les codes de résultat pour plus de flexibilité
    COMB_UPCODED_RESLTNDT = coalesce(as.numeric(UPCODE_RESLTNDT), as.numeric(RESULTNDT)),
    
    # Classify households into status categories
    # Classer les ménages en catégories de statut
    UPCODE_STAT_HH = case_when(
      
      # Status 1: Eligible Respondent
      # Statut 1 : Répondant éligible
      is.na(RESULTNDT) & (STARTINT == 1 & HHELIG == 1 & HHCONSTAT == 1 &
                          !is.na(HHQDTHSINS) & !is.na(ROSTER_START) & !is.na(HHQINSHH) &
                          !is.na(HHQASSIGN_INST)) ~ 1,
      
      is.na(RESULTNDT) & (STARTINT == 4 & !is.na(ROSTER_START)) ~ 1,
      
      # Status 2: Eligible Non-respondent
      # Statut 2 : Ménage éligible non répondant
      COMB_UPCODED_RESLTNDT %in% c(1, 2, 7, 8, 9) ~ 2,
      is.na(RESULTNDT) & (HHELIG == 2 | HHCONSTAT %in% c(2, 3) |
                          (HHELIG == 1 & is.na(HHCONSTAT)) |
                          STARTINT == 4 & is.na(ROSTER_START)) ~ 2,
      
      # Status 3: Ineligible Household
      # Statut 3 : Ménage non éligible
      COMB_UPCODED_RESLTNDT %in% c(3, 4, 6) ~ 3,
      
      # Status 4: Unknown Eligibility
      # Statut 4 : Éligibilité inconnue
      is.na(COMB_UPCODED_RESLTNDT) | COMB_UPCODED_RESLTNDT %in% c(5, 96) ~ 4,
      
      # Default to unknown if no rule applies
      # Par défaut : statut inconnu si aucune règle ne s’applique
      .default = 4
    )
  )

# Create a frequency table of household statuses
# Créer un tableau de fréquence des statuts des ménages
drc_hhstatus |> 
  tabyl(UPCODE_STAT_HH)

##  UPCODE_STAT_HH    n    percent
##               1 1788 0.76475620
##               2  332 0.14200171
##               3  181 0.07741660
##               4   37 0.01582549

# Visualize the proportion of households by response status within each enumeration area
# Visualiser la proportion des ménages par statut de réponse dans chaque zone de dénombrement

drc_hhstatus |> 
  
  # Rename column UPCODE_STAT_HH to hh_status for clarity
  # Renommer la colonne UPCODE_STAT_HH en hh_status pour plus de clarté
  rename(hh_status = UPCODE_STAT_HH) %>%
  
  # Start a ggplot to create a stacked proportional bar chart
  # Démarrer un ggplot pour créer un diagramme en barres empilées proportionnelles
  ggplot(aes(x = factor(HHI_EACODE),                 # X-axis: enumeration area (as a factor)
             fill = factor(hh_status, levels = 1:4,  # Fill color: household response status
                           labels = c("Eligible Respondent",         # 1
                                      "Eligible Non-respondent",     # 2
                                      "Ineligible Household",        # 3
                                      "Unknown Eligibility")))) +    # 4
  
  # Use a stacked bar chart with proportions (each bar sums to 1)
  # Utiliser un diagramme en barres empilées avec des proportions (chaque barre fait 100 %)
  geom_bar(position = position_fill(reverse = TRUE)) +
  
  # Use a colorblind-friendly color palette (viridis)
  # Utiliser une palette de couleurs adaptée au daltonisme (viridis)
  scale_fill_viridis_d() +
  
  # Add axis and legend labels
  # Ajouter des étiquettes aux axes et à la légende
  labs(x = "Region",           # Label for x-axis
       y = "Proportion",       # Label for y-axis
       fill = "hh_status") +   # Legend title for household status
  
  # Apply a clean black-and-white theme for the plot
  # Appliquer un thème noir et blanc pour un rendu épuré
  theme_bw()

# Prepare the household selection weights and create a standardized enumeration area code
# Préparer les poids de sélection des ménages et créer un code de zone de dénombrement standardisé

selected_hh <- hh_selection_weights %>%
  
  # Keep only one unique observation per EA_ID (to avoid duplication)
  # Conserver une seule observation unique par EA_ID (pour éviter les doublons)
  distinct(EA_ID, .keep_all = TRUE) %>%
  
  # Select only the enumeration area ID and the base household weight
  # Sélectionner uniquement l’identifiant de la zone de dénombrement et le poids de base du ménage
  select(EA_ID, hhbwt0) %>%
  
  # Create a new variable HHI_EACODE by prefixing EA_ID with "CD"
  # Créer une nouvelle variable HHI_EACODE en ajoutant le préfixe "CD" à EA_ID
  mutate(HHI_EACODE = paste0("CD", EA_ID))

hh_adj_one is the first-stage household non-response adjustment accounting for eligible and unknown eligibility.

Only status 1 (respondent) and status 2 (eligible non-respondent) receive non-zero adjusted weights.

Statuses 3 and 4 (unknown or ineligible) are given a weight of 0 in hh_wt_one.

# Step 1: Start preparing the household dataset for unknown eligibility adjustment
# Étape 1 : Préparer les données des ménages pour l'ajustement d’éligibilité inconnue

hh_uknwn_adj <- drc_hhstatus %>%
  
  # Filter out rows with missing enumeration area codes (HHI_EACODE)
  # Exclure les lignes avec des codes de zone de dénombrement manquants (HHI_EACODE)
  filter(!is.na(HHI_EACODE)) %>%
  
  # Keep only the relevant columns: enumeration area, province, and status code
  # Conserver uniquement les colonnes pertinentes : zone de dénombrement, province et code de statut
  select(HHI_EACODE, HHI_PROVINCE, UPCODE_STAT_HH) %>%
  
  # Merge in the initial selection weights (hhbwt0) using the EA code
  # Joindre les poids initiaux de sélection (hhbwt0) à l’aide du code de zone de dénombrement
  left_join(selected_hh, by = "HHI_EACODE") %>%
  
  # Group by enumeration area to compute the adjustment factor at the EA level
  # Regrouper par zone de dénombrement pour calculer le facteur d'ajustement à ce niveau
  group_by(HHI_EACODE) %>%
  
  # Step 2: Create household response status and calculate adjustment factors
  # Étape 2 : Créer le statut de réponse des ménages et calculer les facteurs d’ajustement
  mutate(
    # Assign response disposition code to new variable hhstatus
    # Attribuer le code de disposition de réponse à la nouvelle variable hhstatus
    hhstatus = UPCODE_STAT_HH,
    
    # Compute adjustment factor (hh_adj_one) based on eligible or unknown households
    # Calculer le facteur d'ajustement (hh_adj_one) basé sur les ménages éligibles ou à éligibilité inconnue
    hh_adj_one = sum(hhbwt0) / (
      sum(hhbwt0 * (hhstatus == 1)) +  # Respondents
      sum(hhbwt0 * (hhstatus == 2)) +  # Eligible non-respondents
      sum(hhbwt0 * (hhstatus == 3))    # Unknown eligibility
    ),
    
    # Assign adjusted weights based on response status
    # Attribuer les poids ajustés en fonction du statut de réponse
    hh_wt_one = case_when(
      hhstatus %in% c(1, 2) ~ hh_adj_one * hhbwt0,  # Apply adjusted weight to eligible households
      hhstatus %in% c(3, 4) ~ 0                    # Zero weight for unknown/ineligible households
    )
  ) %>%
  
  # Ungroup to return to a regular data frame structure
  # Supprimer le regroupement pour revenir à une structure de données régulière
  ungroup()

3.3 Compute non-response adjusted household weight

Non-response adjustment for households is done by dividing the responding household weights by the response rate in each EA. The non response adjustment factor is

\[ A^{\textrm{HH NR}}_i = \Sigma_{j = 1}^{n_{i,1} + n_{i,2}} W^{\textit{hh2}}_{ij} / \Sigma_{j=1}^{n_i} W^{hh2}_{ij}, \] where \(W^{\textit{hh2}}_{ij}\) is the eligibility-adjusted weight of household \(j\) in EA \(i\), and the other quantities are as in the previous equation.

# Adjust household weights for non-response at the EA (Enumeration Area) level
# Ajuster les poids des ménages pour la non-réponse au niveau des ZD (Zones de Dénombrement)

hh_nr_adj <- hh_uknwn_adj %>%
  
  # Group data by enumeration area (HHI_EACODE)
  # Regrouper les données par zone de dénombrement (HHI_EACODE)
  group_by(HHI_EACODE) %>%
  
  # Calculate adjusted weights and apply them only to responding households
  # Calculer les poids ajustés et les appliquer uniquement aux ménages répondants
  mutate(
    # Compute the adjustment factor: total weight divided by weight of responding households
    # Calculer le facteur d’ajustement : poids total divisé par le poids des ménages répondants
    hh_adj_two = sum(hh_wt_one) / (sum(hh_wt_one * (hhstatus == 1))),
    
    # Apply the adjustment factor only to households with hhstatus == 1
    # Appliquer le facteur d’ajustement uniquement aux ménages avec hhstatus == 1
    hh_wt_two = if_else(hhstatus == 1, 
                        hh_adj_two * hh_wt_one,  # Adjusted weight
                        0)                       # Non-respondents get weight = 0
  ) %>%
  
  # Remove grouping to return to normal (ungrouped) data frame
  # Supprimer le regroupement pour revenir à un tableau de données non groupé
  ungroup()


# Prepare final household weight dataset with standard variable names
# Préparer le jeu de données final des poids des ménages avec des noms de variables standardisés

final_hh_weights <- hh_nr_adj %>%
  
  # Rename variables for final output: EAID for enumeration area, hhwt0 for final household weight
  # Renommer les variables pour le fichier final : EAID pour zone de dénombrement, hhwt0 pour poids final
  rename(EAID = HHI_EACODE, 
         hhwt0 = hh_wt_two)

# Create a boxplot showing distribution of the second household adjustment factor by province
# Créer un diagramme en boîte montrant la distribution du deuxième facteur d'ajustement des ménages par province

hh_nr_adj |> 
  
  # Keep one unique observation per HHI_EACODE (to avoid duplication)
  # Garder une seule observation unique par HHI_EACODE (pour éviter les doublons)
  distinct(HHI_EACODE, .keep_all = TRUE) |>
  
  # Start a ggplot: define x and y axes
  # Créer le graphique avec ggplot : définir les axes x et y
  ggplot(aes(x = hh_adj_two,              # X-axis: the second household adjustment factor
             y = HHI_PROVINCE)) +         # Y-axis: the province or region
  
  # Add a boxplot layer to show the distribution of values by province
  # Ajouter un diagramme en boîte pour montrer la distribution des valeurs par province
  geom_boxplot() +
  
  # Use a clean black-and-white theme for the plot
  # Utiliser un thème en noir et blanc pour un rendu clair
  theme_bw() +
  
  # Add axis labels and a title for clarity
  # Ajouter des étiquettes d’axes pour plus de clarté
  labs(x = "HH adjustment factor two",    # Label for x-axis
       y = "Region")                      # Label for y-axis

4 Calculating person-level interview weights

4.1 Load roster and individual data

cd_csproroster <- import(str_c(indir, "cd_ffcorr_roster_draft_20241004.sas7bdat"),
                  guess_max = 10000) |> 
  filter(is.na(DMFLAG) | DMFLAG != "DISCARD") 

cd_csproindiv <- import(str_c(indir, "cd_ffcorr_ind_int_draft_20241004.sas7bdat"),
                  guess_max = 10000) |> 
  filter(is.na(DMFLAG) | DMFLAG != "DISCARD") %>% 
  mutate(roster_elig=PRIMARY_ROSTER_STATUS) # we need to check this
  
#  mutate(roster_elig=PRIMARY_ROSTER_STATUS) # need to relook a this, could not find the same variable as RUPHIA
# just used PRIMARY_ROSTER_STATUS as a placeholder for roster_elig

# Create a dataset called roster_status_vars by starting from the household roster data (cd_csproroster)
# Crée un jeu de données nommé roster_status_vars à partir des données du registre des ménages (cd_csproroster)
roster_status_vars <- cd_csproroster %>%
  
  # Create a new column indiv_status by copying values from ROSTERED_FINAL_STATUS
  # Crée une nouvelle colonne indiv_status à partir des valeurs de ROSTERED_FINAL_STATUS
  mutate(indiv_status = ROSTERED_FINAL_STATUS) %>%
  
  # Join with the individual-level dataset cd_csproindiv using EA_HHID_LN_FIXED as the key
  # Joindre avec le jeu de données individuel cd_csproindiv en utilisant EA_HHID_LN_FIXED comme clé
  left_join(select(cd_csproindiv, EA_HHID_LN_FIXED, INDSTATUS, roster_elig)) %>%
  
  # Keep only relevant variables for analysis and rename some for clarity
  # Garder uniquement les variables pertinentes pour l'analyse et renommer certaines pour plus de clarté
  select(roster_sex = SEX,                  # Rename SEX to roster_sex
         roster_age = AGEYEARS,             # Rename AGEYEARS to roster_age
         RELATTOHH,                         # Relationship to head of household
         SLEEPHERE,                         # Whether the individual sleeps in the household
         EA_HHID_FIXED, EA_HHID_LN_FIXED,   # Household identifiers
         indiv_status, ELIGIBLE1,           # Individual status and eligibility
         HHR_EACODE, INDSTATUS              # Enumeration area code and individual status from join
  )

## Joining with `by = join_by(EA_HHID_LN_FIXED)`

indiv_status_vars <- cd_csproindiv %>%
  select(IND_PROVINCE, IND_EACODE, 
         EA_HHID_FIXED, EA_HHID_LN_FIXED, INDISPSTATUS, CONFAGEY)

# Create a separate dataset for individual-level status variables
# Crée un jeu de données distinct pour les variables de statut individuel
indiv_status_vars <- cd_csproindiv %>%
  
  # Select key geographic and demographic variables for individual analysis
  # Sélectionne des variables géographiques et démographiques clés pour l'analyse individuelle
  select(IND_PROVINCE, IND_EACODE,         # Province and enumeration area
         EA_HHID_FIXED, EA_HHID_LN_FIXED,  # Household identifiers
         INDISPSTATUS, CONFAGEY)           # Individual disposition status and confirmed age

# Merge roster and individual status information, and create classification variables
# Fusionner les informations du registre et du statut individuel, et créer des variables de classification

roster_ind_status_vars <- roster_status_vars %>%
  
  # Convert HHR_EACODE to character and rename as EAID for merging
  # Convertir HHR_EACODE en chaîne de caractères et renommer en EAID pour la fusion
  mutate(EAID = as.character(HHR_EACODE)) %>%
  
  # Join with household weights dataset using EAID
  # Joindre avec les poids des ménages en utilisant EAID
  left_join(select(final_hh_weights, EAID, hhwt0), by = "EAID") %>%
  
  # Join with individual-level status variables
  # Joindre avec les variables de statut au niveau individuel
  left_join(indiv_status_vars, by = c("EA_HHID_FIXED", "EA_HHID_LN_FIXED")) %>%
  
  # Create derived variables
  # Créer des variables dérivées
  mutate(
    # Indicator: 1 if roster age is 15 or older, 0 otherwise, -9 if missing
    # Indicateur : 1 si l'âge dans le registre ≥ 15 ans, 0 sinon, -9 si manquant
    roster_age_gt_15 = if_else(roster_age >= 15, 1, 0, missing = -9),
    
    # Indicator: 1 if confirmed individual age ≥ 15, 0 otherwise, -9 if missing
    # Indicateur : 1 si CONFAGEY ≥ 15 ans, 0 sinon, -9 si manquant
    int_age_gt_15 = if_else(CONFAGEY >= 15, 1, 0, missing = -9),
    
    # Classify each person by residence and age group
    # Classer chaque personne selon la résidence et le groupe d'âge
    resident_status_roster = case_when(
      SLEEPHERE == 1 & roster_age >= 15 ~ "De facto person 15 years or older",
      SLEEPHERE != 1 & roster_age >= 15 ~ "Non-de facto person 15 years or older",
      roster_age < 15 ~ "Person under 15 years",
      TRUE ~ "OTHER"
    ),
    
    # Classification for analysis table (used for tabulation)
    # Classification pour le tableau d'analyse (utilisé pour le regroupement)
    status_for_table = case_when(
      SLEEPHERE == 1 & roster_age >= 15 & int_age_gt_15 == 1 ~ 1,
      SLEEPHERE == 1 & roster_age >= 15 & int_age_gt_15 == 0 ~ 2,
      SLEEPHERE == 1 & roster_age >= 15 & int_age_gt_15 == -9 ~ 3,
      SLEEPHERE != 1 & roster_age >= 15 ~ 4,
      roster_age < 15 ~ 5,
      TRUE ~ 99999
    ),
    
    # Recode ineligible people: assign code 90 to INDSTATUS and 9 to indiv_status
    # Recoder les personnes non éligibles : attribuer 90 à INDSTATUS et 9 à indiv_status
    INDSTATUS = if_else(ELIGIBLE1 == 2, 90, INDSTATUS),
    indiv_status = if_else(ELIGIBLE1 == 2, 9, indiv_status)
  )

# Generate the summary table with counts and weighted estimates
# Générer le tableau récapitulatif avec les effectifs et les estimations pondérées

table_counts <- roster_ind_status_vars %>% 
  
  # Convert status_for_table to a factor with fixed levels
  # Convertir status_for_table en facteur avec des niveaux fixes
  mutate(
    status_for_table = factor(status_for_table, levels = 1:5),
    
    # Create age group description for interview age
    # Créer une description du groupe d'âge basé sur l'âge confirmé
    int_age_desc = case_when(
      status_for_table %in% c(4, 5) ~ "NA",
      int_age_gt_15 == 1 ~ "15+",
      int_age_gt_15 == 0 ~ "Under 15",
      int_age_gt_15 == -9 ~ "Unknown"
    )
  ) %>%
  
  # Group by classification variables and do not drop any factor levels
  # Regrouper par les variables de classification sans supprimer les niveaux
  group_by(status_for_table, resident_status_roster, int_age_desc, .drop = FALSE) %>%
  
  # Summarize: count people and sum household weights
  # Résumer : compter les personnes et additionner les poids des ménages
  summarise(
    rostered_persons = n(),
    wgt_number_rostered = sum(hhwt0, na.rm = TRUE)
  ) %>%
  
  # Reassign label for status 2 to fix consistency in resident_status and age group
  # Réassigner les libellés pour status 2 pour assurer la cohérence
  mutate(
    resident_status_roster = if_else(status_for_table != 2, resident_status_roster, "De facto person 15 years or older"),
    int_age_desc = if_else(status_for_table != 2, int_age_desc, "Under 15")
  ) %>%
  
  # Add a total row to the summary table
  # Ajouter une ligne de total au tableau récapitulatif
  janitor::adorn_totals(where = "row")

## `summarise()` has grouped output by 'status_for_table',
## 'resident_status_roster'. You can override using the `.groups` argument.

4.2 Phase One Interview Weights and Adjust for Eligibility

# ─────────────────────────────────────────────────────────────────────────────
# Construct Phase One Interview Weights and Adjust for Eligibility
# Construction des poids d'entretien phase 1 et ajustement selon l'éligibilité
# ─────────────────────────────────────────────────────────────────────────────

# Step 1: Prepare input data for phase one interview weighting
# Étape 1 : Préparer les données d'entrée pour le calcul des poids de la phase 1

int_wgts_phase_one_input <- select(cd_csproroster, 
                                   EA_HHID_FIXED, EA_HHID_LN_FIXED, HHR_EACODE, SEX, AGEYEARS, SLEEPHERE) %>%
  
  # Merge with individual data to obtain confirmed age (CONFAGEY)
  # Fusionner avec les données individuelles pour obtenir l'âge confirmé (CONFAGEY)
  left_join(select(cd_csproindiv, EA_HHID_LN_FIXED, CONFAGEY), by = "EA_HHID_LN_FIXED") %>%
  
  # Create derived variables for eligibility and stratification
  # Créer des variables dérivées pour l’éligibilité et la stratification
  mutate(
    EAID = as.character(HHR_EACODE),  # Convert EA code to character for merging
    # Convertir le code EA en caractère pour la jointure
    
    roster_age_gt_15 = case_when(     # Determine if individual is 15+
      AGEYEARS >= 0 & AGEYEARS < 15 ~ 0,
      AGEYEARS >= 15 ~ 1,
      TRUE ~ -9
    ),
    
    int_elig_status = case_when(      # Classify individual eligibility status
      roster_age_gt_15 == 1 & CONFAGEY >= 15 ~ 1,   # Eligible and confirmed 15+
      roster_age_gt_15 == 1 & CONFAGEY >= 0 & CONFAGEY < 15 ~ 2, # Misclassified as eligible
      roster_age_gt_15 == 1 ~ 3,      # Missing or unknown confirmed age
      roster_age_gt_15 == 0 ~ 4,      # Under 15
      TRUE ~ 99999
    ),
    
    # Assign individuals into age-sex cells
    # Affecter les individus à des cellules âge-sexe
    age_sex_cell = case_when(
      SEX == 1 & roster_age_gt_15 == 1 & AGEYEARS < 18 ~ "Male 15-17",
      SEX == 1 & roster_age_gt_15 == 1 & AGEYEARS >= 18 ~ "Male 18+",
      SEX == 2 & roster_age_gt_15 == 1 & AGEYEARS < 18 ~ "Female 15-17",
      SEX == 2 & roster_age_gt_15 == 1 & AGEYEARS >= 18 ~ "Female 18+",
      AGEYEARS < 15 ~ "Under 15",
      TRUE ~ "Missing/Unknown"
    )
  ) %>%
  
  # Merge in household base weights for each individual
  # Ajouter les poids de base des ménages à chaque individu
  left_join(select(final_hh_weights, EAID, hhwt0), by = "EAID")



# Step 2: Calculate adjustment factors by age-sex cell and eligibility
# Étape 2 : Calculer les facteurs d'ajustement par cellule âge-sexe et éligibilité

int_wgts_phase_one_adj <- int_wgts_phase_one_input %>%
  
  # Keep only eligible and misclassified individuals
  # Garder uniquement les individus éligibles ou mal classifiés
  filter(int_elig_status %in% c(1, 2, 3)) %>%
  
  # Group by age-sex cell and eligibility category
  # Regrouper par cellule âge-sexe et statut d’éligibilité
  group_by(age_sex_cell, int_elig_status) %>%
  
  # Count individuals and sum household weights
  # Compter les individus et sommer les poids des ménages
  summarise(
    count = n(),
    weighted_count = sum(hhwt0, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  
  # Reshape the table so each eligibility status becomes a column
  # Réorganiser le tableau pour que chaque statut d’éligibilité devienne une colonne
  pivot_wider(
    values_from = c(count, weighted_count),
    names_from = int_elig_status,
    names_glue = "status_{int_elig_status}_{.value}",
    values_fill = 0
  ) %>%
  
  # Calculate the adjustment factor to correct for underestimation due to missing/misclassified data
  # Calculer le facteur d’ajustement pour corriger la sous-estimation due aux données manquantes ou mal classées
  mutate(
    phase_one_int_adj_factor = (status_1_weighted_count + status_3_weighted_count) / status_1_weighted_count
  )


# Step 3: Apply adjustment factor to each eligible individual
# Étape 3 : Appliquer le facteur d’ajustement à chaque individu éligible

int_wgts_phase_one_adjusted <- int_wgts_phase_one_input %>%
  
  # Merge in the adjustment factor based on age-sex cell
  # Ajouter le facteur d'ajustement en fonction de la cellule âge-sexe
  left_join(select(int_wgts_phase_one_adj, age_sex_cell, phase_one_int_adj_factor), 
            by = "age_sex_cell") %>%
  
  # Compute the adjusted individual interview weight
  # Calculer le poids d’entretien individuel ajusté
  mutate(
    phase_one_adj_int_wgt = hhwt0 * phase_one_int_adj_factor
  )


# Step 4: Summarise adjusted weights by age-sex cell
# Étape 4 : Résumer les poids ajustés par cellule âge-sexe

int_wgts_phase_one_adjusted %>%
  group_by(age_sex_cell) %>%
  summarise(
    mean_weight = mean(phase_one_adj_int_wgt, na.rm = TRUE)
  )

## # A tibble: 6 × 2
##   age_sex_cell    mean_weight
##   <chr>                 <dbl>
## 1 Female 15-17           364.
## 2 Female 18+             369.
## 3 Male 15-17             393.
## 4 Male 18+               410.
## 5 Missing/Unknown        NaN 
## 6 Under 15               NaN

5 Calculating person-level interview weights

5.1 Load roster and individual data

indir <- "C:/Users/Kupamupindit/OneDrive - Columbia University Irving Medical Center/Documents/ICAP work/Countries/DRC/Weighting/Data/"
drc_csproroster <- read_sas(str_c(indir, "cd_ffcorr_roster_draft_20241004.sas7bdat")) %>% 
  filter(is.na(DMFLAG) | DMFLAG != "DISCARD") 

drc_csproindiv <- read_sas(str_c(indir, "cd_ffcorr_ind_int_draft_20241004.sas7bdat")) |> 
  filter(is.na(DMFLAG) | DMFLAG != "DISCARD") %>% 
  mutate(roster_elig=PRIMARY_ROSTER_STATUS) # we need to check this
  
#  mutate(roster_elig=PRIMARY_ROSTER_STATUS) # need to relook a this, could not find the same variable as RUPHIA
# just used PRIMARY_ROSTER_STATUS as a placeholder for roster_elig

CODPHIA Weighting