Applying DHS Sample Weights and Survey Design: Water Insecurity Analysis in R

Author

Jesse McDevitt-Irwin

Overview

This worksheet provides demonstrates the correct use of DHS sample weights and survey design for producing population-representative results, and compares weighted and unweighted analyses for both the JMP Drinking Water Ladder and the WISE (Water Insecurity Experiences) Scale.

Introduction

The DHS uses complex, two-stage stratified sampling. Each sampled household/person represents a different number of the full population. This is encoded in sample weights (hv005 for households).

Estimates must account for:

Sample weight: how many people each unit represents (use hv005 / 1e6).
PSU: survey cluster or “primary sampling unit” (v021).
Strata: groupings for sampling (v023).

Proper use gives correct population results and valid confidence intervals.

In order to easily implement this, install the package survey.

Setup

library(tidyverse)
library(haven)
library(survey)

# Clear environment (optional)
rm(list = ls())

# Select relevant household variables (add v021 and v023 for design!)
vars_hr <- c(
  "hhid", "hv001", "hv002", "hv005", "hv008",
  "hv009", "hv024", "hv025", "hv204","hv201", "hv025",
  "sh108g", "sh108o", "sh108h", "sh108i", "sh108j", "sh108k",
  "sh108l", "sh108m", "sh108n", "sh108p", "sh108q", "sh108r",
  "hv021", "hv023", "hv271"
)

hr_raw <- read_dta("Raw/MZHR81FL.DTA", col_select = all_of(vars_hr)) %>% 
  mutate(hh_weight = hv005/1e6)

sd_wealth = sd(hr_raw$hv271)
hr_raw$wealth = hr_raw$hv271/(2*sd_wealth)

JMP Drinking Ladder

Recoding

Note: as of 24-07-2025, here is new code for creating the JMP drinking ladder.¹ Key changes:

Collapsing “Safely Managed” and “Basic” into one category;
Streamlining hv201 categories.

hr_raw <- hr_raw %>%
  mutate(
    time_to_water = case_when(
      hv204 == 996 ~ 0,
      hv204 < 996 ~ hv204,
      T ~ NA
    ),
    w_imp = case_when(
    hv201 %in% c(11,12,13,14,
                 21, 31,41, 51, 61,71) ~ T,
    hv201 %in% c(32,42, 43) ~ F,
    T ~ NA)) %>% 
    mutate(jmp_water = case_when(
      w_imp & time_to_water < 31 ~ "At Least Basic",
      w_imp  ~ "Limited",
      hv201 == 43 ~ "Surface Water",
      !is.na(w_imp) ~ "Unimproved",
      T ~ NA
    )) %>% 
  mutate(jmp_water = factor(jmp_water,
                            levels = c("At Least Basic",
                                       "Limited",
                                       "Unimproved",
                                       "Surface Water")))

Unweighted vs Weighted/Design-based Distribution

Unweighted Bar Plot

hr_raw %>%
  drop_na(jmp_water) %>%
  ggplot(aes(x = jmp_water)) +
    geom_bar() +
    coord_flip() +
    labs(title = "JMP Drinking Water Ladder (Unweighted)", x = "", y = "Sample Households")

Weighted & Survey-Adjusted Proportion (using survey package)

# Create survey design object
des <- svydesign(
  ids = ~hv021,
  strata = ~hv023,
  weights = ~hh_weight,
  nest = TRUE,
  data = hr_raw
)

# Proportions by JMP category
jmp_tab <- svytable(~jmp_water, design = des)
jmp_df <- as.data.frame(jmp_tab) 

# Plot: Proportion of households by JMP category (using design)
ggplot(jmp_df, aes(x = jmp_water, y = Freq)) +
  geom_col() +
  coord_flip() +
  labs(title = "JMP Drinking Water Ladder (Weighted + Survey Design)",
       x = "", y = "Weighted N of households")

WISE Scale

Recoding

wise_vars <- c("sh108g", "sh108o", "sh108h", "sh108i", "sh108j", "sh108k",
               "sh108l", "sh108m", "sh108n", "sh108p", "sh108q", "sh108r")

hr_raw <- hr_raw %>%
  mutate(across(all_of(wise_vars), ~ case_when(
    .x == 1 ~ 0,
    .x == 2 ~ 1,
    .x == 3 ~ 2,
    .x %in% c(4,5) ~ 3,
    TRUE ~ NA_real_
  ), .names = "recoded_{.col}")) %>%
  mutate(
    WISE_score = rowSums(across(starts_with("recoded_"))),
    WISE_category = case_when(
      WISE_score <= 2 ~ "No-to-marginal",
      WISE_score <= 11 ~ "Low",
      WISE_score <= 23 ~ "Moderate",
      WISE_score >= 24 ~ "High",
      TRUE ~ NA_character_
    ),
    WISE_category = factor(
      WISE_category,
      levels = c("No-to-marginal", "Low", "Moderate", "High"),
      ordered = TRUE
    )
  )

Unweighted vs Weighted/Design-based Distribution

Unweighted

hr_raw %>%
  ggplot(aes(x = WISE_category)) +
    geom_bar() +
    coord_flip() +
    labs(title = "WISE Score Distribution (Unweighted)", x = "Score (0–36)", y = "Sample Households")

Design-based Proportion (weighted and correct SEs)

des <- svydesign(
  ids = ~hv021,
  strata = ~hv023,
  weights = ~hh_weight,
  nest = TRUE,
  data = hr_raw
)
wise_tab <- svytable(~WISE_category, design = des)
wise_df <- as.data.frame(wise_tab)

ggplot(wise_df, aes(x = WISE_category, y = Freq)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "WISE Score Distribution (Weighted + Survey Design)",
    x = "Score (0–36)",
    y = "Weighted N of households"
  )

Examples

Weighted Means: Comparing Urban vs Rural (with Standard Errors)

Above we see that water insecurity is greater when we account for sample weights. This is likely because urban households are over-sampled (meaning they have lower weights) than rural ones.

hr_raw %>% group_by(hv025) %>% summarise(mean(hh_weight))

# A tibble: 2 × 2
  hv025     `mean(hh_weight)`
  <dbl+lbl>             <dbl>
1 1 [urban]             0.873
2 2 [rural]             1.08

Now let’s do a basic comparison of the WISE score in urban vs rural areas. In order to make statistical inferences (standard errors and confidence intervals), we need to account for the hierarchical sampling process of the DHS.

des <- update(des, data = hr_raw)

# Design-based means
svyby(
  ~WISE_score,
  ~hv025,
  des,
  svymean,
  na.rm = TRUE,
  vartype = c("se", "ci")
)

  hv025 WISE_score        se     ci_l     ci_u
1     1   3.725686 0.2477948 3.240017 4.211355
2     2   6.299479 0.2922632 5.726653 6.872304

Regression Errors

When running a regression, you can use the svyglm command. This calculates the correct standard errors, taking into account the hierarchical sampling of the DHS.

# Regression of HWISE score on wealth index

des <- update(des, hr_raw)

ols_model <-  svyglm(WISE_score ~ wealth,
                design = des)

jtools::summ(ols_model)

MODEL INFO:
Observations: 14250
Dependent Variable: WISE_score
Type: Survey-weighted linear regression 

MODEL FIT:
R² = 0.04
Adj. R² = 0.04 

Standard errors: Robust
------------------------------------------------
                     Est.   S.E.   t val.      p
----------------- ------- ------ -------- ------
(Intercept)          5.13   0.20    25.77   0.00
wealth              -3.64   0.34   -10.64   0.00
------------------------------------------------

Estimated dispersion parameter = 64.27

Subpopulation Weights

When you analyze a subpopulation (e.g., only rural households, or only households where the main respondent is female), do not recalculate weights or re-design the survey object on the filtered data. Instead, use your full survey design and subset the survey design object. This maintains correct weights, clustering, and stratification.

For example, this gives population-representative estimates for rural households only (replace hv025 == 2 as needed for any other subpopulation):

# Subset to rural households only (hv025: 1=urban, 2=rural)
des_rural <- subset(des, hv025 == 2)

# Example: Proportion of rural households with improved water source (replace with your variable)
svymean(~I(jmp_water == "At Least Basic"), design = des_rural, na.rm = TRUE)

                                         mean     SE
I(jmp_water == "At Least Basic")FALSE 0.62993 0.0186
I(jmp_water == "At Least Basic")TRUE  0.37007 0.0186

# Example: Mean value for a continuous variable among rural households
svymean(~hv009, design = des_rural, na.rm = TRUE)

        mean     SE
hv009 4.6401 0.0416

Summary

Always define the survey design on the full data with correct weights, clusters, and strata.
Use subset() on the survey design for any subpopulation (e.g., age group, sex, urban/rural, region).
This gives correct point estimates and uncertainty for your target subpopulation and respects the complex survey design.

Adjust the filtering condition in subset() for your exact analysis.
Replace jmp_water == "Safely Managed" or hv009 above with your variable(s) of interest. ————————————————————————

Notes

Always use both weights and the survey design (PSU, strata variables) for population-representative analyses and for standard errors/confidence intervals.
Unweighted or weights-only analyses will give incorrect inferences, especially about uncertainty.
For analysis of women and children, use variable v005 as your weight. You still need to divide by one million (1e6).

Refer to the DHS Guide to Statistics, Section 1.7.2 for more details.

Footnotes

I have created this following the practices done by JMP researchers as provided by Joshua Miller.↩︎