library(tidyverse)
library(haven)
# Define variables to load
vars_ir <- c("v115") # Only one variable for this illustration
# Load only the necessary columns
ir_raw <- read_dta("Raw/MZIR81FL.DTA",
col_select = all_of(vars_ir))Handling Missing Values in DHS Data
Overview
In DHS datasets, missing values are often coded with special numbers (like 997, 998, 999). Sometimes special cases like “on premises” are assigned a different value (e.g., 996).
In this handout, we will learn how to:
- Recode missing values properly
- Recode special cases like “on premises”
- Visualize changes using histograms
We will use the example of time to water source (v115) from the Mozambique 2022 DHS Individual Recode file (IR file).
Step 1: Load the Data
Step 2: Recode v115
attr(ir_raw$v115, "labels") on premises not a dejure resident don't know
996 997 998
- Values 997 and 998 are types of missing data
- Value 996 means “On premises” and we want to set it to 0 minutes
ir_raw <- ir_raw %>%
mutate(time_to_water = case_when(
v115 %in% c(997, 998, 999) ~ NA, # Set special codes to NA
v115 == 996 ~ 0, # "On premises" = 0 minutes
TRUE ~ as.numeric(v115) # Otherwise keep the original value
))Step 3: Visualize Before and After
# Histogram of original v115
hist(ir_raw$v115,
main = "Original v115 (Time to Water)",
xlab = "Minutes",
col = "lightblue")We see in this figure that there are a large number of observations clustered at 1000. This is not reflecting time to water, this is reflecting the values the DHS used to represent special cases and missing values.
# Histogram after recoding
hist(ir_raw$time_to_water,
main = "Recoded Time to Water",
xlab = "Minutes",
col = "lightgreen")Summary
- Always check for special missing value codes in DHS
- Use
case_when()to recode cleanly - Visualizing before and after can help confirm your transformations are correct!