# Load packages for data manipulation and file reading
library(tidyverse) # For data wrangling
library(haven) # To read .dta (Stata) filesTable 1 Example: Cancer Testing
Introduction
This example shows how to clean and summarize Demographic and Health Survey (DHS) data using R. We’ll import a dataset, recode a few variables, and produce a simple summary statistics table.
Load Required Packages
Clear the Environment
# Remove all objects from the environment to avoid conflicts
rm(list = ls())Select and Import Relevant Variables
We limit the import to only the variables we need.
# List of individual-level variables of interest
vars_ir <- c("s1105b","v115","v012","v025",
"v190", "v501","v136","v106",
"v191")
# Read the .dta file and keep only selected variables
ir_raw <- read_dta("Raw/MZIR81FL.DTA",
col_select = all_of(vars_ir))Recode Cancer Test Outcome
We create a new variable that groups test results into categories.
ir_raw <- ir_raw %>%
mutate(cancer_test = case_when(
s1105b == 1 ~ "Negative",
s1105b >= 2 & s1105b < 4 ~ "Positive/Suspicious",
s1105b >= 5 ~ NA
))Recode Time to Water Source
Special values in the DHS data are converted to meaningful values or NA.
ir_raw <- ir_raw %>%
mutate(time_to_water = case_when(
v115 %in% c(997, 998, 999) ~ NA, # Special codes
v115 == 996 ~ 0, # "On premises" water = 0 minutes
TRUE ~ as.numeric(v115)
))Create a Binary Education Variable
We’ll create a binary indicator for having at least secondary education.
ir_raw <- ir_raw %>%
mutate(education = case_when(
v106 >= 2 ~ 1,
.default = 0
))Check for Missing Values
Before summarizing, it’s helpful to understand where we may lose data due to missingness.
ir_raw %>%
summarise(
missing_outcome = sum(is.na(cancer_test)),
missing_main_predictor = sum(is.na(time_to_water))
)# A tibble: 1 × 2
missing_outcome missing_main_predictor
<int> <int>
1 11955 697
Summary Statistics by Cancer Test Result
Now we compute summary statistics stratified by the outcome.
ir_raw %>%
group_by(cancer_test) %>%
summarise(
n = n(),
m_time_to_water = mean(time_to_water, na.rm = TRUE),
m_age = mean(v012, na.rm = TRUE),
prop_rural = mean(v025 == 2, na.rm = TRUE),
prop_educ = mean(education, na.rm =TRUE)
)# A tibble: 3 × 6
cancer_test n m_time_to_water m_age prop_rural prop_educ
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 Negative 1155 9.63 34.7 0.274 0.567
2 Positive/Suspicious 73 21.5 36.7 0.425 0.329
3 <NA> 11955 23.8 27.6 0.597 0.338
Notice that there are only 73 women with positive tests. This is likely too small of a group to be the main focus of the study. Instead, we could make the comparison between people who have had tests or not, which maps on to the concept of access to medical care.
ir_raw %>%
mutate(cancer_tested = case_when(
!is.na(s1105b) ~ 1,
.default = 0)) %>%
group_by(cancer_tested) %>%
summarise(
n = n(),
m_time_to_water = mean(time_to_water, na.rm = TRUE),
m_age = mean(v012, na.rm = TRUE),
prop_rural = mean(v025 == 2, na.rm = TRUE),
prop_educ = mean(education, na.rm =TRUE)
)# A tibble: 2 × 6
cancer_tested n m_time_to_water m_age prop_rural prop_educ
<dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 0 11918 23.8 27.6 0.598 0.338
2 1 1265 10.2 34.8 0.285 0.550
We see that there are major differences between the women who have had cancer tests, and the women who have not. For example, women who were tested for cancer are much more likely to be urban, older, more educated, and live closer to their water source.