Table 1 Example: Cancer Testing

Author

Jesse McDevitt-Irwin

Introduction

This example shows how to clean and summarize Demographic and Health Survey (DHS) data using R. We’ll import a dataset, recode a few variables, and produce a simple summary statistics table.

Load Required Packages

# Load packages for data manipulation and file reading
library(tidyverse)  # For data wrangling
library(haven)      # To read .dta (Stata) files

Clear the Environment

# Remove all objects from the environment to avoid conflicts
rm(list = ls())

Select and Import Relevant Variables

We limit the import to only the variables we need.

# List of individual-level variables of interest
vars_ir <- c("s1105b","v115","v012","v025",
             "v190", "v501","v136","v106",
             "v191")

# Read the .dta file and keep only selected variables
ir_raw <- read_dta("Raw/MZIR81FL.DTA",
                   col_select = all_of(vars_ir))

Recode Cancer Test Outcome

We create a new variable that groups test results into categories.

ir_raw <- ir_raw %>%
  mutate(cancer_test = case_when(
    s1105b == 1 ~ "Negative",
    s1105b >= 2 & s1105b < 4 ~ "Positive/Suspicious",
    s1105b >= 5 ~ NA
  ))

Recode Time to Water Source

Special values in the DHS data are converted to meaningful values or NA.

ir_raw <- ir_raw %>%
  mutate(time_to_water = case_when(
    v115 %in% c(997, 998, 999) ~ NA,  # Special codes
    v115 == 996 ~ 0,                 # "On premises" water = 0 minutes
    TRUE ~ as.numeric(v115)
  ))

Create a Binary Education Variable

We’ll create a binary indicator for having at least secondary education.

ir_raw <- ir_raw %>%
  mutate(education = case_when(
    v106 >= 2 ~ 1,
    .default = 0
  ))

Check for Missing Values

Before summarizing, it’s helpful to understand where we may lose data due to missingness.

ir_raw %>%
  summarise(
    missing_outcome = sum(is.na(cancer_test)),
    missing_main_predictor = sum(is.na(time_to_water))
  )

# A tibble: 1 × 2
  missing_outcome missing_main_predictor
            <int>                  <int>
1           11955                    697

Summary Statistics by Cancer Test Result

Now we compute summary statistics stratified by the outcome.

ir_raw %>%
  group_by(cancer_test) %>%
  summarise(
    n = n(),
    m_time_to_water = mean(time_to_water, na.rm = TRUE),
    m_age = mean(v012, na.rm = TRUE),
    prop_rural = mean(v025 == 2, na.rm = TRUE),
    prop_educ = mean(education, na.rm =TRUE)
  )

# A tibble: 3 × 6
  cancer_test             n m_time_to_water m_age prop_rural prop_educ
  <chr>               <int>           <dbl> <dbl>      <dbl>     <dbl>
1 Negative             1155            9.63  34.7      0.274     0.567
2 Positive/Suspicious    73           21.5   36.7      0.425     0.329
3 <NA>                11955           23.8   27.6      0.597     0.338

Notice that there are only 73 women with positive tests. This is likely too small of a group to be the main focus of the study. Instead, we could make the comparison between people who have had tests or not, which maps on to the concept of access to medical care.

ir_raw %>%
  mutate(cancer_tested = case_when(
    !is.na(s1105b) ~ 1,
    .default = 0)) %>% 
         group_by(cancer_tested) %>%
           summarise(
             n = n(),
             m_time_to_water = mean(time_to_water, na.rm = TRUE),
             m_age = mean(v012, na.rm = TRUE),
             prop_rural = mean(v025 == 2, na.rm = TRUE),
             prop_educ = mean(education, na.rm =TRUE)
           )

# A tibble: 2 × 6
  cancer_tested     n m_time_to_water m_age prop_rural prop_educ
          <dbl> <int>           <dbl> <dbl>      <dbl>     <dbl>
1             0 11918            23.8  27.6      0.598     0.338
2             1  1265            10.2  34.8      0.285     0.550

We see that there are major differences between the women who have had cancer tests, and the women who have not. For example, women who were tested for cancer are much more likely to be urban, older, more educated, and live closer to their water source.