Data wrangling in R: From raw data to analysis-ready datasets

Author

Cameline N. Orlendo and Thomas M. Mawora

Code
library(tidyverse)
theme_set(theme_minimal(base_size = 13))

Introduction

As the volume and complexity of data grow, having high-quality, well-structured datasets becomes essential for trustworthy insights and sound decisions. When data are disorganised or flawed, they can quietly distort results and waste precious time, money, and effort for people working on the front lines.

For epidemiologists, programme managers, clinicians, and analysts, this is a familiar burden: long hours spent hunting through Excel files, reconciling conflicting case definitions, or re-checking numbers that do not quite add up before a briefing or report.

This is where data wrangling comes in — the practical work of cleaning, structuring, and transforming raw data so they are ready for analysis and interpretation.

Real-world datasets are rarely ready for analysis. They are often:

  • spread across multiple files,
  • inconsistent in variable names and coding,
  • affected by missing values,
  • and full of choices that must be made before analysis can begin.

1 Introduction to the DHS datasets

Kenya National Bureau of Statistics (KNBS), Ministry of Health (MoH), and ICF. (2023). Kenya Demographic and Health Survey 2022. Rockville, Maryland, USA: KNBS and ICF.

This analysis utilizes data from the 2022 Kenya Demographic and Health Survey (KDHS). The dataset comprises a Household Recode (HR) containing information on living conditions, a Household Member Recode (PR) providing individual demographics for every person in the home, and a Pregnancies Recode (GR) capturing detailed maternal and reproductive histories. These are augmented with Geospatial Covariates (GC) and Geographic Data (GE) to allow for cluster-level spatial analysis.

1.1 Variable Selection

For this malaria-focused training, we select variables that track prevention (mosquito nets), risk factors (wealth, location, and altitude), and health outcomes (test results and pregnancy impact).

1.1.1 Linking Variables (The “Keys”)

To join these files together in R, you will need the following common variables:

  • Cluster Number: hv001 (in HR/PR) or v001 (in GR)
  • Household Number: hv002 (in HR/PR) or v002 (in GR)
  • Respondent’s Line Number: hvidx (in PR) or v003 (in GR) — Only needed if linking individuals to their specific pregnancy history.

We will subset the variables below. for use in the training.

Dataset Variable Name Description Suggested New Name Value Labels (Categorical Codes)
Household (HR) hv001, hv002 Cluster and Household ID (Keys) (Numeric IDs)
hv270 Wealth Index (Social determinant of health) wealth_index 1: Poorest, 2: Poorer, 3: Middle, 4: Richer, 5: Richest
hv025 Type of place of residence (Urban/Rural) residence_type 1: Urban, 2: Rural
hv040 Cluster altitude in meters (High altitude = lower risk) altitude_meters (Continuous/Numeric)
hv227 Household has any mosquito net has_any_net 0: No, 1: Yes
hml1 Number of insecticide-treated nets (ITNs) owned itn_count (Numeric Count)
hv214 Main wall material (Risk of mosquito entry) wall_material 10: Natural, 20: Rudimentary, 30: Finished
hv201 Source of drinking water water_source 10: Piped, 20: Well, 30: Surface, 41: Protected
Member (PR) hv001, hv002 Cluster and Household ID (Keys) (Numeric ID)
hvidx Line number of the individual member (Numeric ID)
hv105 Age of household member (Children <5 are high risk) member_age 0-94: Years, 95: 95+, 98: Don’t know
hml12 Type of net person slept under last night net_type_used 0: None, 1: ITN, 2: Treated (non-ITN), 3: Untreated
hml32 Final result of malaria test (RDT/Microscopy) malaria_test_result 0: Negative, 1: Positive
hml20 Hemoglobin level (Malaria-induced anemia indicator) hemoglobin_level (Numeric g/dL, 1 decimal)
hml35 Had fever in the last 2 weeks fever_last_2weeks 0: No, 1: Yes, 8: Don’t know
hv104 Sex of household member member_sex 1: Male, 2: Female
Pregnancy (GR) v001, v002 Cluster and Household ID (Keys) (Numeric ID)
v003 Respondent’s line number (Links to PR file) (Numeric ID)
v012 Current age of the mother mother_age (Numeric Years)
m49a Took any antimalarial during pregnancy (IPTp) iptp_antimalarial 0: No, 1: Yes, 8: Don’t know
ml1 Number of doses of Fansidar (SP) taken sp_fansidar_doses 1: 1 dose, 2: 2 doses, 3: 3+ doses
v453 Maternal Hemoglobin level (Anemia risk) maternal_hemoglobin (Numeric g/dL)
m19 Birth weight (Malaria can cause low birth weight) birth_weight_g Numeric), 9996: Not weighed, 9998: Don’t know

2 Objectives for this Session

The overacrching objective for this session is to build participants’ capacity to clean, transform, integrate, and prepare malaria-related datasets for accurate and reproducible analysis using R.

By the end of the training, participants will be able to:

  1. Identify common data quality challenges in malaria analytics, including missing values, inconsistent formats, duplicates, outliers, and invalid codes.
  2. Apply core data wrangling workflows in R using the tidyverse, including importing, cleaning, recoding, filtering, summarizing, and reshaping datasets.
  3. Integrate multiple malaria-related datasets using appropriate common identifiers, such as household, individual, cluster, or geographic keys, to create analysis-ready data.

3 Importing the Data

The codes below read in the data so we can use them. We have sampled 1000 households out of the 37911, but for each dataset, we are selecting only ten variables as listed above, with identity variables that can help us link the household to the member and the pregnant mother.

Code
# Load necessary libraries
library(haven)   # For .dta files
library(readr)   # For .csv files
library(sf)      # For .shp files
library(dplyr)   # For data wrangling

# Set seed for reproducibility
set.seed(2026)

# 1. READ & SAMPLE HOUSEHOLD RECODE (HR)
# We select variables and then sample 100 random households
hr_data <- read_dta("data/KE_2022_DHS_04162026_1946_185432/KEHR8CDT/KEHR8CFL.dta") %>%
  select(hv001, hv002, hv270, hv025, hv040, hv227, hml1, hv214, hv201) %>%
  sample_n(1000)

# 2. READ HOUSEHOLD MEMBER RECODE (PR)
pr_data <- read_dta("data/KE_2022_DHS_04162026_1946_185432/KEPR8CDT/KEPR8CFL.dta") %>%
  select(hv001, hv002, hvidx, hv105, hml12, hml32, hml20, hml35, hv104)

# 3. READ PREGNANCIES RECODE (GR)
gr_data <- read_dta("data/KE_2022_DHS_04162026_1946_185432/KEGR8CDT/KEGR8CFL.dta") %>%
  select(v001, v002, v003, v012, m49a, ml1, v453, m19)

# 4. READ GEOSPATIAL DATA
geo_covar <- read_csv("data/KE_2022_DHS_04162026_1946_185432/KEGC8AFL/KEGC8AFL.csv") # CSV file
geo_shape <- st_read("data/KE_2022_DHS_04162026_1946_185432/KEGE8AFL/KEGE8AFL.shp")  # Shapefile
Reading layer `KEGE8AFL' from data source 
  `G:\My Drive\Projects\2026\2026.04 AMNET\Hackathon_data_wrangling_ext\data\KE_2022_DHS_04162026_1946_185432\KEGE8AFL\KEGE8AFL.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 1691 features and 20 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 33.96339 ymin: -4.633603 xmax: 41.87537 ymax: 4.930862
Geodetic CRS:  WGS 84
Code
# EXAMPLE: Linking PR to the Sampled Households
# This keeps only members belonging to our 100 sampled households
# linked_data <- pr_data %>%
#   inner_join(hr_data, by = c("hv001", "hv002"))

# Preview the results
head(hr_data)
# A tibble: 6 × 9
  hv001 hv002 hv270       hv025     hv040 hv227     hml1      hv214     hv201   
  <dbl> <dbl> <dbl+lbl>   <dbl+lbl> <dbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lb>
1  1622    21 3 [middle]  2 [rural]  1696 1 [yes]   2         31 [ceme… 41 [pro…
2   848     6 4 [richer]  1 [urban]  1944 0 [no]    0         32 [ston… 12 [pip…
3    75    49 1 [poorest] 2 [rural]    74 1 [yes]   3         21 [bamb… 13 [pip…
4  1053     2 1 [poorest] 2 [rural]  1532 0 [no]    0         21 [bamb… 43 [riv…
5     4    34 5 [richest] 1 [urban]    60 1 [yes]   2         31 [ceme… 14 [pub…
6    52    68 3 [middle]  2 [rural]    26 1 [yes]   2         32 [ston… 12 [pip…
Code
head(pr_data)
# A tibble: 6 × 9
  hv001 hv002 hvidx hv105     hml12                  hml32 hml20   hml35 hv104  
  <dbl> <dbl> <dbl> <dbl+lbl> <dbl+lbl>              <dbl> <dbl+l> <dbl> <dbl+l>
1     1     4     1 35        1 [only treated (itn)… NA    1 [yes] NA    1 [mal…
2     1     4     2 33        1 [only treated (itn)… NA    1 [yes] NA    2 [fem…
3     1     4     3 13        1 [only treated (itn)… NA    1 [yes] NA    1 [mal…
4     1     4     4 11        1 [only treated (itn)… NA    1 [yes] NA    2 [fem…
5     1     4     5  8        1 [only treated (itn)… NA    1 [yes] NA    1 [mal…
6     1     4     6  0        1 [only treated (itn)… NA    1 [yes] NA    1 [mal…
Code
head(gr_data)
# A tibble: 6 × 8
   v001  v002  v003  v012 m49a      ml1       v453      m19      
  <dbl> <dbl> <dbl> <dbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl>
1     1     4     2    34  1 [yes]   2        NA        4000     
2     1     4     2    34 NA        NA        NA          NA     
3     1     4     2    34 NA        NA        NA          NA     
4     1     4     2    34 NA        NA        NA          NA     
5     1     7     2    38 NA        NA        NA          NA     
6     1     7     2    38 NA        NA        NA          NA     

4 Data cleaning – Names, Types, Missingness, Outliers

4.1 Understanding the Data Architecture

Before we perform any analysis, we must inspect the “skeleton” of our data frames.

  • All the datasets we imported are stored as rectilinear data structures (data frames). Each row represents an observation (a household, a person, or a pregnancy), and each column represents a variable.
  • One immediate challenge you will notice is that the column names are non-informative (e.g., hv270). While these are standardized across all DHS countries, they are difficult for humans to interpret during analysis.
  • We will first explore the technical structure (data types and dimensions) and then immediately transition to Renaming, transforming those cryptic codes into descriptive labels to make our code self-documenting.

4.2 Renaming Variables for Clarity

The first step is to get a sense of the scale of your data. Using the dim() function provides a quick snapshot of the total number of observations and variables. This is important as it helps one confirm the data loaded correctly and helps you keep track of how many variables you are managing.

DHS datasets use a standardized coding system. While hv270 might not mean much at first glance, the labelled package allows you to peek at the underlying description (e.g., “Wealth Index”). By listing the labels, you can create a mapping for your renaming process. For instance, once you see that hv001 is the “Cluster number,” you can confidently rename it to cluster_id to make your code easier for others (and your future self) to read.

Code
# 1. Check dimensions (Rows and Columns)
dim(hr_data)
[1] 1000    9
Code
# 2. Comprehensive overview 
# This shows variable names, types (dbl, int, fct), and the first few values
glimpse(hr_data)
Rows: 1,000
Columns: 9
$ hv001 <dbl> 1622, 848, 75, 1053, 4, 52, 1014, 180, 557, 425, 528, 49, 1252, …
$ hv002 <dbl> 21, 6, 49, 2, 34, 68, 74, 115, 105, 31, 62, 37, 35, 58, 51, 35, …
$ hv270 <dbl+lbl> 3, 4, 1, 1, 5, 3, 2, 2, 1, 2, 3, 4, 2, 3, 2, 3, 4, 1, 2, 1, …
$ hv025 <dbl+lbl> 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, …
$ hv040 <dbl> 1696, 1944, 74, 1532, 60, 26, 1706, 992, 856, 1305, 1468, 101, 1…
$ hv227 <dbl+lbl> 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, …
$ hml1  <dbl+lbl> 2, 0, 3, 0, 2, 2, 3, 2, 2, 0, 0, 1, 1, 2, 4, 0, 1, 1, 0, 2, …
$ hv214 <dbl+lbl> 31, 32, 21, 21, 31, 32, 13, 21, 31, 36, 34, 32, 13, 32, 21, …
$ hv201 <dbl+lbl> 41, 12, 13, 43, 14, 12, 41, 14, 43, 13, 51, 12, 41, 14, 41, …
Code
# 3. Viewing the "Labelled" attributes
# DHS data often has 'labels' hidden behind the codes. 
# This helps us see what 'hv270' actually stands for before we rename it.
library(labelled)
var_label(hr_data)
$hv001
[1] "cluster number"

$hv002
[1] "household number"

$hv270
[1] "wealth index combined"

$hv025
[1] "type of place of residence"

$hv040
[1] "cluster altitude in meters"

$hv227
[1] "has mosquito bed net for sleeping"

$hml1
[1] "number of mosquito bed nets"

$hv214
[1] "main wall material"

$hv201
[1] "source of drinking water"

Now, we apply the mapping we discussed to make the datasets intuitive. We will use the rename() function from the tidyverse.

Code
library(dplyr)

# 1. Renaming Household Recode (HR)
hr_data <- hr_data %>%
  rename(
    wealth_index    = hv270,
    residence_type  = hv025,
    altitude_meters = hv040,
    has_any_net     = hv227,
    itn_count       = hml1,
    wall_material   = hv214,
    water_source    = hv201
  )

# 2. Renaming Member Recode (PR)
pr_data <- pr_data %>%
  rename(
    member_age          = hv105,
    net_type_used       = hml12,
    malaria_test_result = hml32,
    hemoglobin_level    = hml20,
    fever_last_2weeks   = hml35,
    member_sex          = hv104
  )

# 3. Renaming Pregnancy Recode (GR)
gr_data <- gr_data %>%
  rename(
    mother_age          = v012,
    iptp_antimalarial   = m49a,
    sp_fansidar_doses   = ml1,
    maternal_hemoglobin = v453,
    birth_weight_g      = m19
  )

# Verify the changes
names(hr_data)
[1] "hv001"           "hv002"           "wealth_index"    "residence_type" 
[5] "altitude_meters" "has_any_net"     "itn_count"       "wall_material"  
[9] "water_source"   

4.3 Data Types

While functions like head() or tail() are great for a quick visual “sanity check” of the first or last few rows, they can be deceptive. A column might look like it contains numbers, but R might be treating it as a character string, which would prevent you from performing any mathematical operations. Checking the structure ensures that your variables are stored in a format compatible with the functions you plan to use later.

Why Structure Matters Over Just Previewing

Feature head() / tail() str() / glimpse()
Visibility Shows raw data values. Shows data types and metadata.
Precision Hard to distinguish between a “5” (text) and a 5 (number). Explicitly states chr, int, or num.
Scope Only shows the first/last 6 rows. Provides a summary of all variables at once.
Error Prevention Easy to miss trailing spaces or incorrect formatting. Identifies why a column might not be calculating correctly.

There are two primary ways to do this. The base R function str() is the standard, while dplyr::glimpse() provides a cleaner, transposed view that is often easier to read for datasets with many columns (like DHS data).

Code
library(dplyr)

# Glimpse provides a denser summary that fits more columns on your screen
glimpse(hr_data)
Rows: 1,000
Columns: 9
$ hv001           <dbl> 1622, 848, 75, 1053, 4, 52, 1014, 180, 557, 425, 528, …
$ hv002           <dbl> 21, 6, 49, 2, 34, 68, 74, 115, 105, 31, 62, 37, 35, 58…
$ wealth_index    <dbl+lbl> 3, 4, 1, 1, 5, 3, 2, 2, 1, 2, 3, 4, 2, 3, 2, 3, 4,…
$ residence_type  <dbl+lbl> 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1,…
$ altitude_meters <dbl> 1696, 1944, 74, 1532, 60, 26, 1706, 992, 856, 1305, 14…
$ has_any_net     <dbl+lbl> 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1,…
$ itn_count       <dbl+lbl> 2, 0, 3, 0, 2, 2, 3, 2, 2, 0, 0, 1, 1, 2, 4, 0, 1,…
$ wall_material   <dbl+lbl> 31, 32, 21, 21, 31, 32, 13, 21, 31, 36, 34, 32, 13…
$ water_source    <dbl+lbl> 41, 12, 13, 43, 14, 12, 41, 14, 43, 13, 51, 12, 41…
Code
glimpse(pr_data)
Rows: 156,571
Columns: 9
$ hv001               <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ hv002               <dbl> 4, 4, 4, 4, 4, 4, 7, 7, 7, 10, 10, 13, 13, 13, 13,…
$ hvidx               <dbl> 1, 2, 3, 4, 5, 6, 1, 2, 3, 1, 2, 1, 2, 3, 4, 5, 6,…
$ member_age          <dbl+lbl> 35, 33, 13, 11,  8,  0, 38, 35, 16, 33, 11, 48…
$ net_type_used       <dbl+lbl> 1, 1, 1, 1, 1, 1, 3, 3, 0, 1, 3, 1, 1, 1, 3, 3…
$ malaria_test_result <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ hemoglobin_level    <dbl+lbl> 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
$ fever_last_2weeks   <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ member_sex          <dbl+lbl> 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 2, 1, 1, 1…
Code
glimpse(gr_data)
Rows: 82,687
Columns: 8
$ v001                <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ v002                <dbl> 4, 4, 4, 4, 7, 7, 10, 13, 13, 13, 13, 13, 20, 20, …
$ v003                <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ mother_age          <dbl> 34, 34, 34, 34, 38, 38, 33, 39, 39, 39, 39, 39, 30…
$ iptp_antimalarial   <dbl+lbl>  1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sp_fansidar_doses   <dbl+lbl>  2, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ maternal_hemoglobin <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ birth_weight_g      <dbl+lbl> 4000,   NA,   NA,   NA,   NA,   NA,   NA,   NA…

4.4 Missing Values

In R, missing information is represented by the logical constant NA (Not Available), but its behavior shifts slightly depending on the data type. For numeric variables, an NA acts as a mathematical “hole”—any calculation involving it (like 1 + NA or mean(data)) will result in NA unless you explicitly tell R to ignore it using the na.rm = TRUE argument. For character variables, NA represents a missing string, which is distinct from an empty quote (""). While an empty string is technically data (a word with zero characters), an NA is the total absence of a value. Identifying these gaps is typically done using the is.na() function, which returns a logical TRUE for every missing entry regardless of whether the column is numeric or text-based.

You can use the following commands to get a quick count of how much data is missing across your three datasets:

Code
# Count total missing values in each dataset
sum(is.na(hr_data))
[1] 0
Code
sum(is.na(pr_data))
[1] 313142
Code
sum(is.na(gr_data))
[1] 317431
Code
nrow(gr_data)
[1] 82687
Code
# Summary view to see NAs per specific column
summary(gr_data)
      v001             v002             v003          mother_age   
 Min.   :   1.0   Min.   :   0.0   Min.   : 1.000   Min.   :15.00  
 1st Qu.: 358.0   1st Qu.:  19.0   1st Qu.: 1.000   1st Qu.:30.00  
 Median : 861.0   Median :  39.0   Median : 2.000   Median :36.00  
 Mean   : 839.6   Mean   :  43.4   Mean   : 1.889   Mean   :35.69  
 3rd Qu.:1275.0   3rd Qu.:  63.0   3rd Qu.: 2.000   3rd Qu.:42.00  
 Max.   :1692.0   Max.   :1098.0   Max.   :22.000   Max.   :49.00  
                                                                   
 iptp_antimalarial sp_fansidar_doses maternal_hemoglobin birth_weight_g 
 Min.   :0.000     Min.   : 0.000    Min.   : NA         Min.   : 600   
 1st Qu.:0.000     1st Qu.: 1.000    1st Qu.: NA         1st Qu.:3000   
 Median :0.000     Median : 2.000    Median : NA         Median :3400   
 Mean   :0.327     Mean   : 4.274    Mean   :NaN         Mean   :4725   
 3rd Qu.:1.000     3rd Qu.: 3.000    3rd Qu.: NA         3rd Qu.:4500   
 Max.   :8.000     Max.   :90.000    Max.   : NA         Max.   :9998   
 NA's   :77111     NA's   :81094     NA's   :82687       NA's   :76539  

From the above, the variable iptp_antimalarial (Intermittent Preventive Treatment during pregnancy) reveals a significant gap between the total number of respondents and those who received specific antimalarial care. Out of 82,687 total entries, 77,111 contain missing values (NA ) for the use of SP/Fansidar. This high frequency of missingness likely indicates that the vast majority of respondents were either not pregnant during the survey window, did not attend antenatal care where these drugs are distributed, or did not contract malaria.

However, we have identified a core subset of 5,576 cases who did receive these malaria drugs. To ensure our subsequent analysis is precise and focuses specifically on maternal health interventions, we will refine our dataset. By excluding the rows with missing values, we transition from a general population survey to a targeted “pregnancy-intervention” sub-dataset. This allows for a more meaningful evaluation of dosage, timing, and efficacy among those who actually interacted with the malaria prevention protocol.

Code
# Filter for non-missing values and overwrite or save as a new object
gr_data_clean <- gr_data %>%
  filter(!is.na(iptp_antimalarial))

# Verify the new dimensions (should be 5,576 rows)
dim(gr_data_clean)
[1] 5576    8
Code
# Check the first few rows to ensure NAs are gone
head(gr_data_clean$iptp_antimalarial)
<labelled<double>[6]>: during pregnancy took: sp/fansidar for malaria
[1] 1 1 1 0 1 1

Labels:
 value      label
     0         no
     1        yes
     8 don't know

4.5 Check Outliers and Suspicious Values

When analyzing continuous variables like birth weight, identifying outliers is essential for ensuring the integrity of your results. Outliers are extreme values that can disproportionately influence your statistics, often arising from measurement errors or rare clinical cases.

A quick comparison of your summary statistics already hints at a potential issue: the mean is 4,725g, while the median is significantly lower at 3,400g. In a perfectly balanced distribution, these two numbers would be nearly identical. Because the mean is much higher than the median, it suggests that the data is being “pulled” to the right by some exceptionally high values—classic evidence of outliers.

4.5.1 Visualizing the Distribution

By stacking a histogram and a boxplot, we can see exactly where the data clusters and where the “stragglers” lie. Adding lines for the mean and median will visually demonstrate the “skew” caused by these extreme values.

Code
# Calculate values for the plot
b_mean <- 4725
b_median <- 3400

# Set up the plotting area: 2 rows, 1 column
par(mfrow = c(2, 1), mar = c(4, 4, 2, 1))

library(ggplot2)
library(patchwork) # For stacking the plots

# 1. The Histogram
p1 <- ggplot(gr_data, aes(x = birth_weight_g)) +
  geom_histogram(fill = "lightgray", color = "white", bins = 30) +
  # Add the vertical lines
  geom_vline(aes(xintercept = b_median), color = "darkgreen", size = 1.2) +
  geom_vline(aes(xintercept = b_mean), color = "blue", size = 1.2) +
  # Add the text box (label)
  # x and y coordinates may need adjustment based on your specific data height
  # Use separate annotations if you want the text itself to be colored
  annotate("text", x = 6500, y = 550, label = paste("Median:", b_median), color = "darkgreen", fontface = "bold") +
  annotate("text", x = 6500, y = 450, label = paste("Mean:", b_mean), color = "blue", fontface = "bold") +
  labs(title = "Distribution of Birth Weight", x = NULL,y="Count") +
  theme_classic() +
  theme(axis.text.x = element_blank())

# 2. The Boxplot
p2 <- ggplot(gr_data, aes(x = birth_weight_g)) +
  geom_boxplot(fill = "tomato", outlier.color = "red", alpha = 0.7) +
  labs(x = "Birth Weight (grams)") +
  theme_classic() +
  theme(axis.text.y = element_blank())

# Combine them (stacking vertically)
p1 / p2

Code
# Reset plotting parameters
par(mfrow = c(1, 1))

4.5.2 Suspicious values

Let us investigate the birth weights by looking at all the unique values recorded.

Code
unique(gr_data$birth_weight_g)
<labelled<double>[96]>: birth weight in kilograms (3 decimals)
 [1] 4000   NA 3600 3200 3500 2900 9996 3400 3700 3800 3000 2000 2800 2700 4500
[16] 2300 2400 3300 2100 2200 2500 9998 3900 4400 1500 3100 1980 2600 5000 4600
[31] 1000 3260 1200 4300 4700 4200 3745 2790 3250 1900 4800 2570 2480 4110 2520
[46] 3460 2960 1800 1600 1700 2980 3150 3650 3510 6000 5500 1750 5600 3450 1100
[61] 5200 4100 2250 5300 1400 1550  900  750  700 3180 3680 3050 4900 2470 2160
[76] 3360 5800 1300  600  800 2080 2450 2850 2350 2020 2460 2950 5400 3010 3420
[91] 3380 3060 3720 5900 3570 2550

Labels:
 value                label
  9996 not weighed at birth
  9998           don't know

Large-scale surveys like the DHS often use placeholder codes to represent specific types of missing data that a standard NA cannot capture. In your output, the values 9996 and 9998 are not actual weights; they are metadata labels indicating “not weighed at birth” and “don’t know,” respectively.

The presence of these “9-series” codes explains why your initial mean was a staggering 4,725g. Because R treats these placeholders as actual numbers, a few instances of “9998” (nearly 10kg) will aggressively pull the average upward, creating a highly distorted picture of newborn health. While extreme biological values can occur, these specific figures are clearly categorical codes and must be handled—either by recoding them to NA or excluding them—before any statistical analysis can be trusted.

4.5.3 Refining the Data

We will now filter the dataset to include only plausible birth weights by excluding these special codes.

Code
library(ggplot2)
library(patchwork)

# 1. Exclude the placeholder codes (9996 and 9998)
# We also keep NAs out for this specific calculation
gr_data <- subset(gr_data, birth_weight_g < 9000)

# 2. Calculate the NEW Mean and Median
new_mean <- round(mean(gr_data$birth_weight_g, na.rm = TRUE),2)
new_median <- median(gr_data$birth_weight_g, na.rm = TRUE)


# 3. Redo Visualizations with ggplot2
# Histogram
p1 <- ggplot(gr_data, aes(x = birth_weight_g)) +
  geom_histogram(fill = "lightgray", color = "white", bins = 40) +
  # Add the vertical lines
  geom_vline(aes(xintercept = b_median), color = "darkgreen", size = 1.2) +
  geom_vline(aes(xintercept = b_mean), color = "blue", size = 1.2) +
  # Add the text box (label)
  # x and y coordinates may need adjustment based on your specific data height
  # Use separate annotations if you want the text itself to be colored
  annotate("text", x = 1000, y = 550, label = paste("Median:", new_median), color = "darkgreen", fontface = "bold") +
  annotate("text", x = 1000, y = 450, label = paste("Mean:", new_mean), color = "blue", fontface = "bold") +
  labs(title = "Cleaned Birth Weight Distribution", x = NULL,y="Count") +
  theme_classic() +
  theme(axis.text.x = element_blank())

# Boxplot
p2 <- ggplot(gr_data, aes(x = birth_weight_g)) +
  geom_boxplot(fill = "tomato", alpha = 0.7) +
  labs(x = "Birth Weight (grams)") +
  theme_classic() +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

# Combine plots
p1 / p2

It is important to maintain a skeptical but scientific eye. While 9998 is a known placeholder code, a value like 6000g (6kg) might appear as an outlier but could be a legitimate case of fetal macrosomia. When you encounter values that are biologically possible but extreme (e.g., > 5000g or < 500g), they should be investigated against other variables—such as delivery method or maternal health—rather than being deleted immediately. However, purely symbolic codes like 9996 must always be removed from numeric calculations.

Code
nrow(gr_data)
[1] 4755

With the above, our gr_data has reduced 4755 down from 82,687.

5 Data transformation – Join and Reshape

To ensure our analysis is both representative and comprehensive, we must consolidate our information through a strategic Relational Join.

The necessity for this merge is driven by three key factors:

  1. Preserving the Sampling Frame: Since we are focusing on a specific random sample of 100 households from the hr_data, we must use this as our “Master” dataset. Merging ensures that we maintain the context of these specific households, even if some don’t have matching records in the specialized malaria subset.

  2. Handling Targeted Subsets: Our cleaning of gr_data reduced the records to 4,755 specific malaria-related cases. By joining this with our households, we can identify which women within our target sample experienced malaria during pregnancy without losing the denominator of the total sampled population.

  3. Cross-Variable Insights: Information is currently siloed. To understand the relationship between prevention (bednet use from pr_data) and outcomes (malaria during pregnancy from gr_data), we need a single unified row for each individual.

5.1 The Merging Strategy

We will perform a Left Join, using the hr_data (Households) as the foundation. This ensures we keep all 1000 sampled households and “pull in” malaria and bednet data where available. In DHS data, records are typically linked by a combination of Cluster Number (v001), Household Number (v002), and sometimes Line Number (v003) for individuals.

Important Considerations

  • Duplicate Rows: Some persons may be repeated in the gr_data (e.g., if they had multiple pregnancies). A join will create a new row for each match. One should verify if you want to analyze “per pregnancy” or “per woman.”

  • Key Matching: Ensure the variable names for IDs are identical across sets (e.g., hv001 in one might be v001 in another). You can specify this using by = c("hv001" = "v001").

  • Missing Values after Merge: After a Left Join, households that did not have a matching record in the malaria data will have NA in those columns. This is expected and correct—it simply means those specific households did not report a malaria case in the gr_data subset.

Code
library(dplyr)

# 1. First Join: Households (hr) + Malaria Data (gr)
# We use a left_join to keep all 1000 households even if no malaria case was recorded
combined_data <- pr_data %>%
  left_join(gr_data, by = c("hv001"="v001", "hv002"="v002","hvidx"="v003"))

# 3. Verification of the Merge
# Check the dimensions to ensure we haven't lost our primary household focus
dim(combined_data)
[1] 156974     14
Code
# 2. Second Join: Adding Bednet Data (pr)
# This adds the individual-level behavior to our household/malaria records
# Note: hv001, hv002, and hv003 are standard DHS identifiers
final_dataset <- hr_data %>%
  left_join(combined_data, by = c("hv001", "hv002"))
  

# 3. Verification of the Merge
# Check the dimensions to ensure we haven't lost our primary household focus
dim(final_dataset)
[1] 4227   21
Code
glimpse(final_dataset)
Rows: 4,227
Columns: 21
$ hv001               <dbl> 1622, 1622, 1622, 848, 848, 75, 75, 75, 75, 1053, …
$ hv002               <dbl> 21, 21, 21, 6, 6, 49, 49, 49, 49, 2, 2, 2, 34, 34,…
$ wealth_index        <dbl+lbl> 3, 3, 3, 4, 4, 1, 1, 1, 1, 1, 1, 1, 5, 5, 3, 3…
$ residence_type      <dbl+lbl> 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2…
$ altitude_meters     <dbl> 1696, 1696, 1696, 1944, 1944, 74, 74, 74, 74, 1532…
$ has_any_net         <dbl+lbl> 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1…
$ itn_count           <dbl+lbl> 2, 2, 2, 0, 0, 3, 3, 3, 3, 0, 0, 0, 2, 2, 2, 2…
$ wall_material       <dbl+lbl> 31, 31, 31, 32, 32, 21, 21, 21, 21, 21, 21, 21…
$ water_source        <dbl+lbl> 41, 41, 41, 12, 12, 13, 13, 13, 13, 43, 43, 43…
$ hvidx               <dbl> 1, 2, 3, 1, 2, 1, 2, 3, 4, 1, 2, 3, 1, 2, 1, 2, 3,…
$ member_age          <dbl+lbl> 42, 14,  9, 35, 12, 75, 71, 10, 12, 30,  7,  2…
$ net_type_used       <dbl+lbl> 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 3, 1, 1, 1…
$ malaria_test_result <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ hemoglobin_level    <dbl+lbl> 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1…
$ fever_last_2weeks   <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ member_sex          <dbl+lbl> 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 2…
$ mother_age          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ iptp_antimalarial   <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sp_fansidar_doses   <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ maternal_hemoglobin <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ birth_weight_g      <dbl+lbl>   NA,   NA,   NA,   NA,   NA,   NA,   NA,   NA…

5.1.1 Skimming the Final Dataset

With the datasets now merged, the glimpse of final_dataset reveals the “joined” reality: we have a rich collection of household characteristics, individual behaviors, and clinical outcomes. However, a significant portion of the data contains NA values or redundant identifiers that may clutter our analysis.

Performing Summary Statistics is our next logical step. This allows us to assess the distribution of each variable, identify the prevalence of missing data, and determine which columns provide enough variation to be useful for modeling. By scanning these summaries, we can distinguish between “anchor” variables—like wealth index or net usage—and those that are too sparsely populated to support a robust conclusion.

To achieve this, we will use the skimr package. Unlike the standard summary() function, skim() provides a much more detailed and visually intuitive preview, including missing value counts (n_missing), completion rates, and even “sparkline” histograms to show the shape of your data at a glance. It is particularly powerful for large datasets where you need to quickly spot which variables are mostly empty or skewed.

Code
# Load the package
library(skimr)

# Generate a detailed summary of the final dataset
# This will group variables by type (numeric, factor, etc.)
skim(final_dataset)
Data summary
Name final_dataset
Number of rows 4227
Number of columns 21
_______________________
Column type frequency:
numeric 21
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
hv001 0 1.00 849.46 488.46 1 408 864 1297.5 1688 ▇▆▇▇▇
hv002 0 1.00 43.28 28.93 1 19 39 63.0 139 ▇▆▅▂▁
wealth_index 0 1.00 2.72 1.40 1 1 3 4.0 5 ▇▆▅▆▃
residence_type 0 1.00 1.67 0.47 1 1 2 2.0 2 ▃▁▁▁▇
altitude_meters 0 1.00 1317.76 703.64 1 836 1412 1842.0 3342 ▅▅▇▂▁
has_any_net 0 1.00 0.69 0.46 0 0 1 1.0 1 ▃▁▁▁▇
itn_count 0 1.00 1.90 1.75 0 0 2 3.0 7 ▇▃▅▁▁
wall_material 0 1.00 27.47 12.76 11 21 31 32.0 96 ▇▇▁▁▁
water_source 0 1.00 31.26 17.47 11 14 31 43.0 96 ▇▇▂▁▁
hvidx 0 1.00 3.35 2.24 1 2 3 5.0 16 ▇▂▁▁▁
member_age 0 1.00 23.82 19.56 0 9 18 35.0 98 ▇▃▂▁▁
net_type_used 0 1.00 0.60 0.78 0 0 0 1.0 3 ▇▆▁▁▁
malaria_test_result 4227 0.00 NaN NA NA NA NA NA NA
hemoglobin_level 0 1.00 0.41 0.49 0 0 0 1.0 1 ▇▁▁▁▆
fever_last_2weeks 4227 0.00 NaN NA NA NA NA NA NA
member_sex 0 1.00 1.52 0.50 1 1 2 2.0 2 ▇▁▁▁▇
mother_age 4110 0.03 27.57 5.61 16 23 27 31.0 44 ▂▇▅▂▁
iptp_antimalarial 4122 0.02 0.41 0.88 0 0 0 1.0 8 ▇▁▁▁▁
sp_fansidar_doses 4192 0.01 8.43 16.91 1 2 4 5.0 90 ▇▁▁▁▁
maternal_hemoglobin 4227 0.00 NaN NA NA NA NA NA NA
birth_weight_g 4110 0.03 3162.22 682.83 1300 2800 3200 3500.0 5000 ▁▃▇▃▁

Key Observations from the Output

The “Dead” Variables (100% Missing)

  • malaria_test_result, fever_last_2weeks, and maternal_hemoglobin variables have a complete_rate of 0.00. They contain zero data points for the sampled 1000 households.
  • Corrective Action: Remove immediately. They add dimensionality without any information.

High-Missingness Pregnancy Data

  • iptp_antimalarial, sp_fansidar_doses, mother_age, and birth_weight_g show a complete_rate between 0.01 and 0.03.
  • Context: This isn’t necessarily “bad” data; it confirms that only a small subset of your household members (the pregnant women) qualify for these questions.
  • Corrective Action: Keep these, but be aware that any analysis involving them will only use a tiny fraction of your rows.

Strange Values & Data Types

  • sp_fansidar_doses: The mean is 8.43 with a max of 90. Standard malaria protocols (SP/Fansidar) usually involve 3 doses. A value of 90 is likely a placeholder code (similar to 9998) or a significant entry error.

  • birth_weight_g: Now that we removed the 9000+ codes, the distribution looks much better (p50 = 3200g). However, the min of 1300g and max of 5000g are extreme but biologically possible.

  • itn_count: The max is 7. While high, it is plausible for large households.

2. Corrective Cleaning Strategy

We need to strip away the “zero-info” columns and handle the extreme placeholder in the doses variable.

Code
# 1. Clean and Subset
final_dataset <- final_dataset %>%
  # Remove variables with 0% completion rate and redundant IDs
  select(-malaria_test_result, -fever_last_2weeks, -maternal_hemoglobin) %>%
  # Filter out the suspicious '90' doses if they are placeholders
  mutate(sp_fansidar_doses = ifelse(sp_fansidar_doses > 10, NA, sp_fansidar_doses))

# 2. Final check of the structure
str(final_dataset)
tibble [4,227 × 18] (S3: tbl_df/tbl/data.frame)
 $ hv001            : num [1:4227] 1622 1622 1622 848 848 ...
  ..- attr(*, "label")= chr "cluster number"
  ..- attr(*, "format.stata")= chr "%8.0g"
 $ hv002            : num [1:4227] 21 21 21 6 6 49 49 49 49 2 ...
  ..- attr(*, "label")= chr "household number"
  ..- attr(*, "format.stata")= chr "%8.0g"
 $ wealth_index     : dbl+lbl [1:4227] 3, 3, 3, 4, 4, 1, 1, 1, 1, 1, 1, 1, 5, 5, 3, 3, 3, 3,...
   ..@ label       : chr "wealth index combined"
   ..@ format.stata: chr "%8.0g"
   ..@ labels      : Named num [1:5] 1 2 3 4 5
   .. ..- attr(*, "names")= chr [1:5] "poorest" "poorer" "middle" "richer" ...
 $ residence_type   : dbl+lbl [1:4227] 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2,...
   ..@ label       : chr "type of place of residence"
   ..@ format.stata: chr "%8.0g"
   ..@ labels      : Named num [1:2] 1 2
   .. ..- attr(*, "names")= chr [1:2] "urban" "rural"
 $ altitude_meters  : num [1:4227] 1696 1696 1696 1944 1944 ...
  ..- attr(*, "label")= chr "cluster altitude in meters"
  ..- attr(*, "format.stata")= chr "%8.0g"
 $ has_any_net      : dbl+lbl [1:4227] 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,...
   ..@ label       : chr "has mosquito bed net for sleeping"
   ..@ format.stata: chr "%8.0g"
   ..@ labels      : Named num [1:2] 0 1
   .. ..- attr(*, "names")= chr [1:2] "no" "yes"
 $ itn_count        : dbl+lbl [1:4227] 2, 2, 2, 0, 0, 3, 3, 3, 3, 0, 0, 0, 2, 2, 2, 2, 2, 2,...
   ..@ label       : chr "number of mosquito bed nets"
   ..@ format.stata: chr "%8.0g"
   ..@ labels      : Named num 98
   .. ..- attr(*, "names")= chr "don't know"
 $ wall_material    : dbl+lbl [1:4227] 31, 31, 31, 32, 32, 21, 21, 21, 21, 21, 21, 21, 31, 3...
   ..@ label       : chr "main wall material"
   ..@ format.stata: chr "%8.0g"
   ..@ labels      : Named num [1:20] 10 11 12 13 20 21 22 23 24 25 ...
   .. ..- attr(*, "names")= chr [1:20] "natural" "no walls" "cane/palm/trunks" "dirt" ...
 $ water_source     : dbl+lbl [1:4227] 41, 41, 41, 12, 12, 13, 13, 13, 13, 43, 43, 43, 14, 1...
   ..@ label       : chr "source of drinking water"
   ..@ format.stata: chr "%8.0g"
   ..@ labels      : Named num [1:19] 10 11 12 13 14 20 21 30 31 32 ...
   .. ..- attr(*, "names")= chr [1:19] "piped water" "piped into dwelling" "piped to yard/plot" "piped to neighbor" ...
 $ hvidx            : num [1:4227] 1 2 3 1 2 1 2 3 4 1 ...
  ..- attr(*, "label")= chr "line number"
  ..- attr(*, "format.stata")= chr "%8.0g"
 $ member_age       : dbl+lbl [1:4227] 42, 14,  9, 35, 12, 75, 71, 10, 12, 30,  7,  2, 58, 2...
   ..@ label       : chr "age of household members"
   ..@ format.stata: chr "%8.0g"
   ..@ labels      : Named num [1:2] 95 98
   .. ..- attr(*, "names")= chr [1:2] "95+" "don't know"
 $ net_type_used    : dbl+lbl [1:4227] 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 3, 1, 1, 1, 1, 1,...
   ..@ label       : chr "type of mosquito bed net(s) person slept under last night"
   ..@ format.stata: chr "%8.0g"
   ..@ labels      : Named num [1:4] 0 1 2 3
   .. ..- attr(*, "names")= chr [1:4] "did not sleep under a net" "only treated (itn) nets" "both treated (itn) and untreated nets" "only untreated nets"
 $ hemoglobin_level : dbl+lbl [1:4227] 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1,...
   ..@ label       : chr "person slept under an llin net"
   ..@ format.stata: chr "%8.0g"
   ..@ labels      : Named num [1:2] 0 1
   .. ..- attr(*, "names")= chr [1:2] "no" "yes"
 $ member_sex       : dbl+lbl [1:4227] 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 1,...
   ..@ label       : chr "sex of household member"
   ..@ format.stata: chr "%8.0g"
   ..@ labels      : Named num [1:2] 1 2
   .. ..- attr(*, "names")= chr [1:2] "male" "female"
 $ mother_age       : num [1:4227] NA NA NA NA NA NA NA NA NA NA ...
  ..- attr(*, "label")= chr "respondent's current age"
  ..- attr(*, "format.stata")= chr "%8.0g"
 $ iptp_antimalarial: dbl+lbl [1:4227] NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
   ..@ label       : chr "during pregnancy took: sp/fansidar for malaria"
   ..@ format.stata: chr "%8.0g"
   ..@ labels      : Named num [1:3] 0 1 8
   .. ..- attr(*, "names")= chr [1:3] "no" "yes" "don't know"
 $ sp_fansidar_doses: num [1:4227] NA NA NA NA NA NA NA NA NA NA ...
 $ birth_weight_g   : dbl+lbl [1:4227]   NA,   NA,   NA,   NA,   NA,   NA,   NA,   NA,   NA,...
   ..@ label       : chr "birth weight in kilograms (3 decimals)"
   ..@ format.stata: chr "%8.0g"
   ..@ labels      : Named num [1:2] 9996 9998
   .. ..- attr(*, "names")= chr [1:2] "not weighed at birth" "don't know"

5.2 Reshape Data for Analysis and Visualisation

The data is currently in what we call a long format. This can be useful when conducting faceted analysis. Consider the below analysis.

5.2.1 Faceted Analysis: The “Prevention Gap”

A faceted plot allows you to see how the relationship between two variables changes across different categories. Consider this question: “Does the effectiveness of bednet ownership on birth weight differ between Rural and Urban areas across different Wealth levels?”

By using facet_grid(residence_type ~ wealth_index), we create a matrix of plots. Each “cell” shows the distribution of birth weights for people with vs. without nets in that specific socio-economic context.

Code
library(ggplot2)

ggplot(final_dataset, aes(x = factor(has_any_net), y = birth_weight_g, fill = factor(has_any_net))) +
  geom_boxplot() +
  facet_grid(residence_type ~ wealth_index) + # Faceting by Residence and Wealth
  labs(title = "Birth Weight by Net Ownership across Wealth and Residence",
       x = "Has Any Bednet (0=No, 1=Yes)",
       y = "Birth Weight (g)",
       fill = "Net Ownership") +
  theme_bw()

5.2.2 Maintaining the data types

As we will do some analysis going forward, we first ensure all the categorical variables are in the correct format for analysis in R.

Converting numeric codes into Factors in R is a critical step for analysis. It ensures that R treats these variables as categories rather than continuous numbers. For variables like wealth_index, we will use Ordered Factors, which tells R that “Poorest” is lower than “Poorer,” allowing for better trend analysis in your models.

Code
final_dataset <- final_dataset %>%
  mutate(
    # --- Ordered Factors (Hierarchical) ---
    wealth_index = factor(wealth_index, 
                          levels = c(1, 2, 3, 4, 5), 
                          labels = c("Poorest", "Poorer", "Middle", "Richer", "Richest"), 
                          ordered = TRUE),
    
    sp_fansidar_doses = factor(sp_fansidar_doses, 
                               levels = c(0, 1, 2, 3), 
                               labels = c("None", "1 dose", "2 doses", "3+ doses"), 
                               ordered = TRUE),

    # --- Nominal Factors (Unordered Categories) ---
    residence_type = factor(residence_type, 
                            levels = c(1, 2), 
                            labels = c("Urban", "Rural")),
    
    has_any_net = factor(has_any_net, 
                         levels = c(0, 1), 
                         labels = c("No", "Yes")),
    
    member_sex = factor(member_sex, 
                        levels = c(1, 2), 
                        labels = c("Male", "Female")),
    
    iptp_antimalarial = factor(iptp_antimalarial, 
                               levels = c(0, 1, 8), 
                               labels = c("No", "Yes", "Don't Know")),
    
    net_type_used = factor(net_type_used, 
                           levels = c(0, 1, 2, 3), 
                           labels = c("None", "ITN", "Treated (non-ITN)", "Untreated"))
  )

We redo the analysis above with the categorical variables well constituted. Compare the difference.

Code
ggplot(final_dataset, aes(x = factor(has_any_net), y = birth_weight_g, fill = factor(has_any_net))) +
  geom_boxplot() +
  facet_grid(residence_type ~ wealth_index) + # Faceting by Residence and Wealth
  labs(title = "Birth Weight by Net Ownership across Wealth and Residence",
       x = "Has Any Bednet",
       y = "Birth Weight (g)",
       fill = "Net Ownership") +
  theme_bw() +
  theme(legend.position = "none")

5.2.3 Wide Format Analysis: Comparing Health Indicators

Sometimes we want to compare two different numeric variables for the same household on the same scale. To do this effectively in ggplot2, we “pivot” the data into a long format (which R ironically calls “tidying”), but the reason we do it is to compare variables that were originally sitting side-by-side in wide format. The Question: “How does the distribution of Mother’s Age compare to the distribution of Birth Weight (scaled) for each household?”

In your current wide format, mother_age and birth_weight_g are in separate columns. If you want to plot them as two different colored lines on the same chart, you pivot them.

Code
library(tidyr)

# 1. Prepare the data by scaling and pivoting
comparison_data <- final_dataset %>%
  # Select only the relevant columns and unique identifiers
  select(hv001, hv002, mother_age, birth_weight_g) %>%
  # Drop rows where either is missing for a fair comparison
  filter(!is.na(mother_age), !is.na(birth_weight_g)) %>%
  # Scale variables so they share a mean of 0 and SD of 1
  mutate(
    scaled_age = as.numeric(scale(mother_age)),
    scaled_weight = as.numeric(scale(birth_weight_g))
  ) %>%
  # Pivot from WIDE (two columns) to LONG (one column for plot mapping)
  pivot_longer(
    cols = c(scaled_age, scaled_weight),
    names_to = "health_metric",
    values_to = "standardized_value"
  )

# 2. Visualize the comparison
ggplot(comparison_data, aes(x = standardized_value, fill = health_metric)) +
  geom_density(alpha = 0.5) + # Density plot shows the "shape" of the data
  scale_fill_manual(
    values = c("scaled_age" = "darkgreen", "scaled_weight" = "blue"),
    labels = c("Mother's Age", "Birth Weight")
  ) +
  labs(
    title = "Comparison of Normalized Maternal Age and Birth Weight",
    subtitle = "Data pivoted from wide to long for distribution mapping",
    x = "Standardized Value (Z-Score)",
    y = "Density",
    fill = "Metric"
  ) +
  theme_minimal()

Widee-to-long transformation is essential whenever you want to map multiple columns to a single aesthetic (like color or fill). This comparison tells you if the “spread” of maternal ages is as wide as the “spread” of birth weights in your sample.