Code
library(tidyverse)
theme_set(theme_minimal(base_size = 13))library(tidyverse)
theme_set(theme_minimal(base_size = 13))As the volume and complexity of data grow, having high-quality, well-structured datasets becomes essential for trustworthy insights and sound decisions. When data are disorganised or flawed, they can quietly distort results and waste precious time, money, and effort for people working on the front lines.
For epidemiologists, programme managers, clinicians, and analysts, this is a familiar burden: long hours spent hunting through Excel files, reconciling conflicting case definitions, or re-checking numbers that do not quite add up before a briefing or report.
This is where data wrangling comes in — the practical work of cleaning, structuring, and transforming raw data so they are ready for analysis and interpretation.
Real-world datasets are rarely ready for analysis. They are often:
Kenya National Bureau of Statistics (KNBS), Ministry of Health (MoH), and ICF. (2023). Kenya Demographic and Health Survey 2022. Rockville, Maryland, USA: KNBS and ICF.
This analysis utilizes data from the 2022 Kenya Demographic and Health Survey (KDHS). The dataset comprises a Household Recode (HR) containing information on living conditions, a Household Member Recode (PR) providing individual demographics for every person in the home, and a Pregnancies Recode (GR) capturing detailed maternal and reproductive histories. These are augmented with Geospatial Covariates (GC) and Geographic Data (GE) to allow for cluster-level spatial analysis.
For this malaria-focused training, we select variables that track prevention (mosquito nets), risk factors (wealth, location, and altitude), and health outcomes (test results and pregnancy impact).
To join these files together in R, you will need the following common variables:
hv001 (in HR/PR) or v001 (in GR)hv002 (in HR/PR) or v002 (in GR)hvidx (in PR) or v003 (in GR) — Only needed if linking individuals to their specific pregnancy history.We will subset the variables below. for use in the training.
| Dataset | Variable Name | Description | Suggested New Name | Value Labels (Categorical Codes) |
|---|---|---|---|---|
| Household (HR) | hv001, hv002 |
Cluster and Household ID (Keys) | (Numeric IDs) | |
hv270 |
Wealth Index (Social determinant of health) | wealth_index |
1: Poorest, 2: Poorer, 3: Middle, 4: Richer, 5: Richest | |
hv025 |
Type of place of residence (Urban/Rural) | residence_type |
1: Urban, 2: Rural | |
hv040 |
Cluster altitude in meters (High altitude = lower risk) | altitude_meters |
(Continuous/Numeric) | |
hv227 |
Household has any mosquito net | has_any_net |
0: No, 1: Yes | |
hml1 |
Number of insecticide-treated nets (ITNs) owned | itn_count |
(Numeric Count) | |
hv214 |
Main wall material (Risk of mosquito entry) | wall_material |
10: Natural, 20: Rudimentary, 30: Finished | |
hv201 |
Source of drinking water | water_source |
10: Piped, 20: Well, 30: Surface, 41: Protected | |
| Member (PR) | hv001, hv002 |
Cluster and Household ID (Keys) | (Numeric ID) | |
hvidx |
Line number of the individual member | (Numeric ID) | ||
hv105 |
Age of household member (Children <5 are high risk) | member_age |
0-94: Years, 95: 95+, 98: Don’t know | |
hml12 |
Type of net person slept under last night | net_type_used |
0: None, 1: ITN, 2: Treated (non-ITN), 3: Untreated | |
hml32 |
Final result of malaria test (RDT/Microscopy) | malaria_test_result |
0: Negative, 1: Positive | |
hml20 |
Hemoglobin level (Malaria-induced anemia indicator) | hemoglobin_level |
(Numeric g/dL, 1 decimal) | |
hml35 |
Had fever in the last 2 weeks | fever_last_2weeks |
0: No, 1: Yes, 8: Don’t know | |
hv104 |
Sex of household member | member_sex |
1: Male, 2: Female | |
| Pregnancy (GR) | v001, v002 |
Cluster and Household ID (Keys) | (Numeric ID) | |
v003 |
Respondent’s line number (Links to PR file) | (Numeric ID) | ||
v012 |
Current age of the mother | mother_age |
(Numeric Years) | |
m49a |
Took any antimalarial during pregnancy (IPTp) | iptp_antimalarial |
0: No, 1: Yes, 8: Don’t know | |
ml1 |
Number of doses of Fansidar (SP) taken | sp_fansidar_doses |
1: 1 dose, 2: 2 doses, 3: 3+ doses | |
v453 |
Maternal Hemoglobin level (Anemia risk) | maternal_hemoglobin |
(Numeric g/dL) | |
m19 |
Birth weight (Malaria can cause low birth weight) | birth_weight_g |
Numeric), 9996: Not weighed, 9998: Don’t know |
The overacrching objective for this session is to build participants’ capacity to clean, transform, integrate, and prepare malaria-related datasets for accurate and reproducible analysis using R.
By the end of the training, participants will be able to:
The codes below read in the data so we can use them. We have sampled 1000 households out of the 37911, but for each dataset, we are selecting only ten variables as listed above, with identity variables that can help us link the household to the member and the pregnant mother.
# Load necessary libraries
library(haven) # For .dta files
library(readr) # For .csv files
library(sf) # For .shp files
library(dplyr) # For data wrangling
# Set seed for reproducibility
set.seed(2026)
# 1. READ & SAMPLE HOUSEHOLD RECODE (HR)
# We select variables and then sample 100 random households
hr_data <- read_dta("data/KE_2022_DHS_04162026_1946_185432/KEHR8CDT/KEHR8CFL.dta") %>%
select(hv001, hv002, hv270, hv025, hv040, hv227, hml1, hv214, hv201) %>%
sample_n(1000)
# 2. READ HOUSEHOLD MEMBER RECODE (PR)
pr_data <- read_dta("data/KE_2022_DHS_04162026_1946_185432/KEPR8CDT/KEPR8CFL.dta") %>%
select(hv001, hv002, hvidx, hv105, hml12, hml32, hml20, hml35, hv104)
# 3. READ PREGNANCIES RECODE (GR)
gr_data <- read_dta("data/KE_2022_DHS_04162026_1946_185432/KEGR8CDT/KEGR8CFL.dta") %>%
select(v001, v002, v003, v012, m49a, ml1, v453, m19)
# 4. READ GEOSPATIAL DATA
geo_covar <- read_csv("data/KE_2022_DHS_04162026_1946_185432/KEGC8AFL/KEGC8AFL.csv") # CSV file
geo_shape <- st_read("data/KE_2022_DHS_04162026_1946_185432/KEGE8AFL/KEGE8AFL.shp") # ShapefileReading layer `KEGE8AFL' from data source
`G:\My Drive\Projects\2026\2026.04 AMNET\Hackathon_data_wrangling_ext\data\KE_2022_DHS_04162026_1946_185432\KEGE8AFL\KEGE8AFL.shp'
using driver `ESRI Shapefile'
Simple feature collection with 1691 features and 20 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 33.96339 ymin: -4.633603 xmax: 41.87537 ymax: 4.930862
Geodetic CRS: WGS 84
# EXAMPLE: Linking PR to the Sampled Households
# This keeps only members belonging to our 100 sampled households
# linked_data <- pr_data %>%
# inner_join(hr_data, by = c("hv001", "hv002"))
# Preview the results
head(hr_data)# A tibble: 6 × 9
hv001 hv002 hv270 hv025 hv040 hv227 hml1 hv214 hv201
<dbl> <dbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lb>
1 1622 21 3 [middle] 2 [rural] 1696 1 [yes] 2 31 [ceme… 41 [pro…
2 848 6 4 [richer] 1 [urban] 1944 0 [no] 0 32 [ston… 12 [pip…
3 75 49 1 [poorest] 2 [rural] 74 1 [yes] 3 21 [bamb… 13 [pip…
4 1053 2 1 [poorest] 2 [rural] 1532 0 [no] 0 21 [bamb… 43 [riv…
5 4 34 5 [richest] 1 [urban] 60 1 [yes] 2 31 [ceme… 14 [pub…
6 52 68 3 [middle] 2 [rural] 26 1 [yes] 2 32 [ston… 12 [pip…
head(pr_data)# A tibble: 6 × 9
hv001 hv002 hvidx hv105 hml12 hml32 hml20 hml35 hv104
<dbl> <dbl> <dbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl+l> <dbl> <dbl+l>
1 1 4 1 35 1 [only treated (itn)… NA 1 [yes] NA 1 [mal…
2 1 4 2 33 1 [only treated (itn)… NA 1 [yes] NA 2 [fem…
3 1 4 3 13 1 [only treated (itn)… NA 1 [yes] NA 1 [mal…
4 1 4 4 11 1 [only treated (itn)… NA 1 [yes] NA 2 [fem…
5 1 4 5 8 1 [only treated (itn)… NA 1 [yes] NA 1 [mal…
6 1 4 6 0 1 [only treated (itn)… NA 1 [yes] NA 1 [mal…
head(gr_data)# A tibble: 6 × 8
v001 v002 v003 v012 m49a ml1 v453 m19
<dbl> <dbl> <dbl> <dbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl>
1 1 4 2 34 1 [yes] 2 NA 4000
2 1 4 2 34 NA NA NA NA
3 1 4 2 34 NA NA NA NA
4 1 4 2 34 NA NA NA NA
5 1 7 2 38 NA NA NA NA
6 1 7 2 38 NA NA NA NA
Before we perform any analysis, we must inspect the “skeleton” of our data frames.
hv270). While these are standardized across all DHS countries, they are difficult for humans to interpret during analysis.The first step is to get a sense of the scale of your data. Using the dim() function provides a quick snapshot of the total number of observations and variables. This is important as it helps one confirm the data loaded correctly and helps you keep track of how many variables you are managing.
DHS datasets use a standardized coding system. While hv270 might not mean much at first glance, the labelled package allows you to peek at the underlying description (e.g., “Wealth Index”). By listing the labels, you can create a mapping for your renaming process. For instance, once you see that hv001 is the “Cluster number,” you can confidently rename it to cluster_id to make your code easier for others (and your future self) to read.
# 1. Check dimensions (Rows and Columns)
dim(hr_data)[1] 1000 9
# 2. Comprehensive overview
# This shows variable names, types (dbl, int, fct), and the first few values
glimpse(hr_data)Rows: 1,000
Columns: 9
$ hv001 <dbl> 1622, 848, 75, 1053, 4, 52, 1014, 180, 557, 425, 528, 49, 1252, …
$ hv002 <dbl> 21, 6, 49, 2, 34, 68, 74, 115, 105, 31, 62, 37, 35, 58, 51, 35, …
$ hv270 <dbl+lbl> 3, 4, 1, 1, 5, 3, 2, 2, 1, 2, 3, 4, 2, 3, 2, 3, 4, 1, 2, 1, …
$ hv025 <dbl+lbl> 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, …
$ hv040 <dbl> 1696, 1944, 74, 1532, 60, 26, 1706, 992, 856, 1305, 1468, 101, 1…
$ hv227 <dbl+lbl> 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, …
$ hml1 <dbl+lbl> 2, 0, 3, 0, 2, 2, 3, 2, 2, 0, 0, 1, 1, 2, 4, 0, 1, 1, 0, 2, …
$ hv214 <dbl+lbl> 31, 32, 21, 21, 31, 32, 13, 21, 31, 36, 34, 32, 13, 32, 21, …
$ hv201 <dbl+lbl> 41, 12, 13, 43, 14, 12, 41, 14, 43, 13, 51, 12, 41, 14, 41, …
# 3. Viewing the "Labelled" attributes
# DHS data often has 'labels' hidden behind the codes.
# This helps us see what 'hv270' actually stands for before we rename it.
library(labelled)
var_label(hr_data)$hv001
[1] "cluster number"
$hv002
[1] "household number"
$hv270
[1] "wealth index combined"
$hv025
[1] "type of place of residence"
$hv040
[1] "cluster altitude in meters"
$hv227
[1] "has mosquito bed net for sleeping"
$hml1
[1] "number of mosquito bed nets"
$hv214
[1] "main wall material"
$hv201
[1] "source of drinking water"
Now, we apply the mapping we discussed to make the datasets intuitive. We will use the rename() function from the tidyverse.
library(dplyr)
# 1. Renaming Household Recode (HR)
hr_data <- hr_data %>%
rename(
wealth_index = hv270,
residence_type = hv025,
altitude_meters = hv040,
has_any_net = hv227,
itn_count = hml1,
wall_material = hv214,
water_source = hv201
)
# 2. Renaming Member Recode (PR)
pr_data <- pr_data %>%
rename(
member_age = hv105,
net_type_used = hml12,
malaria_test_result = hml32,
hemoglobin_level = hml20,
fever_last_2weeks = hml35,
member_sex = hv104
)
# 3. Renaming Pregnancy Recode (GR)
gr_data <- gr_data %>%
rename(
mother_age = v012,
iptp_antimalarial = m49a,
sp_fansidar_doses = ml1,
maternal_hemoglobin = v453,
birth_weight_g = m19
)
# Verify the changes
names(hr_data)[1] "hv001" "hv002" "wealth_index" "residence_type"
[5] "altitude_meters" "has_any_net" "itn_count" "wall_material"
[9] "water_source"
While functions like head() or tail() are great for a quick visual “sanity check” of the first or last few rows, they can be deceptive. A column might look like it contains numbers, but R might be treating it as a character string, which would prevent you from performing any mathematical operations. Checking the structure ensures that your variables are stored in a format compatible with the functions you plan to use later.
Why Structure Matters Over Just Previewing
| Feature | head() / tail() |
str() / glimpse() |
| Visibility | Shows raw data values. | Shows data types and metadata. |
| Precision | Hard to distinguish between a “5” (text) and a 5 (number). | Explicitly states chr, int, or num. |
| Scope | Only shows the first/last 6 rows. | Provides a summary of all variables at once. |
| Error Prevention | Easy to miss trailing spaces or incorrect formatting. | Identifies why a column might not be calculating correctly. |
There are two primary ways to do this. The base R function str() is the standard, while dplyr::glimpse() provides a cleaner, transposed view that is often easier to read for datasets with many columns (like DHS data).
library(dplyr)
# Glimpse provides a denser summary that fits more columns on your screen
glimpse(hr_data)Rows: 1,000
Columns: 9
$ hv001 <dbl> 1622, 848, 75, 1053, 4, 52, 1014, 180, 557, 425, 528, …
$ hv002 <dbl> 21, 6, 49, 2, 34, 68, 74, 115, 105, 31, 62, 37, 35, 58…
$ wealth_index <dbl+lbl> 3, 4, 1, 1, 5, 3, 2, 2, 1, 2, 3, 4, 2, 3, 2, 3, 4,…
$ residence_type <dbl+lbl> 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1,…
$ altitude_meters <dbl> 1696, 1944, 74, 1532, 60, 26, 1706, 992, 856, 1305, 14…
$ has_any_net <dbl+lbl> 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1,…
$ itn_count <dbl+lbl> 2, 0, 3, 0, 2, 2, 3, 2, 2, 0, 0, 1, 1, 2, 4, 0, 1,…
$ wall_material <dbl+lbl> 31, 32, 21, 21, 31, 32, 13, 21, 31, 36, 34, 32, 13…
$ water_source <dbl+lbl> 41, 12, 13, 43, 14, 12, 41, 14, 43, 13, 51, 12, 41…
glimpse(pr_data)Rows: 156,571
Columns: 9
$ hv001 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ hv002 <dbl> 4, 4, 4, 4, 4, 4, 7, 7, 7, 10, 10, 13, 13, 13, 13,…
$ hvidx <dbl> 1, 2, 3, 4, 5, 6, 1, 2, 3, 1, 2, 1, 2, 3, 4, 5, 6,…
$ member_age <dbl+lbl> 35, 33, 13, 11, 8, 0, 38, 35, 16, 33, 11, 48…
$ net_type_used <dbl+lbl> 1, 1, 1, 1, 1, 1, 3, 3, 0, 1, 3, 1, 1, 1, 3, 3…
$ malaria_test_result <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ hemoglobin_level <dbl+lbl> 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0…
$ fever_last_2weeks <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ member_sex <dbl+lbl> 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 2, 1, 1, 1…
glimpse(gr_data)Rows: 82,687
Columns: 8
$ v001 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ v002 <dbl> 4, 4, 4, 4, 7, 7, 10, 13, 13, 13, 13, 13, 20, 20, …
$ v003 <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ mother_age <dbl> 34, 34, 34, 34, 38, 38, 33, 39, 39, 39, 39, 39, 30…
$ iptp_antimalarial <dbl+lbl> 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sp_fansidar_doses <dbl+lbl> 2, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ maternal_hemoglobin <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ birth_weight_g <dbl+lbl> 4000, NA, NA, NA, NA, NA, NA, NA…
In R, missing information is represented by the logical constant NA (Not Available), but its behavior shifts slightly depending on the data type. For numeric variables, an NA acts as a mathematical “hole”—any calculation involving it (like 1 + NA or mean(data)) will result in NA unless you explicitly tell R to ignore it using the na.rm = TRUE argument. For character variables, NA represents a missing string, which is distinct from an empty quote (""). While an empty string is technically data (a word with zero characters), an NA is the total absence of a value. Identifying these gaps is typically done using the is.na() function, which returns a logical TRUE for every missing entry regardless of whether the column is numeric or text-based.
You can use the following commands to get a quick count of how much data is missing across your three datasets:
# Count total missing values in each dataset
sum(is.na(hr_data))[1] 0
sum(is.na(pr_data))[1] 313142
sum(is.na(gr_data))[1] 317431
nrow(gr_data)[1] 82687
# Summary view to see NAs per specific column
summary(gr_data) v001 v002 v003 mother_age
Min. : 1.0 Min. : 0.0 Min. : 1.000 Min. :15.00
1st Qu.: 358.0 1st Qu.: 19.0 1st Qu.: 1.000 1st Qu.:30.00
Median : 861.0 Median : 39.0 Median : 2.000 Median :36.00
Mean : 839.6 Mean : 43.4 Mean : 1.889 Mean :35.69
3rd Qu.:1275.0 3rd Qu.: 63.0 3rd Qu.: 2.000 3rd Qu.:42.00
Max. :1692.0 Max. :1098.0 Max. :22.000 Max. :49.00
iptp_antimalarial sp_fansidar_doses maternal_hemoglobin birth_weight_g
Min. :0.000 Min. : 0.000 Min. : NA Min. : 600
1st Qu.:0.000 1st Qu.: 1.000 1st Qu.: NA 1st Qu.:3000
Median :0.000 Median : 2.000 Median : NA Median :3400
Mean :0.327 Mean : 4.274 Mean :NaN Mean :4725
3rd Qu.:1.000 3rd Qu.: 3.000 3rd Qu.: NA 3rd Qu.:4500
Max. :8.000 Max. :90.000 Max. : NA Max. :9998
NA's :77111 NA's :81094 NA's :82687 NA's :76539
From the above, the variable iptp_antimalarial (Intermittent Preventive Treatment during pregnancy) reveals a significant gap between the total number of respondents and those who received specific antimalarial care. Out of 82,687 total entries, 77,111 contain missing values (NA ) for the use of SP/Fansidar. This high frequency of missingness likely indicates that the vast majority of respondents were either not pregnant during the survey window, did not attend antenatal care where these drugs are distributed, or did not contract malaria.
However, we have identified a core subset of 5,576 cases who did receive these malaria drugs. To ensure our subsequent analysis is precise and focuses specifically on maternal health interventions, we will refine our dataset. By excluding the rows with missing values, we transition from a general population survey to a targeted “pregnancy-intervention” sub-dataset. This allows for a more meaningful evaluation of dosage, timing, and efficacy among those who actually interacted with the malaria prevention protocol.
# Filter for non-missing values and overwrite or save as a new object
gr_data_clean <- gr_data %>%
filter(!is.na(iptp_antimalarial))
# Verify the new dimensions (should be 5,576 rows)
dim(gr_data_clean)[1] 5576 8
# Check the first few rows to ensure NAs are gone
head(gr_data_clean$iptp_antimalarial)<labelled<double>[6]>: during pregnancy took: sp/fansidar for malaria
[1] 1 1 1 0 1 1
Labels:
value label
0 no
1 yes
8 don't know
When analyzing continuous variables like birth weight, identifying outliers is essential for ensuring the integrity of your results. Outliers are extreme values that can disproportionately influence your statistics, often arising from measurement errors or rare clinical cases.
A quick comparison of your summary statistics already hints at a potential issue: the mean is 4,725g, while the median is significantly lower at 3,400g. In a perfectly balanced distribution, these two numbers would be nearly identical. Because the mean is much higher than the median, it suggests that the data is being “pulled” to the right by some exceptionally high values—classic evidence of outliers.
By stacking a histogram and a boxplot, we can see exactly where the data clusters and where the “stragglers” lie. Adding lines for the mean and median will visually demonstrate the “skew” caused by these extreme values.
# Calculate values for the plot
b_mean <- 4725
b_median <- 3400
# Set up the plotting area: 2 rows, 1 column
par(mfrow = c(2, 1), mar = c(4, 4, 2, 1))
library(ggplot2)
library(patchwork) # For stacking the plots
# 1. The Histogram
p1 <- ggplot(gr_data, aes(x = birth_weight_g)) +
geom_histogram(fill = "lightgray", color = "white", bins = 30) +
# Add the vertical lines
geom_vline(aes(xintercept = b_median), color = "darkgreen", size = 1.2) +
geom_vline(aes(xintercept = b_mean), color = "blue", size = 1.2) +
# Add the text box (label)
# x and y coordinates may need adjustment based on your specific data height
# Use separate annotations if you want the text itself to be colored
annotate("text", x = 6500, y = 550, label = paste("Median:", b_median), color = "darkgreen", fontface = "bold") +
annotate("text", x = 6500, y = 450, label = paste("Mean:", b_mean), color = "blue", fontface = "bold") +
labs(title = "Distribution of Birth Weight", x = NULL,y="Count") +
theme_classic() +
theme(axis.text.x = element_blank())
# 2. The Boxplot
p2 <- ggplot(gr_data, aes(x = birth_weight_g)) +
geom_boxplot(fill = "tomato", outlier.color = "red", alpha = 0.7) +
labs(x = "Birth Weight (grams)") +
theme_classic() +
theme(axis.text.y = element_blank())
# Combine them (stacking vertically)
p1 / p2# Reset plotting parameters
par(mfrow = c(1, 1))Let us investigate the birth weights by looking at all the unique values recorded.
unique(gr_data$birth_weight_g)<labelled<double>[96]>: birth weight in kilograms (3 decimals)
[1] 4000 NA 3600 3200 3500 2900 9996 3400 3700 3800 3000 2000 2800 2700 4500
[16] 2300 2400 3300 2100 2200 2500 9998 3900 4400 1500 3100 1980 2600 5000 4600
[31] 1000 3260 1200 4300 4700 4200 3745 2790 3250 1900 4800 2570 2480 4110 2520
[46] 3460 2960 1800 1600 1700 2980 3150 3650 3510 6000 5500 1750 5600 3450 1100
[61] 5200 4100 2250 5300 1400 1550 900 750 700 3180 3680 3050 4900 2470 2160
[76] 3360 5800 1300 600 800 2080 2450 2850 2350 2020 2460 2950 5400 3010 3420
[91] 3380 3060 3720 5900 3570 2550
Labels:
value label
9996 not weighed at birth
9998 don't know
Large-scale surveys like the DHS often use placeholder codes to represent specific types of missing data that a standard NA cannot capture. In your output, the values 9996 and 9998 are not actual weights; they are metadata labels indicating “not weighed at birth” and “don’t know,” respectively.
The presence of these “9-series” codes explains why your initial mean was a staggering 4,725g. Because R treats these placeholders as actual numbers, a few instances of “9998” (nearly 10kg) will aggressively pull the average upward, creating a highly distorted picture of newborn health. While extreme biological values can occur, these specific figures are clearly categorical codes and must be handled—either by recoding them to NA or excluding them—before any statistical analysis can be trusted.
We will now filter the dataset to include only plausible birth weights by excluding these special codes.
library(ggplot2)
library(patchwork)
# 1. Exclude the placeholder codes (9996 and 9998)
# We also keep NAs out for this specific calculation
gr_data <- subset(gr_data, birth_weight_g < 9000)
# 2. Calculate the NEW Mean and Median
new_mean <- round(mean(gr_data$birth_weight_g, na.rm = TRUE),2)
new_median <- median(gr_data$birth_weight_g, na.rm = TRUE)
# 3. Redo Visualizations with ggplot2
# Histogram
p1 <- ggplot(gr_data, aes(x = birth_weight_g)) +
geom_histogram(fill = "lightgray", color = "white", bins = 40) +
# Add the vertical lines
geom_vline(aes(xintercept = b_median), color = "darkgreen", size = 1.2) +
geom_vline(aes(xintercept = b_mean), color = "blue", size = 1.2) +
# Add the text box (label)
# x and y coordinates may need adjustment based on your specific data height
# Use separate annotations if you want the text itself to be colored
annotate("text", x = 1000, y = 550, label = paste("Median:", new_median), color = "darkgreen", fontface = "bold") +
annotate("text", x = 1000, y = 450, label = paste("Mean:", new_mean), color = "blue", fontface = "bold") +
labs(title = "Cleaned Birth Weight Distribution", x = NULL,y="Count") +
theme_classic() +
theme(axis.text.x = element_blank())
# Boxplot
p2 <- ggplot(gr_data, aes(x = birth_weight_g)) +
geom_boxplot(fill = "tomato", alpha = 0.7) +
labs(x = "Birth Weight (grams)") +
theme_classic() +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank())
# Combine plots
p1 / p2It is important to maintain a skeptical but scientific eye. While 9998 is a known placeholder code, a value like 6000g (6kg) might appear as an outlier but could be a legitimate case of fetal macrosomia. When you encounter values that are biologically possible but extreme (e.g., > 5000g or < 500g), they should be investigated against other variables—such as delivery method or maternal health—rather than being deleted immediately. However, purely symbolic codes like 9996 must always be removed from numeric calculations.
nrow(gr_data)[1] 4755
With the above, our gr_data has reduced 4755 down from 82,687.
To ensure our analysis is both representative and comprehensive, we must consolidate our information through a strategic Relational Join.
The necessity for this merge is driven by three key factors:
Preserving the Sampling Frame: Since we are focusing on a specific random sample of 100 households from the hr_data, we must use this as our “Master” dataset. Merging ensures that we maintain the context of these specific households, even if some don’t have matching records in the specialized malaria subset.
Handling Targeted Subsets: Our cleaning of gr_data reduced the records to 4,755 specific malaria-related cases. By joining this with our households, we can identify which women within our target sample experienced malaria during pregnancy without losing the denominator of the total sampled population.
Cross-Variable Insights: Information is currently siloed. To understand the relationship between prevention (bednet use from pr_data) and outcomes (malaria during pregnancy from gr_data), we need a single unified row for each individual.
We will perform a Left Join, using the hr_data (Households) as the foundation. This ensures we keep all 1000 sampled households and “pull in” malaria and bednet data where available. In DHS data, records are typically linked by a combination of Cluster Number (v001), Household Number (v002), and sometimes Line Number (v003) for individuals.
Important Considerations
Duplicate Rows: Some persons may be repeated in the gr_data (e.g., if they had multiple pregnancies). A join will create a new row for each match. One should verify if you want to analyze “per pregnancy” or “per woman.”
Key Matching: Ensure the variable names for IDs are identical across sets (e.g., hv001 in one might be v001 in another). You can specify this using by = c("hv001" = "v001").
Missing Values after Merge: After a Left Join, households that did not have a matching record in the malaria data will have NA in those columns. This is expected and correct—it simply means those specific households did not report a malaria case in the gr_data subset.
library(dplyr)
# 1. First Join: Households (hr) + Malaria Data (gr)
# We use a left_join to keep all 1000 households even if no malaria case was recorded
combined_data <- pr_data %>%
left_join(gr_data, by = c("hv001"="v001", "hv002"="v002","hvidx"="v003"))
# 3. Verification of the Merge
# Check the dimensions to ensure we haven't lost our primary household focus
dim(combined_data)[1] 156974 14
# 2. Second Join: Adding Bednet Data (pr)
# This adds the individual-level behavior to our household/malaria records
# Note: hv001, hv002, and hv003 are standard DHS identifiers
final_dataset <- hr_data %>%
left_join(combined_data, by = c("hv001", "hv002"))
# 3. Verification of the Merge
# Check the dimensions to ensure we haven't lost our primary household focus
dim(final_dataset)[1] 4227 21
glimpse(final_dataset)Rows: 4,227
Columns: 21
$ hv001 <dbl> 1622, 1622, 1622, 848, 848, 75, 75, 75, 75, 1053, …
$ hv002 <dbl> 21, 21, 21, 6, 6, 49, 49, 49, 49, 2, 2, 2, 34, 34,…
$ wealth_index <dbl+lbl> 3, 3, 3, 4, 4, 1, 1, 1, 1, 1, 1, 1, 5, 5, 3, 3…
$ residence_type <dbl+lbl> 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2…
$ altitude_meters <dbl> 1696, 1696, 1696, 1944, 1944, 74, 74, 74, 74, 1532…
$ has_any_net <dbl+lbl> 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1…
$ itn_count <dbl+lbl> 2, 2, 2, 0, 0, 3, 3, 3, 3, 0, 0, 0, 2, 2, 2, 2…
$ wall_material <dbl+lbl> 31, 31, 31, 32, 32, 21, 21, 21, 21, 21, 21, 21…
$ water_source <dbl+lbl> 41, 41, 41, 12, 12, 13, 13, 13, 13, 43, 43, 43…
$ hvidx <dbl> 1, 2, 3, 1, 2, 1, 2, 3, 4, 1, 2, 3, 1, 2, 1, 2, 3,…
$ member_age <dbl+lbl> 42, 14, 9, 35, 12, 75, 71, 10, 12, 30, 7, 2…
$ net_type_used <dbl+lbl> 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 3, 1, 1, 1…
$ malaria_test_result <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ hemoglobin_level <dbl+lbl> 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1…
$ fever_last_2weeks <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ member_sex <dbl+lbl> 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 2…
$ mother_age <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ iptp_antimalarial <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ sp_fansidar_doses <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ maternal_hemoglobin <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ birth_weight_g <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA…
With the datasets now merged, the glimpse of final_dataset reveals the “joined” reality: we have a rich collection of household characteristics, individual behaviors, and clinical outcomes. However, a significant portion of the data contains NA values or redundant identifiers that may clutter our analysis.
Performing Summary Statistics is our next logical step. This allows us to assess the distribution of each variable, identify the prevalence of missing data, and determine which columns provide enough variation to be useful for modeling. By scanning these summaries, we can distinguish between “anchor” variables—like wealth index or net usage—and those that are too sparsely populated to support a robust conclusion.
To achieve this, we will use the skimr package. Unlike the standard summary() function, skim() provides a much more detailed and visually intuitive preview, including missing value counts (n_missing), completion rates, and even “sparkline” histograms to show the shape of your data at a glance. It is particularly powerful for large datasets where you need to quickly spot which variables are mostly empty or skewed.
# Load the package
library(skimr)
# Generate a detailed summary of the final dataset
# This will group variables by type (numeric, factor, etc.)
skim(final_dataset)| Name | final_dataset |
| Number of rows | 4227 |
| Number of columns | 21 |
| _______________________ | |
| Column type frequency: | |
| numeric | 21 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| hv001 | 0 | 1.00 | 849.46 | 488.46 | 1 | 408 | 864 | 1297.5 | 1688 | ▇▆▇▇▇ |
| hv002 | 0 | 1.00 | 43.28 | 28.93 | 1 | 19 | 39 | 63.0 | 139 | ▇▆▅▂▁ |
| wealth_index | 0 | 1.00 | 2.72 | 1.40 | 1 | 1 | 3 | 4.0 | 5 | ▇▆▅▆▃ |
| residence_type | 0 | 1.00 | 1.67 | 0.47 | 1 | 1 | 2 | 2.0 | 2 | ▃▁▁▁▇ |
| altitude_meters | 0 | 1.00 | 1317.76 | 703.64 | 1 | 836 | 1412 | 1842.0 | 3342 | ▅▅▇▂▁ |
| has_any_net | 0 | 1.00 | 0.69 | 0.46 | 0 | 0 | 1 | 1.0 | 1 | ▃▁▁▁▇ |
| itn_count | 0 | 1.00 | 1.90 | 1.75 | 0 | 0 | 2 | 3.0 | 7 | ▇▃▅▁▁ |
| wall_material | 0 | 1.00 | 27.47 | 12.76 | 11 | 21 | 31 | 32.0 | 96 | ▇▇▁▁▁ |
| water_source | 0 | 1.00 | 31.26 | 17.47 | 11 | 14 | 31 | 43.0 | 96 | ▇▇▂▁▁ |
| hvidx | 0 | 1.00 | 3.35 | 2.24 | 1 | 2 | 3 | 5.0 | 16 | ▇▂▁▁▁ |
| member_age | 0 | 1.00 | 23.82 | 19.56 | 0 | 9 | 18 | 35.0 | 98 | ▇▃▂▁▁ |
| net_type_used | 0 | 1.00 | 0.60 | 0.78 | 0 | 0 | 0 | 1.0 | 3 | ▇▆▁▁▁ |
| malaria_test_result | 4227 | 0.00 | NaN | NA | NA | NA | NA | NA | NA | |
| hemoglobin_level | 0 | 1.00 | 0.41 | 0.49 | 0 | 0 | 0 | 1.0 | 1 | ▇▁▁▁▆ |
| fever_last_2weeks | 4227 | 0.00 | NaN | NA | NA | NA | NA | NA | NA | |
| member_sex | 0 | 1.00 | 1.52 | 0.50 | 1 | 1 | 2 | 2.0 | 2 | ▇▁▁▁▇ |
| mother_age | 4110 | 0.03 | 27.57 | 5.61 | 16 | 23 | 27 | 31.0 | 44 | ▂▇▅▂▁ |
| iptp_antimalarial | 4122 | 0.02 | 0.41 | 0.88 | 0 | 0 | 0 | 1.0 | 8 | ▇▁▁▁▁ |
| sp_fansidar_doses | 4192 | 0.01 | 8.43 | 16.91 | 1 | 2 | 4 | 5.0 | 90 | ▇▁▁▁▁ |
| maternal_hemoglobin | 4227 | 0.00 | NaN | NA | NA | NA | NA | NA | NA | |
| birth_weight_g | 4110 | 0.03 | 3162.22 | 682.83 | 1300 | 2800 | 3200 | 3500.0 | 5000 | ▁▃▇▃▁ |
Key Observations from the Output
The “Dead” Variables (100% Missing)
malaria_test_result, fever_last_2weeks, and maternal_hemoglobin variables have a complete_rate of 0.00. They contain zero data points for the sampled 1000 households.High-Missingness Pregnancy Data
iptp_antimalarial, sp_fansidar_doses, mother_age, and birth_weight_g show a complete_rate between 0.01 and 0.03.Strange Values & Data Types
sp_fansidar_doses: The mean is 8.43 with a max of 90. Standard malaria protocols (SP/Fansidar) usually involve 3 doses. A value of 90 is likely a placeholder code (similar to 9998) or a significant entry error.
birth_weight_g: Now that we removed the 9000+ codes, the distribution looks much better (p50 = 3200g). However, the min of 1300g and max of 5000g are extreme but biologically possible.
itn_count: The max is 7. While high, it is plausible for large households.
2. Corrective Cleaning Strategy
We need to strip away the “zero-info” columns and handle the extreme placeholder in the doses variable.
# 1. Clean and Subset
final_dataset <- final_dataset %>%
# Remove variables with 0% completion rate and redundant IDs
select(-malaria_test_result, -fever_last_2weeks, -maternal_hemoglobin) %>%
# Filter out the suspicious '90' doses if they are placeholders
mutate(sp_fansidar_doses = ifelse(sp_fansidar_doses > 10, NA, sp_fansidar_doses))
# 2. Final check of the structure
str(final_dataset)tibble [4,227 × 18] (S3: tbl_df/tbl/data.frame)
$ hv001 : num [1:4227] 1622 1622 1622 848 848 ...
..- attr(*, "label")= chr "cluster number"
..- attr(*, "format.stata")= chr "%8.0g"
$ hv002 : num [1:4227] 21 21 21 6 6 49 49 49 49 2 ...
..- attr(*, "label")= chr "household number"
..- attr(*, "format.stata")= chr "%8.0g"
$ wealth_index : dbl+lbl [1:4227] 3, 3, 3, 4, 4, 1, 1, 1, 1, 1, 1, 1, 5, 5, 3, 3, 3, 3,...
..@ label : chr "wealth index combined"
..@ format.stata: chr "%8.0g"
..@ labels : Named num [1:5] 1 2 3 4 5
.. ..- attr(*, "names")= chr [1:5] "poorest" "poorer" "middle" "richer" ...
$ residence_type : dbl+lbl [1:4227] 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2,...
..@ label : chr "type of place of residence"
..@ format.stata: chr "%8.0g"
..@ labels : Named num [1:2] 1 2
.. ..- attr(*, "names")= chr [1:2] "urban" "rural"
$ altitude_meters : num [1:4227] 1696 1696 1696 1944 1944 ...
..- attr(*, "label")= chr "cluster altitude in meters"
..- attr(*, "format.stata")= chr "%8.0g"
$ has_any_net : dbl+lbl [1:4227] 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,...
..@ label : chr "has mosquito bed net for sleeping"
..@ format.stata: chr "%8.0g"
..@ labels : Named num [1:2] 0 1
.. ..- attr(*, "names")= chr [1:2] "no" "yes"
$ itn_count : dbl+lbl [1:4227] 2, 2, 2, 0, 0, 3, 3, 3, 3, 0, 0, 0, 2, 2, 2, 2, 2, 2,...
..@ label : chr "number of mosquito bed nets"
..@ format.stata: chr "%8.0g"
..@ labels : Named num 98
.. ..- attr(*, "names")= chr "don't know"
$ wall_material : dbl+lbl [1:4227] 31, 31, 31, 32, 32, 21, 21, 21, 21, 21, 21, 21, 31, 3...
..@ label : chr "main wall material"
..@ format.stata: chr "%8.0g"
..@ labels : Named num [1:20] 10 11 12 13 20 21 22 23 24 25 ...
.. ..- attr(*, "names")= chr [1:20] "natural" "no walls" "cane/palm/trunks" "dirt" ...
$ water_source : dbl+lbl [1:4227] 41, 41, 41, 12, 12, 13, 13, 13, 13, 43, 43, 43, 14, 1...
..@ label : chr "source of drinking water"
..@ format.stata: chr "%8.0g"
..@ labels : Named num [1:19] 10 11 12 13 14 20 21 30 31 32 ...
.. ..- attr(*, "names")= chr [1:19] "piped water" "piped into dwelling" "piped to yard/plot" "piped to neighbor" ...
$ hvidx : num [1:4227] 1 2 3 1 2 1 2 3 4 1 ...
..- attr(*, "label")= chr "line number"
..- attr(*, "format.stata")= chr "%8.0g"
$ member_age : dbl+lbl [1:4227] 42, 14, 9, 35, 12, 75, 71, 10, 12, 30, 7, 2, 58, 2...
..@ label : chr "age of household members"
..@ format.stata: chr "%8.0g"
..@ labels : Named num [1:2] 95 98
.. ..- attr(*, "names")= chr [1:2] "95+" "don't know"
$ net_type_used : dbl+lbl [1:4227] 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 3, 1, 1, 1, 1, 1,...
..@ label : chr "type of mosquito bed net(s) person slept under last night"
..@ format.stata: chr "%8.0g"
..@ labels : Named num [1:4] 0 1 2 3
.. ..- attr(*, "names")= chr [1:4] "did not sleep under a net" "only treated (itn) nets" "both treated (itn) and untreated nets" "only untreated nets"
$ hemoglobin_level : dbl+lbl [1:4227] 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1,...
..@ label : chr "person slept under an llin net"
..@ format.stata: chr "%8.0g"
..@ labels : Named num [1:2] 0 1
.. ..- attr(*, "names")= chr [1:2] "no" "yes"
$ member_sex : dbl+lbl [1:4227] 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 1,...
..@ label : chr "sex of household member"
..@ format.stata: chr "%8.0g"
..@ labels : Named num [1:2] 1 2
.. ..- attr(*, "names")= chr [1:2] "male" "female"
$ mother_age : num [1:4227] NA NA NA NA NA NA NA NA NA NA ...
..- attr(*, "label")= chr "respondent's current age"
..- attr(*, "format.stata")= chr "%8.0g"
$ iptp_antimalarial: dbl+lbl [1:4227] NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
..@ label : chr "during pregnancy took: sp/fansidar for malaria"
..@ format.stata: chr "%8.0g"
..@ labels : Named num [1:3] 0 1 8
.. ..- attr(*, "names")= chr [1:3] "no" "yes" "don't know"
$ sp_fansidar_doses: num [1:4227] NA NA NA NA NA NA NA NA NA NA ...
$ birth_weight_g : dbl+lbl [1:4227] NA, NA, NA, NA, NA, NA, NA, NA, NA,...
..@ label : chr "birth weight in kilograms (3 decimals)"
..@ format.stata: chr "%8.0g"
..@ labels : Named num [1:2] 9996 9998
.. ..- attr(*, "names")= chr [1:2] "not weighed at birth" "don't know"
The data is currently in what we call a long format. This can be useful when conducting faceted analysis. Consider the below analysis.
A faceted plot allows you to see how the relationship between two variables changes across different categories. Consider this question: “Does the effectiveness of bednet ownership on birth weight differ between Rural and Urban areas across different Wealth levels?”
By using facet_grid(residence_type ~ wealth_index), we create a matrix of plots. Each “cell” shows the distribution of birth weights for people with vs. without nets in that specific socio-economic context.
library(ggplot2)
ggplot(final_dataset, aes(x = factor(has_any_net), y = birth_weight_g, fill = factor(has_any_net))) +
geom_boxplot() +
facet_grid(residence_type ~ wealth_index) + # Faceting by Residence and Wealth
labs(title = "Birth Weight by Net Ownership across Wealth and Residence",
x = "Has Any Bednet (0=No, 1=Yes)",
y = "Birth Weight (g)",
fill = "Net Ownership") +
theme_bw()As we will do some analysis going forward, we first ensure all the categorical variables are in the correct format for analysis in R.
Converting numeric codes into Factors in R is a critical step for analysis. It ensures that R treats these variables as categories rather than continuous numbers. For variables like wealth_index, we will use Ordered Factors, which tells R that “Poorest” is lower than “Poorer,” allowing for better trend analysis in your models.
final_dataset <- final_dataset %>%
mutate(
# --- Ordered Factors (Hierarchical) ---
wealth_index = factor(wealth_index,
levels = c(1, 2, 3, 4, 5),
labels = c("Poorest", "Poorer", "Middle", "Richer", "Richest"),
ordered = TRUE),
sp_fansidar_doses = factor(sp_fansidar_doses,
levels = c(0, 1, 2, 3),
labels = c("None", "1 dose", "2 doses", "3+ doses"),
ordered = TRUE),
# --- Nominal Factors (Unordered Categories) ---
residence_type = factor(residence_type,
levels = c(1, 2),
labels = c("Urban", "Rural")),
has_any_net = factor(has_any_net,
levels = c(0, 1),
labels = c("No", "Yes")),
member_sex = factor(member_sex,
levels = c(1, 2),
labels = c("Male", "Female")),
iptp_antimalarial = factor(iptp_antimalarial,
levels = c(0, 1, 8),
labels = c("No", "Yes", "Don't Know")),
net_type_used = factor(net_type_used,
levels = c(0, 1, 2, 3),
labels = c("None", "ITN", "Treated (non-ITN)", "Untreated"))
)We redo the analysis above with the categorical variables well constituted. Compare the difference.
ggplot(final_dataset, aes(x = factor(has_any_net), y = birth_weight_g, fill = factor(has_any_net))) +
geom_boxplot() +
facet_grid(residence_type ~ wealth_index) + # Faceting by Residence and Wealth
labs(title = "Birth Weight by Net Ownership across Wealth and Residence",
x = "Has Any Bednet",
y = "Birth Weight (g)",
fill = "Net Ownership") +
theme_bw() +
theme(legend.position = "none")Sometimes we want to compare two different numeric variables for the same household on the same scale. To do this effectively in ggplot2, we “pivot” the data into a long format (which R ironically calls “tidying”), but the reason we do it is to compare variables that were originally sitting side-by-side in wide format. The Question: “How does the distribution of Mother’s Age compare to the distribution of Birth Weight (scaled) for each household?”
In your current wide format, mother_age and birth_weight_g are in separate columns. If you want to plot them as two different colored lines on the same chart, you pivot them.
library(tidyr)
# 1. Prepare the data by scaling and pivoting
comparison_data <- final_dataset %>%
# Select only the relevant columns and unique identifiers
select(hv001, hv002, mother_age, birth_weight_g) %>%
# Drop rows where either is missing for a fair comparison
filter(!is.na(mother_age), !is.na(birth_weight_g)) %>%
# Scale variables so they share a mean of 0 and SD of 1
mutate(
scaled_age = as.numeric(scale(mother_age)),
scaled_weight = as.numeric(scale(birth_weight_g))
) %>%
# Pivot from WIDE (two columns) to LONG (one column for plot mapping)
pivot_longer(
cols = c(scaled_age, scaled_weight),
names_to = "health_metric",
values_to = "standardized_value"
)
# 2. Visualize the comparison
ggplot(comparison_data, aes(x = standardized_value, fill = health_metric)) +
geom_density(alpha = 0.5) + # Density plot shows the "shape" of the data
scale_fill_manual(
values = c("scaled_age" = "darkgreen", "scaled_weight" = "blue"),
labels = c("Mother's Age", "Birth Weight")
) +
labs(
title = "Comparison of Normalized Maternal Age and Birth Weight",
subtitle = "Data pivoted from wide to long for distribution mapping",
x = "Standardized Value (Z-Score)",
y = "Density",
fill = "Metric"
) +
theme_minimal()Widee-to-long transformation is essential whenever you want to map multiple columns to a single aesthetic (like color or fill). This comparison tells you if the “spread” of maternal ages is as wide as the “spread” of birth weights in your sample.