This report shows the initial finding of the descriptive analysis of child malnutrition in Pakistan,focusing distribution of stunting, wasting and underweight across different wealth groups. ## 2.Setup and data loading: This section prepares the environment by loading necessary R libraries and importing the data-set, ensuring variables are in the correct format for analysis. ### 2.1 Load Libraries:
library(haven)
## Warning: package 'haven' was built under R version 4.5.1
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.5.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(psych)
## Warning: package 'psych' was built under R version 4.5.1
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
library(tidyr)
library(naniar)
## Warning: package 'naniar' was built under R version 4.5.1
In this section Kids data, Household data and Women data is loading in R.
kids <- read_dta("C:/Users/ameen/OneDrive - Higher Education Commission/Documents/Talha csc scholarship china/open door/itmo docs/introduction to R and more/1 Term/reaserch in ternship/Pakistan reaserch/PKKR71DT kids/PKKR71FL.DTA")
household <- read_dta("C:/Users/ameen/OneDrive - Higher Education Commission/Documents/Talha csc scholarship china/open door/itmo docs/introduction to R and more/1 Term/reaserch in ternship/Pakistan reaserch/PKHR71DT household/PKHR71FL.DTA")
women <- read_dta("C:/Users/ameen/OneDrive - Higher Education Commission/Documents/Talha csc scholarship china/open door/itmo docs/introduction to R and more/1 Term/reaserch in ternship/Pakistan reaserch/PKIR71DT women/PKIR71FL.DTA")
In this step variables that are useful for analysis are seperated from the raw data.
kids_selective <- kids %>%
select(v001, v002, v003, b16, midx,
hw70, hw71, hw72, # nutrition outcomes
hw1, hw13, # child age + reason for missing
v190, h2:h9, m70)
household_selective <- household %>%
select(hv001, hv002, hv201, hv202, hv204, hv205, hv226) %>%
rename(v001 = hv001, v002 = hv002)
women_selective <- women %>%
select(v001, v002, v003, v467a:v467f)
This step check while merging Kids and women data is data loss during merging?
# --- 2.1. Linked check ---
kids_selective$linked <- ifelse(
paste(kids_selective$v001, kids_selective$v002, kids_selective$v003) %in%
paste(women_selective$v001, women_selective$v002, women_selective$v003),
1, 0
)
cat("\nLinked vs Unlinked children (pre-merge):\n")
##
## Linked vs Unlinked children (pre-merge):
print(table(kids_selective$linked))
##
## 1
## 12708
The results show that all 12708 children merged successfully with their respective mothers no data is lost during merging.
In this step all selective data sets are merged.
kids_selective <- kids_selective %>% mutate(b16 = as.numeric(b16))
merged_datasets <- kids_selective %>%
left_join(household_selective, by = c("v001","v002")) %>%
left_join(women_selective, by = c("v001","v002","v003"))
Summarize the variable hw1 (child’s age in months) to understand its distribution and check for missing values.
summary(merged_datasets$hw1)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 14.00 29.00 29.16 44.00 59.00 8209
The variable hw1 (children’s age in months) ranges from 0 to 59 months. The median age is 29 months, close to the mean of 29.16, suggesting a roughly symmetric distribution. The first and third quartiles are 14 and 44 months, respectively. There are 8,209 missing values, indicating a substantial amount of incomplete data. ### 4.2 Histogram of Children’s Age:
hist(merged_datasets$hw1, main = "Age distribution of children", xlab = "Age in months")
#### Interpretation: This histogram shows the number of children at
different ages (in months). Overall, the sample is well-distributed
across the full 0–60 month range, with slightly more infants (0–5
months) and slightly fewer children in the oldest age group (55–60
months). This indicates a large and fairly even sample, which is
suitable for studying age-related changes in malnutrition. ### 4.3 Check
for kids without a household match: This counts how many children don’t
have household information (hv201), showing if the merge missed any
matches.
sum(is.na(merged_datasets$hv201))
## [1] 0
The result [1] 0 means that all children in the data-set have matching household information. There are no missing values in hv201, so the merge worked correctly. ## 5 Data Cleaning and Variable Preparation: In this step, we cleaned and prepared the merged dat-aset. Variables were renamed to meaningful names for easier interpretation. Children were grouped into age categories (0–11 months, 12–23 months, etc.), and anthropometry scores (stunting, underweight, wasting) were converted back to z-scores and cleaned for implausible values. Wealth index was recoded into 5 ordered categories from poorest to richest for analysis.
merged_datasets <- merged_datasets %>%
rename(
cluster_number = v001,
household_number = v002,
child_index = b16,
mother_index = v003,
stunting = hw70,
underweight = hw71,
wasting = hw72,
wealth_index = v190,
water_source = hv201,
time_to_water = hv202,
electricity = hv204,
toilet_facility = hv205,
cooking_fuel = hv226,
postnatal_care = m70,
know_where_to_go = v467a,
getting_permission_to_go = v467b,
money_needed = v467c,
distance_ti_health = v467d,
transport = v467e,
wait_to_go_alone = v467f
) %>%
mutate(
# 5.1 Child age groups
age_group = cut(hw1,
breaks = seq(0, 60, by = 12),
labels = c("0–11m", "12–23m", "24–35m", "36–47m", "48–59m"),
right = FALSE),
#5.2 Convert anthropometry scores back to z-scores
stunting = stunting / 100,
underweight = underweight / 100,
wasting = wasting / 100,
#5.3 Clean implausible values
stunting = ifelse(stunting < -6 | stunting > 6, NA, stunting),
underweight = ifelse(underweight < -6 | underweight > 6, NA, underweight),
wasting = ifelse(wasting < -5 | wasting > 5, NA, wasting),
# Wealth categories (5 groups)
wealth_cat = case_when(
wealth_index == 1 ~ "poorest",
wealth_index == 2 ~ "poorer",
wealth_index == 3 ~ "middle",
wealth_index == 4 ~ "richer",
wealth_index == 5 ~ "richest",
TRUE ~ NA_character_
),
wealth_cat = factor(wealth_cat, levels = c("poorest","poorer","middle","richer","richest"))
)
In this section, we examine how child malnutrition indicators vary across age groups (0–11 months, 12–23 months, etc.). Three key nutrition outcomes are considered: stunting (HAZ), underweight (WAZ), and wasting (WHZ). ### 6.1 Boxplots by age group.
ggplot(merged_datasets, aes(x = age_group, y = stunting, fill = age_group)) +
geom_boxplot() + theme_minimal() +
labs(title = "Stunting (HAZ) distribution by child age group", x="Age group (months)", y="HAZ")
## Warning: Removed 8567 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
ggplot(merged_datasets, aes(x = age_group, y = underweight, fill = age_group)) +
geom_boxplot() + theme_minimal() +
labs(title = "Underweight (WAZ) distribution by child age group", x="Age group (months)", y="WAZ")
## Warning: Removed 8443 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
ggplot(merged_datasets, aes(x = age_group, y = wasting, fill = age_group)) +
geom_boxplot() + theme_minimal() +
labs(title = "Wasting (WHZ) distribution by child age group", x="Age group (months)", y="WHZ")
## Warning: Removed 8557 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
#### Interpretation: Stunting (HAZ) – Chronic Malnutrition: Stunting
increases with age. Infants (0–11 months) are close to the healthy
range, but from 12–35 months, median HAZ drops below -1.5. For older
children (36–59 months), stunting remains low, showing that once it
occurs, it persists.
Underweight (WAZ) – Combined Malnutrition: Underweight also worsens with age. Median WAZ is below 0 for all groups, and the lower quartiles are near -2, indicating that many children are underweight across all ages.
Wasting (WHZ) – Acute Malnutrition: Wasting is more stable, with median WHZ near 0. However, some children in all age groups fall below -2, showing ongoing risk of acute malnutrition, often due to short-term illness or food shortage. ### 6.2 line chart by age group:
make_line_chart <- function(data, var, var_label){
line_data <- data %>%
group_by(age_group) %>%
summarise(mean_val = mean(.data[[var]], na.rm = TRUE),
sd_val = sd(.data[[var]], na.rm = TRUE),
n = n()) %>%
mutate(se_val = sd_val / sqrt(n))
ggplot(line_data, aes(x = age_group, y = mean_val, group = 1)) +
geom_line(color = "blue", size = 1) +
geom_point(size = 3, color = "red") +
geom_errorbar(aes(ymin = mean_val - se_val, ymax = mean_val + se_val),
width = 0.1, color = "black") +
theme_minimal() +
labs(title = paste("Average", var_label, "Z-score by Child Age Group"),
x = "Age group (months)",
y = paste("Mean", var_label, "Z-score")) +
geom_hline(yintercept = -2, linetype = "dashed", color = "darkred") +
annotate("text", x = 2.5, y = -2.1,
label = paste(var_label, "cutoff (Z = -2)"),
color = "darkred", hjust = 0)
}
# Run charts
p_stunting_line <- make_line_chart(merged_datasets, "stunting", "Stunting (HAZ)")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p_underweight_line <- make_line_chart(merged_datasets, "underweight", "Underweight (WAZ)")
p_wasting_line <- make_line_chart(merged_datasets, "wasting", "Wasting (WHZ)")
print(p_stunting_line)
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
print(p_underweight_line)
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
print(p_wasting_line)
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
#### Interpretation: Stunting (HAZ): Stunting gets worse with age. The
mean HAZ drops sharply from infancy and crosses the −2 cutoff around
36–47 months, meaning older children are on average chronically
malnourished.
Underweight (WAZ): Underweight also declines with age but less severely. The mean WAZ stays below the healthy average (0) in all groups but does not cross the severe −2 cutoff.
Wasting (WHZ): Wasting remains fairly stable across age groups, with
the mean WHZ close to 0 and well above the −2 cutoff. This shows that
acute malnutrition is less of a population-wide problem and occurs more
as short-term cases. ## 7 Nutritional Outcomes by Wealth Category
(Heatmaps): We used heatmaps to examine how child nutrition outcomes
differ across wealth categories.
The plots show the percentage distribution of stunting, underweight, and
wasting across different levels of household wealth.
This helps visualize socioeconomic disparities in child
malnutrition.
make_single_heatmap <- function(data, var, var_label){
df <- data %>%
filter(!is.na(!!rlang::sym(var))) %>% # drop NA values
mutate(value100 = round(!!rlang::sym(var) * 100))
# DHS-style binning (-600 to 600)
breaks <- seq(-600, 600, by = 100)
labels <- paste0(breaks[-length(breaks)], "-", (breaks[-1]-1))
df <- df %>%
mutate(value_bin = cut(value100, breaks = breaks, labels = labels,
include.lowest = TRUE, right = TRUE))
heat_data <- df %>%
group_by(wealth_cat, value_bin) %>%
summarise(n = n(), .groups = "drop") %>%
complete(wealth_cat, value_bin, fill = list(n = 0)) %>%
group_by(wealth_cat) %>%
mutate(total_in_row = sum(n),
pct_within_wealth = ifelse(total_in_row==0, NA, 100*n/total_in_row)) %>%
ungroup()
ggplot(heat_data, aes(x=value_bin, y=wealth_cat, fill=pct_within_wealth)) +
geom_tile(color="white", size=0.35) +
geom_text(aes(label = ifelse(is.na(pct_within_wealth), "", sprintf("%.1f%%", pct_within_wealth))),
color="black", size=3.1) +
geom_text(aes(label = ifelse(is.na(pct_within_wealth), "", sprintf("%.1f%%", pct_within_wealth))),
color="white", size=2.6) +
scale_fill_viridis_c(name="% within wealth", option="viridis", na.value="white") +
labs(title=var_label, x=paste(var_label,"(binned)"), y="Wealth") +
theme_minimal() +
theme(axis.text.x = element_text(angle=45, hjust=1),
panel.grid = element_blank()) +
scale_y_discrete(limits = rev(levels(heat_data$wealth_cat)))
}
# Run heatmaps
p_stunting <- make_single_heatmap(merged_datasets, "stunting", "Stunting")
p_underweight <- make_single_heatmap(merged_datasets, "underweight", "Underweight")
p_wasting <- make_single_heatmap(merged_datasets, "wasting", "Wasting")
print(p_stunting)
print(p_underweight)
print(p_wasting)
Stunting (HAZ) – Clear Wealth Gradient
Stunting shows the strongest wealth gradient. Children from poorer
households are much more likely to have severe stunting, while richer
households cluster closer to the healthy range.
Underweight (WAZ) – Strong Wealth Gradient
Underweight also declines as wealth increases. Poorer children are more
concentrated in the severe underweight bins, while richer children are
closer to normal values.
Wasting (WHZ) – Weak Wealth Gradient
Wasting shows the weakest wealth gradient. Most children in all wealth
groups fall near normal values, though the poorest still face slightly
higher risks of severe wasting. ## 8 Missingness Analysis: We examined
missing data across all variables. The table below highlights the
variables with the highest percentage of missing values to identify
potential data quality issues.
missing_counts <- sapply(merged_datasets, function(x) sum(is.na(x)))
missing_percent <- round(sapply(merged_datasets, function(x) mean(is.na(x))*100), 2)
missing_table <- data.frame(
Variable = names(missing_counts),
Missing_Count = missing_counts,
Missing_Percent = missing_percent
)
missing_table[order(-missing_table$Missing_Percent), ]
## Variable Missing_Count Missing_Percent
## know_where_to_go know_where_to_go 12708 100.00
## transport transport 12708 100.00
## time_to_water time_to_water 12457 98.02
## h7d h7d 9930 78.14
## h7m h7m 9930 78.14
## h7y h7y 9930 78.14
## h8d h8d 9896 77.87
## h8m h8m 9896 77.87
## h8y h8y 9896 77.87
## h5d h5d 9604 75.57
## h5m h5m 9604 75.57
## h5y h5y 9604 75.57
## h6d h6d 9574 75.34
## h6m h6m 9574 75.34
## h6y h6y 9574 75.34
## h3d h3d 9272 72.96
## h3m h3m 9272 72.96
## h3y h3y 9272 72.96
## h4d h4d 9233 72.66
## h4m h4m 9233 72.66
## h4y h4y 9233 72.66
## h2d h2d 8961 70.51
## h2m h2m 8961 70.51
## h2y h2y 8961 70.51
## stunting stunting 8567 67.41
## wasting wasting 8557 67.34
## underweight underweight 8443 66.44
## hw1 hw1 8209 64.60
## age_group age_group 8209 64.60
## h5 h5 5557 43.73
## h7 h7 5557 43.73
## h4 h4 5556 43.72
## h6 h6 5556 43.72
## h8 h8 5556 43.72
## h2 h2 5555 43.71
## h3 h3 5555 43.71
## h9 h9 5555 43.71
## postnatal_care postnatal_care 4433 34.88
## child_index child_index 719 5.66
## distance_ti_health distance_ti_health 12 0.09
## getting_permission_to_go getting_permission_to_go 10 0.08
## money_needed money_needed 10 0.08
## wait_to_go_alone wait_to_go_alone 10 0.08
## cooking_fuel cooking_fuel 4 0.03
## hw13 hw13 3 0.02
## cluster_number cluster_number 0 0.00
## household_number household_number 0 0.00
## mother_index mother_index 0 0.00
## midx midx 0 0.00
## wealth_index wealth_index 0 0.00
## linked linked 0 0.00
## water_source water_source 0 0.00
## electricity electricity 0 0.00
## toilet_facility toilet_facility 0 0.00
## wealth_cat wealth_cat 0 0.00
In this section we check reason for messingness.
merged_datasets <- merged_datasets %>%
mutate(hw13_reason = case_when(
hw13 == 0 ~ "Measured",
hw13 == 1 ~ "Child died",
hw13 == 2 ~ "Child sick",
hw13 == 3 ~ "Not present",
hw13 == 4 ~ "Refused",
hw13 == 5 ~ "Mother refused",
hw13 == 6 ~ "Other",
hw13 == 7 ~ "No measurement in household",
TRUE ~ NA_character_
))
cat("\nReasons for missing stunting/underweight data (hw13):\n")
##
## Reasons for missing stunting/underweight data (hw13):
print(table(merged_datasets$hw13_reason, useNA="ifany"))
##
## Child died Measured
## 719 4227
## No measurement in household Not present
## 7490 111
## Other Refused
## 72 86
## <NA>
## 3
Most missing cases are due to children not being measured, followed by child deaths, refusals, or absence. ## 10 Visual Missingness Check (naniar plots): In this section different plot help to visualize missingness pattern.
# Missingness across variables
gg_miss_var(merged_datasets)
# Upset plot for overlapping missingness
gg_miss_upset(merged_datasets)
# Missingness by wealth group
gg_miss_var(merged_datasets, facet = "wealth_cat")
To explore how household and healthcare-related factors cluster together, we calculated Spearman correlations among key structural variables (e.g., water source, electricity, sanitation, cooking fuel). Correlations were also examined separately by wealth groups to assess whether these relationships vary by socioeconomic status.
library(dplyr)
library(ggcorrplot)
## Warning: package 'ggcorrplot' was built under R version 4.5.1
# --- Select only required variables ---
structural_variables <- merged_datasets %>%
select(toilet_facility, water_source, postnatal_care, cooking_fuel)
# Convert to numeric
struct_vars_num <- structural_variables %>% mutate(across(everything(), as.numeric))
# --- Overall correlation ---
corr_matrix <- cor(struct_vars_num, use = "pairwise.complete.obs", method = "spearman")
# Replace NA values with 0 (to avoid clustering errors)
corr_matrix[is.na(corr_matrix)] <- 0
# Plot overall correlation heatmap
ggcorrplot(corr_matrix,
lab = TRUE, lab_size = 4,
hc.order = TRUE,
type = "lower",
colors = c("red", "white", "blue"),
title = "Overall Correlation of Structural Variables",
ggtheme = theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
axis.text.y = element_text(size = 10)))
# --- Correlation by wealth group ---
for (grp in levels(merged_datasets$wealth_cat)) {
subset_data <- struct_vars_num[merged_datasets$wealth_cat == grp, ]
# Correlation for this wealth group
corr_matrix_grp <- cor(subset_data, use = "pairwise.complete.obs", method = "spearman")
corr_matrix_grp[is.na(corr_matrix_grp)] <- 0 # replace NAs
print(
ggcorrplot(corr_matrix_grp,
lab = TRUE, lab_size = 4,
hc.order = TRUE,
type = "lower",
colors = c("red", "white", "blue"),
title = paste("Correlation by Wealth Group:", grp),
ggtheme = theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
axis.text.y = element_text(size = 10)))
)
}
Strongest Link (0.52): Toilet facility and cooking fuel are moderately correlated. Households with better toilets are more likely to use cleaner fuels (e.g., gas or electricity).
Weak Positive Links (0.14–0.15): Water source has a small positive correlation with toilet facility and cooking fuel. Better water sources are slightly associated with better sanitation and cleaner fuels.
Weak Negative Links (-0.09 to -0.13): Postnatal care shows weak negative correlations with toilet facility and cooking fuel. Structural improvements do not strongly translate into higher use of postnatal care.
No Link (≈0): Water source and postnatal care show no relationship.
Middle Group: Shows the strongest clustering. Toilet and cooking fuel are most related (0.21), and water source links with toilet (0.17). Postnatal care correlations are near zero.
Poorer & Richer Groups: Weak positive links. In the Poorer group, toilet and cooking fuel (0.09) are weakly related. In the Richer group, water and cooking fuel (0.09) are most related. Postnatal care remains unrelated.
Poorest & Richest Extremes: Very weak or negative links. In the Poorest group, correlations are near zero. In the Richest group, some negative values (e.g., toilet–fuel = -0.16) suggest no clear clustering.