1.Introduction:

This report shows the initial finding of the descriptive analysis of child malnutrition in Pakistan,focusing distribution of stunting, wasting and underweight across different wealth groups. ## 2.Setup and data loading: This section prepares the environment by loading necessary R libraries and importing the data-set, ensuring variables are in the correct format for analysis. ### 2.1 Load Libraries:

library(haven)

## Warning: package 'haven' was built under R version 4.5.1

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.5.1

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(psych)

## Warning: package 'psych' was built under R version 4.5.1

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

library(tidyr)
library(naniar)

## Warning: package 'naniar' was built under R version 4.5.1

2.2 Load Datasets:

In this section Kids data, Household data and Women data is loading in R.

kids <- read_dta("C:/Users/ameen/OneDrive - Higher Education Commission/Documents/Talha csc scholarship china/open door/itmo docs/introduction to R and more/1 Term/reaserch in ternship/Pakistan reaserch/PKKR71DT kids/PKKR71FL.DTA")
household <- read_dta("C:/Users/ameen/OneDrive - Higher Education Commission/Documents/Talha csc scholarship china/open door/itmo docs/introduction to R and more/1 Term/reaserch in ternship/Pakistan reaserch/PKHR71DT household/PKHR71FL.DTA")
women <- read_dta("C:/Users/ameen/OneDrive - Higher Education Commission/Documents/Talha csc scholarship china/open door/itmo docs/introduction to R and more/1 Term/reaserch in ternship/Pakistan reaserch/PKIR71DT women/PKIR71FL.DTA")

3 Select variables:

In this step variables that are useful for analysis are seperated from the raw data.

kids_selective <- kids %>%
  select(v001, v002, v003, b16, midx, 
         hw70, hw71, hw72,   # nutrition outcomes
         hw1, hw13,          # child age + reason for missing
         v190, h2:h9, m70)

household_selective <- household %>%
  select(hv001, hv002, hv201, hv202, hv204, hv205, hv226) %>%
  rename(v001 = hv001, v002 = hv002)

women_selective <- women %>%
  select(v001, v002, v003, v467a:v467f)

3.1 Linked check:

This step check while merging Kids and women data is data loss during merging?

# --- 2.1. Linked check ---
kids_selective$linked <- ifelse(
  paste(kids_selective$v001, kids_selective$v002, kids_selective$v003) %in%
    paste(women_selective$v001, women_selective$v002, women_selective$v003),
  1, 0
)
cat("\nLinked vs Unlinked children (pre-merge):\n")

## 
## Linked vs Unlinked children (pre-merge):

print(table(kids_selective$linked))

## 
##     1 
## 12708

Interpretation:

The results show that all 12708 children merged successfully with their respective mothers no data is lost during merging.

4 Merge datasets All:

In this step all selective data sets are merged.

kids_selective <- kids_selective %>% mutate(b16 = as.numeric(b16))

merged_datasets <- kids_selective %>%
  left_join(household_selective, by = c("v001","v002")) %>%
  left_join(women_selective, by = c("v001","v002","v003"))

4.1 Descriptive Statistics of Children’s Age:

Summarize the variable hw1 (child’s age in months) to understand its distribution and check for missing values.

summary(merged_datasets$hw1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   14.00   29.00   29.16   44.00   59.00    8209

Interpretation:

The variable hw1 (children’s age in months) ranges from 0 to 59 months. The median age is 29 months, close to the mean of 29.16, suggesting a roughly symmetric distribution. The first and third quartiles are 14 and 44 months, respectively. There are 8,209 missing values, indicating a substantial amount of incomplete data. ### 4.2 Histogram of Children’s Age:

hist(merged_datasets$hw1, main = "Age distribution of children", xlab = "Age in months")

#### Interpretation: This histogram shows the number of children at different ages (in months). Overall, the sample is well-distributed across the full 0–60 month range, with slightly more infants (0–5 months) and slightly fewer children in the oldest age group (55–60 months). This indicates a large and fairly even sample, which is suitable for studying age-related changes in malnutrition. ### 4.3 Check for kids without a household match: This counts how many children don’t have household information (hv201), showing if the merge missed any matches.

sum(is.na(merged_datasets$hv201))

## [1] 0

Interpretation:

The result [1] 0 means that all children in the data-set have matching household information. There are no missing values in hv201, so the merge worked correctly. ## 5 Data Cleaning and Variable Preparation: In this step, we cleaned and prepared the merged dat-aset. Variables were renamed to meaningful names for easier interpretation. Children were grouped into age categories (0–11 months, 12–23 months, etc.), and anthropometry scores (stunting, underweight, wasting) were converted back to z-scores and cleaned for implausible values. Wealth index was recoded into 5 ordered categories from poorest to richest for analysis.

merged_datasets <- merged_datasets %>%
  rename(
    cluster_number   = v001,
    household_number = v002,
    child_index      = b16,
    mother_index     = v003,
    stunting         = hw70,
    underweight      = hw71,
    wasting          = hw72,
    wealth_index     = v190,
    water_source     = hv201,
    time_to_water    = hv202,   
    electricity      = hv204,
    toilet_facility  = hv205,
    cooking_fuel     = hv226,
    postnatal_care   = m70,
    know_where_to_go = v467a,
    getting_permission_to_go = v467b,
    money_needed = v467c,
    distance_ti_health = v467d,
    transport  = v467e,
    wait_to_go_alone  = v467f
  ) %>%
  mutate(
    # 5.1 Child age groups
    age_group = cut(hw1,
                    breaks = seq(0, 60, by = 12),
                    labels = c("0–11m", "12–23m", "24–35m", "36–47m", "48–59m"),
                    right = FALSE),
    #5.2  Convert anthropometry scores back to z-scores
    stunting    = stunting / 100,
    underweight = underweight / 100,
    wasting     = wasting / 100,
    #5.3  Clean implausible values
    stunting    = ifelse(stunting < -6 | stunting > 6, NA, stunting),
    underweight = ifelse(underweight < -6 | underweight > 6, NA, underweight),
    wasting     = ifelse(wasting < -5 | wasting > 5, NA, wasting),
    # Wealth categories (5 groups)
    wealth_cat = case_when(
      wealth_index == 1 ~ "poorest",
      wealth_index == 2 ~ "poorer",
      wealth_index == 3 ~ "middle",
      wealth_index == 4 ~ "richer",
      wealth_index == 5 ~ "richest",
      TRUE ~ NA_character_
    ),
    wealth_cat = factor(wealth_cat, levels = c("poorest","poorer","middle","richer","richest"))
  )

6 Nutrition outcomes by age group:

In this section, we examine how child malnutrition indicators vary across age groups (0–11 months, 12–23 months, etc.). Three key nutrition outcomes are considered: stunting (HAZ), underweight (WAZ), and wasting (WHZ). ### 6.1 Boxplots by age group.

ggplot(merged_datasets, aes(x = age_group, y = stunting, fill = age_group)) +
  geom_boxplot() + theme_minimal() +
  labs(title = "Stunting (HAZ) distribution by child age group", x="Age group (months)", y="HAZ")

## Warning: Removed 8567 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

ggplot(merged_datasets, aes(x = age_group, y = underweight, fill = age_group)) +
  geom_boxplot() + theme_minimal() +
  labs(title = "Underweight (WAZ) distribution by child age group", x="Age group (months)", y="WAZ")

## Warning: Removed 8443 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

ggplot(merged_datasets, aes(x = age_group, y = wasting, fill = age_group)) +
  geom_boxplot() + theme_minimal() +
  labs(title = "Wasting (WHZ) distribution by child age group", x="Age group (months)", y="WHZ")

## Warning: Removed 8557 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

#### Interpretation: Stunting (HAZ) – Chronic Malnutrition: Stunting increases with age. Infants (0–11 months) are close to the healthy range, but from 12–35 months, median HAZ drops below -1.5. For older children (36–59 months), stunting remains low, showing that once it occurs, it persists.

Underweight (WAZ) – Combined Malnutrition: Underweight also worsens with age. Median WAZ is below 0 for all groups, and the lower quartiles are near -2, indicating that many children are underweight across all ages.

Wasting (WHZ) – Acute Malnutrition: Wasting is more stable, with median WHZ near 0. However, some children in all age groups fall below -2, showing ongoing risk of acute malnutrition, often due to short-term illness or food shortage. ### 6.2 line chart by age group:

make_line_chart <- function(data, var, var_label){
  line_data <- data %>%
    group_by(age_group) %>%
    summarise(mean_val = mean(.data[[var]], na.rm = TRUE),
              sd_val   = sd(.data[[var]], na.rm = TRUE),
              n        = n()) %>%
    mutate(se_val = sd_val / sqrt(n))
  
  ggplot(line_data, aes(x = age_group, y = mean_val, group = 1)) +
    geom_line(color = "blue", size = 1) +
    geom_point(size = 3, color = "red") +
    geom_errorbar(aes(ymin = mean_val - se_val, ymax = mean_val + se_val),
                  width = 0.1, color = "black") +
    theme_minimal() +
    labs(title = paste("Average", var_label, "Z-score by Child Age Group"),
         x = "Age group (months)",
         y = paste("Mean", var_label, "Z-score")) +
    geom_hline(yintercept = -2, linetype = "dashed", color = "darkred") +
    annotate("text", x = 2.5, y = -2.1,
             label = paste(var_label, "cutoff (Z = -2)"),
             color = "darkred", hjust = 0)
}

# Run charts
p_stunting_line    <- make_line_chart(merged_datasets, "stunting", "Stunting (HAZ)")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

p_underweight_line <- make_line_chart(merged_datasets, "underweight", "Underweight (WAZ)")
p_wasting_line     <- make_line_chart(merged_datasets, "wasting", "Wasting (WHZ)")

print(p_stunting_line)

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

print(p_underweight_line)

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

print(p_wasting_line)

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

#### Interpretation: Stunting (HAZ): Stunting gets worse with age. The mean HAZ drops sharply from infancy and crosses the −2 cutoff around 36–47 months, meaning older children are on average chronically malnourished.

Underweight (WAZ): Underweight also declines with age but less severely. The mean WAZ stays below the healthy average (0) in all groups but does not cross the severe −2 cutoff.

Wasting (WHZ): Wasting remains fairly stable across age groups, with the mean WHZ close to 0 and well above the −2 cutoff. This shows that acute malnutrition is less of a population-wide problem and occurs more as short-term cases. ## 7 Nutritional Outcomes by Wealth Category (Heatmaps): We used heatmaps to examine how child nutrition outcomes differ across wealth categories.
The plots show the percentage distribution of stunting, underweight, and wasting across different levels of household wealth.
This helps visualize socioeconomic disparities in child malnutrition.

make_single_heatmap <- function(data, var, var_label){
  df <- data %>%
    filter(!is.na(!!rlang::sym(var))) %>%     # drop NA values
    mutate(value100 = round(!!rlang::sym(var) * 100))
  
  # DHS-style binning (-600 to 600)
  breaks <- seq(-600, 600, by = 100)
  labels <- paste0(breaks[-length(breaks)], "-", (breaks[-1]-1))
  
  df <- df %>%
    mutate(value_bin = cut(value100, breaks = breaks, labels = labels,
                           include.lowest = TRUE, right = TRUE))
  
  heat_data <- df %>%
    group_by(wealth_cat, value_bin) %>%
    summarise(n = n(), .groups = "drop") %>%
    complete(wealth_cat, value_bin, fill = list(n = 0)) %>%
    group_by(wealth_cat) %>%
    mutate(total_in_row = sum(n),
           pct_within_wealth = ifelse(total_in_row==0, NA, 100*n/total_in_row)) %>%
    ungroup()
  
  ggplot(heat_data, aes(x=value_bin, y=wealth_cat, fill=pct_within_wealth)) +
    geom_tile(color="white", size=0.35) +
    geom_text(aes(label = ifelse(is.na(pct_within_wealth), "", sprintf("%.1f%%", pct_within_wealth))),
              color="black", size=3.1) +
    geom_text(aes(label = ifelse(is.na(pct_within_wealth), "", sprintf("%.1f%%", pct_within_wealth))),
              color="white", size=2.6) +
    scale_fill_viridis_c(name="% within wealth", option="viridis", na.value="white") +
    labs(title=var_label, x=paste(var_label,"(binned)"), y="Wealth") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle=45, hjust=1),
          panel.grid = element_blank()) +
    scale_y_discrete(limits = rev(levels(heat_data$wealth_cat)))
}

# Run heatmaps
p_stunting <- make_single_heatmap(merged_datasets, "stunting", "Stunting")
p_underweight <- make_single_heatmap(merged_datasets, "underweight", "Underweight")
p_wasting <- make_single_heatmap(merged_datasets, "wasting", "Wasting")

print(p_stunting)

print(p_underweight)

print(p_wasting)

Interpretation:

Stunting (HAZ) – Clear Wealth Gradient
Stunting shows the strongest wealth gradient. Children from poorer households are much more likely to have severe stunting, while richer households cluster closer to the healthy range.
Underweight (WAZ) – Strong Wealth Gradient
Underweight also declines as wealth increases. Poorer children are more concentrated in the severe underweight bins, while richer children are closer to normal values.
Wasting (WHZ) – Weak Wealth Gradient
Wasting shows the weakest wealth gradient. Most children in all wealth groups fall near normal values, though the poorest still face slightly higher risks of severe wasting. ## 8 Missingness Analysis: We examined missing data across all variables. The table below highlights the variables with the highest percentage of missing values to identify potential data quality issues.

missing_counts <- sapply(merged_datasets, function(x) sum(is.na(x)))
missing_percent <- round(sapply(merged_datasets, function(x) mean(is.na(x))*100), 2)

missing_table <- data.frame(
  Variable = names(missing_counts),
  Missing_Count = missing_counts,
  Missing_Percent = missing_percent
)
missing_table[order(-missing_table$Missing_Percent), ]

##                                          Variable Missing_Count Missing_Percent
## know_where_to_go                 know_where_to_go         12708          100.00
## transport                               transport         12708          100.00
## time_to_water                       time_to_water         12457           98.02
## h7d                                           h7d          9930           78.14
## h7m                                           h7m          9930           78.14
## h7y                                           h7y          9930           78.14
## h8d                                           h8d          9896           77.87
## h8m                                           h8m          9896           77.87
## h8y                                           h8y          9896           77.87
## h5d                                           h5d          9604           75.57
## h5m                                           h5m          9604           75.57
## h5y                                           h5y          9604           75.57
## h6d                                           h6d          9574           75.34
## h6m                                           h6m          9574           75.34
## h6y                                           h6y          9574           75.34
## h3d                                           h3d          9272           72.96
## h3m                                           h3m          9272           72.96
## h3y                                           h3y          9272           72.96
## h4d                                           h4d          9233           72.66
## h4m                                           h4m          9233           72.66
## h4y                                           h4y          9233           72.66
## h2d                                           h2d          8961           70.51
## h2m                                           h2m          8961           70.51
## h2y                                           h2y          8961           70.51
## stunting                                 stunting          8567           67.41
## wasting                                   wasting          8557           67.34
## underweight                           underweight          8443           66.44
## hw1                                           hw1          8209           64.60
## age_group                               age_group          8209           64.60
## h5                                             h5          5557           43.73
## h7                                             h7          5557           43.73
## h4                                             h4          5556           43.72
## h6                                             h6          5556           43.72
## h8                                             h8          5556           43.72
## h2                                             h2          5555           43.71
## h3                                             h3          5555           43.71
## h9                                             h9          5555           43.71
## postnatal_care                     postnatal_care          4433           34.88
## child_index                           child_index           719            5.66
## distance_ti_health             distance_ti_health            12            0.09
## getting_permission_to_go getting_permission_to_go            10            0.08
## money_needed                         money_needed            10            0.08
## wait_to_go_alone                 wait_to_go_alone            10            0.08
## cooking_fuel                         cooking_fuel             4            0.03
## hw13                                         hw13             3            0.02
## cluster_number                     cluster_number             0            0.00
## household_number                 household_number             0            0.00
## mother_index                         mother_index             0            0.00
## midx                                         midx             0            0.00
## wealth_index                         wealth_index             0            0.00
## linked                                     linked             0            0.00
## water_source                         water_source             0            0.00
## electricity                           electricity             0            0.00
## toilet_facility                   toilet_facility             0            0.00
## wealth_cat                             wealth_cat             0            0.00

9 Reason for Messingness:

In this section we check reason for messingness.

merged_datasets <- merged_datasets %>%
  mutate(hw13_reason = case_when(
    hw13 == 0 ~ "Measured",
    hw13 == 1 ~ "Child died",
    hw13 == 2 ~ "Child sick",
    hw13 == 3 ~ "Not present",
    hw13 == 4 ~ "Refused",
    hw13 == 5 ~ "Mother refused",
    hw13 == 6 ~ "Other",
    hw13 == 7 ~ "No measurement in household",
    TRUE      ~ NA_character_
  ))

cat("\nReasons for missing stunting/underweight data (hw13):\n")

## 
## Reasons for missing stunting/underweight data (hw13):

print(table(merged_datasets$hw13_reason, useNA="ifany"))

## 
##                  Child died                    Measured 
##                         719                        4227 
## No measurement in household                 Not present 
##                        7490                         111 
##                       Other                     Refused 
##                          72                          86 
##                        <NA> 
##                           3

Interpretation:

Most missing cases are due to children not being measured, followed by child deaths, refusals, or absence. ## 10 Visual Missingness Check (naniar plots): In this section different plot help to visualize missingness pattern.

# Missingness across variables
gg_miss_var(merged_datasets)

# Upset plot for overlapping missingness
gg_miss_upset(merged_datasets)

# Missingness by wealth group
gg_miss_var(merged_datasets, facet = "wealth_cat")

10 Structural Variables and Correlations

To explore how household and healthcare-related factors cluster together, we calculated Spearman correlations among key structural variables (e.g., water source, electricity, sanitation, cooking fuel). Correlations were also examined separately by wealth groups to assess whether these relationships vary by socioeconomic status.

library(dplyr)
library(ggcorrplot)

## Warning: package 'ggcorrplot' was built under R version 4.5.1

# --- Select only required variables ---
structural_variables <- merged_datasets %>%
  select(toilet_facility, water_source, postnatal_care, cooking_fuel)

# Convert to numeric
struct_vars_num <- structural_variables %>% mutate(across(everything(), as.numeric))

# ---  Overall correlation ---
corr_matrix <- cor(struct_vars_num, use = "pairwise.complete.obs", method = "spearman")

# Replace NA values with 0 (to avoid clustering errors)
corr_matrix[is.na(corr_matrix)] <- 0

# Plot overall correlation heatmap
ggcorrplot(corr_matrix, 
           lab = TRUE, lab_size = 4,
           hc.order = TRUE,                 
           type = "lower",
           colors = c("red", "white", "blue"),
           title = "Overall Correlation of Structural Variables",
           ggtheme = theme_minimal() +
             theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
                   axis.text.y = element_text(size = 10)))

# ---  Correlation by wealth group ---
for (grp in levels(merged_datasets$wealth_cat)) {
  subset_data <- struct_vars_num[merged_datasets$wealth_cat == grp, ]
  
  # Correlation for this wealth group
  corr_matrix_grp <- cor(subset_data, use = "pairwise.complete.obs", method = "spearman")
  corr_matrix_grp[is.na(corr_matrix_grp)] <- 0   # replace NAs
  
  print(
    ggcorrplot(corr_matrix_grp,
               lab = TRUE, lab_size = 4,
               hc.order = TRUE,
               type = "lower",
               colors = c("red", "white", "blue"),
               title = paste("Correlation by Wealth Group:", grp),
               ggtheme = theme_minimal() +
                 theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
                       axis.text.y = element_text(size = 10)))
  )
}

Interpretation:

Overall correlation:

Strongest Link (0.52): Toilet facility and cooking fuel are moderately correlated. Households with better toilets are more likely to use cleaner fuels (e.g., gas or electricity).

Weak Positive Links (0.14–0.15): Water source has a small positive correlation with toilet facility and cooking fuel. Better water sources are slightly associated with better sanitation and cleaner fuels.

Weak Negative Links (-0.09 to -0.13): Postnatal care shows weak negative correlations with toilet facility and cooking fuel. Structural improvements do not strongly translate into higher use of postnatal care.

No Link (≈0): Water source and postnatal care show no relationship.

Correlation by Wealth Group:

Middle Group: Shows the strongest clustering. Toilet and cooking fuel are most related (0.21), and water source links with toilet (0.17). Postnatal care correlations are near zero.

Poorer & Richer Groups: Weak positive links. In the Poorer group, toilet and cooking fuel (0.09) are weakly related. In the Richer group, water and cooking fuel (0.09) are most related. Postnatal care remains unrelated.

Poorest & Richest Extremes: Very weak or negative links. In the Poorest group, correlations are near zero. In the Richest group, some negative values (e.g., toilet–fuel = -0.16) suggest no clear clustering.

Descriptive Analysis Report of PDHS data

Ameen Talha

2025-10-02

1.Introduction:

2.2 Load Datasets:

3 Select variables:

3.1 Linked check:

Interpretation:

4 Merge datasets All:

4.1 Descriptive Statistics of Children’s Age:

Interpretation:

Interpretation:

6 Nutrition outcomes by age group:

Interpretation:

9 Reason for Messingness:

Interpretation:

10 Structural Variables and Correlations

Interpretation:

Overall correlation:

Correlation by Wealth Group: