Data

The intent of this report is to present the rationale of how the analysis was conducted, variables were selected, transformed and lastly used for the analysis of effect. Furthermore, this report uses the dataframe generated from the raw dataset generated by merge_data_from_source.Rmd. The code is available upon request, given it contain many unstructured exploration in order to find the variables, and more importantly contain more sensitive data which was used for finding out the variables. The most important piece is the linkage between the code book and the survey variables, which is defined as follows:

df <- sqldf("SELECT 
            household.HV001 as house_cluster,
            household.HV007 as house_survey_year,
            household.HV006 as house_survey_mon,
            household.HV016 as house_survey_day,
            household.HV015 as house_survey_progress,
            household.HV024 as house_region,
            household.HV025 as house_rural_or_urban,
            household.HV205 as house_sanitation_type,
            hv205_107.text as t_house_sanitation_type,            
            household.HV210 as house_has_bicycle,
            household.HV211 as house_has_motocycle,
            household.HV212 as house_has_car,
            household.HV009 as n_house_members,
            household.HV014 as house_n_children_under_5,
            household.HV270 as house_wealth_index,
            household.HV238 as n_house_sharing_toilet,  
            household.HV235 as house_location_of_source_water,
            hv235_103.text as t_house_location_of_source_water,
            household.HV201 as house_source_type_drinking_water,
            hv201_102.text as t_house_source_type_drinking_water,
            women.V502 as mother_martial_status,
            women.V012 as mother_age_years,
            women.V106 as mother_highest_education,
            children.B4 as child_sex,
            children.HW1 as child_age_months,   
            children.HW2 as child_weight_kg,
            children.HW3 as child_height_cm,
            children.m5 as child_dur_breastfeeding_months,
            children.HW70 as child_height_for_age,
            children.HW71 as child_weight_for_age,
            children.HW72 as child_weight_for_height,
            children.HW73 as child_BMI
            FROM household INNER JOIN children ON 
            household.HV001 = children.V001 AND 
            household.HV002 = children.V002
            INNER JOIN women ON 
            household.HV001 = women.V001 AND
            household.HV002 = women.V002 AND
            children.HW51 = women.V003
            INNER JOIN hv201_102 on household.HV201 = hv201_102.code
            INNER JOIN hv235_103 on household.HV235 = hv235_103.code
            INNER JOIN hv205_107 on household.HV205 = hv205_107.code")

It is important to note that the final analysis of effect (i.e. the analysis below) did not consider the women information in the end. This is due to the joint of the women table against children and house leading to over 4000 rows being lost due to a lack of mapping and missing points.

Load Data from merge_data_from_source.Rmd and libraries and show the first few rows with all columns:

library(ggplot2)
library(knitr)
require(grDevices)
setwd("~/OneDrive/Academia/Projects/Sanitation\ and\ Malnutrition Project/Class\ Project/r_analysis")
science_theme = theme(panel.background=element_blank(), panel.grid.major = element_line(size = 0.5, color = "grey"), 
    axis.line = element_line(size = 0.7, color = "black"), legend.position = c(0.9, 
        0.2), text = element_text(size = 14))
df <- readRDS(file="data/all_dataN.rds")

kable(head(df,10))
house_cluster house_survey_year house_survey_mon house_survey_day house_survey_progress house_region house_rural_or_urban house_sanitation_type t_house_sanitation_type house_has_bicycle house_has_motocycle house_has_car n_house_members house_n_children_under_5 house_wealth_index n_house_sharing_toilet house_location_of_source_water t_house_location_of_source_water house_source_type_drinking_water t_house_source_type_drinking_water mother_martial_status mother_age_years mother_highest_education child_sex child_age_months child_weight_kg child_height_cm child_dur_breastfeeding_months child_height_for_age child_weight_for_age child_weight_for_height child_BMI
1 2003 8 9 completed amhara rural 31 Composite Toilet 0 0 0 4 2 poorest NA 3 Elsewhere 32 Unprotected Dug Well currently in union/living with a man 24 no education female 42 99 837 42 -379 -329 -129 -93
1 2003 8 11 completed amhara rural 31 Composite Toilet 0 0 0 6 2 poorer NA 3 Elsewhere 41 Protected Water From Spring currently in union/living with a man 32 no education male 41 132 924 93 -168 -110 -23 -1
1 2003 8 9 completed amhara rural 31 Composite Toilet 0 0 0 4 2 poorest NA 3 Elsewhere 42 Unprotected Water From Spring never in union 35 no education male 42 123 874 42 -316 -182 4 52
1 2003 8 11 completed amhara rural 31 Composite Toilet 0 0 0 4 1 poorest NA 3 Elsewhere 21 Borehole formerly in union/living with a man 35 primary female 51 123 921 93 -279 -223 -77 -55
2 2003 7 27 completed addis ababa urban 21 Pit Latrine Ventilated 0 0 0 6 2 richest 95 3 Elsewhere 13 Piped Water Public Tap/Standpipe formerly in union/living with a man 46 primary female 22 96 795 93 -177 -125 -45 -21
2 2003 7 28 completed addis ababa urban 22 Pit Latrine with Slab 0 0 0 6 2 richest 5 3 Elsewhere 13 Piped Water Public Tap/Standpipe currently in union/living with a man 35 primary male 40 131 946 93 -108 -114 -80 -70
2 2003 7 28 completed addis ababa urban 22 Pit Latrine with Slab 0 0 0 3 1 richest 95 3 Elsewhere 13 Piped Water Public Tap/Standpipe formerly in union/living with a man 25 no education female 22 109 836 22 -47 -20 2 10
3 2003 5 31 completed harari rural 23 Pit Latrine Without Slab/Open Pit 0 0 0 6 2 richest 3 3 Elsewhere 21 Borehole formerly in union/living with a man 30 no education female 26 102 821 93 -179 -138 -54 -37
3 2003 5 31 completed harari rural 23 Pit Latrine Without Slab/Open Pit 0 0 0 6 2 richest 3 3 Elsewhere 21 Borehole formerly in union/living with a man 30 no education male 43 156 1100 93 228 -1 -201 -226
3 2003 6 1 completed harari rural 31 Composite Toilet 0 0 0 4 2 richer NA 3 Elsewhere 21 Borehole currently in union/living with a man 20 no education male 35 138 940 93 -25 -14 -2 -3

Coding

Chronic Malnutrition - Height-for-age (WHO Reference Population)

qplot(child_height_for_age,data=df,geom="histogram",binwidth=10,color=df$sanitation, xlab="Height-for-age (WHO)",ylab="Frequency",main="Height-for-age (WHO) Histogram") + science_theme 

Observing the histogram above, it is noteworthy a high frequence of points on 10k. This is most likely outliers due to error on their collected data process, or an indicator that those children height was not possible to be measured. These outliers should not be included on the analysis, and setting the threshold to be less than 2500 seems reasonable to have all the correct registered data points, which is shown on the graph below:

qplot(child_height_for_age,data=df[df$child_height_for_age < 2500,],geom="histogram",binwidth=10,color=df$sanitation, xlab="Child Height-for-age (WHO)",ylab="Frequency",main="Child Height-for-age (WHO) Filtered Histogram") + science_theme

Notice considering only those data points will reduce the total amount of data points remaining for the analysis after filtering them.

df <- df[df$child_height_for_age < 2500,]

It is also noteworthy that the values are too high. A second inspection on the data dictionary reveal that height-for-age before comparing to the reference population (which yields the variable used here) was multiplied by 10. The correct value should be henceforthdivided by 10 for the following steps.

df$child_height_for_age <- df$child_height_for_age/10

And as such from this point and onwards the analysis disregard data points whose height-for-age was not registered.

Sanitation Level (Independent Variable)

ggplot(df, aes(factor(t_house_sanitation_type))) + geom_bar() + science_theme + labs(x = "Sanitation Type", y = "Frequency",title="Sanitation Type Histogram") + theme(axis.text.x = element_text(angle = 60, hjust = 1)) 

Observing this graph, it makes little sense to use the sanitation type variable as is, as most of the categories barely occur. Furthermore, in considering an intervention investment, we would be interested on knowing if moving the target population from Unimproved Sanitation to higher levels of service, i.e. Intermediate Sanitation or Improved Sanitation lead to a reduce of Chronic Malnutrition (Measured by Height-for-age). As such, it seems appropriate to group the sanitation variables as done in previous work in Esrey (1996). On the study, the following classification is used:

  • Unimproved Sanitation: Holes in the ground, bushes, and other places where human waste is not contained to prevent it from contaminating the environment (Bucket Toilet)
  • Intermediate Sanitation: Pit Latrine or Similar Fecal Disposal System (Pit Latrine Types)
  • Optimal Sanitation: Water-based system or flush toilet (Flush or Pour Flush Toilet Types ; Also Composite Toilet).

Others category are discarded for the purpose of the analysis. The redefinition of the variables defines the following histogram:

#Code Improved Sanitation 
df$t2_house_sanitation_type <- df$house_sanitation_type
df$t2_house_sanitation_type[df$t2_house_sanitation_type == 11 | df$t2_house_sanitation_type == 12 | df$t2_house_sanitation_type == 13 | df$t2_house_sanitation_type == 14 | df$t2_house_sanitation_type == 15] <- "Improved Sanitation"

#Code Intermediate Sanitation 
df$t2_house_sanitation_type[df$t2_house_sanitation_type == 21 | df$t2_house_sanitation_type == 22 | df$t2_house_sanitation_type == 23 | df$t2_house_sanitation_type == 31] <- "Intermediate Sanitation"

#Code Unimproved Sanitation
df$t2_house_sanitation_type[df$t2_house_sanitation_type == 41 | df$t2_house_sanitation_type == 51 | df$t2_house_sanitation_type == 61] <- "Unimproved Sanitation"

#Remove Code 96, Other as it is not described anywehre. 
df<-df[!(df$t2_house_sanitation_type==96),]

ggplot(df, aes(factor(t2_house_sanitation_type))) + geom_bar() + science_theme + labs(x = "Sanitation Level", y = "Frequency",title="Sanitation Types grouped by Levels of Service") + theme(axis.text.x = element_text(angle = 60, hjust = 1)) 

Analysis of Effect

Before fitting the regression model, it is interesting to visualize the patterns on the data “as-is”. Let’s consider the association between Height-for-age and the coded Sanitation Levels.

ggplot(df[df$child_height_for_age < 2500,], aes(factor(t2_house_sanitation_type), child_height_for_age)) + geom_boxplot() + science_theme + xlab("Sanitation Level") + ylab("Child Height-for-age (WHO)") +
ggtitle("Height-for-age vs Sanitation Level")

We may be mislead that there is no association, but if we consider separating by region, the association for urban becomes clear:

ggplot(df[df$child_height_for_age < 2500,], aes(factor(t2_house_sanitation_type), child_height_for_age)) + geom_boxplot(aes(fill = house_rural_or_urban)) + science_theme2 + xlab("Sanitation Level") + ylab("Child Height-for-age (WHO)") +
ggtitle("Height-for-age vs Sanitation Level")

In fact, we can consider if the association exists on a satistical level of significance for urban:

df_urban <- df[df$house_rural_or_urban == "urban",]
model <- lm(df_urban$child_height_for_age ~ 
                relevel(factor(df_urban$t2_house_sanitation_type),ref="Unimproved Sanitation"))
summary <- summary(model)
kable(summary$coefficients)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -11.522222 3.522093 -3.2714135 0.0011363
relevel(factor(df_urban$t2_house_sanitation_type), ref = “Unimproved Sanitation”)Improved Sanitation 5.722222 5.224108 1.0953491 0.2738371
relevel(factor(df_urban$t2_house_sanitation_type), ref = “Unimproved Sanitation”)Intermediate Sanitation -2.743432 3.581738 -0.7659502 0.4440304

The regression model does not observe any statistical significance level under 5% from the reference level (Unimproved Sanitation) versus Urban, however there is a difference between the t values from Unimproved to Improved, which only reflects the intuition shown before on the graph.

Lastly, we also verified if controlling for other variables the association could be shown. We can not split on Urban and Rural this time, as there is insufficient data from Urban:

ggplot(df, aes(factor(house_rural_or_urban))) + geom_bar() + science_theme + labs(x = "House Region", y = "Frequency",title="Children's House Region in Ethiopia") + theme(axis.text.x = element_text(angle = 60, hjust = 1)) 

We also coded the level of service of water according to previous work to use it as control as follows:

df$t2_house_source_type_drinking_water <- df$house_source_type_drinking_water

df$t2_house_source_type_drinking_water[df$t2_house_source_type_drinking_water == 11 | df$t2_house_source_type_drinking_water == 12 | df$t2_house_source_type_drinking_water == 13] <- "Improved Water"

df$t2_house_source_type_drinking_water[df$t2_house_source_type_drinking_water == 21 | df$t2_house_source_type_drinking_water == 31 | df$t2_house_source_type_drinking_water == 32 | df$t2_house_source_type_drinking_water == 41 | df$t2_house_source_type_drinking_water == 42 | df$t2_house_source_type_drinking_water == 51 | df$t2_house_source_type_drinking_water == 61 | df$t2_house_source_type_drinking_water == 71 | df$t2_house_source_type_drinking_water == 81 | df$t2_house_source_type_drinking_water == 91] <- "Unimproved Water"

df<-df[!(df$t2_house_source_type_drinking_water==96),]

And as such, the final model is:

df$t2_house_sanitation_type <- factor(df$t2_house_sanitation_type)
model <- lm(df$child_height_for_age/10 ~ 
                relevel(factor(df$t2_house_sanitation_type),ref="Unimproved Sanitation") + 
                relevel(factor(df$house_rural_or_urban),ref="rural") + 
                df$n_house_members + 
                relevel(factor(df$house_wealth_index),ref="poorest") + 
                relevel(factor(df$t2_house_source_type_drinking_water),ref="Unimproved Water") + 
                df$mother_age_years + 
                relevel(factor(df$mother_highest_education),ref="no education") + 
                relevel(factor(df$mother_martial_status),ref="never in union") +
                relevel(factor(df$child_sex),ref="male") + 
                df$child_dur_breastfeeding_months
            )
summary <- summary(model)
kable(summary$coefficients)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.8480779 0.3415176 -5.4113699 0.0000001
relevel(factor(df$t2_house_sanitation_type), ref = “Unimproved Sanitation”)Improved Sanitation 0.2064321 0.2350486 0.8782527 0.3798561
relevel(factor(df$t2_house_sanitation_type), ref = “Unimproved Sanitation”)Intermediate Sanitation 0.0101981 0.1274261 0.0800313 0.9362161
relevel(factor(df$house_rural_or_urban), ref = “rural”)urban 0.2595291 0.1000932 2.5928744 0.0095503
df$n_house_members 0.0390472 0.0117889 3.3122113 0.0009333
relevel(factor(df$house_wealth_index), ref = “poorest”)middle 0.0000983 0.0749509 0.0013117 0.9989535
relevel(factor(df$house_wealth_index), ref = “poorest”)poorer -0.0066209 0.0734401 -0.0901535 0.9281695
relevel(factor(df$house_wealth_index), ref = “poorest”)richer -0.0046898 0.0750766 -0.0624670 0.9501939
relevel(factor(df$house_wealth_index), ref = “poorest”)richest 0.1768342 0.1054572 1.6768342 0.0936483
relevel(factor(df$t2_house_source_type_drinking_water), ref = “Unimproved Water”)Improved Water 0.1212615 0.0621430 1.9513299 0.0510833
df$mother_age_years -0.0064113 0.0041232 -1.5549394 0.1200348
relevel(factor(df$mother_highest_education), ref = “no education”)higher 0.9362637 0.2635816 3.5520829 0.0003863
relevel(factor(df$mother_highest_education), ref = “no education”)primary 0.2059067 0.0582204 3.5366793 0.0004095
relevel(factor(df$mother_highest_education), ref = “no education”)secondary 0.4272295 0.1843788 2.3171294 0.0205439
relevel(factor(df$mother_martial_status), ref = “never in union”)currently in union/living with a man 0.0564546 0.3008529 0.1876485 0.8511611
relevel(factor(df$mother_martial_status), ref = “never in union”)formerly in union/living with a man 0.0047566 0.3113549 0.0152770 0.9878119
relevel(factor(df$child_sex), ref = “male”)female 0.1038863 0.0488807 2.1253021 0.0336188
df$child_dur_breastfeeding_months -0.0052139 0.0006662 -7.8263684 0.0000000

For rural and urban, it is clear that it is unlikely the association exist. This seems counter-intuitive given previous work. On verifying the possible cause, a couple reasons can exist:

A less considered hypothesis is, of course, the fact that the association simple don’t occur for 2011 on the sampled region for some other unknown factor particular to the place.