The intent of this report is to present the rationale of how the analysis was conducted, variables were selected, transformed and lastly used for the analysis of effect. Furthermore, this report uses the dataframe generated from the raw dataset generated by merge_data_from_source.Rmd. The code is available upon request, given it contain many unstructured exploration in order to find the variables, and more importantly contain more sensitive data which was used for finding out the variables. The most important piece is the linkage between the code book and the survey variables, which is defined as follows:
df <- sqldf("SELECT
household.HV001 as house_cluster,
household.HV007 as house_survey_year,
household.HV006 as house_survey_mon,
household.HV016 as house_survey_day,
household.HV015 as house_survey_progress,
household.HV024 as house_region,
household.HV025 as house_rural_or_urban,
household.HV205 as house_sanitation_type,
hv205_107.text as t_house_sanitation_type,
household.HV210 as house_has_bicycle,
household.HV211 as house_has_motocycle,
household.HV212 as house_has_car,
household.HV009 as n_house_members,
household.HV014 as house_n_children_under_5,
household.HV270 as house_wealth_index,
household.HV238 as n_house_sharing_toilet,
household.HV235 as house_location_of_source_water,
hv235_103.text as t_house_location_of_source_water,
household.HV201 as house_source_type_drinking_water,
hv201_102.text as t_house_source_type_drinking_water,
women.V502 as mother_martial_status,
women.V012 as mother_age_years,
women.V106 as mother_highest_education,
children.B4 as child_sex,
children.HW1 as child_age_months,
children.HW2 as child_weight_kg,
children.HW3 as child_height_cm,
children.m5 as child_dur_breastfeeding_months,
children.HW70 as child_height_for_age,
children.HW71 as child_weight_for_age,
children.HW72 as child_weight_for_height,
children.HW73 as child_BMI
FROM household INNER JOIN children ON
household.HV001 = children.V001 AND
household.HV002 = children.V002
INNER JOIN women ON
household.HV001 = women.V001 AND
household.HV002 = women.V002 AND
children.HW51 = women.V003
INNER JOIN hv201_102 on household.HV201 = hv201_102.code
INNER JOIN hv235_103 on household.HV235 = hv235_103.code
INNER JOIN hv205_107 on household.HV205 = hv205_107.code")
It is important to note that the final analysis of effect (i.e. the analysis below) did not consider the women information in the end. This is due to the joint of the women table against children and house leading to over 4000 rows being lost due to a lack of mapping and missing points.
Load Data from merge_data_from_source.Rmd and libraries and show the first few rows with all columns:
library(ggplot2)
library(knitr)
require(grDevices)
setwd("~/OneDrive/Academia/Projects/Sanitation\ and\ Malnutrition Project/Class\ Project/r_analysis")
science_theme = theme(panel.background=element_blank(), panel.grid.major = element_line(size = 0.5, color = "grey"),
axis.line = element_line(size = 0.7, color = "black"), legend.position = c(0.9,
0.2), text = element_text(size = 14))
df <- readRDS(file="data/all_dataN.rds")
kable(head(df,10))
| house_cluster | house_survey_year | house_survey_mon | house_survey_day | house_survey_progress | house_region | house_rural_or_urban | house_sanitation_type | t_house_sanitation_type | house_has_bicycle | house_has_motocycle | house_has_car | n_house_members | house_n_children_under_5 | house_wealth_index | n_house_sharing_toilet | house_location_of_source_water | t_house_location_of_source_water | house_source_type_drinking_water | t_house_source_type_drinking_water | mother_martial_status | mother_age_years | mother_highest_education | child_sex | child_age_months | child_weight_kg | child_height_cm | child_dur_breastfeeding_months | child_height_for_age | child_weight_for_age | child_weight_for_height | child_BMI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2003 | 8 | 9 | completed | amhara | rural | 31 | Composite Toilet | 0 | 0 | 0 | 4 | 2 | poorest | NA | 3 | Elsewhere | 32 | Unprotected Dug Well | currently in union/living with a man | 24 | no education | female | 42 | 99 | 837 | 42 | -379 | -329 | -129 | -93 |
| 1 | 2003 | 8 | 11 | completed | amhara | rural | 31 | Composite Toilet | 0 | 0 | 0 | 6 | 2 | poorer | NA | 3 | Elsewhere | 41 | Protected Water From Spring | currently in union/living with a man | 32 | no education | male | 41 | 132 | 924 | 93 | -168 | -110 | -23 | -1 |
| 1 | 2003 | 8 | 9 | completed | amhara | rural | 31 | Composite Toilet | 0 | 0 | 0 | 4 | 2 | poorest | NA | 3 | Elsewhere | 42 | Unprotected Water From Spring | never in union | 35 | no education | male | 42 | 123 | 874 | 42 | -316 | -182 | 4 | 52 |
| 1 | 2003 | 8 | 11 | completed | amhara | rural | 31 | Composite Toilet | 0 | 0 | 0 | 4 | 1 | poorest | NA | 3 | Elsewhere | 21 | Borehole | formerly in union/living with a man | 35 | primary | female | 51 | 123 | 921 | 93 | -279 | -223 | -77 | -55 |
| 2 | 2003 | 7 | 27 | completed | addis ababa | urban | 21 | Pit Latrine Ventilated | 0 | 0 | 0 | 6 | 2 | richest | 95 | 3 | Elsewhere | 13 | Piped Water Public Tap/Standpipe | formerly in union/living with a man | 46 | primary | female | 22 | 96 | 795 | 93 | -177 | -125 | -45 | -21 |
| 2 | 2003 | 7 | 28 | completed | addis ababa | urban | 22 | Pit Latrine with Slab | 0 | 0 | 0 | 6 | 2 | richest | 5 | 3 | Elsewhere | 13 | Piped Water Public Tap/Standpipe | currently in union/living with a man | 35 | primary | male | 40 | 131 | 946 | 93 | -108 | -114 | -80 | -70 |
| 2 | 2003 | 7 | 28 | completed | addis ababa | urban | 22 | Pit Latrine with Slab | 0 | 0 | 0 | 3 | 1 | richest | 95 | 3 | Elsewhere | 13 | Piped Water Public Tap/Standpipe | formerly in union/living with a man | 25 | no education | female | 22 | 109 | 836 | 22 | -47 | -20 | 2 | 10 |
| 3 | 2003 | 5 | 31 | completed | harari | rural | 23 | Pit Latrine Without Slab/Open Pit | 0 | 0 | 0 | 6 | 2 | richest | 3 | 3 | Elsewhere | 21 | Borehole | formerly in union/living with a man | 30 | no education | female | 26 | 102 | 821 | 93 | -179 | -138 | -54 | -37 |
| 3 | 2003 | 5 | 31 | completed | harari | rural | 23 | Pit Latrine Without Slab/Open Pit | 0 | 0 | 0 | 6 | 2 | richest | 3 | 3 | Elsewhere | 21 | Borehole | formerly in union/living with a man | 30 | no education | male | 43 | 156 | 1100 | 93 | 228 | -1 | -201 | -226 |
| 3 | 2003 | 6 | 1 | completed | harari | rural | 31 | Composite Toilet | 0 | 0 | 0 | 4 | 2 | richer | NA | 3 | Elsewhere | 21 | Borehole | currently in union/living with a man | 20 | no education | male | 35 | 138 | 940 | 93 | -25 | -14 | -2 | -3 |
qplot(child_height_for_age,data=df,geom="histogram",binwidth=10,color=df$sanitation, xlab="Height-for-age (WHO)",ylab="Frequency",main="Height-for-age (WHO) Histogram") + science_theme
Observing the histogram above, it is noteworthy a high frequence of points on 10k. This is most likely outliers due to error on their collected data process, or an indicator that those children height was not possible to be measured. These outliers should not be included on the analysis, and setting the threshold to be less than 2500 seems reasonable to have all the correct registered data points, which is shown on the graph below:
qplot(child_height_for_age,data=df[df$child_height_for_age < 2500,],geom="histogram",binwidth=10,color=df$sanitation, xlab="Child Height-for-age (WHO)",ylab="Frequency",main="Child Height-for-age (WHO) Filtered Histogram") + science_theme
Notice considering only those data points will reduce the total amount of data points remaining for the analysis after filtering them.
df <- df[df$child_height_for_age < 2500,]
It is also noteworthy that the values are too high. A second inspection on the data dictionary reveal that height-for-age before comparing to the reference population (which yields the variable used here) was multiplied by 10. The correct value should be henceforthdivided by 10 for the following steps.
df$child_height_for_age <- df$child_height_for_age/10
And as such from this point and onwards the analysis disregard data points whose height-for-age was not registered.
ggplot(df, aes(factor(t_house_sanitation_type))) + geom_bar() + science_theme + labs(x = "Sanitation Type", y = "Frequency",title="Sanitation Type Histogram") + theme(axis.text.x = element_text(angle = 60, hjust = 1))
Observing this graph, it makes little sense to use the sanitation type variable as is, as most of the categories barely occur. Furthermore, in considering an intervention investment, we would be interested on knowing if moving the target population from Unimproved Sanitation to higher levels of service, i.e. Intermediate Sanitation or Improved Sanitation lead to a reduce of Chronic Malnutrition (Measured by Height-for-age). As such, it seems appropriate to group the sanitation variables as done in previous work in Esrey (1996). On the study, the following classification is used:
Others category are discarded for the purpose of the analysis. The redefinition of the variables defines the following histogram:
#Code Improved Sanitation
df$t2_house_sanitation_type <- df$house_sanitation_type
df$t2_house_sanitation_type[df$t2_house_sanitation_type == 11 | df$t2_house_sanitation_type == 12 | df$t2_house_sanitation_type == 13 | df$t2_house_sanitation_type == 14 | df$t2_house_sanitation_type == 15] <- "Improved Sanitation"
#Code Intermediate Sanitation
df$t2_house_sanitation_type[df$t2_house_sanitation_type == 21 | df$t2_house_sanitation_type == 22 | df$t2_house_sanitation_type == 23 | df$t2_house_sanitation_type == 31] <- "Intermediate Sanitation"
#Code Unimproved Sanitation
df$t2_house_sanitation_type[df$t2_house_sanitation_type == 41 | df$t2_house_sanitation_type == 51 | df$t2_house_sanitation_type == 61] <- "Unimproved Sanitation"
#Remove Code 96, Other as it is not described anywehre.
df<-df[!(df$t2_house_sanitation_type==96),]
ggplot(df, aes(factor(t2_house_sanitation_type))) + geom_bar() + science_theme + labs(x = "Sanitation Level", y = "Frequency",title="Sanitation Types grouped by Levels of Service") + theme(axis.text.x = element_text(angle = 60, hjust = 1))
Before fitting the regression model, it is interesting to visualize the patterns on the data “as-is”. Let’s consider the association between Height-for-age and the coded Sanitation Levels.
ggplot(df[df$child_height_for_age < 2500,], aes(factor(t2_house_sanitation_type), child_height_for_age)) + geom_boxplot() + science_theme + xlab("Sanitation Level") + ylab("Child Height-for-age (WHO)") +
ggtitle("Height-for-age vs Sanitation Level")
We may be mislead that there is no association, but if we consider separating by region, the association for urban becomes clear:
ggplot(df[df$child_height_for_age < 2500,], aes(factor(t2_house_sanitation_type), child_height_for_age)) + geom_boxplot(aes(fill = house_rural_or_urban)) + science_theme2 + xlab("Sanitation Level") + ylab("Child Height-for-age (WHO)") +
ggtitle("Height-for-age vs Sanitation Level")
In fact, we can consider if the association exists on a satistical level of significance for urban:
df_urban <- df[df$house_rural_or_urban == "urban",]
model <- lm(df_urban$child_height_for_age ~
relevel(factor(df_urban$t2_house_sanitation_type),ref="Unimproved Sanitation"))
summary <- summary(model)
kable(summary$coefficients)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -11.522222 | 3.522093 | -3.2714135 | 0.0011363 |
| relevel(factor(df_urban$t2_house_sanitation_type), ref = “Unimproved Sanitation”)Improved Sanitation | 5.722222 | 5.224108 | 1.0953491 | 0.2738371 |
| relevel(factor(df_urban$t2_house_sanitation_type), ref = “Unimproved Sanitation”)Intermediate Sanitation | -2.743432 | 3.581738 | -0.7659502 | 0.4440304 |
The regression model does not observe any statistical significance level under 5% from the reference level (Unimproved Sanitation) versus Urban, however there is a difference between the t values from Unimproved to Improved, which only reflects the intuition shown before on the graph.
Lastly, we also verified if controlling for other variables the association could be shown. We can not split on Urban and Rural this time, as there is insufficient data from Urban:
ggplot(df, aes(factor(house_rural_or_urban))) + geom_bar() + science_theme + labs(x = "House Region", y = "Frequency",title="Children's House Region in Ethiopia") + theme(axis.text.x = element_text(angle = 60, hjust = 1))
We also coded the level of service of water according to previous work to use it as control as follows:
df$t2_house_source_type_drinking_water <- df$house_source_type_drinking_water
df$t2_house_source_type_drinking_water[df$t2_house_source_type_drinking_water == 11 | df$t2_house_source_type_drinking_water == 12 | df$t2_house_source_type_drinking_water == 13] <- "Improved Water"
df$t2_house_source_type_drinking_water[df$t2_house_source_type_drinking_water == 21 | df$t2_house_source_type_drinking_water == 31 | df$t2_house_source_type_drinking_water == 32 | df$t2_house_source_type_drinking_water == 41 | df$t2_house_source_type_drinking_water == 42 | df$t2_house_source_type_drinking_water == 51 | df$t2_house_source_type_drinking_water == 61 | df$t2_house_source_type_drinking_water == 71 | df$t2_house_source_type_drinking_water == 81 | df$t2_house_source_type_drinking_water == 91] <- "Unimproved Water"
df<-df[!(df$t2_house_source_type_drinking_water==96),]
And as such, the final model is:
df$t2_house_sanitation_type <- factor(df$t2_house_sanitation_type)
model <- lm(df$child_height_for_age/10 ~
relevel(factor(df$t2_house_sanitation_type),ref="Unimproved Sanitation") +
relevel(factor(df$house_rural_or_urban),ref="rural") +
df$n_house_members +
relevel(factor(df$house_wealth_index),ref="poorest") +
relevel(factor(df$t2_house_source_type_drinking_water),ref="Unimproved Water") +
df$mother_age_years +
relevel(factor(df$mother_highest_education),ref="no education") +
relevel(factor(df$mother_martial_status),ref="never in union") +
relevel(factor(df$child_sex),ref="male") +
df$child_dur_breastfeeding_months
)
summary <- summary(model)
kable(summary$coefficients)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -1.8480779 | 0.3415176 | -5.4113699 | 0.0000001 |
| relevel(factor(df$t2_house_sanitation_type), ref = “Unimproved Sanitation”)Improved Sanitation | 0.2064321 | 0.2350486 | 0.8782527 | 0.3798561 |
| relevel(factor(df$t2_house_sanitation_type), ref = “Unimproved Sanitation”)Intermediate Sanitation | 0.0101981 | 0.1274261 | 0.0800313 | 0.9362161 |
| relevel(factor(df$house_rural_or_urban), ref = “rural”)urban | 0.2595291 | 0.1000932 | 2.5928744 | 0.0095503 |
| df$n_house_members | 0.0390472 | 0.0117889 | 3.3122113 | 0.0009333 |
| relevel(factor(df$house_wealth_index), ref = “poorest”)middle | 0.0000983 | 0.0749509 | 0.0013117 | 0.9989535 |
| relevel(factor(df$house_wealth_index), ref = “poorest”)poorer | -0.0066209 | 0.0734401 | -0.0901535 | 0.9281695 |
| relevel(factor(df$house_wealth_index), ref = “poorest”)richer | -0.0046898 | 0.0750766 | -0.0624670 | 0.9501939 |
| relevel(factor(df$house_wealth_index), ref = “poorest”)richest | 0.1768342 | 0.1054572 | 1.6768342 | 0.0936483 |
| relevel(factor(df$t2_house_source_type_drinking_water), ref = “Unimproved Water”)Improved Water | 0.1212615 | 0.0621430 | 1.9513299 | 0.0510833 |
| df$mother_age_years | -0.0064113 | 0.0041232 | -1.5549394 | 0.1200348 |
| relevel(factor(df$mother_highest_education), ref = “no education”)higher | 0.9362637 | 0.2635816 | 3.5520829 | 0.0003863 |
| relevel(factor(df$mother_highest_education), ref = “no education”)primary | 0.2059067 | 0.0582204 | 3.5366793 | 0.0004095 |
| relevel(factor(df$mother_highest_education), ref = “no education”)secondary | 0.4272295 | 0.1843788 | 2.3171294 | 0.0205439 |
| relevel(factor(df$mother_martial_status), ref = “never in union”)currently in union/living with a man | 0.0564546 | 0.3008529 | 0.1876485 | 0.8511611 |
| relevel(factor(df$mother_martial_status), ref = “never in union”)formerly in union/living with a man | 0.0047566 | 0.3113549 | 0.0152770 | 0.9878119 |
| relevel(factor(df$child_sex), ref = “male”)female | 0.1038863 | 0.0488807 | 2.1253021 | 0.0336188 |
| df$child_dur_breastfeeding_months | -0.0052139 | 0.0006662 | -7.8263684 | 0.0000000 |
For rural and urban, it is clear that it is unlikely the association exist. This seems counter-intuitive given previous work. On verifying the possible cause, a couple reasons can exist:
A less considered hypothesis is, of course, the fact that the association simple don’t occur for 2011 on the sampled region for some other unknown factor particular to the place.