This is a dashboard created with Rshiny. Source code can be found on github. The analysis is based on illuminate package which can be found here.
Q1. 1. Explain p-values in layman terms. Feel free to use analogies or examples. Keep it simple, but make sure to stay technically accurate.
P-values are a statistical measure that tells us how likely it is that a result occurred by chance. It is a way to test the significance of a hypothesis. Imagine you are flipping a coin, and you get heads four times in a row. You might wonder if the coin is biased towards heads. A p-value would give you the probability of getting heads four times in a row, assuming the coin is not biased. If the p-value is low, it suggests that the result is unlikely to be due to chance, and we can reject the null hypothesis (that the coin is not biased).
a. Why are they important?
P-values are vital because they enable us to assess the statistical significance of the findings from a study or experiment. In other words, p-values help us decide whether the variations or correlations between variables that we notice are most likely the result of chance or if they have real-world significance. The validity and reliability of the study are impacted by whether the results are statistically significant, hence it is important to ascertain this when doing research. If a finding is not statistically significant, it suggests that there is most likely no meaningful association between the variables under study and the observed impact was most likely the product of chance. On the other hand, if a finding is statistically significant, it implies that the observed impact is unlikely to be the product of chance and gives us more reason to believe that the variables are related.
b.How can they be interpreted?
If the p-value is less than 0.05 (5%), we can reject the null hypothesis and conclude that the result is statistically significant. If the p-value is greater than 0.05, we cannot reject the null hypothesis, and the result is not statistically significant.
c. What are some common pitfalls/misunderstandings in their use and interpretation?
The common pitfalls are given below- o Treating a p-value as a definitive answer, rather than one piece of evidence. o Focusing solely on the p-value and ignoring effect sizes and confidence intervals. o Using p-values to make causal claims, when they only show correlation or association. o Misinterpreting a p-value greater than 0.05 as evidence for the null hypothesis, rather than simply failing to reject it.
Q2. When would a Mosaic plot be an appropriate visualization?
A Mosaic plot is a graph showing the distribution of a categorical variable across two or more other categorical variables. It can be particularly useful when exploring complex relationships between variables, such as interactions or dependencies. A Mosaic plot can be especially helpful when dealing with large data sets that contain many variables, as it allows for the visualisation of multiple variables simultaneously. It can also be useful in identifying patterns or trends in the data that may not be immediately apparent in a traditional bar or pie chart. For example, a Mosaic plot could be used to visualise the distribution of a particular disease across different age groups and genders, or the distribution of different types of car accidents across different times of the day and weather conditions
Q3. What is personally identifiable information (PII)? Provide an example. When is it ok to collect PII? Any information that may be used to identify a particular person is referred to as personally identifiable information (PII). A person’s name, date of birth, address, phone number, email address, social security number, or any other data that may be used to identify them may be included in this.
In most cases it is not preferable to collect PII. However, it can collect for organisational purposes such as healthcare, financial transactions or even sometimes for protection. Apart from this, PII is also needed when there is a series of research ongoing, and all the research need the same respondent. However, it is essential to ensure that PII is collected and stored securely and that individuals know how their information is being used and protected. In many cases, individuals may be required to provide their consent before their PII is collected or used.
Q1. There are errors in in the dataset. Please identify at least four errors by highlighting them in yellow in the excel sheets. In the cleaning log tab, report the cell IDs, variable name and a small explanation on why do you think this value can be an error in the comment column.
Step 1: Reading the dataset
library(tidyverse)
library(illuminate)
library(srvyr)
df <- read.csv("03_questions/Data Analysis Officer test (150 min)/Annex 1 - REACH Assessment Test Database_DataAnalyst_v2.csv")
df <- df |> fix_data_type() ## fix the data type for each column
Step 2: Check for duplicate
cleaning_log <- list() ## Creating list object
cleaning_log[["duplicated_survey"]] <- tibble(
variable = "All",
`meta/instanceID` = df$InterviewID[duplicated(df$InterviewID)],
comment = "Duplicated survey"
)
Step 3: check total member of HH
hh_cols <- c("Number.household.member.boy.under5.years.old",
"Number.household.member._girl_under5.years.old", "Number.household.member.boy_5_17.years.old",
"household_girl_5_17", "number.adult.household.members.years.old")
hh_cols2 <- paste0(hh_cols,collapse = ",")
df <- df |> mutate(
hh_size = rowSums(df[hh_cols],na.rm =T)
)
df_hh_inconsistency <- df |> filter(hh_size != Total.household.number)
cleaning_log[["df_hh_inconsistency"]] <- df_hh_inconsistency |> mutate(
comment= glue::glue("Sum of {hh_cols2} ({hh_size}) is not matching with Total household number {Total.household.number}",.sep = "//")
) |> select(InterviewID,comment,Total.household.number,hh_size) |> mutate(variable = "Total.household.number") |> rename(
`meta/instanceID` = InterviewID,
value = Total.household.number,
new_value= hh_size
)
Step 4: Check for outliers
cleaning_log[["outliers"]] <- illuminate::identify_outliers(df = df,cols_to_report = "InterviewID") |> rename(
variable = question,
`meta/instanceID` = InterviewID,
comment = issue,
value = old_value
) |> mutate(
comment = paste0("(",value, ") /Suspicious values, should be double check with the team")
)
[1] "checking_Number.household.member.boy.under5.years.old"
[1] "checking_Number.household.member._girl_under5.years.old"
[1] "checking_Number.household.member.boy_5_17.years.old"
[1] "checking_household_girl_5_17"
[1] "checking_number.adult.household.members.years.old"
[1] "checking_Total.household.number"
[1] "checking_hh_size"
Step 5: Bind the cleaning logs
cleaning_log <- do.call("bind_rows",cleaning_log) |> select(variable,`meta/instanceID`,value,new_value,comment)
2. Using the programming language of R, create a new variable characterizing the household drinking water source into improved / unimproved source following the classification below. Paste the code / function you used below.
# creating classification -------------------------------------------------
improve_source <- c("Protected dug well", "Piped water to yard or plot",
"Piped water into dwelling (house)", "Bottled water", "Tube well or borehole",
"Public tap or standpipe", "Protected spring")
unimprove_source<- c("Cart with small tank or drum","Unprotected dug well",
"Unprotected spring", "Surface water", "Tanker-truck",
"Rainwater collection")
df_with_class <- df |> mutate(
drinking_water_class = case_when(drinking_water_source %in% improve_source ~ "Imporved Source",
drinking_water_source %in% unimprove_source ~ "Unimproved Source")
)
3. This exercise requires the results of the previous exercise. Use any tools, statistics and visualizations that you see fit to analyze the questions below regarding how access to improved water sources changed between the baseline (first data collection round) and the endline (second data collection round, after a water improvement project has been implemented). Records for both rounds are in the same dataset; The column “data_collection_round” is “Baseline” for records of the first round, and “Endline” for records from the second round. Please share all code/files used for the analysis.
ana_by_single_header_round <- survey_analysis(df = df_with_class,weights = F,vars_to_analyze = "drinking_water_class",
disag = c("single_headed_household","data_collection_round")) |> rename(
single_headed_household = subset_1_val,
data_collection_round = subset_2_val
)
ana_by_diarhea_round <- survey_analysis(df = df_with_class,weights = F,vars_to_analyze = "drinking_water_class",
disag = c("diarrhea_under_5","data_collection_round")) |> rename(
diarrhea_under_5 = subset_1_val,
data_collection_round = subset_2_val
) |> filter(!is.na(diarrhea_under_5))
names(ana_by_single_header_round) <- names(ana_by_single_header_round) |> snakecase::to_sentence_case()
names(ana_by_diarhea_round) <- names(ana_by_diarhea_round) |> snakecase::to_sentence_case()
#### Creating graph for single_headedhousehold
ana_by_single_header_round$`Single headed household` <-
factor(ana_by_single_header_round$`Single headed household`,levels = c("Yes","No"),
labels = c("Yes","No"))
plot_single <- ggplot(ana_by_single_header_round,
aes(fill= Choice,
y=Stat*100, x= `Single headed household`)) +
geom_bar(position="dodge", stat="identity")+
theme(panel.background = element_rect(fill = "white",
colour = "black",
size = 0.5, linetype = "solid"),
panel.grid.major.x = element_line(),
legend.position="bottom"
) +
ylab("% of HH has access to improved water source")+
facet_wrap(~`Data collection round`)
## Creating graph for diaria
ana_by_diarhea_round$`Diarrhea under 5` <-
factor(ana_by_diarhea_round$`Diarrhea under 5`,levels = c("Yes","No"),
labels = c("Yes","No"))
plot_improved_diaria <- ggplot(ana_by_diarhea_round,
aes(fill= Choice,
y=Stat*100, x= `Diarrhea under 5`)) +
geom_bar(position="dodge", stat="identity")+
theme(panel.background = element_rect(fill = "white",
colour = "black",
size = 0.5, linetype = "solid"),
panel.grid.major.x = element_line(),
legend.position="bottom"
) +
ylab("% of HH has access to improved water source")+
facet_wrap(~`Data collection round`)
I. Did single headed households receive more/less improvements? (relevant data column: “single_headed_household”)
-The following graphs illustrate the percentage of HH reported to
have different type of drinking water source by the single-headed
household status for both baseline and end-line. The During the
baseline, it was reported that single headed household had relatively
less access to improved drinking water sources. However, in the end line
the situation has improved to 97% whereas during the baseline it was
91%.
II. Did the improvements affect cases of diarrhea in children under 5? (relevant data column: “diarrhea_under_5”)
-Overall, among diarrhea-affected households, 91% reported having access to drinking water. In contrast, among non-affected people, 95% reported having access to drinking water , which means among the percentage of affected people, where people don’t have access to drinking water is higher. Additionally, the percentage of having access to improved drinking water was less in baseline period than in Endline. The following graphs illustrate the percentage of HH reported to have an improved drinking water source by the diarrhea-affected child for both baseline and end-line.
Q1.Using the same dataset (Annex 1) create dashboard composed of test statistics and 3-4 graphs to explain changes regarding water and sanitation of households assessed between baseline and end-line. As in all other exercises, please do not forget to include your code as free text including any functions/packages/ sources used to complete this. Bonus points if these are well documented and re-usable.Complement the graphs with a short interpretation, explaining the main message of your dashboard. Note down any assumptions you made based on the information available.
Step 1: Identifying wash and sanitation related variables and analyzing
cols_to_analysis <- c("Household.treating.water", "Improvedsanitationfacility", "handwashingfull", "Household.praticing.open.defecation", "Frequency.respondant.report.handwhashing.a.day")
wash_analysis <- survey_analysis(df = df_with_class,weights = F,vars_to_analyze = cols_to_analysis,
disag = "data_collection_round" )
wash_analysis$main_variable <- snakecase::to_sentence_case(wash_analysis$main_variable)
wash_analysis$main_variable <- wash_analysis$main_variable |>
str_replace_all(pattern = "Improvedsanitationfacility",replacement = "Improved sanitation facility") |>
str_replace_all(pattern = "Handwashingfull",replacement = "Handwashing full")
Step 2: Creating Visualization
variable_name <- wash_analysis$main_variable |> unique()
plot_name <-list()
for (i in variable_name) {
df <- wash_analysis |> filter(main_variable ==i)
name <- paste0("plot_",i) |> snakecase::to_snake_case()
plot_name[[name]]<- ggplot(df,
aes(fill= choice,
y=stat*100, x= subset_1_val)) +
geom_bar(position="dodge", stat="identity")+
theme(panel.background = element_rect(fill = "white",
colour = "black",
size = 0.5, linetype = "solid"),
panel.grid.major.x = element_line(),
legend.position="bottom"
) +
ylab(paste0("% of ",i))+
xlab("")
}
The below graph compares the percentage of HH treating the water during base line and endline. The result indicates that water improvement project has a positive impact on people. Around 65% people reported to treat their water before drinking during endline whereas it was 62% in baseline
plot_name$plot_household_treating_water
The below graph compares the percentage of HH reported to having handwashing facilities.
plot_name$plot_handwashing_full
The below graph compares the percentage of HH reported to using open defecation.
plot_name$plot_household_praticing_open_defecation
The below graph compares the percentage of HH reported frequency to wash their hand.
plot_name$plot_frequency_respondant_report_handwhashing_a_day
Thanks for your time!