hypegrammaR: Instructions v1

IMPACT Initiatives - Iraq (Mar 2021)

Pretext

The hypegrammaR package was created by IMPACT HQ to support quantitative analysis in R. It implements the IMPACT quantitative data analysis guidelines.

Folder Structure

The basic idea of hypegrammaR is to take a set of inputs and turn them into outputs (based on instructions from an R script).

Hence, your hypegrammaR working folder contains the following files:

  1. input: Folder containing all the input files needed for the analysis:
    • choices.csv: The choices tab from your KoBo tool.
    • data.csv: Your clean, anonymized dataset.
    • eap.csv: The extended analysis plan, which specifies how hypegrammaR is supposed to analyse your data.
    • questions.csv: The questions tab from your KoBo tool.
    • sampling_frame.csv: Your sampling frame, specifying the strata and corresponding population sizes. This is needed to weigh your results - not needed for unweighted analysis.
  2. output: Folder containing all the output files displaying the results of the analysis:
    • results.csv: File listing all the results from the analysis.
  3. hypegrammaR.proj: R project file. Always open this first, and then run the script ‘run_me.R’ from within the project to ensure the working directory is set correctly.
  4. read_me.html: Instructions file you are currently looking at.
  5. run_me.R: R script pulling all the inputs together, doing calculations and saving outputs.

Inputs

Before you can run your analysis, you need to make sure your input files are set up the right way. If any of the input files are missing or containing a mistake, hypegrammaR will return an error, and the analysis will break down.

All your input files need to be saved in the ‘input’ folder as .csv files, and follow the exact same spelling as specified above. If there are files from previous analyses in the folder, simply replace them with the new ones.

Dataset

Format

Your dataset must adhere to standard KoBo XML format:

  • It must not contain labelled values. Make sure you always export the data from KoBo using ‘XML values and headers’ as format.
  • It must have a single row for column headers (unchanged as they come out of KoBo).
  • It may contain additional columns that were not in the original questionnaire. If you added new variables in your dataset post data collection, it is good practice to add them as additional rows to the questionnaire, specifying variable type, choices etc.
  • Please make sure the clean dataset is anonymized at this stage. Delete sensitive data (e.g. names, phone numbers, registration numbers, GPS coordinates etc.).
  • The dataset must be saved as .csv file, put in the ‘input’ folder and named ‘data.csv’.

Example

start end today ben_name_en telephone_number dist_location stratification camp_no_camp camp_name idp_ref
2020-08-16T12:47:14.516+03:00 2020-08-16T14:51:00.692+03:00 8/16/2020 NA NA al_sulaymaniyah al_sulaymaniyah.idp.incamp camp Ashti_IDP refugee
2020-08-16T11:06:51.022+03:00 2020-08-16T16:48:32.724+03:00 8/16/2020 NA NA al_sulaymaniyah al_sulaymaniyah.idp.incamp camp Ashti_IDP refugee
2020-08-16T12:18:34.231+03:00 2020-08-16T12:48:12.089+03:00 8/16/2020 NA NA al_sulaymaniyah al_sulaymaniyah.idp.incamp camp Ashti_IDP refugee
2020-08-16T13:26:00.384+03:00 2020-08-16T13:58:22.406+03:00 8/16/2020 NA NA al_sulaymaniyah al_sulaymaniyah.idp.incamp camp Ashti_IDP refugee
2020-08-16T11:03:43.355+03:00 2020-08-16T12:47:56.403+03:00 8/16/2020 NA NA al_sulaymaniyah al_sulaymaniyah.idp.incamp camp Ashti_IDP refugee
2020-08-16T15:00:22.460+03:00 2020-08-16T15:21:38.619+03:00 8/16/2020 NA NA al_sulaymaniyah al_sulaymaniyah.idp.incamp camp Ashti_IDP refugee

Questionnaire

The questionnaire needs to be included in the toolbox so that hypegrammaR can distinguish between multiple and single choice questions, and lookup the corresponding labels of the variables. Simply save the two tabs from your KoBo tool as ‘choices.csv’ and ‘questions.csv’, respectively, and put the files in the ‘input’ folder.

Sampling Frame

Format

Your sampling frame must be complete, meaning that there needs to be a population figure for each strata in your data. The format should be as in the example below:

  • Two columns named ‘strata.names’ and ‘population’.
  • One row per strata with name and population estimate.
  • The values in strata.names must appear exactly identically in column ‘stratification’ in the dataset (see below for instructions).
  • It must be saved as .csv file, saved in the ‘input’ folder and named ‘sampling_frame.csv’.

Example

strata.names population
duhok.refugee.camp 13118
erbil.refugee.camp 6440
al_sulaymaniyah.refugee.camp 2123
al_sulaymaniyah.idp.camp 2057
baghdad.idp.camp 148
diyala.idp.camp 989

Extended Analysis Plan (EAP)

Last but not least, the toolbox requires an extended analysis plan (EAP) in order to work. The EAP is an extended version of the data analysis plan (DAP) and lists all the indicators you wish to analyze, and defines how indicators are disaggregated. Without it, the script does not know what to calculate.

Format

The EAP must have exactly 9 columns (each of which are explained below). Each row specifies the analysis for one indicator. You can add as many rows as you need for your analysis. The script will then go through the EAP and do calculations row by row.

Rows:

  1. research.question: Insert the indicators corresponding research questions from the DAP. You may also leave this column empty (or write ‘NA’).
  2. sub.research.question: Same as above, but for sub-research questions.
  3. hypothesis: Specify the hypothesis as defined in the DAP.
  4. hypothesis.type: Put ‘direct_reporting’ for each row. Other types are not currently implemented in hypegrammaR.
  5. dependent.variable: The name of the variable you wish to analyze (as specified in the KoBo tool).
  6. dependent.variable.type: The type of the variable you wish to analyze, either ‘numerical’ (for numbers) or ‘categorical’ (for characters).
  7. independent.variable: If you wish to disaggregate your results, specify the variable you wish to disaggregate by here. Leave empty if no disaggregation is needed.
  8. independent.variable.type: Specify the type (‘numerical’ or ‘categorical’) of the disaggregation variable. Leave empty if no disaggregation is needed.
  9. repeat.for.variable: If you wish to do disaggregated analysis by group (two-level disaggregation), specify the name of the categorical variable here. Leave empty if no further disaggregation is needed.

There is no need to specify answer options for multiple choice questions as hypegrammaR will recognize these automatically.

Example

research.question sub.research.question hypothesis hypothesis.type dependent.variable dependent.variable.type independent.variable independent.variable.type repeat.for.variable
NA NA Proportion of beneficiary households by priority needs … direct_reporting needs_before categorical NA
NA NA Proportion of beneficiary households reporting measures being taken … direct_reporting collection_hygiene categorical idp_ref categorical NA

R Script

Install R & R Studio: Before you can use hypegrammaR, you need to make sure you have the latest version of R installed (version 4.0.2 or newer): Download R & R Studio. Re-install R if needed. Next, you need to install the hypegrammaR package (only once per computer). Run the following code (in R):

library(devtools)
devtools::install_github("https://github.com/impact-initiatives/hypegrammaR")

Always open the project file (‘hypegrammaR.Rproj’) first, and then open the script (‘run_me.R’) from within R. The script consist of the 6 components below, some of which may or may not need some adjustment (in step 3) if you use the script for the first time in a new assessment. Click ‘Source’ on the top right of the script pane to run the script.

1. Load Package

The first step is to simply call for R to load the hypegrammaR package.

library(hypegrammaR)

2. Load Files

Next you load in all the input files using the following hypegrammaR functions:

assessment_data <- load_data(file = "input/data.csv")
sampling_frame  <- load_samplingframe("input/sampling_frame.csv")
questionnaire   <- load_questionnaire(data = assessment_data,
                                      questions = "input/questions.csv",
                                      choices = "input/choices.csv",
                                      choices.label.column.to.use = "label::English"
                                      )
analysisplan    <- load_analysisplan(file = "input/eap.csv")

3. Define Stratification Variable

You then define a variable that includes the strata names by pasting together the character strings from the variables you stratify by. Make sure the strata names in the sampling frame follow the exact same logic (same spelling, divided by “.” etc.).

In the example below, strata were defined along three variables (location, population group & camp status). Change/add/remove variables as needed.

assessment_data$strata <- paste(assessment_data$dist_location,
                                assessment_data$idp_ref,
                                assessment_data$camp_no_camp,
                                sep = "."
                                )

4. Apply Weights from Stratification

In the next step, you map the weights from the sampling frame to the dataset:

weights <- map_to_weighting(sampling.frame = sampling_frame,
                            data.stratum.column = "strata",
                            sampling.frame.population.column = "population",
                            sampling.frame.stratum.column = "strata.names"
                            )

Unweighted calculations: If your assessment is not stratified/weighted, there is no need to load in a sampling frame in step 2. You would also need to delete steps 3 and 4, and remove the ‘weighting’ line in step 5.

Additional code: You may add additional code to your script before running the analysis (step 5) if you wish to do so (e.g. filter dataset, create additional variables etc.) See Annex below for guidance on how to filter your dataset or add additional variables.

5. Run Analysis

Now, you are all set to run the analysis, like so:

resultlist <- from_analysisplan_map_to_output(data = assessment_data,
                                              analysisplan = analysisplan,
                                              weighting = weights,
                                              #labeled = TRUE,
                                              questionnaire = questionnaire
                                              )

You may include the line labeled = TRUE, if you want your output to display variable labels rather than names.

Depending on the size of your EAP, running the analysis may take up to a couple of minutes. The progress is displayed in the console.

6. Export Results

Once R has gone through all the rows in the EAP and calculated the results, you then only need to export them as a table in a .csv file.

map_to_master_table(resultlist$results, "output/results.csv")

Outputs

Once you have run the script successfully, your results are saved in the ‘output’ folder.

Results

The results file includes 11 columns, and however many rows depending on your EAP and choice list. It follows the structure of the EAP. Each line from the EAP is calculated and then saved in the results table one after the other.

Example

X dependent.var independent.var dependent.var.value independent.var.value numbers se min max repeat.var repeat.var.value
1 needs_before NA rent NA 0.0113577 NA 0.0051126 0.0176028 NA NA
2 needs_before NA food_drink NA 0.9903370 NA 0.9801565 1.0000000 NA NA
3 needs_before NA utilities NA 0.5738210 NA 0.5300790 0.6175629 NA NA
4 collection_hygiene idp_ref covid_small_number idp 0.5954306 NA 0.5305742 0.6602870 NA NA
5 collection_hygiene idp_ref covid_small_number refugee 0.7364652 NA 0.6857173 0.7872132 NA NA
6 collection_hygiene idp_ref covid_social_dist idp 0.6307426 NA 0.5665483 0.6949368 NA NA

How to read the results:

  • If a variable specified in the EAP is categorical (as opposed to numerical), each answer option will be displayed on a separate row in the results table.
  • If you ran a disaggregated analysis, there is furthermore a separate row for each value that the disaggregation variable (independent.var) adopts. The same logic applies for two-level disaggregation (‘repeat.variable’).
  • Your estimated (population) proportions for each indicator (and disaggregation) are displayed as absolute values in column ‘numbers’. Example: 57.3% of the target population reported that ‘utilities’ were a ‘need_before’ (line 3).
  • ‘min’ and ‘max’ are indicating the upper and lower bounds of the confidence interval. (Ignore the ‘se’ column.)

Annex: Additional Functions

The above R script outlines the basic code chunks needed to make hypegrammaR work. It can be expanded with additional code as needed. In its basic version, the script requires that your dataset already is in exactly the right format, which may not always be the case. In some cases, you may want to filter out certain surveys or create additional variables.

Here are some basic function that you may find useful, and which you could add to the script after step 2 (after loading in the data).

Filter Dataset

Let’s say you want to filter your dataset and only include surveys with a certain attribute (e.g. only IDP households, or only answered phone calls). This can easily be done by calling the filter() function from the mighty dplyr package.

First, install the dplyr package (if you have not already done so) by typing the following:

install.packages("dplyr")

You only have to install the package on your computer once. However, whenever you use it in one of your R sessions you have to ‘activate’ it before using any of its formulas, like so:

library(dplyr)

Let’s say you want to filter your dataset (let’s call it data) by a variable indicating population group (let’s call it pop_group), and only include IDPs (cell value idp):

data <- data %>% filter(pop_group == "idp")

Here is how you read this line of code: Define object data as (<-) data and (%>%) filter it (filter()) by variable pop_group, only keeping surveys that have the value idp.

Other examples of filtering your data:

data <- data %>% filter(pop_group == "idp" | pop_group == "refugee")
data <- data %>% filter(pop_group == "idp" & gov == "al-basrah")

Add New Variables

You want to create additional variables to do analysis with. You may use the mutate() function from the dplyr package to do that.

Let’s assume you want to create an additional indicator, called exp_total, which is the sum of different expenditure categories (exp_food, exp_rent and exp_nfi):

data <- data %>% mutate(exp_total = exp_food + exp_rent + exp_nfi)

You could insert any function after the = sign depending on what you want. Here is an example of how to create the stratification variable as in step 3, but with mutate() from the dplyr package:

data <- data %>%
  mutate(stratification = paste(assessment_data$dist_location,
                                assessment_data$idp_ref,
                                assessment_data$camp_no_camp,
                                sep = "."
                                )
         )

You may even combine the filter() and the mutate() functions with the %>% operator:

data <- data %>%
  mutate(stratification = paste(assessment_data$dist_location,
                                assessment_data$idp_ref,
                                assessment_data$camp_no_camp,
                                sep = "."
                                )
         ) %>%
  filter(pop_group == "idp")

back to top