hypegrammaR: Instructions v1

Pretext

The hypegrammaR package was created by IMPACT HQ to support quantitative analysis in R. It implements the IMPACT quantitative data analysis guidelines.

Folder Structure

The basic idea of hypegrammaR is to take a set of inputs and turn them into outputs (based on instructions from an R script).

Hence, your hypegrammaR working folder contains the following files:

input: Folder containing all the input files needed for the analysis:
- choices.csv: The choices tab from your KoBo tool.
- data.csv: Your clean, anonymized dataset.
- eap.csv: The extended analysis plan, which specifies how hypegrammaR is supposed to analyse your data.
- questions.csv: The questions tab from your KoBo tool.
- sampling_frame.csv: Your sampling frame, specifying the strata and corresponding population sizes. This is needed to weigh your results - not needed for unweighted analysis.
output: Folder containing all the output files displaying the results of the analysis:
- results.csv: File listing all the results from the analysis.
hypegrammaR.proj: R project file. Always open this first, and then run the script ‘run_me.R’ from within the project to ensure the working directory is set correctly.
read_me.html: Instructions file you are currently looking at.
run_me.R: R script pulling all the inputs together, doing calculations and saving outputs.

Inputs

Before you can run your analysis, you need to make sure your input files are set up the right way. If any of the input files are missing or containing a mistake, hypegrammaR will return an error, and the analysis will break down.

All your input files need to be saved in the ‘input’ folder as .csv files, and follow the exact same spelling as specified above. If there are files from previous analyses in the folder, simply replace them with the new ones.

Dataset

Format

Your dataset must adhere to standard KoBo XML format:

It must not contain labelled values. Make sure you always export the data from KoBo using ‘XML values and headers’ as format.
It must have a single row for column headers (unchanged as they come out of KoBo).
It may contain additional columns that were not in the original questionnaire. If you added new variables in your dataset post data collection, it is good practice to add them as additional rows to the questionnaire, specifying variable type, choices etc.
Please make sure the clean dataset is anonymized at this stage. Delete sensitive data (e.g. names, phone numbers, registration numbers, GPS coordinates etc.).
The dataset must be saved as .csv file, put in the ‘input’ folder and named ‘data.csv’.

Example

start	end	today	ben_name_en	telephone_number	dist_location	stratification	camp_no_camp	camp_name	idp_ref
2020-08-16T12:47:14.516+03:00	2020-08-16T14:51:00.692+03:00	8/16/2020	NA	NA	al_sulaymaniyah	al_sulaymaniyah.idp.incamp	camp	Ashti_IDP	refugee
2020-08-16T11:06:51.022+03:00	2020-08-16T16:48:32.724+03:00	8/16/2020	NA	NA	al_sulaymaniyah	al_sulaymaniyah.idp.incamp	camp	Ashti_IDP	refugee
2020-08-16T12:18:34.231+03:00	2020-08-16T12:48:12.089+03:00	8/16/2020	NA	NA	al_sulaymaniyah	al_sulaymaniyah.idp.incamp	camp	Ashti_IDP	refugee
2020-08-16T13:26:00.384+03:00	2020-08-16T13:58:22.406+03:00	8/16/2020	NA	NA	al_sulaymaniyah	al_sulaymaniyah.idp.incamp	camp	Ashti_IDP	refugee
2020-08-16T11:03:43.355+03:00	2020-08-16T12:47:56.403+03:00	8/16/2020	NA	NA	al_sulaymaniyah	al_sulaymaniyah.idp.incamp	camp	Ashti_IDP	refugee
2020-08-16T15:00:22.460+03:00	2020-08-16T15:21:38.619+03:00	8/16/2020	NA	NA	al_sulaymaniyah	al_sulaymaniyah.idp.incamp	camp	Ashti_IDP	refugee

Questionnaire

The questionnaire needs to be included in the toolbox so that hypegrammaR can distinguish between multiple and single choice questions, and lookup the corresponding labels of the variables. Simply save the two tabs from your KoBo tool as ‘choices.csv’ and ‘questions.csv’, respectively, and put the files in the ‘input’ folder.

Sampling Frame

Format

Your sampling frame must be complete, meaning that there needs to be a population figure for each strata in your data. The format should be as in the example below:

Two columns named ‘strata.names’ and ‘population’.
One row per strata with name and population estimate.
The values in strata.names must appear exactly identically in column ‘stratification’ in the dataset (see below for instructions).
It must be saved as .csv file, saved in the ‘input’ folder and named ‘sampling_frame.csv’.

Example

strata.names	population
duhok.refugee.camp	13118
erbil.refugee.camp	6440
al_sulaymaniyah.refugee.camp	2123
al_sulaymaniyah.idp.camp	2057
baghdad.idp.camp	148
diyala.idp.camp	989

Extended Analysis Plan (EAP)

Last but not least, the toolbox requires an extended analysis plan (EAP) in order to work. The EAP is an extended version of the data analysis plan (DAP) and lists all the indicators you wish to analyze, and defines how indicators are disaggregated. Without it, the script does not know what to calculate.

Format

The EAP must have exactly 9 columns (each of which are explained below). Each row specifies the analysis for one indicator. You can add as many rows as you need for your analysis. The script will then go through the EAP and do calculations row by row.

Rows:

research.question: Insert the indicators corresponding research questions from the DAP. You may also leave this column empty (or write ‘NA’).
sub.research.question: Same as above, but for sub-research questions.
hypothesis: Specify the hypothesis as defined in the DAP.
hypothesis.type: Put ‘direct_reporting’ for each row. Other types are not currently implemented in hypegrammaR.
dependent.variable: The name of the variable you wish to analyze (as specified in the KoBo tool).
dependent.variable.type: The type of the variable you wish to analyze, either ‘numerical’ (for numbers) or ‘categorical’ (for characters).
independent.variable: If you wish to disaggregate your results, specify the variable you wish to disaggregate by here. Leave empty if no disaggregation is needed.
independent.variable.type: Specify the type (‘numerical’ or ‘categorical’) of the disaggregation variable. Leave empty if no disaggregation is needed.
repeat.for.variable: If you wish to do disaggregated analysis by group (two-level disaggregation), specify the name of the categorical variable here. Leave empty if no further disaggregation is needed.

There is no need to specify answer options for multiple choice questions as hypegrammaR will recognize these automatically.

Example

research.question	sub.research.question	hypothesis	hypothesis.type	dependent.variable	dependent.variable.type	independent.variable	independent.variable.type	repeat.for.variable
NA	NA	Proportion of beneficiary households by priority needs …	direct_reporting	needs_before	categorical			NA
NA	NA	Proportion of beneficiary households reporting measures being taken …	direct_reporting	collection_hygiene	categorical	idp_ref	categorical	NA

R Script

Install R & R Studio: Before you can use hypegrammaR, you need to make sure you have the latest version of R installed (version 4.0.2 or newer): Download R & R Studio. Re-install R if needed. Next, you need to install the hypegrammaR package (only once per computer). Run the following code (in R):
library(devtools)
devtools::install_github("https://github.com/impact-initiatives/hypegrammaR")

Always open the project file (‘hypegrammaR.Rproj’) first, and then open the script (‘run_me.R’) from within R. The script consist of the 6 components below, some of which may or may not need some adjustment (in step 3) if you use the script for the first time in a new assessment. Click ‘Source’ on the top right of the script pane to run the script.

1. Load Package

The first step is to simply call for R to load the hypegrammaR package.

library(hypegrammaR)

2. Load Files

Next you load in all the input files using the following hypegrammaR functions:

assessment_data <- load_data(file = "input/data.csv")
sampling_frame  <- load_samplingframe("input/sampling_frame.csv")
questionnaire   <- load_questionnaire(data = assessment_data,
                                      questions = "input/questions.csv",
                                      choices = "input/choices.csv",
                                      choices.label.column.to.use = "label::English"
                                      )
analysisplan    <- load_analysisplan(file = "input/eap.csv")

3. Define Stratification Variable

You then define a variable that includes the strata names by pasting together the character strings from the variables you stratify by. Make sure the strata names in the sampling frame follow the exact same logic (same spelling, divided by “.” etc.).

In the example below, strata were defined along three variables (location, population group & camp status). Change/add/remove variables as needed.

assessment_data$strata <- paste(assessment_data$dist_location,
                                assessment_data$idp_ref,
                                assessment_data$camp_no_camp,
                                sep = "."
                                )

4. Apply Weights from Stratification

In the next step, you map the weights from the sampling frame to the dataset:

weights <- map_to_weighting(sampling.frame = sampling_frame,
                            data.stratum.column = "strata",
                            sampling.frame.population.column = "population",
                            sampling.frame.stratum.column = "strata.names"
                            )

Unweighted calculations: If your assessment is not stratified/weighted, there is no need to load in a sampling frame in step 2. You would also need to delete steps 3 and 4, and remove the ‘weighting’ line in step 5.

Additional code: You may add additional code to your script before running the analysis (step 5) if you wish to do so (e.g. filter dataset, create additional variables etc.) See Annex below for guidance on how to filter your dataset or add additional variables.

5. Run Analysis

Now, you are all set to run the analysis, like so:

resultlist <- from_analysisplan_map_to_output(data = assessment_data,
                                              analysisplan = analysisplan,
                                              weighting = weights,
                                              #labeled = TRUE,
                                              questionnaire = questionnaire
                                              )

You may include the line labeled = TRUE, if you want your output to display variable labels rather than names.

Depending on the size of your EAP, running the analysis may take up to a couple of minutes. The progress is displayed in the console.

6. Export Results

Once R has gone through all the rows in the EAP and calculated the results, you then only need to export them as a table in a .csv file.

map_to_master_table(resultlist$results, "output/results.csv")

Outputs

Once you have run the script successfully, your results are saved in the ‘output’ folder.

Results

The results file includes 11 columns, and however many rows depending on your EAP and choice list. It follows the structure of the EAP. Each line from the EAP is calculated and then saved in the results table one after the other.

Example

X	dependent.var	independent.var	dependent.var.value	independent.var.value	numbers	se	min	max	repeat.var	repeat.var.value
1	needs_before	NA	rent	NA	0.0113577	NA	0.0051126	0.0176028	NA	NA
2	needs_before	NA	food_drink	NA	0.9903370	NA	0.9801565	1.0000000	NA	NA
3	needs_before	NA	utilities	NA	0.5738210	NA	0.5300790	0.6175629	NA	NA
4	collection_hygiene	idp_ref	covid_small_number	idp	0.5954306	NA	0.5305742	0.6602870	NA	NA
5	collection_hygiene	idp_ref	covid_small_number	refugee	0.7364652	NA	0.6857173	0.7872132	NA	NA
6	collection_hygiene	idp_ref	covid_social_dist	idp	0.6307426	NA	0.5665483	0.6949368	NA	NA

How to read the results:

If a variable specified in the EAP is categorical (as opposed to numerical), each answer option will be displayed on a separate row in the results table.
If you ran a disaggregated analysis, there is furthermore a separate row for each value that the disaggregation variable (independent.var) adopts. The same logic applies for two-level disaggregation (‘repeat.variable’).
Your estimated (population) proportions for each indicator (and disaggregation) are displayed as absolute values in column ‘numbers’. Example: 57.3% of the target population reported that ‘utilities’ were a ‘need_before’ (line 3).
‘min’ and ‘max’ are indicating the upper and lower bounds of the confidence interval. (Ignore the ‘se’ column.)

Annex: Additional Functions

The above R script outlines the basic code chunks needed to make hypegrammaR work. It can be expanded with additional code as needed. In its basic version, the script requires that your dataset already is in exactly the right format, which may not always be the case. In some cases, you may want to filter out certain surveys or create additional variables.

Here are some basic function that you may find useful, and which you could add to the script after step 2 (after loading in the data).

Filter Dataset

Let’s say you want to filter your dataset and only include surveys with a certain attribute (e.g. only IDP households, or only answered phone calls). This can easily be done by calling the filter() function from the mighty dplyr package.

First, install the dplyr package (if you have not already done so) by typing the following:

install.packages("dplyr")

You only have to install the package on your computer once. However, whenever you use it in one of your R sessions you have to ‘activate’ it before using any of its formulas, like so:

library(dplyr)

Let’s say you want to filter your dataset (let’s call it data) by a variable indicating population group (let’s call it pop_group), and only include IDPs (cell value idp):

data <- data %>% filter(pop_group == "idp")

Here is how you read this line of code: Define object data as (<-) data and (%>%) filter it (filter()) by variable pop_group, only keeping surveys that have the value idp.

Other examples of filtering your data:

data <- data %>% filter(pop_group == "idp" | pop_group == "refugee")
data <- data %>% filter(pop_group == "idp" & gov == "al-basrah")

If you want to include IDPs and refugees (refugee), add an OR operator (|).
If you want the dataset to only include IDPs from governorate (gov) Al-Basrah (al-basrah), use the AND operator (&).
Other logical operators you may find useful:
- != : is not (opposite of ==)
- >= or < : is larger or equal / is smaller
- is.na('variable') : variable value is not available
You may combine logical statements (using parentheses) as needed.

Add New Variables

You want to create additional variables to do analysis with. You may use the mutate() function from the dplyr package to do that.

Let’s assume you want to create an additional indicator, called exp_total, which is the sum of different expenditure categories (exp_food, exp_rent and exp_nfi):

data <- data %>% mutate(exp_total = exp_food + exp_rent + exp_nfi)

You could insert any function after the = sign depending on what you want. Here is an example of how to create the stratification variable as in step 3, but with mutate() from the dplyr package:

data <- data %>%
  mutate(stratification = paste(assessment_data$dist_location,
                                assessment_data$idp_ref,
                                assessment_data$camp_no_camp,
                                sep = "."
                                )
         )

You may even combine the filter() and the mutate() functions with the %>% operator:

data <- data %>%
  mutate(stratification = paste(assessment_data$dist_location,
                                assessment_data$idp_ref,
                                assessment_data$camp_no_camp,
                                sep = "."
                                )
         ) %>%
  filter(pop_group == "idp")

back to top

hypegrammaR: Instructions v1

IMPACT Initiatives - Iraq (Mar 2021)

Pretext

Folder Structure

Inputs

Dataset

Format

Example

Questionnaire

Sampling Frame

Format

Example

Extended Analysis Plan (EAP)

Format

Example

R Script

1. Load Package

2. Load Files

3. Define Stratification Variable

4. Apply Weights from Stratification

5. Run Analysis

6. Export Results

Outputs

Results

Example

Annex: Additional Functions

Filter Dataset

Add New Variables