The hypegrammaR package was created by IMPACT HQ to support quantitative analysis in R. It implements the IMPACT quantitative data analysis guidelines.
The basic idea of hypegrammaR is to take a set of inputs and turn them into outputs (based on instructions from an R script).
Hence, your hypegrammaR working folder contains the following files:
Before you can run your analysis, you need to make sure your input files are set up the right way. If any of the input files are missing or containing a mistake, hypegrammaR will return an error, and the analysis will break down.
All your input files need to be saved in the ‘input’ folder as .csv files, and follow the exact same spelling as specified above. If there are files from previous analyses in the folder, simply replace them with the new ones.
Your dataset must adhere to standard KoBo XML format:
| start | end | today | ben_name_en | telephone_number | dist_location | stratification | camp_no_camp | camp_name | idp_ref |
|---|---|---|---|---|---|---|---|---|---|
| 2020-08-16T12:47:14.516+03:00 | 2020-08-16T14:51:00.692+03:00 | 8/16/2020 | NA | NA | al_sulaymaniyah | al_sulaymaniyah.idp.incamp | camp | Ashti_IDP | refugee |
| 2020-08-16T11:06:51.022+03:00 | 2020-08-16T16:48:32.724+03:00 | 8/16/2020 | NA | NA | al_sulaymaniyah | al_sulaymaniyah.idp.incamp | camp | Ashti_IDP | refugee |
| 2020-08-16T12:18:34.231+03:00 | 2020-08-16T12:48:12.089+03:00 | 8/16/2020 | NA | NA | al_sulaymaniyah | al_sulaymaniyah.idp.incamp | camp | Ashti_IDP | refugee |
| 2020-08-16T13:26:00.384+03:00 | 2020-08-16T13:58:22.406+03:00 | 8/16/2020 | NA | NA | al_sulaymaniyah | al_sulaymaniyah.idp.incamp | camp | Ashti_IDP | refugee |
| 2020-08-16T11:03:43.355+03:00 | 2020-08-16T12:47:56.403+03:00 | 8/16/2020 | NA | NA | al_sulaymaniyah | al_sulaymaniyah.idp.incamp | camp | Ashti_IDP | refugee |
| 2020-08-16T15:00:22.460+03:00 | 2020-08-16T15:21:38.619+03:00 | 8/16/2020 | NA | NA | al_sulaymaniyah | al_sulaymaniyah.idp.incamp | camp | Ashti_IDP | refugee |
The questionnaire needs to be included in the toolbox so that hypegrammaR can distinguish between multiple and single choice questions, and lookup the corresponding labels of the variables. Simply save the two tabs from your KoBo tool as ‘choices.csv’ and ‘questions.csv’, respectively, and put the files in the ‘input’ folder.
Your sampling frame must be complete, meaning that there needs to be a population figure for each strata in your data. The format should be as in the example below:
| strata.names | population |
|---|---|
| duhok.refugee.camp | 13118 |
| erbil.refugee.camp | 6440 |
| al_sulaymaniyah.refugee.camp | 2123 |
| al_sulaymaniyah.idp.camp | 2057 |
| baghdad.idp.camp | 148 |
| diyala.idp.camp | 989 |
Last but not least, the toolbox requires an extended analysis plan (EAP) in order to work. The EAP is an extended version of the data analysis plan (DAP) and lists all the indicators you wish to analyze, and defines how indicators are disaggregated. Without it, the script does not know what to calculate.
The EAP must have exactly 9 columns (each of which are explained below). Each row specifies the analysis for one indicator. You can add as many rows as you need for your analysis. The script will then go through the EAP and do calculations row by row.
Rows:
There is no need to specify answer options for multiple choice questions as hypegrammaR will recognize these automatically.
| research.question | sub.research.question | hypothesis | hypothesis.type | dependent.variable | dependent.variable.type | independent.variable | independent.variable.type | repeat.for.variable |
|---|---|---|---|---|---|---|---|---|
| NA | NA | Proportion of beneficiary households by priority needs … | direct_reporting | needs_before | categorical | NA | ||
| NA | NA | Proportion of beneficiary households reporting measures being taken … | direct_reporting | collection_hygiene | categorical | idp_ref | categorical | NA |
Install R & R Studio: Before you can use hypegrammaR, you need to make sure you have the latest version of R installed (version 4.0.2 or newer): Download R & R Studio. Re-install R if needed. Next, you need to install the hypegrammaR package (only once per computer). Run the following code (in R):
Always open the project file (‘hypegrammaR.Rproj’) first, and then open the script (‘run_me.R’) from within R. The script consist of the 6 components below, some of which may or may not need some adjustment (in step 3) if you use the script for the first time in a new assessment. Click ‘Source’ on the top right of the script pane to run the script.
The first step is to simply call for R to load the hypegrammaR package.
Next you load in all the input files using the following hypegrammaR functions:
assessment_data <- load_data(file = "input/data.csv")
sampling_frame <- load_samplingframe("input/sampling_frame.csv")
questionnaire <- load_questionnaire(data = assessment_data,
questions = "input/questions.csv",
choices = "input/choices.csv",
choices.label.column.to.use = "label::English"
)
analysisplan <- load_analysisplan(file = "input/eap.csv")You then define a variable that includes the strata names by pasting together the character strings from the variables you stratify by. Make sure the strata names in the sampling frame follow the exact same logic (same spelling, divided by “.” etc.).
In the example below, strata were defined along three variables (location, population group & camp status). Change/add/remove variables as needed.
In the next step, you map the weights from the sampling frame to the dataset:
weights <- map_to_weighting(sampling.frame = sampling_frame,
data.stratum.column = "strata",
sampling.frame.population.column = "population",
sampling.frame.stratum.column = "strata.names"
)Unweighted calculations: If your assessment is not stratified/weighted, there is no need to load in a sampling frame in step 2. You would also need to delete steps 3 and 4, and remove the ‘weighting’ line in step 5.
Additional code: You may add additional code to your script before running the analysis (step 5) if you wish to do so (e.g. filter dataset, create additional variables etc.) See Annex below for guidance on how to filter your dataset or add additional variables.
Now, you are all set to run the analysis, like so:
resultlist <- from_analysisplan_map_to_output(data = assessment_data,
analysisplan = analysisplan,
weighting = weights,
#labeled = TRUE,
questionnaire = questionnaire
)You may include the line labeled = TRUE, if you want your output to display variable labels rather than names.
Depending on the size of your EAP, running the analysis may take up to a couple of minutes. The progress is displayed in the console.
Once you have run the script successfully, your results are saved in the ‘output’ folder.
The results file includes 11 columns, and however many rows depending on your EAP and choice list. It follows the structure of the EAP. Each line from the EAP is calculated and then saved in the results table one after the other.
| X | dependent.var | independent.var | dependent.var.value | independent.var.value | numbers | se | min | max | repeat.var | repeat.var.value |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | needs_before | NA | rent | NA | 0.0113577 | NA | 0.0051126 | 0.0176028 | NA | NA |
| 2 | needs_before | NA | food_drink | NA | 0.9903370 | NA | 0.9801565 | 1.0000000 | NA | NA |
| 3 | needs_before | NA | utilities | NA | 0.5738210 | NA | 0.5300790 | 0.6175629 | NA | NA |
| 4 | collection_hygiene | idp_ref | covid_small_number | idp | 0.5954306 | NA | 0.5305742 | 0.6602870 | NA | NA |
| 5 | collection_hygiene | idp_ref | covid_small_number | refugee | 0.7364652 | NA | 0.6857173 | 0.7872132 | NA | NA |
| 6 | collection_hygiene | idp_ref | covid_social_dist | idp | 0.6307426 | NA | 0.5665483 | 0.6949368 | NA | NA |
How to read the results:
The above R script outlines the basic code chunks needed to make hypegrammaR work. It can be expanded with additional code as needed. In its basic version, the script requires that your dataset already is in exactly the right format, which may not always be the case. In some cases, you may want to filter out certain surveys or create additional variables.
Here are some basic function that you may find useful, and which you could add to the script after step 2 (after loading in the data).
Let’s say you want to filter your dataset and only include surveys with a certain attribute (e.g. only IDP households, or only answered phone calls). This can easily be done by calling the filter() function from the mighty dplyr package.
First, install the dplyr package (if you have not already done so) by typing the following:
You only have to install the package on your computer once. However, whenever you use it in one of your R sessions you have to ‘activate’ it before using any of its formulas, like so:
Let’s say you want to filter your dataset (let’s call it data) by a variable indicating population group (let’s call it pop_group), and only include IDPs (cell value idp):
Here is how you read this line of code: Define object data as (<-) data and (%>%) filter it (filter()) by variable pop_group, only keeping surveys that have the value idp.
Other examples of filtering your data:
data <- data %>% filter(pop_group == "idp" | pop_group == "refugee")
data <- data %>% filter(pop_group == "idp" & gov == "al-basrah")refugee), add an OR operator (|).gov) Al-Basrah (al-basrah), use the AND operator (&).!= : is not (opposite of ==)>= or < : is larger or equal / is smalleris.na('variable') : variable value is not availableYou want to create additional variables to do analysis with. You may use the mutate() function from the dplyr package to do that.
Let’s assume you want to create an additional indicator, called exp_total, which is the sum of different expenditure categories (exp_food, exp_rent and exp_nfi):
You could insert any function after the = sign depending on what you want. Here is an example of how to create the stratification variable as in step 3, but with mutate() from the dplyr package:
data <- data %>%
mutate(stratification = paste(assessment_data$dist_location,
assessment_data$idp_ref,
assessment_data$camp_no_camp,
sep = "."
)
)You may even combine the filter() and the mutate() functions with the %>% operator: