Introduction
Due to the current social, economic, and environmental crisis, and the structural lack of opportunities in formal markets, the citizens of the world are facing greater challenges, and every year millions of people are falling into vulnerability and poverty. For that reason, social assistance programmes as part of the social protection umbrella are becoming a core part of the development planning of the countries. On one side, programmes are designed to support those who, due to unexpected events such as natural disasters and sudden economic crisis, have seen their livelihood affected, and if they do not receive support, the consequences of the shock can last for many years. Similarly, the programmes also aim to target those groups that are facing structural challenges in society (e.g. single mothers, people with disabilities, and the elderly) so that their problems can be reduced or at least attenuated and, they will have the minimum tools to develop their personal and professional careers.
In a period where resources are scarce, and people’s needs are rising, governments need to develop strategies that allow them to benefit the people who need them the most. With these strategies, policymakers can improve their chances of choosing the right programmes to use resources efficiently, thus covering, as much as possible, the needs of the population. Unfortunately, there is no magic recipe for the success of this aspect, since the local reality of each country, its culture, environment, institutions, and history make each case unique in many ways. For that reason, it is not reasonable to claim that there is a solution that fits all the countries in the same way. Nevertheless, it’s important to recognize that some structural elements of these programmes have similar patterns. For example, given the limited resources, some countries have opted for targeted social assistance programme. In this case, the programmes must include a mechanism that collects information from the individuals and decides if they have the characteristics needed in order for them to become beneficiaries, as many cash transfer programs in the Arab region and across the world have it. Considering this, the objective of this manual is to guide the policymakers on how to organize and analyse beneficiaries data to characterize and identify possible areas for improvement.
if you are interested in R tutorials for beginners please refer to the introductory manual for R beginners. For data management of social assistance admin records refer to Dictionaries and data management for administrative records in social assistance programs.
Social Assistance Programs, which are primarily non-contributory forms of Social Protection, aim to alleviate poverty and vulnerability. Social Assistance is typically recognized through Cash Transfer Programs, either unconditional or conditional, that provide monetary transfers to target the most vulnerable households and/or individuals. Identifying poverty and vulnerability, as evidenced by various measures of monetary and multidimensional poverty, is one of the key challenges these programs face, as beneficiaries cannot self-select.
The program under evaluation for pedagogical purposes is a Cash Transfer Program implemented in 2023 in the country of Meney. The Ministry of Social Solidarity has provided administrative records for the first cohort of 2023, which includes data from 10,000 households (with 43 household-level variables) and 46,360 individuals (with 24 individual-level variables). The collected data is essential for calculating the Proxy Means Test (PMT), which assigns a score to each household. This score determines eligibility for the program.
Context of the Information
The dataset was gathered from a social protection program in the country of Meney. It includes information from 10,000 households, each with 43 variables, and data on 46,360 individuals, each with 24 variables.
Key Variables in the Dataset:
Relationship to Head of Household: Values range from 1-9, representing different relationships such as head of the family, wife/husband, daughter/son, mother/father, etc.
Gender: Indicates the individual’s gender, either “Male” or “Female.”
Date of Birth: Provides the individual’s age in years.
Marital Status: Specifies whether the individual is single, married, divorced, or widowed.
School Attendance: Indicates whether the individual is currently enrolled in school, has dropped out, never enrolled, or graduated.
Schooling Level: Describes the highest education level achieved, ranging from illiterate to university level.
Chronic Disease: Indicates if the individual suffers from diseases like cardiovascular disease, cancer, or kidney failure (Yes/No).
Nutrition Disease: Specifies if the individual suffers from permanent malnutrition, stunting, or wasting (Yes/No).
Common Disease: Indicates if the individual suffers from other diseases (e.g., measles, diphtheria, or polio) (Yes/No).
Disability: Categorized by severity: total disability (unable to perform chores independently), partial disability (able to work), or other disabilities (Yes/No).
Health Insurance: Indicates whether the individual has access to health insurance (Yes/No).
Employment Status: Describes the individual’s employment status: employed, unemployed, or inactive.
Job Type: Specifies the individual’s job type (e.g., self-employed, paid in cash, paid in-kind, etc.).
Contract Type: Indicates the duration of the employment contract: permanent, temporary, seasonal, or not applicable.
Employment Sector: Specifies whether the individual works in the public or private sector (or does not apply if unemployed).
Father Alive: Indicates whether the individual’s father is alive (Yes/No).
Household Variables:
Household Members: The total number of members in the household.
House Type: Describes the type of housing (e.g., villa, independent modern house, apartment, farmwork/wood/zinc, etc.).
Contract Type: The dwelling ownership type (owned, rented, endowed, or other).
House Size: The size of the house in square meters.
Number of Rooms: The total number of rooms available for household members to sleep in.
Wall/Floor/Roof Material: Describes the materials used for the walls, roof, and floor.
Wall/Floor/Roof Quality: Indicates the level of damage to these structures, categorized from 1 (highest damage) to 3 (least damage).
Cooking Fuel: The type of fuel used for cooking (gas, kerosene, firewood/animal waste, etc.).
Toilet Type: Specifies if the toilet is an Arab toilet or a syringe and whether it is shared or not.
Water and Electricity: Specifies the sources of drinking water, water, electricity, and sewage water.
Duration to Nearest Market/School/Hospital: The time (in minutes) it takes to reach the nearest market, school, and hospital.
Household Properties: Indicates ownership of various household items (e.g., kitchen, TV, mobile phone, car, fridge, etc.).
Livestock: Indicates whether the household owns livestock (Yes/No).
Agricultural Land: Specifies the area of agricultural land owned by the household.
Real Estate: Indicates whether the household owns real estate (Yes/No).
Governorate: Meney is divided into eight governorates, represented by the letters A-H, along with districts, municipalities, and villages/cities.
Location: Specifies if the household is located in an urban, rural, or remote area.
The government calculates a score based on the variables above, and eligibility for the program is determined by this score.
The previous chapter delved into the details of connecting Excel and
R to produce rapid reports. This chapter implements these techniques
extensively to begin the generation of the SPP-RAF report. Keen readers
may wonder why this chapter is not included in Part 2. The reason is
that Part 2 is reserved for statistical techniques, while this chapter
serves as an introduction to that, i.e., producing relevant descriptive
statistics. Accordingly, this chapter, when placed in a report, can be
referred to as Chapter 0 or an introductory chapter. Here, the dataset
is described through graphs and tables, setting the stage for the
subsequent analysis. For this chapter, as well as for the following
chapters, the Excel files used are DATA_BASES.xlsx and
Beneficiaries.xlsx (located in Inputs/Data)
and Dictionary.xlsx (located in
Inputs/Dictionary).
Full code available at: Chapter 05 250225. RScript
Before starting any project, it is highly recommended to have an
organized folder structure. The suggested arrangement, as found in the
teaching aid material, includes separate folders for inputs, codes, and
outputs. Once everything is organized, you can create an R project and
save it in the root folder (in the teaching aid, this project is named
ESCWA). For general cases, you can save a project using the
icon highlighted in the illustration.
A project, for practical purposes, is like a folder where you can
group multiple codes together. These codes are aware that they interact
with the same computer files. While you can work with individual
scripts, as we have done so far, splitting the code into manageable
parts (e.g., chapters) is a good practice for larger projects. This
facilitates debugging and keeps the code clean. In this context, the
project serves as a powerful structure that integrates these separate
files through a well-synchronized master file. By calling the master
file within the project, it gathers information about the folder
structure and improves the organization of results, as demonstrated in
this section. The exact code for the master file is located in the
code folder of the teaching aid. However, this section
walks you through its components. Unlike other sections, several
functions here are covered superficially, as they do not significantly
contribute to the reader’s general understanding, and their exact
functionalities can be easily found online.
rm(list = ls())
graphics.off()
gc()
# Paste start time
startTime = Sys.time()
The first four lines of the master file ensure that the software is clean, with no graphs or variables saved from previous exercises. This helps speed up calculations and avoids errors caused by previously stored data. The last line acts as a time controller, recording when the code begins to run. Later, at the end of the master file, a complementary line records when the code stops running, allowing the user to determine the duration of the process. This is particularly useful for estimating running times and planning work accordingly.
# 1| Preparation --------------------------------------------------------
# 1.1| Libraries --------------------------------------------------------
# Install used packages
myPackages = c('broom','caret', 'cluster', 'clValid', 'cobalt', 'colorspace', 'data.table', 'descr', 'dplyr', 'extrafont', 'factoextra', 'FactoMineR', 'fastDummies', 'foreign', 'fpc', 'gbm', 'geosphere', 'ggdendro', 'ggparty', 'ggplot2', 'ggpubr', 'ggspatial', 'ggmap', 'glmnet', 'gridExtra', 'gtools', 'haven', 'here', 'Hmisc', 'igraph', 'Metrics', 'openxlsx', 'partykit', 'PCAmixdata', 'ppcor', 'questionr', 'raster', 'RColorBrewer', 'readr', 'readxl', 'reshape2', 'rpart', 'rpart.plot', 'scales', 'sf', 'shadowtext', 'spatstat', 'stars', 'StatMatch', 'stringr', 'survey', 'tidyr', 'tidyverse', 'treemapify', 'writexl', 'eeptools', 'lubridate', 'lattice', 'sfsmisc')
notInstalled = myPackages[!(myPackages %in% rownames(installed.packages()))]
if(length(notInstalled)) {
install.packages(notInstalled)
}
invisible(sapply(myPackages, library, character.only = TRUE, quietly = TRUE))
loadfonts(device = 'win', quiet = T)
options(scipen = 999) # Disable scientific notation.
# 1.2| Initial values, locations and folders ---------------------------------------------------
# 1: English.
# 0: Another.
language = 1
# Specify location of file
userLocation = enc2native(here()) # Replace by your own path.
# Create output folders if not available to save the output in
if(!file.exists("Output")){
dir.create("Output")
}
if(!file.exists("Output/Chapter 05")){
dir.create("Output/Chapter 05")
}
if(!file.exists("Output/Profiling")){
dir.create("Output/Profiling")
}
if(!file.exists("Output/2")){
dir.create("Output/Chapter 02")
}
if(!file.exists("Output/Chapter 03")){
dir.create("Output/Chapter 03")
}
# Location of Input and Code
scriptLocation = paste0(userLocation, '/Code/')
inputLocation = paste0(userLocation, '/Input/')
# Call output location for this chapter
outputLocation = paste0(userLocation, '/Output/Chapter 05/')
# Paste end time
endTime = Sys.time()
endTime - startTime
The final part of the master file runs each of the relevant scripts associated with the report’s chapters and concludes the time counter. This allows the calculation of the total time taken by the computer to process the entire report. This chapter of the manual only covers Chapter 00 of the report, as explained earlier. Part 2 will cover the remaining chapters.
As mentioned before, the program we aim to evaluate for the pedagogical purposes is a Cash Transfer Programs implemented since 2023 in the country of Meney. The datasets analyzed correspond to 10,000 households with 43 variables and 46,360 individuals and 24 variables. All the variables collected are needed for the PMT that once estimated for every household will provide an score, as will be mentioned in following sections.
##1.3 Uploading main datasets: Individual and Household Data ---------------------------------------------------
IndividualData=read_excel(paste0(inputLocation, 'Bases/DATA_BASES.xlsx'), sheet = 'Individuals',na = "NULL")
HH_Data=read_excel(paste0(inputLocation, 'Bases/DATA_BASES.xlsx'), sheet = 'Households',na = "NULL")
IndividualData = IndividualData %>%
mutate(`Schooling level`=if_else(`Schooling level`%in%c("illiterat"), "illiterate",`Schooling level`))
IndividualData = IndividualData %>%
mutate(Age = trunc((as.Date(`Date of birth`) %--% as.Date(Sys.Date()) / years(1) )))
The Dictionary is one of the key features that facilitates the cleaning and standardization of datasets where labels are not clear and/or need to be translated. The dictionary attached to these exercises has three sheets that will be explained in detail next to an example of the chunk of code recommended to read it and upload it in R. The dictionary has three sheets:
# 1.4 Uploading dictionary - Labels ---------------------------------------------------
dataPlots <- read_excel(paste0(inputLocation, 'Dictionary/Dictionary.xlsx'),
sheet = 'Labels')
dataPlots[is.na(dataPlots)] <- '' # Replace NAs with characters
# 1.5 Uploading Dictionary - Category ------------------------------------------------------
# Referring each graph to its specific labels in the dictionary
IndVarNames <- as.matrix(read_excel(paste0(inputLocation, 'Dictionary/Dictionary.xlsx'), sheet = 'Category', range = 'A8:D32'))
HHVarNames <- as.matrix(read_excel(paste0(inputLocation, 'Dictionary/Dictionary.xlsx'), sheet = 'Category', range = 'F8:I51'))
GenderLabels <- as.matrix(read_excel(paste0(inputLocation, 'Dictionary/Dictionary.xlsx'), sheet = 'Category', range = 'A1:D3'))
YesNoLabels <- as.matrix(read_excel(paste0(inputLocation, 'Dictionary/Dictionary.xlsx'), sheet = 'Category', range = 'F1:I3'))
Dwelling Type and
Wall Type. This format will definitely change according to
the weights format, variables, iterations, among others, so this code is
only indicative to follow the full example provided.## 1.6 Uploading Dictionary - Weights
DwellingTypeWeights <- read_excel(paste0(inputLocation, 'Dictionary/Dictionary.xlsx'), sheet = 'Weights', range = 'A2:C46') %>% #from to
rename(`House type`=Dwelling_type, `Contract type`= Dwelling_ownership, var_Dwelling_type=weight)
WallTypeWeights <- read_excel(paste0(inputLocation, 'Dictionary/Dictionary.xlsx'), sheet = 'Weights', range = 'E2:G20')%>% #from to
rename(`Wall material`=Wall_type, `Wall quality`= Damage_level, var_Wall_type=weight)
Once the dictionaries are loaded, the datasets need to be transformed to replace the names of variables that might not be clear with the full names provided in categories. The same applies to categorical data that needs to be transformed into factors.
# 1.7 Update names of the columns of main datasets
------------------------------------------------------
colnames(IndividualData) = as.character(factor(colnames(IndividualData),
levels = IndVarNames[,2],
labels = IndVarNames[,4-1]))
colnames(HH_Data) = as.character(factor(colnames(HH_Data),
levels = HHVarNames[,2],
labels = HHVarNames[,4-1]))
HH_Data = HH_Data[,!is.na(colnames(HH_Data))]
# 1.8 Convert categorical variables of individuals dataset into factors. 1.10 section repeats the process for the househol ddataset
------------------------------------------------------
# Also, in case another language wants to be used as a label for the data, this code will update to the language selected (in this case, language 1 is English, 0 is Arabic).
Individual_Data$`Gender` = as.character(factor(Individual_Data$`Gender`,
levels = GenderLabels[,2],
labels = GenderLabels[,4-language]))
Individual_Data$`School attendance` = as.character(factor(Individual_Data$`School attendance`,
levels = EducationAttainment[,2],
labels = EducationAttainment[,4-language]))
Based on individual characteristics, it is possible to create household-level variables that provide information such as the illiteracy rate at the household level, whether there are NEET members, or if there are kids out of school. The information would need to be transformed in many different steps. For pedagogical purposes, we will show the transformations needed to obtain the illiteracy rate within the household. However, it is necessary to check the full code available for a complete understanding.
#1.9 Creation of Relevant Variables from Individual to Household Datasets
------------------------------------------------------
Test_Iliteracy_rate = IndividualData %>% group_by(HHID) %>%
mutate(illiterate = if_else(Age > 15 & `Schooling level` %in% c("illiterate"), 1, 0),
members_15above = if_else(Age > 15, 1, 0)) %>%
summarise(var_illiteracy_rate = 100 * sum(illiterate) / (sum(members_15above))) %>%
mutate(var_illiteracy_rate = if_else(is.nan(var_illiteracy_rate), 100, var_illiteracy_rate))
To address self-selection, many countries select beneficiaries based on certain observable criteria, such as age and disability, and then apply social surveys (registered) to comprehensively evaluate the socioeconomic status of household individuals. This evaluation is converted into a set of observable variables that proxy the level of poverty or vulnerability of the households. Based on statistical exercises, each variable has a certain “relevance” when explaining poverty or vulnerability. For example, owning a house with a roof made of brick or asbestos is less important in explaining vulnerability compared to the employment status of the head of the household. While both variables can explain vulnerability to some extent, the employment status is one of the most relevant factors. The combination of variables and weights assigns a score to applicants or selected households, which ranks and determines their relative level of vulnerability.
The weights of the variables were explained earlier in the chapter, particularly for non-binary variables such as wall type (iterated by damage) or dwelling property type (weights of the variable’s categories). For binary variables, such as car ownership (1 or 0), the weights are determined in the code: 80 for households without a car and 0 for those with a car.
#2.1 Link the PMT weights with the Variables' categories
------------------------------------------------------
Indicators_example = HH_Data %>%
left_join(DwellingTypeWeights) %>%
left_join(WallTypeWeights) %>%
mutate(
carr = if_else(`Car` > 0, 0, 80)
)
Once all the variables are identified and linked to the household ID
(HHID), the variables selected according to the PMT formula
are chosen for the estimation of the PMT, following the order of the
table below.
#2.2 PMT Indicators selection
------------------------------------------------------
PMTIndicators = completeData_example %>%
dplyr::select(c("HHID"), "var_overcrowdingIssue", "var_Wall_type", "var_Floor_type", "var_Toilet_Type", "var_Drinking_water", "var_sewage",
"var_Property_score", "var_AgriLand",
"var_chronic", "var_common", "var_measels", "var_tot_disab", "var_other_disab", "var_duration_hospital",
"var_enrolment", "var_high_education_rate", "var_duration_school",
"var_NEET", "var_unemployment_rate", "var_child_labour",
"var_female_ratio", "var_young_dependency_ratio")
The scoring is estimated according to the formula below.
# 2.3 Calculate the score
------------------------------------------------------
SCORING = PMTIndicators %>%
mutate(
s_var_overcrowdingIssue = (1/6) * (1/7) * var_overcrowdingIssue,
s_var_Floor = (1/6) * (1/7) * var_Floor_type,
s_var_Toilet = (1/6) * (1/7) * var_Toilet_Type,
s_var_DrinkWater = (1/6) * (1/7) * var_Drinking_water,
s_var_Sewage = (1/6) * (1/7) * var_sewage,
s_var_Property = (1/2) * (1/7) * var_Property_score,
s_var_agriLand = (1/2) * (1/7) * var_AgriLand,
s_var_chronic = (1/6) * (1/7) * var_chronic,
s_var_common = (1/6) * (1/7) * var_common,
s_var_measels = (1/6) * (1/7) * var_measels,
s_var_totdisab = (1/6) * (1/7) * var_tot_disab,
s_var_otherdisab = (1/6) * (1/7) * var_other_disab,
s_var_durationHospital = (1/6) * (1/7) * var_duration_hospital,
s_var_enrolment = (1/3) * (1/7) * var_enrolment,
s_var_higheducation = (1/3) * (1/7) * var_high_education_rate,
s_var_durationSchool = (1/3) * (1/7) * var_duration_school,
s_var_NEET = (1/3) * (1/7) * var_NEET,
s_var_unemp = (1/3) * (1/7) * var_unemployment_rate,
s_var_childLabour = (1/3) * (1/7) * var_child_labour,
s_var_female = (1/2) * (1/7) * var_female_ratio,
s_var_youngdependency = (1/2) * (1/7) * var_young_dependency_ratio,
score = s_var_overcrowdingIssue + s_var_Wall + s_var_Floor + s_var_Toilet + s_var_DrinkWater + s_var_Sewage +
s_var_Property + s_var_agriLand + s_var_chronic + s_var_common + s_var_measels + s_var_totdisab + s_var_otherdisab +
s_var_durationHospital + s_var_enrolment + s_var_higheducation + s_var_durationSchool + s_var_NEET + s_var_unemp +
s_var_childLabour + s_var_female + s_var_youngdependency
) %>%
drop_na()
The score will appear as an estimated value for household data. At the individual level, the score will appear in the dataset for the Household dataset and the head of household
#2.4 Add the new SCORING variable to the different datasets
------------------------------------------------------
filtered_Head = IndividualData %>%
filter(`Relationship with HH` %in% c(1)) %>%
dplyr::select(c("HHID", "Gender", "Marital status", "Schooling level", "Employment Status"))