print("Coding is easy!")[1] "Coding is easy!"
This tutorial aims to help analysts and local policymakers develop evidence-based policies and promote a mutually beneficial relationship between Venezuelan migrants and the host community in Chile, Colombia, Peru, and Ecuador. We present an overview of open-source technologies to analyze survey data and introduce a step-by-step guide to use R to describe statistics from host populations and Venezuelan migrants.
The World Bank report Venezuelan Migrants and Refugees in Chile, Colombia, Ecuador, and Peru provides a detailed socio-economic profile of Venezuelans in these four countries to help guide the policy and institutional response. The study uses official data from several surveys with the adult population (18 years or older) of Venezuelan migrants and national residents. The Joint Data Center on Forced Displacement supported the study and has created this tutorial to help you navigate the data sources and create your own analysis.
The study uses data of eigth different surveys from four countries. You can find below a table summarizing key features of each survey. More details, such as collection date, the representativeness and the number of samples for each survey, are provided in the full report.
| Country | Survey | Modality |
|---|---|---|
| Chile | Encuesta de Migración | Telephone |
| Chile | Labor Survey | In-person |
| Colombia | Gran Encuesta Integrada de Hogares (GEIH) | In-person |
| Colombia | Migration Pulse (Round 4) | Telephone |
| Ecuador | Encuesta a Personas en Movilidad Humana y en Comunidades Receptoras en Ecuador (EPEC) | In-person |
| Ecuador | High-Frequency Phone Surveys (HFPS) | Telephone |
| Peru | Encuesta Nacional de Hogares (ENAHO) | In-person |
| Peru | Encuesta Dirigida a la Población Venezolana (ENPOVE) | In-person |
We have organized these surveys into four CSV table files, one per country. The datasets are cleaned and harmonized, meaning all tables follow a similar structure and have the same variable names and values. Having cleaned datasets facilitates comparing countries and populations, and publishing them ensures transparency and reproducibility to our findings.
Survey data have unique characteristics that set them apart from data sources. Fundamentally, surveys are designed to gather information from a sample, a subset of a population and then infer characteristics or attitudes of the entire population. Weighting values is a crucial method for this leap from the sample (Venezuelans and hosts reached by the surveys) to the population (all Venezuelans and hosts). Weighted survey data assign a weight for each record to improve the representativeness of conclusions drawn from a limited and biased sample.
Although proprietary software offers tools for analyzing weighted surveys with graphical interfaces, the open-source landscape presents more challenges for beginners and requires basic coding skills. Jamovi, an open-source application, has plans to support weights in statistical analysis, but its implementation is still in the early stages. The most common approach for analyzing weighted survey data using open-source tools involves programming languages like Python and R. R, in particular, is designed for statistical analysis and provides a wide range of options for survey data exploration.
This tutorial uses R to explore the microdata used in the report Venezuelan Migrants and Refugees in Chile, Colombia, Ecuador, and Peru. It is tailored for beginners and will cover basic descriptive analysis rather than exploratory or inferential statistics.
You will see R code and outputs below, along with regular text like this. For example, the command print shows the message quoted as an output.
print("Coding is easy!")[1] "Coding is easy!"
You can copy and paste to run your own analysis. We assume prior experience running R code. If you are unfamiliar with R, you can download RStudio and check one of the many introductory videos on installing and running R.
First, we will load the libraries needed for our tutorial. We explain the main purpose of each one as comments placed after the hashtag (#).
library(survey) # to handle survey weights
library(tidyverse) # to manipulate data easily
library(visdat) # to visualize missing values
library(knitr) # to format output nicelyNow, we will read the cleaned survey data and inspect the information available. You can download the datasets using the following links of the World Bank Microdata Library: [INSERT LINK HERE]
file_name = "../microdata/ecu_host_mig.csv" # you can change this line
survey = read.csv(file_name)We assume the R code runs from the same folder where the datasets are located.You can use different file paths to change the variable file_name to run this code. We will use the dataset for Ecuador as an example in this tutorial.
The result of the command below shows the number of rows (observations), columns (variables), their respective data types and some sample values. It shows we have 3922 rows and 34 columns. The output also shows the column names after the dollar sign, along with the data type (int for numeric integers,chr for text strings and num for float numbers) and values from the first rows.
Notice that some variables have missing values, which are represented by NA for numeric variables and empty quotes for categorical. We will come back to this issue soon.
str(survey) # print the survey STRucture'data.frame': 3922 obs. of 34 variables:
$ year : int 2019 2019 2019 2019 2019 2019 2019 2019 2019 2019 ...
$ survey : chr "EPEC" "EPEC" "EPEC" "EPEC" ...
$ samp : chr "Hosts" "Hosts" "Hosts" "Hosts" ...
$ wave : int 1 1 1 1 1 1 1 1 1 1 ...
$ code_province : chr "Azuay" "Azuay" "Azuay" "Azuay" ...
$ age : int NA NA NA NA NA NA NA NA NA NA ...
$ sex : chr "" "" "" "" ...
$ female : int NA NA NA NA NA NA NA NA NA NA ...
$ male : int NA NA NA NA NA NA NA NA NA NA ...
$ hh_size : int NA NA NA NA NA NA NA NA NA NA ...
$ marital_status : chr "" "" "" "" ...
$ edu_years : int NA NA NA NA NA NA NA NA NA NA ...
$ edu_level : chr "" "" "" "" ...
$ employed : chr "" "" "" "" ...
$ unemployed : chr "" "" "" "" ...
$ inactive : chr "" "" "" "" ...
$ weight : num 703 949 844 844 718 ...
$ occupation : chr "Services" "Crafts" "Crafts" "Elementary/basic" ...
$ health_insurance : chr "" "" "" "" ...
$ homologate : chr "" "" "" "" ...
$ reason_nohomologate: chr "" "" "" "" ...
$ company : chr "" "" "" "" ...
$ plan_residency : chr "" "" "" "" ...
$ reason_migra : chr "" "" "" "" ...
$ discriminated : chr "" "" "" "" ...
$ integration_3_1 : int NA NA NA NA NA NA NA NA NA NA ...
$ integration_3_2 : int NA NA NA NA NA NA NA NA NA NA ...
$ integration_3_3 : int NA NA NA NA NA NA NA NA NA NA ...
$ integration_3_4 : int NA NA NA NA NA NA NA NA NA NA ...
$ integration_3_5 : int NA NA NA NA NA NA NA NA NA NA ...
$ integration_3_6 : int NA NA NA NA NA NA NA NA NA NA ...
$ integration_3_7 : int NA NA NA NA NA NA NA NA NA NA ...
$ integration_3_8 : int NA NA NA NA NA NA NA NA NA NA ...
$ entry_register : chr "" "" "" "" ...
To understand how each country file is structured, let’s review some core variables. You can find detailed descriptions for all available variables in our data dictionary.
survey: the name of the survey;
wave: the wave of the survey. Surveys might have more than one edition, known as waves. Each wave occurs in a distinct period of time.
samp: indicate whether the response comes from Venezuelan or the host population;
weight: the weight assigned for each record to produce unbiases estimates;
Each file aggregates different surveys from the same country. Therefore, you should use the variables survey and wave to filter the data and pick the right source depending on your research question. Because distinct surveys cover different questions, some rows have missing values.
Let’s plot the missing values for each of the surveys available. The chart shows more records from the EPEC survey than HFPS. The highlighted regions make it easy to spot which variables have missing values.
visdat::vis_miss(survey,facet = survey)To keep it simple, our tutorial analyses only information on age, marital status, region, and population type (host or Venezuelan migrant) in the HPFS survey. As the image shows, these variables have no missing values, so we will not worry about missing values. Nevertheless, handling missing values is crucial to the data preparation phase. You might need to drop missing values or impute values to conduct other analyses. Your chosen strategy depends on why values are missing, the extent of missingness, and your analytical goals. Please refer to the data dictionary and documentation to understand the reasons for missing values.
Before applying the weights, we will check the total number of respondents by survey, wave and population.
summary_df <- survey %>%
group_by(survey, wave, samp) %>%
summarise(total = n(), .groups = 'drop')
# Output the table
kable(summary_df, caption = "Number of records by survey, wave and population")| survey | wave | samp | total |
|---|---|---|---|
| EPEC | 1 | Hosts | 1807 |
| EPEC | 1 | Migrants | 1256 |
| HFPS | 2 | Hosts | 503 |
| HFPS | 2 | Migrants | 356 |
Keep in mind these values reflect the number of responses from an unweighted sample of Venezuelan migrants and host population, not the actual migrant and local population. Next, we’ll demonstrate how to use weights to calculate estimates that are more representative.
As different surveys present distinct questions, you should select the survey according to the goals of your analysis. For instance, the data for Ecuador relies on the High-Frequency Phone Survey (HFPS) and the 2019 Human Mobility and Host Communities Survey (EPEC, for its acronym in Spanish). Most of the indicators used in the report come from the HFPS, except those referring to job occupations and health insurance, which come from the EPEC survey.
We will select the HFPS survey to calculate to compare the average age of Venezuelan migrants and the host population in Ecuador. Let’s start filtering the dataset to get only answers from the HFPS.
survey_filter <- survey %>%
filter(survey == "HFPS")Next, we load the survey design and the weights associated with each response. There are a variety of ways to implement weighted data analysis using R. For convenience, we use the function svydesign from survey, a R package with pre-built features tailored for survey analysis.
survey_ecu <- svydesign(ids = ~1, # ~1 means the survey has no clusters
data= survey_filter,
weights = survey_filter$weight)Now, we are ready to produce more accurate estimates about the populations of interest.
Creating basic summary statistics using the survey package is straightforward. Our tutorial shows how to calculate the mean for numeric values and cross-tabulation for categorical variables.
We will group (svyby) the records by the population type (~samp) and calculate the mean age (svymean). The output shows the mean and the standard error (se) for each estimate. Standard error values use the same unit of measurement of the mean. They represent how much the sample mean calculated is expected to vary from the actual population mean.
# Group by and calculate the mean age
svyby(formula = ~age, by = ~samp, design = survey_ecu, svymean) samp age se
Hosts Hosts 39.88760 0.9410057
Migrants Migrants 36.09989 0.5309258
We can employ the svytable command to analyze categorical variables. The following code cross-tabulates the population by marital status. Using prop.table(crosstab, 1), we present the values as percentages at the population/row level (using 0 instead would sum the values to 100 across columns). Additionally, we round the values to two decimal places.
# Cross-tabulates values
crosstab <- svytable(~samp + marital_status, design = survey_ecu)
# Calculate percentages
crosstab_percentages <- round(prop.table(crosstab,1) * 100,2)
# Output the table
knitr::kable(crosstab_percentages,
caption = "Crosstab of Occupation by Marital Status (%)")| Married/Cohabitation | Other | Single | |
|---|---|---|---|
| Hosts | 54.32 | 10.64 | 35.04 |
| Migrants | 51.14 | 4.29 | 44.57 |
You can also use the column code_province to produce estimates for each region.
mean_age <- svyby(~age, ~samp + code_province, survey_ecu, svymean)
knitr::kable(mean_age, caption = "Mean age by population", row.names=FALSE)| samp | code_province | age | se |
|---|---|---|---|
| Hosts | Azuay | 34.28751 | 2.4472912 |
| Migrants | Azuay | 38.44558 | 1.6925252 |
| Hosts | Bolivar | 35.38478 | 4.8992903 |
| Hosts | Cañar | 41.93000 | 9.4223198 |
| Migrants | Cañar | 31.39452 | 4.6921062 |
| Hosts | Carchi | 45.61607 | 7.7941828 |
| Migrants | Carchi | 48.50000 | 4.5988717 |
| Hosts | Chimborazo | 39.31556 | 5.1718474 |
| Migrants | Chimborazo | 31.05517 | 2.5762621 |
| Hosts | Cotopaxi | 48.75018 | 7.3282210 |
| Migrants | Cotopaxi | 38.00000 | 0.0000000 |
| Hosts | El Oro | 42.98743 | 5.1123709 |
| Migrants | El Oro | 34.21006 | 2.3494369 |
| Hosts | Esmeraldas | 46.99697 | 8.6409451 |
| Hosts | Galápagos | 58.39114 | 2.0476748 |
| Hosts | Guayas | 39.80792 | 1.9244430 |
| Migrants | Guayas | 33.44946 | 1.0339923 |
| Hosts | Imbabura | 42.52794 | 8.1943990 |
| Migrants | Imbabura | 31.84827 | 1.9839182 |
| Hosts | Loja | 39.73884 | 5.0654610 |
| Migrants | Loja | 48.00000 | 0.0000000 |
| Hosts | Los Rios | 31.94942 | 2.5716242 |
| Migrants | Los Rios | 36.79796 | 2.1185825 |
| Hosts | Manabi | 42.21102 | 3.0035721 |
| Migrants | Manabi | 33.56610 | 1.7669829 |
| Hosts | Morona Santiago | 34.26451 | 6.6840835 |
| Migrants | Morona Santiago | 48.00000 | 0.0000000 |
| Hosts | Napo | 42.66667 | 0.9818785 |
| Migrants | Napo | 41.47055 | 3.3052052 |
| Hosts | Orellana | 43.77406 | 7.5836483 |
| Migrants | Orellana | 32.44793 | 1.6388851 |
| Hosts | Pastaza | 55.01462 | 5.8836227 |
| Hosts | Pichincha | 39.85590 | 1.9423484 |
| Migrants | Pichincha | 37.38319 | 0.7775836 |
| Hosts | Santa Elena | 44.04101 | 7.0941311 |
| Migrants | Santa Elena | 39.45244 | 3.8401079 |
| Hosts | Santo Domingo | 46.94187 | 4.9430434 |
| Migrants | Santo Domingo | 34.19018 | 3.4954833 |
| Hosts | Sucumbíos | 25.05930 | 2.1051501 |
| Hosts | Tungurahua | 36.16449 | 3.8003223 |
| Migrants | Tungurahua | 37.68505 | 3.2999673 |
| Hosts | Zamora Chinchipe | 30.66675 | 2.6596754 |
| Migrants | Zamora Chinchipe | 44.00000 | 0.0000000 |
Note the standard error increases as we have fewer observations in some regions. Bear in mind the limitations of using weight levels calculated for a national level in regional analysis, as they might not accurately reflect the unique characteristics of each region, resulting in biased estimates.
This tutorial has offered a glimpse into the initial steps for leveraging open-source tools to interpret data on Venezuelan migration. While we have covered essential techniques, the scope for further exploration is vast. We invite you to share any open-source solutions to analyze weighted survey data that might have been overlooked or suggest topics for future tutorials on forced displacement data. You can contact us by email or social media networks (Twitter and LinkedIn).
https://github.com/pewresearch/pewmethods: R package developed by the Pew Research Center Methods team for working with survey data.
https://github.com/quantipy/quantipy3/: Python package to read survey data.