Venezuelan Migrants and Refugees in Chile, Colombia, Ecuador, and Peru

This tutorial aims to help analysts and local policymakers develop evidence-based policies and promote a mutually beneficial relationship between Venezuelan migrants and the host community in Chile, Colombia, Peru, and Ecuador. We present an overview of open-source technologies to analyze survey data and introduce a step-by-step guide to use R to describe statistics from host populations and Venezuelan migrants.

Introduction

The World Bank report Venezuelan Migrants and Refugees in Chile, Colombia, Ecuador, and Peru provides a detailed socio-economic profile of Venezuelans in these four countries to help guide the policy and institutional response. The study uses official data from several surveys with the adult population (18 years or older) of Venezuelan migrants and national residents. The Joint Data Center on Forced Displacement supported the study and has created this tutorial to help you navigate the data sources and create your own analysis.

Data sources

The study uses data of eigth different surveys from four countries. You can find below a table summarizing key features of each survey. More details, such as collection date, the representativeness and the number of samples for each survey, are provided in the full report.

Country	Survey	Modality
Chile	Encuesta de Migración	Telephone
Chile	Labor Survey	In-person
Colombia	Gran Encuesta Integrada de Hogares (GEIH)	In-person
Colombia	Migration Pulse (Round 4)	Telephone
Ecuador	Encuesta a Personas en Movilidad Humana y en Comunidades Receptoras en Ecuador (EPEC)	In-person
Ecuador	High-Frequency Phone Surveys (HFPS)	Telephone
Peru	Encuesta Nacional de Hogares (ENAHO)	In-person
Peru	Encuesta Dirigida a la Población Venezolana (ENPOVE)	In-person

We have organized these surveys into four CSV table files, one per country. The datasets are cleaned and harmonized, meaning all tables follow a similar structure and have the same variable names and values. Having cleaned datasets facilitates comparing countries and populations, and publishing them ensures transparency and reproducibility to our findings.

How analyze survey data using open-source solutions?

Survey data have unique characteristics that set them apart from data sources. Fundamentally, surveys are designed to gather information from a sample, a subset of a population and then infer characteristics or attitudes of the entire population. Weighting values is a crucial method for this leap from the sample (Venezuelans and hosts reached by the surveys) to the population (all Venezuelans and hosts). Weighted survey data assign a weight for each record to improve the representativeness of conclusions drawn from a limited and biased sample.

Although proprietary software offers tools for analyzing weighted surveys with graphical interfaces, the open-source landscape presents more challenges for beginners and requires basic coding skills. Jamovi, an open-source application, has plans to support weights in statistical analysis, but its implementation is still in the early stages. The most common approach for analyzing weighted survey data using open-source tools involves programming languages like Python and R. R, in particular, is designed for statistical analysis and provides a wide range of options for survey data exploration.

Analyzing the data using R

This tutorial uses R to explore the microdata used in the report Venezuelan Migrants and Refugees in Chile, Colombia, Ecuador, and Peru. It is tailored for beginners and will cover basic descriptive analysis rather than exploratory or inferential statistics.

You will see R code and outputs below, along with regular text like this. For example, the command print shows the message quoted as an output.

print("Coding is easy!")

[1] "Coding is easy!"

You can copy and paste to run your own analysis. We assume prior experience running R code. If you are unfamiliar with R, you can download RStudio and check one of the many introductory videos on installing and running R.

Load libraries

First, we will load the libraries needed for our tutorial. We explain the main purpose of each one as comments placed after the hashtag (#).

library(survey) # to handle survey weights
library(tidyverse) # to manipulate data easily
library(visdat) # to visualize missing values
library(knitr) # to format output nicely

Read the data

Now, we will read the cleaned survey data and inspect the information available. You can download the datasets using the following links of the World Bank Microdata Library: [INSERT LINK HERE]

file_name = "../microdata/ecu_host_mig.csv" # you can change this line

survey = read.csv(file_name)

We assume the R code runs from the same folder where the datasets are located.You can use different file paths to change the variable file_name to run this code. We will use the dataset for Ecuador as an example in this tutorial.

The result of the command below shows the number of rows (observations), columns (variables), their respective data types and some sample values. It shows we have 3922 rows and 34 columns. The output also shows the column names after the dollar sign, along with the data type (int for numeric integers,chr for text strings and num for float numbers) and values from the first rows.

Notice that some variables have missing values, which are represented by NA for numeric variables and empty quotes for categorical. We will come back to this issue soon.

str(survey) # print the survey STRucture

'data.frame':   3922 obs. of  34 variables:
 $ year               : int  2019 2019 2019 2019 2019 2019 2019 2019 2019 2019 ...
 $ survey             : chr  "EPEC" "EPEC" "EPEC" "EPEC" ...
 $ samp               : chr  "Hosts" "Hosts" "Hosts" "Hosts" ...
 $ wave               : int  1 1 1 1 1 1 1 1 1 1 ...
 $ code_province      : chr  "Azuay" "Azuay" "Azuay" "Azuay" ...
 $ age                : int  NA NA NA NA NA NA NA NA NA NA ...
 $ sex                : chr  "" "" "" "" ...
 $ female             : int  NA NA NA NA NA NA NA NA NA NA ...
 $ male               : int  NA NA NA NA NA NA NA NA NA NA ...
 $ hh_size            : int  NA NA NA NA NA NA NA NA NA NA ...
 $ marital_status     : chr  "" "" "" "" ...
 $ edu_years          : int  NA NA NA NA NA NA NA NA NA NA ...
 $ edu_level          : chr  "" "" "" "" ...
 $ employed           : chr  "" "" "" "" ...
 $ unemployed         : chr  "" "" "" "" ...
 $ inactive           : chr  "" "" "" "" ...
 $ weight             : num  703 949 844 844 718 ...
 $ occupation         : chr  "Services" "Crafts" "Crafts" "Elementary/basic" ...
 $ health_insurance   : chr  "" "" "" "" ...
 $ homologate         : chr  "" "" "" "" ...
 $ reason_nohomologate: chr  "" "" "" "" ...
 $ company            : chr  "" "" "" "" ...
 $ plan_residency     : chr  "" "" "" "" ...
 $ reason_migra       : chr  "" "" "" "" ...
 $ discriminated      : chr  "" "" "" "" ...
 $ integration_3_1    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ integration_3_2    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ integration_3_3    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ integration_3_4    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ integration_3_5    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ integration_3_6    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ integration_3_7    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ integration_3_8    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ entry_register     : chr  "" "" "" "" ...

Preliminary analysis

To understand how each country file is structured, let’s review some core variables. You can find detailed descriptions for all available variables in our data dictionary.

survey: the name of the survey;
wave: the wave of the survey. Surveys might have more than one edition, known as waves. Each wave occurs in a distinct period of time.
samp: indicate whether the response comes from Venezuelan or the host population;
weight: the weight assigned for each record to produce unbiases estimates;

Each file aggregates different surveys from the same country. Therefore, you should use the variables survey and wave to filter the data and pick the right source depending on your research question. Because distinct surveys cover different questions, some rows have missing values.

Inspect missing values

Let’s plot the missing values for each of the surveys available. The chart shows more records from the EPEC survey than HFPS. The highlighted regions make it easy to spot which variables have missing values.

visdat::vis_miss(survey,facet = survey)

To keep it simple, our tutorial analyses only information on age, marital status, region, and population type (host or Venezuelan migrant) in the HPFS survey. As the image shows, these variables have no missing values, so we will not worry about missing values. Nevertheless, handling missing values is crucial to the data preparation phase. You might need to drop missing values or impute values to conduct other analyses. Your chosen strategy depends on why values are missing, the extent of missingness, and your analytical goals. Please refer to the data dictionary and documentation to understand the reasons for missing values.

Records by surveys and waves

Before applying the weights, we will check the total number of respondents by survey, wave and population.

summary_df <- survey %>% 
  group_by(survey, wave, samp) %>% 
  summarise(total = n(), .groups = 'drop')

# Output the table
kable(summary_df, caption = "Number of records by survey, wave and population")

Number of records by survey, wave and population
survey	wave	samp	total
EPEC	1	Hosts	1807
EPEC	1	Migrants	1256
HFPS	2	Hosts	503
HFPS	2	Migrants	356

Keep in mind these values reflect the number of responses from an unweighted sample of Venezuelan migrants and host population, not the actual migrant and local population. Next, we’ll demonstrate how to use weights to calculate estimates that are more representative.

Configure the survey design

As different surveys present distinct questions, you should select the survey according to the goals of your analysis. For instance, the data for Ecuador relies on the High-Frequency Phone Survey (HFPS) and the 2019 Human Mobility and Host Communities Survey (EPEC, for its acronym in Spanish). Most of the indicators used in the report come from the HFPS, except those referring to job occupations and health insurance, which come from the EPEC survey.

We will select the HFPS survey to calculate to compare the average age of Venezuelan migrants and the host population in Ecuador. Let’s start filtering the dataset to get only answers from the HFPS.

survey_filter <- survey %>% 
  filter(survey == "HFPS")

Next, we load the survey design and the weights associated with each response. There are a variety of ways to implement weighted data analysis using R. For convenience, we use the function svydesign from survey, a R package with pre-built features tailored for survey analysis.

survey_ecu <- svydesign(ids = ~1, # ~1 means the survey has no clusters
                       data= survey_filter, 
                       weights = survey_filter$weight)

Now, we are ready to produce more accurate estimates about the populations of interest.

Descriptive statistic

Creating basic summary statistics using the survey package is straightforward. Our tutorial shows how to calculate the mean for numeric values and cross-tabulation for categorical variables.

Numeric values

We will group (svyby) the records by the population type (~samp) and calculate the mean age (svymean). The output shows the mean and the standard error (se) for each estimate. Standard error values use the same unit of measurement of the mean. They represent how much the sample mean calculated is expected to vary from the actual population mean.

# Group by and calculate the mean age
svyby(formula = ~age, by = ~samp, design = survey_ecu, svymean)

             samp      age        se
Hosts       Hosts 39.88760 0.9410057
Migrants Migrants 36.09989 0.5309258

Categoric variables

We can employ the svytable command to analyze categorical variables. The following code cross-tabulates the population by marital status. Using prop.table(crosstab, 1), we present the values as percentages at the population/row level (using 0 instead would sum the values to 100 across columns). Additionally, we round the values to two decimal places.

# Cross-tabulates values
crosstab <- svytable(~samp + marital_status, design = survey_ecu)

# Calculate percentages
crosstab_percentages <- round(prop.table(crosstab,1) * 100,2) 
 
# Output the table
knitr::kable(crosstab_percentages, 
             caption = "Crosstab of Occupation by Marital Status (%)")

Crosstab of Occupation by Marital Status (%)
	Married/Cohabitation	Other	Single
Hosts	54.32	10.64	35.04
Migrants	51.14	4.29	44.57

Detailing by region

You can also use the column code_province to produce estimates for each region.

mean_age <- svyby(~age, ~samp + code_province, survey_ecu, svymean)

knitr::kable(mean_age, caption = "Mean age by population", row.names=FALSE)

Mean age by population
samp	code_province	age	se
Hosts	Azuay	34.28751	2.4472912
Migrants	Azuay	38.44558	1.6925252
Hosts	Bolivar	35.38478	4.8992903
Hosts	Cañar	41.93000	9.4223198
Migrants	Cañar	31.39452	4.6921062
Hosts	Carchi	45.61607	7.7941828
Migrants	Carchi	48.50000	4.5988717
Hosts	Chimborazo	39.31556	5.1718474
Migrants	Chimborazo	31.05517	2.5762621
Hosts	Cotopaxi	48.75018	7.3282210
Migrants	Cotopaxi	38.00000	0.0000000
Hosts	El Oro	42.98743	5.1123709
Migrants	El Oro	34.21006	2.3494369
Hosts	Esmeraldas	46.99697	8.6409451
Hosts	Galápagos	58.39114	2.0476748
Hosts	Guayas	39.80792	1.9244430
Migrants	Guayas	33.44946	1.0339923
Hosts	Imbabura	42.52794	8.1943990
Migrants	Imbabura	31.84827	1.9839182
Hosts	Loja	39.73884	5.0654610
Migrants	Loja	48.00000	0.0000000
Hosts	Los Rios	31.94942	2.5716242
Migrants	Los Rios	36.79796	2.1185825
Hosts	Manabi	42.21102	3.0035721
Migrants	Manabi	33.56610	1.7669829
Hosts	Morona Santiago	34.26451	6.6840835
Migrants	Morona Santiago	48.00000	0.0000000
Hosts	Napo	42.66667	0.9818785
Migrants	Napo	41.47055	3.3052052
Hosts	Orellana	43.77406	7.5836483
Migrants	Orellana	32.44793	1.6388851
Hosts	Pastaza	55.01462	5.8836227
Hosts	Pichincha	39.85590	1.9423484
Migrants	Pichincha	37.38319	0.7775836
Hosts	Santa Elena	44.04101	7.0941311
Migrants	Santa Elena	39.45244	3.8401079
Hosts	Santo Domingo	46.94187	4.9430434
Migrants	Santo Domingo	34.19018	3.4954833
Hosts	Sucumbíos	25.05930	2.1051501
Hosts	Tungurahua	36.16449	3.8003223
Migrants	Tungurahua	37.68505	3.2999673
Hosts	Zamora Chinchipe	30.66675	2.6596754
Migrants	Zamora Chinchipe	44.00000	0.0000000

Note the standard error increases as we have fewer observations in some regions. Bear in mind the limitations of using weight levels calculated for a national level in regional analysis, as they might not accurately reflect the unique characteristics of each region, resulting in biased estimates.

Conclusion

This tutorial has offered a glimpse into the initial steps for leveraging open-source tools to interpret data on Venezuelan migration. While we have covered essential techniques, the scope for further exploration is vast. We invite you to share any open-source solutions to analyze weighted survey data that might have been overlooked or suggest topics for future tutorials on forced displacement data. You can contact us by email or social media networks (Twitter and LinkedIn).

References and resources

https://github.com/pewresearch/pewmethods: R package developed by the Pew Research Center Methods team for working with survey data.

https://github.com/quantipy/quantipy3/: Python package to read survey data.