Venezuelan Migrants and Refugees in Chile, Colombia, Ecuador, and Peru

This tutorial aims to help analysts and local policymakers develop evidence-based policies and promote a mutually beneficial relationship between Venezuelan migrants and the host community in Chile, Colombia, Peru, and Ecuador. We present an overview of open-source technologies to analyze survey data and introduce a step-by-step guide to use R to describe statistics from host populations and Venezuelan migrants.

Introduction

The World Bank report Venezuelan Migrants and Refugees in Chile, Colombia, Ecuador, and Peru provides a detailed socio-economic profile of Venezuelans in these four countries to help guide the policy and institutional response. The study uses official data from several surveys with the adult population (18 years or older) of Venezuelan migrants and national residents. The Joint Data Center on Forced Displacement supported the study and has created this tutorial to help you navigate the data sources and create your own analysis.

Data sources

The study uses data of eigth different surveys from four countries. You can find below a table summarizing key features of each survey. More details, such as collection date, the representativeness and the number of samples for each survey, are provided in the full report.

Country Survey Modality
Chile Encuesta de Migración Telephone
Chile Labor Survey In-person
Colombia Gran Encuesta Integrada de Hogares (GEIH) In-person
Colombia Migration Pulse (Round 4) Telephone
Ecuador Encuesta a Personas en Movilidad Humana y en Comunidades Receptoras en Ecuador (EPEC) In-person
Ecuador High-Frequency Phone Surveys (HFPS) Telephone
Peru Encuesta Nacional de Hogares (ENAHO) In-person
Peru Encuesta Dirigida a la Población Venezolana (ENPOVE) In-person

We have organized these surveys into four CSV table files, one per country. The datasets are cleaned and harmonized, meaning all tables follow a similar structure and have the same variable names and values. Having cleaned datasets facilitates comparing countries and populations, and publishing them ensures transparency and reproducibility to our findings.

How analyze survey data using open-source solutions?

Survey data have unique characteristics that set them apart from data sources. Fundamentally, surveys are designed to gather information from a sample, a subset of a population and then infer characteristics or attitudes of the entire population. Weighting values is a crucial method for this leap from the sample (Venezuelans and hosts reached by the surveys) to the population (all Venezuelans and hosts). Weighted survey data assign a weight for each record to improve the representativeness of conclusions drawn from a limited and biased sample.

Although proprietary software offers tools for analyzing weighted surveys with graphical interfaces, the open-source landscape presents more challenges for beginners and requires basic coding skills. Jamovi, an open-source application, has plans to support weights in statistical analysis, but its implementation is still in the early stages. The most common approach for analyzing weighted survey data using open-source tools involves programming languages like Python and R. R, in particular, is designed for statistical analysis and provides a wide range of options for survey data exploration.

Analyzing the data using R

This tutorial uses R to explore the microdata used in the report Venezuelan Migrants and Refugees in Chile, Colombia, Ecuador, and Peru. It is tailored for beginners and will cover basic descriptive analysis rather than exploratory or inferential statistics.

You will see R code and outputs below, along with regular text like this. For example, the command print shows the message quoted as an output.

print("Coding is easy!")
[1] "Coding is easy!"

You can copy and paste to run your own analysis. We assume prior experience running R code. If you are unfamiliar with R, you can download RStudio and check one of the many introductory videos on installing and running R.

Load libraries

First, we will load the libraries needed for our tutorial. We explain the main purpose of each one as comments placed after the hashtag (#).

library(survey) # to handle survey weights
library(tidyverse) # to manipulate data easily
library(visdat) # to visualize missing values
library(knitr) # to format output nicely

Read the data

Now, we will read the cleaned survey data and inspect the information available. You can download the datasets using the following links of the World Bank Microdata Library: [INSERT LINK HERE]

file_name = "../microdata/ecu_host_mig.csv" # you can change this line

survey = read.csv(file_name)

We assume the R code runs from the same folder where the datasets are located.You can use different file paths to change the variable file_name to run this code. We will use the dataset for Ecuador as an example in this tutorial.

The result of the command below shows the number of rows (observations), columns (variables), their respective data types and some sample values. It shows we have 3922 rows and 34 columns. The output also shows the column names after the dollar sign, along with the data type (int for numeric integers,chr for text strings and num for float numbers) and values from the first rows.

Notice that some variables have missing values, which are represented by NA for numeric variables and empty quotes for categorical. We will come back to this issue soon.

str(survey) # print the survey STRucture
'data.frame':   3922 obs. of  34 variables:
 $ year               : int  2019 2019 2019 2019 2019 2019 2019 2019 2019 2019 ...
 $ survey             : chr  "EPEC" "EPEC" "EPEC" "EPEC" ...
 $ samp               : chr  "Hosts" "Hosts" "Hosts" "Hosts" ...
 $ wave               : int  1 1 1 1 1 1 1 1 1 1 ...
 $ code_province      : chr  "Azuay" "Azuay" "Azuay" "Azuay" ...
 $ age                : int  NA NA NA NA NA NA NA NA NA NA ...
 $ sex                : chr  "" "" "" "" ...
 $ female             : int  NA NA NA NA NA NA NA NA NA NA ...
 $ male               : int  NA NA NA NA NA NA NA NA NA NA ...
 $ hh_size            : int  NA NA NA NA NA NA NA NA NA NA ...
 $ marital_status     : chr  "" "" "" "" ...
 $ edu_years          : int  NA NA NA NA NA NA NA NA NA NA ...
 $ edu_level          : chr  "" "" "" "" ...
 $ employed           : chr  "" "" "" "" ...
 $ unemployed         : chr  "" "" "" "" ...
 $ inactive           : chr  "" "" "" "" ...
 $ weight             : num  703 949 844 844 718 ...
 $ occupation         : chr  "Services" "Crafts" "Crafts" "Elementary/basic" ...
 $ health_insurance   : chr  "" "" "" "" ...
 $ homologate         : chr  "" "" "" "" ...
 $ reason_nohomologate: chr  "" "" "" "" ...
 $ company            : chr  "" "" "" "" ...
 $ plan_residency     : chr  "" "" "" "" ...
 $ reason_migra       : chr  "" "" "" "" ...
 $ discriminated      : chr  "" "" "" "" ...
 $ integration_3_1    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ integration_3_2    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ integration_3_3    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ integration_3_4    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ integration_3_5    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ integration_3_6    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ integration_3_7    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ integration_3_8    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ entry_register     : chr  "" "" "" "" ...

Preliminary analysis

To understand how each country file is structured, let’s review some core variables. You can find detailed descriptions for all available variables in our data dictionary.

  • survey: the name of the survey;

  • wave: the wave of the survey. Surveys might have more than one edition, known as waves. Each wave occurs in a distinct period of time.

  • samp: indicate whether the response comes from Venezuelan or the host population;

  • weight: the weight assigned for each record to produce unbiases estimates;

Each file aggregates different surveys from the same country. Therefore, you should use the variables survey and wave to filter the data and pick the right source depending on your research question. Because distinct surveys cover different questions, some rows have missing values.

Inspect missing values

Let’s plot the missing values for each of the surveys available. The chart shows more records from the EPEC survey than HFPS. The highlighted regions make it easy to spot which variables have missing values.

visdat::vis_miss(survey,facet = survey)

To keep it simple, our tutorial analyses only information on age, marital status, region, and population type (host or Venezuelan migrant) in the HPFS survey. As the image shows, these variables have no missing values, so we will not worry about missing values. Nevertheless, handling missing values is crucial to the data preparation phase. You might need to drop missing values or impute values to conduct other analyses. Your chosen strategy depends on why values are missing, the extent of missingness, and your analytical goals. Please refer to the data dictionary and documentation to understand the reasons for missing values.

Records by surveys and waves

Before applying the weights, we will check the total number of respondents by survey, wave and population.

summary_df <- survey %>% 
  group_by(survey, wave, samp) %>% 
  summarise(total = n(), .groups = 'drop')

# Output the table
kable(summary_df, caption = "Number of records by survey, wave and population")
Number of records by survey, wave and population
survey wave samp total
EPEC 1 Hosts 1807
EPEC 1 Migrants 1256
HFPS 2 Hosts 503
HFPS 2 Migrants 356

Keep in mind these values reflect the number of responses from an unweighted sample of Venezuelan migrants and host population, not the actual migrant and local population. Next, we’ll demonstrate how to use weights to calculate estimates that are more representative.

Configure the survey design

As different surveys present distinct questions, you should select the survey according to the goals of your analysis. For instance, the data for Ecuador relies on the High-Frequency Phone Survey (HFPS) and the 2019 Human Mobility and Host Communities Survey (EPEC, for its acronym in Spanish). Most of the indicators used in the report come from the HFPS, except those referring to job occupations and health insurance, which come from the EPEC survey.

We will select the HFPS survey to calculate to compare the average age of Venezuelan migrants and the host population in Ecuador. Let’s start filtering the dataset to get only answers from the HFPS.

survey_filter <- survey %>% 
  filter(survey == "HFPS")

Next, we load the survey design and the weights associated with each response. There are a variety of ways to implement weighted data analysis using R. For convenience, we use the function svydesign from survey, a R package with pre-built features tailored for survey analysis.

survey_ecu <- svydesign(ids = ~1, # ~1 means the survey has no clusters
                       data= survey_filter, 
                       weights = survey_filter$weight)

Now, we are ready to produce more accurate estimates about the populations of interest.

Descriptive statistic

Creating basic summary statistics using the survey package is straightforward. Our tutorial shows how to calculate the mean for numeric values and cross-tabulation for categorical variables.

Numeric values

We will group (svyby) the records by the population type (~samp) and calculate the mean age (svymean). The output shows the mean and the standard error (se) for each estimate. Standard error values use the same unit of measurement of the mean. They represent how much the sample mean calculated is expected to vary from the actual population mean.

# Group by and calculate the mean age
svyby(formula = ~age, by = ~samp, design = survey_ecu, svymean)
             samp      age        se
Hosts       Hosts 39.88760 0.9410057
Migrants Migrants 36.09989 0.5309258

Categoric variables

We can employ the svytable command to analyze categorical variables. The following code cross-tabulates the population by marital status. Using prop.table(crosstab, 1), we present the values as percentages at the population/row level (using 0 instead would sum the values to 100 across columns). Additionally, we round the values to two decimal places.

# Cross-tabulates values
crosstab <- svytable(~samp + marital_status, design = survey_ecu)

# Calculate percentages
crosstab_percentages <- round(prop.table(crosstab,1) * 100,2) 
 
# Output the table
knitr::kable(crosstab_percentages, 
             caption = "Crosstab of Occupation by Marital Status (%)")
Crosstab of Occupation by Marital Status (%)
Married/Cohabitation Other Single
Hosts 54.32 10.64 35.04
Migrants 51.14 4.29 44.57

Detailing by region

You can also use the column code_province to produce estimates for each region.

mean_age <- svyby(~age, ~samp + code_province, survey_ecu, svymean)

knitr::kable(mean_age, caption = "Mean age by population", row.names=FALSE)
Mean age by population
samp code_province age se
Hosts Azuay 34.28751 2.4472912
Migrants Azuay 38.44558 1.6925252
Hosts Bolivar 35.38478 4.8992903
Hosts Cañar 41.93000 9.4223198
Migrants Cañar 31.39452 4.6921062
Hosts Carchi 45.61607 7.7941828
Migrants Carchi 48.50000 4.5988717
Hosts Chimborazo 39.31556 5.1718474
Migrants Chimborazo 31.05517 2.5762621
Hosts Cotopaxi 48.75018 7.3282210
Migrants Cotopaxi 38.00000 0.0000000
Hosts El Oro 42.98743 5.1123709
Migrants El Oro 34.21006 2.3494369
Hosts Esmeraldas 46.99697 8.6409451
Hosts Galápagos 58.39114 2.0476748
Hosts Guayas 39.80792 1.9244430
Migrants Guayas 33.44946 1.0339923
Hosts Imbabura 42.52794 8.1943990
Migrants Imbabura 31.84827 1.9839182
Hosts Loja 39.73884 5.0654610
Migrants Loja 48.00000 0.0000000
Hosts Los Rios 31.94942 2.5716242
Migrants Los Rios 36.79796 2.1185825
Hosts Manabi 42.21102 3.0035721
Migrants Manabi 33.56610 1.7669829
Hosts Morona Santiago 34.26451 6.6840835
Migrants Morona Santiago 48.00000 0.0000000
Hosts Napo 42.66667 0.9818785
Migrants Napo 41.47055 3.3052052
Hosts Orellana 43.77406 7.5836483
Migrants Orellana 32.44793 1.6388851
Hosts Pastaza 55.01462 5.8836227
Hosts Pichincha 39.85590 1.9423484
Migrants Pichincha 37.38319 0.7775836
Hosts Santa Elena 44.04101 7.0941311
Migrants Santa Elena 39.45244 3.8401079
Hosts Santo Domingo 46.94187 4.9430434
Migrants Santo Domingo 34.19018 3.4954833
Hosts Sucumbíos 25.05930 2.1051501
Hosts Tungurahua 36.16449 3.8003223
Migrants Tungurahua 37.68505 3.2999673
Hosts Zamora Chinchipe 30.66675 2.6596754
Migrants Zamora Chinchipe 44.00000 0.0000000

Note the standard error increases as we have fewer observations in some regions. Bear in mind the limitations of using weight levels calculated for a national level in regional analysis, as they might not accurately reflect the unique characteristics of each region, resulting in biased estimates.

Conclusion

This tutorial has offered a glimpse into the initial steps for leveraging open-source tools to interpret data on Venezuelan migration. While we have covered essential techniques, the scope for further exploration is vast. We invite you to share any open-source solutions to analyze weighted survey data that might have been overlooked or suggest topics for future tutorials on forced displacement data. You can contact us by email or social media networks (Twitter and LinkedIn).

References and resources

https://github.com/pewresearch/pewmethods: R package developed by the Pew Research Center Methods team for working with survey data.

https://github.com/quantipy/quantipy3/: Python package to read survey data.