JDC Data Tutorial

This tutorial aims to be a resource for local policy makers to develop better informed policies to promote a mutually beneficial relationship between Venezuelan migrants and the host community.

Introduction

About the report and the microdata

Summary of data sources

We have cleaned and harmonized data, so you can to jump right into the analysis

How to analyze survey data using open source solutions?

Special features of survey data.

Overview of no-code solutions and programming languages.

As this is a tutorial focused on beginners, we will focus on descriptive analysis - not exploratory or inferential analysis.

Analyzing the data using R

Introduce the purpose of the following sections. Summarize each one.

Reference resources explaining how to run R code for beginners.

How to interpret this document? Text, code, and outputs.

Setting up

Load libraries

First, we will load the libraries needed for our tutorial.

library(survey)
library(tidyverse)
library(visdat)
library(DataExplorer)
library(dplyr)
library(knitr)

Read the data

Now, we will read the data and inspect the information available.

You can change the variable file_name to run this code with countries. We will use the dataset for COUNTRY_X in this tutorial.

The result of the command below shows the number of rows (observations), columns (variables), their respective data types and some sample values.

Explain the table below in detail. Note we have missing values.

file_name = "../microdata/ecu_host_mig.csv"

survey = read.csv(file_name)

# Print the survey STRucture

str(survey)

'data.frame':   3922 obs. of  34 variables:
 $ year               : int  2019 2019 2019 2019 2019 2019 2019 2019 2019 2019 ...
 $ survey             : chr  "EPEC" "EPEC" "EPEC" "EPEC" ...
 $ samp               : chr  "Hosts" "Hosts" "Hosts" "Hosts" ...
 $ wave               : int  1 1 1 1 1 1 1 1 1 1 ...
 $ code_province      : chr  "Azuay" "Azuay" "Azuay" "Azuay" ...
 $ age                : int  NA NA NA NA NA NA NA NA NA NA ...
 $ sex                : chr  "" "" "" "" ...
 $ female             : int  NA NA NA NA NA NA NA NA NA NA ...
 $ male               : int  NA NA NA NA NA NA NA NA NA NA ...
 $ hh_size            : int  NA NA NA NA NA NA NA NA NA NA ...
 $ marital_status     : chr  "" "" "" "" ...
 $ edu_years          : int  NA NA NA NA NA NA NA NA NA NA ...
 $ edu_level          : chr  "" "" "" "" ...
 $ employed           : chr  "" "" "" "" ...
 $ unemployed         : chr  "" "" "" "" ...
 $ inactive           : chr  "" "" "" "" ...
 $ weight             : num  703 949 844 844 718 ...
 $ occupation         : chr  "Services" "Crafts" "Crafts" "Elementary/basic" ...
 $ health_insurance   : chr  "" "" "" "" ...
 $ homologate         : chr  "" "" "" "" ...
 $ reason_nohomologate: chr  "" "" "" "" ...
 $ company            : chr  "" "" "" "" ...
 $ plan_residency     : chr  "" "" "" "" ...
 $ reason_migra       : chr  "" "" "" "" ...
 $ discriminated      : chr  "" "" "" "" ...
 $ integration_3_1    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ integration_3_2    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ integration_3_3    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ integration_3_4    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ integration_3_5    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ integration_3_6    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ integration_3_7    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ integration_3_8    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ entry_register     : chr  "" "" "" "" ...

Preliminary analysis

Let’s first understand how each file is structured.

We have aggregated differents surveys for each country. You can use the variables survey and wave to filter different sources. The variable samp indicates whether the records are from the host population or migrations.

Detail other “core” variables.

Inspect missing values

Let’s plot the missing values for each of the surveys available.

visdat::vis_miss(survey,facet = survey)

Interpret the chart.

Refer to the documentation to further understand missing values.

Records by surveys and waves

Now, let’s check the total of records by survey, wave and population.

summary_df <- survey %>% 
  group_by(survey, wave, samp) %>% 
  summarise(total = n(), .groups = 'drop')

# Output the table
kable(summary_df, caption = "Records by survey, wave and population")

Records by survey, wave and population
survey	wave	samp	total
EPEC	1	Hosts	1807
EPEC	1	Migrants	1256
HFPS	2	Hosts	503
HFPS	2	Migrants	356

Select a weighted survey

Different surveys present distinct questions. You should adjust the survey selected according to the goals of your analysis. For instance, in COUNTRY_X survey_1 is used to ____ , survey_2 is used to ____.

Here, we will compare the average age.

survey_filter <- survey %>% 
  filter(survey == "HFPS")

Now, we will load the survey design and make use of the weights associated with each response to produce unbiased estimates.

Explain shortly what are survey weights.

survey_ecu <- svydesign(ids = ~1, # ~1 means the survey has no clusters
                       data= survey_filter, 
                       weights = survey_filter$weight)

Descriptive statistic

Numeric values

Average age by population.

svyby(~age, ~samp, survey_ecu, svymean)

             samp      age        se
Hosts       Hosts 39.88760 0.9410057
Migrants Migrants 36.09989 0.5309258

Categoric variables

crosstab <- svytable(~samp + marital_status, design = survey_ecu)

# Calculate percentages
crosstab_percentages <- round(prop.table(crosstab,1) * 100,2)

# Display the crosstab
knitr::kable(crosstab_percentages, caption = "Crosstab of Occupation by Marital Status")

Crosstab of Occupation by Marital Status
	Married/Cohabitation	Other	Single
Hosts	54.32	10.64	35.04
Migrants	51.14	4.29	44.57

Filtering by region

Average age by population and region.

mean_age <- svyby(~age, ~samp + code_province, survey_ecu, svymean)

knitr::kable(mean_age, caption = "Mean age by population")

Mean age by population
	samp	code_province	age	se
Hosts.Azuay	Hosts	Azuay	34.28751	2.4472912
Migrants.Azuay	Migrants	Azuay	38.44558	1.6925252
Hosts.Bolivar	Hosts	Bolivar	35.38478	4.8992903
Hosts.Cañar	Hosts	Cañar	41.93000	9.4223198
Migrants.Cañar	Migrants	Cañar	31.39452	4.6921062
Hosts.Carchi	Hosts	Carchi	45.61607	7.7941828
Migrants.Carchi	Migrants	Carchi	48.50000	4.5988717
Hosts.Chimborazo	Hosts	Chimborazo	39.31556	5.1718474
Migrants.Chimborazo	Migrants	Chimborazo	31.05517	2.5762621
Hosts.Cotopaxi	Hosts	Cotopaxi	48.75018	7.3282210
Migrants.Cotopaxi	Migrants	Cotopaxi	38.00000	0.0000000
Hosts.El Oro	Hosts	El Oro	42.98743	5.1123709
Migrants.El Oro	Migrants	El Oro	34.21006	2.3494369
Hosts.Esmeraldas	Hosts	Esmeraldas	46.99697	8.6409451
Hosts.Galápagos	Hosts	Galápagos	58.39114	2.0476748
Hosts.Guayas	Hosts	Guayas	39.80792	1.9244430
Migrants.Guayas	Migrants	Guayas	33.44946	1.0339923
Hosts.Imbabura	Hosts	Imbabura	42.52794	8.1943990
Migrants.Imbabura	Migrants	Imbabura	31.84827	1.9839182
Hosts.Loja	Hosts	Loja	39.73884	5.0654610
Migrants.Loja	Migrants	Loja	48.00000	0.0000000
Hosts.Los Rios	Hosts	Los Rios	31.94942	2.5716242
Migrants.Los Rios	Migrants	Los Rios	36.79796	2.1185825
Hosts.Manabi	Hosts	Manabi	42.21102	3.0035721
Migrants.Manabi	Migrants	Manabi	33.56610	1.7669829
Hosts.Morona Santiago	Hosts	Morona Santiago	34.26451	6.6840835
Migrants.Morona Santiago	Migrants	Morona Santiago	48.00000	0.0000000
Hosts.Napo	Hosts	Napo	42.66667	0.9818785
Migrants.Napo	Migrants	Napo	41.47055	3.3052052
Hosts.Orellana	Hosts	Orellana	43.77406	7.5836483
Migrants.Orellana	Migrants	Orellana	32.44793	1.6388851
Hosts.Pastaza	Hosts	Pastaza	55.01462	5.8836227
Hosts.Pichincha	Hosts	Pichincha	39.85590	1.9423484
Migrants.Pichincha	Migrants	Pichincha	37.38319	0.7775836
Hosts.Santa Elena	Hosts	Santa Elena	44.04101	7.0941311
Migrants.Santa Elena	Migrants	Santa Elena	39.45244	3.8401079
Hosts.Santo Domingo	Hosts	Santo Domingo	46.94187	4.9430434
Migrants.Santo Domingo	Migrants	Santo Domingo	34.19018	3.4954833
Hosts.Sucumbíos	Hosts	Sucumbíos	25.05930	2.1051501
Hosts.Tungurahua	Hosts	Tungurahua	36.16449	3.8003223
Migrants.Tungurahua	Migrants	Tungurahua	37.68505	3.2999673
Hosts.Zamora Chinchipe	Hosts	Zamora Chinchipe	30.66675	2.6596754
Migrants.Zamora Chinchipe	Migrants	Zamora Chinchipe	44.00000	0.0000000

Note the standard error increases as we have fewer observations in some regions. Bear in mind limitations of using weights levels calculated for a national level in regional analysis.

References and resources

https://github.com/pewresearch/pewmethods R package developed by the Pew Research Center Methods team for working with survey data.

https://github.com/quantipy/quantipy3/ Python package to read survey data.