JDC Data Tutorial

This tutorial aims to be a resource for local policy makers to develop better informed policies to promote a mutually beneficial relationship between Venezuelan migrants and the host community.

Introduction

About the report and the microdata

Summary of data sources

We have cleaned and harmonized data, so you can to jump right into the analysis

How to analyze survey data using open source solutions?

Special features of survey data.

Overview of no-code solutions and programming languages.

As this is a tutorial focused on beginners, we will focus on descriptive analysis - not exploratory or inferential analysis.

Analyzing the data using R

Introduce the purpose of the following sections. Summarize each one.

Reference resources explaining how to run R code for beginners.

How to interpret this document? Text, code, and outputs.

Setting up

Load libraries

First, we will load the libraries needed for our tutorial.

library(survey)
library(tidyverse)
library(visdat)
library(DataExplorer)
library(dplyr)
library(knitr)

Read the data

Now, we will read the data and inspect the information available.

You can change the variable file_name to run this code with countries. We will use the dataset for COUNTRY_X in this tutorial.

The result of the command below shows the number of rows (observations), columns (variables), their respective data types and some sample values.

Explain the table below in detail. Note we have missing values.

file_name = "../microdata/ecu_host_mig.csv"

survey = read.csv(file_name)

# Print the survey STRucture

str(survey)
'data.frame':   3922 obs. of  34 variables:
 $ year               : int  2019 2019 2019 2019 2019 2019 2019 2019 2019 2019 ...
 $ survey             : chr  "EPEC" "EPEC" "EPEC" "EPEC" ...
 $ samp               : chr  "Hosts" "Hosts" "Hosts" "Hosts" ...
 $ wave               : int  1 1 1 1 1 1 1 1 1 1 ...
 $ code_province      : chr  "Azuay" "Azuay" "Azuay" "Azuay" ...
 $ age                : int  NA NA NA NA NA NA NA NA NA NA ...
 $ sex                : chr  "" "" "" "" ...
 $ female             : int  NA NA NA NA NA NA NA NA NA NA ...
 $ male               : int  NA NA NA NA NA NA NA NA NA NA ...
 $ hh_size            : int  NA NA NA NA NA NA NA NA NA NA ...
 $ marital_status     : chr  "" "" "" "" ...
 $ edu_years          : int  NA NA NA NA NA NA NA NA NA NA ...
 $ edu_level          : chr  "" "" "" "" ...
 $ employed           : chr  "" "" "" "" ...
 $ unemployed         : chr  "" "" "" "" ...
 $ inactive           : chr  "" "" "" "" ...
 $ weight             : num  703 949 844 844 718 ...
 $ occupation         : chr  "Services" "Crafts" "Crafts" "Elementary/basic" ...
 $ health_insurance   : chr  "" "" "" "" ...
 $ homologate         : chr  "" "" "" "" ...
 $ reason_nohomologate: chr  "" "" "" "" ...
 $ company            : chr  "" "" "" "" ...
 $ plan_residency     : chr  "" "" "" "" ...
 $ reason_migra       : chr  "" "" "" "" ...
 $ discriminated      : chr  "" "" "" "" ...
 $ integration_3_1    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ integration_3_2    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ integration_3_3    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ integration_3_4    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ integration_3_5    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ integration_3_6    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ integration_3_7    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ integration_3_8    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ entry_register     : chr  "" "" "" "" ...

Preliminary analysis

Let’s first understand how each file is structured.

We have aggregated differents surveys for each country. You can use the variables survey and wave to filter different sources. The variable samp indicates whether the records are from the host population or migrations.

Detail other “core” variables.

Inspect missing values

Let’s plot the missing values for each of the surveys available.

visdat::vis_miss(survey,facet = survey)

Interpret the chart.

Refer to the documentation to further understand missing values.

Records by surveys and waves

Now, let’s check the total of records by survey, wave and population.

summary_df <- survey %>% 
  group_by(survey, wave, samp) %>% 
  summarise(total = n(), .groups = 'drop')

# Output the table
kable(summary_df, caption = "Records by survey, wave and population")
Records by survey, wave and population
survey wave samp total
EPEC 1 Hosts 1807
EPEC 1 Migrants 1256
HFPS 2 Hosts 503
HFPS 2 Migrants 356

Select a weighted survey

Different surveys present distinct questions. You should adjust the survey selected according to the goals of your analysis. For instance, in COUNTRY_X survey_1 is used to ____ , survey_2 is used to ____.

Here, we will compare the average age.

survey_filter <- survey %>% 
  filter(survey == "HFPS")

Now, we will load the survey design and make use of the weights associated with each response to produce unbiased estimates.

Explain shortly what are survey weights.

survey_ecu <- svydesign(ids = ~1, # ~1 means the survey has no clusters
                       data= survey_filter, 
                       weights = survey_filter$weight)

Descriptive statistic

Numeric values

Average age by population.

svyby(~age, ~samp, survey_ecu, svymean)
             samp      age        se
Hosts       Hosts 39.88760 0.9410057
Migrants Migrants 36.09989 0.5309258

Categoric variables

crosstab <- svytable(~samp + marital_status, design = survey_ecu)

# Calculate percentages
crosstab_percentages <- round(prop.table(crosstab,1) * 100,2)

# Display the crosstab
knitr::kable(crosstab_percentages, caption = "Crosstab of Occupation by Marital Status")
Crosstab of Occupation by Marital Status
Married/Cohabitation Other Single
Hosts 54.32 10.64 35.04
Migrants 51.14 4.29 44.57

Filtering by region

Average age by population and region.

mean_age <- svyby(~age, ~samp + code_province, survey_ecu, svymean)

knitr::kable(mean_age, caption = "Mean age by population")
Mean age by population
samp code_province age se
Hosts.Azuay Hosts Azuay 34.28751 2.4472912
Migrants.Azuay Migrants Azuay 38.44558 1.6925252
Hosts.Bolivar Hosts Bolivar 35.38478 4.8992903
Hosts.Cañar Hosts Cañar 41.93000 9.4223198
Migrants.Cañar Migrants Cañar 31.39452 4.6921062
Hosts.Carchi Hosts Carchi 45.61607 7.7941828
Migrants.Carchi Migrants Carchi 48.50000 4.5988717
Hosts.Chimborazo Hosts Chimborazo 39.31556 5.1718474
Migrants.Chimborazo Migrants Chimborazo 31.05517 2.5762621
Hosts.Cotopaxi Hosts Cotopaxi 48.75018 7.3282210
Migrants.Cotopaxi Migrants Cotopaxi 38.00000 0.0000000
Hosts.El Oro Hosts El Oro 42.98743 5.1123709
Migrants.El Oro Migrants El Oro 34.21006 2.3494369
Hosts.Esmeraldas Hosts Esmeraldas 46.99697 8.6409451
Hosts.Galápagos Hosts Galápagos 58.39114 2.0476748
Hosts.Guayas Hosts Guayas 39.80792 1.9244430
Migrants.Guayas Migrants Guayas 33.44946 1.0339923
Hosts.Imbabura Hosts Imbabura 42.52794 8.1943990
Migrants.Imbabura Migrants Imbabura 31.84827 1.9839182
Hosts.Loja Hosts Loja 39.73884 5.0654610
Migrants.Loja Migrants Loja 48.00000 0.0000000
Hosts.Los Rios Hosts Los Rios 31.94942 2.5716242
Migrants.Los Rios Migrants Los Rios 36.79796 2.1185825
Hosts.Manabi Hosts Manabi 42.21102 3.0035721
Migrants.Manabi Migrants Manabi 33.56610 1.7669829
Hosts.Morona Santiago Hosts Morona Santiago 34.26451 6.6840835
Migrants.Morona Santiago Migrants Morona Santiago 48.00000 0.0000000
Hosts.Napo Hosts Napo 42.66667 0.9818785
Migrants.Napo Migrants Napo 41.47055 3.3052052
Hosts.Orellana Hosts Orellana 43.77406 7.5836483
Migrants.Orellana Migrants Orellana 32.44793 1.6388851
Hosts.Pastaza Hosts Pastaza 55.01462 5.8836227
Hosts.Pichincha Hosts Pichincha 39.85590 1.9423484
Migrants.Pichincha Migrants Pichincha 37.38319 0.7775836
Hosts.Santa Elena Hosts Santa Elena 44.04101 7.0941311
Migrants.Santa Elena Migrants Santa Elena 39.45244 3.8401079
Hosts.Santo Domingo Hosts Santo Domingo 46.94187 4.9430434
Migrants.Santo Domingo Migrants Santo Domingo 34.19018 3.4954833
Hosts.Sucumbíos Hosts Sucumbíos 25.05930 2.1051501
Hosts.Tungurahua Hosts Tungurahua 36.16449 3.8003223
Migrants.Tungurahua Migrants Tungurahua 37.68505 3.2999673
Hosts.Zamora Chinchipe Hosts Zamora Chinchipe 30.66675 2.6596754
Migrants.Zamora Chinchipe Migrants Zamora Chinchipe 44.00000 0.0000000

Note the standard error increases as we have fewer observations in some regions. Bear in mind limitations of using weights levels calculated for a national level in regional analysis.

References and resources

https://github.com/pewresearch/pewmethods R package developed by the Pew Research Center Methods team for working with survey data.

https://github.com/quantipy/quantipy3/ Python package to read survey data.