library(survey)
library(tidyverse)
library(visdat)
library(DataExplorer)
library(dplyr)
library(knitr)JDC Data Tutorial
This tutorial aims to be a resource for local policy makers to develop better informed policies to promote a mutually beneficial relationship between Venezuelan migrants and the host community.
Introduction
About the report and the microdata
Summary of data sources
We have cleaned and harmonized data, so you can to jump right into the analysis
How to analyze survey data using open source solutions?
Special features of survey data.
Overview of no-code solutions and programming languages.
As this is a tutorial focused on beginners, we will focus on descriptive analysis - not exploratory or inferential analysis.
Analyzing the data using R
Introduce the purpose of the following sections. Summarize each one.
Reference resources explaining how to run R code for beginners.
How to interpret this document? Text, code, and outputs.
Setting up
Load libraries
First, we will load the libraries needed for our tutorial.
Read the data
Now, we will read the data and inspect the information available.
You can change the variable file_name to run this code with countries. We will use the dataset for COUNTRY_X in this tutorial.
The result of the command below shows the number of rows (observations), columns (variables), their respective data types and some sample values.
Explain the table below in detail. Note we have missing values.
file_name = "../microdata/ecu_host_mig.csv"
survey = read.csv(file_name)
# Print the survey STRucture
str(survey)'data.frame': 3922 obs. of 34 variables:
$ year : int 2019 2019 2019 2019 2019 2019 2019 2019 2019 2019 ...
$ survey : chr "EPEC" "EPEC" "EPEC" "EPEC" ...
$ samp : chr "Hosts" "Hosts" "Hosts" "Hosts" ...
$ wave : int 1 1 1 1 1 1 1 1 1 1 ...
$ code_province : chr "Azuay" "Azuay" "Azuay" "Azuay" ...
$ age : int NA NA NA NA NA NA NA NA NA NA ...
$ sex : chr "" "" "" "" ...
$ female : int NA NA NA NA NA NA NA NA NA NA ...
$ male : int NA NA NA NA NA NA NA NA NA NA ...
$ hh_size : int NA NA NA NA NA NA NA NA NA NA ...
$ marital_status : chr "" "" "" "" ...
$ edu_years : int NA NA NA NA NA NA NA NA NA NA ...
$ edu_level : chr "" "" "" "" ...
$ employed : chr "" "" "" "" ...
$ unemployed : chr "" "" "" "" ...
$ inactive : chr "" "" "" "" ...
$ weight : num 703 949 844 844 718 ...
$ occupation : chr "Services" "Crafts" "Crafts" "Elementary/basic" ...
$ health_insurance : chr "" "" "" "" ...
$ homologate : chr "" "" "" "" ...
$ reason_nohomologate: chr "" "" "" "" ...
$ company : chr "" "" "" "" ...
$ plan_residency : chr "" "" "" "" ...
$ reason_migra : chr "" "" "" "" ...
$ discriminated : chr "" "" "" "" ...
$ integration_3_1 : int NA NA NA NA NA NA NA NA NA NA ...
$ integration_3_2 : int NA NA NA NA NA NA NA NA NA NA ...
$ integration_3_3 : int NA NA NA NA NA NA NA NA NA NA ...
$ integration_3_4 : int NA NA NA NA NA NA NA NA NA NA ...
$ integration_3_5 : int NA NA NA NA NA NA NA NA NA NA ...
$ integration_3_6 : int NA NA NA NA NA NA NA NA NA NA ...
$ integration_3_7 : int NA NA NA NA NA NA NA NA NA NA ...
$ integration_3_8 : int NA NA NA NA NA NA NA NA NA NA ...
$ entry_register : chr "" "" "" "" ...
Preliminary analysis
Let’s first understand how each file is structured.
We have aggregated differents surveys for each country. You can use the variables survey and wave to filter different sources. The variable samp indicates whether the records are from the host population or migrations.
Detail other “core” variables.
Inspect missing values
Let’s plot the missing values for each of the surveys available.
visdat::vis_miss(survey,facet = survey)Interpret the chart.
Refer to the documentation to further understand missing values.
Records by surveys and waves
Now, let’s check the total of records by survey, wave and population.
summary_df <- survey %>%
group_by(survey, wave, samp) %>%
summarise(total = n(), .groups = 'drop')
# Output the table
kable(summary_df, caption = "Records by survey, wave and population")| survey | wave | samp | total |
|---|---|---|---|
| EPEC | 1 | Hosts | 1807 |
| EPEC | 1 | Migrants | 1256 |
| HFPS | 2 | Hosts | 503 |
| HFPS | 2 | Migrants | 356 |
Select a weighted survey
Different surveys present distinct questions. You should adjust the survey selected according to the goals of your analysis. For instance, in COUNTRY_X survey_1 is used to ____ , survey_2 is used to ____.
Here, we will compare the average age.
survey_filter <- survey %>%
filter(survey == "HFPS")Now, we will load the survey design and make use of the weights associated with each response to produce unbiased estimates.
Explain shortly what are survey weights.
survey_ecu <- svydesign(ids = ~1, # ~1 means the survey has no clusters
data= survey_filter,
weights = survey_filter$weight)Descriptive statistic
Numeric values
Average age by population.
svyby(~age, ~samp, survey_ecu, svymean) samp age se
Hosts Hosts 39.88760 0.9410057
Migrants Migrants 36.09989 0.5309258
Categoric variables
crosstab <- svytable(~samp + marital_status, design = survey_ecu)
# Calculate percentages
crosstab_percentages <- round(prop.table(crosstab,1) * 100,2)
# Display the crosstab
knitr::kable(crosstab_percentages, caption = "Crosstab of Occupation by Marital Status")| Married/Cohabitation | Other | Single | |
|---|---|---|---|
| Hosts | 54.32 | 10.64 | 35.04 |
| Migrants | 51.14 | 4.29 | 44.57 |
Filtering by region
Average age by population and region.
mean_age <- svyby(~age, ~samp + code_province, survey_ecu, svymean)
knitr::kable(mean_age, caption = "Mean age by population")| samp | code_province | age | se | |
|---|---|---|---|---|
| Hosts.Azuay | Hosts | Azuay | 34.28751 | 2.4472912 |
| Migrants.Azuay | Migrants | Azuay | 38.44558 | 1.6925252 |
| Hosts.Bolivar | Hosts | Bolivar | 35.38478 | 4.8992903 |
| Hosts.Cañar | Hosts | Cañar | 41.93000 | 9.4223198 |
| Migrants.Cañar | Migrants | Cañar | 31.39452 | 4.6921062 |
| Hosts.Carchi | Hosts | Carchi | 45.61607 | 7.7941828 |
| Migrants.Carchi | Migrants | Carchi | 48.50000 | 4.5988717 |
| Hosts.Chimborazo | Hosts | Chimborazo | 39.31556 | 5.1718474 |
| Migrants.Chimborazo | Migrants | Chimborazo | 31.05517 | 2.5762621 |
| Hosts.Cotopaxi | Hosts | Cotopaxi | 48.75018 | 7.3282210 |
| Migrants.Cotopaxi | Migrants | Cotopaxi | 38.00000 | 0.0000000 |
| Hosts.El Oro | Hosts | El Oro | 42.98743 | 5.1123709 |
| Migrants.El Oro | Migrants | El Oro | 34.21006 | 2.3494369 |
| Hosts.Esmeraldas | Hosts | Esmeraldas | 46.99697 | 8.6409451 |
| Hosts.Galápagos | Hosts | Galápagos | 58.39114 | 2.0476748 |
| Hosts.Guayas | Hosts | Guayas | 39.80792 | 1.9244430 |
| Migrants.Guayas | Migrants | Guayas | 33.44946 | 1.0339923 |
| Hosts.Imbabura | Hosts | Imbabura | 42.52794 | 8.1943990 |
| Migrants.Imbabura | Migrants | Imbabura | 31.84827 | 1.9839182 |
| Hosts.Loja | Hosts | Loja | 39.73884 | 5.0654610 |
| Migrants.Loja | Migrants | Loja | 48.00000 | 0.0000000 |
| Hosts.Los Rios | Hosts | Los Rios | 31.94942 | 2.5716242 |
| Migrants.Los Rios | Migrants | Los Rios | 36.79796 | 2.1185825 |
| Hosts.Manabi | Hosts | Manabi | 42.21102 | 3.0035721 |
| Migrants.Manabi | Migrants | Manabi | 33.56610 | 1.7669829 |
| Hosts.Morona Santiago | Hosts | Morona Santiago | 34.26451 | 6.6840835 |
| Migrants.Morona Santiago | Migrants | Morona Santiago | 48.00000 | 0.0000000 |
| Hosts.Napo | Hosts | Napo | 42.66667 | 0.9818785 |
| Migrants.Napo | Migrants | Napo | 41.47055 | 3.3052052 |
| Hosts.Orellana | Hosts | Orellana | 43.77406 | 7.5836483 |
| Migrants.Orellana | Migrants | Orellana | 32.44793 | 1.6388851 |
| Hosts.Pastaza | Hosts | Pastaza | 55.01462 | 5.8836227 |
| Hosts.Pichincha | Hosts | Pichincha | 39.85590 | 1.9423484 |
| Migrants.Pichincha | Migrants | Pichincha | 37.38319 | 0.7775836 |
| Hosts.Santa Elena | Hosts | Santa Elena | 44.04101 | 7.0941311 |
| Migrants.Santa Elena | Migrants | Santa Elena | 39.45244 | 3.8401079 |
| Hosts.Santo Domingo | Hosts | Santo Domingo | 46.94187 | 4.9430434 |
| Migrants.Santo Domingo | Migrants | Santo Domingo | 34.19018 | 3.4954833 |
| Hosts.Sucumbíos | Hosts | Sucumbíos | 25.05930 | 2.1051501 |
| Hosts.Tungurahua | Hosts | Tungurahua | 36.16449 | 3.8003223 |
| Migrants.Tungurahua | Migrants | Tungurahua | 37.68505 | 3.2999673 |
| Hosts.Zamora Chinchipe | Hosts | Zamora Chinchipe | 30.66675 | 2.6596754 |
| Migrants.Zamora Chinchipe | Migrants | Zamora Chinchipe | 44.00000 | 0.0000000 |
Note the standard error increases as we have fewer observations in some regions. Bear in mind limitations of using weights levels calculated for a national level in regional analysis.
References and resources
https://github.com/pewresearch/pewmethods R package developed by the Pew Research Center Methods team for working with survey data.
https://github.com/quantipy/quantipy3/ Python package to read survey data.