Introduction

Given the extraordinary surge in migration during the recent years, it is interesting to find out if there is a particular change in the demographics of those groups. The data presented here does not account for refugees or population outflows, but it does provide the opportunity to explore some macro trends of migration, which in this case will be gender.

The guiding research questions is: Is there convincing evidence that the world has seen a change in its gender proportion of migrants between 1990 and 2019?

In order to address this question, we postulate the following hypothesis:

\[Ho: The \ gender \ proportion \ has \ not\ changed\ (p_{1990} = p_{2019} \rightarrow p_{2019} - p_{1990}=0 \rightarrow p_{diff} = 0) \] \[Ha: The \ gender \ proportion \ has \ changed\ (p_{1990} \neq p_{2019} \rightarrow p_{2019} - p_{1990}\neq0 \rightarrow p_{diff} \neq 0) \]

Data

Data Description

We are looking at data collected by the United Nations found here.

In particular, we are looking at Table 1.

The data is collected by the United Nations and the estimates are based on official statistics of foreign-born populations. The data is collected every year, but only 5 year increments as well as the current year (2019) are given.

Each case represents an estimate of a country’s international migrant population by gender, for a particular country and year. There are 3216 observations in the data set.

The response variable is the count of migrant population which is discrete and numerical. We will boil it down into the gender proportion of migrants which is continuous and bounded between 0 and 1.

The independent variables are Year which is a quantitative discretization of time and Gender which is qualitative. We should note that Year can be a different type of variable depending on how it is used. While in this context it makes sense to think of it as numerical time series, it does not make sense to apply numerical operations to it like for example to taking the average of two years or adding two years.

This study is observational because the data collectors had no control over the variables.

Regarding generalizability, the collected data represents the entire population of migrants so there was no random sampling. However, we will be randomly sampling the data to be able to generalize. In this study there is notion of causality since there is no random assignment.

Data Import

The document consists of multiple tabs of data. We will focus on Table 1 which contains International migrant stock at mid-year by sex and by major area, region, country or area, 1990-2019. We will look at the non-aggregated data for each country for both males and females to see what trends we can derive from this dataset.

Here is a quick look at the data. There are rows for every country as well as for aggregations by region. We can only find data by year and gender for the range 1990-2019.

Raw Data

Raw Data

Data Transormation/Preparation

We start by selecting the rows and columns of interest. We will ignore the aggregated data and select the country rows and the and the migrant stock data for the individual sexes

Let’s take a look at 15 random oberservations from this tidy data table.

Long Format
Country Year MigrantCount Sex
Brunei Darussalam 1995 47832 Male
Norway 2005 176444 Male
British Virgin Islands 1995 5012 Female
Lebanon 2019 967998 Female
North Macedonia 1995 45581 Male
Indonesia 1990 229885 Male
Morocco 2005 27543 Male
Liberia 2005 36005 Female
Senegal 2010 135879 Male
Seychelles 2010 7804 Male
Wide Format
Country Sex 1990 1995 2000 2005 2010 2015 2019
Uzbekistan Male 723618 657490 606163 589690 558379 545721 545048
Cambodia Male 19084 45361 71638 57511 43385 39854 42379
Uganda Male 283337 308281 290859 292130 241903 415845 836620
United States Virgin Islands Female 27146 28481 29881 29921 29962 29993 30005
Rwanda Female 79261 105952 171335 209908 212853 258320 270015
Nauru Female 1309 1210 1111 1045 530 819 908
Mauritania Female 54081 42547 26203 25531 36074 72482 75221
Guadeloupe Male 32718 36248 39778 41273 42768 43858 44537
Slovenia Female 88563 84164 99736 100477 112501 103156 110496
Bahrain Male 122730 144227 165344 288614 475905 508966 535728

We perform a series of transformations for downstream analysis.

Year Sex Proportion
1990 Female 0.4924
1995 Female 0.4936
2000 Female 0.4929
2005 Female 0.4893
2010 Female 0.4834
2015 Female 0.4822
2019 Female 0.4791
1990 Male 0.5076
1995 Male 0.5064
2000 Male 0.5071
2005 Male 0.5107
2010 Male 0.5166
2015 Male 0.5178
2019 Male 0.5209
Country Year Male Female Delta Delta_m Total Prop
Costa Rica 2000 156798 154148 2650 0.002650 310946 0.5043
Belarus 1990 572021 676956 -104935 -0.104935 1248977 0.4580
Tajikistan 2019 118207 155864 -37657 -0.037657 274071 0.4313
Nauru 2000 1283 1111 172 0.000172 2394 0.5359
Saint Lucia 2000 4960 4911 49 0.000049 9871 0.5025
Eswatini 1990 39905 35086 4819 0.004819 74991 0.5321
Burundi 1990 163267 169843 -6576 -0.006576 333110 0.4901
Barbados 2000 12405 16019 -3614 -0.003614 28424 0.4364
Samoa 2015 2148 2104 44 0.000044 4252 0.5052
Bosnia and Herzegovina 1990 26538 29462 -2924 -0.002924 56000 0.4739

Here we randomly sample around 9% of the observations that we will use in our statistical analysis. We choose 9% to respect the condition for inference stated below. The two following tables show an example of the randomly selected countries for 1990 and 2019.

Sampled Data for 1990
Country Year Male Female Delta Delta_m Total Prop
Grenada 1990 2106 2157 -51 -0.000051 4263 0.4940
Armenia 1990 270535 388254 -117719 -0.117719 658789 0.4107
Algeria 1990 150234 123720 26514 0.026514 273954 0.5484
Afghanistan 1990 32558 25128 7430 0.007430 57686 0.5644
Anguilla 1990 1258 1312 -54 -0.000054 2570 0.4895
Samoa 1990 1771 1586 185 0.000185 3357 0.5276
El Salvador 1990 22218 25142 -2924 -0.002924 47360 0.4691
Finland 1990 31682 31573 109 0.000109 63255 0.5009
Saint Helena 1990 110 68 42 0.000042 178 0.6180
Turkmenistan 1990 134174 172326 -38152 -0.038152 306500 0.4378
Gibraltar 1990 4528 4181 347 0.000347 8709 0.5199
Haiti 1990 10606 8478 2128 0.002128 19084 0.5558
Portugal 1990 209922 225860 -15938 -0.015938 435782 0.4817
Syrian Arab Republic 1990 364077 350063 14014 0.014014 714140 0.5098
Lebanon 1990 267922 255771 12151 0.012151 523693 0.5116
Cyprus 1990 20463 23342 -2879 -0.002879 43805 0.4671
Kuwait 1990 654878 419513 235365 0.235365 1074391 0.6095
Georgia 1990 133285 171185 -37900 -0.037900 304470 0.4378
Sweden 1990 383085 405682 -22597 -0.022597 788767 0.4857
Serbia 1990 46712 52557 -5845 -0.005845 99269 0.4706
Zimbabwe 1990 356240 278381 77859 0.077859 634621 0.5613
Sampled Data for 2019
Country Year Male Female Delta Delta_m Total Prop
Serbia 2019 361216 459096 -97880 -0.097880 820312 0.4403
Guadeloupe 2019 44537 55493 -10956 -0.010956 100030 0.4452
Mali 2019 237428 230802 6626 0.006626 468230 0.5071
Slovenia 2019 142626 110496 32130 0.032130 253122 0.5635
Belize 2019 30180 29818 362 0.000362 59998 0.5030
Seychelles 2019 9049 3877 5172 0.005172 12926 0.7001
Timor-Leste 2019 5086 3331 1755 0.001755 8417 0.6043
Niue 2019 319 269 50 0.000050 588 0.5425
Haiti 2019 10426 8330 2096 0.002096 18756 0.5559
Colombia 2019 575805 566514 9291 0.009291 1142319 0.5041
Cook Islands 2019 1748 1743 5 0.000005 3491 0.5007
Malawi 2019 117932 129720 -11788 -0.011788 247652 0.4762
Venezuela (Bolivarian Republic of) 2019 685975 689715 -3740 -0.003740 1375690 0.4986
Tokelau 2019 242 262 -20 -0.000020 504 0.4802
Algeria 2019 131596 117479 14117 0.014117 249075 0.5283
Anguilla 2019 2694 2985 -291 -0.000291 5679 0.4744
Benin 2019 183593 206519 -22926 -0.022926 390112 0.4706
Burkina Faso 2019 341830 376508 -34678 -0.034678 718338 0.4759
Iraq 2019 214288 153774 60514 0.060514 368062 0.5822
India 2019 2640513 2514224 126289 0.126289 5154737 0.5122
Bulgaria 2019 82718 85798 -3080 -0.003080 168516 0.4909

Exploratory Data Analysis

Population

The following analysis is based on the true population data.

Below is a summary of the data. We see in particular that the greatest net positive of female migrants was ~1.7m while it is nearly 4.9m for men. In terms of proportion, the parameter of interest (proportion of males) was as high as 87% and as low as 29%. The summary statistics just stated are from all observations, meaning here that these numbers are individual observations so they could be from any country or any year.

##    Country              Year                Male         
##  Length:1608        Length:1608        Min.   :      61  
##  Class :character   Class :character   1st Qu.:   13160  
##  Mode  :character   Mode  :character   Median :   63032  
##                                        Mean   :  453763  
##                                        3rd Qu.:  286557  
##                                        Max.   :24488382  
##      Female             Delta             Delta_m         
##  Min.   :      47   Min.   :-1684385   Min.   :-1.684385  
##  1st Qu.:   12203   1st Qu.:   -5937   1st Qu.:-0.005937  
##  Median :   61813   Median :     224   Median : 0.000224  
##  Mean   :  429830   Mean   :   23933   Mean   : 0.023933  
##  3rd Qu.:  279594   3rd Qu.:    6722   3rd Qu.: 0.006722  
##  Max.   :26172767   Max.   : 4880026   Max.   : 4.880026  
##      Total               Prop       
##  Min.   :     108   Min.   :0.2930  
##  1st Qu.:   26197   1st Qu.:0.4800  
##  Median :  125646   Median :0.5079  
##  Mean   :  883593   Mean   :0.5163  
##  3rd Qu.:  588974   3rd Qu.:0.5392  
##  Max.   :50661149   Max.   :0.8767

The plots below reveals that the total population of migrants is increasing over time. We notice that the difference between the number of male and female migrants is increasing. Also shown is the change in the gender proportion over time.

Here we see how the difference between male and female migrant population is dsitributed over time. We can identify the max and min observations from the summary above in 2019.

As seen below, there is averall increase in the proportion of male migrants.

The side-by-side plot shown below looks at the cases of interest for the statistical study.

By looking at the population data, we can say that there has been an increase in the number of male migrants relative to females.

Sampled Data

This side-by-side plot below shows the proportions from the sampled data. We observe very minor change from 1990 to 2019.

##    Country              Year                Male            Female      
##  Length:21          Length:21          Min.   :   110   Min.   :    68  
##  Class :character   Class :character   1st Qu.: 10606   1st Qu.:  8478  
##  Mode  :character   Mode  :character   Median : 46712   Median : 52557  
##                                        Mean   :147541   Mean   :141251  
##                                        3rd Qu.:267922   3rd Qu.:255771  
##                                        Max.   :654878   Max.   :419513  
##      Delta            Delta_m              Total              Prop       
##  Min.   :-117719   Min.   :-0.117719   Min.   :    178   Min.   :0.4107  
##  1st Qu.:  -5845   1st Qu.:-0.005845   1st Qu.:  19084   1st Qu.:0.4706  
##  Median :     42   Median : 0.000042   Median :  99269   Median :0.5009  
##  Mean   :   6290   Mean   : 0.006290   Mean   : 288792   Mean   :0.5082  
##  3rd Qu.:   7430   3rd Qu.: 0.007430   3rd Qu.: 523693   3rd Qu.:0.5484  
##  Max.   : 235365   Max.   : 0.235365   Max.   :1074391   Max.   :0.6180
##    Country              Year                Male             Female       
##  Length:21          Length:21          Min.   :    242   Min.   :    262  
##  Class :character   Class :character   1st Qu.:   9049   1st Qu.:   3877  
##  Mode  :character   Mode  :character   Median : 117932   Median : 110496  
##                                        Mean   : 277133   Mean   : 273655  
##                                        3rd Qu.: 237428   3rd Qu.: 230802  
##                                        Max.   :2640513   Max.   :2514224  
##      Delta           Delta_m              Total              Prop       
##  Min.   :-97880   Min.   :-0.097880   Min.   :    504   Min.   :0.4403  
##  1st Qu.: -3740   1st Qu.:-0.003740   1st Qu.:  12926   1st Qu.:0.4762  
##  Median :    50   Median : 0.000050   Median : 247652   Median :0.5030  
##  Mean   :  3478   Mean   : 0.003478   Mean   : 550788   Mean   :0.5170  
##  3rd Qu.:  6626   3rd Qu.: 0.006626   3rd Qu.: 468230   3rd Qu.:0.5425  
##  Max.   :126289   Max.   : 0.126289   Max.   :5154737   Max.   :0.7001

Inference

Methodology

The distribution of the sample proportion is described by the following:

\(\hat{p} \sim N \left( mean = p, SE = \sqrt{\frac{p(1-p)}{n}} \right)\)

The confidence interval is described by the following:

\(confidence\ interval = \hat{p} \pm z*SE\)

In this study, we will be using a 95% significance level.

Since we are looking at the distribution of the difference of two independent sample proportions the standard error takes the following form:

\(SE_{(\hat{p}_1 - \hat{p}_2)} = \sqrt{\frac{ \hat{p}_1 (1 - \hat{p}_1)}{n_1} + \frac{ \hat{p}_2 (1 - \hat{p}_2)}{n_2} }\)

Conditions

We then verify the conditions for inference on proportions.

  1. Individual observations should be independent (the sample size should be < 10% of the population if sampling without replacement)

This condition is respected as we sampled 9% of the population.

  1. The sample distribution should be approximately normal.

We are on the border of acceptable with the check below:

\(np = 21 \times 0.5 = 10.5 > 10\)
\(n(1-p) = 21 \times 0.5 = 10.5 > 10\)

  1. Randomness: the data must come from a random sample or a randomized experiment

The data was randomly sampled.

Confidence Intervals

To determine is there is a statistical significance in the different of proportions, we study the overlap of the confidence intervals for the years of interest.

Here is a summary of the analysis. What it reveals is that there is no overlap in the confidence intervals of the male migrant proportion between 1990 and 2019. At a significance level of 95%, we reject the null hypothesis and draw the conclusion that the gender proportion has changed.

Confidence Intervals
year proportion se ci_lower ci_upper
1990 0.5108898 0.000203 0.5104919 0.5112876
2019 0.5031577 0.000147 0.5028696 0.5034459

Here we reach the same conclusion differently by looking at the differences in proportion and recognizing that the confidence interval does not contains the null hypothesis value of 0.

## [1] -0.008223262 -0.007240805

Conclusion

We return to the research question: Is there convincing evidence that the world has seen a change in its gender proportion of migrants between 1990 and 2019?

From the exploration and analysis above, we conclude that based on our sample, there is enough evidence to reject the null hypothesis and conclude that the gender proportions have changed from 1990 to 2019.

We must note that with our sample size of 21 countries we are on the lower bound of the normality assumption.

A 95% confidence interval tells us that in 95% of the cases, we expect to capture the true population mean. This can be confirmed by simulation.

To improve upon this study we could pursue the following research:

  • Run a simulation to collect many samples
  • Instead of overall proportion, we could investigate the number of countries in a sample that have a male proportion greater than 50%
  • Investigate the average male proportion of the sampled countries instead of the male proportion of the total migrant population of the sample countries.
  • Investigate the year to year differential instead of the cummulative differences.