For this exercise, I chose the United Nations migration dataset from 2017. The idea of using this dataset came from a post by Juanelle Marks on the Week 5 Discussion Board. The source of the dataset can be found at: http://www.un.org/en/development/desa/population/migration/data/estimates2/estimates17.shtml
In analyzing the data, I was interested in addressing two questions:
What is the level and trend of the migrant population, relative to the overall population? What regions show the highest and lowest migrant populations?
What is the relative proportion of male vs. female migrants as a percentage of the overall male and female population? What areas show the largest disparity between male and female migrant populations?
The UN provides the migration dataset in downloadable form as an Excel spreadsheet. I saved two tables from the migration dataset as CSV files:
Table 1: “International migrant stock at mid-year by sex and by major area, region, country or area, 1990-2017”
Table 2: “Total population at mid-year by sex and by major area, region, country or area, 1990-2017 (thousands)”
The two tables are saved as CSV files on GitHub.
First, read in the data from the CSV files stored on GitHub. Note that we need to skip the first block of lines as they contain header information, and then we also need to give R parsing instructions for certain columns. Without the parsing instructions, many of the columns with population numbers that are formatted with separating commas (e.g., “123,456”) are treated as character data instead of numerical data.
library(tidyverse)
library(knitr)
# migrant population
file1 <- "https://raw.githubusercontent.com/kecbenson/DATA_607_Proj2/master/UN_Migration_table1.csv"
# total population in 000s
file2 <- "https://raw.githubusercontent.com/kecbenson/DATA_607_Proj2/master/UN_Migration_table2.csv"
# ensure correct parsing of number columns
raw1 <- read_csv(file1, skip = 16,
col_types = cols(
.default = col_number(),
X1 = col_integer(),
X2 = col_character(),
X3 = col_character(),
X4 = col_integer(),
X5 = col_character()
)
)
head(raw1, 20)
## # A tibble: 20 x 51
## X1 X2 X3 X4 X5 `1990` `1995` `2000` `2005` `2010` `2015`
## <int> <chr> <chr> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 WORLD <NA> 900 <NA> 1.53e8 1.61e8 1.73e8 1.91e8 2.20e8 2.48e8
## 2 2 More~ b 901 <NA> 8.24e7 9.23e7 1.03e8 1.16e8 1.31e8 1.40e8
## 3 3 Less~ c 902 <NA> 7.02e7 6.84e7 6.92e7 7.42e7 8.93e7 1.07e8
## 4 4 Leas~ d 941 <NA> 1.11e7 1.17e7 1.01e7 9.80e6 9.99e6 1.38e7
## 5 5 Less~ <NA> 934 <NA> 5.91e7 5.67e7 5.92e7 6.45e7 7.94e7 9.36e7
## 6 6 High~ e 1503 <NA> 7.52e7 8.66e7 1.00e8 1.18e8 1.42e8 1.57e8
## 7 7 Midd~ e 1517 <NA> 6.85e7 6.44e7 6.40e7 6.47e7 7.02e7 7.98e7
## 8 8 Uppe~ e 1502 <NA> 3.34e7 3.32e7 3.38e7 3.58e7 3.97e7 4.82e7
## 9 9 Lowe~ e 1501 <NA> 3.51e7 3.12e7 3.02e7 2.88e7 3.05e7 3.16e7
## 10 10 Low-~ e 1500 <NA> 8.53e6 9.29e6 7.73e6 7.60e6 7.55e6 1.05e7
## 11 11 Sub-~ f 947 <NA> 1.47e7 1.53e7 1.37e7 1.43e7 1.57e7 2.17e7
## 12 12 AFRI~ <NA> 903 <NA> 1.57e7 1.64e7 1.48e7 1.55e7 1.70e7 2.34e7
## 13 13 East~ <NA> 910 <NA> 5.96e6 5.02e6 4.84e6 4.75e6 4.66e6 6.92e6
## 14 14 Buru~ <NA> 108 B R 3.33e5 2.55e5 1.26e5 1.73e5 2.35e5 2.90e5
## 15 15 Como~ <NA> 174 B 1.41e4 1.39e4 1.38e4 1.32e4 1.26e4 1.26e4
## 16 16 Djib~ <NA> 262 B R 1.22e5 9.98e4 1.01e5 9.21e4 1.02e5 1.12e5
## 17 17 Erit~ <NA> 232 I 1.18e4 1.24e4 1.30e4 1.43e4 1.57e4 1.59e4
## 18 18 Ethi~ <NA> 231 B R 1.16e6 8.07e5 6.11e5 5.14e5 5.68e5 1.16e6
## 19 19 Kenya <NA> 404 B R 2.97e5 6.19e5 6.99e5 7.57e5 9.27e5 1.08e6
## 20 20 Mada~ <NA> 450 C 2.39e4 2.12e4 2.35e4 2.61e4 2.89e4 3.21e4
## # ... with 40 more variables: `2017` <dbl>, `1990_1` <dbl>,
## # `1995_1` <dbl>, `2000_1` <dbl>, `2005_1` <dbl>, `2010_1` <dbl>,
## # `2015_1` <dbl>, `2017_1` <dbl>, `1990_2` <dbl>, `1995_2` <dbl>,
## # `2000_2` <dbl>, `2005_2` <dbl>, `2010_2` <dbl>, `2015_2` <dbl>,
## # `2017_2` <dbl>, X27 <dbl>, X28 <dbl>, X29 <dbl>, X30 <dbl>, X31 <dbl>,
## # X32 <dbl>, X33 <dbl>, X34 <dbl>, X35 <dbl>, X36 <dbl>, X37 <dbl>,
## # X38 <dbl>, X39 <dbl>, X40 <dbl>, X41 <dbl>, X42 <dbl>, X43 <dbl>,
## # X44 <dbl>, X45 <dbl>, X46 <dbl>, X47 <dbl>, X48 <dbl>, X49 <dbl>,
## # X50 <dbl>, X51 <dbl>
# ensure correct parsing of number columns
raw2 <- read_csv(file2, skip = 16,
col_types = cols(
.default = col_number(),
X1 = col_integer(),
X2 = col_character(),
X3 = col_character(),
X4 = col_integer()
)
)
head(raw2, 20)
## # A tibble: 20 x 25
## X1 X2 X3 X4 `1990` `1995` `2000` `2005` `2010` `2015` `2017`
## <int> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 WORLD <NA> 900 5.33e6 5.75e6 6.15e6 6.54e6 6.96e6 7.38e6 7.55e6
## 2 2 More~ b 901 1.15e6 1.17e6 1.19e6 1.21e6 1.24e6 1.25e6 1.26e6
## 3 3 Less~ c 902 4.18e6 4.58e6 4.95e6 5.33e6 5.72e6 6.13e6 6.29e6
## 4 4 Leas~ d 941 5.11e5 5.85e5 6.65e5 7.54e5 8.49e5 9.57e5 1.00e6
## 5 5 Less~ <NA> 934 3.67e6 3.99e6 4.29e6 4.58e6 4.87e6 5.17e6 5.29e6
## 6 6 High~ e 1503 9.80e5 1.01e6 1.05e6 1.08e6 1.13e6 1.16e6 1.17e6
## 7 7 Midd~ e 1517 4.03e6 4.36e6 4.67e6 4.97e6 5.27e6 5.58e6 5.70e6
## 8 8 Uppe~ e 1502 2.10e6 2.23e6 2.33e6 2.42e6 2.51e6 2.61e6 2.65e6
## 9 9 Lowe~ e 1501 1.93e6 2.13e6 2.34e6 2.55e6 2.76e6 2.97e6 3.05e6
## 10 10 Low-~ e 1500 3.23e5 3.72e5 4.25e5 4.88e5 5.60e5 6.42e5 6.77e5
## 11 11 Sub-~ f 947 5.13e5 5.89e5 6.72e5 7.68e5 8.80e5 1.01e6 1.06e6
## 12 12 AFRI~ <NA> 903 6.35e5 7.23e5 8.18e5 9.25e5 1.05e6 1.19e6 1.26e6
## 13 13 East~ <NA> 910 1.99e5 2.26e5 2.61e5 3.01e5 3.47e5 3.99e5 4.22e5
## 14 14 Buru~ <NA> 108 5.42e3 5.96e3 6.40e3 7.42e3 8.77e3 1.02e4 1.09e4
## 15 15 Como~ <NA> 174 4.12e2 4.75e2 5.42e2 6.12e2 6.90e2 7.77e2 8.14e2
## 16 16 Djib~ <NA> 262 5.90e2 6.30e2 7.18e2 7.83e2 8.51e2 9.27e2 9.57e2
## 17 17 Erit~ <NA> 232 3.11e3 3.09e3 3.39e3 3.97e3 4.39e3 4.85e3 5.07e3
## 18 18 Ethi~ <NA> 231 4.81e4 5.73e4 6.65e4 7.67e4 8.77e4 9.99e4 1.05e5
## 19 19 Kenya <NA> 404 2.34e4 2.73e4 3.14e4 3.60e4 4.14e4 4.72e4 4.97e4
## 20 20 Mada~ <NA> 450 1.16e4 1.35e4 1.58e4 1.83e4 2.12e4 2.42e4 2.56e4
## # ... with 14 more variables: `1990_1` <dbl>, `1995_1` <dbl>,
## # `2000_1` <dbl>, `2005_1` <dbl>, `2010_1` <dbl>, `2015_1` <dbl>,
## # `2017_1` <dbl>, `1990_2` <dbl>, `1995_2` <dbl>, `2000_2` <dbl>,
## # `2005_2` <dbl>, `2010_2` <dbl>, `2015_2` <dbl>, `2017_2` <dbl>
The data is pretty messy as it stands, so let’s select the relevant columns to narrow down the dataset, and then we can proceed to tidy and clean-up the dataset. We select the Index and Area columns, along with the population and migration columns. Note that the population and migration data by year are stored in a wide table format, so we need to use gather to pull the yearly data into separate observation rows. Also, because years 1990 through 2017 are referenced 3 times in the table (for total, male, and female), let’s separate the data into 3 subsets for the yearly total, male, and female data before using gather; otherwise we run the risk of geometrically expanding the table when we gather the yearly columns (i.e., [1990 - 2017] * [1990 - 2017] * [1990 - 2017]), if we’re not careful. Once the 3 subsets are prepared, then we’ll use an inner join to combine them into 1 data frame. We do this first for able 1 (migrant population).
########################################
### select data and clean-up for table1
########################################
df1 <- raw1 %>% select(1:2, 6:26)
names(df1)[1:2] <- c("Index", "Area")
df1
## # A tibble: 270 x 23
## Index Area `1990` `1995` `2000` `2005` `2010` `2015` `2017` `1990_1`
## <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 WORLD 1.53e8 1.61e8 1.73e8 1.91e8 2.20e8 2.48e8 2.58e8 77456347
## 2 2 More~ 8.24e7 9.23e7 1.03e8 1.16e8 1.31e8 1.40e8 1.46e8 40270398
## 3 3 Less~ 7.02e7 6.84e7 6.92e7 7.42e7 8.93e7 1.07e8 1.12e8 37185949
## 4 4 Leas~ 1.11e7 1.17e7 1.01e7 9.80e6 9.99e6 1.38e7 1.44e7 5556516
## 5 5 Less~ 5.91e7 5.67e7 5.92e7 6.45e7 7.94e7 9.36e7 9.73e7 31650243
## 6 6 High~ 7.52e7 8.66e7 1.00e8 1.18e8 1.42e8 1.57e8 1.65e8 38788286
## 7 7 Midd~ 6.85e7 6.44e7 6.40e7 6.47e7 7.02e7 7.98e7 8.14e7 34252236
## 8 8 Uppe~ 3.34e7 3.32e7 3.38e7 3.58e7 3.97e7 4.82e7 4.97e7 16950275
## 9 9 Lowe~ 3.51e7 3.12e7 3.02e7 2.88e7 3.05e7 3.16e7 3.17e7 17301961
## 10 10 Low-~ 8.53e6 9.29e6 7.73e6 7.60e6 7.55e6 1.05e7 1.09e7 4264450
## # ... with 260 more rows, and 13 more variables: `1995_1` <dbl>,
## # `2000_1` <dbl>, `2005_1` <dbl>, `2010_1` <dbl>, `2015_1` <dbl>,
## # `2017_1` <dbl>, `1990_2` <dbl>, `1995_2` <dbl>, `2000_2` <dbl>,
## # `2005_2` <dbl>, `2010_2` <dbl>, `2015_2` <dbl>, `2017_2` <dbl>
# separate total , male, and female data into subsets to avoid problems with gathering columns
df1_a <- df1 %>% select(1:9)
df1_b <- df1 %>% select(1:2, 10:16)
df1_c <- df1 %>% select(1:2, 17:23)
# rename columns to common labels so yearly data can be combined
names(df1_b)[3:9] <- c(1990 + (0:5)*5, 2017)
names(df1_c)[3:9] <- c(1990 + (0:5)*5, 2017)
# gather column data and fix year data type
df1_a <- df1_a %>% gather(3:9, key = "Year", value = "Mig_Total")
df1_a$Year <- as.integer(df1_a$Year)
df1_a
## # A tibble: 1,890 x 4
## Index Area Year Mig_Total
## <int> <chr> <int> <dbl>
## 1 1 WORLD 1990 152542373
## 2 2 More developed regions 1990 82391619
## 3 3 Less developed regions 1990 70150754
## 4 4 Least developed countries 1990 11073226
## 5 5 Less developed regions, excluding least develope~ 1990 59116432
## 6 6 High-income countries 1990 75239010
## 7 7 Middle-income countries 1990 68475181
## 8 8 Upper-middle-income countries 1990 33412332
## 9 9 Lower-middle-income countries 1990 35062849
## 10 10 Low-income countries 1990 8533974
## # ... with 1,880 more rows
df1_b <- df1_b %>% gather(3:9, key = "Year", value = "Mig_Male")
df1_b$Year <- as.integer(df1_b$Year)
df1_b
## # A tibble: 1,890 x 4
## Index Area Year Mig_Male
## <int> <chr> <int> <dbl>
## 1 1 WORLD 1990 77456347
## 2 2 More developed regions 1990 40270398
## 3 3 Less developed regions 1990 37185949
## 4 4 Least developed countries 1990 5556516
## 5 5 Less developed regions, excluding least developed~ 1990 31650243
## 6 6 High-income countries 1990 38788286
## 7 7 Middle-income countries 1990 34252236
## 8 8 Upper-middle-income countries 1990 16950275
## 9 9 Lower-middle-income countries 1990 17301961
## 10 10 Low-income countries 1990 4264450
## # ... with 1,880 more rows
df1_c <- df1_c %>% gather(3:9, key = "Year", value = "Mig_Female")
df1_c$Year <- as.integer(df1_c$Year)
df1_c
## # A tibble: 1,890 x 4
## Index Area Year Mig_Female
## <int> <chr> <int> <dbl>
## 1 1 WORLD 1990 75086026
## 2 2 More developed regions 1990 42121221
## 3 3 Less developed regions 1990 32964805
## 4 4 Least developed countries 1990 5516710
## 5 5 Less developed regions, excluding least develop~ 1990 27466189
## 6 6 High-income countries 1990 36450724
## 7 7 Middle-income countries 1990 34222945
## 8 8 Upper-middle-income countries 1990 16462057
## 9 9 Lower-middle-income countries 1990 17760888
## 10 10 Low-income countries 1990 4269524
## # ... with 1,880 more rows
# final table1 dataframe
df1 <- df1_a %>% inner_join(df1_b) %>% inner_join(df1_c)
## Joining, by = c("Index", "Area", "Year")
## Joining, by = c("Index", "Area", "Year")
df1
## # A tibble: 1,890 x 6
## Index Area Year Mig_Total Mig_Male Mig_Female
## <int> <chr> <int> <dbl> <dbl> <dbl>
## 1 1 WORLD 1990 152542373 77456347 75086026
## 2 2 More developed regions 1990 82391619 40270398 42121221
## 3 3 Less developed regions 1990 70150754 37185949 32964805
## 4 4 Least developed countries 1990 11073226 5556516 5516710
## 5 5 Less developed regions, excl~ 1990 59116432 31650243 27466189
## 6 6 High-income countries 1990 75239010 38788286 36450724
## 7 7 Middle-income countries 1990 68475181 34252236 34222945
## 8 8 Upper-middle-income countries 1990 33412332 16950275 16462057
## 9 9 Lower-middle-income countries 1990 35062849 17301961 17760888
## 10 10 Low-income countries 1990 8533974 4264450 4269524
## # ... with 1,880 more rows
str(df1)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1890 obs. of 6 variables:
## $ Index : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Area : chr "WORLD" "More developed regions" "Less developed regions" "Least developed countries" ...
## $ Year : int 1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
## $ Mig_Total : num 1.53e+08 8.24e+07 7.02e+07 1.11e+07 5.91e+07 ...
## $ Mig_Male : num 77456347 40270398 37185949 5556516 31650243 ...
## $ Mig_Female: num 75086026 42121221 32964805 5516710 27466189 ...
Next, we follow the same procedure for table 2 (total population). Remember that the total population numbers are expressed in 000s.
########################################
### select data and clean-up for table2
########################################
df2 <- raw2 %>% select(1:2, 5:25)
names(df2)[1:2] <- c("Index", "Area")
df2
## # A tibble: 270 x 23
## Index Area `1990` `1995` `2000` `2005` `2010` `2015` `2017` `1990_1`
## <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 WORLD 5.33e6 5.75e6 6.15e6 6.54e6 6.96e6 7.38e6 7.55e6 2680804
## 2 2 More~ 1.15e6 1.17e6 1.19e6 1.21e6 1.24e6 1.25e6 1.26e6 556533
## 3 3 Less~ 4.18e6 4.58e6 4.95e6 5.33e6 5.72e6 6.13e6 6.29e6 2124272
## 4 4 Leas~ 5.11e5 5.85e5 6.65e5 7.54e5 8.49e5 9.57e5 1.00e6 254357
## 5 5 Less~ 3.67e6 3.99e6 4.29e6 4.58e6 4.87e6 5.17e6 5.29e6 1869915
## 6 6 High~ 9.80e5 1.01e6 1.05e6 1.08e6 1.13e6 1.16e6 1.17e6 481655
## 7 7 Midd~ 4.03e6 4.36e6 4.67e6 4.97e6 5.27e6 5.58e6 5.70e6 2038685
## 8 8 Uppe~ 2.10e6 2.23e6 2.33e6 2.42e6 2.51e6 2.61e6 2.65e6 1058877
## 9 9 Lowe~ 1.93e6 2.13e6 2.34e6 2.55e6 2.76e6 2.97e6 3.05e6 979808
## 10 10 Low-~ 3.23e5 3.72e5 4.25e5 4.88e5 5.60e5 6.42e5 6.77e5 159580
## # ... with 260 more rows, and 13 more variables: `1995_1` <dbl>,
## # `2000_1` <dbl>, `2005_1` <dbl>, `2010_1` <dbl>, `2015_1` <dbl>,
## # `2017_1` <dbl>, `1990_2` <dbl>, `1995_2` <dbl>, `2000_2` <dbl>,
## # `2005_2` <dbl>, `2010_2` <dbl>, `2015_2` <dbl>, `2017_2` <dbl>
# separate total , male, and female data into subsets to avoid problems with gathering columns
df2_a <- df2 %>% select(1:9)
df2_b <- df2 %>% select(1:2, 10:16)
df2_c <- df2 %>% select(1:2, 17:23)
# rename columns to common labels so yearly data can be combined
names(df2_b)[3:9] <- c(1990 + (0:5)*5, 2017)
names(df2_c)[3:9] <- c(1990 + (0:5)*5, 2017)
# gather column data and fix year data type
df2_a <- df2_a %>% gather(3:9, key = "Year", value = "Pop_Total")
df2_a$Year <- as.integer(df2_a$Year)
df2_a
## # A tibble: 1,890 x 4
## Index Area Year Pop_Total
## <int> <chr> <int> <dbl>
## 1 1 WORLD 1990 5330943
## 2 2 More developed regions 1990 1146999
## 3 3 Less developed regions 1990 4183944
## 4 4 Least developed countries 1990 510828
## 5 5 Less developed regions, excluding least develope~ 1990 3673171
## 6 6 High-income countries 1990 979798
## 7 7 Middle-income countries 1990 4026594
## 8 8 Upper-middle-income countries 1990 2099726
## 9 9 Lower-middle-income countries 1990 1926868
## 10 10 Low-income countries 1990 322686
## # ... with 1,880 more rows
df2_b <- df2_b %>% gather(3:9, key = "Year", value = "Pop_Male")
df2_b$Year <- as.integer(df2_b$Year)
df2_b
## # A tibble: 1,890 x 4
## Index Area Year Pop_Male
## <int> <chr> <int> <dbl>
## 1 1 WORLD 1990 2680804
## 2 2 More developed regions 1990 556533
## 3 3 Less developed regions 1990 2124272
## 4 4 Least developed countries 1990 254357
## 5 5 Less developed regions, excluding least developed~ 1990 1869915
## 6 6 High-income countries 1990 481655
## 7 7 Middle-income countries 1990 2038685
## 8 8 Upper-middle-income countries 1990 1058877
## 9 9 Lower-middle-income countries 1990 979808
## 10 10 Low-income countries 1990 159580
## # ... with 1,880 more rows
df2_c <- df2_c %>% gather(3:9, key = "Year", value = "Pop_Female")
df2_c$Year <- as.integer(df2_c$Year)
df2_c
## # A tibble: 1,890 x 4
## Index Area Year Pop_Female
## <int> <chr> <int> <dbl>
## 1 1 WORLD 1990 2649294
## 2 2 More developed regions 1990 590061
## 3 3 Less developed regions 1990 2059233
## 4 4 Least developed countries 1990 256462
## 5 5 Less developed regions, excluding least develop~ 1990 1802772
## 6 6 High-income countries 1990 497564
## 7 7 Middle-income countries 1990 1987725
## 8 8 Upper-middle-income countries 1990 1040666
## 9 9 Lower-middle-income countries 1990 947060
## 10 10 Low-income countries 1990 163106
## # ... with 1,880 more rows
# final table2 dataframe
df2 <- df2_a %>% inner_join(df2_b) %>% inner_join(df2_c)
## Joining, by = c("Index", "Area", "Year")
## Joining, by = c("Index", "Area", "Year")
df2
## # A tibble: 1,890 x 6
## Index Area Year Pop_Total Pop_Male Pop_Female
## <int> <chr> <int> <dbl> <dbl> <dbl>
## 1 1 WORLD 1990 5330943 2680804 2649294
## 2 2 More developed regions 1990 1146999 556533 590061
## 3 3 Less developed regions 1990 4183944 2124272 2059233
## 4 4 Least developed countries 1990 510828 254357 256462
## 5 5 Less developed regions, excl~ 1990 3673171 1869915 1802772
## 6 6 High-income countries 1990 979798 481655 497564
## 7 7 Middle-income countries 1990 4026594 2038685 1987725
## 8 8 Upper-middle-income countries 1990 2099726 1058877 1040666
## 9 9 Lower-middle-income countries 1990 1926868 979808 947060
## 10 10 Low-income countries 1990 322686 159580 163106
## # ... with 1,880 more rows
str(df2)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1890 obs. of 6 variables:
## $ Index : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Area : chr "WORLD" "More developed regions" "Less developed regions" "Least developed countries" ...
## $ Year : int 1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
## $ Pop_Total : num 5330943 1146999 4183944 510828 3673171 ...
## $ Pop_Male : num 2680804 556533 2124272 254357 1869915 ...
## $ Pop_Female: num 2649294 590061 2059233 256462 1802772 ...
Now that we have two tidy data frames, we can proceed to analyze the data and see how to address the questions. First let’s combine the two data frames, and reduce the number of columns to a manageable and relevant variable set. In the process, we’ll compute several new variables including:
Pop_M_Pct: percentage of male to total populationPop_F_Pct: percentage of female to total populationMig_Tot_Pct: percentage of total migrant to total population (note denominator is in 000s)Mig_M_Pct: percentage of male migrant to male population (note denominator is in 000s)Mig_F_Pct: percentage of female migrant to female population (note denominator is in 000s)Mig_MF_Pct: percentage difference between Mig_M_Pct and Mig_F_Pct# combine table1 and table2
df <- inner_join(df2, df1)
## Joining, by = c("Index", "Area", "Year")
head(df, 20)
## # A tibble: 20 x 9
## Index Area Year Pop_Total Pop_Male Pop_Female Mig_Total Mig_Male
## <int> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 WORLD 1990 5330943 2680804 2649294 152542373 77456347
## 2 2 More~ 1990 1146999 556533 590061 82391619 40270398
## 3 3 Less~ 1990 4183944 2124272 2059233 70150754 37185949
## 4 4 Leas~ 1990 510828 254357 256462 11073226 5556516
## 5 5 Less~ 1990 3673171 1869915 1802772 59116432 31650243
## 6 6 High~ 1990 979798 481655 497564 75239010 38788286
## 7 7 Midd~ 1990 4026594 2038685 1987725 68475181 34252236
## 8 8 Uppe~ 1990 2099726 1058877 1040666 33412332 16950275
## 9 9 Lowe~ 1990 1926868 979808 947060 35062849 17301961
## 10 10 Low-~ 1990 322686 159580 163106 8533974 4264450
## 11 11 Sub-~ 1990 513477 255310 258161 14690319 7745306
## 12 12 AFRI~ 1990 634567 316255 318307 15690623 8279564
## 13 13 East~ 1990 198646 98436 100210 5964031 3071189
## 14 14 Buru~ 1990 5415 2644 2772 333110 163267
## 15 15 Como~ 1990 412 206 205 14079 6717
## 16 16 Djib~ 1990 590 297 293 122221 64242
## 17 17 Erit~ 1990 3113 1557 1556 11848 6228
## 18 18 Ethi~ 1990 48087 23981 24105 1155390 607284
## 19 19 Kenya 1990 23403 11649 11753 297292 160852
## 20 20 Mada~ 1990 11599 5791 5808 23917 13348
## # ... with 1 more variable: Mig_Female <dbl>
# create new variables; note population numbers are in 000s
df <- df %>% mutate(Pop_M_Pct = Pop_Male / Pop_Total * 100, Pop_F_Pct = Pop_Female / Pop_Total * 100,
Mig_Tot_Pct = Mig_Total / Pop_Total / 10, Mig_M_Pct = Mig_Male / Pop_Male / 10,
Mig_F_Pct = Mig_Female / Pop_Female / 10, Mig_MF_Pct = (Mig_M_Pct / Mig_F_Pct - 1) * 100)
# select reduced variable set for display
df_brief <- select(df, Area, Year, Pop_Total, Pop_M_Pct, Pop_F_Pct, Mig_Tot_Pct,
Mig_M_Pct, Mig_F_Pct, Mig_MF_Pct)
head(df_brief, 20)
## # A tibble: 20 x 9
## Area Year Pop_Total Pop_M_Pct Pop_F_Pct Mig_Tot_Pct Mig_M_Pct
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 WORLD 1990 5330943 50.3 49.7 2.86 2.89
## 2 More~ 1990 1146999 48.5 51.4 7.18 7.24
## 3 Less~ 1990 4183944 50.8 49.2 1.68 1.75
## 4 Leas~ 1990 510828 49.8 50.2 2.17 2.18
## 5 Less~ 1990 3673171 50.9 49.1 1.61 1.69
## 6 High~ 1990 979798 49.2 50.8 7.68 8.05
## 7 Midd~ 1990 4026594 50.6 49.4 1.70 1.68
## 8 Uppe~ 1990 2099726 50.4 49.6 1.59 1.60
## 9 Lowe~ 1990 1926868 50.8 49.2 1.82 1.77
## 10 Low-~ 1990 322686 49.5 50.5 2.64 2.67
## 11 Sub-~ 1990 513477 49.7 50.3 2.86 3.03
## 12 AFRI~ 1990 634567 49.8 50.2 2.47 2.62
## 13 East~ 1990 198646 49.6 50.4 3.00 3.12
## 14 Buru~ 1990 5415 48.8 51.2 6.15 6.18
## 15 Como~ 1990 412 50 49.8 3.42 3.26
## 16 Djib~ 1990 590 50.3 49.7 20.7 21.6
## 17 Erit~ 1990 3113 50.0 50.0 0.381 0.4
## 18 Ethi~ 1990 48087 49.9 50.1 2.40 2.53
## 19 Kenya 1990 23403 49.8 50.2 1.27 1.38
## 20 Mada~ 1990 11599 49.9 50.1 0.206 0.230
## # ... with 2 more variables: Mig_F_Pct <dbl>, Mig_MF_Pct <dbl>
Let’s return to our two questions, and see if the combined dataset can provide any answers.
From the tables below:
The overall migrant population as a percentage of the total population is roughly 3% across the world. The migrant percentage has risen steadily over time, from 2.9% in 1990 to 3.4% in 2017.
df_brief %>% filter(Area == "WORLD") %>% kable(digits = 1, caption = "WORLD TIME SERIES")
| Area | Year | Pop_Total | Pop_M_Pct | Pop_F_Pct | Mig_Tot_Pct | Mig_M_Pct | Mig_F_Pct | Mig_MF_Pct |
|---|---|---|---|---|---|---|---|---|
| WORLD | 1990 | 5330943 | 50.3 | 49.7 | 2.9 | 2.9 | 2.8 | 1.9 |
| WORLD | 1995 | 5751474 | 50.3 | 49.7 | 2.8 | 2.8 | 2.8 | 1.3 |
| WORLD | 2000 | 6145007 | 50.3 | 49.7 | 2.8 | 2.8 | 2.8 | 1.5 |
| WORLD | 2005 | 6542159 | 50.4 | 49.6 | 2.9 | 3.0 | 2.9 | 2.7 |
| WORLD | 2010 | 6958169 | 50.4 | 49.6 | 3.2 | 3.2 | 3.1 | 4.9 |
| WORLD | 2015 | 7383009 | 50.4 | 49.6 | 3.4 | 3.4 | 3.3 | 4.5 |
| WORLD | 2017 | 7550262 | 50.4 | 49.5 | 3.4 | 3.5 | 3.3 | 4.9 |
df_brief %>% filter(Area == "AFRICA") %>% kable(digits = 1, caption = "AFRICA TIME SERIES")
| Area | Year | Pop_Total | Pop_M_Pct | Pop_F_Pct | Mig_Tot_Pct | Mig_M_Pct | Mig_F_Pct | Mig_MF_Pct |
|---|---|---|---|---|---|---|---|---|
| AFRICA | 1990 | 634567 | 49.8 | 50.2 | 2.5 | 2.6 | 2.3 | 12.4 |
| AFRICA | 1995 | 722922 | 49.8 | 50.2 | 2.3 | 2.4 | 2.1 | 12.2 |
| AFRICA | 2000 | 817566 | 49.8 | 50.2 | 1.8 | 1.9 | 1.7 | 13.9 |
| AFRICA | 2005 | 924758 | 49.8 | 50.2 | 1.7 | 1.8 | 1.5 | 20.0 |
| AFRICA | 2010 | 1049446 | 49.9 | 50.1 | 1.6 | 1.7 | 1.5 | 17.1 |
| AFRICA | 2015 | 1194370 | 49.9 | 50.1 | 2.0 | 2.1 | 1.8 | 13.4 |
| AFRICA | 2017 | 1256268 | 49.9 | 50.1 | 2.0 | 2.1 | 1.8 | 12.4 |
df_brief %>% filter(Area == "ASIA") %>% kable(digits = 1, caption = "ASIA TIME SERIES")
| Area | Year | Pop_Total | Pop_M_Pct | Pop_F_Pct | Mig_Tot_Pct | Mig_M_Pct | Mig_F_Pct | Mig_MF_Pct |
|---|---|---|---|---|---|---|---|---|
| ASIA | 1990 | 3221342 | 51.1 | 48.9 | 1.5 | 1.6 | 1.4 | 10.1 |
| ASIA | 1995 | 3489306 | 51.1 | 48.9 | 1.3 | 1.4 | 1.3 | 10.3 |
| ASIA | 2000 | 3730371 | 51.1 | 48.9 | 1.3 | 1.4 | 1.2 | 11.6 |
| ASIA | 2005 | 3964343 | 51.1 | 48.9 | 1.3 | 1.4 | 1.2 | 15.7 |
| ASIA | 2010 | 4194425 | 51.2 | 48.8 | 1.6 | 1.8 | 1.4 | 28.0 |
| ASIA | 2015 | 4419898 | 51.2 | 48.8 | 1.7 | 1.9 | 1.5 | 28.1 |
| ASIA | 2017 | 4504428 | 51.2 | 48.8 | 1.8 | 2.0 | 1.5 | 29.7 |
df_brief %>% filter(Area == "EUROPE") %>% kable(digits = 1, caption = "EUROPE TIME SERIES")
| Area | Year | Pop_Total | Pop_M_Pct | Pop_F_Pct | Mig_Tot_Pct | Mig_M_Pct | Mig_F_Pct | Mig_MF_Pct |
|---|---|---|---|---|---|---|---|---|
| EUROPE | 1990 | 721699 | 48.2 | 51.8 | 6.8 | 6.9 | 6.8 | 1.9 |
| EUROPE | 1995 | 728085 | 48.2 | 51.8 | 7.3 | 7.3 | 7.2 | 1.3 |
| EUROPE | 2000 | 727201 | 48.2 | 51.8 | 7.7 | 7.8 | 7.7 | 0.8 |
| EUROPE | 2005 | 730290 | 48.2 | 51.8 | 8.7 | 8.7 | 8.6 | 0.7 |
| EUROPE | 2010 | 737164 | 48.2 | 51.7 | 9.6 | 9.6 | 9.6 | -0.2 |
| EUROPE | 2015 | 740814 | 48.3 | 51.7 | 10.1 | 10.0 | 10.1 | -1.6 |
| EUROPE | 2017 | 742074 | 48.3 | 51.7 | 10.5 | 10.4 | 10.6 | -1.3 |
df_brief %>% filter(Area == "LATIN AMERICA AND THE CARIBBEAN") %>% kable(digits = 1, caption = "LATAM TIME SERIES")
| Area | Year | Pop_Total | Pop_M_Pct | Pop_F_Pct | Mig_Tot_Pct | Mig_M_Pct | Mig_F_Pct | Mig_MF_Pct |
|---|---|---|---|---|---|---|---|---|
| LATIN AMERICA AND THE CARIBBEAN | 1990 | 445919 | 49.6 | 50.3 | 1.6 | 1.6 | 1.6 | 2.0 |
| LATIN AMERICA AND THE CARIBBEAN | 1995 | 486863 | 49.6 | 50.4 | 1.4 | 1.4 | 1.4 | 1.8 |
| LATIN AMERICA AND THE CARIBBEAN | 2000 | 525795 | 49.5 | 50.4 | 1.3 | 1.3 | 1.2 | 1.5 |
| LATIN AMERICA AND THE CARIBBEAN | 2005 | 561656 | 49.5 | 50.5 | 1.3 | 1.3 | 1.3 | 0.8 |
| LATIN AMERICA AND THE CARIBBEAN | 2010 | 597562 | 49.4 | 50.5 | 1.4 | 1.4 | 1.4 | 0.7 |
| LATIN AMERICA AND THE CARIBBEAN | 2015 | 632381 | 49.4 | 50.5 | 1.5 | 1.5 | 1.5 | 0.8 |
| LATIN AMERICA AND THE CARIBBEAN | 2017 | 645593 | 49.4 | 50.5 | 1.5 | 1.5 | 1.5 | 0.8 |
df_brief %>% filter(Area == "NORTHERN AMERICA") %>% kable(digits = 1, caption = "NORAM TIME SERIES")
| Area | Year | Pop_Total | Pop_M_Pct | Pop_F_Pct | Mig_Tot_Pct | Mig_M_Pct | Mig_F_Pct | Mig_MF_Pct |
|---|---|---|---|---|---|---|---|---|
| NORTHERN AMERICA | 1990 | 280345 | 49.0 | 50.9 | 9.8 | 9.8 | 9.9 | -0.7 |
| NORTHERN AMERICA | 1995 | 295114 | 49.1 | 50.8 | 11.3 | 11.3 | 11.3 | 0.1 |
| NORTHERN AMERICA | 2000 | 312845 | 49.3 | 50.7 | 12.9 | 13.0 | 12.8 | 1.0 |
| NORTHERN AMERICA | 2005 | 327546 | 49.3 | 50.6 | 13.8 | 13.9 | 13.8 | 1.3 |
| NORTHERN AMERICA | 2010 | 342937 | 49.4 | 50.6 | 14.9 | 14.7 | 15.0 | -1.9 |
| NORTHERN AMERICA | 2015 | 356004 | 49.5 | 50.5 | 15.7 | 15.4 | 16.0 | -3.8 |
| NORTHERN AMERICA | 2017 | 361208 | 49.5 | 50.5 | 16.0 | 15.6 | 16.3 | -3.9 |
df_brief %>% filter(Area == "OCEANIA") %>% kable(digits = 1, caption = "OCEANIA TIME SERIES")
| Area | Year | Pop_Total | Pop_M_Pct | Pop_F_Pct | Mig_Tot_Pct | Mig_M_Pct | Mig_F_Pct | Mig_MF_Pct |
|---|---|---|---|---|---|---|---|---|
| OCEANIA | 1990 | 27071 | 49.7 | 49.6 | 17.5 | 18.0 | 17.3 | 4.1 |
| OCEANIA | 1995 | 29185 | 49.6 | 49.6 | 17.2 | 17.5 | 17.2 | 2.0 |
| OCEANIA | 2000 | 31229 | 49.6 | 49.6 | 17.2 | 17.3 | 17.3 | -0.5 |
| OCEANIA | 2005 | 33568 | 49.7 | 49.6 | 17.9 | 17.9 | 18.2 | -1.7 |
| OCEANIA | 2010 | 36636 | 49.8 | 49.5 | 19.4 | 19.4 | 19.8 | -1.9 |
| OCEANIA | 2015 | 39543 | 49.7 | 49.7 | 20.4 | 20.1 | 20.8 | -3.3 |
| OCEANIA | 2017 | 40691 | 49.7 | 49.7 | 20.7 | 20.4 | 21.2 | -4.0 |
From the tables above:
From the “WORLD TIME SERIES” table, male migrants as a proportion of the male population (3.5% in 2017) is greater across the world than female migrants as a proportion of the female population (3.3% in 2017). Furthermore, the relative difference between the two proportions has been growing, from 1.9% in 1990 to 4.9% in 2017. What this suggests is that the migrant population is becoming more heavily weighted toward male migrants than female migrants.
Just as for the first question, there are large regional differences in the male and female migrant proportions. For instance, Asia has a +30% relative difference between male and female migrant proportions (i.e., disproportionately more male than female migrants) in 2017, whereas Oceania has a -4% relative difference (i.e., slightly more female than male migrants) in 2017.
In terms of the areas with the largest disparities between male and female migrant populations, we can query the data frame based on population size (say > 10 million) and male-vs-female migrant percentage (say >= 50% or <= -20%). It is apparent that:
Countries / regions that have much higher proportions of male migrants than female migrants include countries that have large transient worker populations associated with certain industries (e.g., the petroleum industry in Saudi Arabia) or countries during war time (e.g., Iraq).
Countries that have much higher proportions of female migrants than male migrants include Nepal and some Eastern European countries (Poland, Romania), among others.
# what's the range of the male-female migrant disparity?
summary(df_brief$Mig_MF_Pct)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -57.928 -4.453 3.132 9.874 16.327 445.776 236
# which countries of 10MM or more have the highest disparity of men vs women migrants?
# higher male migrant % than female
df_brief %>% filter(Pop_Total > 10000, Mig_MF_Pct >= 50) %>% arrange(desc(Mig_MF_Pct), Area, Year) %>% kable(digits = 1, caption = "Areas with Greatest Disparity of Male > Female Migrant Percentage")
| Area | Year | Pop_Total | Pop_M_Pct | Pop_F_Pct | Mig_Tot_Pct | Mig_M_Pct | Mig_F_Pct | Mig_MF_Pct |
|---|---|---|---|---|---|---|---|---|
| Iraq | 1990 | 17469 | 50.5 | 49.5 | 0.5 | 0.6 | 0.3 | 96.3 |
| Saudi Arabia | 2010 | 27426 | 56.2 | 43.8 | 30.7 | 38.5 | 20.8 | 85.6 |
| United Republic of Tanzania | 2005 | 39411 | 49.3 | 50.7 | 2.0 | 2.5 | 1.4 | 76.0 |
| Saudi Arabia | 2005 | 23906 | 55.7 | 44.3 | 27.2 | 33.6 | 19.2 | 75.3 |
| Western Asia | 2010 | 232738 | 51.9 | 48.1 | 13.2 | 16.5 | 9.5 | 73.4 |
| Saudi Arabia | 2015 | 31557 | 56.7 | 43.3 | 34.1 | 41.7 | 24.2 | 72.2 |
| Saudi Arabia | 2000 | 20764 | 54.5 | 45.5 | 25.3 | 31.1 | 18.5 | 68.5 |
| South Africa | 1990 | 37561 | 49.4 | 50.6 | 3.1 | 3.9 | 2.3 | 64.6 |
| Saudi Arabia | 2017 | 32938 | 57.1 | 42.9 | 37.0 | 44.4 | 27.2 | 63.2 |
| Western Asia | 2015 | 258124 | 52.2 | 47.8 | 15.4 | 18.9 | 11.6 | 63.0 |
| Western Asia | 2017 | 267660 | 52.3 | 47.7 | 16.0 | 19.6 | 12.1 | 62.9 |
| South Africa | 1995 | 42088 | 49.1 | 50.9 | 2.4 | 3.0 | 1.8 | 61.2 |
| Iraq | 2010 | 30763 | 50.5 | 49.5 | 0.4 | 0.5 | 0.3 | 60.7 |
| Saudi Arabia | 1995 | 18736 | 55.5 | 44.5 | 27.3 | 32.9 | 20.5 | 60.7 |
| Southern Africa | 1990 | 42818 | 49.3 | 50.7 | 3.3 | 4.0 | 2.5 | 57.0 |
| Saudi Arabia | 1990 | 16327 | 56.0 | 44.0 | 30.6 | 36.4 | 23.3 | 56.0 |
| Southern Africa | 1995 | 48035 | 49.1 | 50.9 | 2.5 | 3.0 | 2.0 | 55.2 |
| Dominican Republic | 2017 | 10767 | 49.8 | 50.2 | 3.9 | 4.8 | 3.1 | 55.2 |
| Viet Nam | 2005 | 84309 | 49.2 | 50.8 | 0.1 | 0.1 | 0.0 | 55.1 |
| South Africa | 2000 | 45728 | 49.1 | 50.9 | 2.2 | 2.7 | 1.7 | 55.0 |
| Dominican Republic | 2015 | 10528 | 49.8 | 50.2 | 3.9 | 4.8 | 3.1 | 55.0 |
| Western Asia | 2005 | 205731 | 51.3 | 48.7 | 10.5 | 12.6 | 8.2 | 53.8 |
| Iraq | 2005 | 27008 | 50.5 | 49.5 | 0.5 | 0.6 | 0.4 | 52.1 |
| Czechia | 2010 | 10536 | 49.1 | 50.9 | 3.8 | 4.6 | 3.0 | 50.9 |
# higher female migrant % than male
df_brief %>% filter(Pop_Total > 10000, Mig_MF_Pct <= -20) %>% arrange(Mig_MF_Pct, Area, Year) %>% kable(digits = 1, caption = "Areas with Greatest Disparity of Female > Male Migrant Percentage")
| Area | Year | Pop_Total | Pop_M_Pct | Pop_F_Pct | Mig_Tot_Pct | Mig_M_Pct | Mig_F_Pct | Mig_MF_Pct |
|---|---|---|---|---|---|---|---|---|
| Nepal | 1990 | 18749 | 49.6 | 50.4 | 2.3 | 1.4 | 3.2 | -57.9 |
| Nepal | 1995 | 21396 | 50.5 | 49.5 | 3.2 | 2.0 | 4.5 | -55.0 |
| Nepal | 2017 | 29305 | 48.5 | 51.5 | 1.7 | 1.1 | 2.3 | -53.2 |
| Nepal | 2015 | 28656 | 48.5 | 51.5 | 1.8 | 1.1 | 2.4 | -53.0 |
| Nepal | 2010 | 27023 | 48.8 | 51.2 | 2.1 | 1.4 | 2.8 | -48.9 |
| Nepal | 2000 | 23741 | 49.7 | 50.3 | 3.0 | 2.0 | 4.0 | -48.6 |
| Nepal | 2005 | 25640 | 49.6 | 50.4 | 2.6 | 1.8 | 3.4 | -46.2 |
| Poland | 2000 | 38550 | 48.5 | 51.5 | 2.1 | 1.8 | 2.5 | -26.4 |
| Poland | 2010 | 38323 | 48.3 | 51.7 | 1.7 | 1.4 | 1.9 | -26.3 |
| Poland | 2005 | 38363 | 48.4 | 51.6 | 1.9 | 1.6 | 2.2 | -25.6 |
| Romania | 1990 | 23489 | 49.3 | 50.7 | 0.6 | 0.5 | 0.7 | -24.3 |
| Poland | 1995 | 38453 | 48.6 | 51.4 | 2.5 | 2.2 | 2.8 | -23.9 |
| Uzbekistan | 2000 | 24849 | 49.7 | 50.3 | 5.7 | 4.9 | 6.4 | -23.4 |
| Guatemala | 1995 | 10408 | 49.3 | 50.7 | 1.5 | 1.3 | 1.7 | -23.0 |
| Poland | 2015 | 38265 | 48.3 | 51.7 | 1.6 | 1.4 | 1.8 | -22.7 |
| Uzbekistan | 1995 | 22863 | 49.7 | 50.3 | 6.6 | 5.8 | 7.4 | -22.3 |
| Cuba | 2000 | 11151 | 50.2 | 49.8 | 0.2 | 0.1 | 0.2 | -22.0 |
| Poland | 1990 | 37955 | 48.7 | 51.3 | 3.0 | 2.6 | 3.3 | -21.3 |
| Cuba | 2005 | 11284 | 50.2 | 49.8 | 0.2 | 0.1 | 0.2 | -21.0 |
| Uzbekistan | 1990 | 20462 | 49.5 | 50.5 | 8.1 | 7.1 | 9.0 | -20.5 |