DATA 607 - Project 2 - UN Migration Dataset

Introduction
Create a CSV file that includes all the information in the dataset
Read the dataset and use tidyr and dplyr to tidy and transform the data
Use the tidy data to address the questions
- Q1: What is the level and trend of the migrant population, relative to the overall population? What regions show the highest and lowest migrant populations?
- Q2: What is the relative proportion of male vs. female migrants as a percentage of the overall male and female population? What areas show the largest disparity between male and female migrant populations?

Introduction

For this exercise, I chose the United Nations migration dataset from 2017. The idea of using this dataset came from a post by Juanelle Marks on the Week 5 Discussion Board. The source of the dataset can be found at: http://www.un.org/en/development/desa/population/migration/data/estimates2/estimates17.shtml

In analyzing the data, I was interested in addressing two questions:

What is the level and trend of the migrant population, relative to the overall population? What regions show the highest and lowest migrant populations?
What is the relative proportion of male vs. female migrants as a percentage of the overall male and female population? What areas show the largest disparity between male and female migrant populations?

Create a CSV file that includes all the information in the dataset

The UN provides the migration dataset in downloadable form as an Excel spreadsheet. I saved two tables from the migration dataset as CSV files:

Table 1: “International migrant stock at mid-year by sex and by major area, region, country or area, 1990-2017”
Table 2: “Total population at mid-year by sex and by major area, region, country or area, 1990-2017 (thousands)”

The two tables are saved as CSV files on GitHub.

Read the dataset and use tidyr and dplyr to tidy and transform the data

First, read in the data from the CSV files stored on GitHub. Note that we need to skip the first block of lines as they contain header information, and then we also need to give R parsing instructions for certain columns. Without the parsing instructions, many of the columns with population numbers that are formatted with separating commas (e.g., “123,456”) are treated as character data instead of numerical data.

library(tidyverse)
library(knitr)
# migrant population
file1 <- "https://raw.githubusercontent.com/kecbenson/DATA_607_Proj2/master/UN_Migration_table1.csv"
# total population in 000s
file2 <- "https://raw.githubusercontent.com/kecbenson/DATA_607_Proj2/master/UN_Migration_table2.csv"

# ensure correct parsing of number columns
raw1 <- read_csv(file1, skip = 16,
                 col_types = cols(
                     .default = col_number(),
                     X1 = col_integer(),
                     X2 = col_character(),
                     X3 = col_character(),
                     X4 = col_integer(),
                     X5 = col_character()
                 )
)
head(raw1, 20)

## # A tibble: 20 x 51
##       X1 X2    X3       X4 X5    `1990` `1995` `2000` `2005` `2010` `2015`
##    <int> <chr> <chr> <int> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1     1 WORLD <NA>    900 <NA>  1.53e8 1.61e8 1.73e8 1.91e8 2.20e8 2.48e8
##  2     2 More~ b       901 <NA>  8.24e7 9.23e7 1.03e8 1.16e8 1.31e8 1.40e8
##  3     3 Less~ c       902 <NA>  7.02e7 6.84e7 6.92e7 7.42e7 8.93e7 1.07e8
##  4     4 Leas~ d       941 <NA>  1.11e7 1.17e7 1.01e7 9.80e6 9.99e6 1.38e7
##  5     5 Less~ <NA>    934 <NA>  5.91e7 5.67e7 5.92e7 6.45e7 7.94e7 9.36e7
##  6     6 High~ e      1503 <NA>  7.52e7 8.66e7 1.00e8 1.18e8 1.42e8 1.57e8
##  7     7 Midd~ e      1517 <NA>  6.85e7 6.44e7 6.40e7 6.47e7 7.02e7 7.98e7
##  8     8 Uppe~ e      1502 <NA>  3.34e7 3.32e7 3.38e7 3.58e7 3.97e7 4.82e7
##  9     9 Lowe~ e      1501 <NA>  3.51e7 3.12e7 3.02e7 2.88e7 3.05e7 3.16e7
## 10    10 Low-~ e      1500 <NA>  8.53e6 9.29e6 7.73e6 7.60e6 7.55e6 1.05e7
## 11    11 Sub-~ f       947 <NA>  1.47e7 1.53e7 1.37e7 1.43e7 1.57e7 2.17e7
## 12    12 AFRI~ <NA>    903 <NA>  1.57e7 1.64e7 1.48e7 1.55e7 1.70e7 2.34e7
## 13    13 East~ <NA>    910 <NA>  5.96e6 5.02e6 4.84e6 4.75e6 4.66e6 6.92e6
## 14    14 Buru~ <NA>    108 B R   3.33e5 2.55e5 1.26e5 1.73e5 2.35e5 2.90e5
## 15    15 Como~ <NA>    174 B     1.41e4 1.39e4 1.38e4 1.32e4 1.26e4 1.26e4
## 16    16 Djib~ <NA>    262 B R   1.22e5 9.98e4 1.01e5 9.21e4 1.02e5 1.12e5
## 17    17 Erit~ <NA>    232 I     1.18e4 1.24e4 1.30e4 1.43e4 1.57e4 1.59e4
## 18    18 Ethi~ <NA>    231 B R   1.16e6 8.07e5 6.11e5 5.14e5 5.68e5 1.16e6
## 19    19 Kenya <NA>    404 B R   2.97e5 6.19e5 6.99e5 7.57e5 9.27e5 1.08e6
## 20    20 Mada~ <NA>    450 C     2.39e4 2.12e4 2.35e4 2.61e4 2.89e4 3.21e4
## # ... with 40 more variables: `2017` <dbl>, `1990_1` <dbl>,
## #   `1995_1` <dbl>, `2000_1` <dbl>, `2005_1` <dbl>, `2010_1` <dbl>,
## #   `2015_1` <dbl>, `2017_1` <dbl>, `1990_2` <dbl>, `1995_2` <dbl>,
## #   `2000_2` <dbl>, `2005_2` <dbl>, `2010_2` <dbl>, `2015_2` <dbl>,
## #   `2017_2` <dbl>, X27 <dbl>, X28 <dbl>, X29 <dbl>, X30 <dbl>, X31 <dbl>,
## #   X32 <dbl>, X33 <dbl>, X34 <dbl>, X35 <dbl>, X36 <dbl>, X37 <dbl>,
## #   X38 <dbl>, X39 <dbl>, X40 <dbl>, X41 <dbl>, X42 <dbl>, X43 <dbl>,
## #   X44 <dbl>, X45 <dbl>, X46 <dbl>, X47 <dbl>, X48 <dbl>, X49 <dbl>,
## #   X50 <dbl>, X51 <dbl>

# ensure correct parsing of number columns
raw2 <- read_csv(file2, skip = 16,
                 col_types = cols(
                     .default = col_number(),
                     X1 = col_integer(),
                     X2 = col_character(),
                     X3 = col_character(),
                     X4 = col_integer()
                 )
)
head(raw2, 20)

## # A tibble: 20 x 25
##       X1 X2    X3       X4 `1990` `1995` `2000` `2005` `2010` `2015` `2017`
##    <int> <chr> <chr> <int>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1     1 WORLD <NA>    900 5.33e6 5.75e6 6.15e6 6.54e6 6.96e6 7.38e6 7.55e6
##  2     2 More~ b       901 1.15e6 1.17e6 1.19e6 1.21e6 1.24e6 1.25e6 1.26e6
##  3     3 Less~ c       902 4.18e6 4.58e6 4.95e6 5.33e6 5.72e6 6.13e6 6.29e6
##  4     4 Leas~ d       941 5.11e5 5.85e5 6.65e5 7.54e5 8.49e5 9.57e5 1.00e6
##  5     5 Less~ <NA>    934 3.67e6 3.99e6 4.29e6 4.58e6 4.87e6 5.17e6 5.29e6
##  6     6 High~ e      1503 9.80e5 1.01e6 1.05e6 1.08e6 1.13e6 1.16e6 1.17e6
##  7     7 Midd~ e      1517 4.03e6 4.36e6 4.67e6 4.97e6 5.27e6 5.58e6 5.70e6
##  8     8 Uppe~ e      1502 2.10e6 2.23e6 2.33e6 2.42e6 2.51e6 2.61e6 2.65e6
##  9     9 Lowe~ e      1501 1.93e6 2.13e6 2.34e6 2.55e6 2.76e6 2.97e6 3.05e6
## 10    10 Low-~ e      1500 3.23e5 3.72e5 4.25e5 4.88e5 5.60e5 6.42e5 6.77e5
## 11    11 Sub-~ f       947 5.13e5 5.89e5 6.72e5 7.68e5 8.80e5 1.01e6 1.06e6
## 12    12 AFRI~ <NA>    903 6.35e5 7.23e5 8.18e5 9.25e5 1.05e6 1.19e6 1.26e6
## 13    13 East~ <NA>    910 1.99e5 2.26e5 2.61e5 3.01e5 3.47e5 3.99e5 4.22e5
## 14    14 Buru~ <NA>    108 5.42e3 5.96e3 6.40e3 7.42e3 8.77e3 1.02e4 1.09e4
## 15    15 Como~ <NA>    174 4.12e2 4.75e2 5.42e2 6.12e2 6.90e2 7.77e2 8.14e2
## 16    16 Djib~ <NA>    262 5.90e2 6.30e2 7.18e2 7.83e2 8.51e2 9.27e2 9.57e2
## 17    17 Erit~ <NA>    232 3.11e3 3.09e3 3.39e3 3.97e3 4.39e3 4.85e3 5.07e3
## 18    18 Ethi~ <NA>    231 4.81e4 5.73e4 6.65e4 7.67e4 8.77e4 9.99e4 1.05e5
## 19    19 Kenya <NA>    404 2.34e4 2.73e4 3.14e4 3.60e4 4.14e4 4.72e4 4.97e4
## 20    20 Mada~ <NA>    450 1.16e4 1.35e4 1.58e4 1.83e4 2.12e4 2.42e4 2.56e4
## # ... with 14 more variables: `1990_1` <dbl>, `1995_1` <dbl>,
## #   `2000_1` <dbl>, `2005_1` <dbl>, `2010_1` <dbl>, `2015_1` <dbl>,
## #   `2017_1` <dbl>, `1990_2` <dbl>, `1995_2` <dbl>, `2000_2` <dbl>,
## #   `2005_2` <dbl>, `2010_2` <dbl>, `2015_2` <dbl>, `2017_2` <dbl>

The data is pretty messy as it stands, so let’s select the relevant columns to narrow down the dataset, and then we can proceed to tidy and clean-up the dataset. We select the Index and Area columns, along with the population and migration columns. Note that the population and migration data by year are stored in a wide table format, so we need to use gather to pull the yearly data into separate observation rows. Also, because years 1990 through 2017 are referenced 3 times in the table (for total, male, and female), let’s separate the data into 3 subsets for the yearly total, male, and female data before using gather; otherwise we run the risk of geometrically expanding the table when we gather the yearly columns (i.e., [1990 - 2017] * [1990 - 2017] * [1990 - 2017]), if we’re not careful. Once the 3 subsets are prepared, then we’ll use an inner join to combine them into 1 data frame. We do this first for able 1 (migrant population).

########################################
### select data and clean-up for table1
########################################
df1 <- raw1 %>% select(1:2, 6:26)
names(df1)[1:2] <- c("Index", "Area")
df1

## # A tibble: 270 x 23
##    Index Area  `1990` `1995` `2000` `2005` `2010` `2015` `2017` `1990_1`
##    <int> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>    <dbl>
##  1     1 WORLD 1.53e8 1.61e8 1.73e8 1.91e8 2.20e8 2.48e8 2.58e8 77456347
##  2     2 More~ 8.24e7 9.23e7 1.03e8 1.16e8 1.31e8 1.40e8 1.46e8 40270398
##  3     3 Less~ 7.02e7 6.84e7 6.92e7 7.42e7 8.93e7 1.07e8 1.12e8 37185949
##  4     4 Leas~ 1.11e7 1.17e7 1.01e7 9.80e6 9.99e6 1.38e7 1.44e7  5556516
##  5     5 Less~ 5.91e7 5.67e7 5.92e7 6.45e7 7.94e7 9.36e7 9.73e7 31650243
##  6     6 High~ 7.52e7 8.66e7 1.00e8 1.18e8 1.42e8 1.57e8 1.65e8 38788286
##  7     7 Midd~ 6.85e7 6.44e7 6.40e7 6.47e7 7.02e7 7.98e7 8.14e7 34252236
##  8     8 Uppe~ 3.34e7 3.32e7 3.38e7 3.58e7 3.97e7 4.82e7 4.97e7 16950275
##  9     9 Lowe~ 3.51e7 3.12e7 3.02e7 2.88e7 3.05e7 3.16e7 3.17e7 17301961
## 10    10 Low-~ 8.53e6 9.29e6 7.73e6 7.60e6 7.55e6 1.05e7 1.09e7  4264450
## # ... with 260 more rows, and 13 more variables: `1995_1` <dbl>,
## #   `2000_1` <dbl>, `2005_1` <dbl>, `2010_1` <dbl>, `2015_1` <dbl>,
## #   `2017_1` <dbl>, `1990_2` <dbl>, `1995_2` <dbl>, `2000_2` <dbl>,
## #   `2005_2` <dbl>, `2010_2` <dbl>, `2015_2` <dbl>, `2017_2` <dbl>

# separate total , male, and female data into subsets to avoid problems with gathering columns
df1_a <- df1 %>% select(1:9)
df1_b <- df1 %>% select(1:2, 10:16)
df1_c <- df1 %>% select(1:2, 17:23)

# rename columns to common labels so yearly data can be combined
names(df1_b)[3:9] <- c(1990 + (0:5)*5, 2017)
names(df1_c)[3:9] <- c(1990 + (0:5)*5, 2017)

# gather column data and fix year data type
df1_a <- df1_a %>% gather(3:9, key = "Year", value = "Mig_Total") 
df1_a$Year <- as.integer(df1_a$Year)
df1_a

## # A tibble: 1,890 x 4
##    Index Area                                               Year Mig_Total
##    <int> <chr>                                             <int>     <dbl>
##  1     1 WORLD                                              1990 152542373
##  2     2 More developed regions                             1990  82391619
##  3     3 Less developed regions                             1990  70150754
##  4     4 Least developed countries                          1990  11073226
##  5     5 Less developed regions, excluding least develope~  1990  59116432
##  6     6 High-income countries                              1990  75239010
##  7     7 Middle-income countries                            1990  68475181
##  8     8 Upper-middle-income countries                      1990  33412332
##  9     9 Lower-middle-income countries                      1990  35062849
## 10    10 Low-income countries                               1990   8533974
## # ... with 1,880 more rows

df1_b <- df1_b %>% gather(3:9, key = "Year", value = "Mig_Male") 
df1_b$Year <- as.integer(df1_b$Year)
df1_b

## # A tibble: 1,890 x 4
##    Index Area                                                Year Mig_Male
##    <int> <chr>                                              <int>    <dbl>
##  1     1 WORLD                                               1990 77456347
##  2     2 More developed regions                              1990 40270398
##  3     3 Less developed regions                              1990 37185949
##  4     4 Least developed countries                           1990  5556516
##  5     5 Less developed regions, excluding least developed~  1990 31650243
##  6     6 High-income countries                               1990 38788286
##  7     7 Middle-income countries                             1990 34252236
##  8     8 Upper-middle-income countries                       1990 16950275
##  9     9 Lower-middle-income countries                       1990 17301961
## 10    10 Low-income countries                                1990  4264450
## # ... with 1,880 more rows

df1_c <- df1_c %>% gather(3:9, key = "Year", value = "Mig_Female") 
df1_c$Year <- as.integer(df1_c$Year)
df1_c

## # A tibble: 1,890 x 4
##    Index Area                                              Year Mig_Female
##    <int> <chr>                                            <int>      <dbl>
##  1     1 WORLD                                             1990   75086026
##  2     2 More developed regions                            1990   42121221
##  3     3 Less developed regions                            1990   32964805
##  4     4 Least developed countries                         1990    5516710
##  5     5 Less developed regions, excluding least develop~  1990   27466189
##  6     6 High-income countries                             1990   36450724
##  7     7 Middle-income countries                           1990   34222945
##  8     8 Upper-middle-income countries                     1990   16462057
##  9     9 Lower-middle-income countries                     1990   17760888
## 10    10 Low-income countries                              1990    4269524
## # ... with 1,880 more rows

# final table1 dataframe
df1 <- df1_a %>% inner_join(df1_b) %>% inner_join(df1_c)

## Joining, by = c("Index", "Area", "Year")
## Joining, by = c("Index", "Area", "Year")

df1

## # A tibble: 1,890 x 6
##    Index Area                           Year Mig_Total Mig_Male Mig_Female
##    <int> <chr>                         <int>     <dbl>    <dbl>      <dbl>
##  1     1 WORLD                          1990 152542373 77456347   75086026
##  2     2 More developed regions         1990  82391619 40270398   42121221
##  3     3 Less developed regions         1990  70150754 37185949   32964805
##  4     4 Least developed countries      1990  11073226  5556516    5516710
##  5     5 Less developed regions, excl~  1990  59116432 31650243   27466189
##  6     6 High-income countries          1990  75239010 38788286   36450724
##  7     7 Middle-income countries        1990  68475181 34252236   34222945
##  8     8 Upper-middle-income countries  1990  33412332 16950275   16462057
##  9     9 Lower-middle-income countries  1990  35062849 17301961   17760888
## 10    10 Low-income countries           1990   8533974  4264450    4269524
## # ... with 1,880 more rows

str(df1)

## Classes 'tbl_df', 'tbl' and 'data.frame':    1890 obs. of  6 variables:
##  $ Index     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Area      : chr  "WORLD" "More developed regions" "Less developed regions" "Least developed countries" ...
##  $ Year      : int  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
##  $ Mig_Total : num  1.53e+08 8.24e+07 7.02e+07 1.11e+07 5.91e+07 ...
##  $ Mig_Male  : num  77456347 40270398 37185949 5556516 31650243 ...
##  $ Mig_Female: num  75086026 42121221 32964805 5516710 27466189 ...

Next, we follow the same procedure for table 2 (total population). Remember that the total population numbers are expressed in 000s.

########################################
### select data and clean-up for table2
########################################
df2 <- raw2 %>% select(1:2, 5:25)
names(df2)[1:2] <- c("Index", "Area")
df2

## # A tibble: 270 x 23
##    Index Area  `1990` `1995` `2000` `2005` `2010` `2015` `2017` `1990_1`
##    <int> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>    <dbl>
##  1     1 WORLD 5.33e6 5.75e6 6.15e6 6.54e6 6.96e6 7.38e6 7.55e6  2680804
##  2     2 More~ 1.15e6 1.17e6 1.19e6 1.21e6 1.24e6 1.25e6 1.26e6   556533
##  3     3 Less~ 4.18e6 4.58e6 4.95e6 5.33e6 5.72e6 6.13e6 6.29e6  2124272
##  4     4 Leas~ 5.11e5 5.85e5 6.65e5 7.54e5 8.49e5 9.57e5 1.00e6   254357
##  5     5 Less~ 3.67e6 3.99e6 4.29e6 4.58e6 4.87e6 5.17e6 5.29e6  1869915
##  6     6 High~ 9.80e5 1.01e6 1.05e6 1.08e6 1.13e6 1.16e6 1.17e6   481655
##  7     7 Midd~ 4.03e6 4.36e6 4.67e6 4.97e6 5.27e6 5.58e6 5.70e6  2038685
##  8     8 Uppe~ 2.10e6 2.23e6 2.33e6 2.42e6 2.51e6 2.61e6 2.65e6  1058877
##  9     9 Lowe~ 1.93e6 2.13e6 2.34e6 2.55e6 2.76e6 2.97e6 3.05e6   979808
## 10    10 Low-~ 3.23e5 3.72e5 4.25e5 4.88e5 5.60e5 6.42e5 6.77e5   159580
## # ... with 260 more rows, and 13 more variables: `1995_1` <dbl>,
## #   `2000_1` <dbl>, `2005_1` <dbl>, `2010_1` <dbl>, `2015_1` <dbl>,
## #   `2017_1` <dbl>, `1990_2` <dbl>, `1995_2` <dbl>, `2000_2` <dbl>,
## #   `2005_2` <dbl>, `2010_2` <dbl>, `2015_2` <dbl>, `2017_2` <dbl>

# separate total , male, and female data into subsets to avoid problems with gathering columns
df2_a <- df2 %>% select(1:9)
df2_b <- df2 %>% select(1:2, 10:16)
df2_c <- df2 %>% select(1:2, 17:23)

# rename columns to common labels so yearly data can be combined
names(df2_b)[3:9] <- c(1990 + (0:5)*5, 2017)
names(df2_c)[3:9] <- c(1990 + (0:5)*5, 2017)

# gather column data and fix year data type
df2_a <- df2_a %>% gather(3:9, key = "Year", value = "Pop_Total") 
df2_a$Year <- as.integer(df2_a$Year)
df2_a

## # A tibble: 1,890 x 4
##    Index Area                                               Year Pop_Total
##    <int> <chr>                                             <int>     <dbl>
##  1     1 WORLD                                              1990   5330943
##  2     2 More developed regions                             1990   1146999
##  3     3 Less developed regions                             1990   4183944
##  4     4 Least developed countries                          1990    510828
##  5     5 Less developed regions, excluding least develope~  1990   3673171
##  6     6 High-income countries                              1990    979798
##  7     7 Middle-income countries                            1990   4026594
##  8     8 Upper-middle-income countries                      1990   2099726
##  9     9 Lower-middle-income countries                      1990   1926868
## 10    10 Low-income countries                               1990    322686
## # ... with 1,880 more rows

df2_b <- df2_b %>% gather(3:9, key = "Year", value = "Pop_Male") 
df2_b$Year <- as.integer(df2_b$Year)
df2_b

## # A tibble: 1,890 x 4
##    Index Area                                                Year Pop_Male
##    <int> <chr>                                              <int>    <dbl>
##  1     1 WORLD                                               1990  2680804
##  2     2 More developed regions                              1990   556533
##  3     3 Less developed regions                              1990  2124272
##  4     4 Least developed countries                           1990   254357
##  5     5 Less developed regions, excluding least developed~  1990  1869915
##  6     6 High-income countries                               1990   481655
##  7     7 Middle-income countries                             1990  2038685
##  8     8 Upper-middle-income countries                       1990  1058877
##  9     9 Lower-middle-income countries                       1990   979808
## 10    10 Low-income countries                                1990   159580
## # ... with 1,880 more rows

df2_c <- df2_c %>% gather(3:9, key = "Year", value = "Pop_Female") 
df2_c$Year <- as.integer(df2_c$Year)
df2_c

## # A tibble: 1,890 x 4
##    Index Area                                              Year Pop_Female
##    <int> <chr>                                            <int>      <dbl>
##  1     1 WORLD                                             1990    2649294
##  2     2 More developed regions                            1990     590061
##  3     3 Less developed regions                            1990    2059233
##  4     4 Least developed countries                         1990     256462
##  5     5 Less developed regions, excluding least develop~  1990    1802772
##  6     6 High-income countries                             1990     497564
##  7     7 Middle-income countries                           1990    1987725
##  8     8 Upper-middle-income countries                     1990    1040666
##  9     9 Lower-middle-income countries                     1990     947060
## 10    10 Low-income countries                              1990     163106
## # ... with 1,880 more rows

# final table2 dataframe
df2 <- df2_a %>% inner_join(df2_b) %>% inner_join(df2_c)

## Joining, by = c("Index", "Area", "Year")
## Joining, by = c("Index", "Area", "Year")

df2

## # A tibble: 1,890 x 6
##    Index Area                           Year Pop_Total Pop_Male Pop_Female
##    <int> <chr>                         <int>     <dbl>    <dbl>      <dbl>
##  1     1 WORLD                          1990   5330943  2680804    2649294
##  2     2 More developed regions         1990   1146999   556533     590061
##  3     3 Less developed regions         1990   4183944  2124272    2059233
##  4     4 Least developed countries      1990    510828   254357     256462
##  5     5 Less developed regions, excl~  1990   3673171  1869915    1802772
##  6     6 High-income countries          1990    979798   481655     497564
##  7     7 Middle-income countries        1990   4026594  2038685    1987725
##  8     8 Upper-middle-income countries  1990   2099726  1058877    1040666
##  9     9 Lower-middle-income countries  1990   1926868   979808     947060
## 10    10 Low-income countries           1990    322686   159580     163106
## # ... with 1,880 more rows

str(df2)

## Classes 'tbl_df', 'tbl' and 'data.frame':    1890 obs. of  6 variables:
##  $ Index     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Area      : chr  "WORLD" "More developed regions" "Less developed regions" "Least developed countries" ...
##  $ Year      : int  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
##  $ Pop_Total : num  5330943 1146999 4183944 510828 3673171 ...
##  $ Pop_Male  : num  2680804 556533 2124272 254357 1869915 ...
##  $ Pop_Female: num  2649294 590061 2059233 256462 1802772 ...

Use the tidy data to address the questions

Now that we have two tidy data frames, we can proceed to analyze the data and see how to address the questions. First let’s combine the two data frames, and reduce the number of columns to a manageable and relevant variable set. In the process, we’ll compute several new variables including:

Pop_M_Pct: percentage of male to total population
Pop_F_Pct: percentage of female to total population
Mig_Tot_Pct: percentage of total migrant to total population (note denominator is in 000s)
Mig_M_Pct: percentage of male migrant to male population (note denominator is in 000s)
Mig_F_Pct: percentage of female migrant to female population (note denominator is in 000s)
Mig_MF_Pct: percentage difference between Mig_M_Pct and Mig_F_Pct

# combine table1 and table2
df <- inner_join(df2, df1)

## Joining, by = c("Index", "Area", "Year")

head(df, 20)

## # A tibble: 20 x 9
##    Index Area   Year Pop_Total Pop_Male Pop_Female Mig_Total Mig_Male
##    <int> <chr> <int>     <dbl>    <dbl>      <dbl>     <dbl>    <dbl>
##  1     1 WORLD  1990   5330943  2680804    2649294 152542373 77456347
##  2     2 More~  1990   1146999   556533     590061  82391619 40270398
##  3     3 Less~  1990   4183944  2124272    2059233  70150754 37185949
##  4     4 Leas~  1990    510828   254357     256462  11073226  5556516
##  5     5 Less~  1990   3673171  1869915    1802772  59116432 31650243
##  6     6 High~  1990    979798   481655     497564  75239010 38788286
##  7     7 Midd~  1990   4026594  2038685    1987725  68475181 34252236
##  8     8 Uppe~  1990   2099726  1058877    1040666  33412332 16950275
##  9     9 Lowe~  1990   1926868   979808     947060  35062849 17301961
## 10    10 Low-~  1990    322686   159580     163106   8533974  4264450
## 11    11 Sub-~  1990    513477   255310     258161  14690319  7745306
## 12    12 AFRI~  1990    634567   316255     318307  15690623  8279564
## 13    13 East~  1990    198646    98436     100210   5964031  3071189
## 14    14 Buru~  1990      5415     2644       2772    333110   163267
## 15    15 Como~  1990       412      206        205     14079     6717
## 16    16 Djib~  1990       590      297        293    122221    64242
## 17    17 Erit~  1990      3113     1557       1556     11848     6228
## 18    18 Ethi~  1990     48087    23981      24105   1155390   607284
## 19    19 Kenya  1990     23403    11649      11753    297292   160852
## 20    20 Mada~  1990     11599     5791       5808     23917    13348
## # ... with 1 more variable: Mig_Female <dbl>

# create new variables; note population numbers are in 000s
df <- df %>% mutate(Pop_M_Pct = Pop_Male / Pop_Total * 100, Pop_F_Pct = Pop_Female / Pop_Total * 100, 
                    Mig_Tot_Pct = Mig_Total / Pop_Total / 10, Mig_M_Pct = Mig_Male / Pop_Male / 10,
                    Mig_F_Pct = Mig_Female / Pop_Female / 10, Mig_MF_Pct = (Mig_M_Pct / Mig_F_Pct - 1) * 100)

# select reduced variable set for display
df_brief <- select(df, Area, Year, Pop_Total, Pop_M_Pct, Pop_F_Pct, Mig_Tot_Pct, 
                   Mig_M_Pct, Mig_F_Pct, Mig_MF_Pct) 
head(df_brief, 20)

## # A tibble: 20 x 9
##    Area   Year Pop_Total Pop_M_Pct Pop_F_Pct Mig_Tot_Pct Mig_M_Pct
##    <chr> <int>     <dbl>     <dbl>     <dbl>       <dbl>     <dbl>
##  1 WORLD  1990   5330943      50.3      49.7       2.86      2.89 
##  2 More~  1990   1146999      48.5      51.4       7.18      7.24 
##  3 Less~  1990   4183944      50.8      49.2       1.68      1.75 
##  4 Leas~  1990    510828      49.8      50.2       2.17      2.18 
##  5 Less~  1990   3673171      50.9      49.1       1.61      1.69 
##  6 High~  1990    979798      49.2      50.8       7.68      8.05 
##  7 Midd~  1990   4026594      50.6      49.4       1.70      1.68 
##  8 Uppe~  1990   2099726      50.4      49.6       1.59      1.60 
##  9 Lowe~  1990   1926868      50.8      49.2       1.82      1.77 
## 10 Low-~  1990    322686      49.5      50.5       2.64      2.67 
## 11 Sub-~  1990    513477      49.7      50.3       2.86      3.03 
## 12 AFRI~  1990    634567      49.8      50.2       2.47      2.62 
## 13 East~  1990    198646      49.6      50.4       3.00      3.12 
## 14 Buru~  1990      5415      48.8      51.2       6.15      6.18 
## 15 Como~  1990       412      50        49.8       3.42      3.26 
## 16 Djib~  1990       590      50.3      49.7      20.7      21.6  
## 17 Erit~  1990      3113      50.0      50.0       0.381     0.4  
## 18 Ethi~  1990     48087      49.9      50.1       2.40      2.53 
## 19 Kenya  1990     23403      49.8      50.2       1.27      1.38 
## 20 Mada~  1990     11599      49.9      50.1       0.206     0.230
## # ... with 2 more variables: Mig_F_Pct <dbl>, Mig_MF_Pct <dbl>

Let’s return to our two questions, and see if the combined dataset can provide any answers.

Q1: What is the level and trend of the migrant population, relative to the overall population? What regions show the highest and lowest migrant populations?

From the tables below:

The overall migrant population as a percentage of the total population is roughly 3% across the world. The migrant percentage has risen steadily over time, from 2.9% in 1990 to 3.4% in 2017.
There are large regional variations in the migrant population. Affluent regions such as Oceania, North America, and Europe have the largest migrant populations in percentage terms, which have continued to grow over time:
- Oceania: 17% - 21% range
- Northern America: 10% - 16% range
- Europe: 7% - 11% range
On the other hand, less affluent regions have lower migrant populations in percentage terms, which have been stable to declining over the last 25 years:
- Latin America: 1.3% - 1.6% range
- Asia: 1.3% - 1.8% range
- Africa: 1.6% - 2.5% range

df_brief %>% filter(Area == "WORLD") %>% kable(digits = 1, caption = "WORLD TIME SERIES")

WORLD TIME SERIES
Area	Year	Pop_Total	Pop_M_Pct	Pop_F_Pct	Mig_Tot_Pct	Mig_M_Pct	Mig_F_Pct	Mig_MF_Pct
WORLD	1990	5330943	50.3	49.7	2.9	2.9	2.8	1.9
WORLD	1995	5751474	50.3	49.7	2.8	2.8	2.8	1.3
WORLD	2000	6145007	50.3	49.7	2.8	2.8	2.8	1.5
WORLD	2005	6542159	50.4	49.6	2.9	3.0	2.9	2.7
WORLD	2010	6958169	50.4	49.6	3.2	3.2	3.1	4.9
WORLD	2015	7383009	50.4	49.6	3.4	3.4	3.3	4.5
WORLD	2017	7550262	50.4	49.5	3.4	3.5	3.3	4.9

df_brief %>% filter(Area == "AFRICA") %>% kable(digits = 1, caption = "AFRICA TIME SERIES")

AFRICA TIME SERIES
Area	Year	Pop_Total	Pop_M_Pct	Pop_F_Pct	Mig_Tot_Pct	Mig_M_Pct	Mig_F_Pct	Mig_MF_Pct
AFRICA	1990	634567	49.8	50.2	2.5	2.6	2.3	12.4
AFRICA	1995	722922	49.8	50.2	2.3	2.4	2.1	12.2
AFRICA	2000	817566	49.8	50.2	1.8	1.9	1.7	13.9
AFRICA	2005	924758	49.8	50.2	1.7	1.8	1.5	20.0
AFRICA	2010	1049446	49.9	50.1	1.6	1.7	1.5	17.1
AFRICA	2015	1194370	49.9	50.1	2.0	2.1	1.8	13.4
AFRICA	2017	1256268	49.9	50.1	2.0	2.1	1.8	12.4

df_brief %>% filter(Area == "ASIA") %>% kable(digits = 1, caption = "ASIA TIME SERIES")

ASIA TIME SERIES
Area	Year	Pop_Total	Pop_M_Pct	Pop_F_Pct	Mig_Tot_Pct	Mig_M_Pct	Mig_F_Pct	Mig_MF_Pct
ASIA	1990	3221342	51.1	48.9	1.5	1.6	1.4	10.1
ASIA	1995	3489306	51.1	48.9	1.3	1.4	1.3	10.3
ASIA	2000	3730371	51.1	48.9	1.3	1.4	1.2	11.6
ASIA	2005	3964343	51.1	48.9	1.3	1.4	1.2	15.7
ASIA	2010	4194425	51.2	48.8	1.6	1.8	1.4	28.0
ASIA	2015	4419898	51.2	48.8	1.7	1.9	1.5	28.1
ASIA	2017	4504428	51.2	48.8	1.8	2.0	1.5	29.7

df_brief %>% filter(Area == "EUROPE") %>% kable(digits = 1, caption = "EUROPE TIME SERIES")

EUROPE TIME SERIES
Area	Year	Pop_Total	Pop_M_Pct	Pop_F_Pct	Mig_Tot_Pct	Mig_M_Pct	Mig_F_Pct	Mig_MF_Pct
EUROPE	1990	721699	48.2	51.8	6.8	6.9	6.8	1.9
EUROPE	1995	728085	48.2	51.8	7.3	7.3	7.2	1.3
EUROPE	2000	727201	48.2	51.8	7.7	7.8	7.7	0.8
EUROPE	2005	730290	48.2	51.8	8.7	8.7	8.6	0.7
EUROPE	2010	737164	48.2	51.7	9.6	9.6	9.6	-0.2
EUROPE	2015	740814	48.3	51.7	10.1	10.0	10.1	-1.6
EUROPE	2017	742074	48.3	51.7	10.5	10.4	10.6	-1.3

df_brief %>% filter(Area == "LATIN AMERICA AND THE CARIBBEAN") %>% kable(digits = 1, caption = "LATAM TIME SERIES")

LATAM TIME SERIES
Area	Year	Pop_Total	Pop_M_Pct	Pop_F_Pct	Mig_Tot_Pct	Mig_M_Pct	Mig_F_Pct	Mig_MF_Pct
LATIN AMERICA AND THE CARIBBEAN	1990	445919	49.6	50.3	1.6	1.6	1.6	2.0
LATIN AMERICA AND THE CARIBBEAN	1995	486863	49.6	50.4	1.4	1.4	1.4	1.8
LATIN AMERICA AND THE CARIBBEAN	2000	525795	49.5	50.4	1.3	1.3	1.2	1.5
LATIN AMERICA AND THE CARIBBEAN	2005	561656	49.5	50.5	1.3	1.3	1.3	0.8
LATIN AMERICA AND THE CARIBBEAN	2010	597562	49.4	50.5	1.4	1.4	1.4	0.7
LATIN AMERICA AND THE CARIBBEAN	2015	632381	49.4	50.5	1.5	1.5	1.5	0.8
LATIN AMERICA AND THE CARIBBEAN	2017	645593	49.4	50.5	1.5	1.5	1.5	0.8

df_brief %>% filter(Area == "NORTHERN AMERICA") %>% kable(digits = 1, caption = "NORAM TIME SERIES")

NORAM TIME SERIES
Area	Year	Pop_Total	Pop_M_Pct	Pop_F_Pct	Mig_Tot_Pct	Mig_M_Pct	Mig_F_Pct	Mig_MF_Pct
NORTHERN AMERICA	1990	280345	49.0	50.9	9.8	9.8	9.9	-0.7
NORTHERN AMERICA	1995	295114	49.1	50.8	11.3	11.3	11.3	0.1
NORTHERN AMERICA	2000	312845	49.3	50.7	12.9	13.0	12.8	1.0
NORTHERN AMERICA	2005	327546	49.3	50.6	13.8	13.9	13.8	1.3
NORTHERN AMERICA	2010	342937	49.4	50.6	14.9	14.7	15.0	-1.9
NORTHERN AMERICA	2015	356004	49.5	50.5	15.7	15.4	16.0	-3.8
NORTHERN AMERICA	2017	361208	49.5	50.5	16.0	15.6	16.3	-3.9

df_brief %>% filter(Area == "OCEANIA") %>% kable(digits = 1, caption = "OCEANIA TIME SERIES")

OCEANIA TIME SERIES
Area	Year	Pop_Total	Pop_M_Pct	Pop_F_Pct	Mig_Tot_Pct	Mig_M_Pct	Mig_F_Pct	Mig_MF_Pct
OCEANIA	1990	27071	49.7	49.6	17.5	18.0	17.3	4.1
OCEANIA	1995	29185	49.6	49.6	17.2	17.5	17.2	2.0
OCEANIA	2000	31229	49.6	49.6	17.2	17.3	17.3	-0.5
OCEANIA	2005	33568	49.7	49.6	17.9	17.9	18.2	-1.7
OCEANIA	2010	36636	49.8	49.5	19.4	19.4	19.8	-1.9
OCEANIA	2015	39543	49.7	49.7	20.4	20.1	20.8	-3.3
OCEANIA	2017	40691	49.7	49.7	20.7	20.4	21.2	-4.0

Q2: What is the relative proportion of male vs. female migrants as a percentage of the overall male and female population? What areas show the largest disparity between male and female migrant populations?

From the tables above:

From the “WORLD TIME SERIES” table, male migrants as a proportion of the male population (3.5% in 2017) is greater across the world than female migrants as a proportion of the female population (3.3% in 2017). Furthermore, the relative difference between the two proportions has been growing, from 1.9% in 1990 to 4.9% in 2017. What this suggests is that the migrant population is becoming more heavily weighted toward male migrants than female migrants.
Just as for the first question, there are large regional differences in the male and female migrant proportions. For instance, Asia has a +30% relative difference between male and female migrant proportions (i.e., disproportionately more male than female migrants) in 2017, whereas Oceania has a -4% relative difference (i.e., slightly more female than male migrants) in 2017.

In terms of the areas with the largest disparities between male and female migrant populations, we can query the data frame based on population size (say > 10 million) and male-vs-female migrant percentage (say >= 50% or <= -20%). It is apparent that:

Countries / regions that have much higher proportions of male migrants than female migrants include countries that have large transient worker populations associated with certain industries (e.g., the petroleum industry in Saudi Arabia) or countries during war time (e.g., Iraq).
Countries that have much higher proportions of female migrants than male migrants include Nepal and some Eastern European countries (Poland, Romania), among others.

# what's the range of the male-female migrant disparity?
summary(df_brief$Mig_MF_Pct)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -57.928  -4.453   3.132   9.874  16.327 445.776     236

# which countries of 10MM or more have the highest disparity of men vs women migrants?
# higher male migrant % than female
df_brief %>% filter(Pop_Total > 10000, Mig_MF_Pct >= 50) %>% arrange(desc(Mig_MF_Pct), Area, Year) %>% kable(digits = 1, caption = "Areas with Greatest Disparity of Male > Female Migrant Percentage")

Areas with Greatest Disparity of Male > Female Migrant Percentage
Area	Year	Pop_Total	Pop_M_Pct	Pop_F_Pct	Mig_Tot_Pct	Mig_M_Pct	Mig_F_Pct	Mig_MF_Pct
Iraq	1990	17469	50.5	49.5	0.5	0.6	0.3	96.3
Saudi Arabia	2010	27426	56.2	43.8	30.7	38.5	20.8	85.6
United Republic of Tanzania	2005	39411	49.3	50.7	2.0	2.5	1.4	76.0
Saudi Arabia	2005	23906	55.7	44.3	27.2	33.6	19.2	75.3
Western Asia	2010	232738	51.9	48.1	13.2	16.5	9.5	73.4
Saudi Arabia	2015	31557	56.7	43.3	34.1	41.7	24.2	72.2
Saudi Arabia	2000	20764	54.5	45.5	25.3	31.1	18.5	68.5
South Africa	1990	37561	49.4	50.6	3.1	3.9	2.3	64.6
Saudi Arabia	2017	32938	57.1	42.9	37.0	44.4	27.2	63.2
Western Asia	2015	258124	52.2	47.8	15.4	18.9	11.6	63.0
Western Asia	2017	267660	52.3	47.7	16.0	19.6	12.1	62.9
South Africa	1995	42088	49.1	50.9	2.4	3.0	1.8	61.2
Iraq	2010	30763	50.5	49.5	0.4	0.5	0.3	60.7
Saudi Arabia	1995	18736	55.5	44.5	27.3	32.9	20.5	60.7
Southern Africa	1990	42818	49.3	50.7	3.3	4.0	2.5	57.0
Saudi Arabia	1990	16327	56.0	44.0	30.6	36.4	23.3	56.0
Southern Africa	1995	48035	49.1	50.9	2.5	3.0	2.0	55.2
Dominican Republic	2017	10767	49.8	50.2	3.9	4.8	3.1	55.2
Viet Nam	2005	84309	49.2	50.8	0.1	0.1	0.0	55.1
South Africa	2000	45728	49.1	50.9	2.2	2.7	1.7	55.0
Dominican Republic	2015	10528	49.8	50.2	3.9	4.8	3.1	55.0
Western Asia	2005	205731	51.3	48.7	10.5	12.6	8.2	53.8
Iraq	2005	27008	50.5	49.5	0.5	0.6	0.4	52.1
Czechia	2010	10536	49.1	50.9	3.8	4.6	3.0	50.9

# higher female migrant % than male
df_brief %>% filter(Pop_Total > 10000, Mig_MF_Pct <= -20) %>% arrange(Mig_MF_Pct, Area, Year) %>% kable(digits = 1, caption = "Areas with Greatest Disparity of Female > Male Migrant Percentage")

Areas with Greatest Disparity of Female > Male Migrant Percentage
Area	Year	Pop_Total	Pop_M_Pct	Pop_F_Pct	Mig_Tot_Pct	Mig_M_Pct	Mig_F_Pct	Mig_MF_Pct
Nepal	1990	18749	49.6	50.4	2.3	1.4	3.2	-57.9
Nepal	1995	21396	50.5	49.5	3.2	2.0	4.5	-55.0
Nepal	2017	29305	48.5	51.5	1.7	1.1	2.3	-53.2
Nepal	2015	28656	48.5	51.5	1.8	1.1	2.4	-53.0
Nepal	2010	27023	48.8	51.2	2.1	1.4	2.8	-48.9
Nepal	2000	23741	49.7	50.3	3.0	2.0	4.0	-48.6
Nepal	2005	25640	49.6	50.4	2.6	1.8	3.4	-46.2
Poland	2000	38550	48.5	51.5	2.1	1.8	2.5	-26.4
Poland	2010	38323	48.3	51.7	1.7	1.4	1.9	-26.3
Poland	2005	38363	48.4	51.6	1.9	1.6	2.2	-25.6
Romania	1990	23489	49.3	50.7	0.6	0.5	0.7	-24.3
Poland	1995	38453	48.6	51.4	2.5	2.2	2.8	-23.9
Uzbekistan	2000	24849	49.7	50.3	5.7	4.9	6.4	-23.4
Guatemala	1995	10408	49.3	50.7	1.5	1.3	1.7	-23.0
Poland	2015	38265	48.3	51.7	1.6	1.4	1.8	-22.7
Uzbekistan	1995	22863	49.7	50.3	6.6	5.8	7.4	-22.3
Cuba	2000	11151	50.2	49.8	0.2	0.1	0.2	-22.0
Poland	1990	37955	48.7	51.3	3.0	2.6	3.3	-21.3
Cuba	2005	11284	50.2	49.8	0.2	0.1	0.2	-21.0
Uzbekistan	1990	20462	49.5	50.5	8.1	7.1	9.0	-20.5