Introduction

For this exercise, I chose the United Nations migration dataset from 2017. The idea of using this dataset came from a post by Juanelle Marks on the Week 5 Discussion Board. The source of the dataset can be found at: http://www.un.org/en/development/desa/population/migration/data/estimates2/estimates17.shtml

In analyzing the data, I was interested in addressing two questions:

Create a CSV file that includes all the information in the dataset

The UN provides the migration dataset in downloadable form as an Excel spreadsheet. I saved two tables from the migration dataset as CSV files:

The two tables are saved as CSV files on GitHub.

Read the dataset and use tidyr and dplyr to tidy and transform the data

First, read in the data from the CSV files stored on GitHub. Note that we need to skip the first block of lines as they contain header information, and then we also need to give R parsing instructions for certain columns. Without the parsing instructions, many of the columns with population numbers that are formatted with separating commas (e.g., “123,456”) are treated as character data instead of numerical data.

library(tidyverse)
library(knitr)
# migrant population
file1 <- "https://raw.githubusercontent.com/kecbenson/DATA_607_Proj2/master/UN_Migration_table1.csv"
# total population in 000s
file2 <- "https://raw.githubusercontent.com/kecbenson/DATA_607_Proj2/master/UN_Migration_table2.csv"

# ensure correct parsing of number columns
raw1 <- read_csv(file1, skip = 16,
                 col_types = cols(
                     .default = col_number(),
                     X1 = col_integer(),
                     X2 = col_character(),
                     X3 = col_character(),
                     X4 = col_integer(),
                     X5 = col_character()
                 )
)
head(raw1, 20)
## # A tibble: 20 x 51
##       X1 X2    X3       X4 X5    `1990` `1995` `2000` `2005` `2010` `2015`
##    <int> <chr> <chr> <int> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1     1 WORLD <NA>    900 <NA>  1.53e8 1.61e8 1.73e8 1.91e8 2.20e8 2.48e8
##  2     2 More~ b       901 <NA>  8.24e7 9.23e7 1.03e8 1.16e8 1.31e8 1.40e8
##  3     3 Less~ c       902 <NA>  7.02e7 6.84e7 6.92e7 7.42e7 8.93e7 1.07e8
##  4     4 Leas~ d       941 <NA>  1.11e7 1.17e7 1.01e7 9.80e6 9.99e6 1.38e7
##  5     5 Less~ <NA>    934 <NA>  5.91e7 5.67e7 5.92e7 6.45e7 7.94e7 9.36e7
##  6     6 High~ e      1503 <NA>  7.52e7 8.66e7 1.00e8 1.18e8 1.42e8 1.57e8
##  7     7 Midd~ e      1517 <NA>  6.85e7 6.44e7 6.40e7 6.47e7 7.02e7 7.98e7
##  8     8 Uppe~ e      1502 <NA>  3.34e7 3.32e7 3.38e7 3.58e7 3.97e7 4.82e7
##  9     9 Lowe~ e      1501 <NA>  3.51e7 3.12e7 3.02e7 2.88e7 3.05e7 3.16e7
## 10    10 Low-~ e      1500 <NA>  8.53e6 9.29e6 7.73e6 7.60e6 7.55e6 1.05e7
## 11    11 Sub-~ f       947 <NA>  1.47e7 1.53e7 1.37e7 1.43e7 1.57e7 2.17e7
## 12    12 AFRI~ <NA>    903 <NA>  1.57e7 1.64e7 1.48e7 1.55e7 1.70e7 2.34e7
## 13    13 East~ <NA>    910 <NA>  5.96e6 5.02e6 4.84e6 4.75e6 4.66e6 6.92e6
## 14    14 Buru~ <NA>    108 B R   3.33e5 2.55e5 1.26e5 1.73e5 2.35e5 2.90e5
## 15    15 Como~ <NA>    174 B     1.41e4 1.39e4 1.38e4 1.32e4 1.26e4 1.26e4
## 16    16 Djib~ <NA>    262 B R   1.22e5 9.98e4 1.01e5 9.21e4 1.02e5 1.12e5
## 17    17 Erit~ <NA>    232 I     1.18e4 1.24e4 1.30e4 1.43e4 1.57e4 1.59e4
## 18    18 Ethi~ <NA>    231 B R   1.16e6 8.07e5 6.11e5 5.14e5 5.68e5 1.16e6
## 19    19 Kenya <NA>    404 B R   2.97e5 6.19e5 6.99e5 7.57e5 9.27e5 1.08e6
## 20    20 Mada~ <NA>    450 C     2.39e4 2.12e4 2.35e4 2.61e4 2.89e4 3.21e4
## # ... with 40 more variables: `2017` <dbl>, `1990_1` <dbl>,
## #   `1995_1` <dbl>, `2000_1` <dbl>, `2005_1` <dbl>, `2010_1` <dbl>,
## #   `2015_1` <dbl>, `2017_1` <dbl>, `1990_2` <dbl>, `1995_2` <dbl>,
## #   `2000_2` <dbl>, `2005_2` <dbl>, `2010_2` <dbl>, `2015_2` <dbl>,
## #   `2017_2` <dbl>, X27 <dbl>, X28 <dbl>, X29 <dbl>, X30 <dbl>, X31 <dbl>,
## #   X32 <dbl>, X33 <dbl>, X34 <dbl>, X35 <dbl>, X36 <dbl>, X37 <dbl>,
## #   X38 <dbl>, X39 <dbl>, X40 <dbl>, X41 <dbl>, X42 <dbl>, X43 <dbl>,
## #   X44 <dbl>, X45 <dbl>, X46 <dbl>, X47 <dbl>, X48 <dbl>, X49 <dbl>,
## #   X50 <dbl>, X51 <dbl>
# ensure correct parsing of number columns
raw2 <- read_csv(file2, skip = 16,
                 col_types = cols(
                     .default = col_number(),
                     X1 = col_integer(),
                     X2 = col_character(),
                     X3 = col_character(),
                     X4 = col_integer()
                 )
)
head(raw2, 20)
## # A tibble: 20 x 25
##       X1 X2    X3       X4 `1990` `1995` `2000` `2005` `2010` `2015` `2017`
##    <int> <chr> <chr> <int>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1     1 WORLD <NA>    900 5.33e6 5.75e6 6.15e6 6.54e6 6.96e6 7.38e6 7.55e6
##  2     2 More~ b       901 1.15e6 1.17e6 1.19e6 1.21e6 1.24e6 1.25e6 1.26e6
##  3     3 Less~ c       902 4.18e6 4.58e6 4.95e6 5.33e6 5.72e6 6.13e6 6.29e6
##  4     4 Leas~ d       941 5.11e5 5.85e5 6.65e5 7.54e5 8.49e5 9.57e5 1.00e6
##  5     5 Less~ <NA>    934 3.67e6 3.99e6 4.29e6 4.58e6 4.87e6 5.17e6 5.29e6
##  6     6 High~ e      1503 9.80e5 1.01e6 1.05e6 1.08e6 1.13e6 1.16e6 1.17e6
##  7     7 Midd~ e      1517 4.03e6 4.36e6 4.67e6 4.97e6 5.27e6 5.58e6 5.70e6
##  8     8 Uppe~ e      1502 2.10e6 2.23e6 2.33e6 2.42e6 2.51e6 2.61e6 2.65e6
##  9     9 Lowe~ e      1501 1.93e6 2.13e6 2.34e6 2.55e6 2.76e6 2.97e6 3.05e6
## 10    10 Low-~ e      1500 3.23e5 3.72e5 4.25e5 4.88e5 5.60e5 6.42e5 6.77e5
## 11    11 Sub-~ f       947 5.13e5 5.89e5 6.72e5 7.68e5 8.80e5 1.01e6 1.06e6
## 12    12 AFRI~ <NA>    903 6.35e5 7.23e5 8.18e5 9.25e5 1.05e6 1.19e6 1.26e6
## 13    13 East~ <NA>    910 1.99e5 2.26e5 2.61e5 3.01e5 3.47e5 3.99e5 4.22e5
## 14    14 Buru~ <NA>    108 5.42e3 5.96e3 6.40e3 7.42e3 8.77e3 1.02e4 1.09e4
## 15    15 Como~ <NA>    174 4.12e2 4.75e2 5.42e2 6.12e2 6.90e2 7.77e2 8.14e2
## 16    16 Djib~ <NA>    262 5.90e2 6.30e2 7.18e2 7.83e2 8.51e2 9.27e2 9.57e2
## 17    17 Erit~ <NA>    232 3.11e3 3.09e3 3.39e3 3.97e3 4.39e3 4.85e3 5.07e3
## 18    18 Ethi~ <NA>    231 4.81e4 5.73e4 6.65e4 7.67e4 8.77e4 9.99e4 1.05e5
## 19    19 Kenya <NA>    404 2.34e4 2.73e4 3.14e4 3.60e4 4.14e4 4.72e4 4.97e4
## 20    20 Mada~ <NA>    450 1.16e4 1.35e4 1.58e4 1.83e4 2.12e4 2.42e4 2.56e4
## # ... with 14 more variables: `1990_1` <dbl>, `1995_1` <dbl>,
## #   `2000_1` <dbl>, `2005_1` <dbl>, `2010_1` <dbl>, `2015_1` <dbl>,
## #   `2017_1` <dbl>, `1990_2` <dbl>, `1995_2` <dbl>, `2000_2` <dbl>,
## #   `2005_2` <dbl>, `2010_2` <dbl>, `2015_2` <dbl>, `2017_2` <dbl>

The data is pretty messy as it stands, so let’s select the relevant columns to narrow down the dataset, and then we can proceed to tidy and clean-up the dataset. We select the Index and Area columns, along with the population and migration columns. Note that the population and migration data by year are stored in a wide table format, so we need to use gather to pull the yearly data into separate observation rows. Also, because years 1990 through 2017 are referenced 3 times in the table (for total, male, and female), let’s separate the data into 3 subsets for the yearly total, male, and female data before using gather; otherwise we run the risk of geometrically expanding the table when we gather the yearly columns (i.e., [1990 - 2017] * [1990 - 2017] * [1990 - 2017]), if we’re not careful. Once the 3 subsets are prepared, then we’ll use an inner join to combine them into 1 data frame. We do this first for able 1 (migrant population).

########################################
### select data and clean-up for table1
########################################
df1 <- raw1 %>% select(1:2, 6:26)
names(df1)[1:2] <- c("Index", "Area")
df1
## # A tibble: 270 x 23
##    Index Area  `1990` `1995` `2000` `2005` `2010` `2015` `2017` `1990_1`
##    <int> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>    <dbl>
##  1     1 WORLD 1.53e8 1.61e8 1.73e8 1.91e8 2.20e8 2.48e8 2.58e8 77456347
##  2     2 More~ 8.24e7 9.23e7 1.03e8 1.16e8 1.31e8 1.40e8 1.46e8 40270398
##  3     3 Less~ 7.02e7 6.84e7 6.92e7 7.42e7 8.93e7 1.07e8 1.12e8 37185949
##  4     4 Leas~ 1.11e7 1.17e7 1.01e7 9.80e6 9.99e6 1.38e7 1.44e7  5556516
##  5     5 Less~ 5.91e7 5.67e7 5.92e7 6.45e7 7.94e7 9.36e7 9.73e7 31650243
##  6     6 High~ 7.52e7 8.66e7 1.00e8 1.18e8 1.42e8 1.57e8 1.65e8 38788286
##  7     7 Midd~ 6.85e7 6.44e7 6.40e7 6.47e7 7.02e7 7.98e7 8.14e7 34252236
##  8     8 Uppe~ 3.34e7 3.32e7 3.38e7 3.58e7 3.97e7 4.82e7 4.97e7 16950275
##  9     9 Lowe~ 3.51e7 3.12e7 3.02e7 2.88e7 3.05e7 3.16e7 3.17e7 17301961
## 10    10 Low-~ 8.53e6 9.29e6 7.73e6 7.60e6 7.55e6 1.05e7 1.09e7  4264450
## # ... with 260 more rows, and 13 more variables: `1995_1` <dbl>,
## #   `2000_1` <dbl>, `2005_1` <dbl>, `2010_1` <dbl>, `2015_1` <dbl>,
## #   `2017_1` <dbl>, `1990_2` <dbl>, `1995_2` <dbl>, `2000_2` <dbl>,
## #   `2005_2` <dbl>, `2010_2` <dbl>, `2015_2` <dbl>, `2017_2` <dbl>
# separate total , male, and female data into subsets to avoid problems with gathering columns
df1_a <- df1 %>% select(1:9)
df1_b <- df1 %>% select(1:2, 10:16)
df1_c <- df1 %>% select(1:2, 17:23)

# rename columns to common labels so yearly data can be combined
names(df1_b)[3:9] <- c(1990 + (0:5)*5, 2017)
names(df1_c)[3:9] <- c(1990 + (0:5)*5, 2017)

# gather column data and fix year data type
df1_a <- df1_a %>% gather(3:9, key = "Year", value = "Mig_Total") 
df1_a$Year <- as.integer(df1_a$Year)
df1_a
## # A tibble: 1,890 x 4
##    Index Area                                               Year Mig_Total
##    <int> <chr>                                             <int>     <dbl>
##  1     1 WORLD                                              1990 152542373
##  2     2 More developed regions                             1990  82391619
##  3     3 Less developed regions                             1990  70150754
##  4     4 Least developed countries                          1990  11073226
##  5     5 Less developed regions, excluding least develope~  1990  59116432
##  6     6 High-income countries                              1990  75239010
##  7     7 Middle-income countries                            1990  68475181
##  8     8 Upper-middle-income countries                      1990  33412332
##  9     9 Lower-middle-income countries                      1990  35062849
## 10    10 Low-income countries                               1990   8533974
## # ... with 1,880 more rows
df1_b <- df1_b %>% gather(3:9, key = "Year", value = "Mig_Male") 
df1_b$Year <- as.integer(df1_b$Year)
df1_b
## # A tibble: 1,890 x 4
##    Index Area                                                Year Mig_Male
##    <int> <chr>                                              <int>    <dbl>
##  1     1 WORLD                                               1990 77456347
##  2     2 More developed regions                              1990 40270398
##  3     3 Less developed regions                              1990 37185949
##  4     4 Least developed countries                           1990  5556516
##  5     5 Less developed regions, excluding least developed~  1990 31650243
##  6     6 High-income countries                               1990 38788286
##  7     7 Middle-income countries                             1990 34252236
##  8     8 Upper-middle-income countries                       1990 16950275
##  9     9 Lower-middle-income countries                       1990 17301961
## 10    10 Low-income countries                                1990  4264450
## # ... with 1,880 more rows
df1_c <- df1_c %>% gather(3:9, key = "Year", value = "Mig_Female") 
df1_c$Year <- as.integer(df1_c$Year)
df1_c
## # A tibble: 1,890 x 4
##    Index Area                                              Year Mig_Female
##    <int> <chr>                                            <int>      <dbl>
##  1     1 WORLD                                             1990   75086026
##  2     2 More developed regions                            1990   42121221
##  3     3 Less developed regions                            1990   32964805
##  4     4 Least developed countries                         1990    5516710
##  5     5 Less developed regions, excluding least develop~  1990   27466189
##  6     6 High-income countries                             1990   36450724
##  7     7 Middle-income countries                           1990   34222945
##  8     8 Upper-middle-income countries                     1990   16462057
##  9     9 Lower-middle-income countries                     1990   17760888
## 10    10 Low-income countries                              1990    4269524
## # ... with 1,880 more rows
# final table1 dataframe
df1 <- df1_a %>% inner_join(df1_b) %>% inner_join(df1_c)
## Joining, by = c("Index", "Area", "Year")
## Joining, by = c("Index", "Area", "Year")
df1
## # A tibble: 1,890 x 6
##    Index Area                           Year Mig_Total Mig_Male Mig_Female
##    <int> <chr>                         <int>     <dbl>    <dbl>      <dbl>
##  1     1 WORLD                          1990 152542373 77456347   75086026
##  2     2 More developed regions         1990  82391619 40270398   42121221
##  3     3 Less developed regions         1990  70150754 37185949   32964805
##  4     4 Least developed countries      1990  11073226  5556516    5516710
##  5     5 Less developed regions, excl~  1990  59116432 31650243   27466189
##  6     6 High-income countries          1990  75239010 38788286   36450724
##  7     7 Middle-income countries        1990  68475181 34252236   34222945
##  8     8 Upper-middle-income countries  1990  33412332 16950275   16462057
##  9     9 Lower-middle-income countries  1990  35062849 17301961   17760888
## 10    10 Low-income countries           1990   8533974  4264450    4269524
## # ... with 1,880 more rows
str(df1)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1890 obs. of  6 variables:
##  $ Index     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Area      : chr  "WORLD" "More developed regions" "Less developed regions" "Least developed countries" ...
##  $ Year      : int  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
##  $ Mig_Total : num  1.53e+08 8.24e+07 7.02e+07 1.11e+07 5.91e+07 ...
##  $ Mig_Male  : num  77456347 40270398 37185949 5556516 31650243 ...
##  $ Mig_Female: num  75086026 42121221 32964805 5516710 27466189 ...

Next, we follow the same procedure for table 2 (total population). Remember that the total population numbers are expressed in 000s.

########################################
### select data and clean-up for table2
########################################
df2 <- raw2 %>% select(1:2, 5:25)
names(df2)[1:2] <- c("Index", "Area")
df2
## # A tibble: 270 x 23
##    Index Area  `1990` `1995` `2000` `2005` `2010` `2015` `2017` `1990_1`
##    <int> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>    <dbl>
##  1     1 WORLD 5.33e6 5.75e6 6.15e6 6.54e6 6.96e6 7.38e6 7.55e6  2680804
##  2     2 More~ 1.15e6 1.17e6 1.19e6 1.21e6 1.24e6 1.25e6 1.26e6   556533
##  3     3 Less~ 4.18e6 4.58e6 4.95e6 5.33e6 5.72e6 6.13e6 6.29e6  2124272
##  4     4 Leas~ 5.11e5 5.85e5 6.65e5 7.54e5 8.49e5 9.57e5 1.00e6   254357
##  5     5 Less~ 3.67e6 3.99e6 4.29e6 4.58e6 4.87e6 5.17e6 5.29e6  1869915
##  6     6 High~ 9.80e5 1.01e6 1.05e6 1.08e6 1.13e6 1.16e6 1.17e6   481655
##  7     7 Midd~ 4.03e6 4.36e6 4.67e6 4.97e6 5.27e6 5.58e6 5.70e6  2038685
##  8     8 Uppe~ 2.10e6 2.23e6 2.33e6 2.42e6 2.51e6 2.61e6 2.65e6  1058877
##  9     9 Lowe~ 1.93e6 2.13e6 2.34e6 2.55e6 2.76e6 2.97e6 3.05e6   979808
## 10    10 Low-~ 3.23e5 3.72e5 4.25e5 4.88e5 5.60e5 6.42e5 6.77e5   159580
## # ... with 260 more rows, and 13 more variables: `1995_1` <dbl>,
## #   `2000_1` <dbl>, `2005_1` <dbl>, `2010_1` <dbl>, `2015_1` <dbl>,
## #   `2017_1` <dbl>, `1990_2` <dbl>, `1995_2` <dbl>, `2000_2` <dbl>,
## #   `2005_2` <dbl>, `2010_2` <dbl>, `2015_2` <dbl>, `2017_2` <dbl>
# separate total , male, and female data into subsets to avoid problems with gathering columns
df2_a <- df2 %>% select(1:9)
df2_b <- df2 %>% select(1:2, 10:16)
df2_c <- df2 %>% select(1:2, 17:23)

# rename columns to common labels so yearly data can be combined
names(df2_b)[3:9] <- c(1990 + (0:5)*5, 2017)
names(df2_c)[3:9] <- c(1990 + (0:5)*5, 2017)

# gather column data and fix year data type
df2_a <- df2_a %>% gather(3:9, key = "Year", value = "Pop_Total") 
df2_a$Year <- as.integer(df2_a$Year)
df2_a
## # A tibble: 1,890 x 4
##    Index Area                                               Year Pop_Total
##    <int> <chr>                                             <int>     <dbl>
##  1     1 WORLD                                              1990   5330943
##  2     2 More developed regions                             1990   1146999
##  3     3 Less developed regions                             1990   4183944
##  4     4 Least developed countries                          1990    510828
##  5     5 Less developed regions, excluding least develope~  1990   3673171
##  6     6 High-income countries                              1990    979798
##  7     7 Middle-income countries                            1990   4026594
##  8     8 Upper-middle-income countries                      1990   2099726
##  9     9 Lower-middle-income countries                      1990   1926868
## 10    10 Low-income countries                               1990    322686
## # ... with 1,880 more rows
df2_b <- df2_b %>% gather(3:9, key = "Year", value = "Pop_Male") 
df2_b$Year <- as.integer(df2_b$Year)
df2_b
## # A tibble: 1,890 x 4
##    Index Area                                                Year Pop_Male
##    <int> <chr>                                              <int>    <dbl>
##  1     1 WORLD                                               1990  2680804
##  2     2 More developed regions                              1990   556533
##  3     3 Less developed regions                              1990  2124272
##  4     4 Least developed countries                           1990   254357
##  5     5 Less developed regions, excluding least developed~  1990  1869915
##  6     6 High-income countries                               1990   481655
##  7     7 Middle-income countries                             1990  2038685
##  8     8 Upper-middle-income countries                       1990  1058877
##  9     9 Lower-middle-income countries                       1990   979808
## 10    10 Low-income countries                                1990   159580
## # ... with 1,880 more rows
df2_c <- df2_c %>% gather(3:9, key = "Year", value = "Pop_Female") 
df2_c$Year <- as.integer(df2_c$Year)
df2_c
## # A tibble: 1,890 x 4
##    Index Area                                              Year Pop_Female
##    <int> <chr>                                            <int>      <dbl>
##  1     1 WORLD                                             1990    2649294
##  2     2 More developed regions                            1990     590061
##  3     3 Less developed regions                            1990    2059233
##  4     4 Least developed countries                         1990     256462
##  5     5 Less developed regions, excluding least develop~  1990    1802772
##  6     6 High-income countries                             1990     497564
##  7     7 Middle-income countries                           1990    1987725
##  8     8 Upper-middle-income countries                     1990    1040666
##  9     9 Lower-middle-income countries                     1990     947060
## 10    10 Low-income countries                              1990     163106
## # ... with 1,880 more rows
# final table2 dataframe
df2 <- df2_a %>% inner_join(df2_b) %>% inner_join(df2_c)
## Joining, by = c("Index", "Area", "Year")
## Joining, by = c("Index", "Area", "Year")
df2
## # A tibble: 1,890 x 6
##    Index Area                           Year Pop_Total Pop_Male Pop_Female
##    <int> <chr>                         <int>     <dbl>    <dbl>      <dbl>
##  1     1 WORLD                          1990   5330943  2680804    2649294
##  2     2 More developed regions         1990   1146999   556533     590061
##  3     3 Less developed regions         1990   4183944  2124272    2059233
##  4     4 Least developed countries      1990    510828   254357     256462
##  5     5 Less developed regions, excl~  1990   3673171  1869915    1802772
##  6     6 High-income countries          1990    979798   481655     497564
##  7     7 Middle-income countries        1990   4026594  2038685    1987725
##  8     8 Upper-middle-income countries  1990   2099726  1058877    1040666
##  9     9 Lower-middle-income countries  1990   1926868   979808     947060
## 10    10 Low-income countries           1990    322686   159580     163106
## # ... with 1,880 more rows
str(df2)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1890 obs. of  6 variables:
##  $ Index     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Area      : chr  "WORLD" "More developed regions" "Less developed regions" "Least developed countries" ...
##  $ Year      : int  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
##  $ Pop_Total : num  5330943 1146999 4183944 510828 3673171 ...
##  $ Pop_Male  : num  2680804 556533 2124272 254357 1869915 ...
##  $ Pop_Female: num  2649294 590061 2059233 256462 1802772 ...

Use the tidy data to address the questions

Now that we have two tidy data frames, we can proceed to analyze the data and see how to address the questions. First let’s combine the two data frames, and reduce the number of columns to a manageable and relevant variable set. In the process, we’ll compute several new variables including:

# combine table1 and table2
df <- inner_join(df2, df1)
## Joining, by = c("Index", "Area", "Year")
head(df, 20)
## # A tibble: 20 x 9
##    Index Area   Year Pop_Total Pop_Male Pop_Female Mig_Total Mig_Male
##    <int> <chr> <int>     <dbl>    <dbl>      <dbl>     <dbl>    <dbl>
##  1     1 WORLD  1990   5330943  2680804    2649294 152542373 77456347
##  2     2 More~  1990   1146999   556533     590061  82391619 40270398
##  3     3 Less~  1990   4183944  2124272    2059233  70150754 37185949
##  4     4 Leas~  1990    510828   254357     256462  11073226  5556516
##  5     5 Less~  1990   3673171  1869915    1802772  59116432 31650243
##  6     6 High~  1990    979798   481655     497564  75239010 38788286
##  7     7 Midd~  1990   4026594  2038685    1987725  68475181 34252236
##  8     8 Uppe~  1990   2099726  1058877    1040666  33412332 16950275
##  9     9 Lowe~  1990   1926868   979808     947060  35062849 17301961
## 10    10 Low-~  1990    322686   159580     163106   8533974  4264450
## 11    11 Sub-~  1990    513477   255310     258161  14690319  7745306
## 12    12 AFRI~  1990    634567   316255     318307  15690623  8279564
## 13    13 East~  1990    198646    98436     100210   5964031  3071189
## 14    14 Buru~  1990      5415     2644       2772    333110   163267
## 15    15 Como~  1990       412      206        205     14079     6717
## 16    16 Djib~  1990       590      297        293    122221    64242
## 17    17 Erit~  1990      3113     1557       1556     11848     6228
## 18    18 Ethi~  1990     48087    23981      24105   1155390   607284
## 19    19 Kenya  1990     23403    11649      11753    297292   160852
## 20    20 Mada~  1990     11599     5791       5808     23917    13348
## # ... with 1 more variable: Mig_Female <dbl>
# create new variables; note population numbers are in 000s
df <- df %>% mutate(Pop_M_Pct = Pop_Male / Pop_Total * 100, Pop_F_Pct = Pop_Female / Pop_Total * 100, 
                    Mig_Tot_Pct = Mig_Total / Pop_Total / 10, Mig_M_Pct = Mig_Male / Pop_Male / 10,
                    Mig_F_Pct = Mig_Female / Pop_Female / 10, Mig_MF_Pct = (Mig_M_Pct / Mig_F_Pct - 1) * 100)

# select reduced variable set for display
df_brief <- select(df, Area, Year, Pop_Total, Pop_M_Pct, Pop_F_Pct, Mig_Tot_Pct, 
                   Mig_M_Pct, Mig_F_Pct, Mig_MF_Pct) 
head(df_brief, 20)
## # A tibble: 20 x 9
##    Area   Year Pop_Total Pop_M_Pct Pop_F_Pct Mig_Tot_Pct Mig_M_Pct
##    <chr> <int>     <dbl>     <dbl>     <dbl>       <dbl>     <dbl>
##  1 WORLD  1990   5330943      50.3      49.7       2.86      2.89 
##  2 More~  1990   1146999      48.5      51.4       7.18      7.24 
##  3 Less~  1990   4183944      50.8      49.2       1.68      1.75 
##  4 Leas~  1990    510828      49.8      50.2       2.17      2.18 
##  5 Less~  1990   3673171      50.9      49.1       1.61      1.69 
##  6 High~  1990    979798      49.2      50.8       7.68      8.05 
##  7 Midd~  1990   4026594      50.6      49.4       1.70      1.68 
##  8 Uppe~  1990   2099726      50.4      49.6       1.59      1.60 
##  9 Lowe~  1990   1926868      50.8      49.2       1.82      1.77 
## 10 Low-~  1990    322686      49.5      50.5       2.64      2.67 
## 11 Sub-~  1990    513477      49.7      50.3       2.86      3.03 
## 12 AFRI~  1990    634567      49.8      50.2       2.47      2.62 
## 13 East~  1990    198646      49.6      50.4       3.00      3.12 
## 14 Buru~  1990      5415      48.8      51.2       6.15      6.18 
## 15 Como~  1990       412      50        49.8       3.42      3.26 
## 16 Djib~  1990       590      50.3      49.7      20.7      21.6  
## 17 Erit~  1990      3113      50.0      50.0       0.381     0.4  
## 18 Ethi~  1990     48087      49.9      50.1       2.40      2.53 
## 19 Kenya  1990     23403      49.8      50.2       1.27      1.38 
## 20 Mada~  1990     11599      49.9      50.1       0.206     0.230
## # ... with 2 more variables: Mig_F_Pct <dbl>, Mig_MF_Pct <dbl>

Let’s return to our two questions, and see if the combined dataset can provide any answers.

Q1: What is the level and trend of the migrant population, relative to the overall population? What regions show the highest and lowest migrant populations?

From the tables below:

  • The overall migrant population as a percentage of the total population is roughly 3% across the world. The migrant percentage has risen steadily over time, from 2.9% in 1990 to 3.4% in 2017.

  • There are large regional variations in the migrant population. Affluent regions such as Oceania, North America, and Europe have the largest migrant populations in percentage terms, which have continued to grow over time:
    • Oceania: 17% - 21% range
    • Northern America: 10% - 16% range
    • Europe: 7% - 11% range
  • On the other hand, less affluent regions have lower migrant populations in percentage terms, which have been stable to declining over the last 25 years:
    • Latin America: 1.3% - 1.6% range
    • Asia: 1.3% - 1.8% range
    • Africa: 1.6% - 2.5% range
df_brief %>% filter(Area == "WORLD") %>% kable(digits = 1, caption = "WORLD TIME SERIES")
WORLD TIME SERIES
Area Year Pop_Total Pop_M_Pct Pop_F_Pct Mig_Tot_Pct Mig_M_Pct Mig_F_Pct Mig_MF_Pct
WORLD 1990 5330943 50.3 49.7 2.9 2.9 2.8 1.9
WORLD 1995 5751474 50.3 49.7 2.8 2.8 2.8 1.3
WORLD 2000 6145007 50.3 49.7 2.8 2.8 2.8 1.5
WORLD 2005 6542159 50.4 49.6 2.9 3.0 2.9 2.7
WORLD 2010 6958169 50.4 49.6 3.2 3.2 3.1 4.9
WORLD 2015 7383009 50.4 49.6 3.4 3.4 3.3 4.5
WORLD 2017 7550262 50.4 49.5 3.4 3.5 3.3 4.9
df_brief %>% filter(Area == "AFRICA") %>% kable(digits = 1, caption = "AFRICA TIME SERIES")
AFRICA TIME SERIES
Area Year Pop_Total Pop_M_Pct Pop_F_Pct Mig_Tot_Pct Mig_M_Pct Mig_F_Pct Mig_MF_Pct
AFRICA 1990 634567 49.8 50.2 2.5 2.6 2.3 12.4
AFRICA 1995 722922 49.8 50.2 2.3 2.4 2.1 12.2
AFRICA 2000 817566 49.8 50.2 1.8 1.9 1.7 13.9
AFRICA 2005 924758 49.8 50.2 1.7 1.8 1.5 20.0
AFRICA 2010 1049446 49.9 50.1 1.6 1.7 1.5 17.1
AFRICA 2015 1194370 49.9 50.1 2.0 2.1 1.8 13.4
AFRICA 2017 1256268 49.9 50.1 2.0 2.1 1.8 12.4
df_brief %>% filter(Area == "ASIA") %>% kable(digits = 1, caption = "ASIA TIME SERIES")
ASIA TIME SERIES
Area Year Pop_Total Pop_M_Pct Pop_F_Pct Mig_Tot_Pct Mig_M_Pct Mig_F_Pct Mig_MF_Pct
ASIA 1990 3221342 51.1 48.9 1.5 1.6 1.4 10.1
ASIA 1995 3489306 51.1 48.9 1.3 1.4 1.3 10.3
ASIA 2000 3730371 51.1 48.9 1.3 1.4 1.2 11.6
ASIA 2005 3964343 51.1 48.9 1.3 1.4 1.2 15.7
ASIA 2010 4194425 51.2 48.8 1.6 1.8 1.4 28.0
ASIA 2015 4419898 51.2 48.8 1.7 1.9 1.5 28.1
ASIA 2017 4504428 51.2 48.8 1.8 2.0 1.5 29.7
df_brief %>% filter(Area == "EUROPE") %>% kable(digits = 1, caption = "EUROPE TIME SERIES")
EUROPE TIME SERIES
Area Year Pop_Total Pop_M_Pct Pop_F_Pct Mig_Tot_Pct Mig_M_Pct Mig_F_Pct Mig_MF_Pct
EUROPE 1990 721699 48.2 51.8 6.8 6.9 6.8 1.9
EUROPE 1995 728085 48.2 51.8 7.3 7.3 7.2 1.3
EUROPE 2000 727201 48.2 51.8 7.7 7.8 7.7 0.8
EUROPE 2005 730290 48.2 51.8 8.7 8.7 8.6 0.7
EUROPE 2010 737164 48.2 51.7 9.6 9.6 9.6 -0.2
EUROPE 2015 740814 48.3 51.7 10.1 10.0 10.1 -1.6
EUROPE 2017 742074 48.3 51.7 10.5 10.4 10.6 -1.3
df_brief %>% filter(Area == "LATIN AMERICA AND THE CARIBBEAN") %>% kable(digits = 1, caption = "LATAM TIME SERIES")
LATAM TIME SERIES
Area Year Pop_Total Pop_M_Pct Pop_F_Pct Mig_Tot_Pct Mig_M_Pct Mig_F_Pct Mig_MF_Pct
LATIN AMERICA AND THE CARIBBEAN 1990 445919 49.6 50.3 1.6 1.6 1.6 2.0
LATIN AMERICA AND THE CARIBBEAN 1995 486863 49.6 50.4 1.4 1.4 1.4 1.8
LATIN AMERICA AND THE CARIBBEAN 2000 525795 49.5 50.4 1.3 1.3 1.2 1.5
LATIN AMERICA AND THE CARIBBEAN 2005 561656 49.5 50.5 1.3 1.3 1.3 0.8
LATIN AMERICA AND THE CARIBBEAN 2010 597562 49.4 50.5 1.4 1.4 1.4 0.7
LATIN AMERICA AND THE CARIBBEAN 2015 632381 49.4 50.5 1.5 1.5 1.5 0.8
LATIN AMERICA AND THE CARIBBEAN 2017 645593 49.4 50.5 1.5 1.5 1.5 0.8
df_brief %>% filter(Area == "NORTHERN AMERICA") %>% kable(digits = 1, caption = "NORAM TIME SERIES")
NORAM TIME SERIES
Area Year Pop_Total Pop_M_Pct Pop_F_Pct Mig_Tot_Pct Mig_M_Pct Mig_F_Pct Mig_MF_Pct
NORTHERN AMERICA 1990 280345 49.0 50.9 9.8 9.8 9.9 -0.7
NORTHERN AMERICA 1995 295114 49.1 50.8 11.3 11.3 11.3 0.1
NORTHERN AMERICA 2000 312845 49.3 50.7 12.9 13.0 12.8 1.0
NORTHERN AMERICA 2005 327546 49.3 50.6 13.8 13.9 13.8 1.3
NORTHERN AMERICA 2010 342937 49.4 50.6 14.9 14.7 15.0 -1.9
NORTHERN AMERICA 2015 356004 49.5 50.5 15.7 15.4 16.0 -3.8
NORTHERN AMERICA 2017 361208 49.5 50.5 16.0 15.6 16.3 -3.9
df_brief %>% filter(Area == "OCEANIA") %>% kable(digits = 1, caption = "OCEANIA TIME SERIES")
OCEANIA TIME SERIES
Area Year Pop_Total Pop_M_Pct Pop_F_Pct Mig_Tot_Pct Mig_M_Pct Mig_F_Pct Mig_MF_Pct
OCEANIA 1990 27071 49.7 49.6 17.5 18.0 17.3 4.1
OCEANIA 1995 29185 49.6 49.6 17.2 17.5 17.2 2.0
OCEANIA 2000 31229 49.6 49.6 17.2 17.3 17.3 -0.5
OCEANIA 2005 33568 49.7 49.6 17.9 17.9 18.2 -1.7
OCEANIA 2010 36636 49.8 49.5 19.4 19.4 19.8 -1.9
OCEANIA 2015 39543 49.7 49.7 20.4 20.1 20.8 -3.3
OCEANIA 2017 40691 49.7 49.7 20.7 20.4 21.2 -4.0

Q2: What is the relative proportion of male vs. female migrants as a percentage of the overall male and female population? What areas show the largest disparity between male and female migrant populations?

From the tables above:

  • From the “WORLD TIME SERIES” table, male migrants as a proportion of the male population (3.5% in 2017) is greater across the world than female migrants as a proportion of the female population (3.3% in 2017). Furthermore, the relative difference between the two proportions has been growing, from 1.9% in 1990 to 4.9% in 2017. What this suggests is that the migrant population is becoming more heavily weighted toward male migrants than female migrants.

  • Just as for the first question, there are large regional differences in the male and female migrant proportions. For instance, Asia has a +30% relative difference between male and female migrant proportions (i.e., disproportionately more male than female migrants) in 2017, whereas Oceania has a -4% relative difference (i.e., slightly more female than male migrants) in 2017.

In terms of the areas with the largest disparities between male and female migrant populations, we can query the data frame based on population size (say > 10 million) and male-vs-female migrant percentage (say >= 50% or <= -20%). It is apparent that:

  • Countries / regions that have much higher proportions of male migrants than female migrants include countries that have large transient worker populations associated with certain industries (e.g., the petroleum industry in Saudi Arabia) or countries during war time (e.g., Iraq).

  • Countries that have much higher proportions of female migrants than male migrants include Nepal and some Eastern European countries (Poland, Romania), among others.

# what's the range of the male-female migrant disparity?
summary(df_brief$Mig_MF_Pct)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -57.928  -4.453   3.132   9.874  16.327 445.776     236
# which countries of 10MM or more have the highest disparity of men vs women migrants?
# higher male migrant % than female
df_brief %>% filter(Pop_Total > 10000, Mig_MF_Pct >= 50) %>% arrange(desc(Mig_MF_Pct), Area, Year) %>% kable(digits = 1, caption = "Areas with Greatest Disparity of Male > Female Migrant Percentage")
Areas with Greatest Disparity of Male > Female Migrant Percentage
Area Year Pop_Total Pop_M_Pct Pop_F_Pct Mig_Tot_Pct Mig_M_Pct Mig_F_Pct Mig_MF_Pct
Iraq 1990 17469 50.5 49.5 0.5 0.6 0.3 96.3
Saudi Arabia 2010 27426 56.2 43.8 30.7 38.5 20.8 85.6
United Republic of Tanzania 2005 39411 49.3 50.7 2.0 2.5 1.4 76.0
Saudi Arabia 2005 23906 55.7 44.3 27.2 33.6 19.2 75.3
Western Asia 2010 232738 51.9 48.1 13.2 16.5 9.5 73.4
Saudi Arabia 2015 31557 56.7 43.3 34.1 41.7 24.2 72.2
Saudi Arabia 2000 20764 54.5 45.5 25.3 31.1 18.5 68.5
South Africa 1990 37561 49.4 50.6 3.1 3.9 2.3 64.6
Saudi Arabia 2017 32938 57.1 42.9 37.0 44.4 27.2 63.2
Western Asia 2015 258124 52.2 47.8 15.4 18.9 11.6 63.0
Western Asia 2017 267660 52.3 47.7 16.0 19.6 12.1 62.9
South Africa 1995 42088 49.1 50.9 2.4 3.0 1.8 61.2
Iraq 2010 30763 50.5 49.5 0.4 0.5 0.3 60.7
Saudi Arabia 1995 18736 55.5 44.5 27.3 32.9 20.5 60.7
Southern Africa 1990 42818 49.3 50.7 3.3 4.0 2.5 57.0
Saudi Arabia 1990 16327 56.0 44.0 30.6 36.4 23.3 56.0
Southern Africa 1995 48035 49.1 50.9 2.5 3.0 2.0 55.2
Dominican Republic 2017 10767 49.8 50.2 3.9 4.8 3.1 55.2
Viet Nam 2005 84309 49.2 50.8 0.1 0.1 0.0 55.1
South Africa 2000 45728 49.1 50.9 2.2 2.7 1.7 55.0
Dominican Republic 2015 10528 49.8 50.2 3.9 4.8 3.1 55.0
Western Asia 2005 205731 51.3 48.7 10.5 12.6 8.2 53.8
Iraq 2005 27008 50.5 49.5 0.5 0.6 0.4 52.1
Czechia 2010 10536 49.1 50.9 3.8 4.6 3.0 50.9
# higher female migrant % than male
df_brief %>% filter(Pop_Total > 10000, Mig_MF_Pct <= -20) %>% arrange(Mig_MF_Pct, Area, Year) %>% kable(digits = 1, caption = "Areas with Greatest Disparity of Female > Male Migrant Percentage")
Areas with Greatest Disparity of Female > Male Migrant Percentage
Area Year Pop_Total Pop_M_Pct Pop_F_Pct Mig_Tot_Pct Mig_M_Pct Mig_F_Pct Mig_MF_Pct
Nepal 1990 18749 49.6 50.4 2.3 1.4 3.2 -57.9
Nepal 1995 21396 50.5 49.5 3.2 2.0 4.5 -55.0
Nepal 2017 29305 48.5 51.5 1.7 1.1 2.3 -53.2
Nepal 2015 28656 48.5 51.5 1.8 1.1 2.4 -53.0
Nepal 2010 27023 48.8 51.2 2.1 1.4 2.8 -48.9
Nepal 2000 23741 49.7 50.3 3.0 2.0 4.0 -48.6
Nepal 2005 25640 49.6 50.4 2.6 1.8 3.4 -46.2
Poland 2000 38550 48.5 51.5 2.1 1.8 2.5 -26.4
Poland 2010 38323 48.3 51.7 1.7 1.4 1.9 -26.3
Poland 2005 38363 48.4 51.6 1.9 1.6 2.2 -25.6
Romania 1990 23489 49.3 50.7 0.6 0.5 0.7 -24.3
Poland 1995 38453 48.6 51.4 2.5 2.2 2.8 -23.9
Uzbekistan 2000 24849 49.7 50.3 5.7 4.9 6.4 -23.4
Guatemala 1995 10408 49.3 50.7 1.5 1.3 1.7 -23.0
Poland 2015 38265 48.3 51.7 1.6 1.4 1.8 -22.7
Uzbekistan 1995 22863 49.7 50.3 6.6 5.8 7.4 -22.3
Cuba 2000 11151 50.2 49.8 0.2 0.1 0.2 -22.0
Poland 1990 37955 48.7 51.3 3.0 2.6 3.3 -21.3
Cuba 2005 11284 50.2 49.8 0.2 0.1 0.2 -21.0
Uzbekistan 1990 20462 49.5 50.5 8.1 7.1 9.0 -20.5