Visualizing human migration patterns with chord diagrams in R using the circlize package

Setting the environment

These are the R libraries we will need for this demo:



Accessing the data

This demo will utilize the United Nations International Migration Dataset
The data is openly available for the public to download as an .xlsx file. The data has several sheets & we will be working with ‘Table 1: Total migrant stock at mid-year by origin and major area, region, country or area of destination’. The Table was exported to .csv and uploaded to the author’s github.
Directly access it via the raw link and load to an R data.frame:

## [1] 1982  241

There are several basic steps we will take to clean the data before we ‘tidy’ the data:

  1. The column names are a mess, let’s fix them
  2. Subset the data to work with the data from 2019
##  [1] "Year"                                              
##  [2] "Sort.order"                                        
##  [3] "Major.area..region..country.or.area.of.destination"
##  [4] "Notes"                                             
##  [5] "Code"                                              
##  [6] "Type.of.data..a."                                  
##  [7] "Total"                                             
##  [8] "Other South"                                       
##  [9] "Other North"                                       
## [10] "Afghanistan"
##  [1] "Year"        "Sort.order"  "DestArea"    "Notes"       "Code"       
##  [6] "DataType"    "Total"       "Other South" "Other North" "Afghanistan"

The column ‘DestArea’ contains heterogeneous data. Most rows are specific to a country whereas a few contain information for a region. the rows for countries are group underneath the region they belong to.
Here we will disambiguate this information by creating new columns in the dataframe with region information for each country.

First we will make a data.frame that holds the information to relate country to region:

#data.frame that defines regions
subRegions_df <- data.frame( ID=c(910, 911, 913, 914, 912, 922, 5500, 5501, 906, 920, 915, 916, 931, 927, 928, 954, 957, 923, 924, 925, 926, 918), #ID = code for a region
                             Region=c("E_Africa", "Mid_Africa", "S_Africa", "W_Africa", "N_Africa", "W_Asia", "C_Asia", "S_Asia", "E_Asia", "SE_Asia", "Caribbean", "C_America", "S_America", "Australia_NZ", "Melanesia", "Micronesia", "Polynesia", "E_Europe", "N_Europe", "S_Europe", "W_Europe", "N_America")) 
subRegions_df$Code= list(c(108, 174, 262, 232, 231, 404, 450, 454, 480, 175, 508, 638, 646, 690, 706, 728, 800, 834, 894, 716),
        c(24, 120, 140, 148, 178, 180, 226, 266, 678), 
        c(72, 748, 426, 516, 710), 
        c(204, 854, 132, 384, 270, 288, 324, 624, 430, 466, 478, 562, 566, 654, 686, 694, 768), 
        c(12, 818, 434, 504, 729, 788, 732), 
        c(51, 31, 48, 196, 268, 368, 376, 400, 414, 422, 512, 634, 682, 275, 760, 792, 784, 887), 
        c(398, 417, 762, 795, 860), 
        c(4, 50, 64, 356, 364, 462, 524, 586, 144), 
        c(156, 344, 446, 408, 392, 496, 410), 
        c(96, 116, 360, 418, 458, 104, 608, 702, 764, 626, 704), 
        c(660, 28, 533, 44, 52, 92, 535, 136, 192, 531, 212, 214, 308, 312, 332, 388, 474, 500, 630, 659, 662, 670, 534, 780, 796, 850), 
        c(84, 188, 222, 320, 340, 484, 558, 591), 
        c(32, 68, 76, 152, 170, 218, 238, 254, 328, 600, 604, 740, 858, 862), 
        c(36, 554), 
        c(242, 540, 598, 90, 548), 
        c(316, 296, 584, 583, 520, 580, 585), 
        c(16, 184, 258, 570, 882, 772, 776, 798, 876), 
        c(112, 100, 203, 348, 616, 498, 642, 643, 703, 804), 
        c(830, 208, 233, 234, 246, 352, 372, 833, 428, 440, 578, 752, 826), 
        c(8, 20, 70, 191, 292, 300, 336, 380, 470, 499, 807, 620, 674, 688, 705, 724),
        c(40, 56, 250, 276, 438, 442, 492, 528, 756), 
        c(60, 124, 304, 666, 840)) #country codes for each region in "Region"

#use tidyr to unnest the lists to give each value a row
subRegions_df <- subRegions_df %>% unnest( Code )
head( subRegions_df )
## # A tibble: 6 x 3
##      ID Region    Code
##   <dbl> <fct>    <dbl>
## 1   910 E_Africa   108
## 2   910 E_Africa   174
## 3   910 E_Africa   262
## 4   910 E_Africa   232
## 5   910 E_Africa   231
## 6   910 E_Africa   404

Now join the information from the subRegions_df to the migrant2019_df

Visualize UN Migration Data by Country

Data-to-viz.com has prepared an excellent tutorial on chordDiagrams with circlize. This is used as a template to plot the data:

That does not look good at all.
There are far too many data chords to make sense of in this figure. Here the data is plotted again by region instead of by country:

Warning: you are about to see a for loop in R. I couldn’t find any other way ¯\(ツ)

Tidy the data to find the values in terms of origin region instead of origin country

  1. make data.frame long
  2. replace country strings with appropriate region strings
  3. reshape the data to plot migration flow
## # A tibble: 6 x 3
##   Region    DRegion       count
##   <fct>     <chr>         <dbl>
## 1 C_America Australia_NZ  21664
## 2 C_America Caribbean     14585
## 3 C_America C_America    641408
## 4 C_America C_Asia            0
## 5 C_America E_Africa          0
## 6 C_America E_Asia         2416

In Closing

In this demo we created a chord diagram from the raw migration data provided by the United Nations. In the figure above, we can see some interesting trends: Some regions appear to have more immigrants than emigrants (e.g. North America, Western Europe & Western Asia) whereas other regions have more emigrants (e.g. South East Asia & Central America).
The data needed many steps of processing before meaninful visualization could be generated. The R libraries ‘dplyr’ & ‘tidyr’ did most of our heavy lifting.