Why Bother?

In Spring 2021, I taught a course entitled “Telling Stories with Data” (TSD), which introduced non-STEM majors to the Tidyverse and basic data visualization and analysis. I had bright students, but ones who typically had NO prior experience with either statistical analysis or computer programming. So TSD was designed as a soft entry, beginner-level guide to working with data. For our stories, we started with various data sets available in R packages – the usual suspects – and then progressed to data sets in the wild.

Our capstone project required to the students to use two data sets from the Gapminder.org foundation to build a dashboard and tell a coherent story with visualizations, summary stats, text contextualization and analyses, and at least one basic model or hypothesis test. (Three capstone project examples: Bonnie, Chanley, and Bethia).

library(tidyverse)
library(here)
library(visdat)

Choropleth Maps Made Easy?

For their capstone projects, nearly all the students wanted to include Choropleth maps. Understandable. The students were working with global data, and Choropleth maps both look impressive and are useful. But we ran into some problems.

Gapminder.org uses the nation-state as a primary unit of analysis: the country variable in their data sets. But they do NOT include the standard ISO country codes. The R ecosystem has various mapping tools, some well outside the Tidyverse, but all of which require remapping at least some of the Gapminder country names to the geo-data units; or, vice versa. So until we have all the primary units remapped, we get something like this:

load(here::here("data", "tidy_data", "cmap.rda") )
bad_ex

Just ggplot?

To simplify this as the course required, and to stay largely within the Tidyverse ecosystem so as to avoid cognitive overload, we went with the world map from ggplot2::map_data("world"). To ensure compabitility with the Gapminder.org date, we created a new data set with the geo-mapping information: world_map2. We fixed most of the flaws, and did “good enough” quick and dirty Choropleth maps – but I wanted to finish what we started.

If data was missing for a given nation for a given year, we wanted to know that. We also wanted our mapping data compatible across all the Gapminder.org data sets. We wanted a result more like this:

plotly::ggplotly(good_ex)

The Project

We would not need to rely on matching names if we had a version of ggplot2::map_data("world") which was updated and contained the standard ISO country codes: Alpha-2 code, Alpha-3 code, and Numeric code. So that is what the data set world_map2 offers.

The remainder of this document describes the process of creating world_map2, our new data set directly dervied from ggplot2::map_data("world"). It – world_map2, this RMD, and a supporting case study, are available at github.com/Thom-J-H/map_Gap_2_Tidy. I include in this document below all the steps and rationale involved for full transparency and in hopes that other people can improve upon this effort or offer a better solution for working with Global Studies data sets (like the Gapminder.org data) in the Tidyverse.

Two Core Problems

When using ggplot2::map_data("world") (hereafter world_map) with the Gapminder.org or other Global Studies data sets, we have two core problems. First, the names for country are not consistent across data sets.The informal names of the nations can vary greatly; the formal names, often too long for appropriate labeling and generally not even recorded. In the Gapminder.org data sets, which largely share a common source, for Sint Maarten, we have two values: “Sint Maarten” and “Sint Maarten (Dutch part)”. This because the same island also contains the the Collectivity of Saint Martin, more commonly known as the French “Saint Martin”. When we move from the Gapminder.org data sets to others, the country name values can vary greatly. In world_map, the preferred “Eswatini” is the older “Swaziland”; “North Macedonia” as of 2019, the older “Macedonia”; and so on.

The obvious solution to this problem of inconsistent country nomenclature: use the ISO codes: the three letter designation, or the three digit ONU, or both. Neither the Gapminder.org data sets nor world_map does so.

What comprises a country?

Second, in practical terms, we have no simple definition of what comprises a country. As of 4 September 2020, Kosovo was recognized by 97 out of 193 (50.26%) United Nations member states; as of July 2021, Western Sahara was recognized by 45 out of a total of 193 United Nations member states. Both have region values with the corresponding polygon coordinates in world_map. They may or may not appear in various Global Studies data collections.

Likewise, we also have existing designations that do not distinguish clearly between geographical boundaries and political boundaries. Some of the Gapminder data sets, for example, report on the “Channel Islands”: more properly, the two Crown dependencies, the Bailiwick of Jersey, and the Bailiwick of Guernsey. But as Wikipedia correctly reports: “‘Channel Islands’ is a geographical term, not a political unit. The two bailiwicks have been administered separately since the late 13th century. Each has its own independent laws, elections, and representative bodies…. Any institution common to both is the exception rather than the rule.” Jersey, for example, is “a self-governing parliamentary democracy under a constitutional monarchy, with its own financial, legal and judicial systems, and the power of self-determination”. For mappping purposes, both Jersey (JE; JEY; 832) and Guernsey (GG; GGY; 831) have their own ISO codes. In truth, it makes more sense NOT to lump Jersey and Guernsey together for the purposes of economic, social, and public health data analysis. Even if Gapminder.org and/or the World Bank did so for some data collections.

Conversely, but appropriately, world_map places Hong Kong and Macao in the subregion column as a region of China. This is geographically and politically correct: but for decades of practice continuing to the present, economic and public health data for Hong Kong and Macao have been gathered separately. Not aggregrated back to China. Each former city-state now “Special Administrative Region” effectively has a country level status, as the Gapminder.org and other Global Studies data sets show.

Historical countries

Finally, on this point, some of the Gapminder data sets also include as a country value the dissolved Netherlands Antilles. If we keep this historical designation which is needed for only a limited number of data analyses, we must otherwise ignore the data for the now independent nations of Aruba and Curacao, as well as the political regroupings of the remaining islands. So although I want a mapping data set highly compatible with the Gapminder.org data sets, it should also work with any Global Studies data set. The value “Netherlands Antilles” will be dropped.

Ideally, world_map2 should not only work better with the Gapminder.org data set: it should be interoperable with all reasonably similar Global Studies data sets. So beyond adding, deleting and changing names, and in three cases, adding new polygon coordinates, world_map2 also contains (when they exist) the Alpha-2 code, Alpha-3 code, and Numeric code for each entity represented as a country value.

The Data Sets

The Gapminder.org data sets available for download are generally sourced to the World Bank and available under a Creative Commons Attribution 4.0 International license. They cover global trends with the nation-state, the variable country, as a primary level of analysis. The data is also organized chronologically, by year.

Between data sets, the names for countries are generally consistent. Some sets do cover more nations (and territories and sub-national units).

We will use four sets below to test differences in coverage, and to build a country names reference.

# Data sets from Gapminder.org --------------------------------------------
life_expectancy_years <- read_csv(here::here("data", 
                                             "raw_data", 
                                             "life_expectancy_years.csv") ,
                      show_col_types = FALSE)

total_fertility <- read_csv(here::here("data", 
                                       "raw_data",
                                       "children_per_woman_total_fertility.csv"),
                      show_col_types = FALSE)

energy_use_per_person <- read_csv(here::here("data", 
                                             "raw_data", 
                                             "energy_use_per_person.csv"),
                      show_col_types = FALSE)

demox_eiu <- read_csv(here::here("data", 
                                 "raw_data", 
                                 "demox_eiu.csv"),
                      show_col_types = FALSE)

Basic EDA

The Gapminder.org data sets are untidy, and in long format. We’ll deal with those issues later. Depending on the primary variable of interest, we have a different range of nations and years covered. For example, the data set for Life Expectancy (years) has 189 designated countries; the data set for Total Fertility, 202 countries; the data set for Energy Use per Capita, 169 nations; and the data set for Democracy Index (EIU), 166 nations. But Total Fertility, to take one comparison, does not simply have 13 more listed countries than Life Expectancy (years): we have meaningful set differences in coverage between the sets.

life_expectancy_years %>% head()
## # A tibble: 6 x 302
##   country  `1800` `1801` `1802` `1803` `1804` `1805` `1806` `1807` `1808` `1809`
##   <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Afghani~   28.2   28.2   28.2   28.2   28.2   28.2   28.1   28.1   28.1   28.1
## 2 Angola     27     27     27     27     27     27     27     27     27     27  
## 3 Albania    35.4   35.4   35.4   35.4   35.4   35.4   35.4   35.4   35.4   35.4
## 4 Andorra    NA     NA     NA     NA     NA     NA     NA     NA     NA     NA  
## 5 United ~   30.7   30.7   30.7   30.7   30.7   30.7   30.7   30.7   30.7   30.7
## 6 Argenti~   33.2   33.2   33.2   33.2   33.2   33.2   33.2   33.2   33.2   33.2
## # ... with 291 more variables: 1810 <dbl>, 1811 <dbl>, 1812 <dbl>, 1813 <dbl>,
## #   1814 <dbl>, 1815 <dbl>, 1816 <dbl>, 1817 <dbl>, 1818 <dbl>, 1819 <dbl>,
## #   1820 <dbl>, 1821 <dbl>, 1822 <dbl>, 1823 <dbl>, 1824 <dbl>, 1825 <dbl>,
## #   1826 <dbl>, 1827 <dbl>, 1828 <dbl>, 1829 <dbl>, 1830 <dbl>, 1831 <dbl>,
## #   1832 <dbl>, 1833 <dbl>, 1834 <dbl>, 1835 <dbl>, 1836 <dbl>, 1837 <dbl>,
## #   1838 <dbl>, 1839 <dbl>, 1840 <dbl>, 1841 <dbl>, 1842 <dbl>, 1843 <dbl>,
## #   1844 <dbl>, 1845 <dbl>, 1846 <dbl>, 1847 <dbl>, 1848 <dbl>, 1849 <dbl>, ...
demox_eiu %>% vis_dat()

life_expectancy_years %>% 
  arrange(country) %>% 
  select(country)
## # A tibble: 189 x 1
##    country            
##    <chr>              
##  1 Afghanistan        
##  2 Albania            
##  3 Algeria            
##  4 Andorra            
##  5 Angola             
##  6 Antigua and Barbuda
##  7 Argentina          
##  8 Armenia            
##  9 Australia          
## 10 Austria            
## # ... with 179 more rows
demox_eiu %>% 
  arrange(country) %>% 
  select(country)
## # A tibble: 166 x 1
##    country    
##    <chr>      
##  1 Afghanistan
##  2 Albania    
##  3 Algeria    
##  4 Angola     
##  5 Argentina  
##  6 Armenia    
##  7 Australia  
##  8 Austria    
##  9 Azerbaijan 
## 10 Bahrain    
## # ... with 156 more rows
life_expectancy_years$country %>% 
  n_distinct()
## [1] 189
energy_use_per_person$country %>% 
  n_distinct() 
## [1] 169

Key Differences

Let’s explore the similarities and differences in coverage for the country variable between the sets.

### Fertility  vs. Life coverage
setdiff(total_fertility$country, life_expectancy_years$country) %>% 
  enframe(name = NULL, value = "diff") %>% 
  knitr::kable(caption = "Fertility  vs. Life coverage" ,
               row.names = TRUE)
Fertility vs. Life coverage
diff
1 Aruba
2 Netherlands Antilles
3 Channel Islands
4 Western Sahara
5 Guadeloupe
6 Greenland
7 French Guiana
8 Guam
9 Macao, China
10 Martinique
11 Mayotte
12 New Caledonia
13 Puerto Rico
14 French Polynesia
15 Reunion
16 Virgin Islands (U.S.)
### Life vs. Fertility  coverage
setdiff(life_expectancy_years$country, total_fertility$country)  %>% 
  enframe(name = NULL, value = "diff") %>% 
  knitr::kable(caption = "Life vs. Fertility  coverage" ,
               row.names = TRUE)
Life vs. Fertility coverage
diff
1 Andorra
2 Dominica
3 Marshall Islands
### Fertility  vs. Energy coverage"
setdiff(total_fertility$country, energy_use_per_person$country)  %>% 
  enframe(name = NULL, value = "diff") %>% 
  knitr::kable(caption = "Fertility  vs. Energy coverage",
               row.names = TRUE)
Fertility vs. Energy coverage
diff
1 Aruba
2 Afghanistan
3 Netherlands Antilles
4 Burundi
5 Burkina Faso
6 Central African Republic
7 Channel Islands
8 Western Sahara
9 Micronesia, Fed. Sts.
10 Guinea
11 Guadeloupe
12 Greenland
13 French Guiana
14 Guam
15 Hong Kong, China
16 Lao
17 Liberia
18 Macao, China
19 Madagascar
20 Mali
21 Mauritania
22 Martinique
23 Malawi
24 Mayotte
25 New Caledonia
26 Papua New Guinea
27 Puerto Rico
28 Palestine
29 French Polynesia
30 Reunion
31 Rwanda
32 Sierra Leone
33 Somalia
34 Chad
35 Taiwan
36 Uganda
37 Virgin Islands (U.S.)
### Energy vs. Fertility coverage
setdiff(energy_use_per_person$country, total_fertility$country) %>% 
  enframe(name = NULL, value = "diff") %>% 
  knitr::kable(caption = "Energy vs. Fertility coverage",
               row.names = TRUE)
Energy vs. Fertility coverage
diff
1 Dominica
2 Marshall Islands
3 Palau
4 St. Kitts and Nevis
### Life vs. Energy coverage
setdiff(life_expectancy_years$country, energy_use_per_person$country) %>% 
  enframe(name = NULL, value = "diff") %>% 
  knitr::kable(caption = "Life vs. Energy coverage",
               row.names = TRUE)
Life vs. Energy coverage
diff
1 Afghanistan
2 Andorra
3 Burundi
4 Burkina Faso
5 Central African Republic
6 Micronesia, Fed. Sts.
7 Guinea
8 Hong Kong, China
9 Lao
10 Liberia
11 Madagascar
12 Mali
13 Mauritania
14 Malawi
15 Papua New Guinea
16 Palestine
17 Rwanda
18 Sierra Leone
19 Somalia
20 Chad
21 Taiwan
22 Uganda
### Energy  vs. Life coverage
setdiff(energy_use_per_person$country,life_expectancy_years$country) %>% 
  enframe(name = NULL, value = "diff") %>% 
  knitr::kable(caption = "Energy  vs. Life coverage",
               row.names = TRUE)
Energy vs. Life coverage
diff
1 Palau
2 St. Kitts and Nevis
### Energy  vs. Democracy coverage
setdiff(energy_use_per_person$country,demox_eiu$country) %>% 
  enframe(name = NULL, value = "diff") %>% 
  knitr::kable(caption = "Energy  vs. Democracy coverage",
               row.names = TRUE)
Energy vs. Democracy coverage
diff
1 Antigua and Barbuda
2 Bahamas
3 Barbados
4 Belize
5 Brunei
6 Dominica
7 Georgia
8 Grenada
9 Kiribati
10 Maldives
11 Marshall Islands
12 Palau
13 Samoa
14 Sao Tome and Principe
15 Seychelles
16 Solomon Islands
17 South Sudan
18 St. Kitts and Nevis
19 St. Lucia
20 St. Vincent and the Grenadines
21 Tonga
22 Vanuatu
### Democracy vs. Energy coverage
setdiff(demox_eiu$country,energy_use_per_person$country) %>% 
  enframe(name = NULL, value = "diff") %>%
  knitr::kable(caption = "Democracy vs. Energy coverage",
               row.names = TRUE)
Democracy vs. Energy coverage
diff
1 Afghanistan
2 Burundi
3 Burkina Faso
4 Central African Republic
5 Guinea
6 Hong Kong, China
7 Lao
8 Liberia
9 Madagascar
10 Mali
11 Mauritania
12 Malawi
13 Papua New Guinea
14 Palestine
15 Rwanda
16 Sierra Leone
17 Chad
18 Taiwan
19 Uganda

Overall, we have fairly complete vector of country values. Since the law of diminishing returns has set in our Gapminder data set comparisons, let’s build a country name list (dataframe, actually) to test against our map data coverage.

Gapminder Country Reference List

# Gapminder Country Name Reference DF -------------------------------------

country_names <- demox_eiu %>% 
  select(country) %>% 
  full_join(energy_use_per_person, by = "country") %>%
  select(country) %>%
  full_join(total_fertility, by = "country") %>%
  select(country)  %>% 
  full_join(life_expectancy_years, by = "country") %>% 
  select(country) %>% 
  arrange(country) 

## Current working total
country_names$country %>%
  n_distinct() 
## [1] 207
country_names %>% head(n = 10) %>% 
  knitr::kable(caption = "First Ten Country Designations",
               row.names = TRUE)
First Ten Country Designations
country
1 Afghanistan
2 Albania
3 Algeria
4 Andorra
5 Angola
6 Antigua and Barbuda
7 Argentina
8 Armenia
9 Aruba
10 Australia
country_names %>% tail(n = 10) %>% 
  knitr::kable(caption = "Last Ten Country Designations",
               row.names = TRUE)
Last Ten Country Designations
country
1 Uruguay
2 Uzbekistan
3 Vanuatu
4 Venezuela
5 Vietnam
6 Virgin Islands (U.S.)
7 Western Sahara
8 Yemen
9 Zambia
10 Zimbabwe

So at this point we have 207 unique country level units of analysis. Please note that some country designations are better understood as regions within a nation-state, or as overseas territories belonging to a nation-state, rather than as distinct nation-states as recognized by the United Nations or the international community.

WorldMap Regions

In the data set world_map, derived from ggplot2::map_data("world"), the region variable generally corresponds with the Gapminder country variable: but it can also define geographical rather than political entities. We need to dig into the map data subregion to obtain a proper match with country.

Let’s have a look.

world_map <- ggplot2::map_data("world")

world_map %>% vis_dat()

## Basic unit is region; subregion mostly NA
world_map %>% glimpse()
## Rows: 99,338
## Columns: 6
## $ long      <dbl> -69.89912, -69.89571, -69.94219, -70.00415, -70.06612, -70.0~
## $ lat       <dbl> 12.45200, 12.42300, 12.43853, 12.50049, 12.54697, 12.59707, ~
## $ group     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
## $ order     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17, 18, 1~
## $ region    <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba~
## $ subregion <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## group or groups belong to regions
## order refers the long and lat coordinates for mapping
## long == longitude lat == latitude

world_map %>% 
  skimr::skim(region, subregion)
Data summary
Name Piped data
Number of rows 99338
Number of columns 6
_______________________
Column type frequency:
character 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
region 0 1.00 2 35 0 252 0
subregion 63154 0.36 1 33 0 1069 0
## more regions than gapminder countries 
## difference in emphasis
## subregion contains some units which gapminder treats a country

The ggplot2 map data for a global Mercator projection uses region as the primary unit. A region can have subregions, but always consists of at least one group. The group marks out the polygon to be mapped and filled, by longitude long and latitude lat coordinates in their appropriate order order: some regions and even subregions require multiple groups to draw the appropriate shape. The subregion “Hong Kong”, for example, has three distinct groups: 668, 669, and 670.

Tools for Matching

In the majority of cases, we have either an existing or an obvious match between the Gapminder country and the map data region variables. But for that minority, we need to dig through the vectors. Below are some tools for that task.

### Check for example South Sudan
world_map %>% 
  filter(stringr::str_detect(region, "Sudan") ) %>% 
  distinct(region) 
##        region
## 1       Sudan
## 2 South Sudan
### Check for example South Sudan
world_map %>% 
  filter(stringr::str_detect(region, "South") ) %>% 
  distinct(region) 
##                                region
## 1 French Southern and Antarctic Lands
## 2                         South Korea
## 3                         South Sudan
## 4              South Sandwich Islands
## 5                       South Georgia
## 6                        South Africa
### Check for example South Sudan
country_names %>% 
  filter(stringr::str_detect(country, "Sudan") ) 
## # A tibble: 2 x 1
##   country    
##   <chr>      
## 1 South Sudan
## 2 Sudan
### Check for example South Sudan
country_names %>% 
  filter(stringr::str_detect(country, "South") )
## # A tibble: 3 x 1
##   country     
##   <chr>       
## 1 South Africa
## 2 South Korea 
## 3 South Sudan
### Check for example Hong Kong
world_map %>% 
  filter(stringr::str_detect(region, "Hong Kong") ) %>% 
  distinct(region)  # NO!
## [1] region
## <0 rows> (or 0-length row.names)
### Check for example Hong Kong
world_map %>% 
  filter(stringr::str_detect(subregion, "Hong Kong") ) %>% 
  distinct(region, subregion)  # YES!
##   region subregion
## 1  China Hong Kong
### Group IDs for coordinates data
 world_map %>% 
  filter(stringr::str_detect(subregion, "Hong Kong") ) %>%
  select(group) %>% 
   distinct()
##   group
## 1   668
## 2   669
## 3   670

Map vs. Gap

Let’s identify the mismatches and work to reconcile as many as possible.

####Identify key differences --------------------------------------

map_vs_gap <- setdiff(world_map$region, country_names$country) %>%
  enframe(name = NULL, value = "desn") %>% 
  arrange(desn)

gap_vs_map <- setdiff(country_names$country, world_map$region) %>%
  enframe(name = NULL, value = "desn") %>% 
  arrange(desn)

map_vs_gap %>% 
  knitr::kable(caption = "Map regions vs. Gap countries: Coverage diff",
               row.names = TRUE)
Map regions vs. Gap countries: Coverage diff
desn
1 American Samoa
2 Anguilla
3 Antarctica
4 Antigua
5 Ascension Island
6 Azores
7 Barbuda
8 Bermuda
9 Bonaire
10 Canary Islands
11 Cayman Islands
12 Chagos Archipelago
13 Christmas Island
14 Cocos Islands
15 Cook Islands
16 Curacao
17 Democratic Republic of the Congo
18 Falkland Islands
19 Faroe Islands
20 French Southern and Antarctic Lands
21 Grenadines
22 Guernsey
23 Heard Island
24 Isle of Man
25 Ivory Coast
26 Jersey
27 Kosovo
28 Kyrgyzstan
29 Laos
30 Liechtenstein
31 Macedonia
32 Madeira Islands
33 Micronesia
34 Monaco
35 Montserrat
36 Nauru
37 Nevis
38 Niue
39 Norfolk Island
40 Northern Mariana Islands
41 Pitcairn Islands
42 Republic of Congo
43 Saba
44 Saint Barthelemy
45 Saint Helena
46 Saint Kitts
47 Saint Lucia
48 Saint Martin
49 Saint Pierre and Miquelon
50 Saint Vincent
51 San Marino
52 Siachen Glacier
53 Sint Eustatius
54 Sint Maarten
55 Slovakia
56 South Georgia
57 South Sandwich Islands
58 Swaziland
59 Tobago
60 Trinidad
61 Turks and Caicos Islands
62 UK
63 USA
64 Vatican
65 Virgin Islands
66 Wallis and Futuna
gap_vs_map  %>% 
  knitr::kable(caption = "Gap countries vs. Map regions: Coverage diff",
               row.names = TRUE)
Gap countries vs. Map regions: Coverage diff
desn
1 Antigua and Barbuda
2 Channel Islands
3 Congo, Dem. Rep.
4 Congo, Rep.
5 Cote d’Ivoire
6 Eswatini
7 Hong Kong, China
8 Kyrgyz Republic
9 Lao
10 Macao, China
11 Micronesia, Fed. Sts.
12 Netherlands Antilles
13 North Macedonia
14 Slovak Republic
15 St. Kitts and Nevis
16 St. Lucia
17 St. Vincent and the Grenadines
18 Trinidad and Tobago
19 United Kingdom
20 United States
21 Virgin Islands (U.S.)

When going from the map region variable to Gapminder country values, we find 66 differences. Some of these are geographical entities or national subregions or overseas territories that we would not expect to find considered in the Gapminder data. Others are simple mismatches easily reconciled. Another group is a bit more tricky for coding, but logically straightforward. For example, the country is “Trinidad and Tobago”: the two primary geographical entities are islands “Trinidad” and “Tobago”, both region values in the map data.

When going from the Gapminder country to the map region values, our primary concern for reconciliation, we find 21 differences. These break down into four rough categories: 1. Easy Cases, 2. Island Nations, 3. Subregion Promotion, and 4. Do Not Restore.

1. Easy Cases

Dealing with the “Easy Cases”, the first group of mismatches, is straightforward.

### Easy cases -- see tools above for digging out names

world_map2 <- world_map %>% 
  rename(country = region) %>%
  mutate(country = case_when(country == "Macedonia" ~ "North Macedonia" ,
                             country == "Ivory Coast"  ~ "Cote d'Ivoire",
                             country == "Democratic Republic of the Congo"  ~ "Congo, Dem. Rep.",
                             country == "Republic of Congo" ~  "Congo, Rep.",
                             country == "UK" ~  "United Kingdom",
                             country == "USA" ~  "United States",
                             country == "Laos" ~  "Lao",
                             country == "Slovakia" ~  "Slovak Republic",
                             country == "Saint Lucia" ~  "St. Lucia",
                             country == "Kyrgyzstan"  ~  "Kyrgyz Republic",
                             country == "Micronesia" ~ "Micronesia, Fed. Sts.",
                             country == "Swaziland"  ~ "Eswatini", 
                             country == "Virgin Islands"  ~ "Virgin Islands (U.S.)", 
                             TRUE ~ country))





### Progress check
setdiff(country_names$country, world_map2$country)  %>% 
  enframe(name = NULL, value = "diff") %>% 
  knitr::kable(caption = "Remaining Cases", 
               row.names = TRUE)
Remaining Cases
diff
1 Antigua and Barbuda
2 Channel Islands
3 Hong Kong, China
4 Macao, China
5 Netherlands Antilles
6 St. Kitts and Nevis
7 St. Vincent and the Grenadines
8 Trinidad and Tobago

We now have eight remaining cases.

2. Island Nations

The cases of Antigua and Barbuda, St. Kitts and Nevis, Trinidad and Tobago, and St. Vincent and the Grenadines are all similar: combine the related map region designations to the appropriate new country designation. In each instance, we can re-organize the existing group , order, long and lat values under the new country value

## Get data for Island nations
match_names <- c("Antigua" , "Barbuda", "Nevis", 
                 "Saint Kitts", "Trinidad" , 
                 "Tobago", "Grenadines" , "Saint Vincent")

### Island nations data set
map_match <- world_map2 %>% 
  filter(country %in% match_names) 

map_match %>% distinct(country)
##         country
## 1       Antigua
## 2       Barbuda
## 3         Nevis
## 4   Saint Kitts
## 5      Trinidad
## 6        Tobago
## 7    Grenadines
## 8 Saint Vincent
### Group IDs for the countries
ant_bar <- c(137 ,138 )
kit_nev <- c(930 , 931)
tri_tog <- c(1425, 1426)
vin_gre <- c(1575, 1576, 1577)
# chan_isl <- c(594, 861)
# neth_ant <- c(1055, 1056)

new_names_ref <- c("Antigua and Barbuda", "St. Kitts and Nevis",
                   "Trinidad and Tobago", "St. Vincent and the Grenadines")


### assign new country names to match Gapminder
map_match <- map_match %>% 
  mutate(country = case_when(group %in% ant_bar ~ "Antigua and Barbuda" ,
                             group %in% kit_nev  ~ "St. Kitts and Nevis" ,
                             group %in% tri_tog  ~ "Trinidad and Tobago" ,
                             group %in% vin_gre ~ "St. Vincent and the Grenadines") 
  ) %>% 
  tibble()

### Quick checks

map_match %>% head()
## # A tibble: 6 x 6
##    long   lat group order country             subregion
##   <dbl> <dbl> <dbl> <int> <chr>               <chr>    
## 1 -61.7  17.0   137  7243 Antigua and Barbuda <NA>     
## 2 -61.7  17.0   137  7244 Antigua and Barbuda <NA>     
## 3 -61.9  17.0   137  7245 Antigua and Barbuda <NA>     
## 4 -61.9  17.1   137  7246 Antigua and Barbuda <NA>     
## 5 -61.9  17.1   137  7247 Antigua and Barbuda <NA>     
## 6 -61.8  17.2   137  7248 Antigua and Barbuda <NA>
map_match %>% 
  distinct(country)%>% 
  knitr::kable(caption = "Add to World Map")
Add to World Map
country
Antigua and Barbuda
St. Kitts and Nevis
Trinidad and Tobago
St. Vincent and the Grenadines
map_match %>% 
  group_by(country) %>% 
  count(group)  %>% 
  knitr::kable(caption = "Add to World Map")
Add to World Map
country group n
Antigua and Barbuda 137 12
Antigua and Barbuda 138 10
St. Kitts and Nevis 930 7
St. Kitts and Nevis 931 13
St. Vincent and the Grenadines 1575 16
St. Vincent and the Grenadines 1576 23
St. Vincent and the Grenadines 1577 10
Trinidad and Tobago 1425 30
Trinidad and Tobago 1426 8
#### Structure check for merge
map_match %>% 
  str()
## tibble [129 x 6] (S3: tbl_df/tbl/data.frame)
##  $ long     : num [1:129] -61.7 -61.7 -61.9 -61.9 -61.9 ...
##  $ lat      : num [1:129] 17 17 17 17.1 17.1 ...
##  $ group    : num [1:129] 137 137 137 137 137 137 137 137 137 137 ...
##  $ order    : int [1:129] 7243 7244 7245 7246 7247 7248 7249 7250 7251 7252 ...
##  $ country  : chr [1:129] "Antigua and Barbuda" "Antigua and Barbuda" "Antigua and Barbuda" "Antigua and Barbuda" ...
##  $ subregion: chr [1:129] NA NA NA NA ...
world_map2 %>% 
  str()
## 'data.frame':    99338 obs. of  6 variables:
##  $ long     : num  -69.9 -69.9 -69.9 -70 -70.1 ...
##  $ lat      : num  12.5 12.4 12.4 12.5 12.5 ...
##  $ group    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ order    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ country  : chr  "Aruba" "Aruba" "Aruba" "Aruba" ...
##  $ subregion: chr  NA NA NA NA ...
#### Time to Slice, Dice, and Restack

world_map2 <-  world_map2 %>%
  filter(!country %in% match_names)


world_map2 <- world_map2 %>% 
  bind_rows(map_match) %>%
  arrange(country)  %>%
  tibble()

### Safety check -- should return empty set
world_map2 %>% 
  filter(country %in% match_names)
## # A tibble: 0 x 6
## # ... with 6 variables: long <dbl>, lat <dbl>, group <dbl>, order <int>,
## #   country <chr>, subregion <chr>
### Safety check - should return one complete row each
world_map2 %>% 
  filter(country %in% new_names_ref) %>%
  group_by(country) %>%
  slice_max(order, n = 1)
## # A tibble: 4 x 6
## # Groups:   country [4]
##    long   lat group order country                        subregion
##   <dbl> <dbl> <dbl> <int> <chr>                          <chr>    
## 1 -61.7  17.6   138  7265 Antigua and Barbuda            <NA>     
## 2 -62.6  17.2   931 58081 St. Kitts and Nevis            <NA>     
## 3 -61.2  13.2  1577 98189 St. Vincent and the Grenadines <NA>     
## 4 -60.8  11.2  1426 89453 Trinidad and Tobago            <NA>

Safety checks passed. The island nations now included in world_map2.

3. Subregion Promotion

The cases of Macao, China, and Hong Kong, China differ again: in the map data set, each is a subregion of the region China. But economic and public health data for both former city-states, now Special Administrative Regions in China, has for decades and continues to be treated separately from that of mainland China (PRC). Each, for the purposes of Global Studies, has country level status (which is not the same as nation-state status). So we should follow practice and and treat them as country-level entities in terms of the map data set.

####
### Hong Kong and Macao
####  Pull from subregion; slice out; restack

sub_sleeps <- c("Hong Kong", "Macao")

hk_mc <- world_map2 %>% 
  filter(subregion %in% sub_sleeps)

hk_mc <- hk_mc %>%
  mutate(country = case_when(subregion == "Hong Kong" ~ "Hong Kong, China" ,
                             subregion == "Macao" ~ "Macao, China" ) )


### Safety check for bind_rows()
hk_mc %>% 
  slice(38:41) %>% 
  knitr::kable(caption = "Check structure")
Check structure
long lat group order country subregion
114.0067 22.48403 670 45801 Hong Kong, China Hong Kong
114.0154 22.51191 670 45802 Hong Kong, China Hong Kong
113.4789 22.19556 960 59893 Macao, China Macao
113.4810 22.21748 960 59894 Macao, China Macao
### Slice out old info
world_map2 <-   world_map2 %>%
  filter(!subregion %in% sub_sleeps)

### Stack in new info
world_map2 <- world_map2 %>% 
  bind_rows(hk_mc) %>%
  select(-subregion) %>% 
  tibble()


### Progress check
setdiff(country_names$country, world_map2$country)  %>% 
  enframe(name = NULL, value = "diff") %>% 
  knitr::kable(caption = "Remaining Cases", 
               row.names = TRUE)
Remaining Cases
diff
1 Channel Islands
2 Netherlands Antilles

We’ve added Hong Kong and Macao, and now have only two outstanding cases.

4. Do Not Restore

Finally, we havetwo cases we arguably should not reconcile. The Netherlands Antilles was dissolved in 2010. It consisted of the islands Curaçao, Bonaire, Aruba (until 1986), Saba, Sint Eustatius, and Sint Maarten. Aruba, which has a country designation in the Gapminder data, is a “a constituent country of the Kingdom of the Netherlands”; Curaçao and Sint Maarten, likewise. Each has its own ISO code. Bonaire, Saba, and Sint Eustatius are special municipalities within the country of the Netherlands: all share the same ISO code. By recombining these various constituent countries and special municipalities back into the historical Netherlands Antilles, itself once a constituent country of the Kingdom of the Netherlands, we would do so at the cost of current (since 2010) and future compatibility with data collection and analysis.

Likewise, for the reasons discussed earlier, we should pass on restoring the historical designation the Channel Islands. The Channel Islands primarily consist of the Bailiwick of Guernsey and the Bailiwick of Jersey. Guernsey and Jersey both have their own ISO codes and have real-world effective country-level status.

Map Country List
world_map2 %>% distinct(country) %>%
  DT::datatable(caption = "Map Country List")
### No Tuvalu  in map -- add coordinates
world_map2 %>% 
  filter(stringr::str_detect(country, "Tu") ) %>%
  distinct(country)
## # A tibble: 4 x 1
##   country                 
##   <chr>                   
## 1 Tunisia                 
## 2 Turkey                  
## 3 Turkmenistan            
## 4 Turks and Caicos Islands

But as it turns out, we are missing Tuvalu! This nation was represented in some of the Gapminder data sets.

Missing Coordinates

We now have a new problem. Our map data lacks coordinates –indeed, entries – for countries or subregions which have ISO codes: the nation Tuvalu, for example, and the territories of Gibraltar and the British Virgin Islands. None of which currently show in our list of countries for the map data. Here is a quick check, using the original world_map data. We will check both the region and subregion vars.

# Tuvalu
world_map %>% 
  filter(stringr::str_detect(region, "Tu") ) %>% 
  distinct(region, subregion)
##                      region             subregion
## 1  Turks and Caicos Islands Providenciales Island
## 2  Turks and Caicos Islands   Grand Caicos Island
## 3  Turks and Caicos Islands   North Caicos Island
## 4              Turkmenistan            Ogurja Ada
## 5              Turkmenistan                  <NA>
## 6                   Tunisia                 Jerba
## 7                   Tunisia        Shergui Island
## 8                   Tunisia                  <NA>
## 9                    Turkey              Gokceada
## 10                   Turkey                  <NA>
## 11                   Turkey                 North
# Tuvalu again
world_map %>% 
  filter(stringr::str_detect(subregion, "Tu") ) %>% 
  distinct(region, subregion) 
##           region       subregion
## 1 American Samoa         Tutuila
## 2         Belize Turneffe Island
## 3         Canada  Tukarak Island
## 4      Indonesia         Tuangku
# Gibraltar
world_map %>% 
  filter(stringr::str_detect(region, "Gib") ) %>% 
  distinct(region, subregion) 
## [1] region    subregion
## <0 rows> (or 0-length row.names)
# Gibraltar
world_map %>% 
  filter(stringr::str_detect(subregion, "Gib") ) %>% 
  distinct(region, subregion) 
## [1] region    subregion
## <0 rows> (or 0-length row.names)

For these cases – but regretfully, not for all – we can download the polygon coordinates from OpenDataSource. Some hacking around (not on display here) will get us compatible data sets.

Adding to the Map

### From https://public.opendatasoft.com/ 
tuvalu_coords <- readRDS(here::here("data", 
                                    "tidy_data", 
                                    "tuvalu_coords.rds") )


tuvalu_coords %>% head()  ## check structure
## # A tibble: 6 x 5
##    long   lat group  order country
##   <dbl> <dbl> <dbl>  <dbl> <chr>  
## 1  179. -8.55  2010 110000 Tuvalu 
## 2  179. -8.56  2010 110001 Tuvalu 
## 3  179. -8.47  2010 110003 Tuvalu 
## 4  179. -8.48  2010 110004 Tuvalu 
## 5  179. -8.49  2010 110005 Tuvalu 
## 6  179. -8.50  2011 110006 Tuvalu
## Add to map
world_map2 <- world_map2 %>%
  bind_rows(tuvalu_coords) %>% 
  arrange(country)

## Check!
world_map2 %>% 
  filter(stringr::str_detect(country, "Tu") ) %>%
  distinct(country)
## # A tibble: 5 x 1
##   country                 
##   <chr>                   
## 1 Tunisia                 
## 2 Turkey                  
## 3 Turkmenistan            
## 4 Turks and Caicos Islands
## 5 Tuvalu

Adding more

We’ve successfully added Tuvalu. Now, for Gibraltar and the British Virgin Islands.

### Missing also Gibraltar &  Virgin Islands (British)
### From https://public.opendatasoft.com/ 

Gib_BVI_coords <- readRDS(file = here::here("data",
                          "tidy_data",
                          "Gib_BVI_coords.rds"))

Gib_BVI_coords %>% head()
## # A tibble: 6 x 5
##     long   lat group  order country                 
##    <dbl> <dbl> <dbl>  <dbl> <chr>                   
## 1  -5.36  36.1  2014 110035 Gibraltar               
## 2  -5.34  36.1  2014 110036 Gibraltar               
## 3  -5.34  36.1  2014 110037 Gibraltar               
## 4  -5.35  36.1  2014 110038 Gibraltar               
## 5  -5.36  36.1  2014 110039 Gibraltar               
## 6 -64.6   18.3  2021 110045 Virgin Islands (British)
world_map2 <- world_map2 %>%
  bind_rows(Gib_BVI_coords) %>% 
  arrange(country)


world_map2 %>% 
  filter(stringr::str_detect(country, "Gib") ) %>%
  distinct(country)
## # A tibble: 1 x 1
##   country  
##   <chr>    
## 1 Gibraltar
world_map2 %>% 
  filter(stringr::str_detect(country, "Vir") ) %>%
  distinct(country)
## # A tibble: 2 x 1
##   country                 
##   <chr>                   
## 1 Virgin Islands (British)
## 2 Virgin Islands (U.S.)

ISO Country Codes

We now have a map which provides near-complete of the Gapminder.org data sets, and will work for other Global Studies data sets. We need now to add the ISO 3166-1 Country Codes to our map data: in particular, the Alpha-2 code, the Alpha-3 code, and the Numeric code. This will ensure compatibility with a greater range of Global Studies data sets.

Please note that the country_ISO_codes data set below was compiled and cross-checked using various open sources. But as ISO 3166-1 is a moving target (an ongoing process), this data set will need checking and updating.

country_ISO_codes <- readRDS(file = here::here("data", 
                          "tidy_data", 
                          "country_ISO_codes2.rds") )

country_ISO_codes %>% head()
## # A tibble: 6 x 5
##   s_name         code_2 code_3 code_num form_name                               
##   <chr>          <chr>  <chr>     <dbl> <chr>                                   
## 1 Afghanistan    AF     AFG           4 Islamic Republic of Afghanistan         
## 2 Aland Islands  AX     ALA         248 Åland                                   
## 3 Albania        AL     ALB           8 The Republic of Albania                 
## 4 Algeria        DZ     DZA          12 The People's Democratic Republic of Alg~
## 5 American Samoa AS     ASM          16 The Territory of American Samoa         
## 6 Andorra        AD     AND          20 The Principality of Andorra

Missing Norfolk!

It turns out, however, that like our map data, our master list of ISO Country Codes was also not complete. Finding a free and reliable Open Source version is not easy – and I do not have access to the commericial version. So, below, how to update country_ISO_codes:

### Missing Norfolk Island
norfolk_codes <- tibble(s_name = "Norfolk Island",
                        code_2 = "NF", 
                        code_3 = "NFK",
                        code_num = 574,
                        form_name = "Territory of Norfolk Island, Australia")

norfolk_codes %>% head()
## # A tibble: 1 x 5
##   s_name         code_2 code_3 code_num form_name                             
##   <chr>          <chr>  <chr>     <dbl> <chr>                                 
## 1 Norfolk Island NF     NFK         574 Territory of Norfolk Island, Australia
country_ISO_codes2 <- country_ISO_codes %>%
  bind_rows(norfolk_codes) %>% 
  arrange(s_name)

country_ISO_codes2 %>% 
  filter(code_2 == "NF") %>% 
  slice(n=1)
## # A tibble: 1 x 5
##   s_name         code_2 code_3 code_num form_name                             
##   <chr>          <chr>  <chr>     <dbl> <chr>                                 
## 1 Norfolk Island NF     NFK         574 Territory of Norfolk Island, Australia

Reconcilation Check

Now that we have our ISO Country Codes loaded and updated, we are almost ready to add them to the map data. One set of checks for differences.

### Remaining  Gapmminder cases -- the two historical entities
setdiff(country_names$country, world_map2$country)  %>% 
  enframe(name = NULL, value = "diff") %>% 
  knitr::kable(caption = "Gap vs Map: Remaining Cases", 
               row.names = TRUE)
Gap vs Map: Remaining Cases
diff
1 Channel Islands
2 Netherlands Antilles
setdiff(country_ISO_codes2$s_name , world_map2$country)  %>% 
  enframe(name = NULL, value = "diff") %>% 
  knitr::kable(caption = "ISO vs Map: Remaining Cases", 
               row.names = TRUE)
ISO vs Map: Remaining Cases
diff
1 Aland Islands
2 Bouvet Island
3 British Indian Ocean Territory
4 Svalbard and Jan Mayen
5 Tokelau
6 United States Minor Outlying Islands
setdiff(world_map2$country, country_ISO_codes2$s_name)  %>% 
  enframe(name = NULL, value = "diff") %>% 
  knitr::kable(caption = "Map vs. ISO:  Remaining Cases", 
               row.names = TRUE)
Map vs. ISO: Remaining Cases
diff
1 Siachen Glacier

So our checks indicate success. First, we declined to restore the two defunct designations, once of which reflected an historical country-level entity, and the other, a geographically convenient label. Our map data set by decision will not account for the Netherlands Antilles or the Channel Islands.

Second, of the ISO vs Map cases, only the sparsely populated Tokelau possibly matters, but OpenDataSoft does not have the polygon coordinates for it. The Chagos Archipelago, included in our map, makes up the most important part of the British Indian Ocean Territory. The remaining four cases comprise either seasonally inhabited regions or (and) remote military bases. These produce negligible data in terms of economic or public health statistics, and can be safely ignored for such purposes.

Third and finally, the original map makers included the Siachen Glacier. This is a geographical entity and a disputed territory: but it does not have an individual ISO code, does not have civilian residents (only military), and does not produce the relevant economic or public health data. If we remove it from world_map2, however, we will get a small but annoying empty space (usually portrayed as a white dot). So it stays in.

Add ISO to Map

world_map2 <- world_map2 %>%
  left_join(country_ISO_codes2, by = c("country" = "s_name")) %>%
  tibble() 


world_map2 %>% vis_dat()

world_map2 %>% glimpse()
## Rows: 99,442
## Columns: 9
## $ long      <dbl> 74.89131, 74.84023, 74.76738, 74.73896, 74.72666, 74.66895, ~
## $ lat       <dbl> 37.23164, 37.22505, 37.24917, 37.28564, 37.29072, 37.26670, ~
## $ group     <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
## $ order     <dbl> 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, ~
## $ country   <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", ~
## $ code_2    <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", ~
## $ code_3    <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG~
## $ code_num  <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ~
## $ form_name <chr> "Islamic Republic of Afghanistan", "Islamic Republic of Afgh~

The slight seeming glitch or NA space in the data set: Canary Islands, which has an alpha-2 code “IC” still in use, but no alpha-3 or numeric code. Instead, it has been recoded as ES-CN: a subdivision of Spain.

Save and Done

We now have mapping data with the following standard ISO country codes: Alpha-2 code, Alpha-3 code, and Numeric code. We can match by country name to the majority of Gapminder.org data sets, and we can match by to any Global Studies data set which likewise uses one or more of the above ISO country codes. So the data set world_map2_ISO offers an update on ggplot2::map_data("world") with improved interoperability.

save_data <- c("world_map2",
               "country_ISO_codes2")

# Save Data! --------------------------
save(list = save_data, file = here::here("data",
                                         "tidy_data", 
                                         "maps", 
                                         "world_map2_project.rda" ))
## Just the map data
saveRDS(world_map2, file = here::here("data",
                                         "tidy_data", 
                                         "maps", 
                                         "world_map2.rds" ))

Download from

Or, please improve and share: github.com/Thom-J-H/map_Gap_2_Tidy


Thomas J. Haslam
2021-08-13


Session Info