In Spring 2021, I taught a course entitled “Telling Stories with Data” (TSD), which introduced non-STEM majors to the Tidyverse and basic data visualization and analysis. I had bright students, but ones who typically had NO prior experience with either statistical analysis or computer programming. So TSD was designed as a soft entry, beginner-level guide to working with data. For our stories, we started with various data sets available in R packages – the usual suspects – and then progressed to data sets in the wild.
Our capstone project required to the students to use two data sets from the Gapminder.org foundation to build a dashboard and tell a coherent story with visualizations, summary stats, text contextualization and analyses, and at least one basic model or hypothesis test. (Three capstone project examples: Bonnie, Chanley, and Bethia).
library(tidyverse)
library(here)
library(visdat)
For their capstone projects, nearly all the students wanted to include Choropleth maps. Understandable. The students were working with global data, and Choropleth maps both look impressive and are useful. But we ran into some problems.
Gapminder.org uses the nation-state as a primary unit of analysis: the country
variable in their data sets. But they do NOT include the standard ISO country codes. The R ecosystem has various mapping tools, some well outside the Tidyverse, but all of which require remapping at least some of the Gapminder country
names to the geo-data units; or, vice versa. So until we have all the primary units remapped, we get something like this:
load(here::here("data", "tidy_data", "cmap.rda") )
bad_ex
To simplify this as the course required, and to stay largely within the Tidyverse ecosystem so as to avoid cognitive overload, we went with the world map from ggplot2::map_data("world")
. To ensure compabitility with the Gapminder.org date, we created a new data set with the geo-mapping information: world_map2
. We fixed most of the flaws, and did “good enough” quick and dirty Choropleth maps – but I wanted to finish what we started.
If data was missing for a given nation for a given year, we wanted to know that. We also wanted our mapping data compatible across all the Gapminder.org data sets. We wanted a result more like this:
plotly::ggplotly(good_ex)
We would not need to rely on matching names if we had a version of ggplot2::map_data("world")
which was updated and contained the standard ISO country codes: Alpha-2 code, Alpha-3 code, and Numeric code. So that is what the data set world_map2
offers.
The remainder of this document describes the process of creating world_map2
, our new data set directly dervied from ggplot2::map_data("world")
. It – world_map2
, this RMD, and a supporting case study, are available at github.com/Thom-J-H/map_Gap_2_Tidy. I include in this document below all the steps and rationale involved for full transparency and in hopes that other people can improve upon this effort or offer a better solution for working with Global Studies data sets (like the Gapminder.org data) in the Tidyverse.
When using ggplot2::map_data("world")
(hereafter world_map
) with the Gapminder.org or other Global Studies data sets, we have two core problems. First, the names for country
are not consistent across data sets.The informal names of the nations can vary greatly; the formal names, often too long for appropriate labeling and generally not even recorded. In the Gapminder.org data sets, which largely share a common source, for Sint Maarten, we have two values: “Sint Maarten” and “Sint Maarten (Dutch part)”. This because the same island also contains the the Collectivity of Saint Martin, more commonly known as the French “Saint Martin”. When we move from the Gapminder.org data sets to others, the country name values can vary greatly. In world_map
, the preferred “Eswatini” is the older “Swaziland”; “North Macedonia” as of 2019, the older “Macedonia”; and so on.
The obvious solution to this problem of inconsistent country nomenclature: use the ISO codes: the three letter designation, or the three digit ONU, or both. Neither the Gapminder.org data sets nor world_map
does so.
Second, in practical terms, we have no simple definition of what comprises a country. As of 4 September 2020, Kosovo was recognized by 97 out of 193 (50.26%) United Nations member states; as of July 2021, Western Sahara was recognized by 45 out of a total of 193 United Nations member states. Both have region
values with the corresponding polygon coordinates in world_map
. They may or may not appear in various Global Studies data collections.
Likewise, we also have existing designations that do not distinguish clearly between geographical boundaries and political boundaries. Some of the Gapminder data sets, for example, report on the “Channel Islands”: more properly, the two Crown dependencies, the Bailiwick of Jersey, and the Bailiwick of Guernsey. But as Wikipedia correctly reports: “‘Channel Islands’ is a geographical term, not a political unit. The two bailiwicks have been administered separately since the late 13th century. Each has its own independent laws, elections, and representative bodies…. Any institution common to both is the exception rather than the rule.” Jersey, for example, is “a self-governing parliamentary democracy under a constitutional monarchy, with its own financial, legal and judicial systems, and the power of self-determination”. For mappping purposes, both Jersey (JE; JEY; 832) and Guernsey (GG; GGY; 831) have their own ISO codes. In truth, it makes more sense NOT to lump Jersey and Guernsey together for the purposes of economic, social, and public health data analysis. Even if Gapminder.org and/or the World Bank did so for some data collections.
Conversely, but appropriately, world_map
places Hong Kong and Macao in the subregion
column as a region
of China. This is geographically and politically correct: but for decades of practice continuing to the present, economic and public health data for Hong Kong and Macao have been gathered separately. Not aggregrated back to China. Each former city-state now “Special Administrative Region” effectively has a country
level status, as the Gapminder.org and other Global Studies data sets show.
Finally, on this point, some of the Gapminder data sets also include as a country
value the dissolved Netherlands Antilles. If we keep this historical designation which is needed for only a limited number of data analyses, we must otherwise ignore the data for the now independent nations of Aruba and Curacao, as well as the political regroupings of the remaining islands. So although I want a mapping data set highly compatible with the Gapminder.org data sets, it should also work with any Global Studies data set. The value “Netherlands Antilles” will be dropped.
Ideally, world_map2
should not only work better with the Gapminder.org data set: it should be interoperable with all reasonably similar Global Studies data sets. So beyond adding, deleting and changing names, and in three cases, adding new polygon coordinates, world_map2
also contains (when they exist) the Alpha-2 code, Alpha-3 code, and Numeric code for each entity represented as a country
value.
The Gapminder.org data sets available for download are generally sourced to the World Bank and available under a Creative Commons Attribution 4.0 International license. They cover global trends with the nation-state, the variable country
, as a primary level of analysis. The data is also organized chronologically, by year
.
Between data sets, the names for countries are generally consistent. Some sets do cover more nations (and territories and sub-national units).
We will use four sets below to test differences in coverage, and to build a country names reference.
# Data sets from Gapminder.org --------------------------------------------
life_expectancy_years <- read_csv(here::here("data",
"raw_data",
"life_expectancy_years.csv") ,
show_col_types = FALSE)
total_fertility <- read_csv(here::here("data",
"raw_data",
"children_per_woman_total_fertility.csv"),
show_col_types = FALSE)
energy_use_per_person <- read_csv(here::here("data",
"raw_data",
"energy_use_per_person.csv"),
show_col_types = FALSE)
demox_eiu <- read_csv(here::here("data",
"raw_data",
"demox_eiu.csv"),
show_col_types = FALSE)
The Gapminder.org data sets are untidy, and in long format. We’ll deal with those issues later. Depending on the primary variable of interest, we have a different range of nations and years covered. For example, the data set for Life Expectancy (years) has 189 designated countries; the data set for Total Fertility, 202 countries; the data set for Energy Use per Capita, 169 nations; and the data set for Democracy Index (EIU), 166 nations. But Total Fertility, to take one comparison, does not simply have 13 more listed countries than Life Expectancy (years): we have meaningful set differences in coverage between the sets.
life_expectancy_years %>% head()
## # A tibble: 6 x 302
## country `1800` `1801` `1802` `1803` `1804` `1805` `1806` `1807` `1808` `1809`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghani~ 28.2 28.2 28.2 28.2 28.2 28.2 28.1 28.1 28.1 28.1
## 2 Angola 27 27 27 27 27 27 27 27 27 27
## 3 Albania 35.4 35.4 35.4 35.4 35.4 35.4 35.4 35.4 35.4 35.4
## 4 Andorra NA NA NA NA NA NA NA NA NA NA
## 5 United ~ 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7
## 6 Argenti~ 33.2 33.2 33.2 33.2 33.2 33.2 33.2 33.2 33.2 33.2
## # ... with 291 more variables: 1810 <dbl>, 1811 <dbl>, 1812 <dbl>, 1813 <dbl>,
## # 1814 <dbl>, 1815 <dbl>, 1816 <dbl>, 1817 <dbl>, 1818 <dbl>, 1819 <dbl>,
## # 1820 <dbl>, 1821 <dbl>, 1822 <dbl>, 1823 <dbl>, 1824 <dbl>, 1825 <dbl>,
## # 1826 <dbl>, 1827 <dbl>, 1828 <dbl>, 1829 <dbl>, 1830 <dbl>, 1831 <dbl>,
## # 1832 <dbl>, 1833 <dbl>, 1834 <dbl>, 1835 <dbl>, 1836 <dbl>, 1837 <dbl>,
## # 1838 <dbl>, 1839 <dbl>, 1840 <dbl>, 1841 <dbl>, 1842 <dbl>, 1843 <dbl>,
## # 1844 <dbl>, 1845 <dbl>, 1846 <dbl>, 1847 <dbl>, 1848 <dbl>, 1849 <dbl>, ...
demox_eiu %>% vis_dat()
life_expectancy_years %>%
arrange(country) %>%
select(country)
## # A tibble: 189 x 1
## country
## <chr>
## 1 Afghanistan
## 2 Albania
## 3 Algeria
## 4 Andorra
## 5 Angola
## 6 Antigua and Barbuda
## 7 Argentina
## 8 Armenia
## 9 Australia
## 10 Austria
## # ... with 179 more rows
demox_eiu %>%
arrange(country) %>%
select(country)
## # A tibble: 166 x 1
## country
## <chr>
## 1 Afghanistan
## 2 Albania
## 3 Algeria
## 4 Angola
## 5 Argentina
## 6 Armenia
## 7 Australia
## 8 Austria
## 9 Azerbaijan
## 10 Bahrain
## # ... with 156 more rows
life_expectancy_years$country %>%
n_distinct()
## [1] 189
energy_use_per_person$country %>%
n_distinct()
## [1] 169
Let’s explore the similarities and differences in coverage for the country variable
between the sets.
### Fertility vs. Life coverage
setdiff(total_fertility$country, life_expectancy_years$country) %>%
enframe(name = NULL, value = "diff") %>%
knitr::kable(caption = "Fertility vs. Life coverage" ,
row.names = TRUE)
diff | |
---|---|
1 | Aruba |
2 | Netherlands Antilles |
3 | Channel Islands |
4 | Western Sahara |
5 | Guadeloupe |
6 | Greenland |
7 | French Guiana |
8 | Guam |
9 | Macao, China |
10 | Martinique |
11 | Mayotte |
12 | New Caledonia |
13 | Puerto Rico |
14 | French Polynesia |
15 | Reunion |
16 | Virgin Islands (U.S.) |
### Life vs. Fertility coverage
setdiff(life_expectancy_years$country, total_fertility$country) %>%
enframe(name = NULL, value = "diff") %>%
knitr::kable(caption = "Life vs. Fertility coverage" ,
row.names = TRUE)
diff | |
---|---|
1 | Andorra |
2 | Dominica |
3 | Marshall Islands |
### Fertility vs. Energy coverage"
setdiff(total_fertility$country, energy_use_per_person$country) %>%
enframe(name = NULL, value = "diff") %>%
knitr::kable(caption = "Fertility vs. Energy coverage",
row.names = TRUE)
diff | |
---|---|
1 | Aruba |
2 | Afghanistan |
3 | Netherlands Antilles |
4 | Burundi |
5 | Burkina Faso |
6 | Central African Republic |
7 | Channel Islands |
8 | Western Sahara |
9 | Micronesia, Fed. Sts. |
10 | Guinea |
11 | Guadeloupe |
12 | Greenland |
13 | French Guiana |
14 | Guam |
15 | Hong Kong, China |
16 | Lao |
17 | Liberia |
18 | Macao, China |
19 | Madagascar |
20 | Mali |
21 | Mauritania |
22 | Martinique |
23 | Malawi |
24 | Mayotte |
25 | New Caledonia |
26 | Papua New Guinea |
27 | Puerto Rico |
28 | Palestine |
29 | French Polynesia |
30 | Reunion |
31 | Rwanda |
32 | Sierra Leone |
33 | Somalia |
34 | Chad |
35 | Taiwan |
36 | Uganda |
37 | Virgin Islands (U.S.) |
### Energy vs. Fertility coverage
setdiff(energy_use_per_person$country, total_fertility$country) %>%
enframe(name = NULL, value = "diff") %>%
knitr::kable(caption = "Energy vs. Fertility coverage",
row.names = TRUE)
diff | |
---|---|
1 | Dominica |
2 | Marshall Islands |
3 | Palau |
4 | St. Kitts and Nevis |
### Life vs. Energy coverage
setdiff(life_expectancy_years$country, energy_use_per_person$country) %>%
enframe(name = NULL, value = "diff") %>%
knitr::kable(caption = "Life vs. Energy coverage",
row.names = TRUE)
diff | |
---|---|
1 | Afghanistan |
2 | Andorra |
3 | Burundi |
4 | Burkina Faso |
5 | Central African Republic |
6 | Micronesia, Fed. Sts. |
7 | Guinea |
8 | Hong Kong, China |
9 | Lao |
10 | Liberia |
11 | Madagascar |
12 | Mali |
13 | Mauritania |
14 | Malawi |
15 | Papua New Guinea |
16 | Palestine |
17 | Rwanda |
18 | Sierra Leone |
19 | Somalia |
20 | Chad |
21 | Taiwan |
22 | Uganda |
### Energy vs. Life coverage
setdiff(energy_use_per_person$country,life_expectancy_years$country) %>%
enframe(name = NULL, value = "diff") %>%
knitr::kable(caption = "Energy vs. Life coverage",
row.names = TRUE)
diff | |
---|---|
1 | Palau |
2 | St. Kitts and Nevis |
### Energy vs. Democracy coverage
setdiff(energy_use_per_person$country,demox_eiu$country) %>%
enframe(name = NULL, value = "diff") %>%
knitr::kable(caption = "Energy vs. Democracy coverage",
row.names = TRUE)
diff | |
---|---|
1 | Antigua and Barbuda |
2 | Bahamas |
3 | Barbados |
4 | Belize |
5 | Brunei |
6 | Dominica |
7 | Georgia |
8 | Grenada |
9 | Kiribati |
10 | Maldives |
11 | Marshall Islands |
12 | Palau |
13 | Samoa |
14 | Sao Tome and Principe |
15 | Seychelles |
16 | Solomon Islands |
17 | South Sudan |
18 | St. Kitts and Nevis |
19 | St. Lucia |
20 | St. Vincent and the Grenadines |
21 | Tonga |
22 | Vanuatu |
### Democracy vs. Energy coverage
setdiff(demox_eiu$country,energy_use_per_person$country) %>%
enframe(name = NULL, value = "diff") %>%
knitr::kable(caption = "Democracy vs. Energy coverage",
row.names = TRUE)
diff | |
---|---|
1 | Afghanistan |
2 | Burundi |
3 | Burkina Faso |
4 | Central African Republic |
5 | Guinea |
6 | Hong Kong, China |
7 | Lao |
8 | Liberia |
9 | Madagascar |
10 | Mali |
11 | Mauritania |
12 | Malawi |
13 | Papua New Guinea |
14 | Palestine |
15 | Rwanda |
16 | Sierra Leone |
17 | Chad |
18 | Taiwan |
19 | Uganda |
Overall, we have fairly complete vector of country
values. Since the law of diminishing returns has set in our Gapminder data set comparisons, let’s build a country name list (dataframe, actually) to test against our map data coverage.
# Gapminder Country Name Reference DF -------------------------------------
country_names <- demox_eiu %>%
select(country) %>%
full_join(energy_use_per_person, by = "country") %>%
select(country) %>%
full_join(total_fertility, by = "country") %>%
select(country) %>%
full_join(life_expectancy_years, by = "country") %>%
select(country) %>%
arrange(country)
## Current working total
country_names$country %>%
n_distinct()
## [1] 207
country_names %>% head(n = 10) %>%
knitr::kable(caption = "First Ten Country Designations",
row.names = TRUE)
country | |
---|---|
1 | Afghanistan |
2 | Albania |
3 | Algeria |
4 | Andorra |
5 | Angola |
6 | Antigua and Barbuda |
7 | Argentina |
8 | Armenia |
9 | Aruba |
10 | Australia |
country_names %>% tail(n = 10) %>%
knitr::kable(caption = "Last Ten Country Designations",
row.names = TRUE)
country | |
---|---|
1 | Uruguay |
2 | Uzbekistan |
3 | Vanuatu |
4 | Venezuela |
5 | Vietnam |
6 | Virgin Islands (U.S.) |
7 | Western Sahara |
8 | Yemen |
9 | Zambia |
10 | Zimbabwe |
So at this point we have 207 unique country
level units of analysis. Please note that some country
designations are better understood as regions within a nation-state, or as overseas territories belonging to a nation-state, rather than as distinct nation-states as recognized by the United Nations or the international community.
In the data set world_map
, derived from ggplot2::map_data("world")
, the region
variable generally corresponds with the Gapminder country
variable: but it can also define geographical rather than political entities. We need to dig into the map data subregion
to obtain a proper match with country
.
Let’s have a look.
world_map <- ggplot2::map_data("world")
world_map %>% vis_dat()
## Basic unit is region; subregion mostly NA
world_map %>% glimpse()
## Rows: 99,338
## Columns: 6
## $ long <dbl> -69.89912, -69.89571, -69.94219, -70.00415, -70.06612, -70.0~
## $ lat <dbl> 12.45200, 12.42300, 12.43853, 12.50049, 12.54697, 12.59707, ~
## $ group <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
## $ order <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17, 18, 1~
## $ region <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba~
## $ subregion <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## group or groups belong to regions
## order refers the long and lat coordinates for mapping
## long == longitude lat == latitude
world_map %>%
skimr::skim(region, subregion)
Name | Piped data |
Number of rows | 99338 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
character | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
region | 0 | 1.00 | 2 | 35 | 0 | 252 | 0 |
subregion | 63154 | 0.36 | 1 | 33 | 0 | 1069 | 0 |
## more regions than gapminder countries
## difference in emphasis
## subregion contains some units which gapminder treats a country
The ggplot2 map data for a global Mercator projection uses region
as the primary unit. A region
can have subregions, but always consists of at least one group
. The group
marks out the polygon to be mapped and filled, by longitude long
and latitude lat
coordinates in their appropriate order order
: some regions and even subregions require multiple groups to draw the appropriate shape. The subregion
“Hong Kong”, for example, has three distinct groups: 668, 669, and 670.
In the majority of cases, we have either an existing or an obvious match between the Gapminder country
and the map data region
variables. But for that minority, we need to dig through the vectors. Below are some tools for that task.
### Check for example South Sudan
world_map %>%
filter(stringr::str_detect(region, "Sudan") ) %>%
distinct(region)
## region
## 1 Sudan
## 2 South Sudan
### Check for example South Sudan
world_map %>%
filter(stringr::str_detect(region, "South") ) %>%
distinct(region)
## region
## 1 French Southern and Antarctic Lands
## 2 South Korea
## 3 South Sudan
## 4 South Sandwich Islands
## 5 South Georgia
## 6 South Africa
### Check for example South Sudan
country_names %>%
filter(stringr::str_detect(country, "Sudan") )
## # A tibble: 2 x 1
## country
## <chr>
## 1 South Sudan
## 2 Sudan
### Check for example South Sudan
country_names %>%
filter(stringr::str_detect(country, "South") )
## # A tibble: 3 x 1
## country
## <chr>
## 1 South Africa
## 2 South Korea
## 3 South Sudan
### Check for example Hong Kong
world_map %>%
filter(stringr::str_detect(region, "Hong Kong") ) %>%
distinct(region) # NO!
## [1] region
## <0 rows> (or 0-length row.names)
### Check for example Hong Kong
world_map %>%
filter(stringr::str_detect(subregion, "Hong Kong") ) %>%
distinct(region, subregion) # YES!
## region subregion
## 1 China Hong Kong
### Group IDs for coordinates data
world_map %>%
filter(stringr::str_detect(subregion, "Hong Kong") ) %>%
select(group) %>%
distinct()
## group
## 1 668
## 2 669
## 3 670
Let’s identify the mismatches and work to reconcile as many as possible.
####Identify key differences --------------------------------------
map_vs_gap <- setdiff(world_map$region, country_names$country) %>%
enframe(name = NULL, value = "desn") %>%
arrange(desn)
gap_vs_map <- setdiff(country_names$country, world_map$region) %>%
enframe(name = NULL, value = "desn") %>%
arrange(desn)
map_vs_gap %>%
knitr::kable(caption = "Map regions vs. Gap countries: Coverage diff",
row.names = TRUE)
desn | |
---|---|
1 | American Samoa |
2 | Anguilla |
3 | Antarctica |
4 | Antigua |
5 | Ascension Island |
6 | Azores |
7 | Barbuda |
8 | Bermuda |
9 | Bonaire |
10 | Canary Islands |
11 | Cayman Islands |
12 | Chagos Archipelago |
13 | Christmas Island |
14 | Cocos Islands |
15 | Cook Islands |
16 | Curacao |
17 | Democratic Republic of the Congo |
18 | Falkland Islands |
19 | Faroe Islands |
20 | French Southern and Antarctic Lands |
21 | Grenadines |
22 | Guernsey |
23 | Heard Island |
24 | Isle of Man |
25 | Ivory Coast |
26 | Jersey |
27 | Kosovo |
28 | Kyrgyzstan |
29 | Laos |
30 | Liechtenstein |
31 | Macedonia |
32 | Madeira Islands |
33 | Micronesia |
34 | Monaco |
35 | Montserrat |
36 | Nauru |
37 | Nevis |
38 | Niue |
39 | Norfolk Island |
40 | Northern Mariana Islands |
41 | Pitcairn Islands |
42 | Republic of Congo |
43 | Saba |
44 | Saint Barthelemy |
45 | Saint Helena |
46 | Saint Kitts |
47 | Saint Lucia |
48 | Saint Martin |
49 | Saint Pierre and Miquelon |
50 | Saint Vincent |
51 | San Marino |
52 | Siachen Glacier |
53 | Sint Eustatius |
54 | Sint Maarten |
55 | Slovakia |
56 | South Georgia |
57 | South Sandwich Islands |
58 | Swaziland |
59 | Tobago |
60 | Trinidad |
61 | Turks and Caicos Islands |
62 | UK |
63 | USA |
64 | Vatican |
65 | Virgin Islands |
66 | Wallis and Futuna |
gap_vs_map %>%
knitr::kable(caption = "Gap countries vs. Map regions: Coverage diff",
row.names = TRUE)
desn | |
---|---|
1 | Antigua and Barbuda |
2 | Channel Islands |
3 | Congo, Dem. Rep. |
4 | Congo, Rep. |
5 | Cote d’Ivoire |
6 | Eswatini |
7 | Hong Kong, China |
8 | Kyrgyz Republic |
9 | Lao |
10 | Macao, China |
11 | Micronesia, Fed. Sts. |
12 | Netherlands Antilles |
13 | North Macedonia |
14 | Slovak Republic |
15 | St. Kitts and Nevis |
16 | St. Lucia |
17 | St. Vincent and the Grenadines |
18 | Trinidad and Tobago |
19 | United Kingdom |
20 | United States |
21 | Virgin Islands (U.S.) |
When going from the map region
variable to Gapminder country
values, we find 66 differences. Some of these are geographical entities or national subregions or overseas territories that we would not expect to find considered in the Gapminder data. Others are simple mismatches easily reconciled. Another group is a bit more tricky for coding, but logically straightforward. For example, the country
is “Trinidad and Tobago”: the two primary geographical entities are islands “Trinidad” and “Tobago”, both region
values in the map data.
When going from the Gapminder country
to the map region
values, our primary concern for reconciliation, we find 21 differences. These break down into four rough categories: 1. Easy Cases, 2. Island Nations, 3. Subregion Promotion, and 4. Do Not Restore.
Dealing with the “Easy Cases”, the first group of mismatches, is straightforward.
### Easy cases -- see tools above for digging out names
world_map2 <- world_map %>%
rename(country = region) %>%
mutate(country = case_when(country == "Macedonia" ~ "North Macedonia" ,
country == "Ivory Coast" ~ "Cote d'Ivoire",
country == "Democratic Republic of the Congo" ~ "Congo, Dem. Rep.",
country == "Republic of Congo" ~ "Congo, Rep.",
country == "UK" ~ "United Kingdom",
country == "USA" ~ "United States",
country == "Laos" ~ "Lao",
country == "Slovakia" ~ "Slovak Republic",
country == "Saint Lucia" ~ "St. Lucia",
country == "Kyrgyzstan" ~ "Kyrgyz Republic",
country == "Micronesia" ~ "Micronesia, Fed. Sts.",
country == "Swaziland" ~ "Eswatini",
country == "Virgin Islands" ~ "Virgin Islands (U.S.)",
TRUE ~ country))
### Progress check
setdiff(country_names$country, world_map2$country) %>%
enframe(name = NULL, value = "diff") %>%
knitr::kable(caption = "Remaining Cases",
row.names = TRUE)
diff | |
---|---|
1 | Antigua and Barbuda |
2 | Channel Islands |
3 | Hong Kong, China |
4 | Macao, China |
5 | Netherlands Antilles |
6 | St. Kitts and Nevis |
7 | St. Vincent and the Grenadines |
8 | Trinidad and Tobago |
We now have eight remaining cases.
The cases of Antigua and Barbuda, St. Kitts and Nevis, Trinidad and Tobago, and St. Vincent and the Grenadines are all similar: combine the related map region
designations to the appropriate new country
designation. In each instance, we can re-organize the existing group
, order
, long
and lat
values under the new country
value
## Get data for Island nations
match_names <- c("Antigua" , "Barbuda", "Nevis",
"Saint Kitts", "Trinidad" ,
"Tobago", "Grenadines" , "Saint Vincent")
### Island nations data set
map_match <- world_map2 %>%
filter(country %in% match_names)
map_match %>% distinct(country)
## country
## 1 Antigua
## 2 Barbuda
## 3 Nevis
## 4 Saint Kitts
## 5 Trinidad
## 6 Tobago
## 7 Grenadines
## 8 Saint Vincent
### Group IDs for the countries
ant_bar <- c(137 ,138 )
kit_nev <- c(930 , 931)
tri_tog <- c(1425, 1426)
vin_gre <- c(1575, 1576, 1577)
# chan_isl <- c(594, 861)
# neth_ant <- c(1055, 1056)
new_names_ref <- c("Antigua and Barbuda", "St. Kitts and Nevis",
"Trinidad and Tobago", "St. Vincent and the Grenadines")
### assign new country names to match Gapminder
map_match <- map_match %>%
mutate(country = case_when(group %in% ant_bar ~ "Antigua and Barbuda" ,
group %in% kit_nev ~ "St. Kitts and Nevis" ,
group %in% tri_tog ~ "Trinidad and Tobago" ,
group %in% vin_gre ~ "St. Vincent and the Grenadines")
) %>%
tibble()
### Quick checks
map_match %>% head()
## # A tibble: 6 x 6
## long lat group order country subregion
## <dbl> <dbl> <dbl> <int> <chr> <chr>
## 1 -61.7 17.0 137 7243 Antigua and Barbuda <NA>
## 2 -61.7 17.0 137 7244 Antigua and Barbuda <NA>
## 3 -61.9 17.0 137 7245 Antigua and Barbuda <NA>
## 4 -61.9 17.1 137 7246 Antigua and Barbuda <NA>
## 5 -61.9 17.1 137 7247 Antigua and Barbuda <NA>
## 6 -61.8 17.2 137 7248 Antigua and Barbuda <NA>
map_match %>%
distinct(country)%>%
knitr::kable(caption = "Add to World Map")
country |
---|
Antigua and Barbuda |
St. Kitts and Nevis |
Trinidad and Tobago |
St. Vincent and the Grenadines |
map_match %>%
group_by(country) %>%
count(group) %>%
knitr::kable(caption = "Add to World Map")
country | group | n |
---|---|---|
Antigua and Barbuda | 137 | 12 |
Antigua and Barbuda | 138 | 10 |
St. Kitts and Nevis | 930 | 7 |
St. Kitts and Nevis | 931 | 13 |
St. Vincent and the Grenadines | 1575 | 16 |
St. Vincent and the Grenadines | 1576 | 23 |
St. Vincent and the Grenadines | 1577 | 10 |
Trinidad and Tobago | 1425 | 30 |
Trinidad and Tobago | 1426 | 8 |
#### Structure check for merge
map_match %>%
str()
## tibble [129 x 6] (S3: tbl_df/tbl/data.frame)
## $ long : num [1:129] -61.7 -61.7 -61.9 -61.9 -61.9 ...
## $ lat : num [1:129] 17 17 17 17.1 17.1 ...
## $ group : num [1:129] 137 137 137 137 137 137 137 137 137 137 ...
## $ order : int [1:129] 7243 7244 7245 7246 7247 7248 7249 7250 7251 7252 ...
## $ country : chr [1:129] "Antigua and Barbuda" "Antigua and Barbuda" "Antigua and Barbuda" "Antigua and Barbuda" ...
## $ subregion: chr [1:129] NA NA NA NA ...
world_map2 %>%
str()
## 'data.frame': 99338 obs. of 6 variables:
## $ long : num -69.9 -69.9 -69.9 -70 -70.1 ...
## $ lat : num 12.5 12.4 12.4 12.5 12.5 ...
## $ group : num 1 1 1 1 1 1 1 1 1 1 ...
## $ order : int 1 2 3 4 5 6 7 8 9 10 ...
## $ country : chr "Aruba" "Aruba" "Aruba" "Aruba" ...
## $ subregion: chr NA NA NA NA ...
#### Time to Slice, Dice, and Restack
world_map2 <- world_map2 %>%
filter(!country %in% match_names)
world_map2 <- world_map2 %>%
bind_rows(map_match) %>%
arrange(country) %>%
tibble()
### Safety check -- should return empty set
world_map2 %>%
filter(country %in% match_names)
## # A tibble: 0 x 6
## # ... with 6 variables: long <dbl>, lat <dbl>, group <dbl>, order <int>,
## # country <chr>, subregion <chr>
### Safety check - should return one complete row each
world_map2 %>%
filter(country %in% new_names_ref) %>%
group_by(country) %>%
slice_max(order, n = 1)
## # A tibble: 4 x 6
## # Groups: country [4]
## long lat group order country subregion
## <dbl> <dbl> <dbl> <int> <chr> <chr>
## 1 -61.7 17.6 138 7265 Antigua and Barbuda <NA>
## 2 -62.6 17.2 931 58081 St. Kitts and Nevis <NA>
## 3 -61.2 13.2 1577 98189 St. Vincent and the Grenadines <NA>
## 4 -60.8 11.2 1426 89453 Trinidad and Tobago <NA>
Safety checks passed. The island nations now included in world_map2
.
The cases of Macao, China, and Hong Kong, China differ again: in the map data set, each is a subregion
of the region
China. But economic and public health data for both former city-states, now Special Administrative Regions in China, has for decades and continues to be treated separately from that of mainland China (PRC). Each, for the purposes of Global Studies, has country
level status (which is not the same as nation-state status). So we should follow practice and and treat them as country-level entities in terms of the map data set.
####
### Hong Kong and Macao
#### Pull from subregion; slice out; restack
sub_sleeps <- c("Hong Kong", "Macao")
hk_mc <- world_map2 %>%
filter(subregion %in% sub_sleeps)
hk_mc <- hk_mc %>%
mutate(country = case_when(subregion == "Hong Kong" ~ "Hong Kong, China" ,
subregion == "Macao" ~ "Macao, China" ) )
### Safety check for bind_rows()
hk_mc %>%
slice(38:41) %>%
knitr::kable(caption = "Check structure")
long | lat | group | order | country | subregion |
---|---|---|---|---|---|
114.0067 | 22.48403 | 670 | 45801 | Hong Kong, China | Hong Kong |
114.0154 | 22.51191 | 670 | 45802 | Hong Kong, China | Hong Kong |
113.4789 | 22.19556 | 960 | 59893 | Macao, China | Macao |
113.4810 | 22.21748 | 960 | 59894 | Macao, China | Macao |
### Slice out old info
world_map2 <- world_map2 %>%
filter(!subregion %in% sub_sleeps)
### Stack in new info
world_map2 <- world_map2 %>%
bind_rows(hk_mc) %>%
select(-subregion) %>%
tibble()
### Progress check
setdiff(country_names$country, world_map2$country) %>%
enframe(name = NULL, value = "diff") %>%
knitr::kable(caption = "Remaining Cases",
row.names = TRUE)
diff | |
---|---|
1 | Channel Islands |
2 | Netherlands Antilles |
We’ve added Hong Kong and Macao, and now have only two outstanding cases.
Finally, we havetwo cases we arguably should not reconcile. The Netherlands Antilles was dissolved in 2010. It consisted of the islands Curaçao, Bonaire, Aruba (until 1986), Saba, Sint Eustatius, and Sint Maarten. Aruba, which has a country
designation in the Gapminder data, is a “a constituent country of the Kingdom of the Netherlands”; Curaçao and Sint Maarten, likewise. Each has its own ISO code. Bonaire, Saba, and Sint Eustatius are special municipalities within the country of the Netherlands: all share the same ISO code. By recombining these various constituent countries and special municipalities back into the historical Netherlands Antilles, itself once a constituent country of the Kingdom of the Netherlands, we would do so at the cost of current (since 2010) and future compatibility with data collection and analysis.
Likewise, for the reasons discussed earlier, we should pass on restoring the historical designation the Channel Islands. The Channel Islands primarily consist of the Bailiwick of Guernsey and the Bailiwick of Jersey. Guernsey and Jersey both have their own ISO codes and have real-world effective country-level status.
world_map2 %>% distinct(country) %>%
DT::datatable(caption = "Map Country List")
### No Tuvalu in map -- add coordinates
world_map2 %>%
filter(stringr::str_detect(country, "Tu") ) %>%
distinct(country)
## # A tibble: 4 x 1
## country
## <chr>
## 1 Tunisia
## 2 Turkey
## 3 Turkmenistan
## 4 Turks and Caicos Islands
But as it turns out, we are missing Tuvalu! This nation was represented in some of the Gapminder data sets.
We now have a new problem. Our map data lacks coordinates –indeed, entries – for countries or subregions which have ISO codes: the nation Tuvalu, for example, and the territories of Gibraltar and the British Virgin Islands. None of which currently show in our list of countries for the map data. Here is a quick check, using the original world_map
data. We will check both the region
and subregion
vars.
# Tuvalu
world_map %>%
filter(stringr::str_detect(region, "Tu") ) %>%
distinct(region, subregion)
## region subregion
## 1 Turks and Caicos Islands Providenciales Island
## 2 Turks and Caicos Islands Grand Caicos Island
## 3 Turks and Caicos Islands North Caicos Island
## 4 Turkmenistan Ogurja Ada
## 5 Turkmenistan <NA>
## 6 Tunisia Jerba
## 7 Tunisia Shergui Island
## 8 Tunisia <NA>
## 9 Turkey Gokceada
## 10 Turkey <NA>
## 11 Turkey North
# Tuvalu again
world_map %>%
filter(stringr::str_detect(subregion, "Tu") ) %>%
distinct(region, subregion)
## region subregion
## 1 American Samoa Tutuila
## 2 Belize Turneffe Island
## 3 Canada Tukarak Island
## 4 Indonesia Tuangku
# Gibraltar
world_map %>%
filter(stringr::str_detect(region, "Gib") ) %>%
distinct(region, subregion)
## [1] region subregion
## <0 rows> (or 0-length row.names)
# Gibraltar
world_map %>%
filter(stringr::str_detect(subregion, "Gib") ) %>%
distinct(region, subregion)
## [1] region subregion
## <0 rows> (or 0-length row.names)
For these cases – but regretfully, not for all – we can download the polygon coordinates from OpenDataSource. Some hacking around (not on display here) will get us compatible data sets.
### From https://public.opendatasoft.com/
tuvalu_coords <- readRDS(here::here("data",
"tidy_data",
"tuvalu_coords.rds") )
tuvalu_coords %>% head() ## check structure
## # A tibble: 6 x 5
## long lat group order country
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 179. -8.55 2010 110000 Tuvalu
## 2 179. -8.56 2010 110001 Tuvalu
## 3 179. -8.47 2010 110003 Tuvalu
## 4 179. -8.48 2010 110004 Tuvalu
## 5 179. -8.49 2010 110005 Tuvalu
## 6 179. -8.50 2011 110006 Tuvalu
## Add to map
world_map2 <- world_map2 %>%
bind_rows(tuvalu_coords) %>%
arrange(country)
## Check!
world_map2 %>%
filter(stringr::str_detect(country, "Tu") ) %>%
distinct(country)
## # A tibble: 5 x 1
## country
## <chr>
## 1 Tunisia
## 2 Turkey
## 3 Turkmenistan
## 4 Turks and Caicos Islands
## 5 Tuvalu
We’ve successfully added Tuvalu. Now, for Gibraltar and the British Virgin Islands.
### Missing also Gibraltar & Virgin Islands (British)
### From https://public.opendatasoft.com/
Gib_BVI_coords <- readRDS(file = here::here("data",
"tidy_data",
"Gib_BVI_coords.rds"))
Gib_BVI_coords %>% head()
## # A tibble: 6 x 5
## long lat group order country
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 -5.36 36.1 2014 110035 Gibraltar
## 2 -5.34 36.1 2014 110036 Gibraltar
## 3 -5.34 36.1 2014 110037 Gibraltar
## 4 -5.35 36.1 2014 110038 Gibraltar
## 5 -5.36 36.1 2014 110039 Gibraltar
## 6 -64.6 18.3 2021 110045 Virgin Islands (British)
world_map2 <- world_map2 %>%
bind_rows(Gib_BVI_coords) %>%
arrange(country)
world_map2 %>%
filter(stringr::str_detect(country, "Gib") ) %>%
distinct(country)
## # A tibble: 1 x 1
## country
## <chr>
## 1 Gibraltar
world_map2 %>%
filter(stringr::str_detect(country, "Vir") ) %>%
distinct(country)
## # A tibble: 2 x 1
## country
## <chr>
## 1 Virgin Islands (British)
## 2 Virgin Islands (U.S.)
We now have a map which provides near-complete of the Gapminder.org data sets, and will work for other Global Studies data sets. We need now to add the ISO 3166-1 Country Codes to our map data: in particular, the Alpha-2 code, the Alpha-3 code, and the Numeric code. This will ensure compatibility with a greater range of Global Studies data sets.
Please note that the country_ISO_codes
data set below was compiled and cross-checked using various open sources. But as ISO 3166-1 is a moving target (an ongoing process), this data set will need checking and updating.
country_ISO_codes <- readRDS(file = here::here("data",
"tidy_data",
"country_ISO_codes2.rds") )
country_ISO_codes %>% head()
## # A tibble: 6 x 5
## s_name code_2 code_3 code_num form_name
## <chr> <chr> <chr> <dbl> <chr>
## 1 Afghanistan AF AFG 4 Islamic Republic of Afghanistan
## 2 Aland Islands AX ALA 248 Åland
## 3 Albania AL ALB 8 The Republic of Albania
## 4 Algeria DZ DZA 12 The People's Democratic Republic of Alg~
## 5 American Samoa AS ASM 16 The Territory of American Samoa
## 6 Andorra AD AND 20 The Principality of Andorra
It turns out, however, that like our map data, our master list of ISO Country Codes was also not complete. Finding a free and reliable Open Source version is not easy – and I do not have access to the commericial version. So, below, how to update country_ISO_codes
:
### Missing Norfolk Island
norfolk_codes <- tibble(s_name = "Norfolk Island",
code_2 = "NF",
code_3 = "NFK",
code_num = 574,
form_name = "Territory of Norfolk Island, Australia")
norfolk_codes %>% head()
## # A tibble: 1 x 5
## s_name code_2 code_3 code_num form_name
## <chr> <chr> <chr> <dbl> <chr>
## 1 Norfolk Island NF NFK 574 Territory of Norfolk Island, Australia
country_ISO_codes2 <- country_ISO_codes %>%
bind_rows(norfolk_codes) %>%
arrange(s_name)
country_ISO_codes2 %>%
filter(code_2 == "NF") %>%
slice(n=1)
## # A tibble: 1 x 5
## s_name code_2 code_3 code_num form_name
## <chr> <chr> <chr> <dbl> <chr>
## 1 Norfolk Island NF NFK 574 Territory of Norfolk Island, Australia
Now that we have our ISO Country Codes loaded and updated, we are almost ready to add them to the map data. One set of checks for differences.
### Remaining Gapmminder cases -- the two historical entities
setdiff(country_names$country, world_map2$country) %>%
enframe(name = NULL, value = "diff") %>%
knitr::kable(caption = "Gap vs Map: Remaining Cases",
row.names = TRUE)
diff | |
---|---|
1 | Channel Islands |
2 | Netherlands Antilles |
setdiff(country_ISO_codes2$s_name , world_map2$country) %>%
enframe(name = NULL, value = "diff") %>%
knitr::kable(caption = "ISO vs Map: Remaining Cases",
row.names = TRUE)
diff | |
---|---|
1 | Aland Islands |
2 | Bouvet Island |
3 | British Indian Ocean Territory |
4 | Svalbard and Jan Mayen |
5 | Tokelau |
6 | United States Minor Outlying Islands |
setdiff(world_map2$country, country_ISO_codes2$s_name) %>%
enframe(name = NULL, value = "diff") %>%
knitr::kable(caption = "Map vs. ISO: Remaining Cases",
row.names = TRUE)
diff | |
---|---|
1 | Siachen Glacier |
So our checks indicate success. First, we declined to restore the two defunct designations, once of which reflected an historical country-level entity, and the other, a geographically convenient label. Our map data set by decision will not account for the Netherlands Antilles or the Channel Islands.
Second, of the ISO vs Map cases, only the sparsely populated Tokelau possibly matters, but OpenDataSoft does not have the polygon coordinates for it. The Chagos Archipelago, included in our map, makes up the most important part of the British Indian Ocean Territory. The remaining four cases comprise either seasonally inhabited regions or (and) remote military bases. These produce negligible data in terms of economic or public health statistics, and can be safely ignored for such purposes.
Third and finally, the original map makers included the Siachen Glacier. This is a geographical entity and a disputed territory: but it does not have an individual ISO code, does not have civilian residents (only military), and does not produce the relevant economic or public health data. If we remove it from world_map2
, however, we will get a small but annoying empty space (usually portrayed as a white dot). So it stays in.
world_map2 <- world_map2 %>%
left_join(country_ISO_codes2, by = c("country" = "s_name")) %>%
tibble()
world_map2 %>% vis_dat()
world_map2 %>% glimpse()
## Rows: 99,442
## Columns: 9
## $ long <dbl> 74.89131, 74.84023, 74.76738, 74.73896, 74.72666, 74.66895, ~
## $ lat <dbl> 37.23164, 37.22505, 37.24917, 37.28564, 37.29072, 37.26670, ~
## $ group <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
## $ order <dbl> 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, ~
## $ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", ~
## $ code_2 <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", ~
## $ code_3 <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG~
## $ code_num <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ~
## $ form_name <chr> "Islamic Republic of Afghanistan", "Islamic Republic of Afgh~
The slight seeming glitch or NA space in the data set: Canary Islands, which has an alpha-2 code “IC” still in use, but no alpha-3 or numeric code. Instead, it has been recoded as ES-CN: a subdivision of Spain.
We now have mapping data with the following standard ISO country codes: Alpha-2 code, Alpha-3 code, and Numeric code. We can match by country
name to the majority of Gapminder.org data sets, and we can match by to any Global Studies data set which likewise uses one or more of the above ISO country codes. So the data set world_map2_ISO
offers an update on ggplot2::map_data("world")
with improved interoperability.
save_data <- c("world_map2",
"country_ISO_codes2")
# Save Data! --------------------------
save(list = save_data, file = here::here("data",
"tidy_data",
"maps",
"world_map2_project.rda" ))
## Just the map data
saveRDS(world_map2, file = here::here("data",
"tidy_data",
"maps",
"world_map2.rds" ))
Or, please improve and share: github.com/Thom-J-H/map_Gap_2_Tidy
Thomas J. Haslam
2021-08-13