library(tidyverse)
library(here)
library(visdat)

Brief Introduction

The data set world_map2, a derivative and update of ggplot2::map_data("world"), was created for easier mapping of Global Studies data in the Tidyverse. To not solely rely on matching country or region names, which can vary considerably per the data source, world_map2 contains following the ISO country codes: Alpha-2 code, Alpha-3 code, and Numeric code. (To download and/or improve world_map2: world_map2_project.rda).

The process behind making world_map2 is described in a separate markdown. This markdown showcases how to use world_map2 with Global Studies data. This markdown uses data from the Gapminder.org foundation as examples of typical Global Studies data. We want to show the annual results in Choropleth maps.

# our custom map data
load(here::here("data",
                "tidy_data", 
                "maps", 
                "world_map2_project.rda"))


# raw Global Studies dat from Gapminder.org
gni_percapita <- read_csv(here::here("data",
                                         "raw_data",                               
                                         "gnipercapita_ppp_current_international.csv") )

# raw Global Studies dat from Gapminder.org
water_access <- read_csv(here::here("data", 
                                    "raw_data",                               
                                    "at_least_basic_water_source_overall_access_percent.csv") )


# raw Global Studies dat from Gapminder.org
energy_capita <- read_csv(here::here("data",
                                     "raw_data",
                                     "energy_use_per_person.csv"))

Case Study: Energy

The first data set, “Energy use, per person”, is from Gapminder.org but sourced from there to the World Bank. We will proceed by exploring it, creating a Tidy version, and then mapping the Tidy version. Our try at mapping will introduce and troubleshoot a common problem.

EDA Enegry

We see that the data is in a wide format, and not Tidy. It does not use ISO country codes. In fact, none of data sets available from Gapminder.org/data do so. So we will need to match country names.

energy_capita  %>% vis_dat()

energy_capita  %>% glimpse()
## Rows: 169
## Columns: 57
## $ country <chr> "Albania", "Algeria", "Angola", "Antigua and Barbuda", "Argent~
## $ `1960`  <dbl> NA, NA, NA, NA, NA, NA, 3060, 1550, NA, NA, NA, NA, NA, NA, 25~
## $ `1961`  <dbl> NA, NA, NA, NA, NA, NA, 3120, 1550, NA, NA, NA, NA, NA, NA, 25~
## $ `1962`  <dbl> NA, NA, NA, NA, NA, NA, 3170, 1680, NA, NA, NA, NA, NA, NA, 28~
## $ `1963`  <dbl> NA, NA, NA, NA, NA, NA, 3280, 1820, NA, NA, NA, NA, NA, NA, 30~
## $ `1964`  <dbl> NA, NA, NA, NA, NA, NA, 3350, 1860, NA, NA, NA, NA, NA, NA, 30~
## $ `1965`  <dbl> NA, NA, NA, NA, NA, NA, 3460, 1850, NA, NA, NA, NA, NA, NA, 31~
## $ `1966`  <dbl> NA, NA, NA, NA, NA, NA, 3550, 1900, NA, NA, NA, NA, NA, NA, 30~
## $ `1967`  <dbl> NA, NA, NA, NA, NA, NA, 3690, 1920, NA, NA, NA, NA, NA, NA, 31~
## $ `1968`  <dbl> NA, NA, NA, NA, NA, NA, 3760, 2050, NA, NA, NA, NA, NA, NA, 35~
## $ `1969`  <dbl> NA, NA, NA, NA, NA, NA, 3790, 2180, NA, NA, NA, NA, NA, NA, 38~
## $ `1970`  <dbl> NA, NA, NA, NA, NA, NA, 4060, 2420, NA, NA, NA, NA, NA, NA, 41~
## $ `1971`  <dbl> 785.0, 233.0, 637.0, NA, 1390.0, NA, 3990.0, 2510.0, NA, NA, 6~
## $ `1972`  <dbl> 866.0, 263.0, 663.0, NA, 1390.0, NA, 4040.0, 2630.0, NA, NA, 5~
## $ `1973`  <dbl> 763.0, 307.0, 636.0, NA, 1420.0, NA, 4260.0, 2830.0, NA, NA, 8~
## $ `1974`  <dbl> 777.0, 321.0, 624.0, NA, 1430.0, NA, 4290.0, 2730.0, NA, NA, 9~
## $ `1975`  <dbl> 827, 332, 587, NA, 1390, NA, 4350, 2650, NA, NA, 8000, 96, NA,~
## $ `1976`  <dbl> 891.0, 369.0, 559.0, NA, 1420.0, NA, 4410.0, 2870.0, NA, NA, 9~
## $ `1977`  <dbl> 924.0, 401.0, 533.0, NA, 1430.0, NA, 4670.0, 2800.0, NA, NA, 8~
## $ `1978`  <dbl> 1010, 480, 567, NA, 1440, NA, 4630, 2890, NA, NA, 7840, 100, N~
## $ `1979`  <dbl> 864, 590, 556, NA, 1500, NA, 4680, 3140, NA, NA, 8810, 102, NA~
## $ `1980`  <dbl> 1150, 583, 547, NA, 1500, NA, 4740, 3070, NA, NA, 7790, 106, N~
## $ `1981`  <dbl> 989, 615, 532, NA, 1440, NA, 4690, 2900, NA, NA, 8300, 105, NA~
## $ `1982`  <dbl> 967, 776, 506, NA, 1430, NA, 4820, 2830, NA, NA, 9070, 108, NA~
## $ `1983`  <dbl> 1000, 813, 502, NA, 1430, NA, 4560, 2840, NA, NA, 8500, 108, N~
## $ `1984`  <dbl> 1020, 781, 489, NA, 1460, NA, 4650, 2950, NA, NA, 8830, 106, N~
## $ `1985`  <dbl> 917, 791, 501, NA, 1370, NA, 4600, 3050, NA, NA, 9920, 110, NA~
## $ `1986`  <dbl> 964, 868, 489, NA, 1430, NA, 4620, 3060, NA, NA, 10300, 114, N~
## $ `1987`  <dbl> 922, 833, 484, NA, 1480, NA, 4770, 3170, NA, NA, 9520, 110, NA~
## $ `1988`  <dbl> 928, 855, 486, NA, 1510, NA, 4700, 3200, NA, NA, 10500, 117, N~
## $ `1989`  <dbl> 896, 825, 480, NA, 1450, NA, 5000, 3140, NA, NA, 10200, 120, N~
## $ `1990`  <dbl> 813, 861, 497, 1580, 1410, 2180, 5060, 3240, 3170, 2520, 10600~
## $ `1991`  <dbl> 573, 889, 492, NA, 1440, 2320, 4930, 3420, 3090, NA, 10100, 11~
## $ `1992`  <dbl> 418, 889, 479, NA, 1490, 1200, 4960, 3250, 2460, NA, 10800, 12~
## $ `1993`  <dbl> 412, 872, 480, NA, 1470, 652, 5150, 3260, 2180, NA, 11100, 127~
## $ `1994`  <dbl> 441, 824, 471, NA, 1550, 420, 5090, 3230, 1950, NA, 11600, 130~
## $ `1995`  <dbl> 417, 843, 456, NA, 1550, 511, 5130, 3370, 1810, NA, 11400, 138~
## $ `1996`  <dbl> 448, 802, 454, NA, 1590, 562, 5390, 3580, 1510, NA, 11100, 136~
## $ `1997`  <dbl> 385, 809, 449, NA, 1620, 594, 5470, 3550, 1440, NA, 12200, 139~
## $ `1998`  <dbl> 427, 825, 434, NA, 1660, 610, 5550, 3610, 1490, NA, 12400, 142~
## $ `1999`  <dbl> 576, 868, 441, NA, 1670, 594, 5610, 3590, 1370, NA, 11900, 141~
## $ `2000`  <dbl> 580, 870, 439, NA, 1670, 656, 5640, 3570, 1400, NA, 12000, 143~
## $ `2001`  <dbl> 597, 860, 443, NA, 1570, 657, 5450, 3760, 1410, NA, 11700, 154~
## $ `2002`  <dbl> 660, 908, 448, NA, 1510, 618, 5570, 3770, 1410, NA, 11500, 154~
## $ `2003`  <dbl> 648, 953, 468, NA, 1600, 656, 5570, 3970, 1480, NA, 11600, 160~
## $ `2004`  <dbl> 715, 952, 465, 1680, 1730, 698, 5600, 4010, 1540, 2080, 10900,~
## $ `2005`  <dbl> 720, 978, 434, 1680, 1720, 843, 5560, 4090, 1600, 2140, 11700,~
## $ `2006`  <dbl> 707, 1030, 459, 1730, 1850, 865, 5710, 4080, 1560, 2130, 11600~
## $ `2007`  <dbl> 680, 1080, 472, 1740, 1860, 973, 5870, 4020, 1410, 2100, 11200~
## $ `2008`  <dbl> 711, 1070, 492, NA, 1940, 1030, 5960, 4030, 1520, NA, 11300, 1~
## $ `2009`  <dbl> 732, 1150, 515, NA, 1870, 904, 5860, 3800, 1330, NA, 10300, 19~
## $ `2010`  <dbl> 729, 1110, 521, NA, 1930, 863, 5790, 4050, 1280, NA, 10200, 20~
## $ `2011`  <dbl> 765, 1140, 522, NA, 1950, 944, 5750, 3920, 1370, NA, 9910, 212~
## $ `2012`  <dbl> 688, 1230, 552, NA, 1940, 1030, 5580, 3890, 1470, NA, 9660, 22~
## $ `2013`  <dbl> 801, 1250, 534, NA, 1970, 1000, 5470, 3920, 1470, NA, 10400, 2~
## $ `2014`  <dbl> 808, 1330, 545, NA, 2030, 1020, 5330, 3760, 1500, NA, 10600, 2~
## $ `2015`  <dbl> NA, NA, NA, NA, NA, NA, 5480, 3800, NA, NA, NA, NA, NA, NA, 46~

When converting to a Tidy data format, we want to make sure the variable year is not a character. But first, let’s check to see if the country names in energy_capita have a match in our mapping data.

setdiff(energy_capita$country,world_map2$country)
## character(0)
setdiff(world_map2$country, energy_capita$country) %>% 
  enframe(name = NULL, value ="diff")
## # A tibble: 84 x 1
##    diff            
##    <chr>           
##  1 Afghanistan     
##  2 American Samoa  
##  3 Andorra         
##  4 Anguilla        
##  5 Antarctica      
##  6 Aruba           
##  7 Ascension Island
##  8 Azores          
##  9 Bermuda         
## 10 Bonaire         
## # ... with 74 more rows

Yes! All the values for the country variable in our Global Studies data have a match in world_map2. We have only 169 countries in energy_capita, so this comprises only a subset of the countries in world_map2.

Tidy Energy

energy_tidy <- energy_capita  %>%
  pivot_longer(cols = !country,
               names_to = "year",
               names_transform = list(year = as.integer),
               values_to = "energy")

energy_tidy  %>% vis_dat()

energy_tidy  %>% glimpse()
## Rows: 9,464
## Columns: 3
## $ country <chr> "Albania", "Albania", "Albania", "Albania", "Albania", "Albani~
## $ year    <int> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 19~
## $ energy  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 785, 866, 763, 777~

Map left_join()

We have energy_tidy. It nows seems a simple step to left_join it to the map data, and then plot the energy values for a given year. But this will result in void spaces on our map.

energy_tidy %>%
  filter(year == 2004) %>% 
  left_join(world_map2, by = "country") %>%
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             label = country)) +
  geom_polygon(aes(fill = energy) )+
  scale_fill_viridis_c(option = "C") +
  labs(fill = "Energy Use\nPer Capita",
       title = "Gapminder Data: 2004") +
  theme_void()

The white spaces for Afghanistan and various nations in Africa and Southeast Asia are unnecessarily confusing. Although these nations were not reported on in energy_tidy, that does not mean they were replaced by ocean or otherwise ceased to exist.

We want the map to fail gracefully. Missing countries would be better treated as NA cases. And NA cases should be displayed. knowing that we have no data for certain nations is both important and useful.

Using complete()

So we want to expand the data set before we join it with the mapping data. The Tidyverse verb complete() helps solve this problem. For the missing energy values, we will assign NA using replace_na(). This ensures that our plot will use all listed countries, showing the NA when applicable. Again, we want both graceful failure and to know when we are missing information.

energy_tidy %>%
  filter(year == 2004) %>% 
  complete(country = world_map2$country, 
           fill = (list(energy = NA )) ) %>%
  left_join(world_map2, by = "country") %>%
  replace_na(list(year = 2004)) %>%
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             label = country)) +
  geom_polygon(aes(fill = energy) )+
  scale_fill_viridis_c(option = "C") +
  labs(fill = "Energy Use\nPer Capita",
       title = "Gapminder Data: 2000") +
  theme_void()

This version fails much more gracefully. It also makes clearer for what regions of the world we lack data.

Interactivity made easy

Let’s say that someone thinks Antarctica justs wastes space in this or like plots as it seldom generates economic, political science, or public health data. We can remove it with a simple filter call. Likewise, we can make the map interactive with one additional library and function call: plotly and ggplotly().

For plotly::ggplotly(), the default legend guide will not carry over. We can solve this by changing the title.

In the code below, this is broken down into three steps. First, create the data set for the plot. Second, create the ggplot object. Third, display it as plotly graphic.

# Map data
enegry_dat_2004 <- energy_tidy %>%
  filter(year == 2004) %>% 
  complete(country = world_map2$country, 
           fill = (list(energy = NA )) ) %>%
  left_join(world_map2, by = "country") %>%
  replace_na(list(year = 2004))

# ggplot object
enegry_map_2004 <- enegry_dat_2004 %>%
  filter(code_3 != "ATA") %>%
   ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             label = country)) +
  geom_polygon(aes(fill = energy) )+
  scale_fill_viridis_c(option = "C") +
  labs(fill = "",
       title = "Energy Use Per Capita (2004)") +
  theme_void()

# interactive version
plotly::ggplotly(enegry_map_2004)

So we now have a procedure. This project originated with a course entitled “Telling Stories with Data” (TSD), which introduced non-STEM majors to the Tidyverse and basic data visualization and analysis. Beginning students find it useful to have a proven model to follow. We have one now for our Choropleth maps, Let’s apply it to two other Gapminder.org data sets.

CS: Water Access

The second data set, “At least basic water source, overall access”, is likewise from Gapminder.org but sourced from there to the UN’s Millennium Development Goals. We will apply the procedure – the “recipe” of steps – developed when mapping Energy as above.

EDA: Water Access

The data set “At least basic water source, overall access” has values for 194 countries, some NA, for the years 2000 to 2017.

water_access  %>% vis_dat()

water_access  %>% glimpse()
## Rows: 194
## Columns: 19
## $ country <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "Angola", "Ant~
## $ `2000`  <dbl> 0.278, 0.879, 0.898, 1.000, 0.411, 0.983, 0.962, 0.951, 0.997,~
## $ `2001`  <dbl> 0.278, 0.879, 0.901, 1.000, 0.423, 0.981, 0.965, 0.954, 0.997,~
## $ `2002`  <dbl> 0.299, 0.879, 0.904, 1.000, 0.434, 0.980, 0.967, 0.957, 0.997,~
## $ `2003`  <dbl> 0.320, 0.879, 0.907, 1.000, 0.444, 0.979, 0.969, 0.960, 0.997,~
## $ `2004`  <dbl> 0.341, 0.879, 0.909, 1.000, 0.454, 0.978, 0.972, 0.963, 0.998,~
## $ `2005`  <dbl> 0.363, 0.879, 0.912, 1.000, 0.463, 0.977, 0.974, 0.967, 0.998,~
## $ `2006`  <dbl> 0.384, 0.879, 0.914, 1.000, 0.472, 0.976, 0.976, 0.970, 0.998,~
## $ `2007`  <dbl> 0.408, 0.879, 0.917, 1.000, 0.480, 0.974, 0.978, 0.973, 0.998,~
## $ `2008`  <dbl> 0.433, 0.879, 0.919, 1.000, 0.488, 0.973, 0.981, 0.976, 0.998,~
## $ `2009`  <dbl> 0.458, 0.879, 0.922, 1.000, 0.496, 0.972, 0.983, 0.979, 0.999,~
## $ `2010`  <dbl> 0.484, 0.878, 0.924, 1.000, 0.504, 0.971, 0.985, 0.983, 0.999,~
## $ `2011`  <dbl> 0.509, 0.878, 0.926, 1.000, 0.512, 0.970, 0.987, 0.986, 0.999,~
## $ `2012`  <dbl> 0.535, 0.879, 0.928, 1.000, 0.520, 0.969, 0.989, 0.989, 0.999,~
## $ `2013`  <dbl> 0.562, 0.895, 0.931, 1.000, 0.528, 0.967, 0.989, 0.993, 0.999,~
## $ `2014`  <dbl> 0.588, 0.910, 0.933, 1.000, 0.535, 0.967, 0.990, 0.996, 1.000,~
## $ `2015`  <dbl> 0.615, 0.910, 0.935, 1.000, 0.543, 0.967, 0.991, 0.997, 1.000,~
## $ `2016`  <dbl> 0.643, 0.910, 0.935, 1.000, 0.551, 0.967, 0.991, 0.999, 1.000,~
## $ `2017`  <dbl> 0.671, 0.910, 0.936, 1.000, 0.558, 0.967, NA, 0.999, 1.000, 1.~

This data set again is in wide format, and not Tidy. But we know how to fix, and can largely recycle our solution for energy_capita to energy_tidy. We now need to check the how well the country names match – do we have coverage?

setdiff(water_access$country,world_map2$country)
## character(0)
setdiff(world_map2$country, water_access$country)  %>% 
  enframe(name = NULL, value ="diff")
## # A tibble: 59 x 1
##    diff            
##    <chr>           
##  1 American Samoa  
##  2 Anguilla        
##  3 Antarctica      
##  4 Aruba           
##  5 Ascension Island
##  6 Azores          
##  7 Bermuda         
##  8 Bonaire         
##  9 Canary Islands  
## 10 Cayman Islands  
## # ... with 49 more rows

So our map data provides complete coverage for our statistical data. Now, let’s Tidy the data, and then use complete() with left_join() to create our plot data sets.

Tidy Water

water_tidy <- water_access %>%
  pivot_longer(cols = !country,
               names_to = "year",
               names_transform = list(year = as.integer),
               values_to = "water")


water_tidy %>% 
  vis_dat()

water_tidy %>% 
  glimpse()
## Rows: 3,492
## Columns: 3
## $ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "A~
## $ year    <int> 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 20~
## $ water   <dbl> 0.278, 0.278, 0.299, 0.320, 0.341, 0.363, 0.384, 0.408, 0.433,~

All good.

Map & Plot

water_dat_2010 <- water_tidy %>%
  filter(year == 2010) %>% 
  complete(country = world_map2$country, 
           fill = (list(water = NA )) ) %>%
  left_join(world_map2, by = "country") %>%
  replace_na(list(year = 2010))

Create Plot Object

water_mpa_2010 <- water_dat_2010 %>%
  filter(code_3 != "ATA") %>%
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             label = country)) +
  geom_polygon(aes(fill = water) )+
  scale_fill_viridis_c(option = "C") +
  labs(fill = "",
       title = "Basic Water Access (2010)") +
  theme_void()

water_mpa_2010

Plotly version

plotly::ggplotly(water_mpa_2010)

Since we have a proven procedure, we can now tackle a slightly more difficult case.

CS: GNI Per Capita

Our third data set, “GNI per capita (PPP, current international $)”, is likewise from Gapminder.org, but sourced from there to the World Bank. It also regretfully does not use any ISO codes for identifying the countries reported on. We are back to matching names.

EDA: GNI Per Capita

This data is wide, not Tidy, and has one additional issue. It reports on 196 countries over the time span 1990-2019. The values for GNI Per Capita should be numeric. Instead, they are character data. Values under 10000 appear as numbers; values over 10000 are expressed as k-units: for example, 13.2k for 13200.

gni_percapita %>% 
  vis_dat()

gni_percapita %>% 
  glimpse()
## Rows: 196
## Columns: 31
## $ country <chr> "Aruba", "Afghanistan", "Angola", "Albania", "United Arab Emir~
## $ `1990`  <chr> "23.1k", NA, "2980", "2550", NA, "6880", "2670", "10.5k", "16.~
## $ `1991`  <chr> "24.8k", NA, "3300", "1880", NA, "7760", "2550", "11.3k", "17.~
## $ `1992`  <chr> "25.4k", NA, "1620", "1740", NA, "8560", "1290", "11.6k", "17.~
## $ `1993`  <chr> "26.1k", NA, "1490", "2110", NA, "9410", "1480", "12.4k", "18.~
## $ `1994`  <chr> "27.3k", NA, "1260", "2300", NA, "10k", "1620", "13.2k", "19.6~
## $ `1995`  <chr> "27.4k", NA, "2340", "2710", NA, "9790", "1870", "12.6k", "20.~
## $ `1996`  <chr> "27.2k", NA, "2410", "3050", NA, "10.4k", "2050", "13.5k", "21~
## $ `1997`  <chr> "28.8k", NA, "2820", "2780", NA, "11.3k", "2240", "14.3k", "22~
## $ `1998`  <chr> "29.4k", NA, "2810", "3110", NA, "11.7k", "2390", "14.9k", "23~
## $ `1999`  <chr> "29k", NA, "2630", "3550", NA, "11.3k", "2510", "15.3k", "24.5~
## $ `2000`  <chr> "31.6k", NA, "2780", "3980", "103k", "11.3k", "2740", "16k", "~
## $ `2001`  <chr> "30.1k", NA, "2890", "4460", "101k", "10.9k", "3090", "15.7k",~
## $ `2002`  <chr> "27.8k", NA, "3370", "4800", "96.3k", "9540", "3580", "15.5k",~
## $ `2003`  <chr> "29.6k", NA, "3450", "5150", "99.2k", "10.6k", "4180", "16.7k"~
## $ `2004`  <chr> "31.9k", NA, "3760", "5560", "102k", "10.9k", "4810", "17.8k",~
## $ `2005`  <chr> "27.1k", NA, "4290", "5990", "99.2k", "12k", "5680", "19.4k", ~
## $ `2006`  <chr> "33.6k", NA, "4710", "6750", "97.7k", "14.5k", "6720", "22.2k"~
## $ `2007`  <chr> "31.2k", NA, "5320", "7480", "89.9k", "16.1k", "7960", "24.6k"~
## $ `2008`  <chr> "35.7k", NA, "5560", "8260", "80.7k", "17k", "8790", "24.5k", ~
## $ `2009`  <chr> "32.4k", "1520", "5820", "8680", "68.9k", "15.8k", "7530", "21~
## $ `2010`  <chr> "30.7k", "1710", "5950", "9540", "64.9k", "17.4k", "7880", "20~
## $ `2011`  <chr> "31k", "1700", "6130", "10.2k", "67.7k", "18.8k", "8310", "19.~
## $ `2012`  <chr> "32k", "1920", "6810", "10.4k", "69.2k", "19.1k", "9740", "18.~
## $ `2013`  <chr> "34.3k", "2020", "7130", "10.8k", "70.5k", "19.6k", "10.4k", "~
## $ `2014`  <chr> "35.1k", "2070", "7680", "11.4k", "73.7k", "19.3k", "10.5k", "~
## $ `2015`  <chr> "35.2k", "2110", "6960", "11.8k", "65.2k", "19.7k", "10.4k", "~
## $ `2016`  <chr> "35.7k", "2150", "6730", "12.2k", "64.3k", "19.9k", "10.9k", "~
## $ `2017`  <chr> "36.3k", "2230", "6860", "13.1k", "67.7k", "23k", "12.5k", "18~
## $ `2018`  <chr> NA, "2260", "6550", "13.8k", "68.8k", "22.4k", "13.2k", "20.5k~
## $ `2019`  <chr> NA, "2330", "6390", "14.3k", "70.2k", "22.1k", "14.5k", "21.5k~

To properly parse this data, we need to explore a bit more and think our way through the problem. We also need to check the country coverage.

setdiff(gni_percapita$country, world_map2$country) %>% 
  enframe(name = NULL, value ="diff")
## # A tibble: 2 x 1
##   diff                     
##   <chr>                    
## 1 Curaçao                  
## 2 Sint Maarten (Dutch part)
setdiff(world_map2$country, gni_percapita$country) %>% 
  enframe(name = NULL, value ="diff")
## # A tibble: 59 x 1
##    diff              
##    <chr>             
##  1 American Samoa    
##  2 Andorra           
##  3 Anguilla          
##  4 Antarctica        
##  5 Ascension Island  
##  6 Azores            
##  7 Bonaire           
##  8 Canary Islands    
##  9 Chagos Archipelago
## 10 Christmas Island  
## # ... with 49 more rows

Troubleshoot to Tidy

We have two mismatches: “Curaçao”, which uses the special character “ç”; and “Sint Maarten (Dutch part)”, which contains the addition of “(Dutch part)”. Although both these names are now becoming more regularly used as the short designations for these countries, we will still normalize them to simpler versions: “Curaçao” will be simplified to “Curacao”, and “Sint Maarten (Dutch part)” to “Sint Maarten”. Why? Special characters do not always play well in data sets or in coding, and the parenthetical information can also cause problems. It would be better if the data sets used the ISO country codes in addition to the short version of the country names. But the Gapminder.org data sets so far do not.

Now, back to the bigger issue. We need to convert the GNI Per Capita values to numbers. But the values previously expressed as something k need to be multiplied by 1000.

We have several ways of doing this. Below is one solution that relies on a cut-off value of 175. This assumes – based an exploration of the data – that we have no nations in this set with a reported GNI Per Capita greater than 199000, and that we have no nations with a reported GNI Per Capita less than 200. The top and bottom values will shortly be displayed.

gni_tidy <- gni_percapita %>%
    pivot_longer(cols = !country,
                 names_to = "year",
                 names_transform = list(year = as.integer),
                 values_to = "gni_ppp_cap") %>%
  mutate(gni_ppp_cap = readr::parse_number(gni_ppp_cap) )%>%
  mutate(gni_ppp_cap = case_when(gni_ppp_cap < 200 ~ gni_ppp_cap * 1000,
                                 TRUE ~ gni_ppp_cap) )

gni_tidy  <- gni_tidy %>%
  mutate( country = case_when(country == "Curaçao" ~ "Curacao",
                              country == "Sint Maarten (Dutch part)"  ~ "Sint Maarten" ,
                              TRUE ~ country) )


# Brief check
gni_tidy  %>% slice_min(gni_ppp_cap, n = 4)
## # A tibble: 4 x 3
##   country     year gni_ppp_cap
##   <chr>      <int>       <dbl>
## 1 Mozambique  1992         270
## 2 Mozambique  1991         300
## 3 Mozambique  1993         300
## 4 Mozambique  1994         310
gni_tidy %>% slice_max(gni_ppp_cap, n = 4)
## # A tibble: 4 x 3
##   country       year gni_ppp_cap
##   <chr>        <int>       <dbl>
## 1 Macao, China  2014      132000
## 2 Qatar         2012      132000
## 3 Qatar         2013      131000
## 4 Macao, China  2013      129000

So as it turns out, our cut-off was effective. The lowest reported GNI Per Capita: 270, for Mozambique in 1992. The highest, 132000 for Qatar in 2012 and Macao in 2014.

We should now recheck country coverage.

setdiff(gni_tidy$country, world_map2$country)
## character(0)
gni_tidy %>%
  vis_dat()

Complete coverage. And all variables the right types. We have mapping data for all the countries in gni_tidy. We are now ready to make our maps. From here, we follow the earlier recipe.

Map & Plot

gni_dat_2017 <- gni_tidy %>%
  filter(year == 2017) %>% 
  complete(country = world_map2$country, 
           fill = (list(gni_ppp_cap = NA)) ) %>%
  left_join(world_map2, by = "country") %>%
  replace_na(list(year = 2017))


gni_map_2017 <- gni_dat_2017%>%
  filter(code_3 != "ATA") %>%
  ggplot(aes(x = long, 
             y = lat, 
             group = group, 
             label = country)) +
  geom_polygon(aes(fill = gni_ppp_cap) )+
  scale_fill_viridis_c(option = "C") +
  labs(fill = "",
       title = "GNI Per Capita for 2017 (in PPP dollars)") +
  theme_void()

gni_map_2017 

plotly::ggplotly(gni_map_2017 )

Summary

The three above case studies show how to use Global studies data andworld_map2 with complete() to produce better Choropleth maps. As opposed to the defaultggplot2::map_data("world"), the updated world_map2 both fails gracefully when we have missing countries or data, and contains the following standard ISO country codes: Alpha-2 code, Alpha-3 code, and Numeric code. As such, we can match by country name to the majority of Gapminder.org data sets, and we can match by to any Global Studies data set which likewise uses one or more of the above ISO country codes. The bottom line: better graphs and improved interoperability while keeping it (relatively) simple in the Tidyverse.

Download from

Or, please improve and share: github.com/Thom-J-H/map_Gap_2_Tidy


Thomas J. Haslam
2021-08-14


Session Info